Approximate search in Linq

Many applications have a search functionality. In such a case, we would like an approximate (or fuzzy) search, which returns all items that are similar (in stead of identical) to the item we're looking for. In the past days, I developped a library which implements this for Linq (Language INtergated Query) to objects. So, all you .Net 3.5 programmers out there that need approximate search, download this dll from our website and get started!

Basically, the library adds two methods to LINQ: ProximitySort and ProximitySearch. The first one sorts all objects in the collection based on how similar they are to a given object, so it returns a collection. The second one returns only the best match. Because of this limitation, it can apply some optimisations. In short, ProximitySearch is conceptually similar to ProximitySort(...).First(), only more performant.

Both methods are based on string comparison: they just apply the ToString method on every object in the collection, and then calculate a distance between this string and the string of the SearchTerm object. The distance used is the Lhevenstein distance, a variant of the edit distance. These distances are language independant, they work on names, product numbers,... . If you want to compare objects on some other string than the one returned by ToString, there is an extra overload which allows you to specify a string selector.

Let's show some sample code.

"A few minutes ago somebody called with a complaint on the order he made on 10/5/2007, his name was 'Peterson' or something like that..."

var allShoppers = from c in Customers
                  where c.Orders.OrderDate = "10/5/2007"
                  select c;
var peterson = allShoppers.ProximitySearch("Peterson", c => c.CustomerName);

First we retrieve all people who bought something on that particular day (allShoppers), then we filter out the single person who's customername looks most like 'Peterson'.

 

"On our website, customers hate that huge drop down box which contains all our product names, but when we let them select a product category first, the complain that they don't know in which category to look... What can we do?"

Product searchProduct = new Product();
searchProduct.Name = textbox.Text;
var similarProd = Products.ProximitySort(searchProduct).Take(10);

If Product has a ToString method that shows the productname, and textbox.Text contains the name of the product the user is searching for, then the similarProd collection will contain all the products who's name resembles the one the user typed. Show them in a drop down box and let the user select from that limited box.

Have fun with the linq extensions! If I find some time next week (I'll be at TechEd) I'll post a follow-up article with some best practises and timings.