Thursday, October 01, 2009

Simple Simhashing, My favorite trick.

Knol got equation editing a few days ago, and to celebrate I've written up a knol on one of my favorite CS tricks, simhashing using sets. It's a technique I picked up at Google, and it's extremely useful for grouping together similar things in a large dataset.
Suppose you have a huge number of items that you would like to group together by a fuzzy notion of similarity. Suppose the only tool available to you is a key-value store. Suppose you only have the resources to consider each object once. Never fear, simhashing is here!