Thursday, October 01, 2009

Simple Simhashing, My favorite trick.

Knol got equation editing a few days ago, and to celebrate I've written up a knol on one of my favorite CS tricks, simhashing using sets. It's a technique I picked up at Google, and it's extremely useful for grouping together similar things in a large dataset.
Suppose you have a huge number of items that you would like to group together by a fuzzy notion of similarity. Suppose the only tool available to you is a key-value store. Suppose you only have the resources to consider each object once. Never fear, simhashing is here!


Michael Chermside said...

Wow. Neat! Thank you -- I had never seen this.

Markus Kohler said...

Hi Ryan,
It seems I'm a few date too late to read your article on knol.
Do you have a copy of the article somewhere else?


Ryan Moulton said...

Thanks for the note! I've updated the link to the new location.