home | books | articles | gleanings | case studies | hire
other sites: widgetopia | blueprints for the web | metafooder | Mammahood


 


 


« First day of preschool! | main | Researching Search »

Detecting documents near duplications in realtime

via Wondering around

To make it clear, our problem is not to find all near duplications. We just with to find near duplications in articles we serve, but it must be very fast. We might return 100 articles in one set. Comparing all of them to each other will take about 10K comparisons.
Some of conventional methods to solve near duplications I know of are using shingling and matching term frequency vectors. Shingling is great and most accurate, but is expensive. You can't take all the articles and compare them, not mentioning keeping all in memory. Creating the shingles takes time and in large documents there may be many of them. Vectors might be less accurate for these purposes and have similar caveats.

I go most of my professional life never hearing about shingling, suddenly I hear about twice in a week. An artifact of growing computing power?

Posted at September 11, 2008 11:28 AM


Comments

 



Post a comment
*Name:


*Email Address:


URL:


Remember me?

Comments:

bold italic underline link


posting can be slow; please wait a few seconds before hitting the button again.

The extra-fine print
wording stolen by the more-eloquent-than-I kottke
The bold, italics, and link buttons (and associated shortcut keys) only work in IE 5+ on the PC.
Hearty discussion and unpopular viewpoints are welcome, but please keep comments on-topic and *civil*. Flaming, trolling, and ass-kissing comments are discouraged and may be deleted.
All comments, suggestions, bug reports, etc. related to the comments system should be directed to me.


mail entry to a friend

Email this entry to:


Your email address:


Message (optional):




« First day of preschool! | main | Researching Search »

 

 

 

home | books | articles | gleanings | case studies | hire
other sites: widgetopia | blueprints for the web | metafooder | Mammahood