MinHash + LSH
Also: MinHash · Locality-sensitive hashing
A pair of techniques that find near-duplicate texts in a corpus without comparing every pair.
Last updated
MinHash is a compact fingerprint of a text (derived from hashed n-grams) that lets you estimate how similar two texts are from the fingerprints alone. Locality-Sensitive Hashing (LSH) indexes those fingerprints so near-duplicates can be found without comparing every pair. Together they scale to corpora where all-pairs comparison would be infeasible.
Why it matters for your research. Primary-source corpora are full of reprints, boilerplate, syndicated articles, translated copies, and scans of the same issue from multiple libraries. Finding these is a research question (which papers copied from which?) and a data-quality tool — duplicates inflate term frequencies, distort clusters, and mislead counts.
In Archēglyph. On the 90-day roadmap as the near- duplicate detection plugin. Improves cluster quality as a side effect.
Not to be confused with. Cryptographic hashes (SHA-256, etc.) are designed so similar inputs produce different hashes. MinHash is designed for the opposite — similar inputs produce similar hashes.