Archēglyph

MinHash + LSH

Also: MinHash · Locality-sensitive hashing

A pair of techniques that find near-duplicate texts in a corpus without comparing every pair.

Last updated

MinHash is a compact fingerprint of a text (derived from hashed n-grams) that lets you estimate how similar two texts are from the fingerprints alone. Locality-Sensitive Hashing (LSH) indexes those fingerprints so near-duplicates can be found without comparing every pair. Together they scale to corpora where all-pairs comparison would be infeasible.

Why it matters for your research. Primary-source corpora are full of reprints, boilerplate, syndicated articles, translated copies, and scans of the same issue from multiple libraries. Finding these is a research question (which papers copied from which?) and a data-quality tool — duplicates inflate term frequencies, distort clusters, and mislead counts.

In Archēglyph. On the 90-day roadmap as the near- duplicate detection plugin. Improves cluster quality as a side effect.

Not to be confused with. Cryptographic hashes (SHA-256, etc.) are designed so similar inputs produce different hashes. MinHash is designed for the opposite — similar inputs produce similar hashes.

Related terms

References

← Back to the glossary