TF-IDF
Also: Term frequency–inverse document frequency
A classical score for a word's importance to a document — high when the word is frequent *here* but rare across the corpus.
Last updated
Term Frequency–Inverse Document Frequency scores a word’s importance to a document by weighing how often it appears there against how rare it is across the whole corpus. A high TF-IDF score means the word is distinctive to that document.
Why it matters for your research. TF-IDF is the simplest, most legible way to surface the vocabulary that characterises a document or a cluster. The score is interpretable — no black-box embeddings — which makes it a good diagnostic tool even in projects that rely on modern methods elsewhere.
In Archēglyph. A stepping-stone for the roadmap NMF-based topic-modelling plugin; also conceptually the foundation under BM25.
Not to be confused with. An embedding encodes meaning; TF-IDF encodes vocabulary weight. Both can place documents in a geometry, but they are different geometries.