Archēglyph

TF-IDF

Also: Term frequency–inverse document frequency

A classical score for a word's importance to a document — high when the word is frequent *here* but rare across the corpus.

Last updated

Term Frequency–Inverse Document Frequency scores a word’s importance to a document by weighing how often it appears there against how rare it is across the whole corpus. A high TF-IDF score means the word is distinctive to that document.

Why it matters for your research. TF-IDF is the simplest, most legible way to surface the vocabulary that characterises a document or a cluster. The score is interpretable — no black-box embeddings — which makes it a good diagnostic tool even in projects that rely on modern methods elsewhere.

In Archēglyph. A stepping-stone for the roadmap NMF-based topic-modelling plugin; also conceptually the foundation under BM25.

Not to be confused with. An embedding encodes meaning; TF-IDF encodes vocabulary weight. Both can place documents in a geometry, but they are different geometries.

Related terms

References

← Back to the glossary