Language detection
A classifier that assigns a likely natural language (or script) to a span of text.
Last updated
A classifier that assigns a likely natural language (or script) to a span of text. The short-text case — captions, marginalia, quoted phrases, headings — is harder than the long-document case, where more context is available.
Why it matters for your research. Corpora with code-switching, translated quotations, or multi-script pages need per-chunk language labels so analysis tools — especially embedding models — aren’t applied to text they weren’t trained on. For many DH corpora, language labelling is also a research dimension in its own right.
In Archēglyph. On the roadmap as a faceted statistic; later feeds the choice of embedding model per chunk.
Not to be confused with. Script detection (Latin vs Cyrillic vs Arabic) is easier than language detection within a script; both may be needed on mixed corpora.