Archēglyph

Language detection

A classifier that assigns a likely natural language (or script) to a span of text.

Last updated

A classifier that assigns a likely natural language (or script) to a span of text. The short-text case — captions, marginalia, quoted phrases, headings — is harder than the long-document case, where more context is available.

Why it matters for your research. Corpora with code-switching, translated quotations, or multi-script pages need per-chunk language labels so analysis tools — especially embedding models — aren’t applied to text they weren’t trained on. For many DH corpora, language labelling is also a research dimension in its own right.

In Archēglyph. On the roadmap as a faceted statistic; later feeds the choice of embedding model per chunk.

Not to be confused with. Script detection (Latin vs Cyrillic vs Arabic) is easier than language detection within a script; both may be needed on mixed corpora.

Related terms

References

← Back to the glossary