Corpus

A bounded collection of texts gathered to answer a research question — not just 'all my PDFs'.

Last updated 20 April 2026

A bounded collection of texts gathered to answer a research question. A corpus is not just “all my PDFs” — it becomes a corpus when you can describe what is in it, what is out, and why. Provenance, scope, and completeness are as much a part of the corpus as the documents are.

Why it matters for your research. Analysis of a badly-scoped corpus will produce badly-scoped results, and no tool can diagnose that for you. Every dataset in Archēglyph starts with a dataset note because the corpus is the first object of study, before any model is run on it.

In Archēglyph. Each dataset is one corpus, with a human-editable dataset note describing scope, provenance, and known gaps.

Not to be confused with. A “training corpus” is the set of texts a model was trained on — a property of the model, not the dataset you’re analysing.

Related terms

References

McEnery & Hardie — Corpus Linguistics (2012)

← Back to the glossary