Corpus
A bounded collection of texts gathered to answer a research question — not just 'all my PDFs'.
Last updated
A bounded collection of texts gathered to answer a research question. A corpus is not just “all my PDFs” — it becomes a corpus when you can describe what is in it, what is out, and why. Provenance, scope, and completeness are as much a part of the corpus as the documents are.
Why it matters for your research. Analysis of a badly-scoped corpus will produce badly-scoped results, and no tool can diagnose that for you. Every dataset in Archēglyph starts with a dataset note because the corpus is the first object of study, before any model is run on it.
In Archēglyph. Each dataset is one corpus, with a human-editable dataset note describing scope, provenance, and known gaps.
Not to be confused with. A “training corpus” is the set of texts a model was trained on — a property of the model, not the dataset you’re analysing.