Glossary
A working reference for the AI, NLP, and digital-humanities terms Archēglyph touches — written for researchers, not ML engineers. Each entry explains the idea in plain English, says why it matters for your work, and (where relevant) whether Archēglyph uses it, rejects it, or plans to.
If you landed here from an article or the roadmap and hit a term that feels implied rather than explained, this is the page that explains it.
Generative AI
What people usually mean by "AI" in 2026 headlines — and how it differs from what Archēglyph does.
-
Fine-tuning
Taking a pre-trained model and continuing to train it on a narrower dataset so its behaviour shifts towards that domain.
-
Generative AI
Any model whose output is newly produced content — text, image, audio — rather than a classification, ranking, or extracted span.
-
Hallucination
A generative model's output that is fluent, confident, and wrong.
-
Large language model
LLMA very large neural network trained to predict the next token of text — the thing meant by 'AI' in most 2020s headlines.
-
Prompt
The text you hand to a generative model to shape its output — instructions, context, worked examples, and the user's question.
-
Retrieval-augmented generation
RAGA pattern where a generative model is first fed the results of a search over your documents, then asked to write an answer grounded in what it retrieved.
-
Training vs inference
Training is how a model learns — done once, expensively. Inference is how the model is used — each prediction is a forward pass over frozen weights.
Machine learning basics
The vocabulary that shows up once per article: model, embedding, transformer, token.
-
Embedding
A list of numbers — usually 384 to 1024 dimensions — that represents the meaning of a chunk of text.
-
Embedding model
The specific neural network that turns a chunk of text into its embedding vector. Different embedding models produce different geometries from the same text.
-
Layout analysis
Layout assessmentThe step before OCR that identifies and labels regions on a page image — text blocks, headers, footnotes, figures, tables, marginalia.
-
Model
A trained function from input to output. Umbrella term — an LLM, an embedding model, an OCR engine, and a classifier are all models.
-
Tokenization
Splitting text into the sub-word pieces a model actually sees. Token length — not word count — is what drives cost and context-window limits.
-
Transformer
The neural-network architecture, introduced in 2017, that sits behind nearly every modern language, vision, and multimodal model.
-
Vision-language model
VLMA model that takes an image (or image + text) as input and produces text — describing, transcribing, or classifying what it sees.
Language techniques
Specific NLP methods you will see referenced in DH work — NER, stylometry, topic modelling.
-
Language detection
A classifier that assigns a likely natural language (or script) to a span of text.
-
MinHash + LSH
MinHash · Locality-sensitive hashingA pair of techniques that find near-duplicate texts in a corpus without comparing every pair.
-
Named-entity recognition
NERTagging spans of text that refer to people, places, organisations, dates, and similar categories.
-
Sentence segmentation
Splitting a stream of prose into sentences. The rules look simple until you meet historical punctuation or OCR noise.
-
Stylometry
Using measurable features of writing — function-word frequencies, sentence lengths, punctuation — to profile authorship.
-
TF-IDF
Term frequency–inverse document frequencyA classical score for a word's importance to a document — high when the word is frequent *here* but rare across the corpus.
-
Topic modelling
Topic modelingAn unsupervised technique that groups documents or passages by the themes they share, surfaced as lists of related words per topic.
Search and organisation
How a corpus gets indexed, ranked, clustered, and visualised.
-
BM25
The standard ranking function for lexical search — what Lucene, Tantivy, and Elasticsearch use under the hood.
-
Clustering
Grouping items by similarity without pre-specifying what the groups should be. An unsupervised view of the corpus.
-
Cosine similarity
A number between -1 and 1 that measures the angle between two vectors. 1 is identical direction, 0 is unrelated.
-
Extractive question answering
Extractive QAQuestion answering that returns a span from a real document. It can pick the wrong span — it cannot make one up.
-
HDBSCAN
Hierarchical density-based clusteringA clustering algorithm that finds clusters of varying sizes in dense regions and leaves outliers unlabelled rather than forcing them in.
-
OCR
Optical character recognitionReading characters off a page image and producing machine-readable text.
-
Semantic search
Search that ranks by meaning rather than literal word overlap — useful when the corpus's vocabulary isn't your query's vocabulary.
-
UMAP
Uniform Manifold Approximation and ProjectionA technique that squashes high-dimensional vectors down to 2 or 3 dimensions for visualisation.
-
Vector search
Approximate nearest neighbour search · ANNFinding the k chunks whose embeddings are closest to a query embedding. The mechanical step behind semantic search and neighbour lookup.
Digital humanities
Terms that predate the computational turn and still carry the argument — corpus, provenance, diachronic.
-
Citation extraction
Finding and structuring citations — footnote markers, bibliographic entries, in-line references — so they become queryable objects.
-
Co-occurrence
Two entities (or terms) appearing together within some window — sentence, paragraph, document — are co-occurring.
-
Corpus
A bounded collection of texts gathered to answer a research question — not just 'all my PDFs'.
-
Diachronic analysis
Analysis that tracks change across time — vocabulary, concepts, named-entity mentions, sentence structure.
-
Provenance
The record of where something came from. In DH, the chain of custody of a source; in computing, which model produced which output.
Archēglyph vocabulary
Names for things that are specific to this platform — bundle, versioning, analysis plugin.
-
Analysis plugin
Archēglyph's contract for adding a new analysis technique. A single Python file that declares its dependencies, cost tier, and the UI surfaces it extends.
-
Artefact
Our internal library that owns every bundle read and write. Every pipeline step that changes a dataset goes through it.
-
Bundle
The per-dataset artefact that holds everything Archēglyph needs to serve a dataset — Tantivy index, zvec index, and metadata SQLite.
-
Chunk
The unit of analysis inside Archēglyph — a passage of several sentences, small enough for an embedding model to handle in one pass.
-
Dataset note
A plain-language summary of what a dataset contains and how its current results were produced — auto-generated, owner-editable.
-
Public dataset
A dataset whose owner has turned on public visibility. Gets a canonical URL and a license; some surfaces open up, others stay login-gated.
-
Versioning
Dataset versions · VersionsNamed, immutable snapshots of a dataset's bundle at a point in time. Let you return to exactly that state later.