Articles
Essays on the choices behind Archeglyph — why certain defaults, where OCR still wins over VLMs, how to read cluster views without a statistics background.
- positioning · method · digitisation
Downstream of Trove: where analysis fits in the corpus stack
Digitisation projects like Trove, Chronicling America, and Europeana produce the corpus. Archeglyph produces the analysis on top of it. They are layers of the same stack — not competitors, not substitutes.
- positioning · method · llm · research-workflow
Orthogonal to LLM 'deep research'
Deep-research agents synthesise. Archeglyph indexes. They are different products solving different problems for different research workflows. Knowing which you need keeps your citations defensible.
- method · citation · epistemology
The citable-claim test
A simple test for whether a research tool produces output you can defend in a footnote: can you, in one click, see the page the claim came from? If not, the tool is for exploration, not for scholarship.
- transparency · hallucination · method
Why Archeglyph cannot hallucinate
Hallucination is a property of generative systems. Archeglyph isn't one. Every line of text the system surfaces was already in the source corpus — and we can show you which page it came from.
- clustering · interpretation · ui
Reading clusters as a researcher
The Archeglyph cluster view leads with quotations, not scatterplots. Here is how to use it — and why the scatterplot is behind a toggle.
- architecture · product · snapshots
Why we snapshot per dataset
The product decision behind Archeglyph's dataset snapshot: one tarball that bundles a tantivy index, a zvec embedding store, and a sqlite catalogue — why the three belong together and why the unit is the dataset, not the document.
- embeddings · semantic-search · practical
Choosing an embedding model for digital humanities
A practical comparison of MiniLM-L6-v2 and BGE-small-en-v1.5 for DH corpora: what each optimises for, when the extra dimensions earn their keep, and how to decide without running a benchmark you cannot reproduce.
- extraction · ocr · vlm
VLM vs OCR: when to pick what
Notes from the newspapers prototype on when Tesseract is still the right choice, when a vision-language model earns its cost, and how to tell the difference before a full run.
- transparency · ux · design
What a good provenance badge looks like
UX writing about the transparency contract: what goes inside the badge, what gets omitted, and why the re-run affordance lives next to it. With ASCII mockups of the patterns we use in the review screen, search results, and cluster cards.
- transparency · product
Transparency is a feature
Why every extracted text block in Archeglyph shows the model that produced it, and why we treat that disclosure as product surface rather than footer text.