Articles

Essays on the choices behind Archēglyph — why certain defaults, where OCR still wins over VLMs, how to read cluster views without a statistics background.

13 May 2026 positioning · method · digitisation

Downstream of Trove: where analysis fits in the corpus stack

Digitisation projects like Trove, Chronicling America, and Europeana produce the corpus. Archēglyph produces the analysis on top of it. They are layers of the same stack — not competitors, not substitutes.
11 May 2026 positioning · method · llm · research-workflow

Orthogonal to LLM 'deep research'

Deep-research agents synthesise. Archēglyph indexes. They are different products solving different problems for different research workflows. Knowing which you need keeps your citations defensible.
9 May 2026 method · citation · epistemology

The citable-claim test

A simple test for whether a research tool produces output you can defend in a footnote: can you, in one click, see the page the claim came from? If not, the tool is for exploration, not for scholarship.
7 May 2026 transparency · hallucination · method

Why Archēglyph cannot hallucinate

Hallucination is a property of generative systems. Archēglyph isn't one. Every line of text the system surfaces was already in the source corpus — and we can show you which page it came from.
5 May 2026 clustering · interpretation · ui

Reading clusters as a researcher

The Archēglyph cluster view leads with quotations, not scatterplots. Here is how to use it — and why the scatterplot is behind a toggle.
1 May 2026 architecture · product · snapshots

Why we snapshot per dataset

The product decision behind Archēglyph's dataset snapshot: one tarball that bundles a tantivy index, a zvec embedding store, and a sqlite catalogue — why the three belong together and why the unit is the dataset, not the document.
29 Apr 2026 embeddings · semantic-search · practical

Choosing an embedding model for digital humanities

A practical comparison of MiniLM-L6-v2 and BGE-small-en-v1.5 for DH corpora: what each optimises for, when the extra dimensions earn their keep, and how to decide without running a benchmark you cannot reproduce.
25 Apr 2026 extraction · ocr · vlm

VLM vs OCR: when to pick what

Notes from the newspapers prototype on when Tesseract is still the right choice, when a vision-language model earns its cost, and how to tell the difference before a full run.
17 Apr 2026 transparency · ux · design

What a good provenance badge looks like

UX writing about the transparency contract: what goes inside the badge, what gets omitted, and why the re-run affordance lives next to it. With ASCII mockups of the patterns we use in the review screen, search results, and cluster cards.
15 Apr 2026 transparency · product

Transparency is a feature

Why every extracted text block in Archēglyph shows the model that produced it, and why we treat that disclosure as product surface rather than footer text.