Archeglyph Statistics-driven insight into archives, interpreted by the domain experts who know them.
What researchers do with Archeglyph
-
See patterns across a corpus you can't read cover to cover.
-
Ground every cluster, every quotation, in a citable source image.
-
Share a dataset as a single reproducible artefact — index, vectors, clusters, and all.
Who it's for
Built for researchers in the humanities — historians, philologists, archivists, intellectual historians, art historians, and social scientists working with archival material — who already know their sources. Archeglyph amplifies — it doesn't replace judgment.
Under the hood
Four stages, each one inspectable. Read the full pipeline guide for details.
Where we fit in the corpus stack — downstream of Trove, Chronicling America, and friends.
- 01
Upload
Bring image-based PDFs or page scans into a dataset. Each file is hashed and stored as-is.
- 02
Assess
A vision-language model reads the layout and returns ordered regions — headlines, body, captions, figures.
- 03
Extract
Each region is read with the engine you pick: Tesseract for clean print, a VLM for degraded or non-Latin scripts.
- 04
Analyse
Extracted text is chunked, embedded, indexed for lexical and semantic search, and grouped into readable clusters.
- article
Downstream of Trove: where analysis fits in the corpus stack
Digitisation projects like Trove, Chronicling America, and Europeana produce the corpus. Archeglyph produces the analysis on top of it. They are layers of the same stack — not competitors, not substitutes.
- article
Orthogonal to LLM 'deep research'
Deep-research agents synthesise. Archeglyph indexes. They are different products solving different problems for different research workflows. Knowing which you need keeps your citations defensible.
- article
The citable-claim test
A simple test for whether a research tool produces output you can defend in a footnote: can you, in one click, see the page the claim came from? If not, the tool is for exploration, not for scholarship.
- guide
Exporting and archiving a dataset
A forward-looking but grounded walkthrough of Archeglyph's dataset snapshot: what goes into the tarball, how to open it without the product, and how to cite a snapshot in a paper.
- guide
Reviewing a noisy scan
A walkthrough of the review screen on a low-quality scan: what to look for, how to read the confidence tint, and when to re-run a region — or the whole page — with a VLM instead.
- guide
OCR vs VLM: a practical chooser
A short, decision-oriented guide to picking the right extraction engine for your corpus. When Tesseract is the right default, when a VLM is worth the cost, and how to test the choice cheaply.
Start a dataset.
Sign in with a magic link, upload a handful of pages, and watch the pipeline run end to end. Everything above is what you get by default.