Archeglyph Statistics-driven insight into archives, interpreted by the domain experts who know them.

What researchers do with Archeglyph

See patterns across a corpus you can't read cover to cover.
Ground every cluster, every quotation, in a citable source image.
Share a dataset as a single reproducible artefact — index, vectors, clusters, and all.

Who it's for

Built for researchers in the humanities — historians, philologists, archivists, intellectual historians, art historians, and social scientists working with archival material — who already know their sources. Archeglyph amplifies — it doesn't replace judgment.

Under the hood

Four stages, each one inspectable. Read the full pipeline guide for details.

Where we fit in the corpus stack — downstream of Trove, Chronicling America, and friends.

01

Upload

Bring image-based PDFs or page scans into a dataset. Each file is hashed and stored as-is.
02

Assess

A vision-language model reads the layout and returns ordered regions — headlines, body, captions, figures.
03

Extract

Each region is read with the engine you pick: Tesseract for clean print, a VLM for degraded or non-Latin scripts.
04

Analyse

Extracted text is chunked, embedded, indexed for lexical and semantic search, and grouped into readable clusters.

Latest articles & guides

Articles · Guides

Start a dataset.

Sign in with a magic link, upload a handful of pages, and watch the pipeline run end to end. Everything above is what you get by default.

Archeglyph Statistics-driven insight into archives, interpreted by the domain experts who know them.

What researchers do with Archeglyph

Who it's for

Under the hood

Upload

Assess

Extract

Analyse

Latest articles & guides

Downstream of Trove: where analysis fits in the corpus stack

Orthogonal to LLM 'deep research'

The citable-claim test

Exporting and archiving a dataset

Reviewing a noisy scan

OCR vs VLM: a practical chooser

Start a dataset.