Archeglyph

Archeglyph Statistics-driven insight into archives, interpreted by the domain experts who know them.

What researchers do with Archeglyph

  • See patterns across a corpus you can't read cover to cover.

  • Ground every cluster, every quotation, in a citable source image.

  • Share a dataset as a single reproducible artefact — index, vectors, clusters, and all.

Who it's for

Built for researchers in the humanities — historians, philologists, archivists, intellectual historians, art historians, and social scientists working with archival material — who already know their sources. Archeglyph amplifies — it doesn't replace judgment.

Under the hood

Four stages, each one inspectable. Read the full pipeline guide for details.

Where we fit in the corpus stack — downstream of Trove, Chronicling America, and friends.

  1. 01

    Upload

    Bring image-based PDFs or page scans into a dataset. Each file is hashed and stored as-is.

  2. 02

    Assess

    A vision-language model reads the layout and returns ordered regions — headlines, body, captions, figures.

  3. 03

    Extract

    Each region is read with the engine you pick: Tesseract for clean print, a VLM for degraded or non-Latin scripts.

  4. 04

    Analyse

    Extracted text is chunked, embedded, indexed for lexical and semantic search, and grouped into readable clusters.

Latest articles & guides

Start a dataset.

Sign in with a magic link, upload a handful of pages, and watch the pipeline run end to end. Everything above is what you get by default.