Archeglyph
Guide · · pipeline · overview

The pipeline

A plain-language tour of the four stages a document passes through in Archeglyph: upload, assess, extract, and analyse. Written for the person using the product, not the person building it.

By Dipankar · Last updated

On this page
  1. Upload
  2. Assess
  3. Extract
  4. Review (optional)
  5. Analyse
  6. What you see along the way
  7. What each stage costs
  8. Where to go next

This guide walks through what happens when you upload a page scan to Archeglyph, in the order it happens, at the level a researcher cares about. It is not the implementer’s view — if you want API shapes and worker topologies, see the architecture docs — but it should give you a clear mental model of what the product is doing with your files and why.

Upload

A document starts its life as an upload into a dataset. A dataset is the unit of grouping in Archeglyph: a corpus of related documents that share an extraction engine, an embedding model, and a clustering configuration. You might have one dataset per archival collection, or per research project, or per publication.

When you upload a PDF or image, Archeglyph:

  • Hashes the file so re-uploading the same PDF is a no-op.
  • Stores the original bytes untouched in object storage.
  • Renders the pages of a PDF to page images at a resolution suitable for both layout and extraction models.

You see the file appear in the dataset’s document table with a status of uploaded. No extraction has happened yet — the next stage has to run first.

Assess

The second stage is layout assessment. Archeglyph sends each page image to the vision-language model you chose for the dataset (the default for new datasets is the smallest current Ollama Cloud VLM, which is cheap to run and adequate for clean scans). The model returns a list of regions: a bounding box, a kind (headline, body, caption, figure, or table), a reading order, and a confidence.

Why a VLM for this step rather than classical computer vision? In our experience on the newspapers prototype, classical column-detection works well on regular broadsheet layouts and breaks on almost everything else: irregular gutters, rotated headlines, embedded figures, book-style pages. A VLM handles the long tail because it has a language prior over what a page looks like. For the regular cases where classical CV would also work, Archeglyph retains a CV fallback that is offline and free — useful for very large newspaper-like runs where the VLM cost adds up.

The assessment’s output is what the next stage operates on: a set of labelled rectangles per page, each one a region that needs to be read.

Extract

The third stage is text extraction. For each region the layout model found, Archeglyph runs the extraction engine you chose — Tesseract by default, or any VLM in the Ollama Cloud list — and stores the resulting text, the engine’s name, and a timestamp.

A few things about this stage that matter:

  • It is per-region, not per-page. A page with a headline, three body columns, and a caption is five separate extraction runs. This matters because it lets you re-run just one region with a different engine when one goes wrong, without touching the others.
  • Engine choice is per-dataset with per-document override. Most researchers pick one engine for the whole dataset. When they hit a tricky page, they override the choice for that page (or for one region on that page) without changing the dataset default.
  • Every extracted block carries its engine in the ProvenanceBadge. You can see at a glance which engine produced which block, and re-run a block with a different engine from the badge’s menu.

When extraction finishes, the document is in state extracted_text. This is the first point at which the text of the document is legible in Archeglyph’s search.

Review (optional)

Between extraction and analysis, Archeglyph offers an optional review step. This is the trust surface of the product: a three-pane screen showing the page image with region overlays on the left, the per-region extracted text (editable) in the middle, and a metadata panel — confidence histogram, engine list, per-region re-run buttons — on the right.

For small, important corpora (a few dozen pages you’re going to cite) we recommend using review. For large exploratory corpora (a thousand pages you are surveying) we recommend skipping it, knowing you can come back to the review screen any time to spot-check a document that looks off.

Reviewing a document doesn’t change how the analysis stage runs — it just gives you a chance to correct extraction errors before the text is chunked and indexed.

Analyse

The final stage is where your dataset turns into something searchable and clusterable. Archeglyph:

  • Chunks the extracted text into sentence units using syntok. A chunk is roughly one sentence, sometimes two if the sentences are short.
  • Embeds each chunk with the embedding model you chose — MiniLM-L6 by default, with BGE-small as an interchangeable alternative. The embedding model’s id is recorded alongside each chunk so a later re-embed is a tracked event, not a silent overwrite.
  • Indexes the chunks twice: once in a lexical index (Tantivy, with stemming and snippets) and once in a vector index (zvec, same dimension as the embedding model). The two indexes join on chunk id so hybrid search works transparently.
  • Clusters the chunks into semantic groups using HDBSCAN over the embeddings and into lexical groups using TF-IDF plus TruncatedSVD plus HDBSCAN. Each cluster gets a theme title and a one-sentence summary from a small text LLM, both of which disclose the LLM that wrote them.

When analysis finishes, the document is in state ready. The dataset’s search, cluster browser, and fragment neighbourhood views all become available on the document’s text.

What you see along the way

The dataset page shows each document’s current state and any running jobs. Jobs emit live events over a server-sent stream; the status column updates as each stage completes. If a stage fails — the VLM times out, a PDF has an unreadable page — the failure surfaces on the document with a retry button. The pipeline is fingerprinted per stage, so a retry re-runs only the failing stage, not the whole document.

What each stage costs

A rough sense of cost per page on a medium-large corpus (hundreds of pages):

  • Upload and render: free (CPU).
  • Assess: 10-30 seconds and a few cents of VLM credit per page.
  • Extract: either a tenth of a second of CPU (Tesseract) or 30-60 seconds and single-digit cents of VLM credit (VLM read) per region.
  • Analyse: a few seconds of CPU per document for chunking, embedding, and index updates; clustering runs once per ingest batch and is usually under a minute for datasets up to around 10,000 chunks.

For a 1,000-page corpus with Tesseract extraction and VLM layout, the end-to-end cost is typically tens of minutes of wall-clock time and a few dollars of hosted-model credit.

Where to go next