<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Archeglyph — Articles &amp; Guides</title><description>Essays and how-tos on digital humanities pipelines: layout assessment, extraction, reading-first clusters.</description><link>https://www.archeglyph.com/</link><language>en-gb</language><item><title>Downstream of Trove: where analysis fits in the corpus stack</title><link>https://www.archeglyph.com/articles/downstream-of-trove/</link><guid isPermaLink="true">https://www.archeglyph.com/articles/downstream-of-trove/</guid><description>Digitisation projects like Trove, Chronicling America, and Europeana produce the corpus. Archeglyph produces the analysis on top of it. They are layers of the same stack — not competitors, not substitutes.</description><pubDate>Wed, 13 May 2026 00:00:00 GMT</pubDate><content:encoded>A question we get, politely phrased, roughly once a week: *&quot;Is this a
competitor to Trove?&quot;* Or *&quot;How is this different from Chronicling America?&quot;*
Or — from someone who has spent longer thinking about it — *&quot;Where does
Archeglyph sit, exactly, in the ecosystem of digital tools for archival
research?&quot;*

The honest answer is that Archeglyph and the great digitisation projects
live on different floors of the same building. The building is the
corpus stack. This article is about what each floor does, and why
confusing the floors leads to bad tool choices.

## The corpus stack has layers

Digital research on archival material has, in practice, a vertical stack
of concerns. Roughly, from the bottom up:

1. **Preservation.** Keeping the physical artefact alive — paper, glass
   plate, wax cylinder, magnetic tape — for another century. This is the
   job of libraries, archives, and museums.
2. **Digitisation.** Turning the artefact into bits. Scanning a page,
   photographing a plate, ripping a cylinder. Producing a page image, an
   OCR text layer, and page-level metadata.
3. **Indexing.** Making the bits findable. Full-text search across the
   digitised corpus, search-by-metadata, browseable collection pages.
4. **Analysis.** Doing something with a *subset* of the digitised
   material once you have chosen it: clustering, close reading with
   navigation, extraction of claims, quantitative comparison.
5. **Interpretation.** Writing the paper, the chapter, the monograph.
   This is a human activity. It is not, at any foreseeable point, the
   job of software.

Each layer is somebody&apos;s job. Each layer has its own institutions, its
own funding model, its own timescale. Archeglyph occupies the
*analysis* layer; the layer immediately below it — the layer that
*feeds* Archeglyph — is digitisation.

## What digitisation projects do

The great digitisation projects are, by a wide margin, the most
impressive infrastructural work in the digital humanities. Some of
the ones we lean on daily:

- **Trove** (National Library of Australia) — hundreds of millions of
  digitised pages of Australian newspapers, gazettes, magazines, and
  books. Full-text searchable, with a community of volunteer text
  correctors improving the OCR a line at a time.
- **Chronicling America** (Library of Congress + NEH) — a growing
  corpus of historic US newspapers, state-by-state, with a public API
  and a clean page-image viewer.
- **Europeana** — a federation across European cultural heritage
  institutions, aggregating metadata and digitised objects from
  thousands of museums, libraries, and archives.
- **HathiTrust** — a shared digital library built on the mass
  digitisation of research-library holdings, with in-copyright and
  public-domain strata and a careful access model.
- **Internet Archive** — the public-facing generalist. Books, serials,
  audio, video, web. An indispensable safety net for everything the
  institutional projects haven&apos;t yet reached.
- **Google Books** — the largest of all by raw volume, with a search
  surface that is uneven but often surprising.
- **DPLA** (Digital Public Library of America) — an aggregator over
  US institutional collections, analogous in ambition to Europeana.

What these projects produce, broadly, is the same shape of output: a
scanned page, an OCR text layer of varying quality, page-level metadata
(title, date, publisher, rights), and a search surface that lets you
find pages across millions.

This is an enormous achievement. It is also expensive, institutional,
and slow. A digitisation project is measured in decades. Its output is
*broad* — it serves every downstream use from genealogy to literary
scholarship to local history — and, necessarily, *generic*: it does
not privilege any one research question.

## What Archeglyph does

Archeglyph is a research tool, not a digitisation tool. It starts from
material that has already been digitised — by an institution, by a
researcher with a scanner, by a photographer with a phone — and
produces a reading surface over the *specific subset* a researcher
cares about.

Concretely, given a set of page images or PDFs a researcher has
chosen, Archeglyph:

- Runs a transparent extraction pipeline (VLM-assisted layout
  assessment, OCR or VLM extraction at the researcher&apos;s choice) with
  every model disclosed at the point of output.
- Indexes the extracted text for full-text search across the
  researcher&apos;s corpus — not across all of Trove, just across what
  they uploaded.
- Clusters fragments semantically and presents each cluster as
  quotations with sources, not as a scatterplot.
- Keeps version history so that re-ingestion doesn&apos;t silently
  renumber clusters or invalidate saved links.
- Produces an auto-generated plain-language technique note so the
  researcher can cite how the corpus was processed.
- Ships the whole dataset as a single exportable snapshot — index,
  vectors, metadata — so the corpus can be archived, shared, or
  re-opened without the product.

Analysis, in other words. Project-bound, weeks-to-months, specific,
narrow — the opposite axis to digitisation.

## Where they meet

The interface between the two layers is the PDF, or the image folder,
or the API response. A researcher searches Trove, finds the two hundred
pages that touch their research question, downloads them, and loads
them into Archeglyph. The bits come *from* the digitisation project.
The reading happens *on top of* Archeglyph.

Nothing about this is adversarial. We do not want to re-digitise what
Trove has already digitised; Trove has no plans to ship a clustering
UI. The digitisation projects built the library. Archeglyph is the
desk you read at.

&lt;blockquote&gt;&quot;Trove built the library. Archeglyph is the desk you read at.&quot;&lt;/blockquote&gt;

## A worked example: Trove + Archeglyph in the same workflow

Concretely — a historian of Australian labour movements in the 1920s is
interested in one specific union&apos;s coverage in the Brisbane press.

1. **In Trove.** Search for the union&apos;s name across the *Brisbane
   Courier* and the *Daily Standard* for the years 1921–1929. Refine
   by date, by title, by page type. Select the two hundred-odd
   articles that look relevant. Download the page PDFs — Trove
   supports this for most of its newspaper holdings — or export the
   list as references.
2. **Between layers.** The researcher now has a folder of two hundred
   PDFs on their laptop. This is the handoff point. The digitisation
   layer has done its job; the analysis layer hasn&apos;t started.
3. **In Archeglyph.** Create a new dataset. Upload the PDFs. Let the
   pipeline run: layout assessment, OCR (Tesseract is usually fine for
   1920s Brisbane newsprint), chunking, embedding, clustering.
4. **Reading.** The cluster view surfaces themes — shipping strikes,
   wage arbitration, internal union politics, coverage of rival
   unions, editorial hostility. Each theme is a card with exemplar
   quotations and a page reference back to the original Trove scan.
5. **Citing.** Every quotation links to a source page. The researcher
   cites the Trove record for the canonical reference and uses the
   Archeglyph snapshot ID as the methodological appendix — *&quot;cluster
   analysis produced via Archeglyph snapshot XYZ, 2026-04-15&quot;*.

Trove did the work of digitising the Brisbane press in the 1920s.
Archeglyph did the work of letting this particular researcher read
two hundred pages as a corpus, not as two hundred separate documents.
The two tools did not compete at any step.

## What we don&apos;t claim

In the same spirit of spelling things out one axis at a time:

- We are not digitising more material. We have no scanners, no
  institutional mandate, no partnerships with rights-holders. If the
  material isn&apos;t already digitised, Archeglyph cannot help.
- We are not replacing the institutional archive. The canonical
  record stays where it is: in the library&apos;s catalogue, under the
  library&apos;s URL, with the library&apos;s metadata.
- We are not synthesising new prose. Archeglyph does not summarise a
  corpus into a paragraph. It does not answer a research question
  with generated text. Everything on screen is either a quotation
  from the corpus or a clearly-labelled technique note.
- We are not building a competing search engine. We search across
  *your* dataset. Trove searches across Trove. Those are different
  jobs.

## A closing line

If the phrase &quot;downstream of Trove&quot; lands badly — if it sounds
dismissive of the decades of work that Trove, Chronicling America,
Europeana, HathiTrust, the Internet Archive, Google Books, and DPLA
have put into digitising the record — that isn&apos;t our intent.
*Downstream* here is the geographical meaning, not the hierarchical
one. The river flows from the institutional archive, through the
digitisation project, to the researcher&apos;s desk. Archeglyph sits at
the desk. The water had to get there somehow, and it wasn&apos;t us who
carried it.</content:encoded><category>positioning</category><category>method</category><category>digitisation</category><author>Dipankar Sarkar</author></item><item><title>Orthogonal to LLM &apos;deep research&apos;</title><link>https://www.archeglyph.com/articles/orthogonal-to-llm-deep-research/</link><guid isPermaLink="true">https://www.archeglyph.com/articles/orthogonal-to-llm-deep-research/</guid><description>Deep-research agents synthesise. Archeglyph indexes. They are different products solving different problems for different research workflows. Knowing which you need keeps your citations defensible.</description><pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate><content:encoded>There is a category of products getting a lot of attention right now —
&quot;deep research&quot; agents. They take a question, search the web (or a
private collection), read a few dozen sources, and produce a written
synthesis with footnote-style citations. Perplexity Research, ChatGPT
Search, Claude&apos;s research mode, Gemini Deep Research, and a long tail
of vertical clones all sit in this category.

Researchers often ask us how Archeglyph compares. The honest answer
is **we don&apos;t, because we&apos;re not in that category**. Deep-research
agents and Archeglyph aren&apos;t on the same axis. They solve different
problems for different stages of research. This article tries to
draw the orthogonal-axis distinction clearly, because choosing the
wrong tool for the work in front of you is expensive.

## What a deep-research agent does

A deep-research agent compresses *many sources* into *one written
answer*. Workflow:

1. Researcher asks a question.
2. Agent reads several dozen sources it selected.
3. Agent writes a synthesis — usually 800–3,000 words — with
   inline citations.
4. The footnotes link back to the sources.

This is genuinely useful. For an early-stage scoping pass on an
unfamiliar topic, or when you need a credible briefing in fifteen
minutes, the workflow saves real time. The footnotes are a meaningful
improvement over the previous generation of chatbots that gave you no
sources at all.

But the *output* is a written paragraph that the agent composed. The
sources fed it; it digested them; it produced new prose. The chain of
custody between any specific sentence in the output and any specific
sentence in a source is fragile in ways that are subtle and
adversarial:

- The cited source frequently doesn&apos;t say quite what the synthesis
  claims. (Multiple recent audits put this rate at 15–40% on real
  research questions.)
- A claim with no clear source in the cited material gets attributed
  anyway, often plausibly enough that nobody checks.
- The selection of sources is itself opaque — why these forty, and
  not the other forty?

For *exploratory* work, this is acceptable. For *citable* work, it
isn&apos;t. A researcher writing a footnote needs to point at a specific
page on a specific date. A philologist needs the passage as it
actually appears in the manuscript, not a paraphrase that may have
introduced a tense or a hedge.

## What Archeglyph does

Archeglyph compresses *one large corpus* into *a navigable index*.
Workflow:

1. Researcher uploads a corpus they already chose. (We don&apos;t pick
   sources for you. This is the scholar&apos;s job, not ours.)
2. Archeglyph reads each page (OCR or VLM, your choice, always
   disclosed). The text we extract is the text on the page; we
   show you the bbox.
3. We index it for full-text search and group it semantically into
   clusters of related fragments.
4. The output is a browsable view of the corpus: search returns real
   chunks, clusters surface real exemplar quotations, every fragment
   links back to the source page.

There is no &quot;synthesis&quot; anywhere in that pipeline. The deepest
generative thing we do is name a cluster — *&quot;Migrations across the
Bosphorus&quot;* — and even that comes with a &quot;this is an LLM-generated
label&quot; badge, with the actual quotations beneath it.

## The orthogonal axis

| Axis                          | Deep-research agents          | Archeglyph                                  |
|-------------------------------|-------------------------------|---------------------------------------------|
| Input                         | A question                    | A corpus                                    |
| Output                        | A written synthesis           | A navigable index                           |
| Source selection              | The agent picks               | You picked when you uploaded                |
| What the user reads           | Generated paragraphs          | Real fragments, in context                  |
| Provenance granularity        | Footnote per paragraph        | Region + page + bbox per fragment           |
| Confidence in any single line | Probabilistic                 | Deterministic (it&apos;s the OCR&apos;d text)         |
| Audit cost                    | Manual re-reading per claim   | One-time per region, then reusable          |
| Best for                      | Exploration, briefings        | Citable scholarship, longitudinal analysis  |

These are not competing rows. They&apos;re different jobs.

## Where they actually meet (and where they don&apos;t)

In a complete research workflow, both can have a place:

- **Scoping (early)** — a deep-research agent is a fast way to find
  out what conversation you&apos;re walking into. Read its synthesis
  knowing it might mislead, then go read the actual sources.
- **Working with primary corpora (the middle)** — a tool that lets
  you read your archive without reading it cover-to-cover. This is
  the hole Archeglyph fills.
- **Writing (later)** — your own synthesis, with citations to the
  primary sources you found via Archeglyph&apos;s search and clusters.
  Footnotes that point at page numbers, not at AI outputs.

What deep-research agents are *not* good at is being trusted at the
*writing* end of that pipeline. The hallucination rate is high enough
that any claim drawn from one needs to be re-verified against the
original. At which point the labour-saving has been transferred, not
removed — you&apos;re now reading the original anyway.

What Archeglyph is *not* trying to do is the synthesis. We have no
plans to add a chat interface, no plans to write summaries on your
behalf, no plans to &quot;answer the research question&quot;. A different
product can do that. We want to be the tool you cite *from*.

## Why this matters now

The noise floor of &quot;AI research tools&quot; is rising fast. A new product
launches every week claiming to do &quot;everything for the researcher&quot;.
Most of them are minor variations on the same generative pattern: an
LLM, a vector store, a chat interface, a hand-wave at hallucination.
The pattern isn&apos;t bad — it&apos;s just one workflow. The danger is that
researchers begin to assume *this is what AI does to research*: it
synthesises, it asserts, it occasionally invents.

We think there&apos;s an entire other axis to build along, and it&apos;s the
one humanities scholarship has always run on: read the source,
ground the claim, cite the page. Make that easier — much easier —
without changing what counts as a claim. Don&apos;t synthesise on the
researcher&apos;s behalf. Don&apos;t replace judgment. Amplify the corpus,
keep the interpretation human.

That&apos;s the orthogonal axis. We&apos;re betting it matters.</content:encoded><category>positioning</category><category>method</category><category>llm</category><category>research-workflow</category><author>Dipankar Sarkar</author></item><item><title>The citable-claim test</title><link>https://www.archeglyph.com/articles/the-citable-claim-test/</link><guid isPermaLink="true">https://www.archeglyph.com/articles/the-citable-claim-test/</guid><description>A simple test for whether a research tool produces output you can defend in a footnote: can you, in one click, see the page the claim came from? If not, the tool is for exploration, not for scholarship.</description><pubDate>Sat, 09 May 2026 00:00:00 GMT</pubDate><content:encoded>There&apos;s a quiet test I run on every new &quot;AI for research&quot; tool that
crosses my desk, and it cuts through the marketing copy in about
thirty seconds. It&apos;s this:

&gt; *Pick a single sentence in the tool&apos;s output. Can you, in one
&gt; click, see the page in the original source the sentence came
&gt; from?*

If yes, the tool is a candidate for citable scholarship. If no — if
you have to manually re-search to verify it, or if the &quot;source link&quot;
points at a paragraph that doesn&apos;t quite say what the tool&apos;s
sentence claims — the tool is for exploration. Both are useful. They
are not interchangeable.

This is the test we built Archeglyph to pass.

## What the test actually checks

A footnote in academic writing makes a quiet but specific promise:
*if you go look at the cited source on the cited page, you will find
the thing I am attributing to it.* That&apos;s the contract scholarship
runs on. Peer review enforces it; tenure committees notice when it
breaks; entire careers turn on whether the contract holds.

The citable-claim test asks: does this tool let me make that
promise about anything I quote or summarise from its output?

Three failure modes are common in current &quot;AI research&quot; tooling:

**Failure 1 — The source exists but doesn&apos;t say it.**
The tool produces a sentence and footnotes it to a real document. You
click through. The document is real. The page exists. But the
sentence the tool composed isn&apos;t quite what the source says. It might
have shifted a tense, dropped a hedge, conflated two adjacent claims,
or attributed a quotation to the wrong speaker. You only catch it if
you read the source carefully — at which point the tool has saved
you no time.

**Failure 2 — The source link is to the document, not the page.**
You&apos;re cited &quot;Smith 2018&quot; but the document is 280 pages. To verify,
you read the whole thing or full-text search for the claim. Often
the search misses because the tool paraphrased.

**Failure 3 — There is no source.**
The tool produced a confident assertion with no citation at all.
This is the cleanest failure because it&apos;s obvious; it&apos;s also the
most common in casual chatbot output.

A tool that passes the citable-claim test avoids all three by
construction: it shows you fragments from the source instead of
generating new prose about them.

## Why this is a structural property, not a quality knob

You can&apos;t get a generative system to pass this test by being more
careful. You can lower its failure rate with better retrieval,
better prompting, better grounding — and the best deep-research
agents have lowered it considerably. But the structure is still:

```
sources → model → new prose → footnote → sources
```

There&apos;s a loss-of-fidelity step in the middle. The new prose is
*about* the sources; it isn&apos;t *from* them. Whether that loss is 5%
or 35% depends on the day, the question, and the model. It&apos;s never
zero, because the model&apos;s job is to write something new.

The structural alternative is:

```
sources → indexed fragments → fragments shown → click → source
```

No new prose in the middle. The tool&apos;s job is to make the existing
text findable, not to compose new text on top of it. The fragments
the researcher reads *are* the sources, sliced, indexed, surfaced
with context. You can quote them word-for-word; the audit trail is
trivially short.

This is why the citable-claim test discriminates so cleanly: it&apos;s
asking which structure the tool uses, and the structure cannot be
faked.

## What the test looks like in Archeglyph

Open a search result. The snippet you see is the actual chunk text
from the bundle&apos;s SQLite, with the matched terms wrapped in `&lt;em&gt;`.
Click it. You land on the review page for the document the chunk
came from, scrolled to the region. The bbox is highlighted on the
source image. There is a `ProvenanceBadge` showing the OCR engine
that read it and the date. If you don&apos;t trust the OCR, run it again
with a different engine on that one region; the system keeps the
audit trail.

Open a cluster. The exemplar quotations on the card are real chunks
from the corpus. Click one — same trip back to the source page.

Now write your footnote. *&quot;&lt;i&gt;Le Figaro&lt;/i&gt;, 11 January 1924, p. 3&quot;*
— exactly what the chunk&apos;s source-link gave you. The page is on
disk; you (and your reader) can return to it forever.

That&apos;s all the test asks. The tool either does this or it doesn&apos;t.

## Where this fits

Tools that pass the citable-claim test are good for the *writing*
end of research — the parts where your name goes on the claim. Tools
that fail are good for the *exploring* end — getting up to speed on
a topic, finding sources you didn&apos;t know existed, surfacing
unexpected angles.

A serious research workflow probably uses both. The mistake is
assuming they&apos;re substitutes. They&apos;re not. They&apos;re complementary
tools at different stages of the same workflow, and a tool that&apos;s
honest about which end it serves is a tool that respects how
scholarship actually works.

We built Archeglyph for the writing end on purpose. Every choice in
the pipeline — preserving the source image, recording the engine on
every region, keeping the cluster card text-first, refusing to add
a &quot;summarise this corpus&quot; button — is downstream of one commitment:
*everything you read in this tool, you can defend in a footnote*.

If that&apos;s the test you care about, you&apos;ll find Archeglyph passes
it. If you need a tool that synthesises, that&apos;s a different tool,
and we won&apos;t pretend to be it.</content:encoded><category>method</category><category>citation</category><category>epistemology</category><author>Maitrayee Roychoudhury</author></item><item><title>Why Archeglyph cannot hallucinate</title><link>https://www.archeglyph.com/articles/why-archeglyph-cannot-hallucinate/</link><guid isPermaLink="true">https://www.archeglyph.com/articles/why-archeglyph-cannot-hallucinate/</guid><description>Hallucination is a property of generative systems. Archeglyph isn&apos;t one. Every line of text the system surfaces was already in the source corpus — and we can show you which page it came from.</description><pubDate>Thu, 07 May 2026 00:00:00 GMT</pubDate><content:encoded>A philologist recently asked us a sharper version of the question that&apos;s
quietly haunting every research tool right now: *&quot;How do I know your tool
isn&apos;t making things up?&quot;*

It&apos;s the right question to ask. The honest answer is short: **Archeglyph is
not a generative system, so it cannot hallucinate the things you read in
it.** The longer answer is worth writing down because it explains an
architectural choice we made very early, and it explains why that choice
is the reason we exist.

## What &quot;hallucination&quot; actually means

The word is used loosely. In the literature it has a specific shape:
a generative model produces content that is *fluent and plausible* but
unfaithful to its inputs — a quotation that was never said, a citation
that doesn&apos;t exist, a date that is off by twenty years and stated with
total confidence. The failure mode is intrinsic to how the system
works: a language model is trained to produce the next likely token,
not the next *true* token. Plausibility is the optimised target.
Truthfulness is, at best, correlated.

This is why retrieval-augmented generation, careful prompting, and
chain-of-thought tricks help but never close the gap. They lower the
hallucination rate. They don&apos;t change what kind of system you&apos;re using.

## Where Archeglyph&apos;s text comes from

Walk through the pipeline. At every step we can name the *source* of
the text on screen.

**1. The page image.** The starting point is a researcher-uploaded PDF
or image scan. The bytes don&apos;t change. The original is preserved in
object storage and re-downloadable forever.

**2. Region detection.** A vision model (or a CV fallback) draws boxes
on the page. The model&apos;s only output is *coordinates and a label*
(headline / body / caption / figure / table). It does not produce
text. If the model invents a region that isn&apos;t there, we crop air —
and the OCR step on the next page produces empty text, which is easy
to notice.

**3. Text extraction.** Tesseract or a vision-language model is given
a single cropped region and asked: *&quot;Read what&apos;s on this image,
faithfully.&quot;* This is the only step where a model could plausibly
&quot;add&quot; text that wasn&apos;t there. We mitigate the risk three ways:

- The image *and* the extracted text are kept side-by-side in the
  review UI. Hover a region; the bbox highlights on the source page.
- Every region is stamped with the engine that produced its text and a
  confidence score.
- The dataset technique note (auto-generated, clearly labelled as
  such) tells the researcher how many regions were Tesseract-read
  versus VLM-read. A researcher can audit by sampling.

**4. Chunking, embedding, indexing.** These are deterministic
operations. `syntok` splits the extracted text on sentence boundaries.
A sentence-transformer turns each chunk into a vector. Tantivy
indexes the words for full-text search. None of these steps add
text. They make the existing text findable.

**5. Clustering.** HDBSCAN groups vectors. The output is *which chunk
is in which cluster*. There is no language generation here.

**6. Cluster theme titles.** Yes, this step uses an LLM. The LLM is
given the top TF-IDF terms for a cluster plus a handful of sample
sentences, and asked to produce a four-to-six word label. The label
is shown with a `ProvenanceBadge` naming the model. If a researcher
doubts a label, they read the exemplars beneath it — which are real
quotations from the corpus, not LLM output.

**7. The dataset technique note.** Three to five sentences describing
how the dataset was processed. Generated by a small model from the
*known* engine choices and the *known* counts of files, regions, and
chunks. We cap its length, and if the model&apos;s output is missing or
malformed we fall back to a deterministic template. The note carries
a &quot;this summary is automatically generated&quot; caveat in every version.

That is every model invocation in Archeglyph. None of them is asked
to *summarise the corpus*. None of them is asked to *answer a research
question*. None of them produces a paragraph that a researcher could
mistake for primary text.

## What we don&apos;t do

We don&apos;t have a chat interface. We don&apos;t have a &quot;summarise this
collection&quot; button. We don&apos;t have an &quot;ask a question of your archive&quot;
endpoint. Those are perfectly reasonable products to build — they&apos;re
just a different product. The research workflow they support is
*synthesis*. The research workflow we support is *reading*.

We made this call deliberately, and we don&apos;t expect to change it.
A tool that synthesises will always be liable to hallucinate, no
matter how careful the prompt engineering. Once a researcher has to
audit each generated sentence for fabrication, the tool has stopped
being a labour-saver and started being a liability.

## What you can verify

If you&apos;re evaluating Archeglyph, run this test:

1. Upload a page you know cold.
2. Watch the regions appear on the review screen. Open the bbox
   overlay. For each region, verify that the text is actually what&apos;s
   on the image at that location.
3. Run a search that you know should match. Verify every result is a
   real chunk from a real region.
4. Open the cluster browser. Pick a cluster. Click an exemplar. It
   takes you back to the source page, with the highlighted region.
5. Now try to find an unsupported claim in any of the surfaced text.
   You won&apos;t, because there isn&apos;t a step in the pipeline that could
   have produced one.

That&apos;s the audit. It scales.

## The promise, stated plainly

Archeglyph reads what is on the page. We disclose which model did the
reading. We index, group, and surface what was read. We don&apos;t write
anything new on top of it. When we *do* generate (cluster titles, the
note), we say so loudly and we keep it to under a hundred words.

This is the line we hold. Not because LLMs are bad — they&apos;re useful
for plenty of things — but because *citing what you read* is the
foundational act of scholarship, and we want to be a tool a
researcher can cite from without an audit trail of footnotes saying
&quot;the AI told me so&quot;.

If your work needs the corpus to mean what it says on the page,
Archeglyph is for you. If your work needs synthesis, we&apos;ll happily
recommend something else.</content:encoded><category>transparency</category><category>hallucination</category><category>method</category><author>Dipankar Sarkar</author></item><item><title>Reading clusters as a researcher</title><link>https://www.archeglyph.com/articles/reading-clusters-as-a-researcher/</link><guid isPermaLink="true">https://www.archeglyph.com/articles/reading-clusters-as-a-researcher/</guid><description>The Archeglyph cluster view leads with quotations, not scatterplots. Here is how to use it — and why the scatterplot is behind a toggle.</description><pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate><content:encoded>If you have used a topic-modelling or embedding-clustering tool before, you have probably
seen the default view: a UMAP scatterplot with coloured dots, a sidebar of top terms per
cluster, and — if you are lucky — a list of document titles. This is a default designed for
someone debugging a clustering algorithm. It is not a default designed for someone reading a
corpus.

Archeglyph&apos;s cluster view reverses the priority. Each cluster is a card, and the card leads
with three things a researcher actually wants to see: a theme title, a one-sentence summary,
and three to six exemplar fragments rendered as readable quotations with source citations.
The scatterplot is behind a button called &quot;Advanced&quot;.

This article is about how to use the default view — and, for those inclined, when the
Advanced panel earns its place.

## What a card shows

A cluster card has the following anatomy, in order of visual weight:

1. **Theme title.** Four to six words. Generated by combining the top TF-IDF terms for the
   cluster with a small text LLM that polishes them into something readable. The card
   discloses which LLM wrote the title.
2. **One-sentence summary.** A short description — &quot;A group of 42 fragments discussing
   population movements between Europe and Asia Minor during the 1920s&quot; — produced by the
   same model from the exemplar fragments.
3. **Exemplar quotations.** Three to six fragments, each rendered as a block quote with its
   source document, page, and (where available) date. These are the highest-probability
   members of the cluster according to HDBSCAN&apos;s soft-clustering output.
4. **Size and link.** &quot;42 fragments · Open cluster →&quot; opens the *fragment neighbourhood* — a
   longer view of all members with a sentence of surrounding context on either side.

That is the whole default. There is no dot plot, no silhouette score, no outlier percentage.
Those numbers exist, and the Advanced toggle surfaces them, but the first read doesn&apos;t need
them.

## How to read a card

Our experience working with historians, philologists, and archivists on the newspapers prototype led to a short heuristic:

1. **Read the three quotations first, top to bottom.** Ignore the title; the title is a
   guess. The quotations are the data.
2. **Ask whether they feel like one group.** If yes, the cluster is doing useful work — even
   if the title is slightly off. If no, look at the outliers: sometimes one exemplar signals
   a sub-theme that was swept into the same bucket.
3. **Open the cluster.** The fragment neighbourhood shows all members with ±1 sentence of
   context. This is where the research actually happens. Skim the neighbourhood, flag
   fragments that feel adjacent but not central, and drop out to the document page for the
   ones that matter.
4. **Only then look at the title.** By this point you have your own sense of the cluster&apos;s
   shape. If the generated title fits, fine. If it doesn&apos;t, you can rename it, and the rename
   persists.

Notice how much of this is close reading. The clustering algorithm got the fragments into
roughly the same room; the philologist decides whether they are actually having the same
conversation. That division of labour is the whole point.

## What the Advanced toggle is for

There are three moments when the Advanced panel earns its place, and they have nothing to do
with the default reading loop:

- **When you suspect the clustering is wrong and want to know how wrong.** The probability
  histogram tells you whether a cluster&apos;s members are tightly bound or loosely attached.
  Loose clusters should be read skeptically.
- **When you are comparing two runs.** If you changed the embedding model or the HDBSCAN
  parameters, the UMAP projection lets you see at a glance whether the structure moved.
- **When you are teaching the algorithm to someone else.** The scatterplot is good pedagogy;
  it is mediocre daily bread.

For everything else, the numbers are noise. Our experience on the newspapers prototype was
that researchers who spent thirty minutes in a UMAP view ended up with a *worse* sense of
their corpus than researchers who spent thirty minutes reading exemplar quotations. The
geometric view feels authoritative in a way the quotations don&apos;t, and that authority is
misleading — distances in 2D UMAP space are not the distances the clustering algorithm used.

## What we don&apos;t do

A few things the cluster view deliberately omits, and why:

- **Word clouds.** They encode frequency as area, which the eye reads as importance. TF-IDF
  terms are already in the theme-title pipeline; that is enough.
- **Automatic cluster merging.** If two clusters are &quot;similar&quot; by some metric, the researcher
  — not the algorithm — decides whether to merge them. The tool proposes; the scholar
  disposes.
- **Sentiment or stance overlays.** Sentiment classifiers trained on 21st-century social
  media do poorly on 19th-century newspapers. We would rather ship no signal than a
  misleading one.

## What cluster IDs promise

When you re-ingest a dataset — add new documents, re-run extraction on a batch, change the
cluster parameters — the underlying clustering algorithm produces a fresh assignment. Naively
this would renumber every cluster, breaking any URL or note that references &quot;cluster #17&quot;.

Archeglyph stabilises cluster IDs via Hungarian matching against the previous assignment: a
cluster that has substantial overlap with a previous cluster keeps the previous ID. This
means saved cluster links survive incremental ingests. It also means a cluster whose
membership shifts dramatically — because, say, you added two hundred documents about a new
topic — will show up as a *new* cluster rather than hiding inside the old one.

That stability is load-bearing. It lets an archivist or intellectual historian bookmark a cluster as they would
bookmark a chapter, and come back to it a month later without chasing a new number.

## The end state

The default view is not just an aesthetic choice. It is a bet that the closest thing
clustering tools have to an interface — the UMAP plot — was never the right one for the
humanities. A cluster is a reading unit. Make it look like one, and the tool recedes into
the background of the work, which is where research tools belong.</content:encoded><category>clustering</category><category>interpretation</category><category>ui</category><author>Dipankar</author></item><item><title>Exporting and archiving a dataset</title><link>https://www.archeglyph.com/guides/exporting-and-archiving-a-dataset/</link><guid isPermaLink="true">https://www.archeglyph.com/guides/exporting-and-archiving-a-dataset/</guid><description>A forward-looking but grounded walkthrough of Archeglyph&apos;s dataset snapshot: what goes into the tarball, how to open it without the product, and how to cite a snapshot in a paper.</description><pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate><content:encoded>import FaqSchema from &apos;../../components/seo/FaqSchema.astro&apos;;

Every study ends — a paper gets published, a grant closes, a postdoc moves institutions —
and the question becomes: *how do I keep the work in a form I can still use in five years,
without the tool that produced it?* Archeglyph&apos;s answer is the dataset snapshot, a single
tarball that bundles the catalogue, the lexical index, and the embedding store for one
dataset. This guide walks through how to create one, what is inside it, how to open it
without Archeglyph running, and how to cite it.

Some of the surfaces described below are still rolling out through M1. Where that is the
case the guide flags it.

## Creating a snapshot

From the dataset page, open the `⋯` menu on the header and choose `Export snapshot`. The
product computes the total size (it will tell you before you commit), asks you to confirm,
and produces a tarball named:

```
archeglyph-&lt;workspace&gt;-&lt;dataset-slug&gt;-&lt;YYYYMMDD-HHMM&gt;.tar.zst
```

The timestamp is UTC. The file is zstd-compressed; on a modern machine a dataset with tens
of thousands of documents typically lands in the low hundreds of megabytes.

While the export is running you can close the page — the job continues server-side and an
email with the download link arrives when it finishes. For datasets with tens of millions
of chunks the export can take several minutes; the job status surfaces in the dataset&apos;s
events feed the same way extraction jobs do.

## What is inside

Unpack the archive:

```
$ tar --zstd -xvf archeglyph-&lt;…&gt;.tar.zst
archeglyph-&lt;…&gt;/
├── README.txt
├── catalogue.sqlite
├── index.tantivy/
│   ├── meta.json
│   ├── … segment files …
├── embeddings.zvec
├── settings.json
└── manifest.json
```

- **`catalogue.sqlite`** is a plain sqlite database containing the tables for documents,
  pages, regions, extracted text (with its engine provenance), edits, clusters, cluster
  memberships, and the settings that were active at snapshot time. You can open it in any
  sqlite browser; the schema is documented in `README.txt` and mirrors the tables described
  in the platform docs.
- **`index.tantivy/`** is the lexical search index, in tantivy&apos;s on-disk format. It can be
  opened by any tantivy 0.22+ reader; you do not need Archeglyph to query it.
- **`embeddings.zvec`** is the compressed embedding store, one vector per chunk plus a
  small metadata header (model id, dimension, chunking recipe). The zvec format is
  documented in its repository; a short Python reader script is bundled as
  `read_embeddings.py`.
- **`settings.json`** is a human-readable copy of the dataset&apos;s settings at the moment of
  export — engines, thresholds, chunking parameters. It is redundant with the sqlite
  catalogue but is present to make the snapshot legible without any database tooling.
- **`manifest.json`** lists every file, its SHA-256, and the snapshot schema version. Check
  the hashes after download if you intend to archive the tarball long-term.

The tarball does **not** contain the raw source images. It contains *references* — a
stable URL plus a SHA-256 — and a `rehydrate.sh` script that refetches the binaries from
the original object store. This is a licensing choice: many source archives grant
Archeglyph the right to process images but not to redistribute them. A future
`--with-images` flag will bundle the binaries for researchers whose sources are fully open.

## Opening a snapshot without Archeglyph

The design goal is that the snapshot opens with off-the-shelf tools. Three worked examples:

### Browse the catalogue in sqlite

```
$ sqlite3 catalogue.sqlite
sqlite&gt; .tables
documents   regions    texts    clusters   chunks   settings   engines
sqlite&gt; SELECT count(*) FROM chunks;
sqlite&gt; SELECT text FROM texts WHERE engine_id = &apos;qwen3-vl:235b-cloud&apos; LIMIT 5;
```

Every row carries the engine id that produced it; joining `texts` to `engines` gives you
the full provenance record in a single query.

### Search the lexical index from Python

```
from tantivy import Index

ix = Index.open(&apos;index.tantivy&apos;)
searcher = ix.searcher()
hits = searcher.search(ix.parse_query(&apos;wharves OR galata&apos;, [&apos;text&apos;]), limit=20)
for score, address in hits.hits:
    doc = searcher.doc(address)
    print(score, doc[&apos;document_id&apos;], doc[&apos;page_no&apos;], doc[&apos;text&apos;][:80])
```

The tantivy Python bindings read Archeglyph&apos;s snapshot indexes directly; the field names
(`text`, `document_id`, `page_no`, `region_id`) are documented in `README.txt`.

### Load the embeddings

```
from zvec import read

store = read(&apos;embeddings.zvec&apos;)
print(store.metadata)  # {&apos;model&apos;: &apos;bge-small-en-v1.5&apos;, &apos;dim&apos;: 384, ...}
for chunk_id, vector in store:
    # use numpy, faiss, whatever
    ...
```

The embedding store carries enough metadata to reconstruct a search space without
Archeglyph; the model id is what lets you (or a future reader) know whether they can mix
these vectors with another corpus.

## Citing a snapshot

A snapshot is citable. The recommended format:

&gt; Author, *Dataset title*, Archeglyph snapshot `sha256:&lt;…&gt;` exported
&gt; `&lt;YYYY-MM-DD&gt;`, archived at `&lt;url-or-doi&gt;`.

The `manifest.json` contains a `snapshot_id` which is the SHA-256 of the concatenated file
hashes — that is the value to paste in the `sha256:` field. Two researchers with the same
`snapshot_id` are guaranteed to be looking at bit-identical data.

If you deposit the tarball in Zenodo or your institution&apos;s repository, Archeglyph will
accept the DOI on the dataset&apos;s settings page and show it on the dataset&apos;s landing card.
That feature lands in M1-D.

## Archiving versus re-importing

Two different verbs, two different use cases:

- **Archiving** — the tarball is the final form. You put it in a repository, you stop
  thinking about it. The three files inside are all openable with tools older than
  Archeglyph; whatever happens to us, the research artefact survives.
- **Re-importing** — the same tarball can be loaded back into Archeglyph (`⋯ → Import
  snapshot`) and becomes a new dataset in your workspace. The original snapshot is not
  mutated; re-imports are a fork, not a load. This is how a collaborator receives your
  study.

## Caveats we want to be honest about

- **Not every settings field is carried.** The snapshot preserves the engine selection, the
  chunking recipe, and the search configuration. Workspace-level things (billing, team
  membership, access policies) are intentionally not exported because they belong to a
  workspace, not a dataset.
- **Image rehydration depends on upstream availability.** If the source archive takes a
  document offline, the `rehydrate.sh` script will fail on that file. The extracted text,
  regions, index, and embeddings are untouched — you keep the scholarship, you lose the
  ability to redisplay the image.
- **Snapshot schema will version.** The format is at v1. Future versions will add fields,
  never remove them; a v1 reader will continue to open every snapshot it produced today.

## A short checklist before you call it done

1. Download the tarball and verify the SHA-256s in `manifest.json`.
2. Open `catalogue.sqlite` and confirm the document count matches what you expect.
3. Archive the tarball somewhere with an addressable URL (institutional repository,
   Zenodo, S3 bucket with public read).
4. If the study is published, paste the `snapshot_id` into the methods section and the DOI
   onto the dataset&apos;s settings page in Archeglyph so other readers can find it.

A snapshot is not the end of a dataset&apos;s life — it is the first moment it becomes a
citizen of the scholarly record rather than a row in our database. That is what we built
the format to let it be.

&lt;FaqSchema items={[
  {
    question: &quot;Can I open an Archeglyph snapshot without running Archeglyph?&quot;,
    answer: &quot;Yes. The tarball contains a sqlite catalogue, a tantivy lexical index, and a zvec embedding store — all three are readable with their respective off-the-shelf libraries. The README inside the tarball points at the schema and the Python reader scripts we bundle.&quot;
  },
  {
    question: &quot;Does the snapshot include the source images?&quot;,
    answer: &quot;Not by default. The tarball carries stable URLs plus SHA-256 hashes for every source image and a rehydrate.sh script that re-downloads them from the original object store. A future --with-images flag will bundle binaries for researchers whose source material is fully open.&quot;
  },
  {
    question: &quot;How do I cite a specific snapshot?&quot;,
    answer: &quot;Cite the snapshot_id from manifest.json (a SHA-256 of the concatenated file hashes) alongside the export date. Two researchers quoting the same snapshot_id are guaranteed to be looking at bit-identical data.&quot;
  },
  {
    question: &quot;Can I re-import a snapshot to continue working on it?&quot;,
    answer: &quot;Yes. From the workspace menu, choose Import snapshot and upload the tarball. It loads as a new dataset; the original snapshot file is not mutated, so re-imports are forks rather than in-place loads.&quot;
  }
]} /&gt;</content:encoded><category>export</category><category>archiving</category><category>snapshots</category><category>how-to</category><author>Dipankar</author></item><item><title>Why we snapshot per dataset</title><link>https://www.archeglyph.com/articles/why-we-snapshot-per-dataset/</link><guid isPermaLink="true">https://www.archeglyph.com/articles/why-we-snapshot-per-dataset/</guid><description>The product decision behind Archeglyph&apos;s dataset snapshot: one tarball that bundles a tantivy index, a zvec embedding store, and a sqlite catalogue — why the three belong together and why the unit is the dataset, not the document.</description><pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate><content:encoded>If you asked any three digital humanities tools to export &quot;everything about this corpus
right now, in a form I can archive,&quot; you would get three different answers and none of them
would round-trip. One tool would give you a folder of PDFs and shrug at the indexes. Another
would give you a Postgres dump with references to an ElasticSearch cluster you no longer
have. A third would give you a vendor-specific archive that needs the vendor&apos;s runtime to
open.

Archeglyph&apos;s answer is a single tarball per dataset, containing three files: a tantivy
lexical index, a zvec embedding store, and a sqlite catalogue. That choice — one unit, three
files, the dataset as the atom — took real work to converge on, and it is worth writing down
why.

## What a dataset is

A dataset in Archeglyph is a bounded collection of source pages a researcher has decided to
study together: a newspaper run, the plates from one expedition, one archive&apos;s photographs
of a monastery. It has a stable slug, a settings page that pins the engines used for its
pipeline, and a set of documents whose regions, text, clusters, and search indexes are all
derived from those engines. The dataset is the unit a researcher talks about at a
conference. It is also the unit we need to be able to hand back to them intact.

The corollary is that the dataset is *not* the document. A document snapshot that did not
carry its embedding space would lose the ability to search. A document snapshot that carried
an embedding space but not the index would ship a vector blob no one can query. The dataset
is the smallest scope at which the artefacts still compose into a usable tool.

## Why three files, and why these three

Each file in the tarball covers one mode of access and is lossless on its own:

- **The sqlite catalogue** is the system of record: documents, pages, regions, engine
  provenance, edit history, cluster membership, settings at time of snapshot. It is plain
  SQL, opens in any sqlite browser, and is the thing an archivist can read in ten years
  without us.
- **The tantivy index** is the lexical search layer. It is derived data — it can be rebuilt
  from the catalogue — but rebuilding it is minutes to hours, and a snapshot without it is
  noticeably worse to open. Ship it.
- **The zvec embedding store** holds the chunk embeddings plus their metadata (model id,
  dimension, chunking recipe). Like the index, it is derivable; unlike the index, rebuilding
  requires access to the embedding model, which may have been retired or paywalled by the
  time a snapshot is re-opened. Shipping the vectors is how you guarantee the semantic
  search still works years later.

There are tempting simplifications we rejected. A single sqlite file with vectors stored as
blobs would be convenient but would forfeit zvec&apos;s per-chunk compression and the ability to
load vectors without paging the whole DB. A single portable archive built on Parquet would
be elegant but we would have to re-implement the reader side of tantivy. The three-file
shape is a compromise: each file is authored by a battle-tested library, and the tarball is
what makes them feel like one object.

## Why the unit is the dataset

There is a pull, always, to snapshot at a coarser or finer grain:

- **Coarser: snapshot the whole workspace.** Attractive because it would be one button. Not
  what researchers want. A workspace often mixes a finished study with half-cooked
  exploratory corpora; the finished one needs citation-stable archiving, the exploratory
  ones don&apos;t, and bundling them conflates two lifecycles. A workspace snapshot also
  multiplies the size of every archive by a factor that has nothing to do with the
  scholarship.
- **Finer: snapshot one document.** Attractive because the individual page is the atomic
  image. Not useful on its own: the vector space a document&apos;s chunks live in is shared across
  the dataset, so a single-document snapshot either ships the whole embedding store (wasteful)
  or ships only that document&apos;s vectors (which cannot be searched without the rest). Either
  way the snapshot is no longer composable.

The dataset sits exactly where the scholarly unit and the technical unit agree. That is why
it is the snapshot.

## Operational consequences

Committing to a dataset snapshot format shaped parts of the product that look unrelated:

1. **Settings are copied into the snapshot.** The sqlite catalogue carries the engine
   selections active at the moment of the snapshot — not just the engine names, but their
   versions. Opening an older snapshot displays a settings banner that makes clear this is
   what the dataset was extracted with, even if the workspace has since moved on.
2. **Re-runs are idempotent within the snapshot.** Because the catalogue stores every
   re-run as a new row with its own provenance, a snapshot can be re-extracted selectively
   and the new rows either merge into a fresh snapshot or split off into a derived dataset.
   We did not want to teach ourselves two different &quot;which row is canonical&quot; rules.
3. **The tarball is the export format, full stop.** There is no JSON export, no CSV export,
   no &quot;lite&quot; mode. Every export is this tarball. Researchers get a format that round-trips
   back into the product; archivists get a format that opens without the product; and we get
   one thing to maintain instead of five.

## What we are deferring

The snapshot format does not yet carry the raw source images. That is deliberate — images
are large, often restricted by the source archive&apos;s license, and already live in our object
store with their own retention policy. A snapshot currently carries image *references* (a
stable URL plus a SHA-256) and a helper script that re-fetches them from the original
repository when the snapshot is opened. A future `--with-images` flag will bundle the
binaries for researchers whose source is fully open. We would rather ship the lean tarball
now than block on the harder legal question.

## Why this belongs in an article

Infrastructure choices usually hide inside release notes. We surface this one because the
snapshot format is a promise to researchers: *the work you do in Archeglyph is yours, and
the form in which you take it away is simple enough to still make sense after we&apos;re gone.*
That promise is only real if we explain what it looks like.</content:encoded><category>architecture</category><category>product</category><category>snapshots</category><author>Dipankar</author></item><item><title>Choosing an embedding model for digital humanities</title><link>https://www.archeglyph.com/articles/choosing-an-embedding-model-for-dh/</link><guid isPermaLink="true">https://www.archeglyph.com/articles/choosing-an-embedding-model-for-dh/</guid><description>A practical comparison of MiniLM-L6-v2 and BGE-small-en-v1.5 for DH corpora: what each optimises for, when the extra dimensions earn their keep, and how to decide without running a benchmark you cannot reproduce.</description><pubDate>Wed, 29 Apr 2026 00:00:00 GMT</pubDate><content:encoded>A researcher opening Archeglyph for the first time sees two options under *Embedding model*:
`all-MiniLM-L6-v2` and `bge-small-en-v1.5`. Neither label is self-explanatory; neither
choice is obviously wrong. This article is the long-form version of the hover tooltip, for
the researcher who wants to make the choice with their eyes open.

We are not going to cite benchmarks. The published MTEB numbers are useful as an orientation
for engineers, but re-running them against a 1901 Ottoman-Greek newspaper or a set of
colonial-era expedition plates is not a thing any of us have the budget to do honestly.
What we can offer is a description of what each model optimises for, the operational
consequences of picking one, and the heuristics we use when the researcher asks us.

## What &quot;embedding model&quot; means in Archeglyph

An embedding model turns a chunk of text — here, a passage of a few sentences drawn from an
extracted region — into a fixed-length numeric vector. Vectors whose cosine similarity is
high are, in theory, about similar things. Archeglyph uses those vectors for two jobs:

1. **Semantic search.** The researcher types a query, the product embeds it, and ranks
   chunks by similarity.
2. **Clustering.** Chunks that land near each other in vector space form a candidate
   cluster; the theme-writing LLM is given the top-TF-IDF terms from that cluster and asked
   for a 4-6 word title.

Both uses depend on the vector space being coherent for the *kind of text* in the dataset.
That is the axis on which these two models differ in practice.

## What MiniLM-L6-v2 is

`all-MiniLM-L6-v2` is a 384-dimensional model distilled from a larger MiniLM, trained on a
broad mix of general English Q&amp;A and paraphrase data. It is small, fast, and has been the
default &quot;try this first&quot; open embedding for several years. For Archeglyph it has three
practical virtues:

- **Low footprint.** 384 dimensions compresses well in zvec; a dataset of a million chunks
  holds in memory on a single modest server.
- **Fast embedding.** On CPU it will out-throughput most alternatives. On a machine without
  a GPU, this is the difference between waiting an hour for a dataset to embed and waiting a
  shift.
- **Long production history.** Its failure modes are well documented; when a cluster looks
  odd with MiniLM, there is usually a named reason.

What it is not especially good at: domain-shifted English, archaic spellings, multilingual
content, and sentences where the interesting signal is a small number of proper nouns
(place names, ship names, officer names). In those regimes it will still produce a vector,
but the vector will often cluster on surface features (sentence length, function-word mix)
rather than what the researcher cares about.

## What BGE-small-en-v1.5 is

`bge-small-en-v1.5` is a 384-dimensional model from the BGE family, trained with an explicit
instruction-tuning objective on retrieval pairs. It is the same size as MiniLM and embeds
at roughly comparable cost. The interesting differences show up qualitatively:

- **Retrieval-shaped training.** BGE was trained to make query-document pairs close and
  negatives far; MiniLM was trained more broadly. For Archeglyph&apos;s two use cases (search,
  then cluster-as-a-form-of-search), that objective is on-target.
- **Better handling of named entities.** In internal dogfooding on a 1900s newspaper
  corpus, BGE&apos;s top-k search results for a proper-noun query (`&quot;wharves of Galata&quot;`) more
  consistently surface the *narrative* contexts around that phrase rather than other
  sentences of similar shape. We do not have a publishable benchmark for this; we mention it
  as an intuition to keep.
- **Instruction prefix.** BGE expects a short prefix on query embeddings (e.g.
  `&quot;Represent this sentence for retrieval: &quot;`). Archeglyph applies this automatically — if
  you switch to BGE, the query side of the pipeline is handled. You do not need to think
  about it.

What it is not: multilingual. `bge-small-en-v1.5` is English-tuned. For Ottoman-Turkish,
Italian, French, Greek, or Arabic sources, neither of these two models is ideal; the
researcher should pick whichever they judge less bad and plan on the cross-language failure
modes. A future Archeglyph release will surface BGE-M3 or the multilingual E5 family as
third and fourth options for exactly this reason.

## The heuristic we give researchers

A decision tree that does not require benchmarks:

- **English-only corpus, CPU-bound infrastructure, dataset &gt; 500k chunks** → start with
  MiniLM. The embedding pass is cheap and the search quality is &quot;good enough&quot; for the first
  exploratory read.
- **English-dominant corpus, quality matters more than throughput, GPU available** → start
  with BGE. The improvement is perceptible in the top-10 search results on the kinds of
  queries DH researchers actually type.
- **Mixed-language or heavily archaic corpus** → either, with the awareness that whichever
  you pick, you are going to see cross-language leakage. Consider using Archeglyph&apos;s cluster
  view as the primary reading surface rather than search, because clustering is slightly
  more forgiving of a noisy vector space than pinpoint retrieval.
- **Actively comparing models** → embed the dataset twice. Archeglyph&apos;s snapshots carry the
  embedding model id per chunk, so a dataset can live in the workspace with two embedding
  spaces and the provenance badge will keep them straight. This is the honest way to compare
  on *your* corpus; it is also the only way that yields a defensible answer.

## Operational notes

- Switching embedding model on a live dataset re-embeds all chunks and rebuilds the index.
  The settings page surfaces this as a rebuild step with an estimated time before you
  confirm. A researcher should expect minutes per thousand chunks on CPU, seconds on a
  modern GPU.
- The search result UI discloses the embedding model on hover. If you switched models
  mid-study, this is how you will notice that *this* result came from the old space.
- Clustering is not invariant across models. A dataset clustered under MiniLM and then
  re-clustered under BGE will *not* produce the same clusters, or even the same number of
  clusters; treat them as two separate analytic frames, not two views of one truth.

## The honest caveat

Picking an embedding model is one of several places in a DH pipeline where the default
should usually be *try one, read, try the other, read again*. We have shipped two defaults
because shipping zero is not useful and shipping ten is paralysing. The right reading of
this article is not *&quot;one of these is better&quot;* but *&quot;these are the two we shipped, here is
how they differ, and here is how Archeglyph helps you tell the difference on your own
corpus.&quot;* The scholarship is still yours.</content:encoded><category>embeddings</category><category>semantic-search</category><category>practical</category><author>Dipankar</author></item><item><title>Reviewing a noisy scan</title><link>https://www.archeglyph.com/guides/reviewing-a-noisy-scan/</link><guid isPermaLink="true">https://www.archeglyph.com/guides/reviewing-a-noisy-scan/</guid><description>A walkthrough of the review screen on a low-quality scan: what to look for, how to read the confidence tint, and when to re-run a region — or the whole page — with a VLM instead.</description><pubDate>Mon, 27 Apr 2026 00:00:00 GMT</pubDate><content:encoded>import FaqSchema from &apos;../../components/seo/FaqSchema.astro&apos;;

Eventually every digital humanities pipeline meets the scan it cannot quite read. Paper
that was already foxed before 1950, microfilm that was printed hot, a colonial-era plate
whose register drifted during capture — these documents are the reason a reviewer seat
exists at all. This guide walks through how Archeglyph&apos;s review screen handles a bad scan
and how to decide, region by region, whether to accept, edit, or re-run.

## Before you open the review screen

Open the dataset&apos;s Settings page and check the extraction engine. If the dataset was
extracted with Tesseract and you are about to triage a batch of scans you know are noisy,
you have two options: leave the default and fix regions individually on the review screen
(cheap but slow), or switch the default to a VLM for the whole dataset (expensive but
systematic). This guide assumes the first — you&apos;re keeping the default and fixing the worst
offenders on a per-region basis.

## The review screen, at a glance

When you open a document, the review screen splits into two columns:

- **Left: the scan.** The source image with layout regions overlaid as bounding boxes.
  Hover any box and the corresponding text card on the right scrolls into view and tints.
  Click a box to activate it.
- **Right: the cards.** One card per region, in reading order, with the extracted text in a
  textarea, the provenance badge below, and an `Accept` button. Regions the extractor
  flagged low-confidence render in a warn-orange tint; high-confidence regions stay muted
  ink.

On a clean scan almost every card is muted; you skim, accept, move on. On a noisy scan the
column of warn-orange tints is what you will notice first.

## Reading the signals

Three signals together tell you whether a card needs work:

1. **Card tint.** Warn-orange = the extractor&apos;s own confidence score dropped below 65%.
   Muted ink = the extractor thinks it got this one.
2. **Region shape on the image.** Layout regions that overlap, clip through a fold, or run
   at an angle are a layout-assessment failure, not an extraction failure — re-running the
   text engine won&apos;t help.
3. **The text itself.** Look for the failure patterns: run-together words, characters
   replaced with punctuation (`d1e` instead of `die`), lines that start mid-word because
   the layout pass missed a break.

A region with all three signals lit (orange tint, odd bbox, garbled text) is almost
certainly a full-page candidate for re-running with a VLM. A region with only one signal lit
(say, orange tint but reasonable-looking text) is usually fixable inline.

## The keyboard rhythm

The review screen is designed to be operated from the keyboard. The essential four:

- `j` / `k` — move between regions.
- `e` — edit the focused region&apos;s text (focuses the textarea).
- `Enter` — accept the focused region.
- `r` — open the region re-run popover.

On a noisy scan, the rhythm becomes: `j j j`, stop on an orange card, press `e`, fix the
text, press `Enter`, continue. After a few pages you stop thinking about the keys.

## When to re-run a region

Press `r` on a focused region. The popover offers two tabs (OCR, VLM) and a short list of
available engines. The rules of thumb:

- **The text is garbled but the bbox is right** → re-run with a better OCR engine first. If
  the dataset&apos;s default is Tesseract and you have a cloud VLM configured, try the VLM
  anyway; on small regions the cost is negligible.
- **The region is a caption, a figure label, or a stamp** → VLMs read these better than
  Tesseract in almost all cases. Re-run with a VLM and accept the result.
- **The region is a column of a table** → neither engine is reliable on table cells in M0.
  Re-running does not help; correct inline or mark the region for a later pass.

Each re-run produces a new row in the region&apos;s history with its own provenance badge. The
previous row is not lost — the row-history disclosure on the left edge of the card shows
every attempt, and you can swap back if the re-run was worse.

## When to re-run the whole document

If more than roughly a third of a document&apos;s regions are orange, a per-region approach will
cost more reviewer time than a single document-level re-run. Open the right-pane &quot;Re-run
full document from…&quot; control, pick the extraction stage, and choose a VLM override. This
replaces the extraction outputs for all regions at once and leaves the layout assessment
intact (unless you also pick the `assess` stage).

Rule of thumb: document-level re-runs are worth it when you expect to accept most of the
new output. If you already know three-quarters of the page will need manual edits either
way, save the cloud call and fix inline.

## When to give up and re-scan

There is a scan quality below which no pipeline will help you. If the layout pass produces
overlapping bboxes that slice through columns, if regions disappear entirely on certain
pages, if the VLM comes back with plausible-looking prose that does not match the image —
the document is below threshold. Flag it with a review note (the textarea supports a
`[[rescan]]` tag that surfaces on the dataset&apos;s documents table) and move on. Archeglyph
does not pretend that a better model will rescue a photograph of a ruined page.

## A suggested workflow on a tough batch

1. Open the first document. `j` through every region without editing. Note how many orange
   cards you see per page.
2. If the ratio is low (&lt; 15%), fix regions inline as you go.
3. If the ratio is high (&gt; 30%), exit to the dataset level and re-run extraction on the
   whole batch with a VLM override. Come back to review fresh.
4. For regions where the new extraction is still wrong, accept the edit inline rather than
   re-running a third time. At that point, you are the arbiter.

The review screen is designed around the assumption that a researcher&apos;s time is the most
expensive thing in the pipeline. Use it for judgement, not for data entry.

&lt;FaqSchema items={[
  {
    question: &quot;What does the warn-orange tint on a region card mean?&quot;,
    answer: &quot;The extraction engine&apos;s confidence score for that region fell below 65%. The tint is a signal that the region is worth pausing on; it does not mean the text is wrong, only that the engine was less sure of itself than it usually is.&quot;
  },
  {
    question: &quot;Will re-running a region lose the previous text?&quot;,
    answer: &quot;No. Every re-run produces a new row in the region&apos;s history with its own provenance badge. The previous row is still available from the row-history disclosure on the left edge of the region card, and you can swap back if the re-run was worse.&quot;
  },
  {
    question: &quot;When should I re-run a whole document instead of individual regions?&quot;,
    answer: &quot;As a rule of thumb, if more than roughly a third of a document&apos;s regions are flagged low-confidence, a document-level re-run with a VLM override is cheaper than fixing each region by hand. Below that ratio, per-region re-runs and inline edits are usually faster.&quot;
  },
  {
    question: &quot;Does Archeglyph re-run layout assessment when I re-run extraction?&quot;,
    answer: &quot;No. Re-running extraction leaves the layout regions untouched. If the problem is that regions are clipped, overlap columns, or miss a page break, you need to re-run from the assess stage — not the extract stage — on the full-document re-run control.&quot;
  }
]} /&gt;</content:encoded><category>review</category><category>ocr</category><category>vlm</category><category>how-to</category><author>Dipankar</author></item><item><title>VLM vs OCR: when to pick what</title><link>https://www.archeglyph.com/articles/vlm-vs-ocr-when-to-pick-what/</link><guid isPermaLink="true">https://www.archeglyph.com/articles/vlm-vs-ocr-when-to-pick-what/</guid><description>Notes from the newspapers prototype on when Tesseract is still the right choice, when a vision-language model earns its cost, and how to tell the difference before a full run.</description><pubDate>Sat, 25 Apr 2026 00:00:00 GMT</pubDate><content:encoded>A common framing in the digital humanities community right now is that vision-language
models have made OCR obsolete. This is not what we found on the newspapers prototype. What
we found, roughly, is that each engine has a regime where it is straightforwardly the better
tool, and a middle regime where the choice depends on what you are going to do with the text
afterwards. This article is our attempt to describe those regimes concretely enough that you
can make the call on your own corpus.

Everything below is from our experience running a few thousand pages of archival
newspapers through both pipelines and hand-checking the outputs. It is not a benchmark
paper. Treat it as folklore from one project that we found held up.

## Where Tesseract still wins

Tesseract — by which we mean a recent Tesseract 5 with LSTM and the right language packs —
is, on our corpus, strictly better for:

- **Clean, high-resolution print.** 300+ dpi scans of 20th-century typeset text. The
  character accuracy on well-aligned Latin-script print is remarkably good, and Tesseract
  is fast and predictable in its failures.
- **Heavy throughput.** A page of newspaper text extracts in under a second on a modern
  CPU. A VLM run on the same page takes 10-60 seconds and a real amount of money. When the
  corpus is large and the downstream task is lexical search, the speed and cost ratio
  dominates.
- **Cases where you will post-process.** Tesseract&apos;s errors are *consistent*. It mis-reads
  the same letter-pair the same way across a page. That consistency is a gift for
  deduplication, lexical normalisation, and any downstream pipeline that can correct
  systematic errors in bulk.

On our newspapers corpus, Tesseract hit character accuracy above 98% on a sample of
well-scanned 1920s broadsheet pages, and the errors it did make were almost entirely in a
fixed set of confusions (`cl` ↔ `d`, `rn` ↔ `m`, `in` ↔ `m`).

## Where a VLM earns its cost

A vision-language model — in our case, various Ollama Cloud models that accept a region crop
and return text — is straightforwardly the better tool for:

- **Degraded scans.** Faded print, show-through from the reverse page, heavy staining, tight
  gutters. A VLM&apos;s language prior lets it read around damage that Tesseract refuses to
  touch.
- **Non-Latin scripts with limited training data.** We had a small set of Ottoman-Turkish
  pages. Tesseract&apos;s Ottoman language pack is workable but the VLM&apos;s Arabic-script handling
  was noticeably better — particularly on ligatures and diacritics.
- **Handwriting.** Tesseract is not a handwriting engine. There are specialised
  handwriting models; for mixed print/handwriting pages, a VLM is the pragmatic path.
- **Mixed content.** Pages with figures, tables, and running text intermixed — where the
  layout model has already produced a bbox but the bbox contents are heterogeneous. The
  VLM&apos;s &quot;just describe what&apos;s in this crop&quot; tolerance handles these better.

The cost side is real. On a mid-sized VLM, per-page extraction at hosted rates runs roughly
ten to a hundred times the operational cost of Tesseract on a CPU. For a 10,000-page
project, that is the difference between &quot;run it tonight&quot; and &quot;budget for a quarter.&quot;

## The middle regime

Many corpora sit in a regime where either engine could plausibly work. In that regime the
choice depends on what you will do next:

- **Planning to do lexical search and snippet retrieval?** Prefer Tesseract. Its
  consistent errors are easy to account for in a BM25-style index, and you will want the
  throughput.
- **Planning to do semantic search or clustering?** The choice is more subtle. Embedding
  models are surprisingly robust to moderate OCR noise — MiniLM still produces sensible
  cosine similarities on text that is 85-90% character-accurate. But once errors pass a
  threshold, clustering degrades: the fragments that end up in a cluster start including
  passages that share *misreading patterns* rather than *topics*. If you are seeing this on
  your own corpus (the tell is a cluster whose exemplars share an odd letter-confusion), a
  VLM run on the degraded pages will almost always tighten the clusters.
- **Planning to publish the extracted text as a resource?** Prefer the VLM. The bar for
  published text is higher than the bar for internal search, and the VLM&apos;s error modes are
  less systematic — where it fails, it usually fails to readable (if wrong) text rather
  than to gibberish.

## A concrete check before committing

If you are unsure which engine to pick for a new corpus, Archeglyph makes this check cheap:

1. Upload 20 pages spanning the visual range of the corpus — a clean page, a damaged page,
   a page with unusual layout, a page in a less-familiar script.
2. Run extraction with Tesseract.
3. On those same 20 pages, re-run extraction per region with a VLM.
4. Open the review screen and scan the two outputs side by side.

Because both extractions are stamped with their engine in the `ProvenanceBadge`, you can
quickly see where they agree and where they diverge. Twenty pages is enough to form an
opinion; on our corpus, the regions where the engines disagreed at a sample of 20 pages
predicted the regions where they disagreed at the full 5,000-page scale almost exactly.

## The hybrid strategy

The answer for large, heterogeneous corpora is usually neither pure-Tesseract nor pure-VLM.
It is a hybrid:

- Run Tesseract as the default on every region. It is fast and cheap.
- Use the VLM as a targeted re-run for regions flagged as low-confidence by Tesseract (low
  word count, low mean per-character confidence, high symbol-to-letter ratio).
- Expose both outputs in the review screen and let the researcher accept either, or edit
  in place.

Archeglyph supports this out of the box: per-region re-run with a different engine is a
first-class operation, the pipeline fingerprints each stage so re-runs skip unchanged work,
and the provenance badge keeps both outputs attributable.

## The thing we got wrong

We built the newspapers prototype assuming VLM extraction would replace Tesseract wherever
we could afford it. On the first large run we found two things we did not expect:

- **VLM errors are less legible.** When a VLM mis-reads a word, the misreading is often a
  plausible other word — &quot;Galata&quot; becomes &quot;Golata&quot; becomes, a paragraph later, &quot;Gorata&quot;.
  Tesseract&apos;s errors look like OCR errors and are easy to spot. VLM errors look like
  paraphrases and are not.
- **VLMs hallucinate structure.** Given a crop that contains a half-visible column on one
  side, the VLM will sometimes confidently extract text from the half-visible column as if
  it were fully present. Tesseract, in the same situation, produces garbage that the
  reviewer can see is garbage.

Both of these argued for keeping Tesseract as the default and using the VLM as a targeted
tool. We still think that is the right default for most humanities corpora, and it is the
default Archeglyph ships with.</content:encoded><category>extraction</category><category>ocr</category><category>vlm</category><author>Dipankar</author></item><item><title>OCR vs VLM: a practical chooser</title><link>https://www.archeglyph.com/guides/ocr-vs-vlm/</link><guid isPermaLink="true">https://www.archeglyph.com/guides/ocr-vs-vlm/</guid><description>A short, decision-oriented guide to picking the right extraction engine for your corpus. When Tesseract is the right default, when a VLM is worth the cost, and how to test the choice cheaply.</description><pubDate>Thu, 23 Apr 2026 00:00:00 GMT</pubDate><content:encoded>This is a decision guide. If you want the reasoning behind it, read the companion article
[VLM vs OCR: when to pick what](/articles/vlm-vs-ocr-when-to-pick-what). If you want to just
decide what to set your dataset&apos;s `extract_engine` to, start here.

## The one-line answer

**Default to Tesseract; escalate to a VLM per-region when the output looks wrong.** This
handles almost every corpus we have seen.

The remainder of this guide is a more nuanced version of that same answer, for the cases
where the default isn&apos;t right.

## Pick Tesseract if

Any of these is true of your corpus:

- Printed text, typeset, post-1900.
- Scans at 300 dpi or better.
- Latin script, or a well-supported non-Latin script with a Tesseract language pack
  (Greek, Cyrillic, Arabic with `ara`, and so on).
- Your downstream use is lexical search or surveying, not publication of the extracted
  text.
- Your corpus is large enough that VLM per-page cost becomes a budget question.

Tesseract will produce good text quickly, the errors will be consistent, and you will have
headroom to re-run troublesome pages with a VLM individually.

## Pick a VLM if

Any of these is true, and especially if more than one is:

- Heavy degradation: staining, bleed-through, torn edges, uneven exposure.
- Low-resolution scans (below ~200 dpi).
- Handwriting or mixed print/handwriting.
- Non-Latin scripts with limited Tesseract support (historical Ottoman, older scripts, or
  very stylised typography).
- Your downstream use is publication of the extracted text as a resource, where the error
  bar matters.
- The corpus is small enough that per-page VLM cost is affordable.

Pick the smallest VLM on the Ollama Cloud list that works on a sample. Larger VLMs cost
more and are not always more accurate on extraction — some of them over-correct text in
ways you may not want.

## The hybrid default

Many corpora benefit from a hybrid approach:

- **Dataset default: Tesseract.** Runs on every region.
- **Per-document override: a VLM**, used when Tesseract output looks wrong on that document.
- **Per-region re-run: available from the provenance badge** in the review screen.

Archeglyph supports all three levels directly. No custom pipeline code is needed.

## How to test cheaply before committing

Before setting the extraction engine for a large dataset, run this 20-minute check:

1. **Pick a representative subset.** Twenty pages that span the visual range of your corpus
   — one clean page, one damaged page, one with unusual layout, one in the less-familiar
   script if your corpus has more than one.
2. **Upload the subset** as a fresh dataset with Tesseract as the default.
3. **Skim the review screen** for each page. Note the regions that look wrong.
4. **Re-run those regions** from the provenance badge with a VLM of your choice.
5. **Compare side by side.** The review screen will show both outputs attributed to their
   engines.

If Tesseract is right on 18 of 20 pages, stick with Tesseract and use per-region re-run as
needed. If it is wrong on 5 or more, switch the dataset default to a VLM. If it is in the
middle, consider the hybrid strategy above.

## A quick triage table

| Situation                                   | Default engine | Notes                                                     |
|---------------------------------------------|----------------|-----------------------------------------------------------|
| 20th-century typeset print, 300+ dpi        | Tesseract      | Expect 95-99% character accuracy                          |
| 19th-century print, 300+ dpi                | Tesseract      | Add post-processing for systematic errors                 |
| Pre-1850 print, letterpress                 | Tesseract → VLM| Test a subset first; VLM often wins                       |
| Typewritten 20th-century documents          | Tesseract      | Very reliable                                             |
| Degraded archival scans                     | VLM            | Tesseract output will look like noise                     |
| Handwriting                                 | VLM            | Tesseract is not designed for this                        |
| Mixed print + handwriting                   | VLM            | Mixed regions benefit from a VLM&apos;s tolerance              |
| Tables of numbers                           | Tesseract      | Specify PSM mode in settings if results look disordered   |
| Ottoman Turkish                             | VLM            | Our newspapers experience: noticeably better on ligatures |
| East Asian scripts (Chinese, Japanese)      | VLM            | Specialised OCR is an option; VLM is usually simpler      |

## Configuring the choice in Archeglyph

From the dataset&apos;s **Settings** tab:

- **Extraction engine**: set to `tesseract` or to any VLM id from the Ollama Cloud list.
- **Tesseract language**: set under the `extract_engine` sub-options when Tesseract is
  selected. Default is `eng`; change to `eng+fra`, `ara`, `ell`, etc., as your corpus
  requires.
- Saving the change applies to new documents. Existing documents keep their current
  extraction; to re-extract, use the per-document re-run button on the document&apos;s review
  screen or (for the whole dataset) the &quot;Re-extract all&quot; action.

Changing the extraction engine does not invalidate embeddings or clusters — those derive
from the text, not the engine. However, the *text itself* will change, which means the
embeddings will need to be recomputed. Archeglyph surfaces this in the confirmation modal
when you save a change that triggers a re-extraction.

## Further reading

- [The pipeline](/guides/pipeline) — where extraction sits in the full flow.
- [VLM vs OCR: when to pick what](/articles/vlm-vs-ocr-when-to-pick-what) — the reasoning
  and evidence behind the recommendations on this page.
- [Transparency is a feature](/articles/transparency-is-a-feature) — why every extracted
  block names the engine that produced it.</content:encoded><category>extraction</category><category>ocr</category><category>vlm</category><category>decision</category><author>Dipankar</author></item><item><title>Your first dataset</title><link>https://www.archeglyph.com/guides/first-dataset/</link><guid isPermaLink="true">https://www.archeglyph.com/guides/first-dataset/</guid><description>End-to-end walkthrough: sign in, create a dataset, upload pages, watch the pipeline run, review a document, and run your first search.</description><pubDate>Tue, 21 Apr 2026 00:00:00 GMT</pubDate><content:encoded>This guide takes you from zero to a searchable dataset in about 15 minutes of active time,
plus however long the pipeline takes to run on your uploads. You will need: a browser, an
email address, and a handful of scanned pages in PDF or image form — twenty pages is a good
starting size.

## Step 1 — Sign in

Archeglyph uses magic-link sign-in. Visit [/app/login](/app/login), enter your email, and
click the link we send you. There is no password to choose or remember. The magic link
expires 15 minutes after issue; if you miss the window, request another one.

The session cookie set on successful sign-in (`ag_sess`) is httpOnly and lasts 30 days. If
you sign out, or if 30 days of inactivity pass, you will be asked for another magic link.

## Step 2 — Create a dataset

From the datasets page, click **New dataset**. You will be asked for:

- A **name** — human-readable. &quot;Constantinople newspapers 1920s&quot; is fine.
- A **slug** — the URL-safe identifier. Derived from the name; you can edit it.
- A **description** — one or two sentences for your own future reference.

The first dataset you create uses Archeglyph&apos;s safe defaults: Tesseract for extraction,
MiniLM-L6 for embeddings, the smallest current Ollama Cloud VLM for layout assessment and
cluster labels. You can change any of these later from the dataset&apos;s **Settings** tab, and
the settings page will tell you which of your stored state (embeddings, clusters) would need
to be rebuilt if you do.

## Step 3 — Upload files

On the new dataset&apos;s page, click **Upload**. You can drag PDFs or image files directly onto
the page, or pick them from a file dialog. Archeglyph:

- Hashes each file. Duplicate uploads are detected and skipped.
- Accepts PDFs up to 500 MB and individual images up to 50 MB.
- Begins the pipeline automatically once a file has finished uploading.

You will see each file appear as a row in the document table. Its status column starts at
`uploaded` and moves through `assessed`, `extracted_text`, `chunked`, `embedded`, `indexed`,
`clustered`, `ready` as the pipeline runs. The updates arrive over a server-sent stream,
so no refresh is needed — the column updates in place.

## Step 4 — Watch the pipeline

For a typical 20-page upload, you will see:

- **Upload** complete in a few seconds (depends on your connection).
- **Assess** complete in a minute or two — this is the VLM looking at each page and
  returning regions.
- **Extract** complete in under a minute — Tesseract is fast.
- **Analyse** complete in another minute — chunking, embedding, indexing, clustering.

Five to ten minutes wall-clock is a fair estimate for twenty pages on a fresh dataset.
Longer documents with complex layouts will take longer; the progress bar on each row
reflects per-stage completion.

If anything fails, the row shows an error badge with a **Retry** button. The pipeline is
fingerprinted per stage so retries re-run only the failing stage.

## Step 5 — Review a document

Once a document&apos;s status hits `extracted_text`, its **Review** link becomes live. Click it
for one document. You will land on a three-pane screen:

- The **source image** on the left, with region bounding boxes overlaid.
- The **extracted text** in the middle, one editable block per region. Each block has a
  `ProvenanceBadge` showing the engine that produced it.
- A **metadata panel** on the right: confidence histogram, engine choices, per-region
  re-run buttons, and an escape hatch to re-run the whole document from segmentation.

Scroll through the text. Click on a region in the image — the corresponding text block
highlights. If a block looks garbled, click the &quot;re-run with…&quot; affordance on its
provenance badge and pick a different engine. The re-run runs just that region, typically
in seconds.

When you are satisfied, click **Accept**. The document&apos;s status advances and the next
stages (chunking, embedding, indexing, clustering) proceed over the accepted text. You can
skip review entirely for corpora where that level of care is not needed.

Keyboard shortcuts help here: `j` and `k` move between regions, `e` opens the editor on
the current region, `r` opens the re-run menu, `Enter` accepts the region, `Esc` cancels.

## Step 6 — Run a search

Once at least one document is `ready`, the dataset&apos;s **Search** tab works. Type a query and
you will get back snippets from the dataset&apos;s text, each with:

- The document and page they come from.
- The matching phrases highlighted.
- The `ProvenanceBadge` for the extracted block they came from.
- A relevance score that combines lexical (Tantivy BM25) and semantic (zvec cosine) scores
  via reciprocal rank fusion.

Use the **Lexical | Hybrid | Semantic** toggle at the top of the search box to change the
search mode. Lexical is best when you know the exact phrase; semantic is best when you are
searching for a concept; hybrid — the default — generally works well for both.

## Step 7 — Open the cluster browser

Click the **Clusters** tab. You will see a grid of cluster cards; each card leads with a
theme title, a one-sentence summary, and three exemplar fragments. Pick the card that
looks most interesting and click **Open cluster**. You will land in the **fragment
neighbourhood** view — all of the cluster&apos;s fragments with ±1 sentence of surrounding
context, grouped by document.

The fragment neighbourhood is where much of the research happens: read the fragments,
flag the ones that matter, and click through to the source page for the full context.
Flags and notes are per-user and persist across sessions.

If you want to see the more ML-flavoured view, click **Advanced** on any cluster card.
That reveals the probability histogram, outlier scores, and a UMAP scatter. These are
secondary by design; see [Reading clusters as a researcher](/articles/reading-clusters-as-a-researcher)
for why.

## Step 8 — Settings and snapshots

Visit the dataset&apos;s **Settings** tab. Every default Archeglyph uses for this dataset is
visible there and editable: the layout VLM, the extraction engine, the embedding model, the
cluster-label LLM, and the clustering parameters. Saving a change that invalidates derived
state (notably changing the embedding model) surfaces an explicit confirmation modal that
tells you what will be rebuilt and what it will cost.

The settings page also has an **Export snapshot** button. A dataset snapshot is a single
compressed archive of the lexical index, the vector index, and the metadata database. You
can download it, back it up, and later re-upload it to restore the dataset exactly. This is
the &quot;one file&quot; property mentioned on the landing page.

## What next

- [The pipeline](/guides/pipeline) — the same stages in more conceptual detail.
- [OCR vs VLM extraction](/guides/ocr-vs-vlm) — for when Tesseract is or isn&apos;t the right
  default on your corpus.
- [Reading clusters as a researcher](/articles/reading-clusters-as-a-researcher) — a reading
  guide for the cluster browser.</content:encoded><category>getting-started</category><category>tutorial</category><author>Dipankar</author></item><item><title>The pipeline</title><link>https://www.archeglyph.com/guides/pipeline/</link><guid isPermaLink="true">https://www.archeglyph.com/guides/pipeline/</guid><description>A plain-language tour of the four stages a document passes through in Archeglyph: upload, assess, extract, and analyse. Written for the person using the product, not the person building it.</description><pubDate>Sun, 19 Apr 2026 00:00:00 GMT</pubDate><content:encoded>This guide walks through what happens when you upload a page scan to Archeglyph, in the
order it happens, at the level a researcher cares about. It is not the implementer&apos;s view
— if you want API shapes and worker topologies, see the architecture docs — but it should
give you a clear mental model of what the product is doing with your files and why.

## Upload

A document starts its life as an upload into a **dataset**. A dataset is the unit of
grouping in Archeglyph: a corpus of related documents that share an extraction engine, an
embedding model, and a clustering configuration. You might have one dataset per archival
collection, or per research project, or per publication.

When you upload a PDF or image, Archeglyph:

- Hashes the file so re-uploading the same PDF is a no-op.
- Stores the original bytes untouched in object storage.
- Renders the pages of a PDF to page images at a resolution suitable for both layout and
  extraction models.

You see the file appear in the dataset&apos;s document table with a status of `uploaded`. No
extraction has happened yet — the next stage has to run first.

## Assess

The second stage is layout assessment. Archeglyph sends each page image to the
vision-language model you chose for the dataset (the default for new datasets is the
smallest current Ollama Cloud VLM, which is cheap to run and adequate for clean scans).
The model returns a list of **regions**: a bounding box, a `kind` (headline, body,
caption, figure, or table), a reading order, and a confidence.

Why a VLM for this step rather than classical computer vision? In our experience on the
newspapers prototype, classical column-detection works well on regular broadsheet layouts
and breaks on almost everything else: irregular gutters, rotated headlines, embedded
figures, book-style pages. A VLM handles the long tail because it has a language prior
over what a page looks like. For the regular cases where classical CV would also work,
Archeglyph retains a CV fallback that is offline and free — useful for very large
newspaper-like runs where the VLM cost adds up.

The assessment&apos;s output is what the next stage operates on: a set of labelled rectangles
per page, each one a region that needs to be read.

## Extract

The third stage is text extraction. For each region the layout model found, Archeglyph
runs the extraction engine you chose — Tesseract by default, or any VLM in the Ollama
Cloud list — and stores the resulting text, the engine&apos;s name, and a timestamp.

A few things about this stage that matter:

- **It is per-region, not per-page.** A page with a headline, three body columns, and a
  caption is five separate extraction runs. This matters because it lets you re-run just
  one region with a different engine when one goes wrong, without touching the others.
- **Engine choice is per-dataset with per-document override.** Most researchers pick one
  engine for the whole dataset. When they hit a tricky page, they override the choice for
  that page (or for one region on that page) without changing the dataset default.
- **Every extracted block carries its engine in the `ProvenanceBadge`.** You can see at a
  glance which engine produced which block, and re-run a block with a different engine
  from the badge&apos;s menu.

When extraction finishes, the document is in state `extracted_text`. This is the first
point at which the text of the document is legible in Archeglyph&apos;s search.

## Review (optional)

Between extraction and analysis, Archeglyph offers an optional review step. This is the
**trust surface** of the product: a three-pane screen showing the page image with region
overlays on the left, the per-region extracted text (editable) in the middle, and a
metadata panel — confidence histogram, engine list, per-region re-run buttons — on the
right.

For small, important corpora (a few dozen pages you&apos;re going to cite) we recommend using
review. For large exploratory corpora (a thousand pages you are surveying) we recommend
skipping it, knowing you can come back to the review screen any time to spot-check a
document that looks off.

Reviewing a document doesn&apos;t change how the analysis stage runs — it just gives you a
chance to correct extraction errors before the text is chunked and indexed.

## Analyse

The final stage is where your dataset turns into something searchable and clusterable.
Archeglyph:

- **Chunks** the extracted text into sentence units using `syntok`. A chunk is roughly one
  sentence, sometimes two if the sentences are short.
- **Embeds** each chunk with the embedding model you chose — MiniLM-L6 by default, with
  BGE-small as an interchangeable alternative. The embedding model&apos;s id is recorded
  alongside each chunk so a later re-embed is a tracked event, not a silent overwrite.
- **Indexes** the chunks twice: once in a lexical index (Tantivy, with stemming and
  snippets) and once in a vector index (zvec, same dimension as the embedding model). The
  two indexes join on chunk id so hybrid search works transparently.
- **Clusters** the chunks into semantic groups using HDBSCAN over the embeddings and into
  lexical groups using TF-IDF plus TruncatedSVD plus HDBSCAN. Each cluster gets a theme
  title and a one-sentence summary from a small text LLM, both of which disclose the LLM
  that wrote them.

When analysis finishes, the document is in state `ready`. The dataset&apos;s search, cluster
browser, and fragment neighbourhood views all become available on the document&apos;s text.

## What you see along the way

The dataset page shows each document&apos;s current state and any running jobs. Jobs emit live
events over a server-sent stream; the status column updates as each stage completes. If a
stage fails — the VLM times out, a PDF has an unreadable page — the failure surfaces on
the document with a retry button. The pipeline is fingerprinted per stage, so a retry
re-runs only the failing stage, not the whole document.

## What each stage costs

A rough sense of cost per page on a medium-large corpus (hundreds of pages):

- Upload and render: free (CPU).
- Assess: 10-30 seconds and a few cents of VLM credit per page.
- Extract: either a tenth of a second of CPU (Tesseract) or 30-60 seconds and single-digit
  cents of VLM credit (VLM read) per region.
- Analyse: a few seconds of CPU per document for chunking, embedding, and index updates;
  clustering runs once per ingest batch and is usually under a minute for datasets up to
  around 10,000 chunks.

For a 1,000-page corpus with Tesseract extraction and VLM layout, the end-to-end cost is
typically tens of minutes of wall-clock time and a few dollars of hosted-model credit.

## Where to go next

- [Your first dataset](/guides/first-dataset) walks through the same pipeline hands-on,
  from sign-in to first search.
- [OCR vs VLM extraction](/guides/ocr-vs-vlm) is the practical chooser for the extraction
  stage.
- [Transparency is a feature](/articles/transparency-is-a-feature) explains why every
  stage of the pipeline labels its output with the model that produced it.</content:encoded><category>pipeline</category><category>overview</category><author>Dipankar</author></item><item><title>What a good provenance badge looks like</title><link>https://www.archeglyph.com/articles/what-a-good-provenance-badge-looks-like/</link><guid isPermaLink="true">https://www.archeglyph.com/articles/what-a-good-provenance-badge-looks-like/</guid><description>UX writing about the transparency contract: what goes inside the badge, what gets omitted, and why the re-run affordance lives next to it. With ASCII mockups of the patterns we use in the review screen, search results, and cluster cards.</description><pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate><content:encoded>If a provenance badge is a promise that &quot;this specific output was produced by this specific
engine,&quot; the badge has to be readable without training. It has to answer three questions at
a glance: *what model, what version, and can I try another?* It has to do that in a row of
search results without ballooning the row. And it has to mean the same thing whether it sits
beside an extracted paragraph, a cluster title, or a vector search hit.

We have iterated on the badge several times during M0. These notes describe where it landed
and why.

## Anatomy

A badge is three fields rendered as one pill:

```
┌─────────────────────────────────────────┐
│ qwen3-vl:235b-cloud · v2025.03 · 02:14 │
└─────────────────────────────────────────┘
```

- **Engine id.** The left-most field is the stable identifier used across the catalogue.
  Tesseract reads as `tesseract`, a VLM reads as its full Ollama tag. We never shorten
  `qwen3-vl:235b-cloud` to `qwen` — abbreviation was one of the first temptations and one of
  the first rejections, because &quot;qwen&quot; alone is not a citable reference.
- **Version.** For binary engines (Tesseract) this is the upstream semver. For cloud-backed
  VLMs this is a date tag that we reconcile nightly against the provider. If the provider
  does not expose a version, we surface the date we first observed that model id in our
  engine catalogue.
- **Timestamp.** HH:MM of when this specific block was produced. Not the full ISO-8601
  (which clutters), but enough to disambiguate the pre-review output from a re-run.

A badge never carries confidence scores. Confidence is useful in the review pane and on the
advanced panel of a cluster card, but folding it into the badge would pressure readers to
treat it as the headline number, and the headline of a provenance badge is *who produced
this*, not *how sure they were*.

## Where badges appear

### In the review pane

```
┌─ Region 14 ──────────────────────────────────────────────────────┐
│ &quot;reported from the wharves of Galata that the Russian           │
│  steamer...&quot;                                                    │
│                                                                 │
│ [ tesseract · 5.3.0 · 02:14 ]  [ accept ]  [ re-run with ⌄ ]    │
└──────────────────────────────────────────────────────────────────┘
```

The badge sits on the same row as the accept and re-run controls because those three things
compose one decision: *I have seen what produced this, I know my options, I choose to accept
or rework.* If the badge were in a tooltip, the action would lose the attribution that
justifies it.

### In search results

```
 #42  p=0.812   Document 117, p.3                                 
 &quot;…reported from the wharves of Galata that the Russian steamer…&quot; 
 [ tesseract · 5.3.0 ]  [ embed: bge-small-en-v1.5 ]              
```

Search results have two badges: the engine that extracted the text, and the model that
embedded the chunk. We show both because a user comparing two search results can form a
legitimate hypothesis like *&quot;the MiniLM rows rank differently from the BGE rows&quot;* only if
both badges are visible side by side.

### In cluster cards

```
┌─ Migrations across the Bosphorus ────────────────────────────────┐
│ Fourteen fragments, mostly port reporting from 1897–1901.       │
│ — theme_llm: gemma3:27b-cloud                                   │
│                                                                 │
│ &quot;the wharves of Galata...&quot;         — Doc 117, p.3  (tesseract)  │
│ &quot;steamers inward bound...&quot;         — Doc 204, p.1  (tesseract)  │
│ &quot;lo riferiva il console...&quot;        — Doc 91,  p.2  (qwen3-vl)   │
└──────────────────────────────────────────────────────────────────┘
```

On a cluster card, the theme-writing model is badged at the top of the card and each
exemplar carries its extraction engine. The rule is that every human-readable string the
product did not author with a keyboard has a badge somewhere within a one-glance radius.

## What we decided not to do

- **No &quot;AI generated&quot; disclaimer.** A badge that says `qwen3-vl:235b-cloud` is a piece of
  scholarly apparatus. A banner that says &quot;generated by AI&quot; is a legal posture. We made the
  mistake in an early prototype of bolting both on; readers ignored the banner entirely and
  dismissed the badge as redundant. We kept the badge.
- **No colour-coded risk.** We tried a green/amber/red scheme where high-confidence
  extractions got a muted badge and low-confidence ones got a warn tint. Reviewers read the
  colour as a judgement on the *engine* rather than the *region*, and argued with it. We
  moved confidence to the region tint instead, where it belongs.
- **No vendor logos.** A badge is text. Logos turn provenance into branding, and the moment
  a researcher sees a logo they stop treating the badge as information and start treating it
  as an endorsement.

## The re-run affordance

The badge is paired with a `re-run with…` trigger that opens a popover. The popover is split
into two tabs (OCR, VLM) with the current engine pre-selected and greyed out. Re-running
produces a new row in the region&apos;s history; the badge swaps to the new engine id but the
previous row is still available from the row-history disclosure on the left edge of the
region card.

The re-run button is never the default. In the review pane, `Accept` is the large button;
`re-run with…` is a secondary. In search results and cluster cards, the badge is purely
informational and the re-run affordance is gated behind clicking through to the review
pane. We resisted every design iteration where a researcher could re-run a region from a
search result, because the cost of mis-clicking a re-run in a scanning view is two minutes
of compute and a brief jitter in their own mental model of the dataset.

## The transparency contract, stated plainly

What the badge promises:

1. Every textual output in the product carries, on the same screen, an attribution to the
   engine that produced it.
2. Engine ids are stable: what appears in one snapshot resolves to the same model identity
   in every future snapshot.
3. Every output paired with a badge has a re-run path that is one or two clicks away.

What the badge does not promise:

1. That the engine is correct.
2. That the engine&apos;s weights will remain available upstream.
3. That we have any editorial opinion about the engine&apos;s output.

The badge is a pointer, not an endorsement. That is the whole shape of the transparency
contract, and it is the reason we obsess about the pixels.</content:encoded><category>transparency</category><category>ux</category><category>design</category><author>Dipankar</author></item><item><title>Transparency is a feature</title><link>https://www.archeglyph.com/articles/transparency-is-a-feature/</link><guid isPermaLink="true">https://www.archeglyph.com/articles/transparency-is-a-feature/</guid><description>Why every extracted text block in Archeglyph shows the model that produced it, and why we treat that disclosure as product surface rather than footer text.</description><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>A researcher opens a cluster in an automated text-analysis tool. The cluster is titled
&quot;Migrations across the Bosphorus&quot; and it contains forty-two fragments from a newspaper corpus.
Two of those fragments look very wrong — the OCR is garbled, the sentences don&apos;t quite close,
one of them seems to contain the word &quot;Golota&quot; where the city is clearly Galata. A reasonable
next question for the researcher is: *which engine produced that text, and can I re-run it with
something else?*

Most tools don&apos;t let that question get asked. The text just shows up. If the researcher is
suspicious, they can either ignore the fragment, dig through logs they don&apos;t have access to, or
trust the cluster anyway. None of those are good answers for scholarship.

Archeglyph&apos;s answer is to put the engine&apos;s name next to the text.

## The provenance badge

Every extracted text block in the product carries a small chip — we call it the
`ProvenanceBadge` — that shows the engine and version responsible for that block: for example
`tesseract 5.3` or `qwen3-vl:235b-cloud`, plus a timestamp. Next to the badge is a &quot;re-run
with…&quot; affordance that lets the researcher swap engines on that region without touching the
rest of the document. The badge appears in the document review screen, in search results, and
on every exemplar quotation inside a cluster card.

This sounds like a small UI element, and on the page it is. But the consequences run deep:

- **It forces the pipeline to be honest.** If we can&apos;t reliably attribute a text block to an
  engine, we can&apos;t render the badge. That constraint shaped our data model: every extracted
  region stores its engine id, and re-runs don&apos;t silently overwrite — they produce a new row
  with a new provenance stamp.
- **It turns failure into a question the researcher can answer.** A garbled OCR line stops
  being &quot;the machine failed&quot; and becomes &quot;Tesseract failed on this region; what happens if we
  try a VLM here?&quot; The failure mode is legible, and so is the remedy.
- **It makes cross-engine comparison part of normal reading.** When the cluster view shows that
  forty of the forty-two exemplars came from `tesseract 5.3` and two came from
  `qwen3-vl:235b-cloud`, the researcher can start forming intuitions about which engine earns
  its cost on which kind of page.

## Why this isn&apos;t a footer

The easy thing to do is put a line at the bottom of a report that says &quot;generated using an
AI-assisted pipeline.&quot; Every vendor does this and it satisfies nothing. A footer says: *there
is a machine somewhere, and the output might be wrong, and you should know that in the
abstract.* A badge next to each block says: *this specific sentence was produced by this
specific engine at this specific time, and here is the button to try again with a different
one.*

The first is a legal disclosure. The second is a piece of scholarly apparatus.

## What the product discloses

In M0 the badge surface covers:

- **Layout regions.** Each region&apos;s `kind` (headline, body, caption, figure, table) and the
  model that assessed the layout — e.g. `gemma3:27b-cloud` — with a confidence score when the
  model returns one.
- **Extracted text.** The engine that read each region and its version. For Tesseract that&apos;s
  the binary version. For a VLM that&apos;s the full Ollama tag.
- **Cluster theme titles.** When a small text LLM is used to polish the top-TF-IDF terms into
  a 4-6 word title, the title discloses the model that wrote it. The summary sentence gets the
  same treatment.
- **Embeddings.** Every chunk stores the embedding model id, and the search result UI surfaces
  it when the user hovers on a hit — because if you switch from MiniLM to BGE, results can
  reorder, and that reordering deserves a trail.

## What transparency is not

Transparency is not the same as openness. We do not claim the weights of the VLMs we call are
open or auditable. We do not claim you can reproduce a cluster bit-for-bit six months from now
if the upstream Ollama model has been retrained. What we do claim — and what the badge
delivers — is a second-order guarantee: *at the time you are looking at this output, you can
see exactly what produced it.* From there, if a claim matters, you can re-run the relevant
step with a different engine and compare.

That is enough for scholarship to work. A footnote that names the edition does not promise
the edition is correct; it promises the reader can go look. The provenance badge is the
same promise in a different medium.

## Implications for our roadmap

Treating provenance as surface shapes what we build next:

1. **Engine catalogue is a first-class object.** Not a config file; a database table, with a
   nightly reconciliation job that flags stale ids. If an engine disappears upstream, the
   dataset settings page warns you that your chosen default is no longer available.
2. **Re-run is cheap.** The pipeline is fingerprinted per stage, so re-running extraction on
   one region with a different engine costs the cost of that one region, not the whole
   document. The badge only makes sense if the &quot;re-run with…&quot; button is painless.
3. **The advanced toggle exists, but it&apos;s not the default.** Confidence histograms, outlier
   scores, UMAP projections — those matter when you&apos;re debugging a pipeline, not when you&apos;re
   reading a cluster. They live behind an explicit toggle on each cluster card.

## What we ask of readers

When you use Archeglyph outputs in published work, please cite the engine. The product makes
it easy — the badge text is already the citation string. In return, we commit to keeping the
badges stable: an engine id that appears in one snapshot will resolve to the same model
identity in all future snapshots, even if we retire the engine and archive the weights
metadata.

Transparency isn&apos;t a privacy stance or a compliance checkbox. It&apos;s the piece of product
surface that lets a researcher do their job without trusting us more than they should.</content:encoded><category>transparency</category><category>product</category><author>Dipankar</author></item></channel></rss>