Archeglyph — Articles & Guides

Downstream of Trove: where analysis fits in the corpus stack

Dipankar Sarkar — Wed, 13 May 2026 00:00:00 GMT

A question we get, politely phrased, roughly once a week: *"Is this a competitor to Trove?"* Or *"How is this different from Chronicling America?"* Or — from someone who has spent longer thinking about it — *"Where does Archeglyph sit, exactly, in the ecosystem of digital tools for archival research?"* The honest answer is that Archeglyph and the great digitisation projects live on different floors of the same building. The building is the corpus stack. This article is about what each floor does, and why confusing the floors leads to bad tool choices. ## The corpus stack has layers Digital research on archival material has, in practice, a vertical stack of concerns. Roughly, from the bottom up: 1. **Preservation.** Keeping the physical artefact alive — paper, glass plate, wax cylinder, magnetic tape — for another century. This is the job of libraries, archives, and museums. 2. **Digitisation.** Turning the artefact into bits. Scanning a page, photographing a plate, ripping a cylinder. Producing a page image, an OCR text layer, and page-level metadata. 3. **Indexing.** Making the bits findable. Full-text search across the digitised corpus, search-by-metadata, browseable collection pages. 4. **Analysis.** Doing something with a *subset* of the digitised material once you have chosen it: clustering, close reading with navigation, extraction of claims, quantitative comparison. 5. **Interpretation.** Writing the paper, the chapter, the monograph. This is a human activity. It is not, at any foreseeable point, the job of software. Each layer is somebody's job. Each layer has its own institutions, its own funding model, its own timescale. Archeglyph occupies the *analysis* layer; the layer immediately below it — the layer that *feeds* Archeglyph — is digitisation. ## What digitisation projects do The great digitisation projects are, by a wide margin, the most impressive infrastructural work in the digital humanities. Some of the ones we lean on daily: - **Trove** (National Library of Australia) — hundreds of millions of digitised pages of Australian newspapers, gazettes, magazines, and books. Full-text searchable, with a community of volunteer text correctors improving the OCR a line at a time. - **Chronicling America** (Library of Congress + NEH) — a growing corpus of historic US newspapers, state-by-state, with a public API and a clean page-image viewer. - **Europeana** — a federation across European cultural heritage institutions, aggregating metadata and digitised objects from thousands of museums, libraries, and archives. - **HathiTrust** — a shared digital library built on the mass digitisation of research-library holdings, with in-copyright and public-domain strata and a careful access model. - **Internet Archive** — the public-facing generalist. Books, serials, audio, video, web. An indispensable safety net for everything the institutional projects haven't yet reached. - **Google Books** — the largest of all by raw volume, with a search surface that is uneven but often surprising. - **DPLA** (Digital Public Library of America) — an aggregator over US institutional collections, analogous in ambition to Europeana. What these projects produce, broadly, is the same shape of output: a scanned page, an OCR text layer of varying quality, page-level metadata (title, date, publisher, rights), and a search surface that lets you find pages across millions. This is an enormous achievement. It is also expensive, institutional, and slow. A digitisation project is measured in decades. Its output is *broad* — it serves every downstream use from genealogy to literary scholarship to local history — and, necessarily, *generic*: it does not privilege any one research question. ## What Archeglyph does Archeglyph is a research tool, not a digitisation tool. It starts from material that has already been digitised — by an institution, by a researcher with a scanner, by a photographer with a phone — and produces a reading surface over the *specific subset* a researcher cares about. Concretely, given a set of page images or PDFs a researcher has chosen, Archeglyph: - Runs a transparent extraction pipeline (VLM-assisted layout assessment, OCR or VLM extraction at the researcher's choice) with every model disclosed at the point of output. - Indexes the extracted text for full-text search across the researcher's corpus — not across all of Trove, just across what they uploaded. - Clusters fragments semantically and presents each cluster as quotations with sources, not as a scatterplot. - Keeps version history so that re-ingestion doesn't silently renumber clusters or invalidate saved links. - Produces an auto-generated plain-language technique note so the researcher can cite how the corpus was processed. - Ships the whole dataset as a single exportable snapshot — index, vectors, metadata — so the corpus can be archived, shared, or re-opened without the product. Analysis, in other words. Project-bound, weeks-to-months, specific, narrow — the opposite axis to digitisation. ## Where they meet The interface between the two layers is the PDF, or the image folder, or the API response. A researcher searches Trove, finds the two hundred pages that touch their research question, downloads them, and loads them into Archeglyph. The bits come *from* the digitisation project. The reading happens *on top of* Archeglyph. Nothing about this is adversarial. We do not want to re-digitise what Trove has already digitised; Trove has no plans to ship a clustering UI. The digitisation projects built the library. Archeglyph is the desk you read at.

"Trove built the library. Archeglyph is the desk you read at."

## A worked example: Trove + Archeglyph in the same workflow Concretely — a historian of Australian labour movements in the 1920s is interested in one specific union's coverage in the Brisbane press. 1. **In Trove.** Search for the union's name across the *Brisbane Courier* and the *Daily Standard* for the years 1921–1929. Refine by date, by title, by page type. Select the two hundred-odd articles that look relevant. Download the page PDFs — Trove supports this for most of its newspaper holdings — or export the list as references. 2. **Between layers.** The researcher now has a folder of two hundred PDFs on their laptop. This is the handoff point. The digitisation layer has done its job; the analysis layer hasn't started. 3. **In Archeglyph.** Create a new dataset. Upload the PDFs. Let the pipeline run: layout assessment, OCR (Tesseract is usually fine for 1920s Brisbane newsprint), chunking, embedding, clustering. 4. **Reading.** The cluster view surfaces themes — shipping strikes, wage arbitration, internal union politics, coverage of rival unions, editorial hostility. Each theme is a card with exemplar quotations and a page reference back to the original Trove scan. 5. **Citing.** Every quotation links to a source page. The researcher cites the Trove record for the canonical reference and uses the Archeglyph snapshot ID as the methodological appendix — *"cluster analysis produced via Archeglyph snapshot XYZ, 2026-04-15"*. Trove did the work of digitising the Brisbane press in the 1920s. Archeglyph did the work of letting this particular researcher read two hundred pages as a corpus, not as two hundred separate documents. The two tools did not compete at any step. ## What we don't claim In the same spirit of spelling things out one axis at a time: - We are not digitising more material. We have no scanners, no institutional mandate, no partnerships with rights-holders. If the material isn't already digitised, Archeglyph cannot help. - We are not replacing the institutional archive. The canonical record stays where it is: in the library's catalogue, under the library's URL, with the library's metadata. - We are not synthesising new prose. Archeglyph does not summarise a corpus into a paragraph. It does not answer a research question with generated text. Everything on screen is either a quotation from the corpus or a clearly-labelled technique note. - We are not building a competing search engine. We search across *your* dataset. Trove searches across Trove. Those are different jobs. ## A closing line If the phrase "downstream of Trove" lands badly — if it sounds dismissive of the decades of work that Trove, Chronicling America, Europeana, HathiTrust, the Internet Archive, Google Books, and DPLA have put into digitising the record — that isn't our intent. *Downstream* here is the geographical meaning, not the hierarchical one. The river flows from the institutional archive, through the digitisation project, to the researcher's desk. Archeglyph sits at the desk. The water had to get there somehow, and it wasn't us who carried it.

Orthogonal to LLM 'deep research'

Dipankar Sarkar — Mon, 11 May 2026 00:00:00 GMT

There is a category of products getting a lot of attention right now — "deep research" agents. They take a question, search the web (or a private collection), read a few dozen sources, and produce a written synthesis with footnote-style citations. Perplexity Research, ChatGPT Search, Claude's research mode, Gemini Deep Research, and a long tail of vertical clones all sit in this category. Researchers often ask us how Archeglyph compares. The honest answer is **we don't, because we're not in that category**. Deep-research agents and Archeglyph aren't on the same axis. They solve different problems for different stages of research. This article tries to draw the orthogonal-axis distinction clearly, because choosing the wrong tool for the work in front of you is expensive. ## What a deep-research agent does A deep-research agent compresses *many sources* into *one written answer*. Workflow: 1. Researcher asks a question. 2. Agent reads several dozen sources it selected. 3. Agent writes a synthesis — usually 800–3,000 words — with inline citations. 4. The footnotes link back to the sources. This is genuinely useful. For an early-stage scoping pass on an unfamiliar topic, or when you need a credible briefing in fifteen minutes, the workflow saves real time. The footnotes are a meaningful improvement over the previous generation of chatbots that gave you no sources at all. But the *output* is a written paragraph that the agent composed. The sources fed it; it digested them; it produced new prose. The chain of custody between any specific sentence in the output and any specific sentence in a source is fragile in ways that are subtle and adversarial: - The cited source frequently doesn't say quite what the synthesis claims. (Multiple recent audits put this rate at 15–40% on real research questions.) - A claim with no clear source in the cited material gets attributed anyway, often plausibly enough that nobody checks. - The selection of sources is itself opaque — why these forty, and not the other forty? For *exploratory* work, this is acceptable. For *citable* work, it isn't. A researcher writing a footnote needs to point at a specific page on a specific date. A philologist needs the passage as it actually appears in the manuscript, not a paraphrase that may have introduced a tense or a hedge. ## What Archeglyph does Archeglyph compresses *one large corpus* into *a navigable index*. Workflow: 1. Researcher uploads a corpus they already chose. (We don't pick sources for you. This is the scholar's job, not ours.) 2. Archeglyph reads each page (OCR or VLM, your choice, always disclosed). The text we extract is the text on the page; we show you the bbox. 3. We index it for full-text search and group it semantically into clusters of related fragments. 4. The output is a browsable view of the corpus: search returns real chunks, clusters surface real exemplar quotations, every fragment links back to the source page. There is no "synthesis" anywhere in that pipeline. The deepest generative thing we do is name a cluster — *"Migrations across the Bosphorus"* — and even that comes with a "this is an LLM-generated label" badge, with the actual quotations beneath it. ## The orthogonal axis | Axis | Deep-research agents | Archeglyph | |-------------------------------|-------------------------------|---------------------------------------------| | Input | A question | A corpus | | Output | A written synthesis | A navigable index | | Source selection | The agent picks | You picked when you uploaded | | What the user reads | Generated paragraphs | Real fragments, in context | | Provenance granularity | Footnote per paragraph | Region + page + bbox per fragment | | Confidence in any single line | Probabilistic | Deterministic (it's the OCR'd text) | | Audit cost | Manual re-reading per claim | One-time per region, then reusable | | Best for | Exploration, briefings | Citable scholarship, longitudinal analysis | These are not competing rows. They're different jobs. ## Where they actually meet (and where they don't) In a complete research workflow, both can have a place: - **Scoping (early)** — a deep-research agent is a fast way to find out what conversation you're walking into. Read its synthesis knowing it might mislead, then go read the actual sources. - **Working with primary corpora (the middle)** — a tool that lets you read your archive without reading it cover-to-cover. This is the hole Archeglyph fills. - **Writing (later)** — your own synthesis, with citations to the primary sources you found via Archeglyph's search and clusters. Footnotes that point at page numbers, not at AI outputs. What deep-research agents are *not* good at is being trusted at the *writing* end of that pipeline. The hallucination rate is high enough that any claim drawn from one needs to be re-verified against the original. At which point the labour-saving has been transferred, not removed — you're now reading the original anyway. What Archeglyph is *not* trying to do is the synthesis. We have no plans to add a chat interface, no plans to write summaries on your behalf, no plans to "answer the research question". A different product can do that. We want to be the tool you cite *from*. ## Why this matters now The noise floor of "AI research tools" is rising fast. A new product launches every week claiming to do "everything for the researcher". Most of them are minor variations on the same generative pattern: an LLM, a vector store, a chat interface, a hand-wave at hallucination. The pattern isn't bad — it's just one workflow. The danger is that researchers begin to assume *this is what AI does to research*: it synthesises, it asserts, it occasionally invents. We think there's an entire other axis to build along, and it's the one humanities scholarship has always run on: read the source, ground the claim, cite the page. Make that easier — much easier — without changing what counts as a claim. Don't synthesise on the researcher's behalf. Don't replace judgment. Amplify the corpus, keep the interpretation human. That's the orthogonal axis. We're betting it matters.

The citable-claim test

Maitrayee Roychoudhury — Sat, 09 May 2026 00:00:00 GMT

There's a quiet test I run on every new "AI for research" tool that crosses my desk, and it cuts through the marketing copy in about thirty seconds. It's this: > *Pick a single sentence in the tool's output. Can you, in one > click, see the page in the original source the sentence came > from?* If yes, the tool is a candidate for citable scholarship. If no — if you have to manually re-search to verify it, or if the "source link" points at a paragraph that doesn't quite say what the tool's sentence claims — the tool is for exploration. Both are useful. They are not interchangeable. This is the test we built Archeglyph to pass. ## What the test actually checks A footnote in academic writing makes a quiet but specific promise: *if you go look at the cited source on the cited page, you will find the thing I am attributing to it.* That's the contract scholarship runs on. Peer review enforces it; tenure committees notice when it breaks; entire careers turn on whether the contract holds. The citable-claim test asks: does this tool let me make that promise about anything I quote or summarise from its output? Three failure modes are common in current "AI research" tooling: **Failure 1 — The source exists but doesn't say it.** The tool produces a sentence and footnotes it to a real document. You click through. The document is real. The page exists. But the sentence the tool composed isn't quite what the source says. It might have shifted a tense, dropped a hedge, conflated two adjacent claims, or attributed a quotation to the wrong speaker. You only catch it if you read the source carefully — at which point the tool has saved you no time. **Failure 2 — The source link is to the document, not the page.** You're cited "Smith 2018" but the document is 280 pages. To verify, you read the whole thing or full-text search for the claim. Often the search misses because the tool paraphrased. **Failure 3 — There is no source.** The tool produced a confident assertion with no citation at all. This is the cleanest failure because it's obvious; it's also the most common in casual chatbot output. A tool that passes the citable-claim test avoids all three by construction: it shows you fragments from the source instead of generating new prose about them. ## Why this is a structural property, not a quality knob You can't get a generative system to pass this test by being more careful. You can lower its failure rate with better retrieval, better prompting, better grounding — and the best deep-research agents have lowered it considerably. But the structure is still: ``` sources → model → new prose → footnote → sources ``` There's a loss-of-fidelity step in the middle. The new prose is *about* the sources; it isn't *from* them. Whether that loss is 5% or 35% depends on the day, the question, and the model. It's never zero, because the model's job is to write something new. The structural alternative is: ``` sources → indexed fragments → fragments shown → click → source ``` No new prose in the middle. The tool's job is to make the existing text findable, not to compose new text on top of it. The fragments the researcher reads *are* the sources, sliced, indexed, surfaced with context. You can quote them word-for-word; the audit trail is trivially short. This is why the citable-claim test discriminates so cleanly: it's asking which structure the tool uses, and the structure cannot be faked. ## What the test looks like in Archeglyph Open a search result. The snippet you see is the actual chunk text from the bundle's SQLite, with the matched terms wrapped in ``. Click it. You land on the review page for the document the chunk came from, scrolled to the region. The bbox is highlighted on the source image. There is a `ProvenanceBadge` showing the OCR engine that read it and the date. If you don't trust the OCR, run it again with a different engine on that one region; the system keeps the audit trail. Open a cluster. The exemplar quotations on the card are real chunks from the corpus. Click one — same trip back to the source page. Now write your footnote. *"Le Figaro, 11 January 1924, p. 3"* — exactly what the chunk's source-link gave you. The page is on disk; you (and your reader) can return to it forever. That's all the test asks. The tool either does this or it doesn't. ## Where this fits Tools that pass the citable-claim test are good for the *writing* end of research — the parts where your name goes on the claim. Tools that fail are good for the *exploring* end — getting up to speed on a topic, finding sources you didn't know existed, surfacing unexpected angles. A serious research workflow probably uses both. The mistake is assuming they're substitutes. They're not. They're complementary tools at different stages of the same workflow, and a tool that's honest about which end it serves is a tool that respects how scholarship actually works. We built Archeglyph for the writing end on purpose. Every choice in the pipeline — preserving the source image, recording the engine on every region, keeping the cluster card text-first, refusing to add a "summarise this corpus" button — is downstream of one commitment: *everything you read in this tool, you can defend in a footnote*. If that's the test you care about, you'll find Archeglyph passes it. If you need a tool that synthesises, that's a different tool, and we won't pretend to be it.

Why Archeglyph cannot hallucinate

Dipankar Sarkar — Thu, 07 May 2026 00:00:00 GMT
A philologist recently asked us a sharper version of the question that's quietly haunting every research tool right now: *"How do I know your tool isn't making things up?"* It's the right question to ask. The honest answer is short: **Archeglyph is not a generative system, so it cannot hallucinate the things you read in it.** The longer answer is worth writing down because it explains an architectural choice we made very early, and it explains why that choice is the reason we exist. ## What "hallucination" actually means The word is used loosely. In the literature it has a specific shape: a generative model produces content that is *fluent and plausible* but unfaithful to its inputs — a quotation that was never said, a citation that doesn't exist, a date that is off by twenty years and stated with total confidence. The failure mode is intrinsic to how the system works: a language model is trained to produce the next likely token, not the next *true* token. Plausibility is the optimised target. Truthfulness is, at best, correlated. This is why retrieval-augmented generation, careful prompting, and chain-of-thought tricks help but never close the gap. They lower the hallucination rate. They don't change what kind of system you're using. ## Where Archeglyph's text comes from Walk through the pipeline. At every step we can name the *source* of the text on screen. **1. The page image.** The starting point is a researcher-uploaded PDF or image scan. The bytes don't change. The original is preserved in object storage and re-downloadable forever. **2. Region detection.** A vision model (or a CV fallback) draws boxes on the page. The model's only output is *coordinates and a label* (headline / body / caption / figure / table). It does not produce text. If the model invents a region that isn't there, we crop air — and the OCR step on the next page produces empty text, which is easy to notice. **3. Text extraction.** Tesseract or a vision-language model is given a single cropped region and asked: *"Read what's on this image, faithfully."* This is the only step where a model could plausibly "add" text that wasn't there. We mitigate the risk three ways: - The image *and* the extracted text are kept side-by-side in the review UI. Hover a region; the bbox highlights on the source page. - Every region is stamped with the engine that produced its text and a confidence score. - The dataset technique note (auto-generated, clearly labelled as such) tells the researcher how many regions were Tesseract-read versus VLM-read. A researcher can audit by sampling. **4. Chunking, embedding, indexing.** These are deterministic operations. `syntok` splits the extracted text on sentence boundaries. A sentence-transformer turns each chunk into a vector. Tantivy indexes the words for full-text search. None of these steps add text. They make the existing text findable. **5. Clustering.** HDBSCAN groups vectors. The output is *which chunk is in which cluster*. There is no language generation here. **6. Cluster theme titles.** Yes, this step uses an LLM. The LLM is given the top TF-IDF terms for a cluster plus a handful of sample sentences, and asked to produce a four-to-six word label. The label is shown with a `ProvenanceBadge` naming the model. If a researcher doubts a label, they read the exemplars beneath it — which are real quotations from the corpus, not LLM output. **7. The dataset technique note.** Three to five sentences describing how the dataset was processed. Generated by a small model from the *known* engine choices and the *known* counts of files, regions, and chunks. We cap its length, and if the model's output is missing or malformed we fall back to a deterministic template. The note carries a "this summary is automatically generated" caveat in every version. That is every model invocation in Archeglyph. None of them is asked to *summarise the corpus*. None of them is asked to *answer a research question*. None of them produces a paragraph that a researcher could mistake for primary text. ## What we don't do We don't have a chat interface. We don't have a "summarise this collection" button. We don't have an "ask a question of your archive" endpoint. Those are perfectly reasonable products to build — they're just a different product. The research workflow they support is *synthesis*. The research workflow we support is *reading*. We made this call deliberately, and we don't expect to change it. A tool that synthesises will always be liable to hallucinate, no matter how careful the prompt engineering. Once a researcher has to audit each generated sentence for fabrication, the tool has stopped being a labour-saver and started being a liability. ## What you can verify If you're evaluating Archeglyph, run this test: 1. Upload a page you know cold. 2. Watch the regions appear on the review screen. Open the bbox overlay. For each region, verify that the text is actually what's on the image at that location. 3. Run a search that you know should match. Verify every result is a real chunk from a real region. 4. Open the cluster browser. Pick a cluster. Click an exemplar. It takes you back to the source page, with the highlighted region. 5. Now try to find an unsupported claim in any of the surfaced text. You won't, because there isn't a step in the pipeline that could have produced one. That's the audit. It scales. ## The promise, stated plainly Archeglyph reads what is on the page. We disclose which model did the reading. We index, group, and surface what was read. We don't write anything new on top of it. When we *do* generate (cluster titles, the note), we say so loudly and we keep it to under a hundred words. This is the line we hold. Not because LLMs are bad — they're useful for plenty of things — but because *citing what you read* is the foundational act of scholarship, and we want to be a tool a researcher can cite from without an audit trail of footnotes saying "the AI told me so". If your work needs the corpus to mean what it says on the page, Archeglyph is for you. If your work needs synthesis, we'll happily recommend something else.

Reading clusters as a researcher

Dipankar — Tue, 05 May 2026 00:00:00 GMT
If you have used a topic-modelling or embedding-clustering tool before, you have probably seen the default view: a UMAP scatterplot with coloured dots, a sidebar of top terms per cluster, and — if you are lucky — a list of document titles. This is a default designed for someone debugging a clustering algorithm. It is not a default designed for someone reading a corpus. Archeglyph's cluster view reverses the priority. Each cluster is a card, and the card leads with three things a researcher actually wants to see: a theme title, a one-sentence summary, and three to six exemplar fragments rendered as readable quotations with source citations. The scatterplot is behind a button called "Advanced". This article is about how to use the default view — and, for those inclined, when the Advanced panel earns its place. ## What a card shows A cluster card has the following anatomy, in order of visual weight: 1. **Theme title.** Four to six words. Generated by combining the top TF-IDF terms for the cluster with a small text LLM that polishes them into something readable. The card discloses which LLM wrote the title. 2. **One-sentence summary.** A short description — "A group of 42 fragments discussing population movements between Europe and Asia Minor during the 1920s" — produced by the same model from the exemplar fragments. 3. **Exemplar quotations.** Three to six fragments, each rendered as a block quote with its source document, page, and (where available) date. These are the highest-probability members of the cluster according to HDBSCAN's soft-clustering output. 4. **Size and link.** "42 fragments · Open cluster →" opens the *fragment neighbourhood* — a longer view of all members with a sentence of surrounding context on either side. That is the whole default. There is no dot plot, no silhouette score, no outlier percentage. Those numbers exist, and the Advanced toggle surfaces them, but the first read doesn't need them. ## How to read a card Our experience working with historians, philologists, and archivists on the newspapers prototype led to a short heuristic: 1. **Read the three quotations first, top to bottom.** Ignore the title; the title is a guess. The quotations are the data. 2. **Ask whether they feel like one group.** If yes, the cluster is doing useful work — even if the title is slightly off. If no, look at the outliers: sometimes one exemplar signals a sub-theme that was swept into the same bucket. 3. **Open the cluster.** The fragment neighbourhood shows all members with ±1 sentence of context. This is where the research actually happens. Skim the neighbourhood, flag fragments that feel adjacent but not central, and drop out to the document page for the ones that matter. 4. **Only then look at the title.** By this point you have your own sense of the cluster's shape. If the generated title fits, fine. If it doesn't, you can rename it, and the rename persists. Notice how much of this is close reading. The clustering algorithm got the fragments into roughly the same room; the philologist decides whether they are actually having the same conversation. That division of labour is the whole point. ## What the Advanced toggle is for There are three moments when the Advanced panel earns its place, and they have nothing to do with the default reading loop: - **When you suspect the clustering is wrong and want to know how wrong.** The probability histogram tells you whether a cluster's members are tightly bound or loosely attached. Loose clusters should be read skeptically. - **When you are comparing two runs.** If you changed the embedding model or the HDBSCAN parameters, the UMAP projection lets you see at a glance whether the structure moved. - **When you are teaching the algorithm to someone else.** The scatterplot is good pedagogy; it is mediocre daily bread. For everything else, the numbers are noise. Our experience on the newspapers prototype was that researchers who spent thirty minutes in a UMAP view ended up with a *worse* sense of their corpus than researchers who spent thirty minutes reading exemplar quotations. The geometric view feels authoritative in a way the quotations don't, and that authority is misleading — distances in 2D UMAP space are not the distances the clustering algorithm used. ## What we don't do A few things the cluster view deliberately omits, and why: - **Word clouds.** They encode frequency as area, which the eye reads as importance. TF-IDF terms are already in the theme-title pipeline; that is enough. - **Automatic cluster merging.** If two clusters are "similar" by some metric, the researcher — not the algorithm — decides whether to merge them. The tool proposes; the scholar disposes. - **Sentiment or stance overlays.** Sentiment classifiers trained on 21st-century social media do poorly on 19th-century newspapers. We would rather ship no signal than a misleading one. ## What cluster IDs promise When you re-ingest a dataset — add new documents, re-run extraction on a batch, change the cluster parameters — the underlying clustering algorithm produces a fresh assignment. Naively this would renumber every cluster, breaking any URL or note that references "cluster #17". Archeglyph stabilises cluster IDs via Hungarian matching against the previous assignment: a cluster that has substantial overlap with a previous cluster keeps the previous ID. This means saved cluster links survive incremental ingests. It also means a cluster whose membership shifts dramatically — because, say, you added two hundred documents about a new topic — will show up as a *new* cluster rather than hiding inside the old one. That stability is load-bearing. It lets an archivist or intellectual historian bookmark a cluster as they would bookmark a chapter, and come back to it a month later without chasing a new number. ## The end state The default view is not just an aesthetic choice. It is a bet that the closest thing clustering tools have to an interface — the UMAP plot — was never the right one for the humanities. A cluster is a reading unit. Make it look like one, and the tool recedes into the background of the work, which is where research tools belong.

Exporting and archiving a dataset

Dipankar — Sun, 03 May 2026 00:00:00 GMT
import FaqSchema from '../../components/seo/FaqSchema.astro'; Every study ends — a paper gets published, a grant closes, a postdoc moves institutions — and the question becomes: *how do I keep the work in a form I can still use in five years, without the tool that produced it?* Archeglyph's answer is the dataset snapshot, a single tarball that bundles the catalogue, the lexical index, and the embedding store for one dataset. This guide walks through how to create one, what is inside it, how to open it without Archeglyph running, and how to cite it. Some of the surfaces described below are still rolling out through M1. Where that is the case the guide flags it. ## Creating a snapshot From the dataset page, open the `⋯` menu on the header and choose `Export snapshot`. The product computes the total size (it will tell you before you commit), asks you to confirm, and produces a tarball named: ``` archeglyph---.tar.zst ``` The timestamp is UTC. The file is zstd-compressed; on a modern machine a dataset with tens of thousands of documents typically lands in the low hundreds of megabytes. While the export is running you can close the page — the job continues server-side and an email with the download link arrives when it finishes. For datasets with tens of millions of chunks the export can take several minutes; the job status surfaces in the dataset's events feed the same way extraction jobs do. ## What is inside Unpack the archive: ``` $ tar --zstd -xvf archeglyph-<…>.tar.zst archeglyph-<…>/ ├── README.txt ├── catalogue.sqlite ├── index.tantivy/ │ ├── meta.json │ ├── … segment files … ├── embeddings.zvec ├── settings.json └── manifest.json ``` - **`catalogue.sqlite`** is a plain sqlite database containing the tables for documents, pages, regions, extracted text (with its engine provenance), edits, clusters, cluster memberships, and the settings that were active at snapshot time. You can open it in any sqlite browser; the schema is documented in `README.txt` and mirrors the tables described in the platform docs. - **`index.tantivy/`** is the lexical search index, in tantivy's on-disk format. It can be opened by any tantivy 0.22+ reader; you do not need Archeglyph to query it. - **`embeddings.zvec`** is the compressed embedding store, one vector per chunk plus a small metadata header (model id, dimension, chunking recipe). The zvec format is documented in its repository; a short Python reader script is bundled as `read_embeddings.py`. - **`settings.json`** is a human-readable copy of the dataset's settings at the moment of export — engines, thresholds, chunking parameters. It is redundant with the sqlite catalogue but is present to make the snapshot legible without any database tooling. - **`manifest.json`** lists every file, its SHA-256, and the snapshot schema version. Check the hashes after download if you intend to archive the tarball long-term. The tarball does **not** contain the raw source images. It contains *references* — a stable URL plus a SHA-256 — and a `rehydrate.sh` script that refetches the binaries from the original object store. This is a licensing choice: many source archives grant Archeglyph the right to process images but not to redistribute them. A future `--with-images` flag will bundle the binaries for researchers whose sources are fully open. ## Opening a snapshot without Archeglyph The design goal is that the snapshot opens with off-the-shelf tools. Three worked examples: ### Browse the catalogue in sqlite ``` $ sqlite3 catalogue.sqlite sqlite> .tables documents regions texts clusters chunks settings engines sqlite> SELECT count(*) FROM chunks; sqlite> SELECT text FROM texts WHERE engine_id = 'qwen3-vl:235b-cloud' LIMIT 5; ``` Every row carries the engine id that produced it; joining `texts` to `engines` gives you the full provenance record in a single query. ### Search the lexical index from Python ``` from tantivy import Index ix = Index.open('index.tantivy') searcher = ix.searcher() hits = searcher.search(ix.parse_query('wharves OR galata', ['text']), limit=20) for score, address in hits.hits: doc = searcher.doc(address) print(score, doc['document_id'], doc['page_no'], doc['text'][:80]) ``` The tantivy Python bindings read Archeglyph's snapshot indexes directly; the field names (`text`, `document_id`, `page_no`, `region_id`) are documented in `README.txt`. ### Load the embeddings ``` from zvec import read store = read('embeddings.zvec') print(store.metadata) # {'model': 'bge-small-en-v1.5', 'dim': 384, ...} for chunk_id, vector in store: # use numpy, faiss, whatever ... ``` The embedding store carries enough metadata to reconstruct a search space without Archeglyph; the model id is what lets you (or a future reader) know whether they can mix these vectors with another corpus. ## Citing a snapshot A snapshot is citable. The recommended format: > Author, *Dataset title*, Archeglyph snapshot `sha256:<…>` exported > ``, archived at ``. The `manifest.json` contains a `snapshot_id` which is the SHA-256 of the concatenated file hashes — that is the value to paste in the `sha256:` field. Two researchers with the same `snapshot_id` are guaranteed to be looking at bit-identical data. If you deposit the tarball in Zenodo or your institution's repository, Archeglyph will accept the DOI on the dataset's settings page and show it on the dataset's landing card. That feature lands in M1-D. ## Archiving versus re-importing Two different verbs, two different use cases: - **Archiving** — the tarball is the final form. You put it in a repository, you stop thinking about it. The three files inside are all openable with tools older than Archeglyph; whatever happens to us, the research artefact survives. - **Re-importing** — the same tarball can be loaded back into Archeglyph (`⋯ → Import snapshot`) and becomes a new dataset in your workspace. The original snapshot is not mutated; re-imports are a fork, not a load. This is how a collaborator receives your study. ## Caveats we want to be honest about - **Not every settings field is carried.** The snapshot preserves the engine selection, the chunking recipe, and the search configuration. Workspace-level things (billing, team membership, access policies) are intentionally not exported because they belong to a workspace, not a dataset. - **Image rehydration depends on upstream availability.** If the source archive takes a document offline, the `rehydrate.sh` script will fail on that file. The extracted text, regions, index, and embeddings are untouched — you keep the scholarship, you lose the ability to redisplay the image. - **Snapshot schema will version.** The format is at v1. Future versions will add fields, never remove them; a v1 reader will continue to open every snapshot it produced today. ## A short checklist before you call it done 1. Download the tarball and verify the SHA-256s in `manifest.json`. 2. Open `catalogue.sqlite` and confirm the document count matches what you expect. 3. Archive the tarball somewhere with an addressable URL (institutional repository, Zenodo, S3 bucket with public read). 4. If the study is published, paste the `snapshot_id` into the methods section and the DOI onto the dataset's settings page in Archeglyph so other readers can find it. A snapshot is not the end of a dataset's life — it is the first moment it becomes a citizen of the scholarly record rather than a row in our database. That is what we built the format to let it be.

Why we snapshot per dataset

Dipankar — Fri, 01 May 2026 00:00:00 GMT
If you asked any three digital humanities tools to export "everything about this corpus right now, in a form I can archive," you would get three different answers and none of them would round-trip. One tool would give you a folder of PDFs and shrug at the indexes. Another would give you a Postgres dump with references to an ElasticSearch cluster you no longer have. A third would give you a vendor-specific archive that needs the vendor's runtime to open. Archeglyph's answer is a single tarball per dataset, containing three files: a tantivy lexical index, a zvec embedding store, and a sqlite catalogue. That choice — one unit, three files, the dataset as the atom — took real work to converge on, and it is worth writing down why. ## What a dataset is A dataset in Archeglyph is a bounded collection of source pages a researcher has decided to study together: a newspaper run, the plates from one expedition, one archive's photographs of a monastery. It has a stable slug, a settings page that pins the engines used for its pipeline, and a set of documents whose regions, text, clusters, and search indexes are all derived from those engines. The dataset is the unit a researcher talks about at a conference. It is also the unit we need to be able to hand back to them intact. The corollary is that the dataset is *not* the document. A document snapshot that did not carry its embedding space would lose the ability to search. A document snapshot that carried an embedding space but not the index would ship a vector blob no one can query. The dataset is the smallest scope at which the artefacts still compose into a usable tool. ## Why three files, and why these three Each file in the tarball covers one mode of access and is lossless on its own: - **The sqlite catalogue** is the system of record: documents, pages, regions, engine provenance, edit history, cluster membership, settings at time of snapshot. It is plain SQL, opens in any sqlite browser, and is the thing an archivist can read in ten years without us. - **The tantivy index** is the lexical search layer. It is derived data — it can be rebuilt from the catalogue — but rebuilding it is minutes to hours, and a snapshot without it is noticeably worse to open. Ship it. - **The zvec embedding store** holds the chunk embeddings plus their metadata (model id, dimension, chunking recipe). Like the index, it is derivable; unlike the index, rebuilding requires access to the embedding model, which may have been retired or paywalled by the time a snapshot is re-opened. Shipping the vectors is how you guarantee the semantic search still works years later. There are tempting simplifications we rejected. A single sqlite file with vectors stored as blobs would be convenient but would forfeit zvec's per-chunk compression and the ability to load vectors without paging the whole DB. A single portable archive built on Parquet would be elegant but we would have to re-implement the reader side of tantivy. The three-file shape is a compromise: each file is authored by a battle-tested library, and the tarball is what makes them feel like one object. ## Why the unit is the dataset There is a pull, always, to snapshot at a coarser or finer grain: - **Coarser: snapshot the whole workspace.** Attractive because it would be one button. Not what researchers want. A workspace often mixes a finished study with half-cooked exploratory corpora; the finished one needs citation-stable archiving, the exploratory ones don't, and bundling them conflates two lifecycles. A workspace snapshot also multiplies the size of every archive by a factor that has nothing to do with the scholarship. - **Finer: snapshot one document.** Attractive because the individual page is the atomic image. Not useful on its own: the vector space a document's chunks live in is shared across the dataset, so a single-document snapshot either ships the whole embedding store (wasteful) or ships only that document's vectors (which cannot be searched without the rest). Either way the snapshot is no longer composable. The dataset sits exactly where the scholarly unit and the technical unit agree. That is why it is the snapshot. ## Operational consequences Committing to a dataset snapshot format shaped parts of the product that look unrelated: 1. **Settings are copied into the snapshot.** The sqlite catalogue carries the engine selections active at the moment of the snapshot — not just the engine names, but their versions. Opening an older snapshot displays a settings banner that makes clear this is what the dataset was extracted with, even if the workspace has since moved on. 2. **Re-runs are idempotent within the snapshot.** Because the catalogue stores every re-run as a new row with its own provenance, a snapshot can be re-extracted selectively and the new rows either merge into a fresh snapshot or split off into a derived dataset. We did not want to teach ourselves two different "which row is canonical" rules. 3. **The tarball is the export format, full stop.** There is no JSON export, no CSV export, no "lite" mode. Every export is this tarball. Researchers get a format that round-trips back into the product; archivists get a format that opens without the product; and we get one thing to maintain instead of five. ## What we are deferring The snapshot format does not yet carry the raw source images. That is deliberate — images are large, often restricted by the source archive's license, and already live in our object store with their own retention policy. A snapshot currently carries image *references* (a stable URL plus a SHA-256) and a helper script that re-fetches them from the original repository when the snapshot is opened. A future `--with-images` flag will bundle the binaries for researchers whose source is fully open. We would rather ship the lean tarball now than block on the harder legal question. ## Why this belongs in an article Infrastructure choices usually hide inside release notes. We surface this one because the snapshot format is a promise to researchers: *the work you do in Archeglyph is yours, and the form in which you take it away is simple enough to still make sense after we're gone.* That promise is only real if we explain what it looks like.

Choosing an embedding model for digital humanities

Dipankar — Wed, 29 Apr 2026 00:00:00 GMT
A researcher opening Archeglyph for the first time sees two options under *Embedding model*: `all-MiniLM-L6-v2` and `bge-small-en-v1.5`. Neither label is self-explanatory; neither choice is obviously wrong. This article is the long-form version of the hover tooltip, for the researcher who wants to make the choice with their eyes open. We are not going to cite benchmarks. The published MTEB numbers are useful as an orientation for engineers, but re-running them against a 1901 Ottoman-Greek newspaper or a set of colonial-era expedition plates is not a thing any of us have the budget to do honestly. What we can offer is a description of what each model optimises for, the operational consequences of picking one, and the heuristics we use when the researcher asks us. ## What "embedding model" means in Archeglyph An embedding model turns a chunk of text — here, a passage of a few sentences drawn from an extracted region — into a fixed-length numeric vector. Vectors whose cosine similarity is high are, in theory, about similar things. Archeglyph uses those vectors for two jobs: 1. **Semantic search.** The researcher types a query, the product embeds it, and ranks chunks by similarity. 2. **Clustering.** Chunks that land near each other in vector space form a candidate cluster; the theme-writing LLM is given the top-TF-IDF terms from that cluster and asked for a 4-6 word title. Both uses depend on the vector space being coherent for the *kind of text* in the dataset. That is the axis on which these two models differ in practice. ## What MiniLM-L6-v2 is `all-MiniLM-L6-v2` is a 384-dimensional model distilled from a larger MiniLM, trained on a broad mix of general English Q&A and paraphrase data. It is small, fast, and has been the default "try this first" open embedding for several years. For Archeglyph it has three practical virtues: - **Low footprint.** 384 dimensions compresses well in zvec; a dataset of a million chunks holds in memory on a single modest server. - **Fast embedding.** On CPU it will out-throughput most alternatives. On a machine without a GPU, this is the difference between waiting an hour for a dataset to embed and waiting a shift. - **Long production history.** Its failure modes are well documented; when a cluster looks odd with MiniLM, there is usually a named reason. What it is not especially good at: domain-shifted English, archaic spellings, multilingual content, and sentences where the interesting signal is a small number of proper nouns (place names, ship names, officer names). In those regimes it will still produce a vector, but the vector will often cluster on surface features (sentence length, function-word mix) rather than what the researcher cares about. ## What BGE-small-en-v1.5 is `bge-small-en-v1.5` is a 384-dimensional model from the BGE family, trained with an explicit instruction-tuning objective on retrieval pairs. It is the same size as MiniLM and embeds at roughly comparable cost. The interesting differences show up qualitatively: - **Retrieval-shaped training.** BGE was trained to make query-document pairs close and negatives far; MiniLM was trained more broadly. For Archeglyph's two use cases (search, then cluster-as-a-form-of-search), that objective is on-target. - **Better handling of named entities.** In internal dogfooding on a 1900s newspaper corpus, BGE's top-k search results for a proper-noun query (`"wharves of Galata"`) more consistently surface the *narrative* contexts around that phrase rather than other sentences of similar shape. We do not have a publishable benchmark for this; we mention it as an intuition to keep. - **Instruction prefix.** BGE expects a short prefix on query embeddings (e.g. `"Represent this sentence for retrieval: "`). Archeglyph applies this automatically — if you switch to BGE, the query side of the pipeline is handled. You do not need to think about it. What it is not: multilingual. `bge-small-en-v1.5` is English-tuned. For Ottoman-Turkish, Italian, French, Greek, or Arabic sources, neither of these two models is ideal; the researcher should pick whichever they judge less bad and plan on the cross-language failure modes. A future Archeglyph release will surface BGE-M3 or the multilingual E5 family as third and fourth options for exactly this reason. ## The heuristic we give researchers A decision tree that does not require benchmarks: - **English-only corpus, CPU-bound infrastructure, dataset > 500k chunks** → start with MiniLM. The embedding pass is cheap and the search quality is "good enough" for the first exploratory read. - **English-dominant corpus, quality matters more than throughput, GPU available** → start with BGE. The improvement is perceptible in the top-10 search results on the kinds of queries DH researchers actually type. - **Mixed-language or heavily archaic corpus** → either, with the awareness that whichever you pick, you are going to see cross-language leakage. Consider using Archeglyph's cluster view as the primary reading surface rather than search, because clustering is slightly more forgiving of a noisy vector space than pinpoint retrieval. - **Actively comparing models** → embed the dataset twice. Archeglyph's snapshots carry the embedding model id per chunk, so a dataset can live in the workspace with two embedding spaces and the provenance badge will keep them straight. This is the honest way to compare on *your* corpus; it is also the only way that yields a defensible answer. ## Operational notes - Switching embedding model on a live dataset re-embeds all chunks and rebuilds the index. The settings page surfaces this as a rebuild step with an estimated time before you confirm. A researcher should expect minutes per thousand chunks on CPU, seconds on a modern GPU. - The search result UI discloses the embedding model on hover. If you switched models mid-study, this is how you will notice that *this* result came from the old space. - Clustering is not invariant across models. A dataset clustered under MiniLM and then re-clustered under BGE will *not* produce the same clusters, or even the same number of clusters; treat them as two separate analytic frames, not two views of one truth. ## The honest caveat Picking an embedding model is one of several places in a DH pipeline where the default should usually be *try one, read, try the other, read again*. We have shipped two defaults because shipping zero is not useful and shipping ten is paralysing. The right reading of this article is not *"one of these is better"* but *"these are the two we shipped, here is how they differ, and here is how Archeglyph helps you tell the difference on your own corpus."* The scholarship is still yours.

Reviewing a noisy scan

Dipankar — Mon, 27 Apr 2026 00:00:00 GMT
import FaqSchema from '../../components/seo/FaqSchema.astro'; Eventually every digital humanities pipeline meets the scan it cannot quite read. Paper that was already foxed before 1950, microfilm that was printed hot, a colonial-era plate whose register drifted during capture — these documents are the reason a reviewer seat exists at all. This guide walks through how Archeglyph's review screen handles a bad scan and how to decide, region by region, whether to accept, edit, or re-run. ## Before you open the review screen Open the dataset's Settings page and check the extraction engine. If the dataset was extracted with Tesseract and you are about to triage a batch of scans you know are noisy, you have two options: leave the default and fix regions individually on the review screen (cheap but slow), or switch the default to a VLM for the whole dataset (expensive but systematic). This guide assumes the first — you're keeping the default and fixing the worst offenders on a per-region basis. ## The review screen, at a glance When you open a document, the review screen splits into two columns: - **Left: the scan.** The source image with layout regions overlaid as bounding boxes. Hover any box and the corresponding text card on the right scrolls into view and tints. Click a box to activate it. - **Right: the cards.** One card per region, in reading order, with the extracted text in a textarea, the provenance badge below, and an `Accept` button. Regions the extractor flagged low-confidence render in a warn-orange tint; high-confidence regions stay muted ink. On a clean scan almost every card is muted; you skim, accept, move on. On a noisy scan the column of warn-orange tints is what you will notice first. ## Reading the signals Three signals together tell you whether a card needs work: 1. **Card tint.** Warn-orange = the extractor's own confidence score dropped below 65%. Muted ink = the extractor thinks it got this one. 2. **Region shape on the image.** Layout regions that overlap, clip through a fold, or run at an angle are a layout-assessment failure, not an extraction failure — re-running the text engine won't help. 3. **The text itself.** Look for the failure patterns: run-together words, characters replaced with punctuation (`d1e` instead of `die`), lines that start mid-word because the layout pass missed a break. A region with all three signals lit (orange tint, odd bbox, garbled text) is almost certainly a full-page candidate for re-running with a VLM. A region with only one signal lit (say, orange tint but reasonable-looking text) is usually fixable inline. ## The keyboard rhythm The review screen is designed to be operated from the keyboard. The essential four: - `j` / `k` — move between regions. - `e` — edit the focused region's text (focuses the textarea). - `Enter` — accept the focused region. - `r` — open the region re-run popover. On a noisy scan, the rhythm becomes: `j j j`, stop on an orange card, press `e`, fix the text, press `Enter`, continue. After a few pages you stop thinking about the keys. ## When to re-run a region Press `r` on a focused region. The popover offers two tabs (OCR, VLM) and a short list of available engines. The rules of thumb: - **The text is garbled but the bbox is right** → re-run with a better OCR engine first. If the dataset's default is Tesseract and you have a cloud VLM configured, try the VLM anyway; on small regions the cost is negligible. - **The region is a caption, a figure label, or a stamp** → VLMs read these better than Tesseract in almost all cases. Re-run with a VLM and accept the result. - **The region is a column of a table** → neither engine is reliable on table cells in M0. Re-running does not help; correct inline or mark the region for a later pass. Each re-run produces a new row in the region's history with its own provenance badge. The previous row is not lost — the row-history disclosure on the left edge of the card shows every attempt, and you can swap back if the re-run was worse. ## When to re-run the whole document If more than roughly a third of a document's regions are orange, a per-region approach will cost more reviewer time than a single document-level re-run. Open the right-pane "Re-run full document from…" control, pick the extraction stage, and choose a VLM override. This replaces the extraction outputs for all regions at once and leaves the layout assessment intact (unless you also pick the `assess` stage). Rule of thumb: document-level re-runs are worth it when you expect to accept most of the new output. If you already know three-quarters of the page will need manual edits either way, save the cloud call and fix inline. ## When to give up and re-scan There is a scan quality below which no pipeline will help you. If the layout pass produces overlapping bboxes that slice through columns, if regions disappear entirely on certain pages, if the VLM comes back with plausible-looking prose that does not match the image — the document is below threshold. Flag it with a review note (the textarea supports a `[[rescan]]` tag that surfaces on the dataset's documents table) and move on. Archeglyph does not pretend that a better model will rescue a photograph of a ruined page. ## A suggested workflow on a tough batch 1. Open the first document. `j` through every region without editing. Note how many orange cards you see per page. 2. If the ratio is low (< 15%), fix regions inline as you go. 3. If the ratio is high (> 30%), exit to the dataset level and re-run extraction on the whole batch with a VLM override. Come back to review fresh. 4. For regions where the new extraction is still wrong, accept the edit inline rather than re-running a third time. At that point, you are the arbiter. The review screen is designed around the assumption that a researcher's time is the most expensive thing in the pipeline. Use it for judgement, not for data entry.

VLM vs OCR: when to pick what

Dipankar — Sat, 25 Apr 2026 00:00:00 GMT
A common framing in the digital humanities community right now is that vision-language models have made OCR obsolete. This is not what we found on the newspapers prototype. What we found, roughly, is that each engine has a regime where it is straightforwardly the better tool, and a middle regime where the choice depends on what you are going to do with the text afterwards. This article is our attempt to describe those regimes concretely enough that you can make the call on your own corpus. Everything below is from our experience running a few thousand pages of archival newspapers through both pipelines and hand-checking the outputs. It is not a benchmark paper. Treat it as folklore from one project that we found held up. ## Where Tesseract still wins Tesseract — by which we mean a recent Tesseract 5 with LSTM and the right language packs — is, on our corpus, strictly better for: - **Clean, high-resolution print.** 300+ dpi scans of 20th-century typeset text. The character accuracy on well-aligned Latin-script print is remarkably good, and Tesseract is fast and predictable in its failures. - **Heavy throughput.** A page of newspaper text extracts in under a second on a modern CPU. A VLM run on the same page takes 10-60 seconds and a real amount of money. When the corpus is large and the downstream task is lexical search, the speed and cost ratio dominates. - **Cases where you will post-process.** Tesseract's errors are *consistent*. It mis-reads the same letter-pair the same way across a page. That consistency is a gift for deduplication, lexical normalisation, and any downstream pipeline that can correct systematic errors in bulk. On our newspapers corpus, Tesseract hit character accuracy above 98% on a sample of well-scanned 1920s broadsheet pages, and the errors it did make were almost entirely in a fixed set of confusions (`cl` ↔ `d`, `rn` ↔ `m`, `in` ↔ `m`). ## Where a VLM earns its cost A vision-language model — in our case, various Ollama Cloud models that accept a region crop and return text — is straightforwardly the better tool for: - **Degraded scans.** Faded print, show-through from the reverse page, heavy staining, tight gutters. A VLM's language prior lets it read around damage that Tesseract refuses to touch. - **Non-Latin scripts with limited training data.** We had a small set of Ottoman-Turkish pages. Tesseract's Ottoman language pack is workable but the VLM's Arabic-script handling was noticeably better — particularly on ligatures and diacritics. - **Handwriting.** Tesseract is not a handwriting engine. There are specialised handwriting models; for mixed print/handwriting pages, a VLM is the pragmatic path. - **Mixed content.** Pages with figures, tables, and running text intermixed — where the layout model has already produced a bbox but the bbox contents are heterogeneous. The VLM's "just describe what's in this crop" tolerance handles these better. The cost side is real. On a mid-sized VLM, per-page extraction at hosted rates runs roughly ten to a hundred times the operational cost of Tesseract on a CPU. For a 10,000-page project, that is the difference between "run it tonight" and "budget for a quarter." ## The middle regime Many corpora sit in a regime where either engine could plausibly work. In that regime the choice depends on what you will do next: - **Planning to do lexical search and snippet retrieval?** Prefer Tesseract. Its consistent errors are easy to account for in a BM25-style index, and you will want the throughput. - **Planning to do semantic search or clustering?** The choice is more subtle. Embedding models are surprisingly robust to moderate OCR noise — MiniLM still produces sensible cosine similarities on text that is 85-90% character-accurate. But once errors pass a threshold, clustering degrades: the fragments that end up in a cluster start including passages that share *misreading patterns* rather than *topics*. If you are seeing this on your own corpus (the tell is a cluster whose exemplars share an odd letter-confusion), a VLM run on the degraded pages will almost always tighten the clusters. - **Planning to publish the extracted text as a resource?** Prefer the VLM. The bar for published text is higher than the bar for internal search, and the VLM's error modes are less systematic — where it fails, it usually fails to readable (if wrong) text rather than to gibberish. ## A concrete check before committing If you are unsure which engine to pick for a new corpus, Archeglyph makes this check cheap: 1. Upload 20 pages spanning the visual range of the corpus — a clean page, a damaged page, a page with unusual layout, a page in a less-familiar script. 2. Run extraction with Tesseract. 3. On those same 20 pages, re-run extraction per region with a VLM. 4. Open the review screen and scan the two outputs side by side. Because both extractions are stamped with their engine in the `ProvenanceBadge`, you can quickly see where they agree and where they diverge. Twenty pages is enough to form an opinion; on our corpus, the regions where the engines disagreed at a sample of 20 pages predicted the regions where they disagreed at the full 5,000-page scale almost exactly. ## The hybrid strategy The answer for large, heterogeneous corpora is usually neither pure-Tesseract nor pure-VLM. It is a hybrid: - Run Tesseract as the default on every region. It is fast and cheap. - Use the VLM as a targeted re-run for regions flagged as low-confidence by Tesseract (low word count, low mean per-character confidence, high symbol-to-letter ratio). - Expose both outputs in the review screen and let the researcher accept either, or edit in place. Archeglyph supports this out of the box: per-region re-run with a different engine is a first-class operation, the pipeline fingerprints each stage so re-runs skip unchanged work, and the provenance badge keeps both outputs attributable. ## The thing we got wrong We built the newspapers prototype assuming VLM extraction would replace Tesseract wherever we could afford it. On the first large run we found two things we did not expect: - **VLM errors are less legible.** When a VLM mis-reads a word, the misreading is often a plausible other word — "Galata" becomes "Golata" becomes, a paragraph later, "Gorata". Tesseract's errors look like OCR errors and are easy to spot. VLM errors look like paraphrases and are not. - **VLMs hallucinate structure.** Given a crop that contains a half-visible column on one side, the VLM will sometimes confidently extract text from the half-visible column as if it were fully present. Tesseract, in the same situation, produces garbage that the reviewer can see is garbage. Both of these argued for keeping Tesseract as the default and using the VLM as a targeted tool. We still think that is the right default for most humanities corpora, and it is the default Archeglyph ships with.

OCR vs VLM: a practical chooser

Dipankar — Thu, 23 Apr 2026 00:00:00 GMT
This is a decision guide. If you want the reasoning behind it, read the companion article [VLM vs OCR: when to pick what](/articles/vlm-vs-ocr-when-to-pick-what). If you want to just decide what to set your dataset's `extract_engine` to, start here. ## The one-line answer **Default to Tesseract; escalate to a VLM per-region when the output looks wrong.** This handles almost every corpus we have seen. The remainder of this guide is a more nuanced version of that same answer, for the cases where the default isn't right. ## Pick Tesseract if Any of these is true of your corpus: - Printed text, typeset, post-1900. - Scans at 300 dpi or better. - Latin script, or a well-supported non-Latin script with a Tesseract language pack (Greek, Cyrillic, Arabic with `ara`, and so on). - Your downstream use is lexical search or surveying, not publication of the extracted text. - Your corpus is large enough that VLM per-page cost becomes a budget question. Tesseract will produce good text quickly, the errors will be consistent, and you will have headroom to re-run troublesome pages with a VLM individually. ## Pick a VLM if Any of these is true, and especially if more than one is: - Heavy degradation: staining, bleed-through, torn edges, uneven exposure. - Low-resolution scans (below ~200 dpi). - Handwriting or mixed print/handwriting. - Non-Latin scripts with limited Tesseract support (historical Ottoman, older scripts, or very stylised typography). - Your downstream use is publication of the extracted text as a resource, where the error bar matters. - The corpus is small enough that per-page VLM cost is affordable. Pick the smallest VLM on the Ollama Cloud list that works on a sample. Larger VLMs cost more and are not always more accurate on extraction — some of them over-correct text in ways you may not want. ## The hybrid default Many corpora benefit from a hybrid approach: - **Dataset default: Tesseract.** Runs on every region. - **Per-document override: a VLM**, used when Tesseract output looks wrong on that document. - **Per-region re-run: available from the provenance badge** in the review screen. Archeglyph supports all three levels directly. No custom pipeline code is needed. ## How to test cheaply before committing Before setting the extraction engine for a large dataset, run this 20-minute check: 1. **Pick a representative subset.** Twenty pages that span the visual range of your corpus — one clean page, one damaged page, one with unusual layout, one in the less-familiar script if your corpus has more than one. 2. **Upload the subset** as a fresh dataset with Tesseract as the default. 3. **Skim the review screen** for each page. Note the regions that look wrong. 4. **Re-run those regions** from the provenance badge with a VLM of your choice. 5. **Compare side by side.** The review screen will show both outputs attributed to their engines. If Tesseract is right on 18 of 20 pages, stick with Tesseract and use per-region re-run as needed. If it is wrong on 5 or more, switch the dataset default to a VLM. If it is in the middle, consider the hybrid strategy above. ## A quick triage table | Situation | Default engine | Notes | |---------------------------------------------|----------------|-----------------------------------------------------------| | 20th-century typeset print, 300+ dpi | Tesseract | Expect 95-99% character accuracy | | 19th-century print, 300+ dpi | Tesseract | Add post-processing for systematic errors | | Pre-1850 print, letterpress | Tesseract → VLM| Test a subset first; VLM often wins | | Typewritten 20th-century documents | Tesseract | Very reliable | | Degraded archival scans | VLM | Tesseract output will look like noise | | Handwriting | VLM | Tesseract is not designed for this | | Mixed print + handwriting | VLM | Mixed regions benefit from a VLM's tolerance | | Tables of numbers | Tesseract | Specify PSM mode in settings if results look disordered | | Ottoman Turkish | VLM | Our newspapers experience: noticeably better on ligatures | | East Asian scripts (Chinese, Japanese) | VLM | Specialised OCR is an option; VLM is usually simpler | ## Configuring the choice in Archeglyph From the dataset's **Settings** tab: - **Extraction engine**: set to `tesseract` or to any VLM id from the Ollama Cloud list. - **Tesseract language**: set under the `extract_engine` sub-options when Tesseract is selected. Default is `eng`; change to `eng+fra`, `ara`, `ell`, etc., as your corpus requires. - Saving the change applies to new documents. Existing documents keep their current extraction; to re-extract, use the per-document re-run button on the document's review screen or (for the whole dataset) the "Re-extract all" action. Changing the extraction engine does not invalidate embeddings or clusters — those derive from the text, not the engine. However, the *text itself* will change, which means the embeddings will need to be recomputed. Archeglyph surfaces this in the confirmation modal when you save a change that triggers a re-extraction. ## Further reading - [The pipeline](/guides/pipeline) — where extraction sits in the full flow. - [VLM vs OCR: when to pick what](/articles/vlm-vs-ocr-when-to-pick-what) — the reasoning and evidence behind the recommendations on this page. - [Transparency is a feature](/articles/transparency-is-a-feature) — why every extracted block names the engine that produced it.

Your first dataset

Dipankar — Tue, 21 Apr 2026 00:00:00 GMT
This guide takes you from zero to a searchable dataset in about 15 minutes of active time, plus however long the pipeline takes to run on your uploads. You will need: a browser, an email address, and a handful of scanned pages in PDF or image form — twenty pages is a good starting size. ## Step 1 — Sign in Archeglyph uses magic-link sign-in. Visit [/app/login](/app/login), enter your email, and click the link we send you. There is no password to choose or remember. The magic link expires 15 minutes after issue; if you miss the window, request another one. The session cookie set on successful sign-in (`ag_sess`) is httpOnly and lasts 30 days. If you sign out, or if 30 days of inactivity pass, you will be asked for another magic link. ## Step 2 — Create a dataset From the datasets page, click **New dataset**. You will be asked for: - A **name** — human-readable. "Constantinople newspapers 1920s" is fine. - A **slug** — the URL-safe identifier. Derived from the name; you can edit it. - A **description** — one or two sentences for your own future reference. The first dataset you create uses Archeglyph's safe defaults: Tesseract for extraction, MiniLM-L6 for embeddings, the smallest current Ollama Cloud VLM for layout assessment and cluster labels. You can change any of these later from the dataset's **Settings** tab, and the settings page will tell you which of your stored state (embeddings, clusters) would need to be rebuilt if you do. ## Step 3 — Upload files On the new dataset's page, click **Upload**. You can drag PDFs or image files directly onto the page, or pick them from a file dialog. Archeglyph: - Hashes each file. Duplicate uploads are detected and skipped. - Accepts PDFs up to 500 MB and individual images up to 50 MB. - Begins the pipeline automatically once a file has finished uploading. You will see each file appear as a row in the document table. Its status column starts at `uploaded` and moves through `assessed`, `extracted_text`, `chunked`, `embedded`, `indexed`, `clustered`, `ready` as the pipeline runs. The updates arrive over a server-sent stream, so no refresh is needed — the column updates in place. ## Step 4 — Watch the pipeline For a typical 20-page upload, you will see: - **Upload** complete in a few seconds (depends on your connection). - **Assess** complete in a minute or two — this is the VLM looking at each page and returning regions. - **Extract** complete in under a minute — Tesseract is fast. - **Analyse** complete in another minute — chunking, embedding, indexing, clustering. Five to ten minutes wall-clock is a fair estimate for twenty pages on a fresh dataset. Longer documents with complex layouts will take longer; the progress bar on each row reflects per-stage completion. If anything fails, the row shows an error badge with a **Retry** button. The pipeline is fingerprinted per stage so retries re-run only the failing stage. ## Step 5 — Review a document Once a document's status hits `extracted_text`, its **Review** link becomes live. Click it for one document. You will land on a three-pane screen: - The **source image** on the left, with region bounding boxes overlaid. - The **extracted text** in the middle, one editable block per region. Each block has a `ProvenanceBadge` showing the engine that produced it. - A **metadata panel** on the right: confidence histogram, engine choices, per-region re-run buttons, and an escape hatch to re-run the whole document from segmentation. Scroll through the text. Click on a region in the image — the corresponding text block highlights. If a block looks garbled, click the "re-run with…" affordance on its provenance badge and pick a different engine. The re-run runs just that region, typically in seconds. When you are satisfied, click **Accept**. The document's status advances and the next stages (chunking, embedding, indexing, clustering) proceed over the accepted text. You can skip review entirely for corpora where that level of care is not needed. Keyboard shortcuts help here: `j` and `k` move between regions, `e` opens the editor on the current region, `r` opens the re-run menu, `Enter` accepts the region, `Esc` cancels. ## Step 6 — Run a search Once at least one document is `ready`, the dataset's **Search** tab works. Type a query and you will get back snippets from the dataset's text, each with: - The document and page they come from. - The matching phrases highlighted. - The `ProvenanceBadge` for the extracted block they came from. - A relevance score that combines lexical (Tantivy BM25) and semantic (zvec cosine) scores via reciprocal rank fusion. Use the **Lexical | Hybrid | Semantic** toggle at the top of the search box to change the search mode. Lexical is best when you know the exact phrase; semantic is best when you are searching for a concept; hybrid — the default — generally works well for both. ## Step 7 — Open the cluster browser Click the **Clusters** tab. You will see a grid of cluster cards; each card leads with a theme title, a one-sentence summary, and three exemplar fragments. Pick the card that looks most interesting and click **Open cluster**. You will land in the **fragment neighbourhood** view — all of the cluster's fragments with ±1 sentence of surrounding context, grouped by document. The fragment neighbourhood is where much of the research happens: read the fragments, flag the ones that matter, and click through to the source page for the full context. Flags and notes are per-user and persist across sessions. If you want to see the more ML-flavoured view, click **Advanced** on any cluster card. That reveals the probability histogram, outlier scores, and a UMAP scatter. These are secondary by design; see [Reading clusters as a researcher](/articles/reading-clusters-as-a-researcher) for why. ## Step 8 — Settings and snapshots Visit the dataset's **Settings** tab. Every default Archeglyph uses for this dataset is visible there and editable: the layout VLM, the extraction engine, the embedding model, the cluster-label LLM, and the clustering parameters. Saving a change that invalidates derived state (notably changing the embedding model) surfaces an explicit confirmation modal that tells you what will be rebuilt and what it will cost. The settings page also has an **Export snapshot** button. A dataset snapshot is a single compressed archive of the lexical index, the vector index, and the metadata database. You can download it, back it up, and later re-upload it to restore the dataset exactly. This is the "one file" property mentioned on the landing page. ## What next - [The pipeline](/guides/pipeline) — the same stages in more conceptual detail. - [OCR vs VLM extraction](/guides/ocr-vs-vlm) — for when Tesseract is or isn't the right default on your corpus. - [Reading clusters as a researcher](/articles/reading-clusters-as-a-researcher) — a reading guide for the cluster browser.

The pipeline

Dipankar — Sun, 19 Apr 2026 00:00:00 GMT
This guide walks through what happens when you upload a page scan to Archeglyph, in the order it happens, at the level a researcher cares about. It is not the implementer's view — if you want API shapes and worker topologies, see the architecture docs — but it should give you a clear mental model of what the product is doing with your files and why. ## Upload A document starts its life as an upload into a **dataset**. A dataset is the unit of grouping in Archeglyph: a corpus of related documents that share an extraction engine, an embedding model, and a clustering configuration. You might have one dataset per archival collection, or per research project, or per publication. When you upload a PDF or image, Archeglyph: - Hashes the file so re-uploading the same PDF is a no-op. - Stores the original bytes untouched in object storage. - Renders the pages of a PDF to page images at a resolution suitable for both layout and extraction models. You see the file appear in the dataset's document table with a status of `uploaded`. No extraction has happened yet — the next stage has to run first. ## Assess The second stage is layout assessment. Archeglyph sends each page image to the vision-language model you chose for the dataset (the default for new datasets is the smallest current Ollama Cloud VLM, which is cheap to run and adequate for clean scans). The model returns a list of **regions**: a bounding box, a `kind` (headline, body, caption, figure, or table), a reading order, and a confidence. Why a VLM for this step rather than classical computer vision? In our experience on the newspapers prototype, classical column-detection works well on regular broadsheet layouts and breaks on almost everything else: irregular gutters, rotated headlines, embedded figures, book-style pages. A VLM handles the long tail because it has a language prior over what a page looks like. For the regular cases where classical CV would also work, Archeglyph retains a CV fallback that is offline and free — useful for very large newspaper-like runs where the VLM cost adds up. The assessment's output is what the next stage operates on: a set of labelled rectangles per page, each one a region that needs to be read. ## Extract The third stage is text extraction. For each region the layout model found, Archeglyph runs the extraction engine you chose — Tesseract by default, or any VLM in the Ollama Cloud list — and stores the resulting text, the engine's name, and a timestamp. A few things about this stage that matter: - **It is per-region, not per-page.** A page with a headline, three body columns, and a caption is five separate extraction runs. This matters because it lets you re-run just one region with a different engine when one goes wrong, without touching the others. - **Engine choice is per-dataset with per-document override.** Most researchers pick one engine for the whole dataset. When they hit a tricky page, they override the choice for that page (or for one region on that page) without changing the dataset default. - **Every extracted block carries its engine in the `ProvenanceBadge`.** You can see at a glance which engine produced which block, and re-run a block with a different engine from the badge's menu. When extraction finishes, the document is in state `extracted_text`. This is the first point at which the text of the document is legible in Archeglyph's search. ## Review (optional) Between extraction and analysis, Archeglyph offers an optional review step. This is the **trust surface** of the product: a three-pane screen showing the page image with region overlays on the left, the per-region extracted text (editable) in the middle, and a metadata panel — confidence histogram, engine list, per-region re-run buttons — on the right. For small, important corpora (a few dozen pages you're going to cite) we recommend using review. For large exploratory corpora (a thousand pages you are surveying) we recommend skipping it, knowing you can come back to the review screen any time to spot-check a document that looks off. Reviewing a document doesn't change how the analysis stage runs — it just gives you a chance to correct extraction errors before the text is chunked and indexed. ## Analyse The final stage is where your dataset turns into something searchable and clusterable. Archeglyph: - **Chunks** the extracted text into sentence units using `syntok`. A chunk is roughly one sentence, sometimes two if the sentences are short. - **Embeds** each chunk with the embedding model you chose — MiniLM-L6 by default, with BGE-small as an interchangeable alternative. The embedding model's id is recorded alongside each chunk so a later re-embed is a tracked event, not a silent overwrite. - **Indexes** the chunks twice: once in a lexical index (Tantivy, with stemming and snippets) and once in a vector index (zvec, same dimension as the embedding model). The two indexes join on chunk id so hybrid search works transparently. - **Clusters** the chunks into semantic groups using HDBSCAN over the embeddings and into lexical groups using TF-IDF plus TruncatedSVD plus HDBSCAN. Each cluster gets a theme title and a one-sentence summary from a small text LLM, both of which disclose the LLM that wrote them. When analysis finishes, the document is in state `ready`. The dataset's search, cluster browser, and fragment neighbourhood views all become available on the document's text. ## What you see along the way The dataset page shows each document's current state and any running jobs. Jobs emit live events over a server-sent stream; the status column updates as each stage completes. If a stage fails — the VLM times out, a PDF has an unreadable page — the failure surfaces on the document with a retry button. The pipeline is fingerprinted per stage, so a retry re-runs only the failing stage, not the whole document. ## What each stage costs A rough sense of cost per page on a medium-large corpus (hundreds of pages): - Upload and render: free (CPU). - Assess: 10-30 seconds and a few cents of VLM credit per page. - Extract: either a tenth of a second of CPU (Tesseract) or 30-60 seconds and single-digit cents of VLM credit (VLM read) per region. - Analyse: a few seconds of CPU per document for chunking, embedding, and index updates; clustering runs once per ingest batch and is usually under a minute for datasets up to around 10,000 chunks. For a 1,000-page corpus with Tesseract extraction and VLM layout, the end-to-end cost is typically tens of minutes of wall-clock time and a few dollars of hosted-model credit. ## Where to go next - [Your first dataset](/guides/first-dataset) walks through the same pipeline hands-on, from sign-in to first search. - [OCR vs VLM extraction](/guides/ocr-vs-vlm) is the practical chooser for the extraction stage. - [Transparency is a feature](/articles/transparency-is-a-feature) explains why every stage of the pipeline labels its output with the model that produced it.

What a good provenance badge looks like

Dipankar — Fri, 17 Apr 2026 00:00:00 GMT
If a provenance badge is a promise that "this specific output was produced by this specific engine," the badge has to be readable without training. It has to answer three questions at a glance: *what model, what version, and can I try another?* It has to do that in a row of search results without ballooning the row. And it has to mean the same thing whether it sits beside an extracted paragraph, a cluster title, or a vector search hit. We have iterated on the badge several times during M0. These notes describe where it landed and why. ## Anatomy A badge is three fields rendered as one pill: ``` ┌─────────────────────────────────────────┐ │ qwen3-vl:235b-cloud · v2025.03 · 02:14 │ └─────────────────────────────────────────┘ ``` - **Engine id.** The left-most field is the stable identifier used across the catalogue. Tesseract reads as `tesseract`, a VLM reads as its full Ollama tag. We never shorten `qwen3-vl:235b-cloud` to `qwen` — abbreviation was one of the first temptations and one of the first rejections, because "qwen" alone is not a citable reference. - **Version.** For binary engines (Tesseract) this is the upstream semver. For cloud-backed VLMs this is a date tag that we reconcile nightly against the provider. If the provider does not expose a version, we surface the date we first observed that model id in our engine catalogue. - **Timestamp.** HH:MM of when this specific block was produced. Not the full ISO-8601 (which clutters), but enough to disambiguate the pre-review output from a re-run. A badge never carries confidence scores. Confidence is useful in the review pane and on the advanced panel of a cluster card, but folding it into the badge would pressure readers to treat it as the headline number, and the headline of a provenance badge is *who produced this*, not *how sure they were*. ## Where badges appear ### In the review pane ``` ┌─ Region 14 ──────────────────────────────────────────────────────┐ │ "reported from the wharves of Galata that the Russian │ │ steamer..." │ │ │ │ [ tesseract · 5.3.0 · 02:14 ] [ accept ] [ re-run with ⌄ ] │ └──────────────────────────────────────────────────────────────────┘ ``` The badge sits on the same row as the accept and re-run controls because those three things compose one decision: *I have seen what produced this, I know my options, I choose to accept or rework.* If the badge were in a tooltip, the action would lose the attribution that justifies it. ### In search results ``` #42 p=0.812 Document 117, p.3 "…reported from the wharves of Galata that the Russian steamer…" [ tesseract · 5.3.0 ] [ embed: bge-small-en-v1.5 ] ``` Search results have two badges: the engine that extracted the text, and the model that embedded the chunk. We show both because a user comparing two search results can form a legitimate hypothesis like *"the MiniLM rows rank differently from the BGE rows"* only if both badges are visible side by side. ### In cluster cards ``` ┌─ Migrations across the Bosphorus ────────────────────────────────┐ │ Fourteen fragments, mostly port reporting from 1897–1901. │ │ — theme_llm: gemma3:27b-cloud │ │ │ │ "the wharves of Galata..." — Doc 117, p.3 (tesseract) │ │ "steamers inward bound..." — Doc 204, p.1 (tesseract) │ │ "lo riferiva il console..." — Doc 91, p.2 (qwen3-vl) │ └──────────────────────────────────────────────────────────────────┘ ``` On a cluster card, the theme-writing model is badged at the top of the card and each exemplar carries its extraction engine. The rule is that every human-readable string the product did not author with a keyboard has a badge somewhere within a one-glance radius. ## What we decided not to do - **No "AI generated" disclaimer.** A badge that says `qwen3-vl:235b-cloud` is a piece of scholarly apparatus. A banner that says "generated by AI" is a legal posture. We made the mistake in an early prototype of bolting both on; readers ignored the banner entirely and dismissed the badge as redundant. We kept the badge. - **No colour-coded risk.** We tried a green/amber/red scheme where high-confidence extractions got a muted badge and low-confidence ones got a warn tint. Reviewers read the colour as a judgement on the *engine* rather than the *region*, and argued with it. We moved confidence to the region tint instead, where it belongs. - **No vendor logos.** A badge is text. Logos turn provenance into branding, and the moment a researcher sees a logo they stop treating the badge as information and start treating it as an endorsement. ## The re-run affordance The badge is paired with a `re-run with…` trigger that opens a popover. The popover is split into two tabs (OCR, VLM) with the current engine pre-selected and greyed out. Re-running produces a new row in the region's history; the badge swaps to the new engine id but the previous row is still available from the row-history disclosure on the left edge of the region card. The re-run button is never the default. In the review pane, `Accept` is the large button; `re-run with…` is a secondary. In search results and cluster cards, the badge is purely informational and the re-run affordance is gated behind clicking through to the review pane. We resisted every design iteration where a researcher could re-run a region from a search result, because the cost of mis-clicking a re-run in a scanning view is two minutes of compute and a brief jitter in their own mental model of the dataset. ## The transparency contract, stated plainly What the badge promises: 1. Every textual output in the product carries, on the same screen, an attribution to the engine that produced it. 2. Engine ids are stable: what appears in one snapshot resolves to the same model identity in every future snapshot. 3. Every output paired with a badge has a re-run path that is one or two clicks away. What the badge does not promise: 1. That the engine is correct. 2. That the engine's weights will remain available upstream. 3. That we have any editorial opinion about the engine's output. The badge is a pointer, not an endorsement. That is the whole shape of the transparency contract, and it is the reason we obsess about the pixels.

Transparency is a feature

Dipankar — Wed, 15 Apr 2026 00:00:00 GMT
A researcher opens a cluster in an automated text-analysis tool. The cluster is titled "Migrations across the Bosphorus" and it contains forty-two fragments from a newspaper corpus. Two of those fragments look very wrong — the OCR is garbled, the sentences don't quite close, one of them seems to contain the word "Golota" where the city is clearly Galata. A reasonable next question for the researcher is: *which engine produced that text, and can I re-run it with something else?* Most tools don't let that question get asked. The text just shows up. If the researcher is suspicious, they can either ignore the fragment, dig through logs they don't have access to, or trust the cluster anyway. None of those are good answers for scholarship. Archeglyph's answer is to put the engine's name next to the text. ## The provenance badge Every extracted text block in the product carries a small chip — we call it the `ProvenanceBadge` — that shows the engine and version responsible for that block: for example `tesseract 5.3` or `qwen3-vl:235b-cloud`, plus a timestamp. Next to the badge is a "re-run with…" affordance that lets the researcher swap engines on that region without touching the rest of the document. The badge appears in the document review screen, in search results, and on every exemplar quotation inside a cluster card. This sounds like a small UI element, and on the page it is. But the consequences run deep: - **It forces the pipeline to be honest.** If we can't reliably attribute a text block to an engine, we can't render the badge. That constraint shaped our data model: every extracted region stores its engine id, and re-runs don't silently overwrite — they produce a new row with a new provenance stamp. - **It turns failure into a question the researcher can answer.** A garbled OCR line stops being "the machine failed" and becomes "Tesseract failed on this region; what happens if we try a VLM here?" The failure mode is legible, and so is the remedy. - **It makes cross-engine comparison part of normal reading.** When the cluster view shows that forty of the forty-two exemplars came from `tesseract 5.3` and two came from `qwen3-vl:235b-cloud`, the researcher can start forming intuitions about which engine earns its cost on which kind of page. ## Why this isn't a footer The easy thing to do is put a line at the bottom of a report that says "generated using an AI-assisted pipeline." Every vendor does this and it satisfies nothing. A footer says: *there is a machine somewhere, and the output might be wrong, and you should know that in the abstract.* A badge next to each block says: *this specific sentence was produced by this specific engine at this specific time, and here is the button to try again with a different one.* The first is a legal disclosure. The second is a piece of scholarly apparatus. ## What the product discloses In M0 the badge surface covers: - **Layout regions.** Each region's `kind` (headline, body, caption, figure, table) and the model that assessed the layout — e.g. `gemma3:27b-cloud` — with a confidence score when the model returns one. - **Extracted text.** The engine that read each region and its version. For Tesseract that's the binary version. For a VLM that's the full Ollama tag. - **Cluster theme titles.** When a small text LLM is used to polish the top-TF-IDF terms into a 4-6 word title, the title discloses the model that wrote it. The summary sentence gets the same treatment. - **Embeddings.** Every chunk stores the embedding model id, and the search result UI surfaces it when the user hovers on a hit — because if you switch from MiniLM to BGE, results can reorder, and that reordering deserves a trail. ## What transparency is not Transparency is not the same as openness. We do not claim the weights of the VLMs we call are open or auditable. We do not claim you can reproduce a cluster bit-for-bit six months from now if the upstream Ollama model has been retrained. What we do claim — and what the badge delivers — is a second-order guarantee: *at the time you are looking at this output, you can see exactly what produced it.* From there, if a claim matters, you can re-run the relevant step with a different engine and compare. That is enough for scholarship to work. A footnote that names the edition does not promise the edition is correct; it promises the reader can go look. The provenance badge is the same promise in a different medium. ## Implications for our roadmap Treating provenance as surface shapes what we build next: 1. **Engine catalogue is a first-class object.** Not a config file; a database table, with a nightly reconciliation job that flags stale ids. If an engine disappears upstream, the dataset settings page warns you that your chosen default is no longer available. 2. **Re-run is cheap.** The pipeline is fingerprinted per stage, so re-running extraction on one region with a different engine costs the cost of that one region, not the whole document. The badge only makes sense if the "re-run with…" button is painless. 3. **The advanced toggle exists, but it's not the default.** Confidence histograms, outlier scores, UMAP projections — those matter when you're debugging a pipeline, not when you're reading a cluster. They live behind an explicit toggle on each cluster card. ## What we ask of readers When you use Archeglyph outputs in published work, please cite the engine. The product makes it easy — the badge text is already the citation string. In return, we commit to keeping the badges stable: an engine id that appears in one snapshot will resolve to the same model identity in all future snapshots, even if we retire the engine and archive the weights metadata. Transparency isn't a privacy stance or a compliance checkbox. It's the piece of product surface that lets a researcher do their job without trusting us more than they should.