Archeglyph
· positioning · method · digitisation

Downstream of Trove: where analysis fits in the corpus stack

Digitisation projects like Trove, Chronicling America, and Europeana produce the corpus. Archeglyph produces the analysis on top of it. They are layers of the same stack — not competitors, not substitutes.

By Dipankar Sarkar · Last updated

On this page
  1. The corpus stack has layers
  2. What digitisation projects do
  3. What Archeglyph does
  4. Where they meet
  5. A worked example: Trove + Archeglyph in the same workflow
  6. What we don’t claim
  7. A closing line

A question we get, politely phrased, roughly once a week: “Is this a competitor to Trove?” Or “How is this different from Chronicling America?” Or — from someone who has spent longer thinking about it — “Where does Archeglyph sit, exactly, in the ecosystem of digital tools for archival research?”

The honest answer is that Archeglyph and the great digitisation projects live on different floors of the same building. The building is the corpus stack. This article is about what each floor does, and why confusing the floors leads to bad tool choices.

The corpus stack has layers

Digital research on archival material has, in practice, a vertical stack of concerns. Roughly, from the bottom up:

  1. Preservation. Keeping the physical artefact alive — paper, glass plate, wax cylinder, magnetic tape — for another century. This is the job of libraries, archives, and museums.
  2. Digitisation. Turning the artefact into bits. Scanning a page, photographing a plate, ripping a cylinder. Producing a page image, an OCR text layer, and page-level metadata.
  3. Indexing. Making the bits findable. Full-text search across the digitised corpus, search-by-metadata, browseable collection pages.
  4. Analysis. Doing something with a subset of the digitised material once you have chosen it: clustering, close reading with navigation, extraction of claims, quantitative comparison.
  5. Interpretation. Writing the paper, the chapter, the monograph. This is a human activity. It is not, at any foreseeable point, the job of software.

Each layer is somebody’s job. Each layer has its own institutions, its own funding model, its own timescale. Archeglyph occupies the analysis layer; the layer immediately below it — the layer that feeds Archeglyph — is digitisation.

What digitisation projects do

The great digitisation projects are, by a wide margin, the most impressive infrastructural work in the digital humanities. Some of the ones we lean on daily:

  • Trove (National Library of Australia) — hundreds of millions of digitised pages of Australian newspapers, gazettes, magazines, and books. Full-text searchable, with a community of volunteer text correctors improving the OCR a line at a time.
  • Chronicling America (Library of Congress + NEH) — a growing corpus of historic US newspapers, state-by-state, with a public API and a clean page-image viewer.
  • Europeana — a federation across European cultural heritage institutions, aggregating metadata and digitised objects from thousands of museums, libraries, and archives.
  • HathiTrust — a shared digital library built on the mass digitisation of research-library holdings, with in-copyright and public-domain strata and a careful access model.
  • Internet Archive — the public-facing generalist. Books, serials, audio, video, web. An indispensable safety net for everything the institutional projects haven’t yet reached.
  • Google Books — the largest of all by raw volume, with a search surface that is uneven but often surprising.
  • DPLA (Digital Public Library of America) — an aggregator over US institutional collections, analogous in ambition to Europeana.

What these projects produce, broadly, is the same shape of output: a scanned page, an OCR text layer of varying quality, page-level metadata (title, date, publisher, rights), and a search surface that lets you find pages across millions.

This is an enormous achievement. It is also expensive, institutional, and slow. A digitisation project is measured in decades. Its output is broad — it serves every downstream use from genealogy to literary scholarship to local history — and, necessarily, generic: it does not privilege any one research question.

What Archeglyph does

Archeglyph is a research tool, not a digitisation tool. It starts from material that has already been digitised — by an institution, by a researcher with a scanner, by a photographer with a phone — and produces a reading surface over the specific subset a researcher cares about.

Concretely, given a set of page images or PDFs a researcher has chosen, Archeglyph:

  • Runs a transparent extraction pipeline (VLM-assisted layout assessment, OCR or VLM extraction at the researcher’s choice) with every model disclosed at the point of output.
  • Indexes the extracted text for full-text search across the researcher’s corpus — not across all of Trove, just across what they uploaded.
  • Clusters fragments semantically and presents each cluster as quotations with sources, not as a scatterplot.
  • Keeps version history so that re-ingestion doesn’t silently renumber clusters or invalidate saved links.
  • Produces an auto-generated plain-language technique note so the researcher can cite how the corpus was processed.
  • Ships the whole dataset as a single exportable snapshot — index, vectors, metadata — so the corpus can be archived, shared, or re-opened without the product.

Analysis, in other words. Project-bound, weeks-to-months, specific, narrow — the opposite axis to digitisation.

Where they meet

The interface between the two layers is the PDF, or the image folder, or the API response. A researcher searches Trove, finds the two hundred pages that touch their research question, downloads them, and loads them into Archeglyph. The bits come from the digitisation project. The reading happens on top of Archeglyph.

Nothing about this is adversarial. We do not want to re-digitise what Trove has already digitised; Trove has no plans to ship a clustering UI. The digitisation projects built the library. Archeglyph is the desk you read at.

”Trove built the library. Archeglyph is the desk you read at.”

A worked example: Trove + Archeglyph in the same workflow

Concretely — a historian of Australian labour movements in the 1920s is interested in one specific union’s coverage in the Brisbane press.

  1. In Trove. Search for the union’s name across the Brisbane Courier and the Daily Standard for the years 1921–1929. Refine by date, by title, by page type. Select the two hundred-odd articles that look relevant. Download the page PDFs — Trove supports this for most of its newspaper holdings — or export the list as references.
  2. Between layers. The researcher now has a folder of two hundred PDFs on their laptop. This is the handoff point. The digitisation layer has done its job; the analysis layer hasn’t started.
  3. In Archeglyph. Create a new dataset. Upload the PDFs. Let the pipeline run: layout assessment, OCR (Tesseract is usually fine for 1920s Brisbane newsprint), chunking, embedding, clustering.
  4. Reading. The cluster view surfaces themes — shipping strikes, wage arbitration, internal union politics, coverage of rival unions, editorial hostility. Each theme is a card with exemplar quotations and a page reference back to the original Trove scan.
  5. Citing. Every quotation links to a source page. The researcher cites the Trove record for the canonical reference and uses the Archeglyph snapshot ID as the methodological appendix — “cluster analysis produced via Archeglyph snapshot XYZ, 2026-04-15”.

Trove did the work of digitising the Brisbane press in the 1920s. Archeglyph did the work of letting this particular researcher read two hundred pages as a corpus, not as two hundred separate documents. The two tools did not compete at any step.

What we don’t claim

In the same spirit of spelling things out one axis at a time:

  • We are not digitising more material. We have no scanners, no institutional mandate, no partnerships with rights-holders. If the material isn’t already digitised, Archeglyph cannot help.
  • We are not replacing the institutional archive. The canonical record stays where it is: in the library’s catalogue, under the library’s URL, with the library’s metadata.
  • We are not synthesising new prose. Archeglyph does not summarise a corpus into a paragraph. It does not answer a research question with generated text. Everything on screen is either a quotation from the corpus or a clearly-labelled technique note.
  • We are not building a competing search engine. We search across your dataset. Trove searches across Trove. Those are different jobs.

A closing line

If the phrase “downstream of Trove” lands badly — if it sounds dismissive of the decades of work that Trove, Chronicling America, Europeana, HathiTrust, the Internet Archive, Google Books, and DPLA have put into digitising the record — that isn’t our intent. Downstream here is the geographical meaning, not the hierarchical one. The river flows from the institutional archive, through the digitisation project, to the researcher’s desk. Archeglyph sits at the desk. The water had to get there somehow, and it wasn’t us who carried it.