Archeglyph
· architecture · product · snapshots

Why we snapshot per dataset

The product decision behind Archeglyph's dataset snapshot: one tarball that bundles a tantivy index, a zvec embedding store, and a sqlite catalogue — why the three belong together and why the unit is the dataset, not the document.

By Dipankar · Last updated

On this page
  1. What a dataset is
  2. Why three files, and why these three
  3. Why the unit is the dataset
  4. Operational consequences
  5. What we are deferring
  6. Why this belongs in an article

If you asked any three digital humanities tools to export “everything about this corpus right now, in a form I can archive,” you would get three different answers and none of them would round-trip. One tool would give you a folder of PDFs and shrug at the indexes. Another would give you a Postgres dump with references to an ElasticSearch cluster you no longer have. A third would give you a vendor-specific archive that needs the vendor’s runtime to open.

Archeglyph’s answer is a single tarball per dataset, containing three files: a tantivy lexical index, a zvec embedding store, and a sqlite catalogue. That choice — one unit, three files, the dataset as the atom — took real work to converge on, and it is worth writing down why.

What a dataset is

A dataset in Archeglyph is a bounded collection of source pages a researcher has decided to study together: a newspaper run, the plates from one expedition, one archive’s photographs of a monastery. It has a stable slug, a settings page that pins the engines used for its pipeline, and a set of documents whose regions, text, clusters, and search indexes are all derived from those engines. The dataset is the unit a researcher talks about at a conference. It is also the unit we need to be able to hand back to them intact.

The corollary is that the dataset is not the document. A document snapshot that did not carry its embedding space would lose the ability to search. A document snapshot that carried an embedding space but not the index would ship a vector blob no one can query. The dataset is the smallest scope at which the artefacts still compose into a usable tool.

Why three files, and why these three

Each file in the tarball covers one mode of access and is lossless on its own:

  • The sqlite catalogue is the system of record: documents, pages, regions, engine provenance, edit history, cluster membership, settings at time of snapshot. It is plain SQL, opens in any sqlite browser, and is the thing an archivist can read in ten years without us.
  • The tantivy index is the lexical search layer. It is derived data — it can be rebuilt from the catalogue — but rebuilding it is minutes to hours, and a snapshot without it is noticeably worse to open. Ship it.
  • The zvec embedding store holds the chunk embeddings plus their metadata (model id, dimension, chunking recipe). Like the index, it is derivable; unlike the index, rebuilding requires access to the embedding model, which may have been retired or paywalled by the time a snapshot is re-opened. Shipping the vectors is how you guarantee the semantic search still works years later.

There are tempting simplifications we rejected. A single sqlite file with vectors stored as blobs would be convenient but would forfeit zvec’s per-chunk compression and the ability to load vectors without paging the whole DB. A single portable archive built on Parquet would be elegant but we would have to re-implement the reader side of tantivy. The three-file shape is a compromise: each file is authored by a battle-tested library, and the tarball is what makes them feel like one object.

Why the unit is the dataset

There is a pull, always, to snapshot at a coarser or finer grain:

  • Coarser: snapshot the whole workspace. Attractive because it would be one button. Not what researchers want. A workspace often mixes a finished study with half-cooked exploratory corpora; the finished one needs citation-stable archiving, the exploratory ones don’t, and bundling them conflates two lifecycles. A workspace snapshot also multiplies the size of every archive by a factor that has nothing to do with the scholarship.
  • Finer: snapshot one document. Attractive because the individual page is the atomic image. Not useful on its own: the vector space a document’s chunks live in is shared across the dataset, so a single-document snapshot either ships the whole embedding store (wasteful) or ships only that document’s vectors (which cannot be searched without the rest). Either way the snapshot is no longer composable.

The dataset sits exactly where the scholarly unit and the technical unit agree. That is why it is the snapshot.

Operational consequences

Committing to a dataset snapshot format shaped parts of the product that look unrelated:

  1. Settings are copied into the snapshot. The sqlite catalogue carries the engine selections active at the moment of the snapshot — not just the engine names, but their versions. Opening an older snapshot displays a settings banner that makes clear this is what the dataset was extracted with, even if the workspace has since moved on.
  2. Re-runs are idempotent within the snapshot. Because the catalogue stores every re-run as a new row with its own provenance, a snapshot can be re-extracted selectively and the new rows either merge into a fresh snapshot or split off into a derived dataset. We did not want to teach ourselves two different “which row is canonical” rules.
  3. The tarball is the export format, full stop. There is no JSON export, no CSV export, no “lite” mode. Every export is this tarball. Researchers get a format that round-trips back into the product; archivists get a format that opens without the product; and we get one thing to maintain instead of five.

What we are deferring

The snapshot format does not yet carry the raw source images. That is deliberate — images are large, often restricted by the source archive’s license, and already live in our object store with their own retention policy. A snapshot currently carries image references (a stable URL plus a SHA-256) and a helper script that re-fetches them from the original repository when the snapshot is opened. A future --with-images flag will bundle the binaries for researchers whose source is fully open. We would rather ship the lean tarball now than block on the harder legal question.

Why this belongs in an article

Infrastructure choices usually hide inside release notes. We surface this one because the snapshot format is a promise to researchers: the work you do in Archeglyph is yours, and the form in which you take it away is simple enough to still make sense after we’re gone. That promise is only real if we explain what it looks like.