Archeglyph
Guide · · export · archiving · snapshots · how-to

Exporting and archiving a dataset

A forward-looking but grounded walkthrough of Archeglyph's dataset snapshot: what goes into the tarball, how to open it without the product, and how to cite a snapshot in a paper.

By Dipankar · Last updated

On this page
  1. Creating a snapshot
  2. What is inside
  3. Opening a snapshot without Archeglyph
  4. Browse the catalogue in sqlite
  5. Search the lexical index from Python
  6. Load the embeddings
  7. Citing a snapshot
  8. Archiving versus re-importing
  9. Caveats we want to be honest about
  10. A short checklist before you call it done

Every study ends — a paper gets published, a grant closes, a postdoc moves institutions — and the question becomes: how do I keep the work in a form I can still use in five years, without the tool that produced it? Archeglyph’s answer is the dataset snapshot, a single tarball that bundles the catalogue, the lexical index, and the embedding store for one dataset. This guide walks through how to create one, what is inside it, how to open it without Archeglyph running, and how to cite it.

Some of the surfaces described below are still rolling out through M1. Where that is the case the guide flags it.

Creating a snapshot

From the dataset page, open the menu on the header and choose Export snapshot. The product computes the total size (it will tell you before you commit), asks you to confirm, and produces a tarball named:

archeglyph-<workspace>-<dataset-slug>-<YYYYMMDD-HHMM>.tar.zst

The timestamp is UTC. The file is zstd-compressed; on a modern machine a dataset with tens of thousands of documents typically lands in the low hundreds of megabytes.

While the export is running you can close the page — the job continues server-side and an email with the download link arrives when it finishes. For datasets with tens of millions of chunks the export can take several minutes; the job status surfaces in the dataset’s events feed the same way extraction jobs do.

What is inside

Unpack the archive:

$ tar --zstd -xvf archeglyph-<…>.tar.zst
archeglyph-<…>/
├── README.txt
├── catalogue.sqlite
├── index.tantivy/
│   ├── meta.json
│   ├── … segment files …
├── embeddings.zvec
├── settings.json
└── manifest.json
  • catalogue.sqlite is a plain sqlite database containing the tables for documents, pages, regions, extracted text (with its engine provenance), edits, clusters, cluster memberships, and the settings that were active at snapshot time. You can open it in any sqlite browser; the schema is documented in README.txt and mirrors the tables described in the platform docs.
  • index.tantivy/ is the lexical search index, in tantivy’s on-disk format. It can be opened by any tantivy 0.22+ reader; you do not need Archeglyph to query it.
  • embeddings.zvec is the compressed embedding store, one vector per chunk plus a small metadata header (model id, dimension, chunking recipe). The zvec format is documented in its repository; a short Python reader script is bundled as read_embeddings.py.
  • settings.json is a human-readable copy of the dataset’s settings at the moment of export — engines, thresholds, chunking parameters. It is redundant with the sqlite catalogue but is present to make the snapshot legible without any database tooling.
  • manifest.json lists every file, its SHA-256, and the snapshot schema version. Check the hashes after download if you intend to archive the tarball long-term.

The tarball does not contain the raw source images. It contains references — a stable URL plus a SHA-256 — and a rehydrate.sh script that refetches the binaries from the original object store. This is a licensing choice: many source archives grant Archeglyph the right to process images but not to redistribute them. A future --with-images flag will bundle the binaries for researchers whose sources are fully open.

Opening a snapshot without Archeglyph

The design goal is that the snapshot opens with off-the-shelf tools. Three worked examples:

Browse the catalogue in sqlite

$ sqlite3 catalogue.sqlite
sqlite> .tables
documents   regions    texts    clusters   chunks   settings   engines
sqlite> SELECT count(*) FROM chunks;
sqlite> SELECT text FROM texts WHERE engine_id = 'qwen3-vl:235b-cloud' LIMIT 5;

Every row carries the engine id that produced it; joining texts to engines gives you the full provenance record in a single query.

Search the lexical index from Python

from tantivy import Index

ix = Index.open('index.tantivy')
searcher = ix.searcher()
hits = searcher.search(ix.parse_query('wharves OR galata', ['text']), limit=20)
for score, address in hits.hits:
    doc = searcher.doc(address)
    print(score, doc['document_id'], doc['page_no'], doc['text'][:80])

The tantivy Python bindings read Archeglyph’s snapshot indexes directly; the field names (text, document_id, page_no, region_id) are documented in README.txt.

Load the embeddings

from zvec import read

store = read('embeddings.zvec')
print(store.metadata)  # {'model': 'bge-small-en-v1.5', 'dim': 384, ...}
for chunk_id, vector in store:
    # use numpy, faiss, whatever
    ...

The embedding store carries enough metadata to reconstruct a search space without Archeglyph; the model id is what lets you (or a future reader) know whether they can mix these vectors with another corpus.

Citing a snapshot

A snapshot is citable. The recommended format:

Author, Dataset title, Archeglyph snapshot sha256:<…> exported <YYYY-MM-DD>, archived at <url-or-doi>.

The manifest.json contains a snapshot_id which is the SHA-256 of the concatenated file hashes — that is the value to paste in the sha256: field. Two researchers with the same snapshot_id are guaranteed to be looking at bit-identical data.

If you deposit the tarball in Zenodo or your institution’s repository, Archeglyph will accept the DOI on the dataset’s settings page and show it on the dataset’s landing card. That feature lands in M1-D.

Archiving versus re-importing

Two different verbs, two different use cases:

  • Archiving — the tarball is the final form. You put it in a repository, you stop thinking about it. The three files inside are all openable with tools older than Archeglyph; whatever happens to us, the research artefact survives.
  • Re-importing — the same tarball can be loaded back into Archeglyph (⋯ → Import snapshot) and becomes a new dataset in your workspace. The original snapshot is not mutated; re-imports are a fork, not a load. This is how a collaborator receives your study.

Caveats we want to be honest about

  • Not every settings field is carried. The snapshot preserves the engine selection, the chunking recipe, and the search configuration. Workspace-level things (billing, team membership, access policies) are intentionally not exported because they belong to a workspace, not a dataset.
  • Image rehydration depends on upstream availability. If the source archive takes a document offline, the rehydrate.sh script will fail on that file. The extracted text, regions, index, and embeddings are untouched — you keep the scholarship, you lose the ability to redisplay the image.
  • Snapshot schema will version. The format is at v1. Future versions will add fields, never remove them; a v1 reader will continue to open every snapshot it produced today.

A short checklist before you call it done

  1. Download the tarball and verify the SHA-256s in manifest.json.
  2. Open catalogue.sqlite and confirm the document count matches what you expect.
  3. Archive the tarball somewhere with an addressable URL (institutional repository, Zenodo, S3 bucket with public read).
  4. If the study is published, paste the snapshot_id into the methods section and the DOI onto the dataset’s settings page in Archeglyph so other readers can find it.

A snapshot is not the end of a dataset’s life — it is the first moment it becomes a citizen of the scholarly record rather than a row in our database. That is what we built the format to let it be.