Exporting and archiving a dataset
A forward-looking but grounded walkthrough of Archeglyph's dataset snapshot: what goes into the tarball, how to open it without the product, and how to cite a snapshot in a paper.
By Dipankar · Last updated
On this page
Every study ends — a paper gets published, a grant closes, a postdoc moves institutions — and the question becomes: how do I keep the work in a form I can still use in five years, without the tool that produced it? Archeglyph’s answer is the dataset snapshot, a single tarball that bundles the catalogue, the lexical index, and the embedding store for one dataset. This guide walks through how to create one, what is inside it, how to open it without Archeglyph running, and how to cite it.
Some of the surfaces described below are still rolling out through M1. Where that is the case the guide flags it.
Creating a snapshot
From the dataset page, open the ⋯ menu on the header and choose Export snapshot. The
product computes the total size (it will tell you before you commit), asks you to confirm,
and produces a tarball named:
archeglyph-<workspace>-<dataset-slug>-<YYYYMMDD-HHMM>.tar.zst
The timestamp is UTC. The file is zstd-compressed; on a modern machine a dataset with tens of thousands of documents typically lands in the low hundreds of megabytes.
While the export is running you can close the page — the job continues server-side and an email with the download link arrives when it finishes. For datasets with tens of millions of chunks the export can take several minutes; the job status surfaces in the dataset’s events feed the same way extraction jobs do.
What is inside
Unpack the archive:
$ tar --zstd -xvf archeglyph-<…>.tar.zst
archeglyph-<…>/
├── README.txt
├── catalogue.sqlite
├── index.tantivy/
│ ├── meta.json
│ ├── … segment files …
├── embeddings.zvec
├── settings.json
└── manifest.json
catalogue.sqliteis a plain sqlite database containing the tables for documents, pages, regions, extracted text (with its engine provenance), edits, clusters, cluster memberships, and the settings that were active at snapshot time. You can open it in any sqlite browser; the schema is documented inREADME.txtand mirrors the tables described in the platform docs.index.tantivy/is the lexical search index, in tantivy’s on-disk format. It can be opened by any tantivy 0.22+ reader; you do not need Archeglyph to query it.embeddings.zvecis the compressed embedding store, one vector per chunk plus a small metadata header (model id, dimension, chunking recipe). The zvec format is documented in its repository; a short Python reader script is bundled asread_embeddings.py.settings.jsonis a human-readable copy of the dataset’s settings at the moment of export — engines, thresholds, chunking parameters. It is redundant with the sqlite catalogue but is present to make the snapshot legible without any database tooling.manifest.jsonlists every file, its SHA-256, and the snapshot schema version. Check the hashes after download if you intend to archive the tarball long-term.
The tarball does not contain the raw source images. It contains references — a
stable URL plus a SHA-256 — and a rehydrate.sh script that refetches the binaries from
the original object store. This is a licensing choice: many source archives grant
Archeglyph the right to process images but not to redistribute them. A future
--with-images flag will bundle the binaries for researchers whose sources are fully open.
Opening a snapshot without Archeglyph
The design goal is that the snapshot opens with off-the-shelf tools. Three worked examples:
Browse the catalogue in sqlite
$ sqlite3 catalogue.sqlite
sqlite> .tables
documents regions texts clusters chunks settings engines
sqlite> SELECT count(*) FROM chunks;
sqlite> SELECT text FROM texts WHERE engine_id = 'qwen3-vl:235b-cloud' LIMIT 5;
Every row carries the engine id that produced it; joining texts to engines gives you
the full provenance record in a single query.
Search the lexical index from Python
from tantivy import Index
ix = Index.open('index.tantivy')
searcher = ix.searcher()
hits = searcher.search(ix.parse_query('wharves OR galata', ['text']), limit=20)
for score, address in hits.hits:
doc = searcher.doc(address)
print(score, doc['document_id'], doc['page_no'], doc['text'][:80])
The tantivy Python bindings read Archeglyph’s snapshot indexes directly; the field names
(text, document_id, page_no, region_id) are documented in README.txt.
Load the embeddings
from zvec import read
store = read('embeddings.zvec')
print(store.metadata) # {'model': 'bge-small-en-v1.5', 'dim': 384, ...}
for chunk_id, vector in store:
# use numpy, faiss, whatever
...
The embedding store carries enough metadata to reconstruct a search space without Archeglyph; the model id is what lets you (or a future reader) know whether they can mix these vectors with another corpus.
Citing a snapshot
A snapshot is citable. The recommended format:
Author, Dataset title, Archeglyph snapshot
sha256:<…>exported<YYYY-MM-DD>, archived at<url-or-doi>.
The manifest.json contains a snapshot_id which is the SHA-256 of the concatenated file
hashes — that is the value to paste in the sha256: field. Two researchers with the same
snapshot_id are guaranteed to be looking at bit-identical data.
If you deposit the tarball in Zenodo or your institution’s repository, Archeglyph will accept the DOI on the dataset’s settings page and show it on the dataset’s landing card. That feature lands in M1-D.
Archiving versus re-importing
Two different verbs, two different use cases:
- Archiving — the tarball is the final form. You put it in a repository, you stop thinking about it. The three files inside are all openable with tools older than Archeglyph; whatever happens to us, the research artefact survives.
- Re-importing — the same tarball can be loaded back into Archeglyph (
⋯ → Import snapshot) and becomes a new dataset in your workspace. The original snapshot is not mutated; re-imports are a fork, not a load. This is how a collaborator receives your study.
Caveats we want to be honest about
- Not every settings field is carried. The snapshot preserves the engine selection, the chunking recipe, and the search configuration. Workspace-level things (billing, team membership, access policies) are intentionally not exported because they belong to a workspace, not a dataset.
- Image rehydration depends on upstream availability. If the source archive takes a
document offline, the
rehydrate.shscript will fail on that file. The extracted text, regions, index, and embeddings are untouched — you keep the scholarship, you lose the ability to redisplay the image. - Snapshot schema will version. The format is at v1. Future versions will add fields, never remove them; a v1 reader will continue to open every snapshot it produced today.
A short checklist before you call it done
- Download the tarball and verify the SHA-256s in
manifest.json. - Open
catalogue.sqliteand confirm the document count matches what you expect. - Archive the tarball somewhere with an addressable URL (institutional repository, Zenodo, S3 bucket with public read).
- If the study is published, paste the
snapshot_idinto the methods section and the DOI onto the dataset’s settings page in Archeglyph so other readers can find it.
A snapshot is not the end of a dataset’s life — it is the first moment it becomes a citizen of the scholarly record rather than a row in our database. That is what we built the format to let it be.