Can I open an Archēglyph snapshot without running Archēglyph?

Yes. The tarball contains a sqlite catalogue, a tantivy lexical index, and a zvec embedding store — all three are readable with their respective off-the-shelf libraries. The README inside the tarball points at the schema and the Python reader scripts we bundle.

Does the snapshot include the source images?

Not by default. The tarball carries stable URLs plus SHA-256 hashes for every source image and a rehydrate.sh script that re-downloads them from the original object store. A future --with-images flag will bundle binaries for researchers whose source material is fully open.

How do I cite a specific snapshot?

Cite the snapshot_id from manifest.json (a SHA-256 of the concatenated file hashes) alongside the export date. Two researchers quoting the same snapshot_id are guaranteed to be looking at bit-identical data.

Can I re-import a snapshot to continue working on it?

Yes. From the workspace menu, choose Import snapshot and upload the tarball. It loads as a new dataset; the original snapshot file is not mutated, so re-imports are forks rather than in-place loads.

Guide · 3 May 2026 · export · archiving · snapshots · how-to

Exporting and archiving a dataset

A forward-looking but grounded walkthrough of Archēglyph's dataset snapshot: what goes into the tarball, how to open it without the product, and how to cite a snapshot in a paper.

By Dipankar · Last updated 3 May 2026

On this page

Creating a snapshot
What is inside
Opening a snapshot without Archēglyph
Browse the catalogue in sqlite
Search the lexical index from Python
Load the embeddings
Citing a snapshot
Archiving versus re-importing
Caveats we want to be honest about
A short checklist before you call it done
Unfamiliar with a term?

Every study ends — a paper gets published, a grant closes, a postdoc moves institutions — and the question becomes: how do I keep the work in a form I can still use in five years, without the tool that produced it? Archēglyph’s answer is the dataset snapshot, a single tarball that bundles the catalogue, the lexical index, and the embedding store for one dataset. This guide walks through how to create one, what is inside it, how to open it without Archēglyph running, and how to cite it.

Some of the surfaces described below are still rolling out through M1. Where that is the case the guide flags it.

Creating a snapshot

From the dataset page, open the ⋯ menu on the header and choose Export snapshot. The product computes the total size (it will tell you before you commit), asks you to confirm, and produces a tarball named:

archeglyph-<workspace>-<dataset-slug>-<YYYYMMDD-HHMM>.tar.zst

The timestamp is UTC. The file is zstd-compressed; on a modern machine a dataset with tens of thousands of documents typically lands in the low hundreds of megabytes.

While the export is running you can close the page — the job continues server-side and an email with the download link arrives when it finishes. For datasets with tens of millions of chunks the export can take several minutes; the job status surfaces in the dataset’s events feed the same way extraction jobs do.

What is inside

Unpack the archive:

$ tar --zstd -xvf archeglyph-<…>.tar.zst
archeglyph-<…>/
├── README.txt
├── catalogue.sqlite
├── index.tantivy/
│   ├── meta.json
│   ├── … segment files …
├── embeddings.zvec
├── settings.json
└── manifest.json

catalogue.sqlite is a plain sqlite database containing the tables for documents, pages, regions, extracted text (with its engine provenance), edits, clusters, cluster memberships, and the settings that were active at snapshot time. You can open it in any sqlite browser; the schema is documented in README.txt and mirrors the tables described in the platform docs.
index.tantivy/ is the lexical search index, in tantivy’s on-disk format. It can be opened by any tantivy 0.22+ reader; you do not need Archēglyph to query it.
embeddings.zvec is the compressed embedding store, one vector per chunk plus a small metadata header (model id, dimension, chunking recipe). The zvec format is documented in its repository; a short Python reader script is bundled as read_embeddings.py.
settings.json is a human-readable copy of the dataset’s settings at the moment of export — engines, thresholds, chunking parameters. It is redundant with the sqlite catalogue but is present to make the snapshot legible without any database tooling.
manifest.json lists every file, its SHA-256, and the snapshot schema version. Check the hashes after download if you intend to archive the tarball long-term.

The tarball does not contain the raw source images. It contains references — a stable URL plus a SHA-256 — and a rehydrate.sh script that refetches the binaries from the original object store. This is a licensing choice: many source archives grant Archēglyph the right to process images but not to redistribute them. A future --with-images flag will bundle the binaries for researchers whose sources are fully open.

Opening a snapshot without Archēglyph

The design goal is that the snapshot opens with off-the-shelf tools. Three worked examples:

Browse the catalogue in sqlite

$ sqlite3 catalogue.sqlite
sqlite> .tables
documents   regions    texts    clusters   chunks   settings   engines
sqlite> SELECT count(*) FROM chunks;
sqlite> SELECT text FROM texts WHERE engine_id = 'qwen3-vl:235b-cloud' LIMIT 5;

Every row carries the engine id that produced it; joining texts to engines gives you the full provenance record in a single query.

Search the lexical index from Python

from tantivy import Index

ix = Index.open('index.tantivy')
searcher = ix.searcher()
hits = searcher.search(ix.parse_query('wharves OR galata', ['text']), limit=20)
for score, address in hits.hits:
    doc = searcher.doc(address)
    print(score, doc['document_id'], doc['page_no'], doc['text'][:80])

The tantivy Python bindings read Archēglyph’s snapshot indexes directly; the field names (text, document_id, page_no, region_id) are documented in README.txt.

Load the embeddings

from zvec import read

store = read('embeddings.zvec')
print(store.metadata)  # {'model': 'bge-small-en-v1.5', 'dim': 384, ...}
for chunk_id, vector in store:
    # use numpy, faiss, whatever
    ...

The embedding store carries enough metadata to reconstruct a search space without Archēglyph; the model id is what lets you (or a future reader) know whether they can mix these vectors with another corpus.

Citing a snapshot

A snapshot is citable. The recommended format:

Author, Dataset title, Archēglyph snapshot sha256:<…> exported <YYYY-MM-DD>, archived at <url-or-doi>.

The manifest.json contains a snapshot_id which is the SHA-256 of the concatenated file hashes — that is the value to paste in the sha256: field. Two researchers with the same snapshot_id are guaranteed to be looking at bit-identical data.

If you deposit the tarball in Zenodo or your institution’s repository, Archēglyph will accept the DOI on the dataset’s settings page and show it on the dataset’s landing card. That feature lands in M1-D.

Archiving versus re-importing

Two different verbs, two different use cases:

Archiving — the tarball is the final form. You put it in a repository, you stop thinking about it. The three files inside are all openable with tools older than Archēglyph; whatever happens to us, the research artefact survives.
Re-importing — the same tarball can be loaded back into Archēglyph (⋯ → Import snapshot) and becomes a new dataset in your workspace. The original snapshot is not mutated; re-imports are a fork, not a load. This is how a collaborator receives your study.

Caveats we want to be honest about

Not every settings field is carried. The snapshot preserves the engine selection, the chunking recipe, and the search configuration. Workspace-level things (billing, team membership, access policies) are intentionally not exported because they belong to a workspace, not a dataset.
Image rehydration depends on upstream availability. If the source archive takes a document offline, the rehydrate.sh script will fail on that file. The extracted text, regions, index, and embeddings are untouched — you keep the scholarship, you lose the ability to redisplay the image.
Snapshot schema will version. The format is at v1. Future versions will add fields, never remove them; a v1 reader will continue to open every snapshot it produced today.

A short checklist before you call it done

Download the tarball and verify the SHA-256s in manifest.json.
Open catalogue.sqlite and confirm the document count matches what you expect.
Archive the tarball somewhere with an addressable URL (institutional repository, Zenodo, S3 bucket with public read).
If the study is published, paste the snapshot_id into the methods section and the DOI onto the dataset’s settings page in Archēglyph so other readers can find it.

A snapshot is not the end of a dataset’s life — it is the first moment it becomes a citizen of the scholarly record rather than a row in our database. That is what we built the format to let it be.

Unfamiliar with a term?

Bundle — the per-dataset artefact the snapshot is built from.
Versioning — named save-points of that bundle over time.
Artefact — the library that guarantees snapshots are consistent.