Guide · 21 April 2026 · getting-started · tutorial

Your first dataset

End-to-end walkthrough: sign in, create a dataset, upload pages, watch the pipeline run, review a document, and run your first search.

By Dipankar · Last updated 21 April 2026

On this page

Step 1 — Sign in
Step 2 — Create a dataset
Step 3 — Upload files
Step 4 — Watch the pipeline
Step 5 — Review a document
Step 6 — Run a search
Step 7 — Open the cluster browser
Step 8 — Settings and snapshots
What next
Unfamiliar with a term?

This guide takes you from zero to a searchable dataset in about 15 minutes of active time, plus however long the pipeline takes to run on your uploads. You will need: a browser, an email address, and a handful of scanned pages in PDF or image form — twenty pages is a good starting size.

Archēglyph uses magic-link sign-in. Visit /app/login, enter your email, and click the link we send you. There is no password to choose or remember. The magic link expires 15 minutes after issue; if you miss the window, request another one.

The session cookie set on successful sign-in (ag_sess) is httpOnly and lasts 30 days. If you sign out, or if 30 days of inactivity pass, you will be asked for another magic link.

Step 2 — Create a dataset

From the datasets page, click New dataset. You will be asked for:

A name — human-readable. “Constantinople newspapers 1920s” is fine.
A slug — the URL-safe identifier. Derived from the name; you can edit it.
A description — one or two sentences for your own future reference.

The first dataset you create uses Archēglyph’s safe defaults: Tesseract for extraction, MiniLM-L6 for embeddings, the smallest current Ollama Cloud VLM for layout assessment and cluster labels. You can change any of these later from the dataset’s Settings tab, and the settings page will tell you which of your stored state (embeddings, clusters) would need to be rebuilt if you do.

Step 3 — Upload files

On the new dataset’s page, click Upload. You can drag PDFs or image files directly onto the page, or pick them from a file dialog. Archēglyph:

Hashes each file. Duplicate uploads are detected and skipped.
Accepts PDFs up to 500 MB and individual images up to 50 MB.
Begins the pipeline automatically once a file has finished uploading.

You will see each file appear as a row in the document table. Its status column starts at uploaded and moves through assessed, extracted_text, chunked, embedded, indexed, clustered, ready as the pipeline runs. The updates arrive over a server-sent stream, so no refresh is needed — the column updates in place.

Step 4 — Watch the pipeline

For a typical 20-page upload, you will see:

Upload complete in a few seconds (depends on your connection).
Assess complete in a minute or two — this is the VLM looking at each page and returning regions.
Extract complete in under a minute — Tesseract is fast.
Analyse complete in another minute — chunking, embedding, indexing, clustering.

Five to ten minutes wall-clock is a fair estimate for twenty pages on a fresh dataset. Longer documents with complex layouts will take longer; the progress bar on each row reflects per-stage completion.

If anything fails, the row shows an error badge with a Retry button. The pipeline is fingerprinted per stage so retries re-run only the failing stage.

Step 5 — Review a document

Once a document’s status hits extracted_text, its Review link becomes live. Click it for one document. You will land on a three-pane screen:

The source image on the left, with region bounding boxes overlaid.
The extracted text in the middle, one editable block per region. Each block has a ProvenanceBadge showing the engine that produced it.
A metadata panel on the right: confidence histogram, engine choices, per-region re-run buttons, and an escape hatch to re-run the whole document from segmentation.

Scroll through the text. Click on a region in the image — the corresponding text block highlights. If a block looks garbled, click the “re-run with…” affordance on its provenance badge and pick a different engine. The re-run runs just that region, typically in seconds.

When you are satisfied, click Accept. The document’s status advances and the next stages (chunking, embedding, indexing, clustering) proceed over the accepted text. You can skip review entirely for corpora where that level of care is not needed.

Keyboard shortcuts help here: j and k move between regions, e opens the editor on the current region, r opens the re-run menu, Enter accepts the region, Esc cancels.

Step 6 — Run a search

Once at least one document is ready, the dataset’s Search tab works. Type a query and you will get back snippets from the dataset’s text, each with:

The document and page they come from.
The matching phrases highlighted.
The ProvenanceBadge for the extracted block they came from.
A relevance score that combines lexical (Tantivy BM25) and semantic (zvec cosine) scores via reciprocal rank fusion.

Use the Lexical | Hybrid | Semantic toggle at the top of the search box to change the search mode. Lexical is best when you know the exact phrase; semantic is best when you are searching for a concept; hybrid — the default — generally works well for both.

Step 7 — Open the cluster browser

Click the Clusters tab. You will see a grid of cluster cards; each card leads with a theme title, a one-sentence summary, and three exemplar fragments. Pick the card that looks most interesting and click Open cluster. You will land in the fragment neighbourhood view — all of the cluster’s fragments with ±1 sentence of surrounding context, grouped by document.

The fragment neighbourhood is where much of the research happens: read the fragments, flag the ones that matter, and click through to the source page for the full context. Flags and notes are per-user and persist across sessions.

If you want to see the more ML-flavoured view, click Advanced on any cluster card. That reveals the probability histogram, outlier scores, and a UMAP scatter. These are secondary by design; see Reading clusters as a researcher for why.

Step 8 — Settings and snapshots

Visit the dataset’s Settings tab. Every default Archēglyph uses for this dataset is visible there and editable: the layout VLM, the extraction engine, the embedding model, the cluster-label LLM, and the clustering parameters. Saving a change that invalidates derived state (notably changing the embedding model) surfaces an explicit confirmation modal that tells you what will be rebuilt and what it will cost.

The settings page also has an Export snapshot button. A dataset snapshot is a single compressed archive of the lexical index, the vector index, and the metadata database. You can download it, back it up, and later re-upload it to restore the dataset exactly. This is the “one file” property mentioned on the landing page.

What next

The pipeline — the same stages in more conceptual detail.
OCR vs VLM extraction — for when Tesseract is or isn’t the right default on your corpus.
Reading clusters as a researcher — a reading guide for the cluster browser.

Unfamiliar with a term?

Bundle — the per-dataset artefact the pipeline writes to.
Chunk — the unit of analysis inside Archēglyph.
Dataset note — the auto-generated plain-language summary.
Analysis plugin — how new pipeline techniques land.

Step 1 — Sign in