Archeglyph
Guide · · getting-started · tutorial

Your first dataset

End-to-end walkthrough: sign in, create a dataset, upload pages, watch the pipeline run, review a document, and run your first search.

By Dipankar · Last updated

On this page
  1. Step 1 — Sign in
  2. Step 2 — Create a dataset
  3. Step 3 — Upload files
  4. Step 4 — Watch the pipeline
  5. Step 5 — Review a document
  6. Step 6 — Run a search
  7. Step 7 — Open the cluster browser
  8. Step 8 — Settings and snapshots
  9. What next

This guide takes you from zero to a searchable dataset in about 15 minutes of active time, plus however long the pipeline takes to run on your uploads. You will need: a browser, an email address, and a handful of scanned pages in PDF or image form — twenty pages is a good starting size.

Step 1 — Sign in

Archeglyph uses magic-link sign-in. Visit /app/login, enter your email, and click the link we send you. There is no password to choose or remember. The magic link expires 15 minutes after issue; if you miss the window, request another one.

The session cookie set on successful sign-in (ag_sess) is httpOnly and lasts 30 days. If you sign out, or if 30 days of inactivity pass, you will be asked for another magic link.

Step 2 — Create a dataset

From the datasets page, click New dataset. You will be asked for:

  • A name — human-readable. “Constantinople newspapers 1920s” is fine.
  • A slug — the URL-safe identifier. Derived from the name; you can edit it.
  • A description — one or two sentences for your own future reference.

The first dataset you create uses Archeglyph’s safe defaults: Tesseract for extraction, MiniLM-L6 for embeddings, the smallest current Ollama Cloud VLM for layout assessment and cluster labels. You can change any of these later from the dataset’s Settings tab, and the settings page will tell you which of your stored state (embeddings, clusters) would need to be rebuilt if you do.

Step 3 — Upload files

On the new dataset’s page, click Upload. You can drag PDFs or image files directly onto the page, or pick them from a file dialog. Archeglyph:

  • Hashes each file. Duplicate uploads are detected and skipped.
  • Accepts PDFs up to 500 MB and individual images up to 50 MB.
  • Begins the pipeline automatically once a file has finished uploading.

You will see each file appear as a row in the document table. Its status column starts at uploaded and moves through assessed, extracted_text, chunked, embedded, indexed, clustered, ready as the pipeline runs. The updates arrive over a server-sent stream, so no refresh is needed — the column updates in place.

Step 4 — Watch the pipeline

For a typical 20-page upload, you will see:

  • Upload complete in a few seconds (depends on your connection).
  • Assess complete in a minute or two — this is the VLM looking at each page and returning regions.
  • Extract complete in under a minute — Tesseract is fast.
  • Analyse complete in another minute — chunking, embedding, indexing, clustering.

Five to ten minutes wall-clock is a fair estimate for twenty pages on a fresh dataset. Longer documents with complex layouts will take longer; the progress bar on each row reflects per-stage completion.

If anything fails, the row shows an error badge with a Retry button. The pipeline is fingerprinted per stage so retries re-run only the failing stage.

Step 5 — Review a document

Once a document’s status hits extracted_text, its Review link becomes live. Click it for one document. You will land on a three-pane screen:

  • The source image on the left, with region bounding boxes overlaid.
  • The extracted text in the middle, one editable block per region. Each block has a ProvenanceBadge showing the engine that produced it.
  • A metadata panel on the right: confidence histogram, engine choices, per-region re-run buttons, and an escape hatch to re-run the whole document from segmentation.

Scroll through the text. Click on a region in the image — the corresponding text block highlights. If a block looks garbled, click the “re-run with…” affordance on its provenance badge and pick a different engine. The re-run runs just that region, typically in seconds.

When you are satisfied, click Accept. The document’s status advances and the next stages (chunking, embedding, indexing, clustering) proceed over the accepted text. You can skip review entirely for corpora where that level of care is not needed.

Keyboard shortcuts help here: j and k move between regions, e opens the editor on the current region, r opens the re-run menu, Enter accepts the region, Esc cancels.

Once at least one document is ready, the dataset’s Search tab works. Type a query and you will get back snippets from the dataset’s text, each with:

  • The document and page they come from.
  • The matching phrases highlighted.
  • The ProvenanceBadge for the extracted block they came from.
  • A relevance score that combines lexical (Tantivy BM25) and semantic (zvec cosine) scores via reciprocal rank fusion.

Use the Lexical | Hybrid | Semantic toggle at the top of the search box to change the search mode. Lexical is best when you know the exact phrase; semantic is best when you are searching for a concept; hybrid — the default — generally works well for both.

Step 7 — Open the cluster browser

Click the Clusters tab. You will see a grid of cluster cards; each card leads with a theme title, a one-sentence summary, and three exemplar fragments. Pick the card that looks most interesting and click Open cluster. You will land in the fragment neighbourhood view — all of the cluster’s fragments with ±1 sentence of surrounding context, grouped by document.

The fragment neighbourhood is where much of the research happens: read the fragments, flag the ones that matter, and click through to the source page for the full context. Flags and notes are per-user and persist across sessions.

If you want to see the more ML-flavoured view, click Advanced on any cluster card. That reveals the probability histogram, outlier scores, and a UMAP scatter. These are secondary by design; see Reading clusters as a researcher for why.

Step 8 — Settings and snapshots

Visit the dataset’s Settings tab. Every default Archeglyph uses for this dataset is visible there and editable: the layout VLM, the extraction engine, the embedding model, the cluster-label LLM, and the clustering parameters. Saving a change that invalidates derived state (notably changing the embedding model) surfaces an explicit confirmation modal that tells you what will be rebuilt and what it will cost.

The settings page also has an Export snapshot button. A dataset snapshot is a single compressed archive of the lexical index, the vector index, and the metadata database. You can download it, back it up, and later re-upload it to restore the dataset exactly. This is the “one file” property mentioned on the landing page.

What next