Archeglyph
· transparency · product

Transparency is a feature

Why every extracted text block in Archeglyph shows the model that produced it, and why we treat that disclosure as product surface rather than footer text.

By Dipankar · Last updated

On this page
  1. The provenance badge
  2. Why this isn’t a footer
  3. What the product discloses
  4. What transparency is not
  5. Implications for our roadmap
  6. What we ask of readers

A researcher opens a cluster in an automated text-analysis tool. The cluster is titled “Migrations across the Bosphorus” and it contains forty-two fragments from a newspaper corpus. Two of those fragments look very wrong — the OCR is garbled, the sentences don’t quite close, one of them seems to contain the word “Golota” where the city is clearly Galata. A reasonable next question for the researcher is: which engine produced that text, and can I re-run it with something else?

Most tools don’t let that question get asked. The text just shows up. If the researcher is suspicious, they can either ignore the fragment, dig through logs they don’t have access to, or trust the cluster anyway. None of those are good answers for scholarship.

Archeglyph’s answer is to put the engine’s name next to the text.

The provenance badge

Every extracted text block in the product carries a small chip — we call it the ProvenanceBadge — that shows the engine and version responsible for that block: for example tesseract 5.3 or qwen3-vl:235b-cloud, plus a timestamp. Next to the badge is a “re-run with…” affordance that lets the researcher swap engines on that region without touching the rest of the document. The badge appears in the document review screen, in search results, and on every exemplar quotation inside a cluster card.

This sounds like a small UI element, and on the page it is. But the consequences run deep:

  • It forces the pipeline to be honest. If we can’t reliably attribute a text block to an engine, we can’t render the badge. That constraint shaped our data model: every extracted region stores its engine id, and re-runs don’t silently overwrite — they produce a new row with a new provenance stamp.
  • It turns failure into a question the researcher can answer. A garbled OCR line stops being “the machine failed” and becomes “Tesseract failed on this region; what happens if we try a VLM here?” The failure mode is legible, and so is the remedy.
  • It makes cross-engine comparison part of normal reading. When the cluster view shows that forty of the forty-two exemplars came from tesseract 5.3 and two came from qwen3-vl:235b-cloud, the researcher can start forming intuitions about which engine earns its cost on which kind of page.

The easy thing to do is put a line at the bottom of a report that says “generated using an AI-assisted pipeline.” Every vendor does this and it satisfies nothing. A footer says: there is a machine somewhere, and the output might be wrong, and you should know that in the abstract. A badge next to each block says: this specific sentence was produced by this specific engine at this specific time, and here is the button to try again with a different one.

The first is a legal disclosure. The second is a piece of scholarly apparatus.

What the product discloses

In M0 the badge surface covers:

  • Layout regions. Each region’s kind (headline, body, caption, figure, table) and the model that assessed the layout — e.g. gemma3:27b-cloud — with a confidence score when the model returns one.
  • Extracted text. The engine that read each region and its version. For Tesseract that’s the binary version. For a VLM that’s the full Ollama tag.
  • Cluster theme titles. When a small text LLM is used to polish the top-TF-IDF terms into a 4-6 word title, the title discloses the model that wrote it. The summary sentence gets the same treatment.
  • Embeddings. Every chunk stores the embedding model id, and the search result UI surfaces it when the user hovers on a hit — because if you switch from MiniLM to BGE, results can reorder, and that reordering deserves a trail.

What transparency is not

Transparency is not the same as openness. We do not claim the weights of the VLMs we call are open or auditable. We do not claim you can reproduce a cluster bit-for-bit six months from now if the upstream Ollama model has been retrained. What we do claim — and what the badge delivers — is a second-order guarantee: at the time you are looking at this output, you can see exactly what produced it. From there, if a claim matters, you can re-run the relevant step with a different engine and compare.

That is enough for scholarship to work. A footnote that names the edition does not promise the edition is correct; it promises the reader can go look. The provenance badge is the same promise in a different medium.

Implications for our roadmap

Treating provenance as surface shapes what we build next:

  1. Engine catalogue is a first-class object. Not a config file; a database table, with a nightly reconciliation job that flags stale ids. If an engine disappears upstream, the dataset settings page warns you that your chosen default is no longer available.
  2. Re-run is cheap. The pipeline is fingerprinted per stage, so re-running extraction on one region with a different engine costs the cost of that one region, not the whole document. The badge only makes sense if the “re-run with…” button is painless.
  3. The advanced toggle exists, but it’s not the default. Confidence histograms, outlier scores, UMAP projections — those matter when you’re debugging a pipeline, not when you’re reading a cluster. They live behind an explicit toggle on each cluster card.

What we ask of readers

When you use Archeglyph outputs in published work, please cite the engine. The product makes it easy — the badge text is already the citation string. In return, we commit to keeping the badges stable: an engine id that appears in one snapshot will resolve to the same model identity in all future snapshots, even if we retire the engine and archive the weights metadata.

Transparency isn’t a privacy stance or a compliance checkbox. It’s the piece of product surface that lets a researcher do their job without trusting us more than they should.