Why Archeglyph cannot hallucinate
Hallucination is a property of generative systems. Archeglyph isn't one. Every line of text the system surfaces was already in the source corpus — and we can show you which page it came from.
By Dipankar Sarkar · Last updated
On this page
A philologist recently asked us a sharper version of the question that’s quietly haunting every research tool right now: “How do I know your tool isn’t making things up?”
It’s the right question to ask. The honest answer is short: Archeglyph is not a generative system, so it cannot hallucinate the things you read in it. The longer answer is worth writing down because it explains an architectural choice we made very early, and it explains why that choice is the reason we exist.
What “hallucination” actually means
The word is used loosely. In the literature it has a specific shape: a generative model produces content that is fluent and plausible but unfaithful to its inputs — a quotation that was never said, a citation that doesn’t exist, a date that is off by twenty years and stated with total confidence. The failure mode is intrinsic to how the system works: a language model is trained to produce the next likely token, not the next true token. Plausibility is the optimised target. Truthfulness is, at best, correlated.
This is why retrieval-augmented generation, careful prompting, and chain-of-thought tricks help but never close the gap. They lower the hallucination rate. They don’t change what kind of system you’re using.
Where Archeglyph’s text comes from
Walk through the pipeline. At every step we can name the source of the text on screen.
1. The page image. The starting point is a researcher-uploaded PDF or image scan. The bytes don’t change. The original is preserved in object storage and re-downloadable forever.
2. Region detection. A vision model (or a CV fallback) draws boxes on the page. The model’s only output is coordinates and a label (headline / body / caption / figure / table). It does not produce text. If the model invents a region that isn’t there, we crop air — and the OCR step on the next page produces empty text, which is easy to notice.
3. Text extraction. Tesseract or a vision-language model is given a single cropped region and asked: “Read what’s on this image, faithfully.” This is the only step where a model could plausibly “add” text that wasn’t there. We mitigate the risk three ways:
- The image and the extracted text are kept side-by-side in the review UI. Hover a region; the bbox highlights on the source page.
- Every region is stamped with the engine that produced its text and a confidence score.
- The dataset technique note (auto-generated, clearly labelled as such) tells the researcher how many regions were Tesseract-read versus VLM-read. A researcher can audit by sampling.
4. Chunking, embedding, indexing. These are deterministic
operations. syntok splits the extracted text on sentence boundaries.
A sentence-transformer turns each chunk into a vector. Tantivy
indexes the words for full-text search. None of these steps add
text. They make the existing text findable.
5. Clustering. HDBSCAN groups vectors. The output is which chunk is in which cluster. There is no language generation here.
6. Cluster theme titles. Yes, this step uses an LLM. The LLM is
given the top TF-IDF terms for a cluster plus a handful of sample
sentences, and asked to produce a four-to-six word label. The label
is shown with a ProvenanceBadge naming the model. If a researcher
doubts a label, they read the exemplars beneath it — which are real
quotations from the corpus, not LLM output.
7. The dataset technique note. Three to five sentences describing how the dataset was processed. Generated by a small model from the known engine choices and the known counts of files, regions, and chunks. We cap its length, and if the model’s output is missing or malformed we fall back to a deterministic template. The note carries a “this summary is automatically generated” caveat in every version.
That is every model invocation in Archeglyph. None of them is asked to summarise the corpus. None of them is asked to answer a research question. None of them produces a paragraph that a researcher could mistake for primary text.
What we don’t do
We don’t have a chat interface. We don’t have a “summarise this collection” button. We don’t have an “ask a question of your archive” endpoint. Those are perfectly reasonable products to build — they’re just a different product. The research workflow they support is synthesis. The research workflow we support is reading.
We made this call deliberately, and we don’t expect to change it. A tool that synthesises will always be liable to hallucinate, no matter how careful the prompt engineering. Once a researcher has to audit each generated sentence for fabrication, the tool has stopped being a labour-saver and started being a liability.
What you can verify
If you’re evaluating Archeglyph, run this test:
- Upload a page you know cold.
- Watch the regions appear on the review screen. Open the bbox overlay. For each region, verify that the text is actually what’s on the image at that location.
- Run a search that you know should match. Verify every result is a real chunk from a real region.
- Open the cluster browser. Pick a cluster. Click an exemplar. It takes you back to the source page, with the highlighted region.
- Now try to find an unsupported claim in any of the surfaced text. You won’t, because there isn’t a step in the pipeline that could have produced one.
That’s the audit. It scales.
The promise, stated plainly
Archeglyph reads what is on the page. We disclose which model did the reading. We index, group, and surface what was read. We don’t write anything new on top of it. When we do generate (cluster titles, the note), we say so loudly and we keep it to under a hundred words.
This is the line we hold. Not because LLMs are bad — they’re useful for plenty of things — but because citing what you read is the foundational act of scholarship, and we want to be a tool a researcher can cite from without an audit trail of footnotes saying “the AI told me so”.
If your work needs the corpus to mean what it says on the page, Archeglyph is for you. If your work needs synthesis, we’ll happily recommend something else.