7 May 2026 · transparency · hallucination · method

Why Archēglyph cannot hallucinate

Hallucination is a property of generative systems. Archēglyph isn't one. Every line of text the system surfaces was already in the source corpus — and we can show you which page it came from.

By Dipankar Sarkar · Last updated 7 May 2026

On this page

What “hallucination” actually means
Where Archēglyph’s text comes from
What we don’t do
What you can verify
The promise, stated plainly
Unfamiliar with a term?

A philologist recently asked us a sharper version of the question that’s quietly haunting every research tool right now: “How do I know your tool isn’t making things up?”

It’s the right question to ask. The honest answer is short: Archēglyph is not a generative system, so it cannot hallucinate the things you read in it. The longer answer is worth writing down because it explains an architectural choice we made very early, and it explains why that choice is the reason we exist.

What “hallucination” actually means

The word is used loosely. In the literature it has a specific shape: a generative model produces content that is fluent and plausible but unfaithful to its inputs — a quotation that was never said, a citation that doesn’t exist, a date that is off by twenty years and stated with total confidence. The failure mode is intrinsic to how the system works: a language model is trained to produce the next likely token, not the next true token. Plausibility is the optimised target. Truthfulness is, at best, correlated.

This is why retrieval-augmented generation, careful prompting, and chain-of-thought tricks help but never close the gap. They lower the hallucination rate. They don’t change what kind of system you’re using.

Where Archēglyph’s text comes from

Walk through the pipeline. At every step we can name the source of the text on screen.

1. The page image. The starting point is a researcher-uploaded PDF or image scan. The bytes don’t change. The original is preserved in object storage and re-downloadable forever.

2. Region detection. A vision model (or a CV fallback) draws boxes on the page. The model’s only output is coordinates and a label (headline / body / caption / figure / table). It does not produce text. If the model invents a region that isn’t there, we crop air — and the OCR step on the next page produces empty text, which is easy to notice.

3. Text extraction. Tesseract or a vision-language model is given a single cropped region and asked: “Read what’s on this image, faithfully.” This is the only step where a model could plausibly “add” text that wasn’t there. We mitigate the risk three ways:

The image and the extracted text are kept side-by-side in the review UI. Hover a region; the bbox highlights on the source page.
Every region is stamped with the engine that produced its text and a confidence score.
The dataset technique note (auto-generated, clearly labelled as such) tells the researcher how many regions were Tesseract-read versus VLM-read. A researcher can audit by sampling.

4. Chunking, embedding, indexing. These are deterministic operations. syntok splits the extracted text on sentence boundaries. A sentence-transformer turns each chunk into a vector. Tantivy indexes the words for full-text search. None of these steps add text. They make the existing text findable.

5. Clustering. HDBSCAN groups vectors. The output is which chunk is in which cluster. There is no language generation here.

6. Cluster theme titles. Yes, this step uses an LLM. The LLM is given the top TF-IDF terms for a cluster plus a handful of sample sentences, and asked to produce a four-to-six word label. The label is shown with a ProvenanceBadge naming the model. If a researcher doubts a label, they read the exemplars beneath it — which are real quotations from the corpus, not LLM output.

7. The dataset technique note. Three to five sentences describing how the dataset was processed. Generated by a small model from the known engine choices and the known counts of files, regions, and chunks. We cap its length, and if the model’s output is missing or malformed we fall back to a deterministic template. The note carries a “this summary is automatically generated” caveat in every version.

That is every model invocation in Archēglyph. None of them is asked to summarise the corpus. None of them is asked to answer a research question. None of them produces a paragraph that a researcher could mistake for primary text.

What we don’t do

We don’t have a chat interface. We don’t have a “summarise this collection” button. We don’t have an “ask a question of your archive” endpoint. Those are perfectly reasonable products to build — they’re just a different product. The research workflow they support is synthesis. The research workflow we support is reading.

We made this call deliberately, and we don’t expect to change it. A tool that synthesises will always be liable to hallucinate, no matter how careful the prompt engineering. Once a researcher has to audit each generated sentence for fabrication, the tool has stopped being a labour-saver and started being a liability.

What you can verify

If you’re evaluating Archēglyph, run this test:

Upload a page you know cold.
Watch the regions appear on the review screen. Open the bbox overlay. For each region, verify that the text is actually what’s on the image at that location.
Run a search that you know should match. Verify every result is a real chunk from a real region.
Open the cluster browser. Pick a cluster. Click an exemplar. It takes you back to the source page, with the highlighted region.
Now try to find an unsupported claim in any of the surfaced text. You won’t, because there isn’t a step in the pipeline that could have produced one.

That’s the audit. It scales.

The promise, stated plainly

Archēglyph reads what is on the page. We disclose which model did the reading. We index, group, and surface what was read. We don’t write anything new on top of it. When we do generate (cluster titles, the note), we say so loudly and we keep it to under a hundred words.

This is the line we hold. Not because LLMs are bad — they’re useful for plenty of things — but because citing what you read is the foundational act of scholarship, and we want to be a tool a researcher can cite from without an audit trail of footnotes saying “the AI told me so”.

If your work needs the corpus to mean what it says on the page, Archēglyph is for you. If your work needs synthesis, we’ll happily recommend something else.

Unfamiliar with a term?

Hallucination — the failure mode this article is named after.
Generative AI — the category of tools that hallucinate; why we are deliberately not one.
Extractive QA — the opposite stance: returning a real span instead of writing a paraphrase.
Retrieval-augmented generation — RAG, commonly confused with what Archēglyph does.
Large language model — what people usually mean by “AI”.