VLM vs OCR: when to pick what
Notes from the newspapers prototype on when Tesseract is still the right choice, when a vision-language model earns its cost, and how to tell the difference before a full run.
By Dipankar · Last updated
On this page
A common framing in the digital humanities community right now is that vision-language models have made OCR obsolete. This is not what we found on the newspapers prototype. What we found, roughly, is that each engine has a regime where it is straightforwardly the better tool, and a middle regime where the choice depends on what you are going to do with the text afterwards. This article is our attempt to describe those regimes concretely enough that you can make the call on your own corpus.
Everything below is from our experience running a few thousand pages of archival newspapers through both pipelines and hand-checking the outputs. It is not a benchmark paper. Treat it as folklore from one project that we found held up.
Where Tesseract still wins
Tesseract — by which we mean a recent Tesseract 5 with LSTM and the right language packs — is, on our corpus, strictly better for:
- Clean, high-resolution print. 300+ dpi scans of 20th-century typeset text. The character accuracy on well-aligned Latin-script print is remarkably good, and Tesseract is fast and predictable in its failures.
- Heavy throughput. A page of newspaper text extracts in under a second on a modern CPU. A VLM run on the same page takes 10-60 seconds and a real amount of money. When the corpus is large and the downstream task is lexical search, the speed and cost ratio dominates.
- Cases where you will post-process. Tesseract’s errors are consistent. It mis-reads the same letter-pair the same way across a page. That consistency is a gift for deduplication, lexical normalisation, and any downstream pipeline that can correct systematic errors in bulk.
On our newspapers corpus, Tesseract hit character accuracy above 98% on a sample of
well-scanned 1920s broadsheet pages, and the errors it did make were almost entirely in a
fixed set of confusions (cl ↔ d, rn ↔ m, in ↔ m).
Where a VLM earns its cost
A vision-language model — in our case, various Ollama Cloud models that accept a region crop and return text — is straightforwardly the better tool for:
- Degraded scans. Faded print, show-through from the reverse page, heavy staining, tight gutters. A VLM’s language prior lets it read around damage that Tesseract refuses to touch.
- Non-Latin scripts with limited training data. We had a small set of Ottoman-Turkish pages. Tesseract’s Ottoman language pack is workable but the VLM’s Arabic-script handling was noticeably better — particularly on ligatures and diacritics.
- Handwriting. Tesseract is not a handwriting engine. There are specialised handwriting models; for mixed print/handwriting pages, a VLM is the pragmatic path.
- Mixed content. Pages with figures, tables, and running text intermixed — where the layout model has already produced a bbox but the bbox contents are heterogeneous. The VLM’s “just describe what’s in this crop” tolerance handles these better.
The cost side is real. On a mid-sized VLM, per-page extraction at hosted rates runs roughly ten to a hundred times the operational cost of Tesseract on a CPU. For a 10,000-page project, that is the difference between “run it tonight” and “budget for a quarter.”
The middle regime
Many corpora sit in a regime where either engine could plausibly work. In that regime the choice depends on what you will do next:
- Planning to do lexical search and snippet retrieval? Prefer Tesseract. Its consistent errors are easy to account for in a BM25-style index, and you will want the throughput.
- Planning to do semantic search or clustering? The choice is more subtle. Embedding models are surprisingly robust to moderate OCR noise — MiniLM still produces sensible cosine similarities on text that is 85-90% character-accurate. But once errors pass a threshold, clustering degrades: the fragments that end up in a cluster start including passages that share misreading patterns rather than topics. If you are seeing this on your own corpus (the tell is a cluster whose exemplars share an odd letter-confusion), a VLM run on the degraded pages will almost always tighten the clusters.
- Planning to publish the extracted text as a resource? Prefer the VLM. The bar for published text is higher than the bar for internal search, and the VLM’s error modes are less systematic — where it fails, it usually fails to readable (if wrong) text rather than to gibberish.
A concrete check before committing
If you are unsure which engine to pick for a new corpus, Archeglyph makes this check cheap:
- Upload 20 pages spanning the visual range of the corpus — a clean page, a damaged page, a page with unusual layout, a page in a less-familiar script.
- Run extraction with Tesseract.
- On those same 20 pages, re-run extraction per region with a VLM.
- Open the review screen and scan the two outputs side by side.
Because both extractions are stamped with their engine in the ProvenanceBadge, you can
quickly see where they agree and where they diverge. Twenty pages is enough to form an
opinion; on our corpus, the regions where the engines disagreed at a sample of 20 pages
predicted the regions where they disagreed at the full 5,000-page scale almost exactly.
The hybrid strategy
The answer for large, heterogeneous corpora is usually neither pure-Tesseract nor pure-VLM. It is a hybrid:
- Run Tesseract as the default on every region. It is fast and cheap.
- Use the VLM as a targeted re-run for regions flagged as low-confidence by Tesseract (low word count, low mean per-character confidence, high symbol-to-letter ratio).
- Expose both outputs in the review screen and let the researcher accept either, or edit in place.
Archeglyph supports this out of the box: per-region re-run with a different engine is a first-class operation, the pipeline fingerprints each stage so re-runs skip unchanged work, and the provenance badge keeps both outputs attributable.
The thing we got wrong
We built the newspapers prototype assuming VLM extraction would replace Tesseract wherever we could afford it. On the first large run we found two things we did not expect:
- VLM errors are less legible. When a VLM mis-reads a word, the misreading is often a plausible other word — “Galata” becomes “Golata” becomes, a paragraph later, “Gorata”. Tesseract’s errors look like OCR errors and are easy to spot. VLM errors look like paraphrases and are not.
- VLMs hallucinate structure. Given a crop that contains a half-visible column on one side, the VLM will sometimes confidently extract text from the half-visible column as if it were fully present. Tesseract, in the same situation, produces garbage that the reviewer can see is garbage.
Both of these argued for keeping Tesseract as the default and using the VLM as a targeted tool. We still think that is the right default for most humanities corpora, and it is the default Archeglyph ships with.