OCR vs VLM: a practical chooser
A short, decision-oriented guide to picking the right extraction engine for your corpus. When Tesseract is the right default, when a VLM is worth the cost, and how to test the choice cheaply.
By Dipankar · Last updated
On this page
This is a decision guide. If you want the reasoning behind it, read the companion article
VLM vs OCR: when to pick what. If you want to just
decide what to set your dataset’s extract_engine to, start here.
The one-line answer
Default to Tesseract; escalate to a VLM per-region when the output looks wrong. This handles almost every corpus we have seen.
The remainder of this guide is a more nuanced version of that same answer, for the cases where the default isn’t right.
Pick Tesseract if
Any of these is true of your corpus:
- Printed text, typeset, post-1900.
- Scans at 300 dpi or better.
- Latin script, or a well-supported non-Latin script with a Tesseract language pack
(Greek, Cyrillic, Arabic with
ara, and so on). - Your downstream use is lexical search or surveying, not publication of the extracted text.
- Your corpus is large enough that VLM per-page cost becomes a budget question.
Tesseract will produce good text quickly, the errors will be consistent, and you will have headroom to re-run troublesome pages with a VLM individually.
Pick a VLM if
Any of these is true, and especially if more than one is:
- Heavy degradation: staining, bleed-through, torn edges, uneven exposure.
- Low-resolution scans (below ~200 dpi).
- Handwriting or mixed print/handwriting.
- Non-Latin scripts with limited Tesseract support (historical Ottoman, older scripts, or very stylised typography).
- Your downstream use is publication of the extracted text as a resource, where the error bar matters.
- The corpus is small enough that per-page VLM cost is affordable.
Pick the smallest VLM on the Ollama Cloud list that works on a sample. Larger VLMs cost more and are not always more accurate on extraction — some of them over-correct text in ways you may not want.
The hybrid default
Many corpora benefit from a hybrid approach:
- Dataset default: Tesseract. Runs on every region.
- Per-document override: a VLM, used when Tesseract output looks wrong on that document.
- Per-region re-run: available from the provenance badge in the review screen.
Archeglyph supports all three levels directly. No custom pipeline code is needed.
How to test cheaply before committing
Before setting the extraction engine for a large dataset, run this 20-minute check:
- Pick a representative subset. Twenty pages that span the visual range of your corpus — one clean page, one damaged page, one with unusual layout, one in the less-familiar script if your corpus has more than one.
- Upload the subset as a fresh dataset with Tesseract as the default.
- Skim the review screen for each page. Note the regions that look wrong.
- Re-run those regions from the provenance badge with a VLM of your choice.
- Compare side by side. The review screen will show both outputs attributed to their engines.
If Tesseract is right on 18 of 20 pages, stick with Tesseract and use per-region re-run as needed. If it is wrong on 5 or more, switch the dataset default to a VLM. If it is in the middle, consider the hybrid strategy above.
A quick triage table
| Situation | Default engine | Notes |
|---|---|---|
| 20th-century typeset print, 300+ dpi | Tesseract | Expect 95-99% character accuracy |
| 19th-century print, 300+ dpi | Tesseract | Add post-processing for systematic errors |
| Pre-1850 print, letterpress | Tesseract → VLM | Test a subset first; VLM often wins |
| Typewritten 20th-century documents | Tesseract | Very reliable |
| Degraded archival scans | VLM | Tesseract output will look like noise |
| Handwriting | VLM | Tesseract is not designed for this |
| Mixed print + handwriting | VLM | Mixed regions benefit from a VLM’s tolerance |
| Tables of numbers | Tesseract | Specify PSM mode in settings if results look disordered |
| Ottoman Turkish | VLM | Our newspapers experience: noticeably better on ligatures |
| East Asian scripts (Chinese, Japanese) | VLM | Specialised OCR is an option; VLM is usually simpler |
Configuring the choice in Archeglyph
From the dataset’s Settings tab:
- Extraction engine: set to
tesseractor to any VLM id from the Ollama Cloud list. - Tesseract language: set under the
extract_enginesub-options when Tesseract is selected. Default iseng; change toeng+fra,ara,ell, etc., as your corpus requires. - Saving the change applies to new documents. Existing documents keep their current extraction; to re-extract, use the per-document re-run button on the document’s review screen or (for the whole dataset) the “Re-extract all” action.
Changing the extraction engine does not invalidate embeddings or clusters — those derive from the text, not the engine. However, the text itself will change, which means the embeddings will need to be recomputed. Archeglyph surfaces this in the confirmation modal when you save a change that triggers a re-extraction.
Further reading
- The pipeline — where extraction sits in the full flow.
- VLM vs OCR: when to pick what — the reasoning and evidence behind the recommendations on this page.
- Transparency is a feature — why every extracted block names the engine that produced it.