Guide · 23 April 2026 · extraction · ocr · vlm · decision

OCR vs VLM: a practical chooser

A short, decision-oriented guide to picking the right extraction engine for your corpus. When Tesseract is the right default, when a VLM is worth the cost, and how to test the choice cheaply.

By Dipankar · Last updated 23 April 2026

On this page

The one-line answer
Pick Tesseract if
Pick a VLM if
The hybrid default
How to test cheaply before committing
A quick triage table
Configuring the choice in Archēglyph
Further reading
Unfamiliar with a term?

This is a decision guide. If you want the reasoning behind it, read the companion article VLM vs OCR: when to pick what. If you want to just decide what to set your dataset’s extract_engine to, start here.

The one-line answer

Default to Tesseract; escalate to a VLM per-region when the output looks wrong. This handles almost every corpus we have seen.

The remainder of this guide is a more nuanced version of that same answer, for the cases where the default isn’t right.

Pick Tesseract if

Any of these is true of your corpus:

Printed text, typeset, post-1900.
Scans at 300 dpi or better.
Latin script, or a well-supported non-Latin script with a Tesseract language pack (Greek, Cyrillic, Arabic with ara, and so on).
Your downstream use is lexical search or surveying, not publication of the extracted text.
Your corpus is large enough that VLM per-page cost becomes a budget question.

Tesseract will produce good text quickly, the errors will be consistent, and you will have headroom to re-run troublesome pages with a VLM individually.

Pick a VLM if

Any of these is true, and especially if more than one is:

Heavy degradation: staining, bleed-through, torn edges, uneven exposure.
Low-resolution scans (below ~200 dpi).
Handwriting or mixed print/handwriting.
Non-Latin scripts with limited Tesseract support (historical Ottoman, older scripts, or very stylised typography).
Your downstream use is publication of the extracted text as a resource, where the error bar matters.
The corpus is small enough that per-page VLM cost is affordable.

Pick the smallest VLM on the Ollama Cloud list that works on a sample. Larger VLMs cost more and are not always more accurate on extraction — some of them over-correct text in ways you may not want.

The hybrid default

Many corpora benefit from a hybrid approach:

Dataset default: Tesseract. Runs on every region.
Per-document override: a VLM, used when Tesseract output looks wrong on that document.
Per-region re-run: available from the provenance badge in the review screen.

Archēglyph supports all three levels directly. No custom pipeline code is needed.

How to test cheaply before committing

Before setting the extraction engine for a large dataset, run this 20-minute check:

Pick a representative subset. Twenty pages that span the visual range of your corpus — one clean page, one damaged page, one with unusual layout, one in the less-familiar script if your corpus has more than one.
Upload the subset as a fresh dataset with Tesseract as the default.
Skim the review screen for each page. Note the regions that look wrong.
Re-run those regions from the provenance badge with a VLM of your choice.
Compare side by side. The review screen will show both outputs attributed to their engines.

If Tesseract is right on 18 of 20 pages, stick with Tesseract and use per-region re-run as needed. If it is wrong on 5 or more, switch the dataset default to a VLM. If it is in the middle, consider the hybrid strategy above.

A quick triage table

Situation	Default engine	Notes
20th-century typeset print, 300+ dpi	Tesseract	Expect 95-99% character accuracy
19th-century print, 300+ dpi	Tesseract	Add post-processing for systematic errors
Pre-1850 print, letterpress	Tesseract → VLM	Test a subset first; VLM often wins
Typewritten 20th-century documents	Tesseract	Very reliable
Degraded archival scans	VLM	Tesseract output will look like noise
Handwriting	VLM	Tesseract is not designed for this
Mixed print + handwriting	VLM	Mixed regions benefit from a VLM’s tolerance
Tables of numbers	Tesseract	Specify PSM mode in settings if results look disordered
Ottoman Turkish	VLM	Our newspapers experience: noticeably better on ligatures
East Asian scripts (Chinese, Japanese)	VLM	Specialised OCR is an option; VLM is usually simpler

Configuring the choice in Archēglyph

From the dataset’s Settings tab:

Extraction engine: set to tesseract or to any VLM id from the Ollama Cloud list.
Tesseract language: set under the extract_engine sub-options when Tesseract is selected. Default is eng; change to eng+fra, ara, ell, etc., as your corpus requires.
Saving the change applies to new documents. Existing documents keep their current extraction; to re-extract, use the per-document re-run button on the document’s review screen or (for the whole dataset) the “Re-extract all” action.

Changing the extraction engine does not invalidate embeddings or clusters — those derive from the text, not the engine. However, the text itself will change, which means the embeddings will need to be recomputed. Archēglyph surfaces this in the confirmation modal when you save a change that triggers a re-extraction.