Archeglyph
Guide · · extraction · ocr · vlm · decision

OCR vs VLM: a practical chooser

A short, decision-oriented guide to picking the right extraction engine for your corpus. When Tesseract is the right default, when a VLM is worth the cost, and how to test the choice cheaply.

By Dipankar · Last updated

On this page
  1. The one-line answer
  2. Pick Tesseract if
  3. Pick a VLM if
  4. The hybrid default
  5. How to test cheaply before committing
  6. A quick triage table
  7. Configuring the choice in Archeglyph
  8. Further reading

This is a decision guide. If you want the reasoning behind it, read the companion article VLM vs OCR: when to pick what. If you want to just decide what to set your dataset’s extract_engine to, start here.

The one-line answer

Default to Tesseract; escalate to a VLM per-region when the output looks wrong. This handles almost every corpus we have seen.

The remainder of this guide is a more nuanced version of that same answer, for the cases where the default isn’t right.

Pick Tesseract if

Any of these is true of your corpus:

  • Printed text, typeset, post-1900.
  • Scans at 300 dpi or better.
  • Latin script, or a well-supported non-Latin script with a Tesseract language pack (Greek, Cyrillic, Arabic with ara, and so on).
  • Your downstream use is lexical search or surveying, not publication of the extracted text.
  • Your corpus is large enough that VLM per-page cost becomes a budget question.

Tesseract will produce good text quickly, the errors will be consistent, and you will have headroom to re-run troublesome pages with a VLM individually.

Pick a VLM if

Any of these is true, and especially if more than one is:

  • Heavy degradation: staining, bleed-through, torn edges, uneven exposure.
  • Low-resolution scans (below ~200 dpi).
  • Handwriting or mixed print/handwriting.
  • Non-Latin scripts with limited Tesseract support (historical Ottoman, older scripts, or very stylised typography).
  • Your downstream use is publication of the extracted text as a resource, where the error bar matters.
  • The corpus is small enough that per-page VLM cost is affordable.

Pick the smallest VLM on the Ollama Cloud list that works on a sample. Larger VLMs cost more and are not always more accurate on extraction — some of them over-correct text in ways you may not want.

The hybrid default

Many corpora benefit from a hybrid approach:

  • Dataset default: Tesseract. Runs on every region.
  • Per-document override: a VLM, used when Tesseract output looks wrong on that document.
  • Per-region re-run: available from the provenance badge in the review screen.

Archeglyph supports all three levels directly. No custom pipeline code is needed.

How to test cheaply before committing

Before setting the extraction engine for a large dataset, run this 20-minute check:

  1. Pick a representative subset. Twenty pages that span the visual range of your corpus — one clean page, one damaged page, one with unusual layout, one in the less-familiar script if your corpus has more than one.
  2. Upload the subset as a fresh dataset with Tesseract as the default.
  3. Skim the review screen for each page. Note the regions that look wrong.
  4. Re-run those regions from the provenance badge with a VLM of your choice.
  5. Compare side by side. The review screen will show both outputs attributed to their engines.

If Tesseract is right on 18 of 20 pages, stick with Tesseract and use per-region re-run as needed. If it is wrong on 5 or more, switch the dataset default to a VLM. If it is in the middle, consider the hybrid strategy above.

A quick triage table

SituationDefault engineNotes
20th-century typeset print, 300+ dpiTesseractExpect 95-99% character accuracy
19th-century print, 300+ dpiTesseractAdd post-processing for systematic errors
Pre-1850 print, letterpressTesseract → VLMTest a subset first; VLM often wins
Typewritten 20th-century documentsTesseractVery reliable
Degraded archival scansVLMTesseract output will look like noise
HandwritingVLMTesseract is not designed for this
Mixed print + handwritingVLMMixed regions benefit from a VLM’s tolerance
Tables of numbersTesseractSpecify PSM mode in settings if results look disordered
Ottoman TurkishVLMOur newspapers experience: noticeably better on ligatures
East Asian scripts (Chinese, Japanese)VLMSpecialised OCR is an option; VLM is usually simpler

Configuring the choice in Archeglyph

From the dataset’s Settings tab:

  • Extraction engine: set to tesseract or to any VLM id from the Ollama Cloud list.
  • Tesseract language: set under the extract_engine sub-options when Tesseract is selected. Default is eng; change to eng+fra, ara, ell, etc., as your corpus requires.
  • Saving the change applies to new documents. Existing documents keep their current extraction; to re-extract, use the per-document re-run button on the document’s review screen or (for the whole dataset) the “Re-extract all” action.

Changing the extraction engine does not invalidate embeddings or clusters — those derive from the text, not the engine. However, the text itself will change, which means the embeddings will need to be recomputed. Archeglyph surfaces this in the confirmation modal when you save a change that triggers a re-extraction.

Further reading