OCR

Also: Optical character recognition

Reading characters off a page image and producing machine-readable text.

Last updated 20 April 2026

Optical Character Recognition reads characters off a page image and produces machine-readable text. Classical OCR engines (Tesseract, ABBYY) work line by line, character by character, and output a confidence score per token.

Why it matters for your research. Most DH corpora begin life as page scans. Every downstream analysis — search, clustering, NER — is built on the OCR output, and OCR errors cascade. “The OCR was clean” is as important a finding as any result that depends on it.

In Archēglyph. We run classical OCR by default and fall back to a VLM for low-confidence pages. Per- chunk OCR confidence is preserved in the bundle so researchers can filter on it. See VLM vs OCR: when to pick what and Reviewing a noisy scan.

Not to be confused with. OCR errors are not hallucinations — OCR reads actual ink on a real page region. A VLM asked to OCR can hallucinate characters the page does not contain.

Related terms

References

← Back to the glossary