OCR
Also: Optical character recognition
Reading characters off a page image and producing machine-readable text.
Last updated
Optical Character Recognition reads characters off a page image and produces machine-readable text. Classical OCR engines (Tesseract, ABBYY) work line by line, character by character, and output a confidence score per token.
Why it matters for your research. Most DH corpora begin life as page scans. Every downstream analysis — search, clustering, NER — is built on the OCR output, and OCR errors cascade. “The OCR was clean” is as important a finding as any result that depends on it.
In Archēglyph. We run classical OCR by default and fall back to a VLM for low-confidence pages. Per- chunk OCR confidence is preserved in the bundle so researchers can filter on it. See VLM vs OCR: when to pick what and Reviewing a noisy scan.
Not to be confused with. OCR errors are not hallucinations — OCR reads actual ink on a real page region. A VLM asked to OCR can hallucinate characters the page does not contain.