Vision-language model

Also: VLM

A model that takes an image (or image + text) as input and produces text — describing, transcribing, or classifying what it sees.

Last updated 20 April 2026

A VLM takes an image (or image + text) as input and produces text. It can describe a photograph, transcribe a handwritten note, or classify regions of a scanned page. VLMs share transformer foundations with LLMs and add a vision encoder.

Why it matters for your research. VLMs changed what “OCR” means. Classical OCR reads characters; a modern VLM can also tell you what kind of region it is looking at (header, footnote, marginal note, table), and can read some handwritten or degraded scripts that classical engines miss entirely. They are also more prone to hallucination than classical OCR — a VLM can cheerfully “read” text that isn’t on the page.

In Archēglyph. Used for layout analysis on every page, and per-page for OCR when classical OCR confidence is low. The article VLM vs OCR: when to pick what explains how we choose.

Not to be confused with. An LLM — LLMs are text in, text out. A VLM takes images.

Related terms

References

← Back to the glossary