Transformer
The neural-network architecture, introduced in 2017, that sits behind nearly every modern language, vision, and multimodal model.
Last updated
A neural-network architecture introduced in 2017 that learns which other tokens in the input to “attend to” when producing each output token. Almost every modern language model — LLMs, embedding models, translation, image captioning, vision-language models — is a transformer or a close relative.
Why it matters for your research. Knowing that the same architecture underpins ChatGPT, your embedding model, and your OCR VLM demystifies the space. The thing that changes between them is not the architecture but the training objective and the weights.
In Archēglyph. Transformers appear in the embedding model and in any VLM used for layout assessment or OCR. We don’t pick models because they’re transformers; we pick them because they behave well on prose.
Not to be confused with. The architecture is not the model. Two different transformers with different training data will behave very differently.