Archēglyph

Tokenization

Splitting text into the sub-word pieces a model actually sees. Token length — not word count — is what drives cost and context-window limits.

Last updated

Models don’t see text as words; they see it as “tokens” — sub-word pieces chosen by a tokenizer during training. “Researcher” might be one token; “philologist” might be three; an archaic spelling might be ten. Token length, not word count, is what drives cost and context-window limits.

Why it matters for your research. When a tool quotes you “32k tokens”, it means tokens, not words. Historical, code-switched, or heavily abbreviated text will yield more tokens per page than modern prose, which affects both cost per analysis run and how much can be handled in a single pass.

In Archēglyph. Tokenization is a detail internal to the embedding and VLM models we use; you mostly don’t need to think about it. When you do — e.g. unexpectedly expensive embedding runs on a corpus heavy with abbreviations — it’s the tokenizer.

Not to be confused with. Sentence segmentation splits prose into sentences; tokenization splits each sentence into model-sized pieces. See Sentence segmentation.

Related terms

References

← Back to the glossary