Sentence segmentation
Splitting a stream of prose into sentences. The rules look simple until you meet historical punctuation or OCR noise.
Last updated
Splitting a stream of prose into sentences. The rules look simple — full stop, newline — until you meet “Dr. Huxley, pp. 3–5, wrote:”, a 19th-century semicolon-heavy period, or a scan where line breaks carry meaning.
Why it matters for your research. Nearly every downstream analysis — embedding, classification, summarisation — works on the sentence or the chunk. Bad segmentation is invisible but poisons the results: two half-sentences clustering together because the period was missed is not a finding.
In Archēglyph. We use syntok for sentence splitting, chosen for its robustness on historical prose. Chunks are built by grouping segmented sentences until a size threshold.
Not to be confused with. Tokenization breaks each sentence into the sub-word pieces a model sees; sentence segmentation operates a level higher.