Archēglyph

Sentence segmentation

Splitting a stream of prose into sentences. The rules look simple until you meet historical punctuation or OCR noise.

Last updated

Splitting a stream of prose into sentences. The rules look simple — full stop, newline — until you meet “Dr. Huxley, pp. 3–5, wrote:”, a 19th-century semicolon-heavy period, or a scan where line breaks carry meaning.

Why it matters for your research. Nearly every downstream analysis — embedding, classification, summarisation — works on the sentence or the chunk. Bad segmentation is invisible but poisons the results: two half-sentences clustering together because the period was missed is not a finding.

In Archēglyph. We use syntok for sentence splitting, chosen for its robustness on historical prose. Chunks are built by grouping segmented sentences until a size threshold.

Not to be confused with. Tokenization breaks each sentence into the sub-word pieces a model sees; sentence segmentation operates a level higher.

Related terms

References

← Back to the glossary