Reading clusters as a researcher
The Archeglyph cluster view leads with quotations, not scatterplots. Here is how to use it — and why the scatterplot is behind a toggle.
By Dipankar · Last updated
On this page
If you have used a topic-modelling or embedding-clustering tool before, you have probably seen the default view: a UMAP scatterplot with coloured dots, a sidebar of top terms per cluster, and — if you are lucky — a list of document titles. This is a default designed for someone debugging a clustering algorithm. It is not a default designed for someone reading a corpus.
Archeglyph’s cluster view reverses the priority. Each cluster is a card, and the card leads with three things a researcher actually wants to see: a theme title, a one-sentence summary, and three to six exemplar fragments rendered as readable quotations with source citations. The scatterplot is behind a button called “Advanced”.
This article is about how to use the default view — and, for those inclined, when the Advanced panel earns its place.
What a card shows
A cluster card has the following anatomy, in order of visual weight:
- Theme title. Four to six words. Generated by combining the top TF-IDF terms for the cluster with a small text LLM that polishes them into something readable. The card discloses which LLM wrote the title.
- One-sentence summary. A short description — “A group of 42 fragments discussing population movements between Europe and Asia Minor during the 1920s” — produced by the same model from the exemplar fragments.
- Exemplar quotations. Three to six fragments, each rendered as a block quote with its source document, page, and (where available) date. These are the highest-probability members of the cluster according to HDBSCAN’s soft-clustering output.
- Size and link. “42 fragments · Open cluster →” opens the fragment neighbourhood — a longer view of all members with a sentence of surrounding context on either side.
That is the whole default. There is no dot plot, no silhouette score, no outlier percentage. Those numbers exist, and the Advanced toggle surfaces them, but the first read doesn’t need them.
How to read a card
Our experience working with historians, philologists, and archivists on the newspapers prototype led to a short heuristic:
- Read the three quotations first, top to bottom. Ignore the title; the title is a guess. The quotations are the data.
- Ask whether they feel like one group. If yes, the cluster is doing useful work — even if the title is slightly off. If no, look at the outliers: sometimes one exemplar signals a sub-theme that was swept into the same bucket.
- Open the cluster. The fragment neighbourhood shows all members with ±1 sentence of context. This is where the research actually happens. Skim the neighbourhood, flag fragments that feel adjacent but not central, and drop out to the document page for the ones that matter.
- Only then look at the title. By this point you have your own sense of the cluster’s shape. If the generated title fits, fine. If it doesn’t, you can rename it, and the rename persists.
Notice how much of this is close reading. The clustering algorithm got the fragments into roughly the same room; the philologist decides whether they are actually having the same conversation. That division of labour is the whole point.
What the Advanced toggle is for
There are three moments when the Advanced panel earns its place, and they have nothing to do with the default reading loop:
- When you suspect the clustering is wrong and want to know how wrong. The probability histogram tells you whether a cluster’s members are tightly bound or loosely attached. Loose clusters should be read skeptically.
- When you are comparing two runs. If you changed the embedding model or the HDBSCAN parameters, the UMAP projection lets you see at a glance whether the structure moved.
- When you are teaching the algorithm to someone else. The scatterplot is good pedagogy; it is mediocre daily bread.
For everything else, the numbers are noise. Our experience on the newspapers prototype was that researchers who spent thirty minutes in a UMAP view ended up with a worse sense of their corpus than researchers who spent thirty minutes reading exemplar quotations. The geometric view feels authoritative in a way the quotations don’t, and that authority is misleading — distances in 2D UMAP space are not the distances the clustering algorithm used.
What we don’t do
A few things the cluster view deliberately omits, and why:
- Word clouds. They encode frequency as area, which the eye reads as importance. TF-IDF terms are already in the theme-title pipeline; that is enough.
- Automatic cluster merging. If two clusters are “similar” by some metric, the researcher — not the algorithm — decides whether to merge them. The tool proposes; the scholar disposes.
- Sentiment or stance overlays. Sentiment classifiers trained on 21st-century social media do poorly on 19th-century newspapers. We would rather ship no signal than a misleading one.
What cluster IDs promise
When you re-ingest a dataset — add new documents, re-run extraction on a batch, change the cluster parameters — the underlying clustering algorithm produces a fresh assignment. Naively this would renumber every cluster, breaking any URL or note that references “cluster #17”.
Archeglyph stabilises cluster IDs via Hungarian matching against the previous assignment: a cluster that has substantial overlap with a previous cluster keeps the previous ID. This means saved cluster links survive incremental ingests. It also means a cluster whose membership shifts dramatically — because, say, you added two hundred documents about a new topic — will show up as a new cluster rather than hiding inside the old one.
That stability is load-bearing. It lets an archivist or intellectual historian bookmark a cluster as they would bookmark a chapter, and come back to it a month later without chasing a new number.
The end state
The default view is not just an aesthetic choice. It is a bet that the closest thing clustering tools have to an interface — the UMAP plot — was never the right one for the humanities. A cluster is a reading unit. Make it look like one, and the tool recedes into the background of the work, which is where research tools belong.