HDBSCAN
Also: Hierarchical density-based clustering
A clustering algorithm that finds clusters of varying sizes in dense regions and leaves outliers unlabelled rather than forcing them in.
Last updated
Hierarchical Density-Based Spatial Clustering of Applications with Noise. HDBSCAN builds a cluster tree from how densely points pack in embedding space, then picks stable clusters automatically. Points that don’t belong to any dense region get labelled as noise rather than forced into a cluster.
Why it matters for your research. Compared to k-means — which requires you to pick the number of clusters up front and assigns every point to one — HDBSCAN finds clusters of varying sizes and leaves the outliers alone. That is a better fit for messy corpora where the “correct” number of groupings is itself unknown.
In Archēglyph. The default clustering algorithm. Min-cluster-size and min-samples are configurable per analysis run, and the choices are recorded in the dataset note.
Not to be confused with. UMAP is not a clustering algorithm — it’s a dimensionality-reduction step for visualisation that sits alongside HDBSCAN.