Clustering Text With Embeddings ft. HDBSCAN
Someone dumps 40,000 support tickets on a data science team and asks them to find the main themes. No labels, no category taxonomy, just a pile of text and a vague deadline.
The first instinct is K-Means, and it dies quickly. K-Means needs you to decide how many clusters exist before it starts looking. Nobody knows that. The whole point of the exercise is to find out. Picking K=20 and hoping for the best is basically guessing the answer before doing the work, which is not really analysis so much as wishful thinking dressed up in math.
This is where the pipeline of embeddings, UMAP, and HDBSCAN earns its reputation. Not because it is trendy, but because it is the first approach that does not ask you to already know the answer before you start.
K-Means Has a Deeper Problem
Most tutorials gloss over this. They show a clean scatter plot with five obvious blobs and say âsee, K-Means works greatâ while conveniently forgetting they generated that data with exactly five clusters to begin with.
Real text data does not behave that way. It has a few large themes, a bunch of small niche ones, and a long tail of one-off posts that do not belong anywhere in particular. K-Means will assign every single point to some cluster regardless of whether it belongs there. There is no concept of âthis point is noise.â Every outlier gets jammed into the nearest centroid, which quietly corrupts every cluster it touches.
HDBSCAN is different in this specific way. It finds dense regions in the data, draws cluster boundaries around them, and anything that does not clearly belong to a dense region gets labeled as noise with a -1. That -1 label is not a failure. It is the algorithm being honest about a genuinely ambiguous point, and honestly that is more useful than a confidently wrong cluster assignment.
Text Is Not Numbers, So You Have to Convert It First
Before any clustering can happen, there is a more fundamental problem to solve. Clustering algorithms operate on vectors. Text is not a vector.
The old approach was TF-IDF, which counts how often words appear in a document and weighs rare words more heavily. It technically converts text to numbers, but it has no sense of meaning. âCarâ and âvehicleâ are completely unrelated in TF-IDFâs worldview, so two documents about the same topic written with slightly different vocabulary end up scattered apart in the vector space. It works well enough for some tasks but it is fragile for clustering, where you need semantic similarity to translate into geometric proximity.
Sentence embedding models do this properly. They encode meaning rather than word counts. Feed all-MiniLM-L6-v2 the sentences âthe server is downâ and âour API keeps returning 503 errorsâ and the resulting vectors will land very close together in the 384-dimensional space the model produces. Both sentences mean the same thing is broken, and the model learned to reflect that during training on over a billion sentence pairs.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(documents, show_progress_bar=True)
# shape: (num_documents, 384)
all-MiniLM-L6-v2 is 22MB and fast enough to run on CPU for moderate dataset sizes. For most text clustering tasks it is the sensible default before reaching for heavier models. Now every document is a point in 384-dimensional space, and semantically similar documents genuinely sit near each other. That is the foundation everything else builds on.
The 384-Dimension Problem
Here is where people run HDBSCAN directly on the embeddings and get disappointing results, then assume HDBSCAN is the problem. It is not.
HDBSCAN is a density-based algorithm. It clusters by finding regions where many points are packed tightly together, separated from other dense regions by sparser space. This works beautifully in low dimensions. In high dimensions, it falls apart because of something called the curse of dimensionality.
In very high-dimensional spaces, every point tends to become roughly equidistant from every other point. There are no meaningfully dense regions and no meaningfully sparse ones because the distances all converge toward the same value. The HDBSCAN documentation mentions this directly: the algorithm starts struggling beyond 50 to 100 dimensions. At 384 dimensions, performance degrades significantly and most points just get classified as noise.
The fix is UMAP (Uniform Manifold Approximation and Projection). UMAP compresses high-dimensional data into a lower-dimensional space while preserving local structure, meaning points that were close in 384D stay close after compression. For clustering purposes, reducing to 5 dimensions is the standard approach. Two dimensions is for visualization and distorts the structure in ways that hurt clustering quality.
from umap import UMAP
reducer = UMAP(n_components=5, metric='cosine', random_state=42)
reduced_embeddings = reducer.fit_transform(embeddings)
# shape: (num_documents, 5)
After this step, density becomes visible again. The semantic clusters that were hidden in 384 dimensions are now geometrically obvious in 5, and HDBSCAN can do its job.
Running HDBSCAN
With the reduced embeddings ready, HDBSCAN is relatively simple to configure. There are two parameters that actually matter in practice.
min_cluster_size sets the minimum number of documents a group needs before it qualifies as a cluster. Anything smaller gets classified as noise instead. A reasonable starting point is max(5, total_documents // 100), so for 40,000 tickets that would be 400. This means a topic needs to appear at least 400 times to be recognized as a real cluster. Adjust based on domain knowledge of what counts as a meaningful theme versus a one-off anomaly.
min_samples controls how conservative the algorithm is about designating core points. It defaults to the same value as min_cluster_size, which is usually fine to leave alone initially. Lowering it produces more clusters with less noise. Raising it produces fewer, denser clusters and more noise points.
from hdbscan import HDBSCAN
clusterer = HDBSCAN(min_cluster_size=15, metric='euclidean')
labels = clusterer.fit_predict(reduced_embeddings)
# -1 means noise, 0/1/2/... are cluster IDs
The full pipeline end to end:
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
embeddings = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2').encode(documents)
reduced = UMAP(n_components=5, metric='cosine', random_state=42).fit_transform(embeddings)
labels = HDBSCAN(min_cluster_size=15).fit_predict(reduced)
Back to those 40,000 support tickets. Running this pipeline on them does not produce a clean taxonomy, but it does produce something far more useful than a blank stare. The data separates into billing questions, login failures, slow load times, mobile app crashes, refund requests, and one cluster that turned out to be feedback about a specific UI change from two weeks prior that nobody had flagged manually. That last cluster is the kind of thing you would have missed if you were reading tickets one by one. The algorithm found it because enough people wrote about the same thing, semantically close enough, to form a dense region.
The Noise Points Are Worth Reading
When -1 shows up in the labels, the instinct is to discard those points and focus on the named clusters. That is usually a mistake.
Noise points in HDBSCAN are documents that did not cluster with enough similar documents to form a group. In a support context, those are the genuinely weird, one-off issues: the error message nobody else has reported, the edge case that does not fit any known pattern. They are often worth reading first, not last, because they can be early signals of something new breaking.
A noise rate above 30% usually means min_cluster_size is too high, or the data has genuine diversity without strong repeating themes. A noise rate around 5 to 10% is normal and healthy. If noise is near zero, min_cluster_size is probably set too low and the algorithm is finding micro-clusters that do not represent real patterns.
Where This Pipeline Struggles
UMAPâs output depends heavily on its own hyperparameters, particularly n_neighbors. Changing that value changes the low-dimensional layout, which changes what HDBSCAN finds downstream. There is not a single correct clustering waiting to be discovered. There are multiple reasonable interpretations of the same data at different scales, and the pipeline will give you one of them.
In practice, running with a few different n_neighbors values (15, 30, 50) and checking whether the cluster structure stays roughly stable across runs gives a sense of how much to trust the output. Treat the result as a strong hypothesis rather than ground truth. Someone with domain knowledge still needs to look at samples from each cluster and confirm that they actually make sense.
Short text is also harder. Tweets and single-sentence inputs produce thinner embeddings with weaker semantic signal. The pipeline still outperforms TF-IDF on short text, but the cluster separations are less clean. A model fine-tuned specifically for short-form content will do better than the general-purpose all-MiniLM-L6-v2 in those cases.
And if labeled training data already exists, a classifier will outperform this approach. This pipeline is for when you genuinely do not know what structure lives inside the data and need the algorithm to surface it.
Why It Works in Practice
The value here is not that the clustering is perfect. It is that it turns something navigable from something that was not. A formless pile of 40,000 tickets becomes a map with regions, and a human can walk that map, validate what they find, and decide what to do next.
Which is considerably more tractable than where things started.