Thursday, July 10, 2025

Can These Product Descriptions be Clustered?

This piece describes the first part of my dissertation project, where I was assessing the best methods for clustering some technical product descriptions. These were very difficult items with abbreviations, numbers and technical language. 

First - vectorisation. Here we need to convert the data to a numerical representation (a matrix of numbers). Statistical methods like term frequency (and TF-IDF) log the number of times a word appears in a document and how many times across documents. 

TF-IDF vectorisation performed poorly in this case, giving a matrix with a sparsity value of 0.006% - i.e. 99.994% zeros. This reflects the non-overlapping nature of the words in the descriptions. Using K-means clustering and doing a principle component analysis (PCA) to reduce the dimensionality also showed poorly. Here we see the first six components only explain less than 20% of the variance. Typically you can expect that figure to be 90%:

So on the face of it, there doesn't seem to be any underlying structure in the data. On the other hand, the Hopkins statistic (shows 'clusterability at 95% confidence level when greater than 0.7) clocked in at 0.97... and using HDBSCAN clustering with tSNE dimensionality reduction we do get promising results.








Evaluating Embeddings for NLP and Document Clustering

Embeddings are an alternative to traditional methods of vectorisation first proposed by Bengio et al in 2003, who developed the first langua...