This piece describes the first part of my dissertation project, where I was assessing the best methods for clustering some technical product descriptions. These were very difficult items with abbreviations, numbers and technical language.
First - vectorisation. Here we need to convert the data to a numerical representation (a matrix of numbers). Statistical methods like term frequency (and TF-IDF) log the number of times a word appears in a document and how many times across documents.
TF-IDF vectorisation performed poorly in this case, giving a matrix with a sparsity value of 0.006% - i.e. 99.994% zeros. This reflects the non-overlapping nature of the words in the descriptions. Using K-means clustering and doing a principle component analysis (PCA) to reduce the dimensionality also showed poorly. Here we see the first six components only explain less than 20% of the variance. Typically you can expect that figure to be 90%:
So on the face of it, there doesn't seem to be any underlying structure in the data. On the other hand, the Hopkins statistic (shows 'clusterability at 95% confidence level when greater than 0.7) clocked in at 0.97... and using HDBSCAN clustering with tSNE dimensionality reduction we do get promising results.
|
|
No comments:
Post a Comment