Thursday, August 7, 2025

Evaluating Embeddings for NLP and Document Clustering

Embeddings are an alternative to traditional methods of vectorisation first proposed by Bengio et al in 2003, who developed the first language models based on neural networks. Their solution included creating a so-called "embedding layer" used in the process. They can be used in many NLP tasks to obtain vector representations of words in documents and can encode surprisingly accurate syntactic and semantic word relationships.

Having established that embeddings performed better than TF-IDF for vectorising descriptions for clustering, I wanted to assess which embeddings model would perform the best on my data. I chose four models based on their similar MTEB rankings (https://huggingface.co/spaces/mteb/leaderboard)


Here are the figures:

Vectorising the data with each of these embeddings and using them to cluster the data, I was able to score their accuracy using a KNN (k-nearest neighbours) classification model (with 5-fold cross-validation). Plotting these for various values of K showed they all had good accuracy (over 85%).
                                           GloVe


sBERT



Granite



Qwen



Note: The graphs for sBERT and Qwen may look the same but are in fact slightly different

So it looks like any of these embedding models would perform well in my clustering problem. GloVe showed best accuracy with one neigbour. Granite were  by far the slowest to produce. Qwen produced a sub-optimal number of clusters, so sBERT was chosen as the best 'all rounder'.




Evaluating Embeddings for NLP and Document Clustering

Embeddings are an alternative to traditional methods of vectorisation first proposed by Bengio et al in 2003, who developed the first langua...