Data Analysis Insight

Thursday, August 7, 2025

Evaluating Embeddings for NLP and Document Clustering

Embeddings are an alternative to traditional methods of vectorisation first proposed by Bengio et al in 2003, who developed the first language models based on neural networks. Their solution included creating a so-called "embedding layer" used in the process. They can be used in many NLP tasks to obtain vector representations of words in documents and can encode surprisingly accurate syntactic and semantic word relationships.

Having established that embeddings performed better than TF-IDF for vectorising descriptions for clustering, I wanted to assess which embeddings model would perform the best on my data. I chose four models based on their similar MTEB rankings (https://huggingface.co/spaces/mteb/leaderboard)

Here are the figures:

Vectorising the data with each of these embeddings and using them to cluster the data, I was able to score their accuracy using a KNN (k-nearest neighbours) classification model (with 5-fold cross-validation). Plotting these for various values of K showed they all had good accuracy (over 85%).

GloVe

sBERT

Granite

Qwen

Note: The graphs for sBERT and Qwen may look the same but are in fact slightly different

So it looks like any of these embedding models would perform well in my clustering problem. GloVe showed best accuracy with one neigbour. Granite were by far the slowest to produce. Qwen produced a sub-optimal number of clusters, so sBERT was chosen as the best 'all rounder'.

Thursday, July 10, 2025

Can These Product Descriptions be Clustered?

This piece describes the first part of my dissertation project, where I was assessing the best methods for clustering some technical product descriptions. These were very difficult items with abbreviations, numbers and technical language.

First - vectorisation. Here we need to convert the data to a numerical representation (a matrix of numbers). Statistical methods like term frequency (and TF-IDF) log the number of times a word appears in a document and how many times across documents.

TF-IDF vectorisation performed poorly in this case, giving a matrix with a sparsity value of 0.006% - i.e. 99.994% zeros. This reflects the non-overlapping nature of the words in the descriptions. Using K-means clustering and doing a principle component analysis (PCA) to reduce the dimensionality also showed poorly. Here we see the first six components only explain less than 20% of the variance. Typically you can expect that figure to be 90%:

So on the face of it, there doesn't seem to be any underlying structure in the data. On the other hand, the Hopkins statistic (shows 'clusterability at 95% confidence level when greater than 0.7) clocked in at 0.97... and using HDBSCAN clustering with tSNE dimensionality reduction we do get promising results.

Thursday, June 5, 2025

Who's at risk of getting cyber-scammed? An Infographic

Thursday, May 1, 2025

Classification of Airline Customer Data

Introduction

The task under consideration is for classification of airline customer data to predict satisfaction. The data consists of a range of attributes including continuous variables such as departure delay and flight distance, and ordinal data rated 0 or 1 to 5 from passenger surveys. There is also some categorical data such as gender and whether the customer has a loyalty card. Unlike in other classification problems such as fraud detection or loan default prediction, satisfaction is split with a balance of 44% ‘satisfied’ and 56% ‘neutral or unsatisfied’. A few of the attributes show a significant amount of missing values – nearly 30% in some cases. Here, values were imputed using the mean or median, or dealt with in other ways. Two machine learning techniques are commonly found in the literature as being appropriate for the task of classification: naïve Bayes and random forest and these are the subject of this comparison work. Naïve Bayes is a simple probabilistic algorithm that uses Bayes theorem and assumes feature independence. Random forest algorithms are collections of decision trees each trained on a random subset of the data where the final classification is based on the majority one. This is also termed an ‘ensemble’ technique. Both are simple to implement and robust in that they are not regarded as prone to overfitting. Despite their simplicity, both have been reported as “surprisingly accurate” in use.

Results

Feature selection was undertaken by successively removing data using the column filter, one column at a time. Table 1 shows the change in scores as features were removed, with negative changes (more than -0.002 to allow for rounding differences) highlighted in pink. Positive differences (more than +0.002) are highlighted in green. If the score is reduced on feature removal it means that the model is negatively affected – the feature contributes to the model. Conversely if the score increases the model gets better without the feature.

With the random forest, most of the features make a contribution. For naïve Bayes, there are a number of features whose removal improves the model.

Table 1: Feature selection - effect on model scores
	Random Forest		Naïve Bayes
Filtered Out	F-score	AUC	F-score	AUC
None	0.892	0.968	0.778	0.881
Gender	0.887	0.967	0.786	0.882
Customer loyalty	0.871	0.958	0.785	0.878
Age	0.889	0.968	0.784	0.882
Type of Travel	0.859	0.955	0.756	0.866
Class	0.886	0.967	0.771	0.875
Online check-in	0.888	0.964	0.780	0.876
Flight Distance	0.890	0.967	0.775	0.882
Departure/Arrival time convenient	0.889	0.966	0.784	0.882
Ease of Online booking	0.873	0.948	0.779	0.875
Gate location	0.873	0.948	0.785	0.882
Food and drink	0.886	0.967	0.787	0.884
Seat comfort	0.882	0.964	0.777	0.883
Inflight entertainment	0.890	0.967	0.786	0.888
On-board service	0.888	0.967	0.783	0.882
Leg room service	0.889	0.966	0.788	0.880
Baggage handling	0.890	0.965	0.781	0.884
Checkin service	0.884	0.964	0.792	0.879
Inflight service	0.886	0.966	0.784	0.883
Cleanliness	0.885	0.964	0.782	0.883
Departure delay in minutes	0.890	0.968	0.795	0.882
Arrival delay in minutes	0.892	0.967	0.793	0.882

Using these set-ups, the random forest model showed a slight decline in both F-score and AUC. The naïve Bayes model was improved - see table 2

Table 2: Results summary before and after tuning and feature selection
	Random Forest		Naïve Bayes
	F-score	AUC	F-score	AUC
Base model	0.892	0.968	0.778	0.881
Tuned parameters and features	0.886	0.966	0.789	0.887

Conclusions

Overall, the model performance was at a high level, with general ‘accuracy’ in the 80-90% range and recall 80-85%. This proved hard to improve on whether through tuning parameters in the random forest model or selecting features to include. Conversely, this makes them simple to set up and run. Kelleher et al (2015) declare that naïve Bayes models are often used “to define a baseline accuracy score” because they are so easy to implement.

Predicting customer satisfaction was not the aim here. What the model might do is allow the airline to analyse which features are important contributors to satisfaction, which means studying ‘accuracy’ in terms of the positive outcomes – hence why evaluation measures such as recall, F-1 and AUC were chosen. This worked well with the naïve Bayes model, where we saw a clear outcome from feature selection with some attributes contributing positively to the model and some negatively. For the random forest algorithm this was not so successful as all attributes made a mild positive contribution. In terms of interpretability a single decision tree may have been preferable, although that would have brought other problems such as a tendency to overfitting and probably lower accuracy. Binary logistic regression is another method that might have been useful as the outputs are explicit in showing attribute contribution. Another option would be to use several models together. Khan et al (2024) review the literature on classification problems (albeit for class-imbalance problems) and conclude ensemble methods generally show better performance.

References

Kelleher, J., Mac Namee, B., & D’Arcy, A. (2015) Fundamentals of machine learning for predictive data analytics. MIT Press.

Khan, A., Chaudhari, O., & Chandra, R. (2024) A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation. (2024). Expert Systems with Applications. https://doi.org/10.1016/j.eswa.2023.122778

Thursday, April 3, 2025

Sentiment Analysis Down The Pub

I've become fascinated by NLP recently - although, jarringly, I still primarily translate those initials as 'neuro-linguistic programming', a "pseudoscientific approach to communication, personal development and psychotherapy" (Wikipedia), first proposed in 1975. No, we are of course talking about Natural Language Processing - which essentially covers the myriad ways of extracting meaning from data.

I was looking at the TripAdvisor reviews for a pub customer recently and wondering how I could represent them in an easy to digest format. For sure I could show a table of ratings - but they do that on the site anyway, and raw numbers are (a) sometimes hard to process and (b) wouldn't necessarily give you the full flavour of the reviews. Time for some NLP! No, not neuro-linguistic programming. Natural language processing, yes... we've been over that.

The basic idea was to scrape the data, process it, then represent it in a meaningful way. Literal screen-scraping is tedious but not difficult; once I had a goodly number of reviews I edited/mashed them into a useable format: there were only a few hundred so it was easiest to do this manually given the raw structure. Next step involved removing stopwords such as 'the', 'and' or ''of', followed by labelling the remaining with sentiment from a lexicon (I used the standard R one). The lexicon in this case contained a large number of words with a sentiment score associated with each one - positive words like 'brilliant' or 'good' would have a score of +1. Negative words like 'broken' or 'useless' would score -1. So what we then had was a 'stopped' bunch of revies scored for sentiment.

At this point you could take a measure of sentence-based sentiment - weighing up the sentiment of each review by adding the positive and negative scores, and perhaps totalling the positives the negative. Are the reviews good or bad overall? There's your answer. But what I wanted o do was visualise the scores in some way, so I decided to make a word cloud. By representing the positive words in green and the negative words in red, with the size of the words representing the frequency of their occurrence in classic word cloud style, I made what I think is a a pleasingly striking representation of TripAdvisor reviews that says so much more than that table of data.

Thursday, March 13, 2025

War, what is it good for... NLP!

On the back of the last post about sentiment analysis word clouds for pub reviews on trip advisor, I decided on another quick-and dirty exercise: a word cloud for reviews of one of my favourite films, Saving Private Ryan (directed by Stephen Spielberg, starring Tom Hanks, yes that one). IMDB was the source this time:

I like this one because it really captures the essence of the reviews. The colours are set at random so don't represent anything, however the words like 'masterpiece', 'best' and other positives really stand out. And ok, 'movie', 'film' and 'war' are prominent but they are descriptive and you wouldn't want to filter them out.

Wednesday, February 26, 2025

Modelling Daily Takings at the Pub

Our pub, the Talbot Taphouse is not our pub any more - we quit nearly ten years ago. However we still have data so I thought I'd give some modelling ideas a go. What factors might influence how much a pub makes on a daily basis? Day of the week. Weather. Beers on tap and their quality. Ambience and cleanliness. Staff... efficiency? Attitude? The Talbot was wet-led so food doesn't come into the equation here.

I had a sliver of data for Q3 2015 so day of the week we can do. Using the Weather Underground web site I can get daily weather data for the period, sourced from a weather station at East Midlands Airport: close enough? For some reason rainfall wasn't recorded so humidity will have to do, along with temperature, dew point, wind speed and pressure.

First we check a few things in the data to see if Multiple Linear Regression is a permissible modelling regime. This involves questions about distributions (normal?), outliers (none) and variances (similar) where each has a specific statistical test which I won't go into right now. Suffice to say the data pretty much passed the tests, although I had some misgivings about outliers.

I was going to do a layered approach, adding various data sets to a model and see which ones were effective at describing the takings. First up, I tried a 'weather-only' model, using all weather fields. This fared poorly, with an R-squared value of 0.00585. This shows the model describes just 0.6% of the change in daily takings.

Add in the day of the week and the R-squared shoots up to 0.7747, or 77.5%! In fact, interrogating this model showed that only Thursday, Friday, Saturday and Sunday were significant in predicting takings.

Next I removed the weather data and just used day of the week. Here, the R-squared value held steady at 0.7517, and all the days except Friday were significant predictors of takings. Going back to the data I was intrigued to see how well the points might be predicted, so I took 20% of the points as a test set and retrained the model on the 80%. R-squared was 0.7548, so comparable with previous iterations. Running the now previously unseen data through the model , I got the values it predicted. These are shown in the graph, above. Blue crosses are the training (actual) values and red circles are those predicted by the model. Looks pretty good, eh?

As for the other data, well back in 2015 we didn't really do Untappd (for beer scores) or TripAdvisor (for ambience and staff) so I'll have to leave things there for this pub. I'm working on getting data for other pubs so watch this space...

Friday, February 21, 2025

KNN Clustering To Classify Song Popularity

Looking at the data from Spotify, we can see that in the 'Acoustic Features', the 'energy' and 'loudness' features are closely related - and it makes logical sense too, if you think about it. If we plot the two on a simple scatter chart we get this:

So I got to thinking: could we use these two 'features to classify a song in terms of its popularity? Here, KNN clustering seems the appropriate method. This is a type of supervised learning, where the clusters are defined beforehand and the data is used to train the model to fit the categories. I used the boolean 'Is Popular' field as the target, with 80% of the data used for training and 20% reserved for testing.

So what we are attempting to do is "train a model that will predict whether a song is popular or not using the descriptor numbers loudness and energy".

KNN classification groups the items according to a number, K of their nearest neighbours (hence the 'NN').

First we have to scale the values of loudness to between 0 and 1, the same as the energy scores. This means the model won't over-value the higher numbers and be biased towards them. A basic min-max normalisation looks like this:

Value = (X - min(X)) / (max(X) - min(X))

We do this for the test and training values of loudness.

The data sets look now like this:

                energy loudness is_pop
                    <dbl>    <dbl> <lgl> 
                 1  0.575    0.431 FALSE 
                 2  0.911    0.751 TRUE  
                 3  0.871    0.758 TRUE  
                 4  0.577    0.653 FALSE 
                 5  0.928    0.877 FALSE 
                 6  0.777    0.759 FALSE 
                 7  0.493    0.744 FALSE

Finally we need to set up a file of labels for the model. These are just the TRUE and FALSE labels from the 'is-pop' fields.

Running a KNN model is simple. First we add a few (packages and) libraries:

          library(e1071) 
          library(caTools) 
          library(class)

Then run the model! Here I've taken K to be 3, but we can run through different values later.

                knn_model = knn(train,test,train_lab,k=3)

Using the test data and labels, which haven't been used in the model, we can work out how many of these new values are put in the TRUE and FALSE categories incorrectly. This is the 'misclassification rate', and taking this value from 1 gives us a measure of accuracy. This changes with the value of K (although not by much in our model). Here are the accuracies for different K values:

We can see that the misclassification rate drops as K increases. The best is K=18, where the accuracy is 53%. Better than a random chance, but not by much! Not a great model but it demonstrates the principle pretty effectively. This plot shows that the classification errors are distributed across the values, which is good:

To improve the model, other acoustic features could be added or substituted. take a look at the previous blog for a view on how the acoustic features map to Popularity (clue: it's a Sankey diagram!)