Friday, February 21, 2025

KNN Clustering To Classify Song Popularity

Looking at the data from Spotify, we can see that in the 'Acoustic Features', the 'energy' and 'loudness' features are closely related - and it makes logical sense too, if you think about it. If we plot the two on a simple scatter chart we get this:



So I got to thinking: could we use these two 'features to classify a song in terms of its popularity? Here, KNN clustering seems the appropriate method. This is a type of supervised learning, where the clusters are defined beforehand and the data is used to train the model to fit the categories. I used the boolean 'Is Popular' field as the target, with 80% of the data used for training and 20% reserved for testing. 

So what we are attempting to do is "train a model that will predict whether a song is popular or not using the descriptor numbers loudness and energy". 

KNN classification groups the items according to a number, K of their nearest neighbours (hence the 'NN'). 

First we have to scale the values of loudness to between 0 and 1, the same as the energy scores. This means the model won't over-value the higher numbers and be biased towards them. A basic min-max normalisation looks like this:

                Value = (X - min(X)) / (max(X) - min(X))

We do this for the test and training values of loudness. 

The data sets look now like this:

                energy loudness is_pop
                    <dbl>    <dbl> <lgl> 
                 1  0.575    0.431 FALSE 
                 2  0.911    0.751 TRUE  
                 3  0.871    0.758 TRUE  
                 4  0.577    0.653 FALSE 
                 5  0.928    0.877 FALSE 
                 6  0.777    0.759 FALSE 
                 7  0.493    0.744 FALSE 

Finally we need to set up a file of labels for the model. These are just the TRUE and FALSE labels from the 'is-pop' fields. 

Running a KNN model is simple. First we add a few (packages and) libraries:

          library(e1071) 
          library(caTools) 
          library(class) 

Then run the model! Here I've taken K to be 3, but we can run through different values later.
                
                knn_model = knn(train,test,train_lab,k=3)

Using the test data and labels, which haven't been used in the model, we can work out how many of these new values are put in the TRUE and FALSE categories incorrectly. This is the 'misclassification rate', and taking this value from 1 gives us a measure of accuracy. This changes with the value of K (although not by much in our model). Here are the accuracies for different K values:

 

We can see that the misclassification rate drops as K increases. The best is K=18, where the accuracy is 53%. Better than a random chance, but not by much! Not a great model but it demonstrates the principle pretty effectively. This plot shows that the classification errors are distributed across the values, which is good:


To improve the model, other acoustic features could be added or substituted. take a look at the previous blog for a view on how the acoustic features map to Popularity (clue: it's a Sankey diagram!)










Evaluating Embeddings for NLP and Document Clustering

Embeddings are an alternative to traditional methods of vectorisation first proposed by Bengio et al in 2003, who developed the first langua...