Looking at the data from Spotify, we can see that in the 'Acoustic Features', the 'energy' and 'loudness' features are closely related - and it makes logical sense too, if you think about it. If we plot the two on a simple scatter chart we get this:
So what we are attempting to do is "train a model that will predict whether a song is popular or not using the descriptor numbers loudness and energy".
KNN classification groups the items according to a number, K of their nearest neighbours (hence the 'NN').
First we have to scale the values of loudness to between 0 and 1, the same as the energy scores. This means the model won't over-value the higher numbers and be biased towards them. A basic min-max normalisation looks like this:
Value = (X - min(X)) / (max(X) - min(X))
We do this for the test and training values of loudness.
The data sets look now like this:
energy loudness is_pop
<dbl> <dbl> <lgl>
1 0.575 0.431 FALSE
2 0.911 0.751 TRUE
3 0.871 0.758 TRUE
4 0.577 0.653 FALSE
5 0.928 0.877 FALSE
6 0.777 0.759 FALSE
7 0.493 0.744 FALSE
Finally we need to set up a file of labels for the model. These are just the TRUE and FALSE labels from the 'is-pop' fields.
Running a KNN model is simple. First we add a few (packages and) libraries:
library(e1071)
library(caTools)
library(class)
Then run the model! Here I've taken K to be 3, but we can run through different values later.
knn_model = knn(train,test,train_lab,k=3)
Using the test data and labels, which haven't been used in the model, we can work out how many of these new values are put in the TRUE and FALSE categories incorrectly. This is the 'misclassification rate', and taking this value from 1 gives us a measure of accuracy. This changes with the value of K (although not by much in our model). Here are the accuracies for different K values:
We can see that the misclassification rate drops as K increases. The best is K=18, where the accuracy is 53%. Better than a random chance, but not by much! Not a great model but it demonstrates the principle pretty effectively. This plot shows that the classification errors are distributed across the values, which is good:
To improve the model, other acoustic features could be added or substituted. take a look at the previous blog for a view on how the acoustic features map to Popularity (clue: it's a Sankey diagram!)