Friday, February 21, 2025

KNN Clustering To Classify Song Popularity

Looking at the data from Spotify, we can see that in the 'Acoustic Features', the 'energy' and 'loudness' features are closely related - and it makes logical sense too, if you think about it. If we plot the two on a simple scatter chart we get this:



So I got to thinking: could we use these two 'features to classify a song in terms of its popularity? Here, KNN clustering seems the appropriate method. This is a type of supervised learning, where the clusters are defined beforehand and the data is used to train the model to fit the categories. I used the boolean 'Is Popular' field as the target, with 80% of the data used for training and 20% reserved for testing. 

So what we are attempting to do is "train a model that will predict whether a song is popular or not using the descriptor numbers loudness and energy". 

KNN classification groups the items according to a number, K of their nearest neighbours (hence the 'NN'). 

First we have to scale the values of loudness to between 0 and 1, the same as the energy scores. This means the model won't over-value the higher numbers and be biased towards them. A basic min-max normalisation looks like this:

                Value = (X - min(X)) / (max(X) - min(X))

We do this for the test and training values of loudness. 

The data sets look now like this:

                energy loudness is_pop
                    <dbl>    <dbl> <lgl> 
                 1  0.575    0.431 FALSE 
                 2  0.911    0.751 TRUE  
                 3  0.871    0.758 TRUE  
                 4  0.577    0.653 FALSE 
                 5  0.928    0.877 FALSE 
                 6  0.777    0.759 FALSE 
                 7  0.493    0.744 FALSE 

Finally we need to set up a file of labels for the model. These are just the TRUE and FALSE labels from the 'is-pop' fields. 

Running a KNN model is simple. First we add a few (packages and) libraries:

          library(e1071) 
          library(caTools) 
          library(class) 

Then run the model! Here I've taken K to be 3, but we can run through different values later.
                
                knn_model = knn(train,test,train_lab,k=3)

Using the test data and labels, which haven't been used in the model, we can work out how many of these new values are put in the TRUE and FALSE categories incorrectly. This is the 'misclassification rate', and taking this value from 1 gives us a measure of accuracy. This changes with the value of K (although not by much in our model). Here are the accuracies for different K values:

 

We can see that the misclassification rate drops as K increases. The best is K=18, where the accuracy is 53%. Better than a random chance, but not by much! Not a great model but it demonstrates the principle pretty effectively. This plot shows that the classification errors are distributed across the values, which is good:


To improve the model, other acoustic features could be added or substituted. take a look at the previous blog for a view on how the acoustic features map to Popularity (clue: it's a Sankey diagram!)










Monday, February 17, 2025

Predicting Song Popularity From Acoustic Features

Previously, I've shown how I explored both the song popularity information derived from Billboard Top 100 data and Spotify 'Acoustic Features'. The first call for predicting the one from the other is a multiple linear regression model - it's relatively straightforward and easy to understand. However as previously covered it's not a model suitable for this data, so I moved to the next option: binary logistic regression. This model is less sensitive to data issues but only predicts (as its name suggests) a binary outcome, i.e. yes/no, true/false, etc. 

Handily, the data I'm looking at (MusicOSet, curated by Marianos Silva) defines a field called 'Is Pop' as part of the song entities. This is a binary yes/no field that says whether a song is popular, based on annual popularity scores. These are derived from Billboard Chart data and can be seen as a measure for commercial success of songs. See figure 5.1 for a Sankey diagram showing how songs where 'popular or not' relates to their genre. Most genres are fairly equally split between popular and not; niche genres such as disco, funk and alternative metal are generally 'not popular', which seems logical given their relative lack of chart beating success. 

Fig. 5.1 Sankey diagram showing how yes/no popular relates to song genre

Remember that there are a number of acoustic features from the Spotify side of the data, which describe each song - I chose nine of the more subjective features and the distributions are shown in figure 5.2. 

Fig 5.2. Histograms showing the distribution of acoustic features values

So using the 'is_pop' field to represent popularity/success I built a binary logistic model with acoustic features as the inputs. My simple model gave a 58.12% accuracy reading with all inputs showing statistically significant influence on popularity, except acousticness, danceability and liveness (for the complete stats, see table 5.1). Not a fabulous result, but it is a fairly simple model. Academics have been chasing hit song prediction for years now and by pulling in multiple methods and ever more fancy data processing now report success in the region of 87% - for example Zhao et al, 2023. 

Table 5.1. Coefficients from logistic regression

Estimate

Std Err

z

Value

significance (p)

(Intercept)

-0.69

0.15

-4.51

<0.001

acousticness

0.00

0.10

0.03

0.976

danceability

0.27

0.16

1.75

0.080

energy

-0.75

0.18

-4.24

<0.001

instrumentalness

-0.59

0.17

-3.39

<0.001

liveness

-0.07

0.13

-0.52

0.606

loudness

0.83

0.21

3.96

<0.001

speechiness

-1.78

0.28

-6.34

<0.001

valence

0.52

0.10

5.00

<0.001

<technical point> We assume as a basis that the features have no influence on popularity so when the significance is high, that's true. When the significance is low (in this case less than 0.001 or 0.1%), it means that the 'no influence' is false. Therefore they do have an influence. Confusingly double-negative but that's statistics for you! </technical point>

References

Zhao, M., Harvey, M., Cameron, D., Hopfgartner, F., & Gillet, V. J. (2023). An analysis of classification approaches for hit song prediction using engineered metadata features with lyrics and audio features. In International Conference on Information (pp. 303-311). Cham: Springer Nature Switzerland.


Monday, February 10, 2025

Text Mining and Sentiment Analysis in R

Running through the same concepts as before, I started looking at TripAdvisor data for a specific pub in Nottingham - the Malt Cross on St James Street. Their page of reviews can be seen here. I did a simple 'screen scrape' to get the review text, did  a bit of work in Excel to bring it into better shape, then went through the pre-processing steps - lowercasing, removing stopwords and punctuation. I didn't do any stemming - taking words back to a 'stem' where for instance both 'liked' and 'likes' would be shown converted 'lik' to remove duplication - as I wanted whole words for a word cloud. This would mean that 'disappointed' and disappointing', for example, would be shown separately but I was prepared to go with that. 

The "tm" library in R has a lexicon of around 6,500 words described as positive or negative that can be used for sentiment analysis with text mining. The polarity() function scans through the data and assigns a number between -1 and +1 to words in each sentence and then an overall positivity or negativity score to each sentence. Apparently people are on average positive in their comments (even on the Internet!) so these numbers should be scaled so the mean value is zero. After scaling the score distribution was as shown in figure 3.1. Slightly more positive than negative I'd say. 

 

Figure 2.1. Frequency distribution (scaled) for polarity of documents


Taking individual words, I constructed a 'word cloud' (figure 3.2). Here we see positive words in the review in green and negative ones in red. The size of the word corresponds to how often it appears in the reviews.

Figure 2.2. Word cloud for positive and negative sentiments






Monday, February 3, 2025

Exploring Musical 'Acoustic Features' in Spotify Data

A spun-out MIT research project called Echo Nest used a proprietary algorithm to describe songs in terms of 'acoustic features' such as ''danceability', and 'liveness'. Each one is a number, generally between 0 and 1 (but not always). These were used in a service aimed at music recommendations, audio fingerprinting amongst other things. Since being acquired in by Spotify 2014 the data has been available through the Spotify API. Figure 4.1 shows distributions for nine acoustic features. 

Fig. 3.1. Histogram plots of acoustic features

I wanted to see if you could use these features to predict song popularity, based on data from the Billboard Top 100 chart and streaming frequencies. These were obtained from Marianas Silva's 'Musicoset' data

Visually you might see possible correlations between pairs where the distributions are the same shape - tempo and danceability maybe? Let's look at the correlation matrix (figure 4.2):

Fig, 3.2. Correlation plots for acoustic features

So yes, there are some correlations - both positive (eg loudness and energy) and negative (eg acousticness and energy) but nothing that would prevent a multiple regression analysis (typically where corr > 0.75). 

Outliers (defined as 'a value that is less than or more than 1.5 times the interquartile range') can be a problem in regression models and indeed we see that's the case here. Look at the boxplots for example (fig 4.3). Also known as 'box and whisker plots', The box covers the interquartile range (1st quartile to 3rd quartile) and the 'whisker' the SD's. Dots show outliers and there's lost of them!

Fig. 3.3. Boxplots of acoustic features

So between that and an analysis of the variances of the acoustic features (a test for homogeneity if you must know!), I can see that you can't do a regression test using this data to predict popularity.





Monday, January 27, 2025

Looking at Song Popularity

The popularity of musical genres has changes quite a bit of the years... and yet in many ways it's stayed constant. Looking at the data pulled together by Marianos Silver (called 'MusicOSet'), which calculates song and artist popularity based on Billboard chart success and radio and Spotify plays, the least popular music in the years 1960 to 2019 was, perhaps unsurprisingly, 'Bagpipe'! 'Halloween' also scored badly. 'Country Dawn' had the worst single score but featured in more than one decade. Figure 4.1 shows the five least popular musical genres by decade. 

Fig. 4.1. Least popular musical popular by decade

For the most popular genres, ‘Album Rock’ and ‘Dance Pop’ had the highest ratings overall. Album Rock features in the top 5 most popular genres from the 1960’s to the 1990’s. Dance Pop is in the top 5 most popular from the 1990’s to the 2010’s. 

Another way of looking at musical genre is a treemap (figure 4.2). Here we can see a mapping of genre where the area is proportional to the popularity and colours correspond to artist type. Both Dance Pop and Album Rock are present in the 'Singer' and Band' genres.

Fig 4.2. Treemap showing artist genre by artist type










Monday, January 20, 2025

Text Mining and Representing Tweets

I've been investigating text mining through the recommended book (Kwartler, 2017). Some interesting processing on Delta airlines support tweets to remove stopwords and punctuation, lowercase the data. Then from the processed block of tweets a 'Term Document Matrix', or TDM was made. This is a list of words and the frequency of where they appear. The diagram essentially shows the way certain words cluster together. The 'Height' represents the "distance" between words in the matrix. 

Figure 1.1. Dendrogram from Delta airlines tweets

Another way to look at this is by colouring the branches. Representing the dendrogram in a circle gives a different perspective too (Fig 1.2). 


Figure 1.2 Circular dendrogram from Delta airlines tweets

I should reiterate that whilst these dendrograms formed the basis of my understanding, they are worked examples from Ted Kwartler's book. I just thought they looked cool!

References

Kwartler, T. (2017). Text mining in practice with R. John Wiley & Sons.

Evaluating Embeddings for NLP and Document Clustering

Embeddings are an alternative to traditional methods of vectorisation first proposed by Bengio et al in 2003, who developed the first langua...