Monday, February 17, 2025

Predicting Song Popularity From Acoustic Features

Previously, I've shown how I explored both the song popularity information derived from Billboard Top 100 data and Spotify 'Acoustic Features'. The first call for predicting the one from the other is a multiple linear regression model - it's relatively straightforward and easy to understand. However as previously covered it's not a model suitable for this data, so I moved to the next option: binary logistic regression. This model is less sensitive to data issues but only predicts (as its name suggests) a binary outcome, i.e. yes/no, true/false, etc. 

Handily, the data I'm looking at (MusicOSet, curated by Marianos Silva) defines a field called 'Is Pop' as part of the song entities. This is a binary yes/no field that says whether a song is popular, based on annual popularity scores. These are derived from Billboard Chart data and can be seen as a measure for commercial success of songs. See figure 5.1 for a Sankey diagram showing how songs where 'popular or not' relates to their genre. Most genres are fairly equally split between popular and not; niche genres such as disco, funk and alternative metal are generally 'not popular', which seems logical given their relative lack of chart beating success. 

Fig. 5.1 Sankey diagram showing how yes/no popular relates to song genre

Remember that there are a number of acoustic features from the Spotify side of the data, which describe each song - I chose nine of the more subjective features and the distributions are shown in figure 5.2. 

Fig 5.2. Histograms showing the distribution of acoustic features values

So using the 'is_pop' field to represent popularity/success I built a binary logistic model with acoustic features as the inputs. My simple model gave a 58.12% accuracy reading with all inputs showing statistically significant influence on popularity, except acousticness, danceability and liveness (for the complete stats, see table 5.1). Not a fabulous result, but it is a fairly simple model. Academics have been chasing hit song prediction for years now and by pulling in multiple methods and ever more fancy data processing now report success in the region of 87% - for example Zhao et al, 2023. 

Table 5.1. Coefficients from logistic regression

Estimate

Std Err

z

Value

significance (p)

(Intercept)

-0.69

0.15

-4.51

<0.001

acousticness

0.00

0.10

0.03

0.976

danceability

0.27

0.16

1.75

0.080

energy

-0.75

0.18

-4.24

<0.001

instrumentalness

-0.59

0.17

-3.39

<0.001

liveness

-0.07

0.13

-0.52

0.606

loudness

0.83

0.21

3.96

<0.001

speechiness

-1.78

0.28

-6.34

<0.001

valence

0.52

0.10

5.00

<0.001

<technical point> We assume as a basis that the features have no influence on popularity so when the significance is high, that's true. When the significance is low (in this case less than 0.001 or 0.1%), it means that the 'no influence' is false. Therefore they do have an influence. Confusingly double-negative but that's statistics for you! </technical point>

References

Zhao, M., Harvey, M., Cameron, D., Hopfgartner, F., & Gillet, V. J. (2023). An analysis of classification approaches for hit song prediction using engineered metadata features with lyrics and audio features. In International Conference on Information (pp. 303-311). Cham: Springer Nature Switzerland.


No comments:

Post a Comment

Evaluating Embeddings for NLP and Document Clustering

Embeddings are an alternative to traditional methods of vectorisation first proposed by Bengio et al in 2003, who developed the first langua...