Data Analysis Insight: Predicting Song Popularity From Acoustic Features

Previously, I've shown how I explored both the song popularity information derived from Billboard Top 100 data and Spotify 'Acoustic Features'. The first call for predicting the one from the other is a multiple linear regression model - it's relatively straightforward and easy to understand. However as previously covered it's not a model suitable for this data, so I moved to the next option: binary logistic regression. This model is less sensitive to data issues but only predicts (as its name suggests) a binary outcome, i.e. yes/no, true/false, etc.

Handily, the data I'm looking at (MusicOSet, curated by Marianos Silva) defines a field called 'Is Pop' as part of the song entities. This is a binary yes/no field that says whether a song is popular, based on annual popularity scores. These are derived from Billboard Chart data and can be seen as a measure for commercial success of songs. See figure 5.1 for a Sankey diagram showing how songs where 'popular or not' relates to their genre. Most genres are fairly equally split between popular and not; niche genres such as disco, funk and alternative metal are generally 'not popular', which seems logical given their relative lack of chart beating success.

Fig. 5.1 Sankey diagram showing how yes/no popular relates to song genre

Remember that there are a number of acoustic features from the Spotify side of the data, which describe each song - I chose nine of the more subjective features and the distributions are shown in figure 5.2.

Fig 5.2. Histograms showing the distribution of acoustic features values

So using the 'is_pop' field to represent popularity/success I built a binary logistic model with acoustic features as the inputs. My simple model gave a 58.12% accuracy reading with all inputs showing statistically significant influence on popularity, except acousticness, danceability and liveness (for the complete stats, see table 5.1). Not a fabulous result, but it is a fairly simple model. Academics have been chasing hit song prediction for years now and by pulling in multiple methods and ever more fancy data processing now report success in the region of 87% - for example Zhao et al, 2023.

Table 5.1. Coefficients from logistic regression
Estimate	Std Err	z	Value	significance (p)
(Intercept)	-0.69	0.15	-4.51	<0.001
acousticness	0.00	0.10	0.03	0.976
danceability	0.27	0.16	1.75	0.080
energy	-0.75	0.18	-4.24	<0.001
instrumentalness	-0.59	0.17	-3.39	<0.001
liveness	-0.07	0.13	-0.52	0.606
loudness	0.83	0.21	3.96	<0.001
speechiness	-1.78	0.28	-6.34	<0.001
valence	0.52	0.10	5.00	<0.001

<technical point> We assume as a basis that the features have no influence on popularity so when the significance is high, that's true. When the significance is low (in this case less than 0.001 or 0.1%), it means that the 'no influence' is false. Therefore they do have an influence. Confusingly double-negative but that's statistics for you! </technical point>

References

Zhao, M., Harvey, M., Cameron, D., Hopfgartner, F., & Gillet, V. J. (2023). An analysis of classification approaches for hit song prediction using engineered metadata features with lyrics and audio features. In International Conference on Information (pp. 303-311). Cham: Springer Nature Switzerland.

Data Analysis Insight

Monday, February 17, 2025

Predicting Song Popularity From Acoustic Features

No comments:

Post a Comment

Evaluating Embeddings for NLP and Document Clustering

Report Abuse