A spun-out MIT research project called Echo Nest used a proprietary algorithm to describe songs in terms of 'acoustic features' such as ''danceability', and 'liveness'. Each one is a number, generally between 0 and 1 (but not always). These were used in a service aimed at music recommendations, audio fingerprinting amongst other things. Since being acquired in by Spotify 2014 the data has been available through the Spotify API. Figure 4.1 shows distributions for nine acoustic features.
![]() |
Fig. 3.1. Histogram plots of acoustic features |
I wanted to see if you could use these features to predict song popularity, based on data from the Billboard Top 100 chart and streaming frequencies. These were obtained from Marianas Silva's 'Musicoset' data.
Visually you might see possible correlations between pairs where the distributions are the same shape - tempo and danceability maybe? Let's look at the correlation matrix (figure 4.2):
![]() |
Fig, 3.2. Correlation plots for acoustic features |
So yes, there are some correlations - both positive (eg loudness and energy) and negative (eg acousticness and energy) but nothing that would prevent a multiple regression analysis (typically where corr > 0.75).
Outliers (defined as 'a value that is less than or more than 1.5 times the interquartile range') can be a problem in regression models and indeed we see that's the case here. Look at the boxplots for example (fig 4.3). Also known as 'box and whisker plots', The box covers the interquartile range (1st quartile to 3rd quartile) and the 'whisker' the SD's. Dots show outliers and there's lost of them!
![]() |
Fig. 3.3. Boxplots of acoustic features |
So between that and an analysis of the variances of the acoustic features (a test for homogeneity if you must know!), I can see that you can't do a regression test using this data to predict popularity.
No comments:
Post a Comment