đź•’ This report is more than 5 years old (Published Oct 6, 2018).
Because I usually consume music by listening to full-length albums, I wanted to see whether my music album preferences—expressed through music album scores—can be quantitatively explained, and possibly predicted, by specific album features.
And, because I also read many music publications, I wanted to see whether my personal album scores correlate to those of the music magazines I enjoy reading.
This turned out to be one of my side projects, which, albeit simplistic, revealed some interesting findings about my affinity toward music.
Mathematical representation of my taste in music through album scores
The dataset in this analysis contains information on 97 albums that I have listened to in their entirety and for which I have consequently formed a strong opinion. Since I read a lot of album reviews, I was interested in understanding whether my album scores were correlated to album scores of several music publications — AllMusic, musicOMH, Pitchfork, The Guardian — and one aggregator website, Metacritic. I was also interested in finding out whether my album score could be predicted using information on album length, year of release, type of artist, and music genre.
Obviously, a significant drawback of this dataset stems from the fact that my system of evaluating album scores might be vastly different from that of music publications. I purposefully kept my method of scoring consistent across all albums using the following formula:

There is no guarantee, of course, that music publications used the same method. And, it’s naturally possible that, within the same music publication, scoring methodologies were inconsistent.
I first explored the data by looking at basic descriptive and inferential statistics, and by visualizing the frequency distribution of scores across different categories and publishers.
Descriptive and inferential statistics applied to my taste in music
I started by exploring the relative frequency of albums per year of release, type of artist, and genre. Since certain albums belonged to either isolated or cross-sectional genres (such as hip-hop for Lauryn Hill’s The Miseducation of Lauryn Hill and classical + pop for Benjamin Clementine’s At Least for Now), I unfortunately had to include those albums under one of the major genres. The major category was picked based on my subjective evaluation of the “closest” genre of music. Those were:
- Pop: pop and pop-based dance music.
- Electronic: ambiental, and any electronic dance music such as house, techno, and EDM.
- R&B: R&B, Soul, Hip-Hop
- Rock
- Experimental: cross- and multi-categorical genres that are not easily defined


As seen above, 35% of albums in my playlist have been released in years 2009, 2011, and 2013. Interestingly, though not surprisingly given my pop- and dance music-oriented ear, almost 80% of albums in this dataset have been released by solo female artists. More than 70% of albums belong to pop, electronic, and experimental genres.
I then plotted the distribution of albums per length and album score so that I could perform some basic inferential statistics.


If we assume these 97 albums represent my future listening profile, we can treat these data points as a statistically representative sample of an entire population of albums, including current and future ones. As such, I wanted to calculate the mean length and score of albums in my collection, along with 95% confidence intervals. Given that I don’t really know the variance of the entire population, I opted for t-statistic in this case.
In:
# 95% CONFIDENCE INTERVAL FOR LENGTH
confidence_interval
# 95% CONFIDENCE INTERVAL FOR SCORE
confidence_interval
These numbers indicate we can be 95% confident that the true population mean of all albums, current and future ones, will fall within the following ranges:
- (1) For length, the true population mean will be between 46.4 and 50.4 minutes.
- (2) For score, the true population mean will be between 66.0 and 73.4.
Broken down in less technical lingo, if we assume my taste in music doesn’t notably change, albums that secure a permanent spot on my playlist will be, on average, of typical LP length (40 – 50 minutes) and, on average, will not be outstanding albums given the score bracket of 65 – 75.
Although this makes sense, I still found the latter conclusion surprising. It signals that many of the albums that I regularly listen to contain songs that have not grown on me. This means, furthermore, that I consider a notable chunk of my favorite albums to be “very good” or only “good” according to this scoring methodology.
Correlation between my album scores and those of influential music publications
Before I looked at the correlation between my scores and those of other music publications, it was important to notice that musicOMH, The Guardian, and AllMusic assign scores on a scale from 0 to 5 in increments of 0.5, which translates to multiples of 10 on a 0 to 100 scale. Meanwhile, Pitchfork assigns scores on a scale from 0 to 10 in increments of 0.1, while Metacritic assigns scores on a scale from 0 to 100 in increments of 1.
These discrepancies can be clearly seen in the distribution histogram plots below.

Looking at these plots, we can see that other publications’ scores tend to skew higher compared to mine. The frequency distribution of my scores, however, also has a long left tail. In general, one can notice that the albums in the dataset have good (60+) scores — an expected finding if we assume that:
- Albums that I listen repeatedly tend to be those that I like and therefore have higher scores, and that
- Music journalists will, on average, agree on the overall quality of these albums, therefore also assigning higher scores.
This was further corroborated by the mean scores and sample standard deviations, which show that my distribution of album scores has the lowest mean and the largest standard deviation. MusicOMH’s distribution, on the other hand, has the highest mean, while Metacritic has the lowest sample standard deviation. The latter, in particular, was expected since Metacritic is an aggregator website.
As a side note, for simplicity and relevance, I did not calculate confidence intervals for the album scores of music publications.
means
stdevs
Going back to the comparison of scoring methodologies, I first rounded my scores to the nearest multiple of 10 when comparing data to musicOMH, The Guardian, and AllMusic. So, for example, Grimes’ Art Angels album, which had a score of 86, got rounded to 90. The data remained untransformed for comparisons against Pitchfork and Metacritic, since their scores had the same scale as mine.

# Correlation coefficients
coefficients = [correlation_OMH,
correlation_pitchfork,
correlation_theguardian,
correlation_metacritic,
correlation_allmusic]
I expected to see a somewhat stronger linear correlation between these scores, but overall, my album scores were weakly correlated with those of other music publications. Highest coefficients corresponded to correlations with AllMusic, musicOMH, and TheGuardian, for which my scores had to be transformed to nearest multiple of 10. Had the data not been transformed, these coefficients would have been slightly lower.
Generally, I was most surprised by the near-zero correlation coefficient with Pitchfork scores, indicating no linear relationship between our album scores. I considered transforming some of these variables (for example, into a logarithmic scale), but I didn’t see any solid theoretical ground to assume a non-linear relationship between these independent variables.
Can regression models be used to predict my preference for music albums?
I then wanted to see whether my album score was, to some extent, influenced by length of the album, type of artist, genre, and year of release. Albeit a notably simplified representation of reality, I was interested in seeing whether regression models could explain my taste in music.
Before doing any calculations, I wanted to define a function that would get rid of outliers based on sample standard deviation in case such operation was necessary. My reasoning here was that outliers in my scores were albums that I really didn’t like (as a fun fact, those were Lady Gaga’s Artpop, Kanye West’s Yeezus, and The Knife’s Shaking the Habitual). As these scores would negatively impact the regression models, I wanted to delete them. Additionally, since I didn’t need scores of music publications for this analysis, I shortened the dataframe only to essential columns.
# Removing the outliers using the remove_outliers function
data_filtered = remove_outliers(data_reduced, data_reduced.my_score)
Linear Regression Model
As with every linear regression model, I wanted to check for the following criteria before fitting the data:
- (1) Linearity
- (2) No endogeneity of regressors
- (3) Normality and homoescadasticity
- (4) No autocorrelation
- (5) No multicollinearity
* Linearity
Since genre and type of artist were categorical data that would be assigned dummy variables, I wanted to check only linearity for year of release and length of album.

As can be seen, for these two variables, linearity is not observed, and the relationship is rather random. Even after transforming the independent and dependent variables to log and square root distributions, the visual lack of linear relationship persisted. This obviously served as a detriment for the model, but I still wanted to see whether the other variables, type of artists and genre, could be used to have statistical significance in explaining some variability in the score distribution.
With only categorical variables left in the explanation of variability in album scores, I assumed no endogeneity of regressors ad normality and homoescadasticity of error term. Since this is not time-series data, I also assumed no autocorrelation. However, I wanted to check whether there was any significant correlation between each of the genre variables and type of artist.

We can notice from this heatmap that genres and type of artist are weakly correlated, with values less than absolute 0.5, which means that it’s valid to include them in the model. Running ANOVA on this model, we get:
| Â | coef | std err | t | P>|t| | [0.025 | 0.975] |
|---|---|---|---|---|---|---|
| const | 67.1489 | 7.366 | 9.116 | 0.000 | 52.518 | 81.780 |
| female | 9.0825 | 4.683 | 1.939 | 0.056 | -0.220 | 18.385 |
| pop | -5.3310 | 6.795 | -0.785 | 0.435 | -18.829 | 8.167 |
| electr | -4.7769 | 7.309 | -0.654 | 0.515 | -19.295 | 9.741 |
| experim | -5.0291 | 7.264 | -0.692 | 0.491 | -19.459 | 9.401 |
| rock | -2.4935 | 9.223 | -0.270 | 0.788 | -20.814 | 15.827 |
What were some findings from this — admittedly simplified — analysis?
- (1) That my methodology of scoring albums leads to generally lower scores compared to eminent music publications,
- (2) That my scores are weakly correlated with scores of those music publications, and that
- (3) The linear and logistic regression models cannot yet be used to explain variability in my album scores.
Looking onward, as part of my next project, the analysis can be potentially improved by expanding the dataset and including additional variables, of which the most insightful would be the quality of lyrical content, the quality of production, and the quality of musical and vocal arrangements. These variables could also be scored on a scale from 1 to 100. The best improvement, however, will be made after many more years of listening to music.
Note: This page contains only the write-up and the images from the analysis. I recommend visiting my github to access the dataset in .csv format, and the full analysis — including the Python code — in .ipynb format.Â





