CNN Music Classification

This document describes a convolutional neural network that achieves human-level accuracy in music genre classification. The network takes mel-spectrograms of short music segments as input and is trained to classify genres. After training, the network achieves 70% accuracy, matching human performance. The filters learned by the network resemble spectrotemporal receptive fields in the human auditory system, suggesting it extracts meaningful features. By splitting songs into segments and combining predictions, the network can classify full songs as accurately as humans despite being exposed to only short excerpts.

Uploaded by

Mr Bennigan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views6 pages

CNN Music Classification

Uploaded by

Mr Bennigan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Convolutional Neural Network Achieves Human-level Accuracy in

Music Genre Classification

Mingwen Dong
Psychology, Rutgers University (New Brunswick)
[email protected]
arXiv:1802.09697v1 [cs.SD] 27 Feb 2018

Abstract
Music genre classification is one example of content-based analysis of music signals. Traditionally,
human engineered features were used to automatize this task and 61% accuracy has been achieved in the
10-genre classification. However, it’s still below the 70% accuracy that humans could achieve in the same
task. Here, we propose a new method that combines knowledge of human perception study in music
genre classification and the neurophysiology of the auditory system. The method works by training a
simple convolutional neural network (CNN) to classify a short segment of the music signal. Then, the
genre of a music is determined by splitting it into short segments and then combining CNN’s predictions
from all short segments. After training, this method achieves human-level (70%) accuracy and the filters
learned in the CNN resemble the spectrotemporal receptive field (STRF) in the auditory system 1 .

Introduction
With the rapid development of digital technology, the amount of digital music content increases dramatically
everyday. To give better music recommendations for the users, it’s essential to have an algorithm that could
automatically characterize the music. This process is called Musical Information Retrieval (MIR) and one
specific example is music genre classification.
However, music genre classification is a very difficult problem because the boundaries between different
genres could be fuzzy in nature. For example, testing with a 10-way forced choices task, college students
could achieve 70% classification accuracy after hearing 3 seconds of the music and the accuracy doesn’t
improve with longer music [1]. Also, the number of labeled data often is much smaller than the dimension
of the data. For example, GTZAN dataset 2 used in the current work contains only 1000 audio tracks, but
each audio track is 30s long with a sampling rate 22,050 Hz.
Traditionally, using human-engineered features like MFCC (Mel-frequency cepstral coefficients), texture,
beat and so on, 61% accuracy has been achieved in the 10-genre classification task [1]. More recently, using
PCA-whitened spectrogram as input, convolutional deep belief network has achieved 70% accuracy in a
5-genre classification task. These results are reasonable but still not as good as humans, suggesting there’s
still space to improve.
Psychophysics and physiology study show that human auditory system works in a hierarchical way [2].
First, the ear decomposes the continuous sound waveform into different frequencies with higher precision on
low frequencies. Then, neurons from lower to higher auditory structures gradually extract more complex
spectro-temporal features with more complex spectro-temporal receptive field (STRF) [3]. The features
used by human auditory system for music genre classification probably rely on these STRFs. By having
the spectrogram as input and the corresponding genre as label, CNN will learn filters that extract features
in the frequency and time domain. If these learned filters mimic STRFs in the human auditory system,
they can extract useful features for music genre classification. Because music signal often is high-dimension
in the time domain, having a CNN that fits the complete spectrogram of the music signal is not feasible.
To solve this problem, we used a ”divide-and-conquer” method: split the spectrogram of the music signal
1 All codes are available at: https://github.com/ds7711/music_genre_classification
2 Available at: http://marsyasweb.appspot.com/download/data_sets/

1
Figure 1: Convert waveform into mel-spectrogram and an example 3-second segment. Mel-spectrogram
mimics how human ear works, with high precision in low frequency band and low precision in high frequency
band. Note, the mel-spectrogram shown in the figures is already log transformed.

into consecutive 3-second segments, make predictions for each segment, and finally combine the predictions
together. The main rational for this method is that humans’ classification accuracy plateaus at 3 seconds
and good results were obtained using 3-second segments to train convolutional deep belief network [1] [4]. It
also intuitively makes sense because different parts of the same music probably should belong to the same
genre.
To further reduce the dimension on the spectrogram, we used mel-spectrogram as the input to the CNN.
Mel-spectrogram approximates how human auditory system works and can be seen as the spectrogram
smoothed in the frequency domain, with high precision in the low frequencies and low precision in the high
frequencies [5] [6].

Data Processing & Models

Data pre-processing
Each music signal is first converted from waveform into mel-spectrogram zi using Librosa library with 23ms
time window and 50% overlap (figure 1). Then, the mel-spectrogram is log transformed to bring values at
different mel-scale to the same range (f (zi ) = ln(zi + 1)). Because mel-spectrogram is a biological-inspired
representation [6], it has a simpler interpretation than the PCA-whitening method used in [4].

Network Architecture
1. Input layer: 64 * 256 neurons, corresponds to 64 mel scales and 256 time windows(23ms, 50% overlap).
2. Convolution layer: 64 different 3 * 3 filters with a stride of 1.
3. Max pooling layer: 2 * 4.
4. Convolution layer: 64 different 3 * 5 filters with a stride of 1.

2
5. Max pooling layer: 2 * 4.
6. Fully connected layer: 32 neurons that are fully connected to the neurons in the previous layer.
7. Output layer: 10 neurons that are fully connected to neurons in the previous layer.

For 2D layers/filters, the first dimension corresponds to the mel-scale and the second dimension corresponds
to the time. All hidden layers use RELU activation functions, the output layer use softmax function, and the
loss is calculated using cross-entropy function. Dropout and L2 regularization were used to prevent extreme
weights. The model is implemented using Keras (2.0.1) with tensorflow as backend and trained on a single
GTX-1070 using stochastic gradient descent.

Training & Prediction

1000 music tracks (converted into mel-spectrogram) are evenly split into training, validation, and testing set
with a ratio of 5 : 2 : 3. The training procedure is as following:
1. Select a subset of tracks from the training set.
2. Randomly sample a starting point and take the 3-second continuous segments from all selected tracks.
3. Calculate the gradients using back-propagation algorithm using the segments as input and the labels
of the original music as target genres.
4. Update the weights using the gradients.
5. Repeat the procedure until classification accuracy on the cross-validation data set doesn’t improve
anymore.
During testing, all music (mel-spectrogram) are split into consecutive 3-second segments with 50% over-
lap. Then, for each segment, the trained neural network predicts the probabilities of each genre. The
predicted genre for each music is the genre with highest averaged probability.

Calculate the filters learned by the CNN

After training, all musics are split into 3-second segments with 10% overlap. All the segments are then fed
into the trained CNN and intermediate outputs are calculated and stored. Then, we estimated the learned
filters using the following method:
1. Identify the range of input neurons (specific section of the input mel-spectrogram) that could activate
(l)
the target neuron at a specific layer. E.g., ci,j indicates the neuron at location (i, j) from the lth layer.
2. Perform Lasso regression with the specific section of the mel-spectrogram (reshaped as a vector) as the
(l)
regressors and the corresponding activations of the neuron ci,j as the target values.
3. The fitted Lasso coefficients were reshaped to estimate the learned filters.

Results
To the best of our knowledge, the current model is the first to achieve human-level (70%) accuracy in the
10-genre classification task (figure 2). It’s 10% higher than that obtained in [1] and classifies 5 more different
genres than [4] with similar accuracy.

Classification accuracies varies by different genres.

From the confusion matrix (figure 2), we could see that the classification accuracy varies a lot across different
genres. Especially, the accuracies for country and rock genre are not only lower than the current average but
also lower than those from [1] (which has overall lower accuracy that our CNN). Because some important
human-engineered features used in [1] are the long-term feature like beat and rhythm, this suggests country
and rock music may have characteristic features (e.g., beat) that require longer time (> 3 seconds) to

3
Figure 2: Confusion matrix of the CNN classification on testing set.

capture and 3s segments used in our CNN are not long enough. One future direction is to explore how to
use CNN to extract long-term features for classification and one possibility is to use another down-sampled
mel-spectrogram of the whole audio as input. Another explanation is that country and rock share more
features with the other music genres and are more difficult to classify in nature. Nonetheless, expert advice
probably is required to improve the classification accuracy on the country and rock genre.

CNN learns filters like spectro-temporal receptive field.

Figure 2 shows some filters learned by the CNN’s 2nd max pooling layer and they’re qualitatively similar to
the STRF obtained from physiological experiments (figure 4). To visualize how these filters help classify the
audios, we feed all the 3s segments from the testing set into the CNN and calculated the activations of the
last hidden layer. After this non-linear transformation, most testing data points become linearly separable
(figure 5). In contrast, the testing data points are much less separable when raw mel-spectrogram is used.
These results together show that the CNN learns filters similar to the spectro-temporal receptive field
observed in the brain. These filters transform the original mel-spectrogram into a representation where the
data is linearly separable.

Discussion
By combining the knowledge from human psychophysics study and neurophysiology, we used the CNN in a
”divide and conquer” way and classified the audio waveforms into different genres with human-level accuracy.
The same technique may be used to solve problems that share similar characteristics, for example, music
tagging and artist identification using raw audio waveform. With the current model, the genre of the music
can be extracted efficiently with human-level accuracy and used as features for recommending music to the
users.

4
Figure 3: Filters learned by the CNN are similar to the STRF from physiological experiments. Mel scale
corresponds to frequency and relative time corresponds to latency in figure 4.

Figure 4: STRF obtained from physiological experiments. From left to right are the STRFs obtained from
lower to higher auditory structures. Adapted from [3] with permission.

Figure 5: Comparison between the separability of the raw representation and last layer representation of the
CNN of the testing data. The axes are the first three components when data is projected onto the directions
obtained from linear discriminant analysis (LDA). using training data.

5
References
[1] George Tzanetakis and Perry Cook. Musical genre classification of audio signals. IEEE Transactions on
speech and audio processing, 10(5):293–302, 2002.
[2] Jan Schnupp, Israel Nelken, and Andrew King. Auditory neuroscience: Making sense of sound. MIT
press, 2011.
[3] Frédéric E Theunissen and Julie E Elie. Neural processing of natural sounds. Nature Reviews Neuro-
science, 15(6):355–366, 2014.
[4] Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng. Unsupervised feature learning for audio
classification using convolutional deep belief networks. In Advances in neural information processing
systems, pages 1096–1104, 2009.

[5] Douglas O’shaughnessy. Speech communication: human and machine. Universities press, 1987.
[6] Joseph W Picone. Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9):1215–
1247, 1993.