ABSTRACT
In this project we have implemented a system to identify the speaker from a given speech
signal using features of human voice and machine learning techniques. This project was done
in the Python programming language and free audio editing tool Audacity was used to
produce sample data in our project.
The audio files taken for this project were observed for noise and blank sections and clean
samples extracted using Audacity.
The voice features extracted from the voice samples are Mel Frequency Cepstral Coefficients
(MFCC) and their deltas. The data thus obtained is used to model each speaker’s voice.
The speaker features have been fitted to Gaussian Mixture Models (GMM) and test data are
predicted whether they’re the same speaker by computing per sample average log-likelihood
against these models.
CONTENTS
1. Introduction…………………………………………………….1
1.1 Speech Signal…….……………………………….......2
1.2 Feature Extraction……………………………….......2
1.3 Choice of features…………………………………....3
1.4 Mel-Frequency Cepstral Coefficients..........................3
1.5 Delta MFCCs………………………………….....…..5
1.6 Modelling………………………………………….....6
1.7 Choice of classification/modelling algorithm………..6
1.8 Gaussian Distribution………………………………..7
1.9 Gaussian Mixture Model……………………………..8
2. Methodology and Implementation…………………………......9
2.1 Data collection……………………………………….9
2.2 Extracting features…………………………………..10
2.3 Training speaker model……………………………..10
2.4 Testing speaker model…………………………...….10
3. Results………………………………………………11
4. Conclusion…………………………………………..13
5. Uses and scope…………………………………...….13
Bibliography….………………………………………...14
Appendix
1. Python code……………………………….............15
2. List of Figures……………………………….........18
1. INTRODUCTION
Speaker recognition is the process of automatically identifying a person from their voice.
This technology is used in a variety of applications, such as voice-enabled devices like
smartphones and home assistants, security systems, and call centres. Speaker recognition
systems typically work by analysing the unique characteristics of an individual's voice and
representing these characteristics in a way that can be easily compared to other voices. The
system then uses this information to determine whether the person speaking is the same
person whose voice is in the pre-recorded sample.
Speaker modelling is an important part of speaker recognition systems, as it allows these
systems to accurately identify individuals based on their voice. Speaker modelling is typically
done using machine learning algorithms, which are trained on large datasets of voice samples
in order to learn the unique characteristics of individual voices.
1.1 Speech Signal
Speech is a stationary signal when period is small but the signal frequency changes over
longer periods of time. A digital speech signal is a representation of spoken language in
digital form. It is created by sampling an analog speech signal at a specific sampling rate and
converting the resulting samples into a digital format.
The time domain speech signal is as shown in Figure 1a. To visualise the frequency domain
signal of speech, a Short Time Fourier Transform (STFT) is performed as given in Figure 1b.
A Spectrogram is a plot of the STFT in which the log amplitude is indicated by colour
intensity.
Figure 1a. Time domain speech signal Figure 1b. Spectrogram of speech signal
1.2 Feature Extraction
In speaker recognition, feature extraction is the process of extracting distinctive
characteristics from an individual's speech. These characteristics, known as features, are then
used to create a unique representation of the individual's voice. This representation, known as
a speaker model, is used to identify the speaker in future speech samples.
There are various techniques used for feature extraction in speaker recognition. Some
common ones include Mel-Frequency Cepstral Coefficients (MFCCs), Linear Predictive
Coding (LPC), and Perceptual Linear Prediction (PLP). In this work, we have concentrated
on MFCCs.
MFCCs are calculated by converting the speech signal from the time domain to the frequency
domain using a Fourier transform, and then applying a triangular filterbank to the resulting
spectrum. This produces a set of coefficients that capture the spectral characteristics of the
speech signal.
These techniques are typically used in combination, with the resulting features being fed into
a machine learning algorithm for speaker recognition. The specific details of the feature
extraction process can vary depending on the specific application and the desired level of
accuracy.
1.3 Choice of features
Linear Predictive Coding Coefficients (LPCCs) and Mel-Frequency Cepstral Coefficients
(MFCCs) are both types of features that are commonly used in speaker recognition systems.
Both LPCCs and MFCCs are designed to capture spectral information about a speech signal,
which can be used to identify the speaker.
[8]
One key difference between LPCCs and MFCCs is the way in which they represent the
spectral information of the signal. LPCCs are obtained by applying linear prediction to the
signal, which involves fitting a linear model to the autocorrelation function of the signal.
MFCCs, on the other hand, are derived by applying a series of frequency transformations to
the signal, including a Mel-scale warping and a discrete cosine transform (DCT). This results
in a representation of the signal that is based on the spectral power of the signal in different
frequency bands.
[3][4]
Based on literature survey, MFCCs have been found to be more commonly used in speech
recognition and speaker identification systems because they are computationally efficient and
have been shown to perform well in a variety of applications. Moreover, [7]MFCCs have been
found to perform better under noisy environments than LPCCs. LPCC can also be effective in
these tasks and may be preferred in certain situation. In our work, we have used MFCCs and
their deltas for our feature vector.
1.4 Mel-Frequency Cepstral Coefficients (MFCC)
The perception of hearing in the human ear is logarithmic in nature. The Mel frequency scale
is a measure of frequency in audio signals that is based on the perception of pitch by the
human ear. It is used in many speech and audio processing applications because it more
closely approximates the way the human ear perceives pitch than the linear frequency scale.
It is particularly useful for tasks such as speech recognition, where the pitch of the speaker's
voice can be an important feature for identifying words and sounds. [8]The formula for
converting a frequency in Hz to the corresponding value on the Mel scale is:
The Mel scale is logarithmic, which means that the perceived difference in pitch between two
frequencies is the same regardless of the actual frequency difference. MFCCs are a set of
features derived from the power spectrum of an audio signal by applying a filterbank spaced
equally on the Mel frequency scale. The MFCs capture the spectral content of the signal in a
way that is more closely related to the way the human ear perceives pitch than the raw power
spectrum.
[8]
Here are the steps to calculate Mel-Frequency Cepstral Coefficients (MFCCs) from a
speech signal, including the formulas used at each step:
1. Pre-processing: The first step is to apply some pre-processing to the signal, such as
applying a window function and performing signal normalization. This helps to
reduce the effects of background noise and other variations in the signal. For example,
the signal can be windowed using a window function w[n] as follows:
s[n] = x[n] * w[n]
where x[n] is the original signal and s[n] is the windowed signal.
2. Fourier Transform: Next, the signal is transformed into the frequency domain using a
Fast Fourier Transform (FFT). This allows the spectral content of the signal to be
analysed. The FFT of the windowed signal s[n] is given by:
S[k] = FFT(s[n]) = ∑s[n]e^(-j2πkn/N)
where S[k] is the frequency spectrum, n is the time index, k is the frequency index, and N is
the number of samples in the signal.
3. Mel-scale warping: The frequency spectrum is then transformed onto the Mel scale,
which is a non-linear frequency scale that is designed to more closely match the
perceived frequency resolution of the human ear. This is typically done by applying a
triangular or trapezoidal filterbank to the spectrum, which divides it into a set of
overlapping frequency bands. The Mel-scaled spectrum M[f] at a given frequency f is
given by:
M[f] = 2595 * log10(1 + f/700)
where f is the frequency in Hz.
4. Discrete Cosine Transform (DCT): The Mel-scaled spectrum is then transformed
using a Discrete Cosine Transform (DCT). This transformation decorrelates the
spectral coefficients, which can improve the robustness of the features to noise and
other variations. The DCT coefficients C[k] are given by:
C[k] = ∑M[f]cos((π/M)(k-1)(f-0.5))
where M is the number of Mel-scaled frequency bands.
5. MFCCs: The resulting DCT coefficients are the MFCCs. These coefficients capture
the spectral content of the signal and can be used for tasks such as speech recognition
and speaker identification.
It's worth noting that these steps are just a high-level overview of the process and that there
are many variations and refinements that can be applied to the calculation of MFCCs. For
example, the number of Mel-scaled frequency bands and the type of DCT used can be
adjusted to optimize the performance of the features for a particular application.
1.5 Delta MFCCs
Delta MFCCs, also known as delta coefficients, are a type of feature used in speech
processing and speaker recognition. They are derived from the Mel-Frequency Cepstral
Coefficients (MFCCs), which are a set of coefficients that represent the spectral
characteristics of a sound.
Delta coefficients are calculated by taking the first-order or second-order derivative of the
MFCCs over time. This captures the changes in the MFCCs over time, and provides
additional information about the dynamics of the speech signal.
Delta coefficients are typically used in combination with the original MFCCs, resulting in a
set of features that includes both spectral and temporal information. This can improve the
performance of the speaker recognition system, as the additional temporal information can
provide additional cues for identifying the speaker.
The steps to calculate the delta-MFCCs are as follows:
1. First, calculate the MFCCs of the audio signal using a standard MFCC calculation
algorithm.
2. Once you have the MFCCs, calculate the difference between the MFCCs of the
current frame and the previous frame. This difference is called the first-order delta-
MFCCs.
3. You can also calculate the difference between the first-order delta-MFCCs and the
previous frame's delta-MFCCs. This difference is called the second-order delta-
MFCCs.
4. To get the final delta-MFCCs, you can concatenate the MFCCs, the first-order delta-
MFCCs, and the second-order delta-MFCCs. This will give you a feature vector that
captures the spectral characteristics of the audio signal as well as the changes in those
characteristics over time.
1.6 Modelling
Speaker modelling is the process of creating a unique representation of an individual's voice
for the purpose of speaker recognition. This representation, known as a speaker model, is
derived from a set of speech samples from the individual, and is used to identify the speaker
in future speech samples. In this project, we have used the MFCCs and the Delta MFCCs as
features for modelling the speakers.
1.7 Choice of classification/modelling algorithm
There are many machine learning algorithms that can be used for speaker recognition. The
specific algorithm that is best for a given application will depend on a variety of factors, such
as the quality of the speech data, the amount of data available, and the desired level of
accuracy.
Some common algorithms used for speaker recognition include Gaussian mixture models
(GMMs), support vector machines (SVMs), and deep neural networks (DNNs).
GMMs are probabilistic models that represent the distribution of the data using a mixture of
Gaussian distributions. They are often used as a simple and effective baseline for speaker
recognition.
SVMs are a type of supervised learning algorithm that constructs a hyperplane or set of
hyperplanes in a high-dimensional space to separate different classes of data. They can be
effective for speaker recognition when the data is linearly separable.
DNNs are a type of neural network that has multiple layers of interconnected nodes. They are
capable of learning complex nonlinear relationships in the data, and have been shown to be
effective for speaker recognition when trained on large amounts of data.
[1]
In general, DNNs tend to perform better than other algorithms for speaker recognition, but
they also require more data and computational resources to train.
Based on literature survey, we have found [2][5][6]GMM to be a commonly used and reliable
way to create speaker models without the need for a huge dataset or extensive computation.
Hence in our work, we used GMM to create speaker models.
1.8 Gaussian Distribution
A Gaussian distribution, also known as a normal distribution, is a continuous probability
distribution that is symmetric about its mean. It is defined by its mean and standard deviation,
and is often used to model real-valued random variables. Since most observations cluster
around the mean, the further away a value is from the mean, the lower its probability.
Figure 4 - Gaussian distributions
The graph of a Gaussian function is a bell shape curve, given by the standard form -
2
−( x−μ )
( )
( )
1 2σ
2
f ( x )= e
σ √2 π
Where μ = mean and σ = standard deviation
The parameters μ and σ of the function affect the shape of the graph, the curve is centred
around the mean μ and the width is directly proportional to the standard deviation σ, though
the graphs of all Gaussian distributions will have the general bell shape.
1.9 Gaussian Mixture Model
Figure 5. Two component Gaussian Mixture Model
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the underlying
data is generated from a mixture of several different underlying distributions, each of which
is a Gaussian distribution. Each Gaussian distribution is associated with a cluster and each
data point is associated with a mixture component represented by the Gaussian distribution it
was generated from. They can be used to model the distribution of the speech features
extracted from the speech samples, and to score new speech samples based on their
likelihood of belonging to the speaker. Mathematically, a GMM is expressed as p(x )-
K
p ( x )=∑ ∅ i N (x ∨μ i ,σ i)
i=1
2
− ( x−μi )
( )
( )
1 2σ
2
N (x∨μ i , σ i )= e i
σ i √2 π
K
∑ ∅ i=1
i=1
Where N ( x| μi , σ i ) is the probability density function of the i’th Gaussian distribution
component in the model. μi , σ i are the mean and standard deviation of the distribution
respectively and K is the number of components in the model.
The likelihood of a given data sample of having been generated by a GMM is given by the
logarithm of the probability of the sample under the GMM. In other words, it is the logarithm
of the sum of probabilities of the sample under each of the mixture components, weighted by
mixture weights. The probability of a sample x under a GMM is given as -
K
P ( x )=∑ ∅ i P(x ∨μ i)
i=1
where K = No. of components in the model
And the log probability of x is given as -
K
log P ( x )=log (∑ ∅ i P( x∨μ i))
i=1
2. METHODOLOGY AND IMPLEMENTATION
Figure 6 - Flowchart
2.1 Data collection
Table 1. Collected data
SPEAKER GENDER TRAINING TEST TOTAL
SAMPLES SAMPLES
Speaker_1 Female 40 10 50
Speaker_2 Female 40 10 50
Speaker_3 Male 40 10 50
We collect voice recordings of 3 different individuals labelled as Speaker_1, Speaker_2 and
Speaker_3. Each speaker spoke words like digits from “one”, “two”, “three” till “ten”,
alphabets like “a”, “b”, “c”, “d”, phrases like “Good morning”, “Happy Birthday” etc.
These audio files are saved in the .wav format.
The audio files are recorded with a sampling frequency of 48 kHz and a bit rate of 1400 kbps.
We use Audacity to remove silent parts and extract the voice samples. From each speaker we
collect 50 recordings totalling around 30 seconds. We divide the samples for training and
testing in a 40:10 split, i.e. we use 40 audio samples for training our speaker and the other 10
samples to test the model.
2.2 Extracting features
We obtain the MFCC features from the audio sample from the mfcc() method provided by
the python_speech_features library. The following parameters are passed to the
mfcc() function :
The audio signal
Sampling frequency
Window length - 25ms
Window step length - 10ms
Number of cepstrum to return - 20
The fft size - 1200
appendEnergy - When set to true, the log of the total frame energy is used in place of
the 0’th coefficient.
The MFCCs are obtained as a matrix having ‘n’ rows corresponding to each frame and 20
columns for the 20 coefficients obtained for each frame. Their deltas are also obtained and
stacked horizontally as an array of shape (n, 40). This is our desired feature vector for
training our model.
2.3 Training speaker model
In this step, the features for each audio file for every speaker are extracted and stacked
vertically, thus forming a feature matrix consisting of features from all the audio samples of
one speaker. This feature matrix is then used to train our model.
The scikit-learn library provides the GaussianMixture() method which takes
the following parameters:
The number of components in the mixture, n_components - 16
The type of covariance, covariance_type - ‘diagonal’
The number of fitting iterations to perform, max_iter - 500
The number of initialisations to perform, num_iter - 3
We obtain a Gaussian mixture model object with the specified parameters. Then we fit the
model with the feature set we had created from the audio files of the speaker. After the model
has been fitted with the speaker’s feature data, we serialise the model object as a binary
stream and save it as a file using the pickle module.
2.4 Testing speaker model
In this step, we take the voice audio of the speaker, extract its features and generate the
feature matrix. The stored speaker models are loaded and for each model, the per-sample
average log-likelihood of the audio data computed which is given by the [Link]()
method. The model which produces the highest likelihood is the one matching the audio and
its speaker is predicted to be the speaker of the given audio file.
3. RESULTS
We test our speaker models with 10 test samples kept for each speaker. The results table for
each speaker are as follows -
Table 2. Results of Speaker_1
Sample No. Predicted speaker
1 Speaker_1
2 Speaker_1
3 Speaker_1
4 Speaker_1
5 Speaker_1
6 Speaker_1
7 Speaker_1
8 Speaker_1
9 Speaker_1
10 Speaker_1
Table 3. Results of Speaker_2
Sample No. Predicted speaker
1 Speaker_2
2 Speaker_2
3 Speaker_2
4 Speaker_2
5 Speaker_2
6 Speaker_2
7 Speaker_2
8 Speaker_1
9 Speaker_2
10 Speaker_2
Table 4. Results of Speaker_3
Sample No. Predicted speaker
1 Speaker_3
2 Speaker_3
3 Speaker_3
4 Speaker_3
5 Speaker_3
6 Speaker_3
7 Speaker_3
8 Speaker_3
9 Speaker_3
10 Speaker_3
The test results can be visualised with the following confusion matrix -
Figure 7 - Confusion Matrix
Table 5. Outcomes and accuracy
Speaker No. of test No. of true No. of false Accuracy %
samples outcomes outcomes
Speaker_1 10 10 0 100
Speaker_2 10 9 1 90
Speaker_3 10 10 0 100
Thus, we observe that our speaker models are able to predict voice recordings of Speaker_1,
Speaker_2 and Speaker_3 with accuracy of 100%, 90% and 100% respectively and an
average accuracy of 96.67%.
4. CONCLUSION
In this project, we have observed that by modelling MFCC features of speech using Gaussian
Mixture Models, identification of speaker from speech signal with good accuracy is possible
without a large training dataset.
It is thus concluded that MFCC features with GMM is a viable method of performing speaker
identification and verification.
5. USES AND SCOPE
Speaker recognition can be/is used in a variety of areas such as -
Security and access control: Speaker recognition can be used to verify the identity of
an individual before granting access to a secure area or system.
Fraud prevention: Speaker recognition can be used to detect fraudulent activity by
verifying the identity of individuals making phone calls or accessing accounts.
Customer service: Speaker recognition can be used to identify customers when they
call a customer service hotline, allowing the system to provide personalized
assistance.
Voice assistants: Speaker recognition can be used to enable voice assistants, such as
Siri or Alexa, to recognize and respond to specific individuals.
Law enforcement: Speaker recognition can be used by law enforcement agencies to
identify individuals from recorded conversations or intercepts.
For future work, we wish to carry forward the successes and learnings of this project work to
develop a voice access control system which will require implementation of speaker
recognition processes in hardware devices. Such a system could be used in various secure and
restricted access environments.