0% found this document useful (0 votes)

25 views15 pages

Final Project Report

This project implements a speaker identification system using human voice features and machine learning techniques, specifically Gaussian Mixture Models (GMM). The system utilizes Mel Frequency Cepstral Coefficients (MFCC) and their deltas for feature extraction, with data collected and processed using Python and Audacity. The methodology includes data collection, feature extraction, and model training and testing to accurately identify speakers from speech signals.

Uploaded by

kaushikmazumder5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views15 pages

Final Project Report

Uploaded by

kaushikmazumder5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

ABSTRACT

In this project we have implemented a system to identify the speaker from a given speech
signal using features of human voice and machine learning techniques. This project was done
in the Python programming language and free audio editing tool Audacity was used to
produce sample data in our project.

The audio files taken for this project were observed for noise and blank sections and clean
samples extracted using Audacity.

The voice features extracted from the voice samples are Mel Frequency Cepstral Coefficients
(MFCC) and their deltas. The data thus obtained is used to model each speaker’s voice.

The speaker features have been fitted to Gaussian Mixture Models (GMM) and test data are
predicted whether they’re the same speaker by computing per sample average log-likelihood
against these models.
CONTENTS

1. Introduction…………………………………………………….1
 1.1 Speech Signal…….……………………………….......2
 1.2 Feature Extraction……………………………….......2
 1.3 Choice of features…………………………………....3
 1.4 Mel-Frequency Cepstral Coefficients..........................3
 1.5 Delta MFCCs………………………………….....…..5
 1.6 Modelling………………………………………….....6
 1.7 Choice of classification/modelling algorithm………..6
 1.8 Gaussian Distribution………………………………..7
 1.9 Gaussian Mixture Model……………………………..8

2. Methodology and Implementation…………………………......9

 2.1 Data collection……………………………………….9
 2.2 Extracting features…………………………………..10
 2.3 Training speaker model……………………………..10
 2.4 Testing speaker model…………………………...….10

3. Results………………………………………………11
4. Conclusion…………………………………………..13
5. Uses and scope…………………………………...….13

Bibliography….………………………………………...14
Appendix
1. Python code……………………………….............15
2. List of Figures……………………………….........18
1. INTRODUCTION

Speaker recognition is the process of automatically identifying a person from their voice.
This technology is used in a variety of applications, such as voice-enabled devices like
smartphones and home assistants, security systems, and call centres. Speaker recognition
systems typically work by analysing the unique characteristics of an individual's voice and
representing these characteristics in a way that can be easily compared to other voices. The
system then uses this information to determine whether the person speaking is the same
person whose voice is in the pre-recorded sample.

Speaker modelling is an important part of speaker recognition systems, as it allows these

systems to accurately identify individuals based on their voice. Speaker modelling is typically
done using machine learning algorithms, which are trained on large datasets of voice samples
in order to learn the unique characteristics of individual voices.

1.1 Speech Signal

Speech is a stationary signal when period is small but the signal frequency changes over
longer periods of time. A digital speech signal is a representation of spoken language in
digital form. It is created by sampling an analog speech signal at a specific sampling rate and
converting the resulting samples into a digital format.

The time domain speech signal is as shown in Figure 1a. To visualise the frequency domain
signal of speech, a Short Time Fourier Transform (STFT) is performed as given in Figure 1b.
A Spectrogram is a plot of the STFT in which the log amplitude is indicated by colour
intensity.

Figure 1a. Time domain speech signal Figure 1b. Spectrogram of speech signal

1.2 Feature Extraction

In speaker recognition, feature extraction is the process of extracting distinctive
characteristics from an individual's speech. These characteristics, known as features, are then
used to create a unique representation of the individual's voice. This representation, known as
a speaker model, is used to identify the speaker in future speech samples.

There are various techniques used for feature extraction in speaker recognition. Some
common ones include Mel-Frequency Cepstral Coefficients (MFCCs), Linear Predictive
Coding (LPC), and Perceptual Linear Prediction (PLP). In this work, we have concentrated
on MFCCs.

MFCCs are calculated by converting the speech signal from the time domain to the frequency
domain using a Fourier transform, and then applying a triangular filterbank to the resulting
spectrum. This produces a set of coefficients that capture the spectral characteristics of the
speech signal.

These techniques are typically used in combination, with the resulting features being fed into
a machine learning algorithm for speaker recognition. The specific details of the feature
extraction process can vary depending on the specific application and the desired level of
accuracy.

1.3 Choice of features

Linear Predictive Coding Coefficients (LPCCs) and Mel-Frequency Cepstral Coefficients
(MFCCs) are both types of features that are commonly used in speaker recognition systems.
Both LPCCs and MFCCs are designed to capture spectral information about a speech signal,
which can be used to identify the speaker.
[8]
One key difference between LPCCs and MFCCs is the way in which they represent the
spectral information of the signal. LPCCs are obtained by applying linear prediction to the
signal, which involves fitting a linear model to the autocorrelation function of the signal.
MFCCs, on the other hand, are derived by applying a series of frequency transformations to
the signal, including a Mel-scale warping and a discrete cosine transform (DCT). This results
in a representation of the signal that is based on the spectral power of the signal in different
frequency bands.

[3][4]
Based on literature survey, MFCCs have been found to be more commonly used in speech
recognition and speaker identification systems because they are computationally efficient and
have been shown to perform well in a variety of applications. Moreover, [7]MFCCs have been
found to perform better under noisy environments than LPCCs. LPCC can also be effective in
these tasks and may be preferred in certain situation. In our work, we have used MFCCs and
their deltas for our feature vector.

1.4 Mel-Frequency Cepstral Coefficients (MFCC)

The perception of hearing in the human ear is logarithmic in nature. The Mel frequency scale
is a measure of frequency in audio signals that is based on the perception of pitch by the
human ear. It is used in many speech and audio processing applications because it more
closely approximates the way the human ear perceives pitch than the linear frequency scale.
It is particularly useful for tasks such as speech recognition, where the pitch of the speaker's
voice can be an important feature for identifying words and sounds. [8]The formula for
converting a frequency in Hz to the corresponding value on the Mel scale is:

The Mel scale is logarithmic, which means that the perceived difference in pitch between two
frequencies is the same regardless of the actual frequency difference. MFCCs are a set of
features derived from the power spectrum of an audio signal by applying a filterbank spaced
equally on the Mel frequency scale. The MFCs capture the spectral content of the signal in a
way that is more closely related to the way the human ear perceives pitch than the raw power
spectrum.

[8]
Here are the steps to calculate Mel-Frequency Cepstral Coefficients (MFCCs) from a
speech signal, including the formulas used at each step:
1. Pre-processing: The first step is to apply some pre-processing to the signal, such as
applying a window function and performing signal normalization. This helps to
reduce the effects of background noise and other variations in the signal. For example,
the signal can be windowed using a window function w[n] as follows:

s[n] = x[n] * w[n]

where x[n] is the original signal and s[n] is the windowed signal.

2. Fourier Transform: Next, the signal is transformed into the frequency domain using a
Fast Fourier Transform (FFT). This allows the spectral content of the signal to be
analysed. The FFT of the windowed signal s[n] is given by:

S[k] = FFT(s[n]) = ∑s[n]e^(-j2πkn/N)

where S[k] is the frequency spectrum, n is the time index, k is the frequency index, and N is
the number of samples in the signal.

3. Mel-scale warping: The frequency spectrum is then transformed onto the Mel scale,
which is a non-linear frequency scale that is designed to more closely match the
perceived frequency resolution of the human ear. This is typically done by applying a
triangular or trapezoidal filterbank to the spectrum, which divides it into a set of
overlapping frequency bands. The Mel-scaled spectrum M[f] at a given frequency f is
given by:

M[f] = 2595 * log10(1 + f/700)

where f is the frequency in Hz.

4. Discrete Cosine Transform (DCT): The Mel-scaled spectrum is then transformed

using a Discrete Cosine Transform (DCT). This transformation decorrelates the
spectral coefficients, which can improve the robustness of the features to noise and
other variations. The DCT coefficients C[k] are given by:

C[k] = ∑M[f]cos((π/M)(k-1)(f-0.5))

where M is the number of Mel-scaled frequency bands.

5. MFCCs: The resulting DCT coefficients are the MFCCs. These coefficients capture
the spectral content of the signal and can be used for tasks such as speech recognition
and speaker identification.

It's worth noting that these steps are just a high-level overview of the process and that there
are many variations and refinements that can be applied to the calculation of MFCCs. For
example, the number of Mel-scaled frequency bands and the type of DCT used can be
adjusted to optimize the performance of the features for a particular application.
1.5 Delta MFCCs
Delta MFCCs, also known as delta coefficients, are a type of feature used in speech
processing and speaker recognition. They are derived from the Mel-Frequency Cepstral
Coefficients (MFCCs), which are a set of coefficients that represent the spectral
characteristics of a sound.
Delta coefficients are calculated by taking the first-order or second-order derivative of the
MFCCs over time. This captures the changes in the MFCCs over time, and provides
additional information about the dynamics of the speech signal.
Delta coefficients are typically used in combination with the original MFCCs, resulting in a
set of features that includes both spectral and temporal information. This can improve the
performance of the speaker recognition system, as the additional temporal information can
provide additional cues for identifying the speaker.

The steps to calculate the delta-MFCCs are as follows:

1. First, calculate the MFCCs of the audio signal using a standard MFCC calculation
algorithm.
2. Once you have the MFCCs, calculate the difference between the MFCCs of the
current frame and the previous frame. This difference is called the first-order delta-
MFCCs.
3. You can also calculate the difference between the first-order delta-MFCCs and the
previous frame's delta-MFCCs. This difference is called the second-order delta-
MFCCs.
4. To get the final delta-MFCCs, you can concatenate the MFCCs, the first-order delta-
MFCCs, and the second-order delta-MFCCs. This will give you a feature vector that
captures the spectral characteristics of the audio signal as well as the changes in those
characteristics over time.

1.6 Modelling
Speaker modelling is the process of creating a unique representation of an individual's voice
for the purpose of speaker recognition. This representation, known as a speaker model, is
derived from a set of speech samples from the individual, and is used to identify the speaker
in future speech samples. In this project, we have used the MFCCs and the Delta MFCCs as
features for modelling the speakers.

1.7 Choice of classification/modelling algorithm

There are many machine learning algorithms that can be used for speaker recognition. The
specific algorithm that is best for a given application will depend on a variety of factors, such
as the quality of the speech data, the amount of data available, and the desired level of
accuracy.
Some common algorithms used for speaker recognition include Gaussian mixture models
(GMMs), support vector machines (SVMs), and deep neural networks (DNNs).
GMMs are probabilistic models that represent the distribution of the data using a mixture of
Gaussian distributions. They are often used as a simple and effective baseline for speaker
recognition.
SVMs are a type of supervised learning algorithm that constructs a hyperplane or set of
hyperplanes in a high-dimensional space to separate different classes of data. They can be
effective for speaker recognition when the data is linearly separable.
DNNs are a type of neural network that has multiple layers of interconnected nodes. They are
capable of learning complex nonlinear relationships in the data, and have been shown to be
effective for speaker recognition when trained on large amounts of data.
[1]
In general, DNNs tend to perform better than other algorithms for speaker recognition, but
they also require more data and computational resources to train.
Based on literature survey, we have found [2][5][6]GMM to be a commonly used and reliable
way to create speaker models without the need for a huge dataset or extensive computation.
Hence in our work, we used GMM to create speaker models.

1.8 Gaussian Distribution

A Gaussian distribution, also known as a normal distribution, is a continuous probability
distribution that is symmetric about its mean. It is defined by its mean and standard deviation,
and is often used to model real-valued random variables. Since most observations cluster
around the mean, the further away a value is from the mean, the lower its probability.

Figure 4 - Gaussian distributions

The graph of a Gaussian function is a bell shape curve, given by the standard form -

2
−( x−μ )

( )
( )
1 2σ
2

f ( x )= e
σ √2 π

Where μ = mean and σ = standard deviation

The parameters μ and σ of the function affect the shape of the graph, the curve is centred
around the mean μ and the width is directly proportional to the standard deviation σ, though
the graphs of all Gaussian distributions will have the general bell shape.

1.9 Gaussian Mixture Model

Figure 5. Two component Gaussian Mixture Model
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the underlying
data is generated from a mixture of several different underlying distributions, each of which
is a Gaussian distribution. Each Gaussian distribution is associated with a cluster and each
data point is associated with a mixture component represented by the Gaussian distribution it
was generated from. They can be used to model the distribution of the speech features
extracted from the speech samples, and to score new speech samples based on their
likelihood of belonging to the speaker. Mathematically, a GMM is expressed as p(x )-
K
p ( x )=∑ ∅ i N (x ∨μ i ,σ i)
i=1
2
− ( x−μi )

( )
( )
1 2σ
2

N (x∨μ i , σ i )= e i

σ i √2 π
K

∑ ∅ i=1
i=1

Where N ( x| μi , σ i ) is the probability density function of the i’th Gaussian distribution

component in the model. μi , σ i are the mean and standard deviation of the distribution
respectively and K is the number of components in the model.
The likelihood of a given data sample of having been generated by a GMM is given by the
logarithm of the probability of the sample under the GMM. In other words, it is the logarithm
of the sum of probabilities of the sample under each of the mixture components, weighted by
mixture weights. The probability of a sample x under a GMM is given as -
K
P ( x )=∑ ∅ i P(x ∨μ i)
i=1

where K = No. of components in the model

And the log probability of x is given as -
K
log P ( x )=log ⁡(∑ ∅ i P( x∨μ i))
i=1

2. METHODOLOGY AND IMPLEMENTATION

Figure 6 - Flowchart

2.1 Data collection

Table 1. Collected data

SPEAKER GENDER TRAINING TEST TOTAL

SAMPLES SAMPLES
Speaker_1 Female 40 10 50
Speaker_2 Female 40 10 50
Speaker_3 Male 40 10 50

We collect voice recordings of 3 different individuals labelled as Speaker_1, Speaker_2 and

Speaker_3. Each speaker spoke words like digits from “one”, “two”, “three” till “ten”,
alphabets like “a”, “b”, “c”, “d”, phrases like “Good morning”, “Happy Birthday” etc.
These audio files are saved in the .wav format.
The audio files are recorded with a sampling frequency of 48 kHz and a bit rate of 1400 kbps.
We use Audacity to remove silent parts and extract the voice samples. From each speaker we
collect 50 recordings totalling around 30 seconds. We divide the samples for training and
testing in a 40:10 split, i.e. we use 40 audio samples for training our speaker and the other 10
samples to test the model.

2.2 Extracting features

We obtain the MFCC features from the audio sample from the mfcc() method provided by
the python_speech_features library. The following parameters are passed to the
mfcc() function :
 The audio signal
 Sampling frequency
 Window length - 25ms
 Window step length - 10ms
 Number of cepstrum to return - 20
 The fft size - 1200
 appendEnergy - When set to true, the log of the total frame energy is used in place of
the 0’th coefficient.
The MFCCs are obtained as a matrix having ‘n’ rows corresponding to each frame and 20
columns for the 20 coefficients obtained for each frame. Their deltas are also obtained and
stacked horizontally as an array of shape (n, 40). This is our desired feature vector for
training our model.

2.3 Training speaker model

In this step, the features for each audio file for every speaker are extracted and stacked
vertically, thus forming a feature matrix consisting of features from all the audio samples of
one speaker. This feature matrix is then used to train our model.
The scikit-learn library provides the GaussianMixture() method which takes
the following parameters:
 The number of components in the mixture, n_components - 16
 The type of covariance, covariance_type - ‘diagonal’
 The number of fitting iterations to perform, max_iter - 500
 The number of initialisations to perform, num_iter - 3

We obtain a Gaussian mixture model object with the specified parameters. Then we fit the
model with the feature set we had created from the audio files of the speaker. After the model
has been fitted with the speaker’s feature data, we serialise the model object as a binary
stream and save it as a file using the pickle module.

2.4 Testing speaker model

In this step, we take the voice audio of the speaker, extract its features and generate the
feature matrix. The stored speaker models are loaded and for each model, the per-sample
average log-likelihood of the audio data computed which is given by the [Link]()
method. The model which produces the highest likelihood is the one matching the audio and
its speaker is predicted to be the speaker of the given audio file.

3. RESULTS
We test our speaker models with 10 test samples kept for each speaker. The results table for
each speaker are as follows -

Table 2. Results of Speaker_1

Sample No. Predicted speaker
1 Speaker_1
2 Speaker_1
3 Speaker_1
4 Speaker_1
5 Speaker_1
6 Speaker_1
7 Speaker_1
8 Speaker_1
9 Speaker_1
10 Speaker_1

Table 3. Results of Speaker_2

Sample No. Predicted speaker
1 Speaker_2
2 Speaker_2
3 Speaker_2
4 Speaker_2
5 Speaker_2
6 Speaker_2
7 Speaker_2
8 Speaker_1
9 Speaker_2
10 Speaker_2

Table 4. Results of Speaker_3

Sample No. Predicted speaker
1 Speaker_3
2 Speaker_3
3 Speaker_3
4 Speaker_3
5 Speaker_3
6 Speaker_3
7 Speaker_3
8 Speaker_3
9 Speaker_3
10 Speaker_3

The test results can be visualised with the following confusion matrix -

Figure 7 - Confusion Matrix

Table 5. Outcomes and accuracy

Speaker No. of test No. of true No. of false Accuracy %
samples outcomes outcomes
Speaker_1 10 10 0 100
Speaker_2 10 9 1 90
Speaker_3 10 10 0 100

Thus, we observe that our speaker models are able to predict voice recordings of Speaker_1,
Speaker_2 and Speaker_3 with accuracy of 100%, 90% and 100% respectively and an
average accuracy of 96.67%.

4. CONCLUSION
In this project, we have observed that by modelling MFCC features of speech using Gaussian
Mixture Models, identification of speaker from speech signal with good accuracy is possible
without a large training dataset.
It is thus concluded that MFCC features with GMM is a viable method of performing speaker
identification and verification.
5. USES AND SCOPE
Speaker recognition can be/is used in a variety of areas such as -
 Security and access control: Speaker recognition can be used to verify the identity of
an individual before granting access to a secure area or system.
 Fraud prevention: Speaker recognition can be used to detect fraudulent activity by
verifying the identity of individuals making phone calls or accessing accounts.
 Customer service: Speaker recognition can be used to identify customers when they
call a customer service hotline, allowing the system to provide personalized
assistance.
 Voice assistants: Speaker recognition can be used to enable voice assistants, such as
Siri or Alexa, to recognize and respond to specific individuals.
 Law enforcement: Speaker recognition can be used by law enforcement agencies to
identify individuals from recorded conversations or intercepts.
For future work, we wish to carry forward the successes and learnings of this project work to
develop a voice access control system which will require implementation of speaker
recognition processes in hardware devices. Such a system could be used in various secure and
restricted access environments.

A Novel Approach For MFCC Feature Extraction
No ratings yet
A Novel Approach For MFCC Feature Extraction
5 pages
Text-Independent Speaker Identification
No ratings yet
Text-Independent Speaker Identification
9 pages
MFCC Technique For Speech Recognition
No ratings yet
MFCC Technique For Speech Recognition
6 pages
Voice Recognition
No ratings yet
Voice Recognition
6 pages
DSP Lab Mini Project
No ratings yet
DSP Lab Mini Project
7 pages
Intechopen 80419
No ratings yet
Intechopen 80419
18 pages
Speech Recognition Using MFCC
No ratings yet
Speech Recognition Using MFCC
4 pages
Biometrics Lecture Speech
No ratings yet
Biometrics Lecture Speech
38 pages
Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi
No ratings yet
Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi
34 pages
Lecture 7 - Automatic Speech Recognition
No ratings yet
Lecture 7 - Automatic Speech Recognition
58 pages
13MFCC Tutorial
No ratings yet
13MFCC Tutorial
6 pages
Automatic Speaker Recognition Report Hiya
No ratings yet
Automatic Speaker Recognition Report Hiya
8 pages
MFCC and Vector Quantization For Arabic Fricatives2012
No ratings yet
MFCC and Vector Quantization For Arabic Fricatives2012
6 pages
Mel Frequency Cepstral Coefficient (MFCC) - Guidebook - Informatica e Ingegneria Online
No ratings yet
Mel Frequency Cepstral Coefficient (MFCC) - Guidebook - Informatica e Ingegneria Online
12 pages
Speech Recognition Using MFCC Analysis
No ratings yet
Speech Recognition Using MFCC Analysis
4 pages
M FCC Review
No ratings yet
M FCC Review
10 pages
MFCCs in Speech Recognition
No ratings yet
MFCCs in Speech Recognition
14 pages
UNIT 2-Speech Processing
No ratings yet
UNIT 2-Speech Processing
25 pages
Practical Cryptography PDF
No ratings yet
Practical Cryptography PDF
10 pages
Speaker Recognition Using Vocal Tract Features
No ratings yet
Speaker Recognition Using Vocal Tract Features
5 pages
Ijves Y14 05338
No ratings yet
Ijves Y14 05338
5 pages
Speech Recognition for Engineers
100% (1)
Speech Recognition for Engineers
18 pages
pxc3872774 PDF
No ratings yet
pxc3872774 PDF
7 pages
Vector Quantization Approach For Speaker Recognition Using MFCC and Inverted MFCC
No ratings yet
Vector Quantization Approach For Speaker Recognition Using MFCC and Inverted MFCC
7 pages
Speech Recognition, Synthesis, and Dialogue 2
No ratings yet
Speech Recognition, Synthesis, and Dialogue 2
59 pages
An Approach To Extract Feature Using MFC
No ratings yet
An Approach To Extract Feature Using MFC
5 pages
Final Report Complete PDF
No ratings yet
Final Report Complete PDF
26 pages
MFCC Features: Appendix A
No ratings yet
MFCC Features: Appendix A
19 pages
MFCC Step
100% (1)
MFCC Step
5 pages
Speaker Recognition System Based On VQ in MATLAB Environment
No ratings yet
Speaker Recognition System Based On VQ in MATLAB Environment
8 pages
Automatic Speaker Recognition System
100% (1)
Automatic Speaker Recognition System
11 pages
MFCCs
No ratings yet
MFCCs
12 pages
Design Analysis and Experimental Evaluation of Block-Based Trcomputati
No ratings yet
Design Analysis and Experimental Evaluation of Block-Based Trcomputati
23 pages
Control of Robot Arm Based On Speech Recognition Using Mel-Frequency Cepstrum Coefficients (MFCC) and K-Nearest Neighbors (KNN) Method
No ratings yet
Control of Robot Arm Based On Speech Recognition Using Mel-Frequency Cepstrum Coefficients (MFCC) and K-Nearest Neighbors (KNN) Method
6 pages
1-S2.0-S0885230824000962-Main Significance of Chirp MFCC As A Feature in Speech and Audio
No ratings yet
1-S2.0-S0885230824000962-Main Significance of Chirp MFCC As A Feature in Speech and Audio
11 pages
Text-Independent Speaker Recognition
No ratings yet
Text-Independent Speaker Recognition
12 pages
Speaker Identification E6820 Spring '08 Final Project Report Prof. Dan Ellis
No ratings yet
Speaker Identification E6820 Spring '08 Final Project Report Prof. Dan Ellis
16 pages
Mel Filter Bank and Cepstrum Analysis
No ratings yet
Mel Filter Bank and Cepstrum Analysis
3 pages
Speaker Recognition Using Matlab
No ratings yet
Speaker Recognition Using Matlab
14 pages
Evaluation MFCC For Music Similarity
No ratings yet
Evaluation MFCC For Music Similarity
5 pages
Continuous Myanmar Speech Recognition System
No ratings yet
Continuous Myanmar Speech Recognition System
35 pages
Advanced MFCC for Speech Analysis
No ratings yet
Advanced MFCC for Speech Analysis
6 pages
Speaker Recognition File
No ratings yet
Speaker Recognition File
16 pages
Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015
No ratings yet
Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015
10 pages
Bangla Speech Recognition Study
No ratings yet
Bangla Speech Recognition Study
13 pages
KWS - Taiwan Chinese Paper 2002
No ratings yet
KWS - Taiwan Chinese Paper 2002
21 pages
ds203 MFCC
No ratings yet
ds203 MFCC
6 pages
Feature Extraction MFCCs PDF
No ratings yet
Feature Extraction MFCCs PDF
15 pages
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
No ratings yet
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
50 pages
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
No ratings yet
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
5 pages
MFCC and LPC in Speech Recognition
100% (1)
MFCC and LPC in Speech Recognition
5 pages
Speech Recognition Techniques
No ratings yet
Speech Recognition Techniques
5 pages
Kaushik
No ratings yet
Kaushik
1 page
Social Service Club Convenors 2024-25
No ratings yet
Social Service Club Convenors 2024-25
1 page
EEE Department Time Table Jan-May 2025
No ratings yet
EEE Department Time Table Jan-May 2025
1 page
Tezpur Univ. Student Council 2024-25
No ratings yet
Tezpur Univ. Student Council 2024-25
3 pages
Tezpur University Students' Council 2024-25
No ratings yet
Tezpur University Students' Council 2024-25
1 page
Digital Paramount (EC) - Front
No ratings yet
Digital Paramount (EC) - Front
89 pages
Introduction to Discrete Wavelet Transform
No ratings yet
Introduction to Discrete Wavelet Transform
15 pages
Cyclostationarity Approach For Monitoring Chatter and Tool
No ratings yet
Cyclostationarity Approach For Monitoring Chatter and Tool
23 pages
Synchro-Squeezed Time-Frequency Representations For Radar-Based Human Activity Recognition
No ratings yet
Synchro-Squeezed Time-Frequency Representations For Radar-Based Human Activity Recognition
7 pages
Conv-TasNet: Advanced Speech Separation
No ratings yet
Conv-TasNet: Advanced Speech Separation
12 pages
Report Mini Project
No ratings yet
Report Mini Project
35 pages
Infant Cry Paper 18.12.24
No ratings yet
Infant Cry Paper 18.12.24
38 pages
STFT
No ratings yet
STFT
110 pages
Multiresolution and Multirate Signal Processing: Introduction, Principles and Applications Vikram Gadre Full Chapters Included
No ratings yet
Multiresolution and Multirate Signal Processing: Introduction, Principles and Applications Vikram Gadre Full Chapters Included
150 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
Wavelet Transform Seminar
100% (1)
Wavelet Transform Seminar
24 pages
Comparing Recurrent Convolutional Neural Networks For Large Scale Bird Species Classification
No ratings yet
Comparing Recurrent Convolutional Neural Networks For Large Scale Bird Species Classification
12 pages
Chap 5 Audio Dbms
No ratings yet
Chap 5 Audio Dbms
16 pages
Wavelets and Signal Processing
No ratings yet
Wavelets and Signal Processing
193 pages
Time Frequency Analysis
No ratings yet
Time Frequency Analysis
19 pages
Voice Analysis with STFT & Correlation
No ratings yet
Voice Analysis with STFT & Correlation
6 pages
EPM-334-Lecture (3) - Digital Protection - 3-3-2025
No ratings yet
EPM-334-Lecture (3) - Digital Protection - 3-3-2025
31 pages
Deep Learning for Bird Sound Detection
No ratings yet
Deep Learning for Bird Sound Detection
13 pages
Wigner-Ville Distribution Basics
No ratings yet
Wigner-Ville Distribution Basics
14 pages
STFT Analysis Driven Sonographic Sound Processing in Real-Time Using MaxMSP and Jitter - Dissertation PDF
No ratings yet
STFT Analysis Driven Sonographic Sound Processing in Real-Time Using MaxMSP and Jitter - Dissertation PDF
193 pages
GPS Anti-Jamming Techniques
No ratings yet
GPS Anti-Jamming Techniques
10 pages
A Review of Wavelet Analysis and Its Applications Challenges and Opportunities
No ratings yet
A Review of Wavelet Analysis and Its Applications Challenges and Opportunities
35 pages
Harmonic Mixing: Key & Beat Detection Algorithms
100% (2)
Harmonic Mixing: Key & Beat Detection Algorithms
71 pages
Time Series Data Augmentation For Deep Learning: A Survey
No ratings yet
Time Series Data Augmentation For Deep Learning: A Survey
8 pages
978 951 39 7936 2 - Vaitos30112019
No ratings yet
978 951 39 7936 2 - Vaitos30112019
130 pages
An Analysis of Voltage Source Inverter Switches Fault Classification Using Short Time Fourier Transform
No ratings yet
An Analysis of Voltage Source Inverter Switches Fault Classification Using Short Time Fourier Transform
12 pages
Acoustic Identification of Ae. Aegypti Mosquitoes Using Smartphone Apps and Residual Convolutional Neural Networks
No ratings yet
Acoustic Identification of Ae. Aegypti Mosquitoes Using Smartphone Apps and Residual Convolutional Neural Networks
13 pages
A PDF
No ratings yet
A PDF
3 pages
An Introduction To S-Transform For Time-Frequency Analysis: S.K. Steve Chang
No ratings yet
An Introduction To S-Transform For Time-Frequency Analysis: S.K. Steve Chang
35 pages
Wavelet Analysis for Engineers
100% (1)
Wavelet Analysis for Engineers
66 pages

Final Project Report

Uploaded by

Final Project Report

Uploaded by

ABSTRACT

2. Methodology and Implementation…………………………......9

Speaker modelling is an important part of speaker recognition systems, as it allows these

1.1 Speech Signal

1.2 Feature Extraction

1.3 Choice of features

1.4 Mel-Frequency Cepstral Coefficients (MFCC)

s[n] = x[n] * w[n]

S[k] = FFT(s[n]) = ∑s[n]e^(-j2πkn/N)

M[f] = 2595 * log10(1 + f/700)

where f is the frequency in Hz.

4. Discrete Cosine Transform (DCT): The Mel-scaled spectrum is then transformed

where M is the number of Mel-scaled frequency bands.

The steps to calculate the delta-MFCCs are as follows:

1.7 Choice of classification/modelling algorithm

1.8 Gaussian Distribution

Figure 4 - Gaussian distributions

Where μ = mean and σ = standard deviation

1.9 Gaussian Mixture Model

Where N ( x| μi , σ i ) is the probability density function of the i’th Gaussian distribution

where K = No. of components in the model

2. METHODOLOGY AND IMPLEMENTATION

2.1 Data collection

Table 1. Collected data

SPEAKER GENDER TRAINING TEST TOTAL

We collect voice recordings of 3 different individuals labelled as Speaker_1, Speaker_2 and

2.2 Extracting features

2.3 Training speaker model

2.4 Testing speaker model

Table 2. Results of Speaker_1

Table 3. Results of Speaker_2

Table 4. Results of Speaker_3

Figure 7 - Confusion Matrix

Table 5. Outcomes and accuracy

You might also like