Speaker Recognition System
Speaker Recognition System
Ahmedabad ,India
19bec037@[Link] 19bec039@[Link]
𝑠̂𝑛 = ∑ 𝑎 𝑘 . 𝑠𝑛−𝑘
𝑘=1
𝑒𝑛 = 𝑠𝑛 − 𝑠̂𝑛 = 𝑠𝑛 + ∑ 𝑎 𝑘 . 𝑠𝑛−𝑘
𝑘=1
Consider MSE to be E,
𝑝
A. Speech Signal Acquisition
𝐸 = ∑ 𝑒𝑛2 = ∑ [𝑠𝑛 + ∑ 𝑎 𝑘 . 𝑠𝑛−𝑘 ] 2
𝑛 𝑛
For the digital transformation of the speech signal first of all 𝑘=1
the speech signal i.e. the acoustic pressure waves are
The minimum MSE
collected by a microphone or telephone and converted into
analog signal. The then converted signal is passed through
an antialiasing filter which limits its bandwidth to ∑𝑝𝑘=1 𝑎 𝑘 . ∑𝑛 𝑠𝑛−𝑘𝑠𝑛−𝑖 = − ∑𝑛 𝑠𝑛 𝑠𝑛−𝑖
approximately the Nyquist Rate. Now an A/D converter is
used to finally convert the signal into digital. where i= 1,2,….p
.
B. Speech Production This results in the autocorrelation method of LP. Its time
averaged estimates at lag τ are:
The vocal tract helps produce speech. There are different
types of excitation that occurs upon the different ways of Rτ = ∑𝑁−1−𝜏
𝑖=0 𝑠(𝑖) . 𝑠(𝑖 + 𝜏)
airflow in the vocal folds. Some of these are Phonated
excitation, Whispered excitation, Frication excitation, This method gives:
Compression excitation,etc. All these different excitations
have different speech models for the process of speech 𝑅0 𝑅1 𝑅2 ……….. 𝑅𝑝−1
𝑎1 𝑅1
recognition. 𝑅1 𝑅0 𝑅1 𝑅𝑝−2
……….. 𝑎2 𝑅2
Mostly the system used for voice recognition use features 𝑅2 𝑅1 𝑅0 𝑅𝑝−3
that are developed from the vocal tract. And thus in voice
……….. 𝑎3 𝑅3
. . =− .
recognition, the shape of the vocal tract also plays a major .
role. This shape can be estimated from the spectral shape of . .
. . .
the voice signal. The excitation as discussed above is a .
speaker-dependent information and it is not the only form of [𝑎 𝑝] 𝑅
[ 𝑝]
[ 𝑅𝑝−1 𝑅𝑝−2 𝑅𝑝−3 ……….. 𝑅0 ]
speaker-dependent information available. Vital capacity,
maximum phonation time, phonation quotient and glottal air
flow are also under this category. This system of equations are solved by Durbin’s recursive
These above characteristics are those which are physical but algorithm.
there are also some of the characteristics that help
E0 = R 0
distinguish between speakers and those are the learned
(𝑖−1)
𝑘 𝑖 = −[𝑅𝑖 + ∑𝑖−1
𝑗=1 𝑎𝑗 𝑅𝑖−𝑗 ]/𝐸𝑖−1 LAR can be defined as log of the ratio of adjacent cross
( 𝑖) sectional areas of the cylinders:
𝑎 𝑖 = 𝑘𝑖
(𝑖) ( 𝑖−1) ( 𝑖−1)
v i=1,2,……p
𝑎𝑗 = 𝑎𝑗 + 𝑘 𝑖 𝑎𝑖−𝑗 𝑣1 ≤ 𝑗 ≤ 𝑖 −1 𝐴𝑖+1 1 + 𝑘𝑖
𝑔𝑖 = log[ ] = log[ ] = 2 tanh−1 𝑘 𝑖
𝐸𝑖 = (1 − 𝑘 2𝑖 )𝐸𝑖−1 } 𝐴𝑖 1 − 𝑘𝑖
(𝑝)
𝑎𝑗 = 𝑎𝑗 𝑣1 ≤ 𝑗 ≤ 𝑝
3. Arcsin Reflection Coefficients
This shows that any signal can be represented as linear
predictor and the corresponding LP error. Now, Here k i =1 so that singularity of LAR is prevented
𝑝 𝑔𝑖′ = sin−1 𝑘 𝑖
sn = -∑𝑘=1 𝑎 𝑘 .𝑠𝑛−𝑘 + 𝑒𝑛
1. Reflection Coefficients
( 𝑖) Frequency Response
𝑘𝑖 = 𝑎𝑖
(𝑖) (𝑖) (𝑖)
( 𝑝) 𝑎𝑗 + 𝑎 𝑖 . 𝑎 𝑖−𝑗 D. Mel-Warped Cepstrum
𝑎𝑗 = 𝑣1 ≤ 𝑗 ≤ 𝑖 −1
1 − 𝑘 2𝑖
i= p,p-1,……1 This feature does not require LP analysis. Here the signal is
windowed and then fft is taken. After that magnitude is
2. Log Area Ratios calculated and its log is done. The frequencies are warped
according to mel scale and at the end inverse fft is done.
Here the vocal tract is considered as a series of It is beneficial as it has well linear combination of Gaussian
cylindrical acoustic tubes. At each junction there is a densities and it gives better performance.
possibility of impedance mismatch or analogous
difference. The wave transmitted at each boundary also
reflects back some of its portion and these can be termed
as k i. If we consider that the tubes have equal length then
the time taken by the sound to travel through all will be V. FEATURE SELECTION AND MEASURES
the same. So this allows simple z transformation for
digital filter simulation. To derive the reflection It includes:
coefficients:
A0 = 0
Ap+1>>Ap A. Traditional Feature Selection
A. Hypothesis Testing
CONCLUSION
Hypothesis testing involves choosing between the two
hypothesis : Speaker recognition is the utilization of a machine to
1. User is claimed speaker perceive an individual from a verbally expressed expression.
2. User is not the claimed speaker. Speaker recognition frameworks can be utilized either to
recognize a specific individual or to confirm a people
Consider that the hypothesis of user who is not the claim ed guaranteed personality. The fundamentals of speaker
speaker is H0 and that of a user who is claimed speaker is recognition and measures for speaker recognition were
H1. Consider conditional density function generated by the introduced and contrasted and customary ones utilizing
user who is not the claimed speaker is p(z| H0) and by the speaker-segregation standards. Speaker recognition systems
user who is claimed speaker is p(z| H1) having the match can be designed by matching the pattern with hidden
score of [Link] that true conditional score densities Markov model vector quantization , MFCC etc.
are known for both user who is not the claimed speaker as
well as for the user who is the claimed speaker then, Bayes’s
test is based up on likelihood ratio for the speaker, λ(z)
MATLAB CODES
𝑝(z| 𝐻0 )
λ(z) = • Train code
𝑝(z| 𝐻1)
fs=8000;
If the overlap between pdf’s of two score is small, then
% Sampling rate
probability of error is also small. Overlap between two pdf’s
nbits=16;
is given by,µ 0
nChannels=1;
(µ0 − µ1 )2 duration=5;
𝐹=
𝜎2 % Recording duration
Where µ0 and µ1 are means and 𝜎 is a variance. arObj=audiorecorder(fs, nbits, nChannels);
Likelihood ratio can be determined by choosing threshold fprintf('Please Press any Key on the Keyboard, to
X, START RECORDING:___', duration); pause
≥ 𝑋, 𝑐ℎ𝑜𝑜𝑠𝑒 𝐻0
if λ(z) { fprintf('Please wait while it is Recording');
< 𝑋, 𝑐ℎ𝑜𝑜𝑠𝑒 𝐻1 recordblocking(arObj, duration);
fprintf('Your Voice has now been RECORDED\n');
B. ROC fprintf('Please Press any Key on the Keyboard, to
PLAY RECORDING'); pause;
fprintf('\n');
One error type can be reduced by increasing the other type play(arObj);
of error. Relation between false acceptance and false fprintf('Plotting the waveform\n');
rejection is a function of decision threshold. For any of the y=getaudiodata(arObj);
system, ROC can be traversed by making changes in the % Fetching the audio sample data
threshold of acceptance likelihood ratio. Straight line ROC
plot(y);
indicates that product of probability of False Acceptance % Plotting the waveform
and probability of False Rejection is constant and equal to figure;
Equal Error Rate which is the value at which the value of
f = Simplefft(y);
False Acceptance and False Rejection are equal. % Autocorrelation
ms2 = fs/500;
ms20 = fs/50; f = Simplefft(y);
r = xcorr(y, ms20); load database
d = (-ms20:ms20)/fs; D=[];
plot(d, r); for(i=1:size(F,1))
title('Autocorrelation Form'); d=sum(abs(F(i)-f));
xlabel('Delay (s)'); D=[D,d];
ylabel('Correlation Coefficients'); end
r = r(ms20 + 1 : 2*ms20+1); sm =inf;
[rmax, tx] = max(r(ms2:ms20)); ind = -1;
Fx = fs/(ms2+tx-1); for(i=1:length(D))
Fth= 180; if(D(i)<sm)
%% Threshold frequency is 180 Hz sm=D(i);
if Fx> Fth ind=i;
disp('Speaker is Female!') end
gender = 'female'; end
else Identity_Number= C(ind);
disp(' Speaker is Male !') % disp("The Identity Number is;");
gender = 'male' Identity_Number
end
%% Saving the user data in the MATLAB Database • FFT
idno = input("Enter Corresponding Identity Number:");
try % DSP Voice Database Allowance via Correlation
load database %FFT Function
F=[F;f]; function [xPitch]= Simplefft(y)
C=[C;idno]; F = fft(y(:,1));
G=[G;gender] %Plotting the function
save database22 plot(real(F));
catch m = max(real(F));
F = f; xPitch= find(real(F)==m,1);
C = idno;
G = gender;
save database F C G;
end
msgbox('Thank You, your Voice has been Registered')
• Test Code
fs=8000;
% Sampling rate
nbits=16;
nChannels=1;
duration=5;
% Recording the Voice Input duration
arObj=audiorecorder(fs, nbits, nChannels);
fprintf('Please Press any Key on the Keyboard, to
START %g Seconds RECORDING__', duration);
pause
fprintf('Please wait while it is Recording\n');
recordblocking(arObj, duration);
fprintf('Your Voice has now been RECORDED\n');
fprintf('Please Press any Key on the Keyboard, to
PLAY RECORDING');
pause;
fprintf('\n');
play(arObj);
fprintf('Plotting the waveform\n');
y=getaudiodata(arObj);
% Getting the Audio Sample Data
plot(y);
% Plotting the Waveform
figure;
% Extraction
References
[1] Speaker Recognition: A Tutorial - Proceedings of the
IEEE ([Link])
[2] [Link]
ition/