0% found this document useful (0 votes)
462 views

MFCC Code

The document describes the process of extracting mel frequency cepstral coefficients (MFCCs) from a speech signal. It involves framing the speech signal, taking the Fourier transform of each frame, warping the spectrum to the mel scale using triangular filters, taking the log and discrete cosine transform to produce MFCCs, and applying liftering to match the MFCCs to those produced by HTK. An example is also provided to demonstrate extracting MFCCs from a recorded speech file.

Uploaded by

sateesh83
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
462 views

MFCC Code

The document describes the process of extracting mel frequency cepstral coefficients (MFCCs) from a speech signal. It involves framing the speech signal, taking the Fourier transform of each frame, warping the spectrum to the mel scale using triangular filters, taking the log and discrete cosine transform to produce MFCCs, and applying liftering to match the MFCCs to those produced by HTK. An example is also provided to demonstrate extracting MFCCs from a recorded speech file.

Uploaded by

sateesh83
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 8

function [ CC, FBE, frames ] = mfcc( speech, fs, Tw, Ts, alpha, window, R, M, N,

L )
% MFCC Mel frequency cepstral coefficient feature extraction.
%
% MFCC(S,FS,TW,TS,ALPHA,WINDOW,R,M,N,L) returns mel frequency
% cepstral coefficients (MFCCs) computed from speech signal given
% in vector S and sampled at FS (Hz). The speech signal is first
% preemphasised using a first order FIR filter with preemphasis
% coefficient ALPHA. The preemphasised speech signal is subjected
% to the short-time Fourier transform analysis with frame durations
% of TW (ms), frame shifts of TS (ms) and analysis window function
% given as a function handle in WINDOW. This is followed by magnitude
% spectrum computation followed by filterbank design with M triangular
% filters uniformly spaced on the mel scale between lower and upper
% frequency limits given in R (Hz). The filterbank is applied to
% the magnitude spectrum values to produce filterbank energies (FBEs)
% (M per frame). Log-compressed FBEs are then decorrelated using the
% discrete cosine transform to produce cepstral coefficients. Final
% step applies sinusoidal lifter to produce liftered MFCCs that
% closely match those produced by HTK [1].
%
% [CC,FBE,FRAMES]=MFCC(...) also returns FBEs and windowed frames,
% with feature vectors and frames as columns.
%
% This framework is based on Dan Ellis' rastamat routines [2]. The
% emphasis is placed on closely matching MFCCs produced by HTK [1]
% (refer to p.337 of [1] for HTK's defaults) with simplicity and
% compactness as main considerations, but at a cost of reduced
% flexibility. This routine is meant to be easy to extend, and as
% a starting point for work with cepstral coefficients in MATLAB.
% The triangular filterbank equations are given in [3].
%
% Inputs
%
S is the input speech signal (as vector)
%
%
FS is the sampling frequency (Hz)
%
%
TW is the analysis frame duration (ms)
%
%
TS is the analysis frame shift (ms)
%
%
ALPHA is the preemphasis coefficient
%
%
WINDOW is a analysis window function handle
%
%
R is the frequency range (Hz) for filterbank analysis
%
%
M is the number of filterbank channels
%
%
N is the number of cepstral coefficients
%
(including the 0th coefficient)
%
%
L is the liftering parameter
%
% Outputs
%
CC is a matrix of mel frequency cepstral coefficients
%
(MFCCs) with feature vectors as columns
%
%
FBE is a matrix of filterbank energies
%
with feature vectors as columns

%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%

FRAMES is a matrix of windowed frames


(one frame per column)
Example
Tw = 25;
% analysis frame duration (ms)
Ts = 10;
% analysis frame shift (ms)
alpha = 0.97;
% preemphasis coefficient
R = [ 300 3700 ]; % frequency range to consider
M = 20;
% number of filterbank channels
C = 13;
% number of cepstral coefficients
L = 22;
% cepstral sine lifter parameter
% hamming window (see Eq. (5.2) on p.73 of [1])
hamming = @(N)(0.54-0.46*cos(2*pi*[0:N-1].'/(N-1)));
% Read speech samples, sampling rate and precision from file
[ speech, fs, nbits ] = wavread( 'sp10.wav' );
% Feature extraction (feature vectors as columns)
[ MFCCs, FBEs, frames ] = ...
mfcc( speech, fs, Tw, Ts, alpha, hamming, R, M, C, L

);
%
%
% Plot cepstrum over time
%
figure('Position', [30 100 800 200], 'PaperPositionMode', 'auto', ..
.
%
'color', 'w', 'PaperOrientation', 'landscape', 'Visible', 'on
' );
%
%
imagesc( [1:size(MFCCs,2)], [0:C-1], MFCCs );
%
axis( 'xy' );
%
xlabel( 'Frame index' );
%
ylabel( 'Cepstrum index' );
%
title( 'Mel frequency cepstrum' );
%
% References
%
%
[1] Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D.,
%
Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D.,
%
Valtchev, V., Woodland, P., 2006. The HTK Book (for HTK
%
Version 3.4.1). Engineering Department, Cambridge University.
%
(see also: http://htk.eng.cam.ac.uk)
%
%
[2] Ellis, D., 2005. Reproducing the feature outputs of
%
common programs using Matlab and melfcc.m. url:
%
http://labrosa.ee.columbia.edu/matlab/rastamat/mfccs.html
%
%
[3] Huang, X., Acero, A., Hon, H., 2001. Spoken Language
%
Processing: A guide to theory, algorithm, and system
%
development. Prentice Hall, Upper Saddle River, NJ,
%
USA (pp. 314-315).
%
% See also EXAMPLE, COMPARE, FRAMES2VEC, TRIFBANK.
%

Author: Kamil Wojcicki, September 2011


%% PRELIMINARIES

% Ensure correct number of inputs


if( nargin~= 10 ), help mfcc; return; end;
% Explode samples to the range of 16 bit shorts
if( max(abs(speech))<=1 ), speech = speech * 2^15; end;
Nw = round( 1E-3*Tw*fs );
Ns = round( 1E-3*Ts*fs );

% frame duration (samples)


% frame shift (samples)

nfft = 2^nextpow2( Nw );
K = nfft/2+1;

% length of FFT analysis


% length of the unique part of the FFT

%% HANDY INLINE FUNCTION HANDLES


% Forward and backward mel frequency warping (see Eq. (5.13) on p.76 of [1])
% Note that base 10 is used in [1], while base e is used here and in HTK cod
e
hz2mel = @( hz )( 1127*log(1+hz/700) );
% Hertz to mel warping function
mel2hz = @( mel )( 700*exp(mel/1127)-700 ); % mel to Hertz warping function
% Type III DCT matrix routine (see Eq. (5.14) on p.77 of [1])
dctm = @( N, M )( sqrt(2.0/M) * cos( repmat([0:N-1].',1,M) ...
.* repmat(pi*([1:M]-0.5)/M,N,1) ) );
% Cepstral lifter routine (see Eq. (5.12) on p.75 of [1])
ceplifter = @( N, L )( 1+0.5*L*sin(pi*[0:N-1]/L) );
%% FEATURE EXTRACTION
% Preemphasis filtering (see Eq. (5.1) on p.73 of [1])
speech = filter( [1 -alpha], 1, speech ); % fvtool( [1 -alpha], 1 );
% Framing and windowing (frames as columns)
frames = vec2frames( speech, Nw, Ns, 'cols', window, false );
% Magnitude spectrum computation (as column vectors)
MAG = abs( fft(frames,nfft,1) );
% Triangular filterbank with uniformly spaced filters on mel scale
H = trifbank( M, K, R, fs, hz2mel, mel2hz ); % size of H is M x K
% Filterbank application to unique part of the magnitude spectrum
FBE = H * MAG(1:K,:); % FBE( FBE<1.0 ) = 1.0; % apply mel floor
% DCT matrix computation
DCT = dctm( N, M );
% Conversion of logFBEs to cepstral coefficients through DCT
CC = DCT * log( FBE );
% Cepstral lifter computation
lifter = ceplifter( N, L );
% Cepstral liftering gives liftered cepstral coefficients
CC = diag( lifter ) * CC; % ~ HTK's MFCCs

% EOF

%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%% MFCC EXAMPLE %%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
);
%
%
%
.
%
' );
%
%
%
%
%
%

Example
Tw = 25;
% analysis frame duration (ms)
Ts = 10;
% analysis frame shift (ms)
alpha = 0.97;
% preemphasis coefficient
R = [ 300 3700 ]; % frequency range to consider
M = 20;
% number of filterbank channels
C = 13;
% number of cepstral coefficients
L = 22;
% cepstral sine lifter parameter
% hamming window (see Eq. (5.2) on p.73 of [1])
hamming = @(N)(0.54-0.46*cos(2*pi*[0:N-1].'/(N-1)));
% Read speech samples, sampling rate and precision from file
[ speech, fs, nbits ] = wavread( 'sp10.wav' );
% Feature extraction (feature vectors as columns)
[ MFCCs, FBEs, frames ] = ...
mfcc( speech, fs, Tw, Ts, alpha, hamming, R, M, C, L
% Plot cepstrum over time
figure('Position', [30 100 800 200], 'PaperPositionMode', 'auto', ..
'color', 'w', 'PaperOrientation', 'landscape', 'Visible', 'on
imagesc( [1:size(MFCCs,2)], [0:C-1], MFCCs );
axis( 'xy' );
xlabel( 'Frame index' );
ylabel( 'Cepstrum index' );
title( 'Mel frequency cepstrum' );

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%EBOARD CODE FOR MCCF %%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
AI=analoginput('winsound',1);
addchannel(AI,1);
set(AI,'samplerate',16000);
set(AI,'samplespertrigger',2*16000);
set(AI,'triggertype','immediate');
start(AI);
data=getdata(AI);
fs=16000;
%file = sprintf('%s%d.wav','database_new kar',i);
%%wavwrite(data,16000,file);
%% Computing MFCC Co-efficients..
%% (1) Frame Blocking..
N = 256; % N point FFT

M = 100;

% Overlapping

NN = floor(N/2+1); %N/2
nbFrames = ceil((length(data)-N)/M);
Frames = zeros(nbFrames+1,N);
for i = 0:nbFrames-1
temp = data(i*M+1:i*M+N);
Frames(i+1,1:N) = temp;
end
% Last Frame..
temp = zeros(1,N);
lastLength = length(data)- nbFrames*M;
temp(1:lastLength) = data(nbFrames*M+1:(nbFrames*M +1 + lastLength-1));
Frames(nbFrames+1, 1:N) = temp;
%% (2) Windowing..
frameSize = size(Frames);
nbFrames = frameSize(1);
nbSamples = frameSize(2);
% Hamming window..
w = hamming(nbSamples);
Windows = zeros(nbFrames,nbSamples);
for i = 1:nbFrames
temp = Frames(i,1:nbSamples);
Windows(i, 1:nbSamples) = w'.*temp;
end
%% (3) Fourier Transform..
ffts = fft(Windows');
%% (4) Mel-frequency Wrapping..
% (a) Calculate Power spectrum..
PowSpecs = abs(ffts).^2;
PowSpecs = PowSpecs(1:NN-1,:);
% (b) Mel filter generation
nof_c = 20; % Number of channels..
df = fs/N;
Nmax = N/2;
fmax = fs/2;
% Convert to mel scale..
melmax = 2595*log10(1+fmax/700);
melinc = melmax/(nof_c+1);
melcenters = (1:nof_c).*melinc;
% Convert to frequency scale..
fcenters = 700*((10.^(melcenters./2595))-1);
centerf = round(fcenters./df);
startf = [1,centerf(1:nof_c-1)];
stopf = [centerf(2:nof_c),Nmax];
W = zeros(nof_c,Nmax);
% Making filter..
for i = 1:nof_c
increment = 1.0/(centerf(i)-startf(i));
for j = startf(i):centerf(i)

W(i,j) = (j-startf(i))*increment;
end
decrement = 1.0/(stopf(i)-centerf(i));
for j = centerf(i):stopf(i)
W(i,j) = (j-centerf(i))*decrement;
end
end
% Normalising..
for i = 1:nof_c
W(i,:) = W(i,:)/sum(W(i,:));
end
% (c) Apply mel filters to Power spectrum coeffs..
melPowSpecs = W*PowSpecs;
% (d) MFCC calculations..
melCeps = dct(log(melPowSpecs));
melCeps(1,:) = [];
load CodeBook
dist_min = inf;
spkr = 0;
for ind = 1:length(codebook)
single_cb = codebook{ind};
[M1, N1] = size(melCeps);
[M2, N2] = size(single_cb);
dist_temp = zeros(N1,N2);
if N1<N2
repli = zeros(1,N2);
for n1 = 1:N1
dist_temp(n1,:) = sum((melCeps(:,n1+repli) - single_cb).^2,1);
end
else
repli = zeros(1,N1);
for n2 = 1:N2
dist_temp(:,n2) = sum((melCeps - single_cb(:,n2+repli)).^2,1);
end
end
dist_temp = sqrt(dist_temp);
dist_val(ind) = sum(min(dist_temp,[],2))/size(dist_temp,1);
if dist_val(ind) < dist_min
dist_min = dist_val(ind);
spkr = ind;
end
end
msg = sprintf('The Speaker is found);
disp(msg);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
AI=analoginput('winsound',1);
addchannel(AI,1);
set(AI,'samplerate',16000);
set(AI,'samplespertrigger',2*16000);
set(AI,'triggertype','immediate');
start(AI);
data=getdata(AI);
fs=16000;

%file = sprintf('%s%d.wav','database_new kar',i);


%%wavwrite(data,16000,file);
%% Computing MFCC Co-efficients..
%% (1) Frame Blocking..
N = 256; % N point FFT
M = 100; % Overlapping
NN = floor(N/2+1); %N/2
nbFrames = ceil((length(data)-N)/M);
Frames = zeros(nbFrames+1,N);
for i = 0:nbFrames-1
temp = data(i*M+1:i*M+N);
Frames(i+1,1:N) = temp;
end
% Last Frame..
temp = zeros(1,N);
lastLength = length(data)- nbFrames*M;
temp(1:lastLength) = data(nbFrames*M+1:(nbFrames*M +1 + lastLength-1));
Frames(nbFrames+1, 1:N) = temp;
%% (2) Windowing..
frameSize = size(Frames);
nbFrames = frameSize(1);
nbSamples = frameSize(2);
% Hamming window..
w = hamming(nbSamples);
Windows = zeros(nbFrames,nbSamples);
for i = 1:nbFrames
temp = Frames(i,1:nbSamples);
Windows(i, 1:nbSamples) = w'.*temp;
end
%% (3) Fourier Transform..
ffts = fft(Windows');
%% (4) Mel-frequency Wrapping..
% (a) Calculate Power spectrum..
PowSpecs = abs(ffts).^2;
PowSpecs = PowSpecs(1:NN-1,:);
% (b) Mel filter generation
nof_c = 20; % Number of channels..
df = fs/N;
Nmax = N/2;
fmax = fs/2;
% Convert to mel scale..
melmax = 2595*log10(1+fmax/700);
melinc = melmax/(nof_c+1);
melcenters = (1:nof_c).*melinc;
% Convert to frequency scale..
fcenters = 700*((10.^(melcenters./2595))-1);
centerf = round(fcenters./df);
startf = [1,centerf(1:nof_c-1)];
stopf = [centerf(2:nof_c),Nmax];

W = zeros(nof_c,Nmax);
% Making filter..
for i = 1:nof_c
increment = 1.0/(centerf(i)-startf(i));
for j = startf(i):centerf(i)
W(i,j) = (j-startf(i))*increment;
end
decrement = 1.0/(stopf(i)-centerf(i));
for j = centerf(i):stopf(i)
W(i,j) = (j-centerf(i))*decrement;
end
end
% Normalising..
for i = 1:nof_c
W(i,:) = W(i,:)/sum(W(i,:));
end
% (c) Apply mel filters to Power spectrum coeffs..
melPowSpecs = W*PowSpecs;
% (d) MFCC calculations..
melCeps = dct(log(melPowSpecs));
melCeps(1,:) = [];
load CodeBook
dist_min = inf;
spkr = 0;
for ind = 1:length(codebook)
single_cb = codebook{ind};
[M1, N1] = size(melCeps);
[M2, N2] = size(single_cb);
dist_temp = zeros(N1,N2);
if N1<N2
repli = zeros(1,N2);
for n1 = 1:N1
dist_temp(n1,:) = sum((melCeps(:,n1+repli) - single_cb).^2,1);
end
else
repli = zeros(1,N1);
for n2 = 1:N2
dist_temp(:,n2) = sum((melCeps - single_cb(:,n2+repli)).^2,1);
end
end
dist_temp = sqrt(dist_temp);
dist_val(ind) = sum(min(dist_temp,[],2))/size(dist_temp,1);
if dist_val(ind) < dist_min
dist_min = dist_val(ind);
spkr = ind;
end
end
msg = sprintf('The Speaker is found);
disp(msg);

You might also like