0% found this document useful (0 votes)

7 views26 pages

Speech Emotion Recognition

The document outlines the development of a Speech Emotion Recognition (SER) classifier, explaining the significance of SER in understanding human emotions through speech and its applications in various fields such as call centers and in-car systems. It details the datasets used for training, including Crema-D, Ravdess, Savee, and Tess, and describes the data preparation process, including the creation of dataframes for emotions and file paths. Additionally, it covers data visualization techniques, data augmentation methods, and the libraries utilized for audio analysis and model training.

Uploaded by

gx28nt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views26 pages

Speech Emotion Recognition

Uploaded by

gx28nt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Speech Emotion Recognition

I am going to build a speech emotion detection classifier.

But first we need to learn about what is speech recognition (SER) and
why are we building this project? Well, few of the reasons are-
First, lets define SER i.e. Speech Emotion Recognition.
• Speech Emotion Recognition, abbreviated as SER, is the act of attempting to recognize
human emotion and affective states from speech. This is capitalizing on the fact that
voice often reflects underlying emotion through tone and pitch. This is also the
phenomenon that animals like dogs and horses employ to be able to understand human
emotion.

Why we need it?

1. Emotion recognition is the part of speech recognition which is gaining more
popularity and need for it increases enormously. Although there are methods to
recognize emotion using machine learning techniques, this project attempts to use
deep learning to recognize the emotions from data.

2. SER(Speech Emotion Recognition) is used in call center for classifying calls

according to emotions and can be used as the performance parameter for
conversational analysis thus identifying the unsatisfied customer, customer
satisfaction and so on.. for helping companies improving their services

3. It can also be used in-car board system based on information of the mental state of
the driver can be provided to the system to initiate his/her safety preventing
accidents to happen

Datasets used in this project

• Crowd-sourced Emotional Mutimodal Actors Dataset (Crema-D)
• Ryerson Audio-Visual Database of Emotional Speech and Song (Ravdess)
• Surrey Audio-Visual Expressed Emotion (Savee)
• Toronto emotional speech set (Tess)

Importing Libraries
import pandas as pd
import numpy as np

import os
import sys
# librosa is a Python library for analyzing audio and music. It can be
used to extract the data from the audio files we will see it later.
import librosa
import librosa.display
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

# to play the audio files

from IPython.display import Audio

import keras
from keras.callbacks import ReduceLROnPlateau
from keras.models import Sequential
from keras.layers import Dense, Conv1D, MaxPooling1D, Flatten,
Dropout, BatchNormalization
from keras.utils import np_utils, to_categorical
from keras.callbacks import ModelCheckpoint

import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning)

Using TensorFlow backend.

Data Preparation
• As we are working with four different datasets, so i will be creating a dataframe storing
all emotions of the data in dataframe with their paths.
• We will use this dataframe to extract features for our model training.
# Paths for data.
Ravdess =
"/kaggle/input/ravdess-emotional-speech-audio/audio_speech_actors_01-
24/"
Crema = "/kaggle/input/cremad/AudioWAV/"
Tess = "/kaggle/input/toronto-emotional-speech-set-tess/tess toronto
emotional speech set data/TESS Toronto emotional speech set data/"
Savee =
"/kaggle/input/surrey-audiovisual-expressed-emotion-savee/ALL/"

1. Ravdess Dataframe
Here is the filename identifiers as per the official RAVDESS website:

• Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

• Vocal channel (01 = speech, 02 = song).
• Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 =
disgust, 08 = surprised).
• Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the
'neutral' emotion.
• Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
• Repetition (01 = 1st repetition, 02 = 2nd repetition).
• Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

So, here's an example of an audio filename. 02-01-06-01-02-01-12.mp4 This means the meta
data for the audio file is:

• Video-only (02)
• Speech (01)
• Fearful (06)
• Normal intensity (01)
• Statement "dogs" (02)
• 1st Repetition (01)
• 12th Actor (12) - Female (as the actor ID number is even)
ravdess_directory_list = os.listdir(Ravdess)

file_emotion = []
file_path = []
for dir in ravdess_directory_list:
# as their are 20 different actors in our previous directory we
need to extract files for each actor.
actor = os.listdir(Ravdess + dir)
for file in actor:
part = file.split('.')[0]
part = part.split('-')
# third part in each file represents the emotion associated to
that file.
file_emotion.append(int(part[2]))
file_path.append(Ravdess + dir + '/' + file)

# dataframe for emotion of files

emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.

path_df = pd.DataFrame(file_path, columns=['Path'])
Ravdess_df = pd.concat([emotion_df, path_df], axis=1)

# changing integers to actual emotions.

Ravdess_df.Emotions.replace({1:'neutral', 2:'calm', 3:'happy',
4:'sad', 5:'angry', 6:'fear', 7:'disgust', 8:'surprise'},
inplace=True)
Ravdess_df.head()
Emotions Path
0 surprise /kaggle/input/ravdess-emotional-speech-audio/a...
1 angry /kaggle/input/ravdess-emotional-speech-audio/a...
2 calm /kaggle/input/ravdess-emotional-speech-audio/a...
3 disgust /kaggle/input/ravdess-emotional-speech-audio/a...
4 sad /kaggle/input/ravdess-emotional-speech-audio/a...

2. Crema DataFrame
crema_directory_list = os.listdir(Crema)

file_emotion = []
file_path = []

for file in crema_directory_list:

# storing file paths
file_path.append(Crema + file)
# storing file emotions
part=file.split('_')
if part[2] == 'SAD':
file_emotion.append('sad')
elif part[2] == 'ANG':
file_emotion.append('angry')
elif part[2] == 'DIS':
file_emotion.append('disgust')
elif part[2] == 'FEA':
file_emotion.append('fear')
elif part[2] == 'HAP':
file_emotion.append('happy')
elif part[2] == 'NEU':
file_emotion.append('neutral')
else:
file_emotion.append('Unknown')

# dataframe for emotion of files

emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.

path_df = pd.DataFrame(file_path, columns=['Path'])
Crema_df = pd.concat([emotion_df, path_df], axis=1)
Crema_df.head()

Emotions Path
0 angry /kaggle/input/cremad/AudioWAV/1049_WSI_ANG_XX.wav
1 angry /kaggle/input/cremad/AudioWAV/1082_IWW_ANG_XX.wav
2 fear /kaggle/input/cremad/AudioWAV/1021_ITS_FEA_XX.wav
3 angry /kaggle/input/cremad/AudioWAV/1086_ITS_ANG_XX.wav
4 disgust /kaggle/input/cremad/AudioWAV/1026_ITS_DIS_XX.wav
3. TESS dataset
tess_directory_list = os.listdir(Tess)

file_emotion = []
file_path = []

for dir in tess_directory_list:

directories = os.listdir(Tess + dir)
for file in directories:
part = file.split('.')[0]
part = part.split('_')[2]
if part=='ps':
file_emotion.append('surprise')
else:
file_emotion.append(part)
file_path.append(Tess + dir + '/' + file)

# dataframe for emotion of files

emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.

path_df = pd.DataFrame(file_path, columns=['Path'])
Tess_df = pd.concat([emotion_df, path_df], axis=1)
Tess_df.head()

Emotions Path
0 sad /kaggle/input/toronto-emotional-speech-set-tes...
1 sad /kaggle/input/toronto-emotional-speech-set-tes...
2 sad /kaggle/input/toronto-emotional-speech-set-tes...
3 sad /kaggle/input/toronto-emotional-speech-set-tes...
4 sad /kaggle/input/toronto-emotional-speech-set-tes...

4. CREMA-D dataset
The audio files in this dataset are named in such a way that the prefix letters describes the
emotion classes as follows:

• 'a' = 'anger'
• 'd' = 'disgust'
• 'f' = 'fear'
• 'h' = 'happiness'
• 'n' = 'neutral'
• 'sa' = 'sadness'
• 'su' = 'surprise'
savee_directory_list = os.listdir(Savee)

file_emotion = []
file_path = []

for file in savee_directory_list:

file_path.append(Savee + file)
part = file.split('_')[1]
ele = part[:-6]
if ele=='a':
file_emotion.append('angry')
elif ele=='d':
file_emotion.append('disgust')
elif ele=='f':
file_emotion.append('fear')
elif ele=='h':
file_emotion.append('happy')
elif ele=='n':
file_emotion.append('neutral')
elif ele=='sa':
file_emotion.append('sad')
else:
file_emotion.append('surprise')

# dataframe for emotion of files

emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.

path_df = pd.DataFrame(file_path, columns=['Path'])
Savee_df = pd.concat([emotion_df, path_df], axis=1)
Savee_df.head()

Emotions Path
0 surprise /kaggle/input/surrey-audiovisual-expressed-emo...
1 disgust /kaggle/input/surrey-audiovisual-expressed-emo...
2 neutral /kaggle/input/surrey-audiovisual-expressed-emo...
3 disgust /kaggle/input/surrey-audiovisual-expressed-emo...
4 angry /kaggle/input/surrey-audiovisual-expressed-emo...

# creating Dataframe using all the 4 dataframes we created so far.

data_path = pd.concat([Ravdess_df, Crema_df, Tess_df, Savee_df], axis
= 0)
data_path.to_csv("data_path.csv",index=False)
data_path.head()

Emotions Path
0 surprise /kaggle/input/ravdess-emotional-speech-audio/a...
1 angry /kaggle/input/ravdess-emotional-speech-audio/a...
2 calm /kaggle/input/ravdess-emotional-speech-audio/a...
3 disgust /kaggle/input/ravdess-emotional-speech-audio/a...
4 sad /kaggle/input/ravdess-emotional-speech-audio/a...
Data Visualisation and Exploration
First let's plot the count of each emotions in our dataset.

plt.title('Count of Emotions', size=16)

sns.countplot(data_path.Emotions)
plt.ylabel('Count', size=12)
plt.xlabel('Emotions', size=12)
sns.despine(top=True, right=True, left=False, bottom=False)
plt.show()

We can also plot waveplots and spectograms for audio signals

• Waveplots - Waveplots let us know the loudness of the audio at a given time.
• Spectograms - A spectrogram is a visual representation of the spectrum of frequencies
of sound or other signals as they vary with time. It’s a representation of frequencies
changing with respect to time for given audio/music signals.
def create_waveplot(data, sr, e):
plt.figure(figsize=(10, 3))
plt.title('Waveplot for audio with {} emotion'.format(e), size=15)
librosa.display.waveplot(data, sr=sr)
plt.show()

def create_spectrogram(data, sr, e):

# stft function converts the data into short term fourier
transform
X = librosa.stft(data)
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(12, 3))
plt.title('Spectrogram for audio with {} emotion'.format(e),
size=15)
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')

#librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='log')

plt.colorbar()

emotion='fear'
path = np.array(data_path.Path[data_path.Emotions==emotion])[1]
data, sampling_rate = librosa.load(path)
create_waveplot(data, sampling_rate, emotion)
create_spectrogram(data, sampling_rate, emotion)
Audio(path)

<IPython.lib.display.Audio object>
emotion='angry'
path = np.array(data_path.Path[data_path.Emotions==emotion])[1]
data, sampling_rate = librosa.load(path)
create_waveplot(data, sampling_rate, emotion)
create_spectrogram(data, sampling_rate, emotion)
Audio(path)

<IPython.lib.display.Audio object>

emotion='sad'
path = np.array(data_path.Path[data_path.Emotions==emotion])[1]
data, sampling_rate = librosa.load(path)
create_waveplot(data, sampling_rate, emotion)
create_spectrogram(data, sampling_rate, emotion)
Audio(path)
<IPython.lib.display.Audio object>

emotion='happy'
path = np.array(data_path.Path[data_path.Emotions==emotion])[1]
data, sampling_rate = librosa.load(path)
create_waveplot(data, sampling_rate, emotion)
create_spectrogram(data, sampling_rate, emotion)
Audio(path)
<IPython.lib.display.Audio object>

Data Augmentation
• Data augmentation is the process by which we create new synthetic data samples by
adding small perturbations on our initial training set.
• To generate syntactic data for audio, we can apply noise injection, shifting time, changing
pitch and speed.
• The objective is to make our model invariant to those perturbations and enhace its ability
to generalize.
• In order to this to work adding the perturbations must conserve the same label as the
original training sample.
• In images data augmention can be performed by shifting the image, zooming, rotating ...

First, let's check which augmentation techniques works better for our dataset.

def noise(data):
noise_amp = 0.035*np.random.uniform()*np.amax(data)
data = data + noise_amp*np.random.normal(size=data.shape[0])
return data
def stretch(data, rate=0.8):
return librosa.effects.time_stretch(data, rate)

def shift(data):
shift_range = int(np.random.uniform(low=-5, high = 5)*1000)
return np.roll(data, shift_range)

def pitch(data, sampling_rate, pitch_factor=0.7):

return librosa.effects.pitch_shift(data, sampling_rate,
pitch_factor)

# taking any example and checking for techniques.

path = np.array(data_path.Path)[1]
data, sample_rate = librosa.load(path)

1. Simple Audio
plt.figure(figsize=(14,4))
librosa.display.waveplot(y=data, sr=sample_rate)
Audio(path)

<IPython.lib.display.Audio object>

2. Noise Injection
x = noise(data)
plt.figure(figsize=(14,4))
librosa.display.waveplot(y=x, sr=sample_rate)
Audio(x, rate=sample_rate)

<IPython.lib.display.Audio object>
We can see noise injection is a very good augmentation technique because of which we can
assure our training model is not overfitted

3. Stretching
x = stretch(data)
plt.figure(figsize=(14,4))
librosa.display.waveplot(y=x, sr=sample_rate)
Audio(x, rate=sample_rate)

<IPython.lib.display.Audio object>

4. Shifting
x = shift(data)
plt.figure(figsize=(14,4))
librosa.display.waveplot(y=x, sr=sample_rate)
Audio(x, rate=sample_rate)

<IPython.lib.display.Audio object>
5. Pitch
x = pitch(data, sample_rate)
plt.figure(figsize=(14,4))
librosa.display.waveplot(y=x, sr=sample_rate)
Audio(x, rate=sample_rate)

<IPython.lib.display.Audio object>

• From the above types of augmentation techniques i am using noise, stretching(ie.

changing speed) and some pitching.

Feature Extraction
• Extraction of features is a very important part in analyzing and finding relations between
different things. As we already know that the data provided of audio cannot be
understood by the models directly so we need to convert them into an understandable
format for which feature extraction is used.

The audio signal is a three-dimensional signal in which three axes represent time, amplitude and
frequency.
I am no expert on audio signals and feature extraction on audio files so i need to search and
found a very good blog written by Askash Mallik on feature extraction.

As stated there with the help of the sample rate and the sample data, one can perform several
transformations on it to extract valuable features out of it.

1. Zero Crossing Rate : The rate of sign-changes of the signal during the duration of a
particular frame.
2. Energy : The sum of squares of the signal values, normalized by the respective frame
length.
3. Entropy of Energy : The entropy of sub-frames’ normalized energies. It can be
interpreted as a measure of abrupt changes.
4. Spectral Centroid : The center of gravity of the spectrum.
5. Spectral Spread : The second central moment of the spectrum.
6. Spectral Entropy : Entropy of the normalized spectral energies for a set of sub-frames.
7. Spectral Flux : The squared difference between the normalized magnitudes of the
spectra of the two successive frames.
8. Spectral Rolloff : The frequency below which 90% of the magnitude distribution of the
spectrum is concentrated.
9. MFCCs Mel Frequency Cepstral Coefficients form a cepstral representation where the
frequency bands are not linear but distributed according to the mel-scale.
10. Chroma Vector : A 12-element representation of the spectral energy where the bins
represent the 12 equal-tempered pitch classes of western-type music (semitone
spacing).
11. Chroma Deviation : The standard deviation of the 12 chroma coefficients.

In this project i am not going deep in feature selection process to check which features are good
for our dataset rather i am only extracting 5 features:

• Zero Crossing Rate

• Chroma_stft
• MFCC
• RMS(root mean square) value
• MelSpectogram to train our model.
def extract_features(data):
# ZCR
result = np.array([])
zcr = np.mean(librosa.feature.zero_crossing_rate(y=data).T,
axis=0)
result=np.hstack((result, zcr)) # stacking horizontally

# Chroma_stft
stft = np.abs(librosa.stft(data))
chroma_stft = np.mean(librosa.feature.chroma_stft(S=stft,
sr=sample_rate).T, axis=0)
result = np.hstack((result, chroma_stft)) # stacking horizontally

# MFCC
mfcc = np.mean(librosa.feature.mfcc(y=data, sr=sample_rate).T,
axis=0)
result = np.hstack((result, mfcc)) # stacking horizontally

# Root Mean Square Value

rms = np.mean(librosa.feature.rms(y=data).T, axis=0)
result = np.hstack((result, rms)) # stacking horizontally

# MelSpectogram
mel = np.mean(librosa.feature.melspectrogram(y=data,
sr=sample_rate).T, axis=0)
result = np.hstack((result, mel)) # stacking horizontally

return result

def get_features(path):
# duration and offset are used to take care of the no audio in
start and the ending of each audio files as seen above.
data, sample_rate = librosa.load(path, duration=2.5, offset=0.6)

# without augmentation
res1 = extract_features(data)
result = np.array(res1)

# data with noise

noise_data = noise(data)
res2 = extract_features(noise_data)
result = np.vstack((result, res2)) # stacking vertically

# data with stretching and pitching

new_data = stretch(data)
data_stretch_pitch = pitch(new_data, sample_rate)
res3 = extract_features(data_stretch_pitch)
result = np.vstack((result, res3)) # stacking vertically

return result

X, Y = [], []
for path, emotion in zip(data_path.Path, data_path.Emotions):
feature = get_features(path)
for ele in feature:
X.append(ele)
# appending emotion 3 times as we have made 3 augmentation
techniques on each audio file.
Y.append(emotion)

len(X), len(Y), data_path.Path.shape

(36486, 36486, (12162,))

Features = pd.DataFrame(X)
Features['labels'] = Y
Features.to_csv('features.csv', index=False)
Features.head()

0 1 2 3 4 5
6 \
0 0.185239 0.585543 0.541992 0.555859 0.615102 0.599604
0.652054
1 0.302097 0.748427 0.716290 0.740596 0.802801 0.760048
0.693101
2 0.147298 0.646143 0.595935 0.561826 0.547853 0.612391
0.561209
3 0.199350 0.517106 0.521565 0.508298 0.564973 0.626469
0.698655
4 0.296762 0.653405 0.640598 0.633179 0.681640 0.741104
0.730206

7 8 9 ... 153 154 155

156 \
0 0.691854 0.766230 0.791168 ... 0.002888 0.001964 0.001590
0.002071
1 0.699719 0.734826 0.753985 ... 0.003670 0.002759 0.002363
0.003003
2 0.622703 0.689758 0.756473 ... 0.001020 0.000665 0.000617
0.000406
3 0.668579 0.603630 0.621905 ... 0.052493 0.048467 0.046119
0.036382
4 0.660096 0.651581 0.663689 ... 0.083794 0.079053 0.073813
0.065715

157 158 159 160 161 labels

0 0.002255 0.002727 0.001520 0.000461 0.000038 surprise
1 0.003083 0.003557 0.002395 0.001345 0.000886 surprise
2 0.000478 0.000603 0.000401 0.000094 0.000007 surprise
3 0.041288 0.027275 0.024452 0.006556 0.000462 angry
4 0.066659 0.054817 0.055254 0.036077 0.028982 angry

[5 rows x 163 columns]

• We have applied data augmentation and extracted the features for each audio files and
saved them.

Data Preparation
• As of now we have extracted the data, now we need to normalize and split our data for
training and testing.
X = Features.iloc[: ,:-1].values
Y = Features['labels'].values

# As this is a multiclass classification problem onehotencoding our Y.

encoder = OneHotEncoder()
Y = encoder.fit_transform(np.array(Y).reshape(-1,1)).toarray()

# splitting data
x_train, x_test, y_train, y_test = train_test_split(X, Y,
random_state=0, shuffle=True)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((27364, 162), (27364, 8), (9122, 162), (9122, 8))

# scaling our data with sklearn's Standard scaler

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((27364, 162), (27364, 8), (9122, 162), (9122, 8))

# making our data compatible to model.

x_train = np.expand_dims(x_train, axis=2)
x_test = np.expand_dims(x_test, axis=2)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((27364, 162, 1), (27364, 8), (9122, 162, 1), (9122, 8))

Modelling
model=Sequential()
model.add(Conv1D(256, kernel_size=5, strides=1, padding='same',
activation='relu', input_shape=(x_train.shape[1], 1)))
model.add(MaxPooling1D(pool_size=5, strides = 2, padding = 'same'))

model.add(Conv1D(256, kernel_size=5, strides=1, padding='same',

activation='relu'))
model.add(MaxPooling1D(pool_size=5, strides = 2, padding = 'same'))

model.add(Conv1D(128, kernel_size=5, strides=1, padding='same',

activation='relu'))
model.add(MaxPooling1D(pool_size=5, strides = 2, padding = 'same'))
model.add(Dropout(0.2))

model.add(Conv1D(64, kernel_size=5, strides=1, padding='same',

activation='relu'))
model.add(MaxPooling1D(pool_size=5, strides = 2, padding = 'same'))

model.add(Flatten())
model.add(Dense(units=32, activation='relu'))
model.add(Dropout(0.3))

model.add(Dense(units=8, activation='softmax'))
model.compile(optimizer = 'adam' , loss = 'categorical_crossentropy' ,
metrics = ['accuracy'])

model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_28 (Conv1D) (None, 162, 256) 1536
_________________________________________________________________
max_pooling1d_28 (MaxPooling (None, 81, 256) 0
_________________________________________________________________
conv1d_29 (Conv1D) (None, 81, 256) 327936
_________________________________________________________________
max_pooling1d_29 (MaxPooling (None, 41, 256) 0
_________________________________________________________________
conv1d_30 (Conv1D) (None, 41, 128) 163968
_________________________________________________________________
max_pooling1d_30 (MaxPooling (None, 21, 128) 0
_________________________________________________________________
dropout_13 (Dropout) (None, 21, 128) 0
_________________________________________________________________
conv1d_31 (Conv1D) (None, 21, 64) 41024
_________________________________________________________________
max_pooling1d_31 (MaxPooling (None, 11, 64) 0
_________________________________________________________________
flatten_7 (Flatten) (None, 704) 0
_________________________________________________________________
dense_13 (Dense) (None, 32) 22560
_________________________________________________________________
dropout_14 (Dropout) (None, 32) 0
_________________________________________________________________
dense_14 (Dense) (None, 8) 264
=================================================================
Total params: 557,288
Trainable params: 557,288
Non-trainable params: 0
_________________________________________________________________

rlrp = ReduceLROnPlateau(monitor='loss', factor=0.4, verbose=0,

patience=2, min_lr=0.0000001)
history=model.fit(x_train, y_train, batch_size=64, epochs=50,
validation_data=(x_test, y_test), callbacks=[rlrp])

Train on 27364 samples, validate on 9122 samples

Epoch 1/50
27364/27364 [==============================] - 5s 183us/step - loss:
1.6819 - accuracy: 0.3212 - val_loss: 1.4272 - val_accuracy: 0.4257
Epoch 2/50
27364/27364 [==============================] - 4s 163us/step - loss:
1.4340 - accuracy: 0.4279 - val_loss: 1.2990 - val_accuracy: 0.4752
Epoch 3/50
27364/27364 [==============================] - 4s 161us/step - loss:
1.3356 - accuracy: 0.4637 - val_loss: 1.2498 - val_accuracy: 0.5007
Epoch 4/50
27364/27364 [==============================] - 5s 168us/step - loss:
1.2843 - accuracy: 0.4928 - val_loss: 1.2138 - val_accuracy: 0.5027
Epoch 5/50
27364/27364 [==============================] - 4s 157us/step - loss:
1.2453 - accuracy: 0.5074 - val_loss: 1.1987 - val_accuracy: 0.5180
Epoch 6/50
27364/27364 [==============================] - 4s 163us/step - loss:
1.2134 - accuracy: 0.5164 - val_loss: 1.1540 - val_accuracy: 0.5406
Epoch 7/50
27364/27364 [==============================] - 4s 160us/step - loss:
1.1919 - accuracy: 0.5259 - val_loss: 1.1355 - val_accuracy: 0.5478
Epoch 8/50
27364/27364 [==============================] - 4s 161us/step - loss:
1.1666 - accuracy: 0.5315 - val_loss: 1.1249 - val_accuracy: 0.5466
Epoch 9/50
27364/27364 [==============================] - 5s 167us/step - loss:
1.1485 - accuracy: 0.5471 - val_loss: 1.1051 - val_accuracy: 0.5597
Epoch 10/50
27364/27364 [==============================] - 4s 159us/step - loss:
1.1272 - accuracy: 0.5506 - val_loss: 1.0989 - val_accuracy: 0.5628
Epoch 11/50
27364/27364 [==============================] - 5s 165us/step - loss:
1.1107 - accuracy: 0.5585 - val_loss: 1.0926 - val_accuracy: 0.5624
Epoch 12/50
27364/27364 [==============================] - 4s 163us/step - loss:
1.0959 - accuracy: 0.5676 - val_loss: 1.0999 - val_accuracy: 0.5573
Epoch 13/50
27364/27364 [==============================] - 5s 167us/step - loss:
1.0705 - accuracy: 0.5795 - val_loss: 1.0771 - val_accuracy: 0.5733
Epoch 14/50
27364/27364 [==============================] - 5s 170us/step - loss:
1.0616 - accuracy: 0.5799 - val_loss: 1.0927 - val_accuracy: 0.5704
Epoch 15/50
27364/27364 [==============================] - 4s 163us/step - loss:
1.0489 - accuracy: 0.5827 - val_loss: 1.0806 - val_accuracy: 0.5705
Epoch 16/50
27364/27364 [==============================] - 5s 165us/step - loss:
1.0354 - accuracy: 0.5871 - val_loss: 1.0807 - val_accuracy: 0.5725
Epoch 17/50
27364/27364 [==============================] - 4s 158us/step - loss:
1.0296 - accuracy: 0.5949 - val_loss: 1.0748 - val_accuracy: 0.5714
Epoch 18/50
27364/27364 [==============================] - 4s 157us/step - loss:
1.0165 - accuracy: 0.5972 - val_loss: 1.0925 - val_accuracy: 0.5640
Epoch 19/50
27364/27364 [==============================] - 5s 166us/step - loss:
0.9998 - accuracy: 0.6032 - val_loss: 1.0641 - val_accuracy: 0.5839
Epoch 20/50
27364/27364 [==============================] - 4s 156us/step - loss:
0.9847 - accuracy: 0.6125 - val_loss: 1.0481 - val_accuracy: 0.5858
Epoch 21/50
27364/27364 [==============================] - 4s 162us/step - loss:
0.9676 - accuracy: 0.6191 - val_loss: 1.0409 - val_accuracy: 0.5906
Epoch 22/50
27364/27364 [==============================] - 4s 156us/step - loss:
0.9564 - accuracy: 0.6208 - val_loss: 1.0426 - val_accuracy: 0.5932
Epoch 23/50
27364/27364 [==============================] - 4s 158us/step - loss:
0.9543 - accuracy: 0.6243 - val_loss: 1.0587 - val_accuracy: 0.5843
Epoch 24/50
27364/27364 [==============================] - 4s 164us/step - loss:
0.9350 - accuracy: 0.6343 - val_loss: 1.0419 - val_accuracy: 0.5904
Epoch 25/50
27364/27364 [==============================] - 4s 157us/step - loss:
0.9309 - accuracy: 0.6357 - val_loss: 1.0336 - val_accuracy: 0.5944
Epoch 26/50
27364/27364 [==============================] - 5s 170us/step - loss:
0.9138 - accuracy: 0.6396 - val_loss: 1.0423 - val_accuracy: 0.5932
Epoch 27/50
27364/27364 [==============================] - 5s 175us/step - loss:
0.9034 - accuracy: 0.6453 - val_loss: 1.0417 - val_accuracy: 0.5897
Epoch 28/50
27364/27364 [==============================] - 4s 163us/step - loss:
0.8986 - accuracy: 0.6473 - val_loss: 1.0485 - val_accuracy: 0.5930
Epoch 29/50
27364/27364 [==============================] - 4s 164us/step - loss:
0.8902 - accuracy: 0.6540 - val_loss: 1.0569 - val_accuracy: 0.5923
Epoch 30/50
27364/27364 [==============================] - 4s 157us/step - loss:
0.8712 - accuracy: 0.6599 - val_loss: 1.0322 - val_accuracy: 0.5996
Epoch 31/50
27364/27364 [==============================] - 5s 165us/step - loss:
0.8532 - accuracy: 0.6650 - val_loss: 1.0550 - val_accuracy: 0.5980
Epoch 32/50
27364/27364 [==============================] - 4s 156us/step - loss:
0.8493 - accuracy: 0.6661 - val_loss: 1.0356 - val_accuracy: 0.6074
Epoch 33/50
27364/27364 [==============================] - 4s 158us/step - loss:
0.8400 - accuracy: 0.6717 - val_loss: 1.0364 - val_accuracy: 0.6039
Epoch 34/50
27364/27364 [==============================] - 5s 165us/step - loss:
0.8331 - accuracy: 0.6750 - val_loss: 1.0686 - val_accuracy: 0.5956
Epoch 35/50
27364/27364 [==============================] - 4s 158us/step - loss:
0.8321 - accuracy: 0.6778 - val_loss: 1.0696 - val_accuracy: 0.6021
Epoch 36/50
27364/27364 [==============================] - 4s 161us/step - loss:
0.8138 - accuracy: 0.6814 - val_loss: 1.0885 - val_accuracy: 0.6040
Epoch 37/50
27364/27364 [==============================] - 4s 157us/step - loss:
0.8073 - accuracy: 0.6862 - val_loss: 1.0557 - val_accuracy: 0.6053
Epoch 38/50
27364/27364 [==============================] - 4s 156us/step - loss:
0.7902 - accuracy: 0.6902 - val_loss: 1.0833 - val_accuracy: 0.6030
Epoch 39/50
27364/27364 [==============================] - 4s 161us/step - loss:
0.7845 - accuracy: 0.6932 - val_loss: 1.0551 - val_accuracy: 0.6115
Epoch 40/50
27364/27364 [==============================] - 5s 171us/step - loss:
0.7798 - accuracy: 0.6967 - val_loss: 1.0646 - val_accuracy: 0.6027
Epoch 41/50
27364/27364 [==============================] - 5s 175us/step - loss:
0.7663 - accuracy: 0.6999 - val_loss: 1.0824 - val_accuracy: 0.6080
Epoch 42/50
27364/27364 [==============================] - 4s 164us/step - loss:
0.7606 - accuracy: 0.6999 - val_loss: 1.0736 - val_accuracy: 0.6095
Epoch 43/50
27364/27364 [==============================] - 4s 163us/step - loss:
0.7551 - accuracy: 0.7031 - val_loss: 1.0796 - val_accuracy: 0.6017
Epoch 44/50
27364/27364 [==============================] - 4s 160us/step - loss:
0.7462 - accuracy: 0.7140 - val_loss: 1.0945 - val_accuracy: 0.6095
Epoch 45/50
27364/27364 [==============================] - 4s 157us/step - loss:
0.7354 - accuracy: 0.7140 - val_loss: 1.0823 - val_accuracy: 0.6101
Epoch 46/50
27364/27364 [==============================] - 5s 164us/step - loss:
0.7295 - accuracy: 0.7170 - val_loss: 1.0735 - val_accuracy: 0.6084
Epoch 47/50
27364/27364 [==============================] - 4s 158us/step - loss:
0.7152 - accuracy: 0.7207 - val_loss: 1.0904 - val_accuracy: 0.6112
Epoch 48/50
27364/27364 [==============================] - 4s 158us/step - loss:
0.7172 - accuracy: 0.7233 - val_loss: 1.0881 - val_accuracy: 0.6132
Epoch 49/50
27364/27364 [==============================] - 5s 166us/step - loss:
0.6992 - accuracy: 0.7305 - val_loss: 1.0999 - val_accuracy: 0.6062
Epoch 50/50
27364/27364 [==============================] - 4s 157us/step - loss:
0.6977 - accuracy: 0.7294 - val_loss: 1.1017 - val_accuracy: 0.6074

print("Accuracy of our model on test data : " ,

model.evaluate(x_test,y_test)[1]*100 , "%")

epochs = [i for i in range(50)]

fig , ax = plt.subplots(1,2)
train_acc = history.history['accuracy']
train_loss = history.history['loss']
test_acc = history.history['val_accuracy']
test_loss = history.history['val_loss']

fig.set_size_inches(20,6)
ax[0].plot(epochs , train_loss , label = 'Training Loss')
ax[0].plot(epochs , test_loss , label = 'Testing Loss')
ax[0].set_title('Training & Testing Loss')
ax[0].legend()
ax[0].set_xlabel("Epochs")

ax[1].plot(epochs , train_acc , label = 'Training Accuracy')

ax[1].plot(epochs , test_acc , label = 'Testing Accuracy')
ax[1].set_title('Training & Testing Accuracy')
ax[1].legend()
ax[1].set_xlabel("Epochs")
plt.show()

9122/9122 [==============================] - 1s 92us/step

Accuracy of our model on test data : 60.74326038360596 %
# predicting on test data.
pred_test = model.predict(x_test)
y_pred = encoder.inverse_transform(pred_test)

y_test = encoder.inverse_transform(y_test)

df = pd.DataFrame(columns=['Predicted Labels', 'Actual Labels'])

df['Predicted Labels'] = y_pred.flatten()
df['Actual Labels'] = y_test.flatten()

df.head(10)

Predicted Labels Actual Labels

0 neutral disgust
1 sad sad
2 sad sad
3 fear disgust
4 happy happy
5 sad fear
6 disgust sad
7 happy happy
8 angry happy
9 happy happy

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize = (12, 10))
cm = pd.DataFrame(cm , index = [i for i in encoder.categories_] ,
columns = [i for i in encoder.categories_])
sns.heatmap(cm, linecolor='white', cmap='Blues', linewidth=1,
annot=True, fmt='')
plt.title('Confusion Matrix', size=20)
plt.xlabel('Predicted Labels', size=14)
plt.ylabel('Actual Labels', size=14)
plt.show()
print(classification_report(y_test, y_pred))

precision recall f1-score support

angry 0.78 0.69 0.73 1396

calm 0.62 0.86 0.72 142
disgust 0.54 0.48 0.51 1461
fear 0.63 0.51 0.57 1443
happy 0.53 0.62 0.57 1450
neutral 0.55 0.57 0.56 1265
sad 0.58 0.68 0.62 1470
surprise 0.85 0.79 0.82 495

accuracy 0.61 9122

macro avg 0.63 0.65 0.64 9122
weighted avg 0.61 0.61 0.61 9122

• We can see our model is more accurate in predicting surprise, angry emotions and it
makes sense also because audio files of these emotions differ to other audio files in a lot
of ways like pitch, speed etc..
• We overall achieved 61% accuracy on our test data and its decent but we can improve it
more by applying more augmentation techniques and using other feature extraction
methods.

This is all i wanna do in this project. Hope you guyz like this.
If you like the kernel make sure to upvote it please :-)

Importing Libraries: Import As Import As Import Import
No ratings yet
Importing Libraries: Import As Import As Import Import
20 pages
Speech Emotion Recognition
No ratings yet
Speech Emotion Recognition
14 pages
Code For Ser
No ratings yet
Code For Ser
3 pages
TESS
No ratings yet
TESS
4 pages
4th Sem Project
No ratings yet
4th Sem Project
22 pages
4b Review 2
No ratings yet
4b Review 2
23 pages
Project Report - 092046
No ratings yet
Project Report - 092046
5 pages
Emotion Detection Through Speech
No ratings yet
Emotion Detection Through Speech
9 pages
Voice Emotion Recognition
No ratings yet
Voice Emotion Recognition
11 pages
Lec 5
No ratings yet
Lec 5
69 pages
Towards Generalizable SER: Soft Labeling and Data Augmentation For Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech
No ratings yet
Towards Generalizable SER: Soft Labeling and Data Augmentation For Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech
5 pages
Speech Emotion Detection (CNN Algorithm)
No ratings yet
Speech Emotion Detection (CNN Algorithm)
29 pages
Reality
No ratings yet
Reality
11 pages
Assignment 3
No ratings yet
Assignment 3
23 pages
NLP - Emotion Detection
No ratings yet
NLP - Emotion Detection
8 pages
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
No ratings yet
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
7 pages
MiniProject 5
No ratings yet
MiniProject 5
11 pages
Deep Learning Report 1 3
No ratings yet
Deep Learning Report 1 3
3 pages
1822 B.E Cse Batchno 140
No ratings yet
1822 B.E Cse Batchno 140
55 pages
Speech Emotion Recognition with Librosa
No ratings yet
Speech Emotion Recognition with Librosa
50 pages
DL Research Paper PDF
No ratings yet
DL Research Paper PDF
15 pages
Sentiment Analysis - Ipynb
No ratings yet
Sentiment Analysis - Ipynb
35 pages
Speech Based Emotion Recognition
No ratings yet
Speech Based Emotion Recognition
26 pages
Group No 37
No ratings yet
Group No 37
19 pages
Literature Review (2) Smaple
No ratings yet
Literature Review (2) Smaple
9 pages
Yan 2020
No ratings yet
Yan 2020
5 pages
Presentation1 (Autosaved) (Autosaved)
No ratings yet
Presentation1 (Autosaved) (Autosaved)
20 pages
Real-Time Speech Emotion Detection
No ratings yet
Real-Time Speech Emotion Detection
16 pages
Lec 9
No ratings yet
Lec 9
87 pages
AI Emotion Recognition Project
No ratings yet
AI Emotion Recognition Project
26 pages
2024 Law-1 3
No ratings yet
2024 Law-1 3
10 pages
Internship
No ratings yet
Internship
16 pages
Literature Study 2025
No ratings yet
Literature Study 2025
10 pages
Speech Emotion Recognition via ML
No ratings yet
Speech Emotion Recognition via ML
13 pages
Sensors 21 01997
No ratings yet
Sensors 21 01997
25 pages
Mood-Based Music Recommender
No ratings yet
Mood-Based Music Recommender
1 page
Deep Learning Report 7 9
No ratings yet
Deep Learning Report 7 9
3 pages
Speech Emotion Recognition Techniques
No ratings yet
Speech Emotion Recognition Techniques
11 pages
Recognition of Emotions in Speech Using Deep CNN A
No ratings yet
Recognition of Emotions in Speech Using Deep CNN A
18 pages
Speech Emotion Recognition Guide
No ratings yet
Speech Emotion Recognition Guide
14 pages
Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis
No ratings yet
Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis
10 pages
Electronics 11 03831
No ratings yet
Electronics 11 03831
12 pages
JETIR2106163
No ratings yet
JETIR2106163
5 pages
Report
No ratings yet
Report
2 pages
Emotion Detection Using Image Processing in Python
No ratings yet
Emotion Detection Using Image Processing in Python
6 pages
Emotion Detection - Merged
No ratings yet
Emotion Detection - Merged
8 pages
Phase1 Reference Report
No ratings yet
Phase1 Reference Report
11 pages
EMOTIONDETECTION (1) Mini Project
No ratings yet
EMOTIONDETECTION (1) Mini Project
5 pages
Survey Paper
No ratings yet
Survey Paper
10 pages
Research Paper Seminar
No ratings yet
Research Paper Seminar
17 pages
2022 Dravidianlangtech-1 35
No ratings yet
2022 Dravidianlangtech-1 35
6 pages
IVADED
No ratings yet
IVADED
20 pages
IJRPR4210
No ratings yet
IJRPR4210
12 pages
Sentispeak Tone Mood Detector
No ratings yet
Sentispeak Tone Mood Detector
16 pages
SER Final
No ratings yet
SER Final
10 pages
Emotion Detection Using Pretrained Model
No ratings yet
Emotion Detection Using Pretrained Model
2 pages
AIM1
No ratings yet
AIM1
2 pages
Multimodal Sentiment Analysis: A Survey: Research Article
No ratings yet
Multimodal Sentiment Analysis: A Survey: Research Article
15 pages
Souvik Chakraborty
No ratings yet
Souvik Chakraborty
10 pages
Research Methodology - CA2 Topic
No ratings yet
Research Methodology - CA2 Topic
1 page
Untitled Design
No ratings yet
Untitled Design
9 pages
Ee6503 Power Electronics
No ratings yet
Ee6503 Power Electronics
20 pages
Frequency Counter 1601
No ratings yet
Frequency Counter 1601
15 pages
Cp-Pc-1610e-02 Azbil Controller Temp
No ratings yet
Cp-Pc-1610e-02 Azbil Controller Temp
4 pages
M 266 Manual
No ratings yet
M 266 Manual
8 pages
PSR 1200 24 48 Manual
No ratings yet
PSR 1200 24 48 Manual
112 pages
Single Phase Rectification Lab Exercise
No ratings yet
Single Phase Rectification Lab Exercise
6 pages
Power Quality Analyzer 3196 Overview
No ratings yet
Power Quality Analyzer 3196 Overview
12 pages
Sinusoidal Waveforms & Phase Relations
No ratings yet
Sinusoidal Waveforms & Phase Relations
62 pages
MI Instruments MCQ (Free PDF) - Objective Question Answer For MI Instruments Quiz - Download Now!
No ratings yet
MI Instruments MCQ (Free PDF) - Objective Question Answer For MI Instruments Quiz - Download Now!
2 pages
Ccip 016 - 1
100% (2)
Ccip 016 - 1
84 pages
Symmetrical Faults
No ratings yet
Symmetrical Faults
11 pages
AD 256 - Design Considerations For The Vibration of Floors - Part 3
No ratings yet
AD 256 - Design Considerations For The Vibration of Floors - Part 3
3 pages
ImageAnalysis CASA PDF
No ratings yet
ImageAnalysis CASA PDF
30 pages
Power Electronics Concepts and Applications
No ratings yet
Power Electronics Concepts and Applications
7 pages
ACS758 Arduino Current Sensor Tutorial
100% (1)
ACS758 Arduino Current Sensor Tutorial
12 pages
Energy Limiting Curves
No ratings yet
Energy Limiting Curves
1 page
Komputasi Daya dalam Elektronika Daya
No ratings yet
Komputasi Daya dalam Elektronika Daya
50 pages
SCR 300a
No ratings yet
SCR 300a
9 pages
Sinusoidal Signal and Its Significance: Sinusoidal Signal Is A Form of Alternating Quantity
No ratings yet
Sinusoidal Signal and Its Significance: Sinusoidal Signal Is A Form of Alternating Quantity
11 pages
Question Paper Code: 91127: Ee1571 - Power Electronics
No ratings yet
Question Paper Code: 91127: Ee1571 - Power Electronics
3 pages
6 AVO Analysis
No ratings yet
6 AVO Analysis
64 pages
Multicube Manual
No ratings yet
Multicube Manual
39 pages
Matlab Inl DNL PDF
100% (1)
Matlab Inl DNL PDF
112 pages
High Speed Fuses: Section Contents
No ratings yet
High Speed Fuses: Section Contents
20 pages
Ac Theory
No ratings yet
Ac Theory
138 pages
Cse-I-Basic Electricals Engg. L3 PDF
No ratings yet
Cse-I-Basic Electricals Engg. L3 PDF
52 pages
Catalyst Performance Evaluation Report
No ratings yet
Catalyst Performance Evaluation Report
16 pages
Ferroresonance and Power Quality
No ratings yet
Ferroresonance and Power Quality
3 pages
Beckwith Relay
No ratings yet
Beckwith Relay
230 pages
Power Electronics Midterm Solutions
No ratings yet
Power Electronics Midterm Solutions
9 pages

Speech Emotion Recognition

Uploaded by

Speech Emotion Recognition

Uploaded by

Speech Emotion Recognition

I am going to build a speech emotion detection classifier.

Why we need it?

2. SER(Speech Emotion Recognition) is used in call center for classifying calls

Datasets used in this project

from sklearn.preprocessing import StandardScaler, OneHotEncoder

# to play the audio files

Using TensorFlow backend.

• Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

# dataframe for emotion of files

# dataframe for path of files.

# changing integers to actual emotions.

for file in crema_directory_list:

# dataframe for emotion of files

# dataframe for path of files.

for dir in tess_directory_list:

# dataframe for emotion of files

# dataframe for path of files.

for file in savee_directory_list:

# dataframe for emotion of files

# dataframe for path of files.

# creating Dataframe using all the 4 dataframes we created so far.

plt.title('Count of Emotions', size=16)

We can also plot waveplots and spectograms for audio signals

def create_spectrogram(data, sr, e):

#librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='log')

def pitch(data, sampling_rate, pitch_factor=0.7):

# taking any example and checking for techniques.

• From the above types of augmentation techniques i am using noise, stretching(ie.

• Zero Crossing Rate

# Root Mean Square Value

# data with noise

# data with stretching and pitching

len(X), len(Y), data_path.Path.shape

(36486, 36486, (12162,))

7 8 9 ... 153 154 155

157 158 159 160 161 labels

[5 rows x 163 columns]

# As this is a multiclass classification problem onehotencoding our Y.

((27364, 162), (27364, 8), (9122, 162), (9122, 8))

# scaling our data with sklearn's Standard scaler

((27364, 162), (27364, 8), (9122, 162), (9122, 8))

# making our data compatible to model.

model.add(Conv1D(256, kernel_size=5, strides=1, padding='same',

model.add(Conv1D(128, kernel_size=5, strides=1, padding='same',

model.add(Conv1D(64, kernel_size=5, strides=1, padding='same',

rlrp = ReduceLROnPlateau(monitor='loss', factor=0.4, verbose=0,

Train on 27364 samples, validate on 9122 samples

print("Accuracy of our model on test data : " ,

epochs = [i for i in range(50)]

ax[1].plot(epochs , train_acc , label = 'Training Accuracy')

9122/9122 [==============================] - 1s 92us/step

df = pd.DataFrame(columns=['Predicted Labels', 'Actual Labels'])

Predicted Labels Actual Labels

precision recall f1-score support

angry 0.78 0.69 0.73 1396

accuracy 0.61 9122

You might also like