0% found this document useful (0 votes)
13 views8 pages

Distracted Driver Accepted Version

Uploaded by

Praveen Rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

Distracted Driver Accepted Version

Uploaded by

Praveen Rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Signal, Image and Video Processing. Springer manuscript No.

(will be inserted by the editor)

Distracted Driver Classification Using Deep Learning ?


Munif Alotaibia,1 , Bandar Alotaibib,2
1 College of Computing and Information Technology, Shagra University, Shagra, Saudi Arabia
2 College of Computer Science and Information Technology, University of Tabuk, Tabuk, Saudi Arabia

Received: date / Accepted: date

Abstract One of the most challenging topics in the field The main reason for the reported car crashes was texting or
of intelligent transportation systems (ITSs) is the automatic talking on the phone while driving [5]. Distracted driving
interpretation of the driver’s behavior. This research inves- is defined by NHTSA as "any activity that diverts attention
tigates distracted driver posture recognition as a part of the from driving". Distracted driving includes: eating, drinking,
human action recognition framework. Numerous car acci- talking to passengers, texting, talking on the phone, and ad-
dents have been reported that were caused by distracted drivers. justing the stereo, drowsy driving, navigation system, or en-
Our aim is to improve the performance of detecting drivers’ tertainment [5], [33].
distracted actions. The developed system involves a dash-
board camera capable of detecting distracted drivers through A clearer definition of distracted driving is provided by
2D camera images. We use a combination of three of the the Centers for Disease Control and Prevention (CDC), which
most advanced techniques in deep learning, namely, the in- categorizes distracted driving into three classes: visual (i.e.,
ception module with a residual block and a hierarchical re- looking around, and not concentrating visually on the road),
current neural network (HRNN) to enhance the performance cognitive (i.e., looking at the road, but not concentrating
of detecting the distracted behaviors of drivers. The pro- mentally on the road), and manual (i.e., taking the driver’s
posed method yields very good results. The distracted driver hands off the car’s wheel) [2]. The amount of car accident re-
behaviors include texting, talking on the phone, operating duction caused by distracted driving and the improvement of
the radio, drinking, reaching behind, fixing hair and makeup, traffic safety using smart vehicles equipped with distracted
and talking to the passenger. driver’s postures detectors have become the first priority of
many governments and car manufacturers. Additionally, to
increase road safety, police officers or radar cameras have to
1 INTRODUCTION be supplied for use with such distracted driving detectors to
penalize the law offender.
As the amount of traffic density increases, the number of car
crashes is expected to further increase. The World Health In this research, we consider the manual category in which
Organization (WHO) issued the 2015 Global Status Report, the distractions are in the forms of texting, talking on the
which indicated that nearly 1.25 million deaths are estimated phone, eating or drinking, reaching behind, adjusting the
to have occurred globally each year because of car accidents stereo, entertainment, or GPS, and fixing hair and makeup.
[1], [2]. Hazardous and risky driving behavior causes the We improved the performance of distracted driving using
deaths of more than a million people and 50 million serious an ensemble of convolutional neural networks architectures
injuries globally every year [3], [4]. The National Highway and a hierarchical recurrent neural network (HRNN). The
Traffic Safety Administration (NHTSA) states that, in 2015, rest of the paper is arranged as follows. Section 2 reviews the
car accidents involving distracted driving caused the deaths related work. Section 3 explains deep learning algorithms.
of 3,477 people and 391,000 people were seriously injured. Section 4 presents the datasets information. Section 5 presents
? Thanks to the title our proposed method. Section 6 introduces the experimental
a e-mail: [email protected] results. Section 7 discusses our results. Section 8 concludes
b e-mail: [email protected] the paper.
2

2 Related Work Table 1: Summary of state-of-the-art methods

Distracted driver classification is classified into two main Method Dataset Classes Accuracy
categories. The first category uses wearable sensors to mea- Baseline [14] SEU 4 90.63%
sure physiological and bio-medical signals such as brain ac- CNN [4] SEU 4 99.78%
Alex Net [32] AUC 10 94.29%
tivity, vascular and muscular activities, and heart rate. These GA-Weighted Ensemble [32] AUC 10 95.98%
methods [6], [7] have some disadvantages such as hardware
cost and user involvement. The second category to classify
distracted drivers employs a camera. This category consists
on the input images to recognize the distraction behavior.
of three vision-based techniques to monitor distracted be-
The authors pre-trained the model using transfer learning
haviors in real-time, namely, head pose [8], [9], [11] gaze
(i.e., the ImageNet model) for faster convergence. Then, they
detection [10], [12], fatigue cues extraction from the driver’s
obtained the final class distribution by evaluating the weighted
face [13], [15], [16], and body postures characterization (e.g.,
sum of all networks’ outputs. They also developed a ge-
arms, feet, and hands positions) [17], [18].
netic algorithm to evaluate the weights. Table 1 lists state-
Most of the vision-based approaches employ a two-step of-the-art methods, the dataset used with each method, and
structure in which the features are extracted from the raw the overall accuracy of each method.
data using hand-crafted methods and classifiers are fitted
based on the hand-crafted features. Approaches that follow
the two-step architecture cannot attain an optimal trade-off
between the robustness of the trained classifier and the dis- 3 Deep Learning Background
tinctively hand-crafted features. In the last two decades, vision-
based approaches to detect distracted drivers that are based Recently, deep learning has dominated visual interpretation
on support vector machines (SVMs) and decision trees have tasks and has shown outstanding performance. Particularly,
dominated the research area. Lately, with the great success the convolutional neural network (CNN) deep learning al-
of deep learning models, particularly the convolutional neu- gorithm has achieved significant progress on image recog-
ral networks (CNNs) in computer vision [19], [20], [21], nition tasks. Finding the perfect CNN architecture is still a
[34], natural language processing [22], and speech recogni- very difficult task. Thus, there have been many architectures
tion [23], [24], these deep learning models have become the proposed in the past, such as GoogLeNet (i.e., Inception),
dominant approach to solve the distracted driving problem AlexNet, VGGNet, and most recently, the deep residual net-
as well. work (i.e., ResNet). On the other hand, the recurrent neural
Yan et al. [4] proposed an approach to recognize and network (RNN) is a well-known algorithm that obtains im-
detect driving posture based on deep CNN. The approach pressive results on time series problems as well as language
applies local neighborhood operations and trainable filters tasks such as speech recognition and machine translation.
to select meaningful features automatically. The advantage
of this approach is the ability to learn meaningful features
with minimal domain knowledge, which could provide the
model with an improved performance over models that used 3.1 ResNet Model
the hand-crafted features employed in previous works. The
authors also pre-trained the filters using a sparse filter [25] to The ResNet model [21] was invented by Microsoft researchers
accelerate the training for faster convergence and better gen- in 2016. The model has achieved the state-of-art result of
eralization. The authors tested some activation functions and 96.4% in the ImageNet Large Scale Visual Recognition Com-
pooling operations and found that the best activation func- petition (ILSVRC). The network is very deep, consisting
tion is the rectifier linear unit (ReLU) and the best pooling of 152 layers. Furthermore, the ResNet model introduced
operation is the max-pooling technique. unique residual blocks in which the identity skip connec-
Aboulenaga et al. [2] proposed a method to recognize tions are used to address training a very deep architecture
manual driver distractions, including: texting while driving, approach. The purpose of the residual blocks is to copy and
using the cell phone, drinking, eating, adjusting the radio, carry out the inputs of a specific layer to the next layer. The
reaching behind, and fixing hair and makeup. The authors vanishing gradients issue is overcome by the identity skip
proposed technique combines two well-known CNN archi- connection step, which guarantees that the next layer trains
tectures, namely, AlexNet [19] and Inception V3, into a ge- on something other than the input that the layer is famil-
netically weighted ensemble. The inputs to the model are iar with. In addition to the ResNet success in the ILSVRC,
raw images, face images, hands images, face and hands im- ResNet has shown impressive results on many computer vi-
ages, and skin-segmented images. The model was then trained sion tasks.
3

3.2 Inception Model (GoogleNet) using the right hand, talking to someone on the phone us-
ing the right hand, texting while driving using the left hand,
The GoogleNet model [30] is a deep CNN network that talking to someone on the phone using the left hand, reach-
was proposed in 2014 by Google researchers, which even- ing for the dashboard to operate the radio, drinking or eat-
tually achieved top-5 accuracy of 93.3% in the ILSVRC. ing, reaching behind, fixing hair and makeup, and talking to
GoogLeNet architecture is deep, consisting of twenty two a passenger. The submission to the StateFarm competition
layers. GoogLeNet architecture was created upon a novel was evaluated using the multiclass logarithmic loss. A true
building block called the Inception model. This architec- target is attached to each image, and the goal is to submit a
ture uses a network in the network layer instead of using set of predicted probabilities for each image.
the typical sequential process. The architecture uses parallel
computing to compute a large convolutional layer, a small
convolutional layer and a pooling layer. The architecture
performs a one-by-one convolution operation to reduce the
dimensionality of the features. The number of parameters
and operations is reduced significantly because of the di-
mensionality reduction used in this architecture and the par-
allelism that has been introduced; therefore, these features
save memory and minimize the computational cost [29]. (a) 0 (b) 1

3.3 Hierarchical Multiscale Recurrent Neural Network

The hierarchical multiscale recurrent neural network (HM-


RNN) was proposed by University of Montreal researchers
in 2017 [27]. HM-RNN utilizes temporal data to learn the
hierarchical multiscale structure (i.e., it does not use explicit (c) 2 (d) 3
boundary information). The model adaptively designates ad-
equate update rates identical to the layers abstraction levels,
instead of assigning fixed update times. The authors suggest
to use a binary detector at each layer for coarse timescales
and fine timescales. For high-level layers, the model learns
coarse timescales, and for low-level layers, it learns fine time-
scales. At the time step in which the segment of the corre-
(e) 4 (f) 5
sponding abstraction level is totally executed, the boundary
detector is turned on. Otherwise, during the segment exe-
cution, the boundary detector remains off. The authors in-
troduced three operations utilizing the hierarchical bound-
ary states (i.e., COPY, UPDATE, and FLUSH). One of these
operations can be used at each time step. The UPDATE op-
eration is different from update rule of the long-short-term
memory (LSTM) [28] because it is sparsely processed based (g) 6 (h) 7
on the detected boundaries.

4 Dataset Information

One of the first available datasets for driver distraction clas-


sification is the StateFarm1 dataset. The dataset was released
on Kaggle2 in 2016. The dataset consists of ten classes to (i) 8 (j) 9
be classified, namely, normal driving, texting while driving
Fig. 1: Some samples of the dataset that represent the ten
1 An insurance company; its headquarters are located in Bloomington, classes
IL, USA
2 A platform for data science and predictive models competitions
4

Table 2: Summary details of the 1st dataset Table 3: Summary details of the AUC dataset

Class Description images Class Description Size(frames)


0 normal driving 2489 0 safe driving 2,986
1 texting while driving using right hand 2267 1 phone right 1,256
2 talking to somebody on the phone using right hand 2317 2 phone left 1,320
3 texting while driving using the left hand 2346 3 text right 1,718
4 talking to somebody on the phone using left hand 2326 4 text left 1,124
5 reaching the dashboard to operate the radio 2312 5 adjusting radio 1,123
6 drinking or eating 2325 6 drinking 1,076
7 reaching behind 2002 7 hair or makeup 1,044
8 fixing hair and makeup 1911 8 reaching behind 1,034
9 talking to passenger 2129 9 talking to passenger 1,797
Sum 33636 Sum 14,478

The original published dataset has 2 folders, which are The max pooling in our Inception has a size of 3 * 3 with
the testing and the training data, and we use only the data a stride of 1. The figure also shows the size of each filter
in the training folder to evaluate our method since the data (kernel) in each convolutional layer.
in the testing folder is unlabeled. Thus, we used 33,636 im- Some convolutional layers in the Inception performs a
ages; more details about the dataset that we used are shown one-by-one convolution operation to reduce the dimension-
in 2. ality of the features. The size of the filters in the other two
To further evaluate the strength and generalization per- convolutional layers is 5 * 5. Each convolutional layer has
formance of our approach, we also used another recent dataset seven kernels. The output of each one of the convolutional
[32] from the American University in Cairo (AUC). The kernels has the same length as the original input (same padding).
AUC dataset has 44 individuals, 29 males and 15 females, Each input image has RGB color channels. Thus, we are us-
from seven countries: the USA, Egypt, Germany, Uganda, ing 2D convolutional layers.
Canada, Morocco, and Palestine. The videos were taken at
different times of day, in different cars, with different driving
conditions, and with the drivers wearing different clothes. 5.2 Our ResNet
The dataset consists of 14,478 frames distributed over 10
classes, as shown in Table 3. The dataset is divided into 75% There is one ResNet block that has two convolutional lay-
for training and 25% for testing. ers. The raw data of the image is fed to the first convolu-
tional layer, which convolves the input volume with three
filters; each one of them has a size of 5 * 5. The output fea-
5 Proposed Method
ture maps, which have the same length as the original input
In this paper, we take advantage of three advanced deep (same padding), are fed into the second layer, which has the
learning models, namely, the residual network (ResNet), the same settings as the first layer. Then, we sum up the output
hierarchical recurrent neural network (HRNN), and the In- features with the input and feed them to the next step. The
ception architecture by combining them into one model. In operation of the filter in each convolutional layer is summa-
our model shown in Fig. 3, there is one ResNet block and rized in equation 1:
two HRNN layers integrated with the Inception module, and
they are followed by two dense layers and finally by the
softmax classifier. The ResNet has two convolutional lay- X i = (φ (W iΘ X i−1 + β i )) (1)
ers. The details of each of these entities are explained in the
following sections. where X i−1 is the input, and X i is the output feature map.
Note that each convolutional layer will output more than one
feature map. b is the bias term β , ϕ is the activation func-
5.1 Inception module tion, and Θ is the convolution operator. The ReLU activation
function ϕ applies an elementwise operation on the input
We use one Inception module that is similar to the original data x. It is defined as the following equation:
Inception module used in the GoogLeNet model. However,
in addition to the entities of the original Inception module as (
shown in Fig. 2, we add two entities: one ResNet block and 0 for x < 0
f (x) =
two LSTM layers (the HRNN network) as shown in Fig. 3. x for x ≥ 0
5

5×5
convolution
1×1
convolution

5×5
convolution
1×1
convolution
Concatenate
volution
1x1 con
olution
3x3 conv 3×3
max pooling 1×1
olution convolution
1x1 conv

Dense
olution layers
1x1 conv Input
Layer olution
Previous 5x5 conv Softmax
5×5 5×5
convolution convolution

olution +
1x1 conv
-poo ling
3x3 Max

3×3
average
Fig. 2: The original Inception module.
LSTM LSTM

Fig. 3: Our modified version of the Inception module

Then, the output of the residual block is fed to the av- Then, we use the softmax function, which is performed on
erage pooling layer. We apply the average pooling of size 2 the obtained output of the dense layers to classify the input
with a stride of 2 to the data. image to one of the ten classes.
The operation of the residual block has two convolu- We use a batch size of 80 and set the number of epochs
tional layers as defined in equation 2 to 30. We use the Adam optimizer for our model. We set the
initial adaptive learning rate to 0.001, beta1 to 0.9, and beta2
x1 = F(x0 ) to 0.999. For every layer in our model, we use glorot uniform
(Xavier uniform) [35] to initialize the weights matrix. We
x2 = F(x1 ) (2)
initialize all the bias terms of the convlutional layers to zero.
3 0 2
x = (x + x ) The proposed model was implemented using the Keras
where F(x) represents the operation of each convolu- library [31]. It is trained and tested using a computer with
tional layer, and xi represents the input and output feature an Intel core (TM) i7 CPU @ 2.00 GHz, 16.0 GB of RAM
maps. and a 64-bit Windows 10 operating system.

5.3 Our HRNN 6 Experiment and Results

Before feeding the data to the HRNN, we use average pool- To evaluate the performance of our proposed method in de-
ing to reduce the dimensionality of the data. In the hierar- tecting distracted actions of drivers, we used the State Farm
chical RNN, there are two long short-term memory (LSTM) Distracted Driver Detection data set as provided by Kaggle.
layers. The first one will encode every column of the data We measure and evaluate the performance using the overall
that has a shape of (26, 80) to a column vector with a shape accuracy. The overall accuracy is calculated as the number
of (80). The second one will encode these 80 column vec- of all images that are classified correctly divided by the total
tors of shape (80, 80) to a vector that represents all of the number of all samples in the test set.
data. More details of the structure of our HRNN are shown We used several percentages to divide the data into test-
in Table 4. ing and training sets. We used 10%, 20%, and 30% for the
training and the remaining for the testing. The overall ac-
Table 4: Summary details of the HRRN curacies of our method when applied on the data set using
10%, 20%, and 30% of the data as training are shown in
Layer Output tensor size Parameters Number Table 5.
1st LSTM layer (26, 80) 26880 The confusion matrix of our proposed deep learning ar-
2ed LSTM layer (80) 51520 chitecture when using 10% of data for training is shown in
Fig. 4a. Fig. 4b shows the confusion matrix of our proposed
deep learning architecture when using 30% of the data for
Finally, we concatenate all the outputs of the three enti- training. The accuracy increased when we increased the size
ties and feed them to two dense layers, each with 80 neurons. of the training data. Fig. 4c shows the confusion matrix of
6

confusion matrix confusion matrix confusion matrix


0 97.32 0.49 0.40 0.45 0.49 0.00 0.18 0.00 0.04 0.62 0 99.31 0.29 0.00 0.11 0.00 0.00 0.06 0.00 0.17 0.06 0 99.68 0.00 0.00 0.00 0.08 0.00 0.00 0.00 0.00 0.24
80 80 80
1 1.08 96.96 0.20 0.00 0.00 0.00 0.44 0.00 0.00 1.32 1 0.00 99.75 0.00 0.06 0.00 0.00 0.00 0.19 0.00 0.00 1 0.09 99.65 0.00 0.00 0.00 0.00 0.00 0.26 0.00 0.00

2 0.38 0.48 97.51 0.19 0.00 0.00 1.10 0.14 0.14 0.05 2 0.00 0.37 98.95 0.12 0.00 0.00 0.43 0.00 0.12 0.00 2 0.00 0.09 99.22 0.00 0.09 0.00 0.43 0.09 0.09 0.00

3 0.19 0.14 0.09 98.01 0.71 0.14 0.24 0.24 0.00 0.24 60 3 0.18 0.12 0.00 99.45 0.12 0.00 0.12 0.00 0.00 0.00 3 0.26 0.00 0.00 99.40 0.34 0.00 0.00 0.00 0.00 0.00
60 60

4 0.38 0.10 0.33 1.77 95.70 0.00 0.72 0.24 0.14 0.62 4 0.18 0.25 0.06 0.74 98.28 0.06 0.37 0.00 0.00 0.06 4 0.17 0.00 0.00 0.09 99.66 0.00 0.00 0.00 0.00 0.09
True label

True label

True label
5 0.96 0.00 0.34 0.48 0.38 97.26 0.43 0.14 0.00 0.00 5 0.12 0.00 0.12 0.06 0.37 99.07 0.19 0.06 0.00 0.00 5 0.09 0.00 0.00 0.09 0.09 99.48 0.17 0.09 0.00 0.00
40 40 40
6 0.19 0.72 1.39 0.05 0.81 0.00 95.80 0.10 0.53 0.43 6 0.00 0.00 0.12 0.00 0.06 0.06 99.63 0.06 0.06 0.00 6 0.00 0.00 0.09 0.00 0.09 0.09 99.66 0.00 0.09 0.00

7 0.06 0.00 0.67 0.06 0.00 0.11 0.44 96.67 1.00 1.00 7 0.00 0.00 0.00 0.00 0.00 0.00 0.29 99.00 0.36 0.36 7 0.00 0.00 0.00 0.00 0.10 0.00 0.20 99.20 0.10 0.40

8 0.87 0.00 0.06 0.47 0.81 0.06 2.21 0.64 89.59 5.29 8 0.22 0.07 0.15 0.07 0.07 0.07 0.07 0.15 97.61 1.49 8 0.42 0.00 0.10 0.10 0.00 0.00 0.10 0.31 98.32 0.63
20 20 20
9 0.94 0.16 0.00 0.47 0.05 0.00 0.16 0.31 1.51 96.40 9 0.13 0.13 0.20 0.07 0.07 0.07 0.07 0.54 0.87 97.85 9 0.28 0.19 0.28 0.00 0.00 0.00 0.00 0.28 0.56 98.40
0

9
Predicted label Predicted label Predicted label

0 0 0

(a) 10% of the data as training and the rest (b) 30% of the data as training and the rest (c) 50% of the data as training and the rest
for testing for testing for testing

Fig. 4: The confusion matrix of our proposed deep learning architecture using three different percentages to divide the State
Farm Distracted Driver dataset into testing and training sets

Table 5: Overall accuracies of our method when applied on 7 Discussion


the data set using different training split percentages
From the experiment, we can conclude that our model is ca-
Training (%) Training Samples Testing Samples Accuracy pable of learning richer representations with a small number
10% 2242 20182 96.23% of parameters. It was able to achieve very accurate results
30% 6727 15697 98.92% particularly when there is sufficient training data.
50% 11212 11212 99.30%
The motivation for the proposed model is to find an eas-
ier and more efficient model by combining several techniques
that have already been proven to be very effective. More-
our proposed deep learning architecture when using 50% of
over, we compare the performance of the proposed method
the data for training. The accuracy is 99.30%.
with the ResNet model as shown in Table 6. In this table
Our model can smoothly converge to the local minima
we use State Farm Distracted Driver data-set. We use two
within a few epochs. Fig. 5a shows how our model mini-
residual blocks, followed by the average pooling layer. We
mizes the loss function during the training phase in order to
also list the result of the HRNN alone without any convolu-
optimize the parameters of the network. In addition, Figure
tional layer. We use 10 percent of the data for training and
5b shows the accuracy of the training data during the con-
the remaining 90% for testing. We also list the result of one
vergence steps of our model.
ResNet Block followed by the HRNN.
Moreover, We also applied our method to the AUC dis-
We noted that the larger architectures such as Xception,
tracted driver data-set [32]. We re-sized each image to 67
Inception, VGG and ResNet50 can not be optimized eas-
* 120 pixels. We used 75% of the dataset for training and
ily on this dataset. For example, the Inception requires fine-
the remainder for testing. The results are shown in Table 6.
tuning techniques (i.e., transfer learning) . We found that us-
Furthermore, Table 6 shows that our model has promising
ing a few convolutional layers is very optimal for our model.
results.
When we increase the number of layers, the accuracy de-
creases.
Table 6: Overall accuracies of our method when applied on When we apply our method to the AUC distracted driver
to State Farm Distracted Driver and AUC datasets dataset, the result shows that our method can perform better
than both ResNet and HRNN. Our method obtained 92.36%,
Method Database Training Testing Accuracy ResNet obtained 88.52%, and HRNN achieved 84.85%. Our
Our method StateFarm 2242 20182 96.23% method is better than ResNet by around 3.80% and better
ResNet [21] StateFarm 2242 20182 95.31%
than HRNN by around 7%.
HRNN [27] StateFarm 2242 20182 98.34%
ResNet + HRNN StateFarm 2242 20182 91.72% Table 7 shows the average computation time for all three
Our method AUC 12977 4331 92.36% models. ResNet has less computation time, with one sam-
ResNet [21] AUC 12977 4331 88.52% ple taking only 62 ms during the testing phase. For HRNN,
HRNN [27] AUC 12977 4331 84.85%
it only takes 71 ms for one image. For our model, it takes
7

model loss model accuracy


2.5 train 1.0 train
0.9
2.0
0.8
1.5

accuracy
0.7
loss

1.0 0.6

0.5 0.5
0.4
0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
epoch epoch
(a) Loss (b) Accuracy

Fig. 5: (a) Loss and (b) accuracy during the convergence of our model

Table 7: Computation time during the testing phase (AUC References


data-set)
1. World Health Organization, & World Health Organi-
Method Time zation. Management of Substance Abuse Unit. (2014).
Ours 114ms Global status report on alcohol and health, 2014. World
ResNet 62ms Health Organization.
HRNN 71ms
2. Abouelnaga, Y., Eraqi, H. M., & Moustafa, M. N. Real-
time Distracted Driver Posture Classification.
3. Peden, M. (2004). World report on road traffic injury
114 ms to process one sample. Thus, our model requires prevention.
more computation time when compared to both ResNet and 4. Yan, C., Coenen, F., & Zhang, B. (2016). Driving pos-
HRNN, and this is due to our model’s components. How- ture recognition by convolutional neural networks. IET
ever, the results demonstrate that all the methods can main- Computer Vision, 10(2), 103-114.
tain real-time performance. 5. National Highway Traffic Safety Administration.
(2016). 2015 motor vehicle crashes: overview. Traffic
safety facts research note, 2016, 1-9.
8 Conclusions 6. Craye, C., & Karray, F. (2015). Driver distraction de-
tection and recognition using RGB-D sensor. arXiv
The automatic recognition of the driver’s behavior is one of preprint arXiv:1502.00250.
the challenging problems in ITS. In the last decade, as smart- 7. FernÃandez,
˛ A., Usamentiaga, R., CarÞs, J. L., &
phones have become prevalent worldwide, many car crashes Casado, R. (2016). Driver distraction using visual-based
caused by distracted drivers have also occurred. We investi- sensors and algorithms. Sensors, 16(11), 1805.
gate the distracted driver posture as a part of human action 8. Watta, P., Lakshmanan, S., & Hou, Y. (2007). Non-
recognition to recognize the driver’s behavior. Our objective parametric approaches for estimating driver pose. IEEE
is to improve the accuracy in the distracted driver classifi- transactions on vehicular technology, 56(4), 2028-2041.
cation problem. We propose a method that combines three 9. Murphy-Chutorian, E., & Trivedi, M. M. (2010). Head
of the most advanced models in deep learning, namely, the pose estimation and augmented reality tracking: An in-
residual network, the Inception module and the hierarchical tegrated system and evaluation for monitoring driver
recurrent neural network to improve the performance of de- awareness. IEEE Transactions on intelligent transporta-
tecting a distracted driver’s behavior. We test our approach tion systems, 11(2), 300-311.
using two datasets (i.e., State Farm’s dataset on the Kaggle 10. Doshi, A., & Trivedi, M. M. (2009). On the roles of eye
platform and AUC dataset). The images in the datasets were gaze and head dynamics in predicting driver’s intent to
taken using a dashboard camera to detect distracted drivers change lanes. IEEE Transactions on Intelligent Trans-
from the 2D images. The proposed method achieved promis- portation Systems, 10(3), 453-462.
ing results.
8

11. Teyeb, I., Jemai, O., Zaied, M., & Amar, C. B. (2014,
works concepts to hybrid NN-HMM model for speech
September). A drowsy driver detection system based on
recognition. In Acoustics, Speech and Signal Process-
a new method of head posture estimation. In Interna-
ing (ICASSP), 2012 IEEE International Conference on
tional Conference on Intelligent Data Engineering and
(pp. 4277-4280). IEEE.
Automated Learning (pp. 362-369). Springer, Cham.
24. Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L.,
12. Teyeb, I., Jemai, O., Zaied, M., & Amar, C. B. (2014,
Penn, G., & Yu, D. (2014). Convolutional neural net-
July). A novel approach for drowsy driver detection us-
works for speech recognition. IEEE/ACM Transactions
ing head posture estimation and eyes recognition system
on audio, speech, and language processing, 22(10),
based on wavelet network. In Information, Intelligence,
1533-1545.
Systems and Applications, IISA 2014, The 5th Interna-
25. Ngiam, J., Chen, Z., Bhaskar, S. A., Koh, P. W., & Ng,
tional Conference on (pp. 379-384). IEEE.
A. Y. (2011). Sparse filtering. In Advances in neural in-
13. Bergasa, L. M., Nuevo, J., Sotelo, M. A., Barea, R.,
formation processing systems (pp. 1125-1133).
& Lopez, M. E. (2006). Real-time system for monitor-
26. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wo-
ing driver vigilance. IEEE Transactions on Intelligent
jna, Z. (2016). Rethinking the inception architecture for
Transportation Systems, 7(1), 63-77.
computer vision. In Proceedings of the IEEE Confer-
14. Zhao, C. H., Zhang, B. L., He, J., & Lian, J. (2012).
ence on Computer Vision and Pattern Recognition (pp.
Recognition of driving postures by contourlet transform
2818-2826).
and random forests. IET Intelligent Transport Systems,
27. Chung, J., Ahn, S., & Bengio, Y. (2016). Hierarchi-
6(2), 161-168.
cal multiscale recurrent neural networks. arXiv preprint
15. Jemai, O., Teyeb, I., & Bouchrika, T. (2013). A novel
arXiv:1609.01704.
approach for drowsy driver detection using eyes recog-
28. Hochreiter, S., & Schmidhuber, J. (1997). Long short-
nition system based on wavelet network. International
term memory. Neural computation, 9(8), 1735-1780.
Journal of Recent Contributions from Engineering, Sci-
29. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S.,
ence & IT (iJES), 1(1), 46-52.
Villena-Martinez, V., & Garcia-Rodriguez, J. (2017). A
16. Lei, J., Han, Q., Chen, L., Lai, Z., Zeng, L., & Liu,
review on deep learning techniques applied to semantic
X. (2017). A Novel Side Face Contour Extraction Al-
segmentation. arXiv preprint :1704.06857.
gorithm for Driving Fatigue Statue Recognition. IEEE
30. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
Access, 5, 5723-5730.
Anguelov, D., ... & Rabinovich, A. (2015). Going
17. Cheng, S. Y., Park, S., & Trivedi, M. M. (2007). Multi-
deeper with convolutions. In Proceedings of the IEEE
spectral and multi-perspective video arrays for driver
conference on computer vision and pattern recognition
body tracking and activity analysis. Computer Vision
(pp. 1-9).
and Image Understanding, 106(2-3), 245-257.
31. Chollet, François and others. (2015), Keras. https://
18. Tran, C., Doshi, A., & Trivedi, M. M. (2012). Modeling
keras.io
and prediction of driver behavior by foot gesture analy-
32. Eraqi, H. M., Abouelnaga, Y., Saad, M. H., & Moustafa,
sis. Computer Vision and Image Understanding, 116(3),
M. N. (2019). Driver Distraction Identification with an
435-445.
Ensemble of Convolutional Neural Networks. Journal
19. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).
of Advanced Transportation, Hindawi
Imagenet classification with deep convolutional neural
33. Resalat, S. N., & Saba, V. (2015). A practical method
networks. In Advances in neural information processing
for driver sleepiness detection by processing the EEG
systems (pp. 1097-1105).
signals stimulated with external flickering light. Signal,
20. Simonyan, K., & Zisserman, A. (2014). Very deep con-
Image and Video Processing (pp. 1751-1757).
volutional networks for large-scale image recognition.
34. Soon, F. C., Khaw, H. Y., Chuah, J. H., & Kanesan, J.
arXiv preprint arXiv:1409.1556.
(2019). Vehicle logo recognition using whitening trans-
21. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep
formation and deep learning. Signal, Image and Video
residual learning for image recognition. In Proceedings
Processing, (pp. 111-119).
of the IEEE conference on computer vision and pattern
35. Glorot, Xavier and Bengio, Yoshua (2010). Understand-
recognition (pp. 770-778).
ing the difficulty of training deep feedforward neural
22. Hu, B., Lu, Z., Li, H., & Chen, Q. (2014). Convolu-
networks. Proceedings of the 13th international confer-
tional neural network architectures for matching natural
ence on artificial intelligence and statistics (pp. 249–
language sentences. In Advances in neural information
256).
processing systems (pp. 2042-2050).
23. Abdel-Hamid, O., Mohamed, A. R., Jiang, H., & Penn,
G. (2012, March). Applying convolutional neural net-

You might also like