Distracted Driver Accepted Version
Distracted Driver Accepted Version
Abstract One of the most challenging topics in the field The main reason for the reported car crashes was texting or
of intelligent transportation systems (ITSs) is the automatic talking on the phone while driving [5]. Distracted driving
interpretation of the driver’s behavior. This research inves- is defined by NHTSA as "any activity that diverts attention
tigates distracted driver posture recognition as a part of the from driving". Distracted driving includes: eating, drinking,
human action recognition framework. Numerous car acci- talking to passengers, texting, talking on the phone, and ad-
dents have been reported that were caused by distracted drivers. justing the stereo, drowsy driving, navigation system, or en-
Our aim is to improve the performance of detecting drivers’ tertainment [5], [33].
distracted actions. The developed system involves a dash-
board camera capable of detecting distracted drivers through A clearer definition of distracted driving is provided by
2D camera images. We use a combination of three of the the Centers for Disease Control and Prevention (CDC), which
most advanced techniques in deep learning, namely, the in- categorizes distracted driving into three classes: visual (i.e.,
ception module with a residual block and a hierarchical re- looking around, and not concentrating visually on the road),
current neural network (HRNN) to enhance the performance cognitive (i.e., looking at the road, but not concentrating
of detecting the distracted behaviors of drivers. The pro- mentally on the road), and manual (i.e., taking the driver’s
posed method yields very good results. The distracted driver hands off the car’s wheel) [2]. The amount of car accident re-
behaviors include texting, talking on the phone, operating duction caused by distracted driving and the improvement of
the radio, drinking, reaching behind, fixing hair and makeup, traffic safety using smart vehicles equipped with distracted
and talking to the passenger. driver’s postures detectors have become the first priority of
many governments and car manufacturers. Additionally, to
increase road safety, police officers or radar cameras have to
1 INTRODUCTION be supplied for use with such distracted driving detectors to
penalize the law offender.
As the amount of traffic density increases, the number of car
crashes is expected to further increase. The World Health In this research, we consider the manual category in which
Organization (WHO) issued the 2015 Global Status Report, the distractions are in the forms of texting, talking on the
which indicated that nearly 1.25 million deaths are estimated phone, eating or drinking, reaching behind, adjusting the
to have occurred globally each year because of car accidents stereo, entertainment, or GPS, and fixing hair and makeup.
[1], [2]. Hazardous and risky driving behavior causes the We improved the performance of distracted driving using
deaths of more than a million people and 50 million serious an ensemble of convolutional neural networks architectures
injuries globally every year [3], [4]. The National Highway and a hierarchical recurrent neural network (HRNN). The
Traffic Safety Administration (NHTSA) states that, in 2015, rest of the paper is arranged as follows. Section 2 reviews the
car accidents involving distracted driving caused the deaths related work. Section 3 explains deep learning algorithms.
of 3,477 people and 391,000 people were seriously injured. Section 4 presents the datasets information. Section 5 presents
? Thanks to the title our proposed method. Section 6 introduces the experimental
a e-mail: [email protected] results. Section 7 discusses our results. Section 8 concludes
b e-mail: [email protected] the paper.
2
Distracted driver classification is classified into two main Method Dataset Classes Accuracy
categories. The first category uses wearable sensors to mea- Baseline [14] SEU 4 90.63%
sure physiological and bio-medical signals such as brain ac- CNN [4] SEU 4 99.78%
Alex Net [32] AUC 10 94.29%
tivity, vascular and muscular activities, and heart rate. These GA-Weighted Ensemble [32] AUC 10 95.98%
methods [6], [7] have some disadvantages such as hardware
cost and user involvement. The second category to classify
distracted drivers employs a camera. This category consists
on the input images to recognize the distraction behavior.
of three vision-based techniques to monitor distracted be-
The authors pre-trained the model using transfer learning
haviors in real-time, namely, head pose [8], [9], [11] gaze
(i.e., the ImageNet model) for faster convergence. Then, they
detection [10], [12], fatigue cues extraction from the driver’s
obtained the final class distribution by evaluating the weighted
face [13], [15], [16], and body postures characterization (e.g.,
sum of all networks’ outputs. They also developed a ge-
arms, feet, and hands positions) [17], [18].
netic algorithm to evaluate the weights. Table 1 lists state-
Most of the vision-based approaches employ a two-step of-the-art methods, the dataset used with each method, and
structure in which the features are extracted from the raw the overall accuracy of each method.
data using hand-crafted methods and classifiers are fitted
based on the hand-crafted features. Approaches that follow
the two-step architecture cannot attain an optimal trade-off
between the robustness of the trained classifier and the dis- 3 Deep Learning Background
tinctively hand-crafted features. In the last two decades, vision-
based approaches to detect distracted drivers that are based Recently, deep learning has dominated visual interpretation
on support vector machines (SVMs) and decision trees have tasks and has shown outstanding performance. Particularly,
dominated the research area. Lately, with the great success the convolutional neural network (CNN) deep learning al-
of deep learning models, particularly the convolutional neu- gorithm has achieved significant progress on image recog-
ral networks (CNNs) in computer vision [19], [20], [21], nition tasks. Finding the perfect CNN architecture is still a
[34], natural language processing [22], and speech recogni- very difficult task. Thus, there have been many architectures
tion [23], [24], these deep learning models have become the proposed in the past, such as GoogLeNet (i.e., Inception),
dominant approach to solve the distracted driving problem AlexNet, VGGNet, and most recently, the deep residual net-
as well. work (i.e., ResNet). On the other hand, the recurrent neural
Yan et al. [4] proposed an approach to recognize and network (RNN) is a well-known algorithm that obtains im-
detect driving posture based on deep CNN. The approach pressive results on time series problems as well as language
applies local neighborhood operations and trainable filters tasks such as speech recognition and machine translation.
to select meaningful features automatically. The advantage
of this approach is the ability to learn meaningful features
with minimal domain knowledge, which could provide the
model with an improved performance over models that used 3.1 ResNet Model
the hand-crafted features employed in previous works. The
authors also pre-trained the filters using a sparse filter [25] to The ResNet model [21] was invented by Microsoft researchers
accelerate the training for faster convergence and better gen- in 2016. The model has achieved the state-of-art result of
eralization. The authors tested some activation functions and 96.4% in the ImageNet Large Scale Visual Recognition Com-
pooling operations and found that the best activation func- petition (ILSVRC). The network is very deep, consisting
tion is the rectifier linear unit (ReLU) and the best pooling of 152 layers. Furthermore, the ResNet model introduced
operation is the max-pooling technique. unique residual blocks in which the identity skip connec-
Aboulenaga et al. [2] proposed a method to recognize tions are used to address training a very deep architecture
manual driver distractions, including: texting while driving, approach. The purpose of the residual blocks is to copy and
using the cell phone, drinking, eating, adjusting the radio, carry out the inputs of a specific layer to the next layer. The
reaching behind, and fixing hair and makeup. The authors vanishing gradients issue is overcome by the identity skip
proposed technique combines two well-known CNN archi- connection step, which guarantees that the next layer trains
tectures, namely, AlexNet [19] and Inception V3, into a ge- on something other than the input that the layer is famil-
netically weighted ensemble. The inputs to the model are iar with. In addition to the ResNet success in the ILSVRC,
raw images, face images, hands images, face and hands im- ResNet has shown impressive results on many computer vi-
ages, and skin-segmented images. The model was then trained sion tasks.
3
3.2 Inception Model (GoogleNet) using the right hand, talking to someone on the phone us-
ing the right hand, texting while driving using the left hand,
The GoogleNet model [30] is a deep CNN network that talking to someone on the phone using the left hand, reach-
was proposed in 2014 by Google researchers, which even- ing for the dashboard to operate the radio, drinking or eat-
tually achieved top-5 accuracy of 93.3% in the ILSVRC. ing, reaching behind, fixing hair and makeup, and talking to
GoogLeNet architecture is deep, consisting of twenty two a passenger. The submission to the StateFarm competition
layers. GoogLeNet architecture was created upon a novel was evaluated using the multiclass logarithmic loss. A true
building block called the Inception model. This architec- target is attached to each image, and the goal is to submit a
ture uses a network in the network layer instead of using set of predicted probabilities for each image.
the typical sequential process. The architecture uses parallel
computing to compute a large convolutional layer, a small
convolutional layer and a pooling layer. The architecture
performs a one-by-one convolution operation to reduce the
dimensionality of the features. The number of parameters
and operations is reduced significantly because of the di-
mensionality reduction used in this architecture and the par-
allelism that has been introduced; therefore, these features
save memory and minimize the computational cost [29]. (a) 0 (b) 1
4 Dataset Information
Table 2: Summary details of the 1st dataset Table 3: Summary details of the AUC dataset
The original published dataset has 2 folders, which are The max pooling in our Inception has a size of 3 * 3 with
the testing and the training data, and we use only the data a stride of 1. The figure also shows the size of each filter
in the training folder to evaluate our method since the data (kernel) in each convolutional layer.
in the testing folder is unlabeled. Thus, we used 33,636 im- Some convolutional layers in the Inception performs a
ages; more details about the dataset that we used are shown one-by-one convolution operation to reduce the dimension-
in 2. ality of the features. The size of the filters in the other two
To further evaluate the strength and generalization per- convolutional layers is 5 * 5. Each convolutional layer has
formance of our approach, we also used another recent dataset seven kernels. The output of each one of the convolutional
[32] from the American University in Cairo (AUC). The kernels has the same length as the original input (same padding).
AUC dataset has 44 individuals, 29 males and 15 females, Each input image has RGB color channels. Thus, we are us-
from seven countries: the USA, Egypt, Germany, Uganda, ing 2D convolutional layers.
Canada, Morocco, and Palestine. The videos were taken at
different times of day, in different cars, with different driving
conditions, and with the drivers wearing different clothes. 5.2 Our ResNet
The dataset consists of 14,478 frames distributed over 10
classes, as shown in Table 3. The dataset is divided into 75% There is one ResNet block that has two convolutional lay-
for training and 25% for testing. ers. The raw data of the image is fed to the first convolu-
tional layer, which convolves the input volume with three
filters; each one of them has a size of 5 * 5. The output fea-
5 Proposed Method
ture maps, which have the same length as the original input
In this paper, we take advantage of three advanced deep (same padding), are fed into the second layer, which has the
learning models, namely, the residual network (ResNet), the same settings as the first layer. Then, we sum up the output
hierarchical recurrent neural network (HRNN), and the In- features with the input and feed them to the next step. The
ception architecture by combining them into one model. In operation of the filter in each convolutional layer is summa-
our model shown in Fig. 3, there is one ResNet block and rized in equation 1:
two HRNN layers integrated with the Inception module, and
they are followed by two dense layers and finally by the
softmax classifier. The ResNet has two convolutional lay- X i = (φ (W iΘ X i−1 + β i )) (1)
ers. The details of each of these entities are explained in the
following sections. where X i−1 is the input, and X i is the output feature map.
Note that each convolutional layer will output more than one
feature map. b is the bias term β , ϕ is the activation func-
5.1 Inception module tion, and Θ is the convolution operator. The ReLU activation
function ϕ applies an elementwise operation on the input
We use one Inception module that is similar to the original data x. It is defined as the following equation:
Inception module used in the GoogLeNet model. However,
in addition to the entities of the original Inception module as (
shown in Fig. 2, we add two entities: one ResNet block and 0 for x < 0
f (x) =
two LSTM layers (the HRNN network) as shown in Fig. 3. x for x ≥ 0
5
5×5
convolution
1×1
convolution
5×5
convolution
1×1
convolution
Concatenate
volution
1x1 con
olution
3x3 conv 3×3
max pooling 1×1
olution convolution
1x1 conv
Dense
olution layers
1x1 conv Input
Layer olution
Previous 5x5 conv Softmax
5×5 5×5
convolution convolution
olution +
1x1 conv
-poo ling
3x3 Max
3×3
average
Fig. 2: The original Inception module.
LSTM LSTM
Then, the output of the residual block is fed to the av- Then, we use the softmax function, which is performed on
erage pooling layer. We apply the average pooling of size 2 the obtained output of the dense layers to classify the input
with a stride of 2 to the data. image to one of the ten classes.
The operation of the residual block has two convolu- We use a batch size of 80 and set the number of epochs
tional layers as defined in equation 2 to 30. We use the Adam optimizer for our model. We set the
initial adaptive learning rate to 0.001, beta1 to 0.9, and beta2
x1 = F(x0 ) to 0.999. For every layer in our model, we use glorot uniform
(Xavier uniform) [35] to initialize the weights matrix. We
x2 = F(x1 ) (2)
initialize all the bias terms of the convlutional layers to zero.
3 0 2
x = (x + x ) The proposed model was implemented using the Keras
where F(x) represents the operation of each convolu- library [31]. It is trained and tested using a computer with
tional layer, and xi represents the input and output feature an Intel core (TM) i7 CPU @ 2.00 GHz, 16.0 GB of RAM
maps. and a 64-bit Windows 10 operating system.
Before feeding the data to the HRNN, we use average pool- To evaluate the performance of our proposed method in de-
ing to reduce the dimensionality of the data. In the hierar- tecting distracted actions of drivers, we used the State Farm
chical RNN, there are two long short-term memory (LSTM) Distracted Driver Detection data set as provided by Kaggle.
layers. The first one will encode every column of the data We measure and evaluate the performance using the overall
that has a shape of (26, 80) to a column vector with a shape accuracy. The overall accuracy is calculated as the number
of (80). The second one will encode these 80 column vec- of all images that are classified correctly divided by the total
tors of shape (80, 80) to a vector that represents all of the number of all samples in the test set.
data. More details of the structure of our HRNN are shown We used several percentages to divide the data into test-
in Table 4. ing and training sets. We used 10%, 20%, and 30% for the
training and the remaining for the testing. The overall ac-
Table 4: Summary details of the HRRN curacies of our method when applied on the data set using
10%, 20%, and 30% of the data as training are shown in
Layer Output tensor size Parameters Number Table 5.
1st LSTM layer (26, 80) 26880 The confusion matrix of our proposed deep learning ar-
2ed LSTM layer (80) 51520 chitecture when using 10% of data for training is shown in
Fig. 4a. Fig. 4b shows the confusion matrix of our proposed
deep learning architecture when using 30% of the data for
Finally, we concatenate all the outputs of the three enti- training. The accuracy increased when we increased the size
ties and feed them to two dense layers, each with 80 neurons. of the training data. Fig. 4c shows the confusion matrix of
6
2 0.38 0.48 97.51 0.19 0.00 0.00 1.10 0.14 0.14 0.05 2 0.00 0.37 98.95 0.12 0.00 0.00 0.43 0.00 0.12 0.00 2 0.00 0.09 99.22 0.00 0.09 0.00 0.43 0.09 0.09 0.00
3 0.19 0.14 0.09 98.01 0.71 0.14 0.24 0.24 0.00 0.24 60 3 0.18 0.12 0.00 99.45 0.12 0.00 0.12 0.00 0.00 0.00 3 0.26 0.00 0.00 99.40 0.34 0.00 0.00 0.00 0.00 0.00
60 60
4 0.38 0.10 0.33 1.77 95.70 0.00 0.72 0.24 0.14 0.62 4 0.18 0.25 0.06 0.74 98.28 0.06 0.37 0.00 0.00 0.06 4 0.17 0.00 0.00 0.09 99.66 0.00 0.00 0.00 0.00 0.09
True label
True label
True label
5 0.96 0.00 0.34 0.48 0.38 97.26 0.43 0.14 0.00 0.00 5 0.12 0.00 0.12 0.06 0.37 99.07 0.19 0.06 0.00 0.00 5 0.09 0.00 0.00 0.09 0.09 99.48 0.17 0.09 0.00 0.00
40 40 40
6 0.19 0.72 1.39 0.05 0.81 0.00 95.80 0.10 0.53 0.43 6 0.00 0.00 0.12 0.00 0.06 0.06 99.63 0.06 0.06 0.00 6 0.00 0.00 0.09 0.00 0.09 0.09 99.66 0.00 0.09 0.00
7 0.06 0.00 0.67 0.06 0.00 0.11 0.44 96.67 1.00 1.00 7 0.00 0.00 0.00 0.00 0.00 0.00 0.29 99.00 0.36 0.36 7 0.00 0.00 0.00 0.00 0.10 0.00 0.20 99.20 0.10 0.40
8 0.87 0.00 0.06 0.47 0.81 0.06 2.21 0.64 89.59 5.29 8 0.22 0.07 0.15 0.07 0.07 0.07 0.07 0.15 97.61 1.49 8 0.42 0.00 0.10 0.10 0.00 0.00 0.10 0.31 98.32 0.63
20 20 20
9 0.94 0.16 0.00 0.47 0.05 0.00 0.16 0.31 1.51 96.40 9 0.13 0.13 0.20 0.07 0.07 0.07 0.07 0.54 0.87 97.85 9 0.28 0.19 0.28 0.00 0.00 0.00 0.00 0.28 0.56 98.40
0
9
Predicted label Predicted label Predicted label
0 0 0
(a) 10% of the data as training and the rest (b) 30% of the data as training and the rest (c) 50% of the data as training and the rest
for testing for testing for testing
Fig. 4: The confusion matrix of our proposed deep learning architecture using three different percentages to divide the State
Farm Distracted Driver dataset into testing and training sets
accuracy
0.7
loss
1.0 0.6
0.5 0.5
0.4
0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
epoch epoch
(a) Loss (b) Accuracy
Fig. 5: (a) Loss and (b) accuracy during the convergence of our model
11. Teyeb, I., Jemai, O., Zaied, M., & Amar, C. B. (2014,
works concepts to hybrid NN-HMM model for speech
September). A drowsy driver detection system based on
recognition. In Acoustics, Speech and Signal Process-
a new method of head posture estimation. In Interna-
ing (ICASSP), 2012 IEEE International Conference on
tional Conference on Intelligent Data Engineering and
(pp. 4277-4280). IEEE.
Automated Learning (pp. 362-369). Springer, Cham.
24. Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L.,
12. Teyeb, I., Jemai, O., Zaied, M., & Amar, C. B. (2014,
Penn, G., & Yu, D. (2014). Convolutional neural net-
July). A novel approach for drowsy driver detection us-
works for speech recognition. IEEE/ACM Transactions
ing head posture estimation and eyes recognition system
on audio, speech, and language processing, 22(10),
based on wavelet network. In Information, Intelligence,
1533-1545.
Systems and Applications, IISA 2014, The 5th Interna-
25. Ngiam, J., Chen, Z., Bhaskar, S. A., Koh, P. W., & Ng,
tional Conference on (pp. 379-384). IEEE.
A. Y. (2011). Sparse filtering. In Advances in neural in-
13. Bergasa, L. M., Nuevo, J., Sotelo, M. A., Barea, R.,
formation processing systems (pp. 1125-1133).
& Lopez, M. E. (2006). Real-time system for monitor-
26. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wo-
ing driver vigilance. IEEE Transactions on Intelligent
jna, Z. (2016). Rethinking the inception architecture for
Transportation Systems, 7(1), 63-77.
computer vision. In Proceedings of the IEEE Confer-
14. Zhao, C. H., Zhang, B. L., He, J., & Lian, J. (2012).
ence on Computer Vision and Pattern Recognition (pp.
Recognition of driving postures by contourlet transform
2818-2826).
and random forests. IET Intelligent Transport Systems,
27. Chung, J., Ahn, S., & Bengio, Y. (2016). Hierarchi-
6(2), 161-168.
cal multiscale recurrent neural networks. arXiv preprint
15. Jemai, O., Teyeb, I., & Bouchrika, T. (2013). A novel
arXiv:1609.01704.
approach for drowsy driver detection using eyes recog-
28. Hochreiter, S., & Schmidhuber, J. (1997). Long short-
nition system based on wavelet network. International
term memory. Neural computation, 9(8), 1735-1780.
Journal of Recent Contributions from Engineering, Sci-
29. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S.,
ence & IT (iJES), 1(1), 46-52.
Villena-Martinez, V., & Garcia-Rodriguez, J. (2017). A
16. Lei, J., Han, Q., Chen, L., Lai, Z., Zeng, L., & Liu,
review on deep learning techniques applied to semantic
X. (2017). A Novel Side Face Contour Extraction Al-
segmentation. arXiv preprint :1704.06857.
gorithm for Driving Fatigue Statue Recognition. IEEE
30. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
Access, 5, 5723-5730.
Anguelov, D., ... & Rabinovich, A. (2015). Going
17. Cheng, S. Y., Park, S., & Trivedi, M. M. (2007). Multi-
deeper with convolutions. In Proceedings of the IEEE
spectral and multi-perspective video arrays for driver
conference on computer vision and pattern recognition
body tracking and activity analysis. Computer Vision
(pp. 1-9).
and Image Understanding, 106(2-3), 245-257.
31. Chollet, François and others. (2015), Keras. https://
18. Tran, C., Doshi, A., & Trivedi, M. M. (2012). Modeling
keras.io
and prediction of driver behavior by foot gesture analy-
32. Eraqi, H. M., Abouelnaga, Y., Saad, M. H., & Moustafa,
sis. Computer Vision and Image Understanding, 116(3),
M. N. (2019). Driver Distraction Identification with an
435-445.
Ensemble of Convolutional Neural Networks. Journal
19. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).
of Advanced Transportation, Hindawi
Imagenet classification with deep convolutional neural
33. Resalat, S. N., & Saba, V. (2015). A practical method
networks. In Advances in neural information processing
for driver sleepiness detection by processing the EEG
systems (pp. 1097-1105).
signals stimulated with external flickering light. Signal,
20. Simonyan, K., & Zisserman, A. (2014). Very deep con-
Image and Video Processing (pp. 1751-1757).
volutional networks for large-scale image recognition.
34. Soon, F. C., Khaw, H. Y., Chuah, J. H., & Kanesan, J.
arXiv preprint arXiv:1409.1556.
(2019). Vehicle logo recognition using whitening trans-
21. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep
formation and deep learning. Signal, Image and Video
residual learning for image recognition. In Proceedings
Processing, (pp. 111-119).
of the IEEE conference on computer vision and pattern
35. Glorot, Xavier and Bengio, Yoshua (2010). Understand-
recognition (pp. 770-778).
ing the difficulty of training deep feedforward neural
22. Hu, B., Lu, Z., Li, H., & Chen, Q. (2014). Convolu-
networks. Proceedings of the 13th international confer-
tional neural network architectures for matching natural
ence on artificial intelligence and statistics (pp. 249–
language sentences. In Advances in neural information
256).
processing systems (pp. 2042-2050).
23. Abdel-Hamid, O., Mohamed, A. R., Jiang, H., & Penn,
G. (2012, March). Applying convolutional neural net-