Big Data Deep Learning Framework Using Keras
Big Data Deep Learning Framework Using Keras
Abstract—Big Data predictive analytics using machine In this research work, the proposed prediction
learning techniques is currently a much active area of research model is implemented by convolutional deep neural
in medical science. With increasing size and complexity of networks (CNN) using Python programming language.
medical data like X-rays, deep learning gained huge success in CNNS, also known as ConvNets in deep learning.
prediction of many fatal diseases like pneumonia. In this
After pre-processing of data, different machine
research work, DCNN (deep convolutional neural networks)
learning algorithms are trained to measure the
an efficient predicting model for big data, having deep
performance of CNN with popular and modern
layers is a proposed, which can classify whether a person
classifiers. Promising results are achieved, when the
is having a pneumonia or not. The experiments are
results of the suggested framework is compared with
carried after extracting the features of high quality X-
the regular classifiers like SVM, random forest,
ray images data and achieved an prediction accuracy of
adaboost, etc. using different estimating metrics like
84% and AUC of
accuracy, specificity, area under the curve, and
Promising results are found, when the results of the
sensitivity, etc.
DCNN framework is compared with the regular
classifiers like SVM, random forest, etc. using different
evaluation metrics like accuracy, sensitivity, etc. With
the appearance of increasing cases of pneumonia, tactful
implementation of deep learning can play a big part in
improving the performance of prediction of many fatal
diseases in the future.
I. INTRODUCTION
2
The person having pneumonia should be classified as V. RESULTS AND DISCUSSION
positive but if a person not having pneumonia is
classified as positive is not a big issue as it can be This section discusses parameter evaluation metrics to
further rectified. measure the performance of various machine learning
algorithms. The results are discussed much in detail and are
B. Machine Learning Classifiers also presented graphically.
i. Neural Network: The idea is based on human brain A. Performance Evaluation
working, like neuron communicates in human brain The results and performance of the suggested
the same concept is applied here. There are framework is evaluated with different parameters shown in
different layer of neurons and they activate other confusion matrix Table 1. The various evaluation metrics
neurons, like this is learns right weight for calculated from the Table 1 are presented in Table 2. Based
prediction [8]. on various metrics DCNN performed better than other
ii. Random Forest : It ensembles results of different models as shown in Figure
4. As it was imbalanced data, we cannot totally depend on
decision trees and take their average, by doing so
accuracy so comparing other metrics results DCNN gives
improves accuracy and also avoid over-fitting [7].
good results. Neural Network and Random Forest also quite
iii. Support Vector Machine: Using vector, this method good and are strong competitors. Comparing TP rate and FP
finds a hyperplane between the datasets. The rate DCNN maintaining its stand. Overall DCNN is giving
hyperplane acts like a wall between the different efficient result on unseen data.
classes. Checked the category of new unseen data
(in which group it falls as all are separated by TABLE I. CONFUSION MATRIX
hyper- plane) accordingly and results are also True Reference
shown here. The dimension depends on the number
of features [9]. Predicted Condition Condition Positive Condition Negative
iv. Adaboost : It is an ensemble based method in which Pneumonia F P (C)
the output of one become input of next tree after Positive T P (A)
some changes. Doing so improves the accuracy and Pneumonia T N (B)
over-fitting [10]. Negative F N (D)
v. Logistic Regression : This is a classification
method which learn some link in the dependant TABLE II. PERFORMANCE METRIC FORMULA
variable (label) and independent variables (features)
by considering the probability [11,15]. Sensitivity A/(A + B)
vi. Decision Tree: It is a graph based machine Specificity B/(D + B )
learning classifier [16] Accuracy (A + B )/(A + C + D + B )
F Score (2 *A)/((2 *A) + (D + C))
3
TABLE III. COMPARISON OF DCNN
PERFORMANCE WITH DIFFERENT STATE-OF-THE
ART METHODS USING MACHINE LEARNING
PERFORMANCE METRICS
Logistic 77 23 0.90 0.26 0.57 0.50 Fig. 6. K fold cross validation of AUC
DCNN 84 16 0.92 0.11 0.77 0.66 At last, K fold cross validation (with K=10) is performed
to test the robustness of DCNN framework. The result for K
Neural 81 18 0.72 0.15 0.76 0.62 fold validation for accuracy and AUC are depicted
Netwok graphically in Figure 5 and Figure 6 respectively.
Naive 72 27 0.63 0.22 0.62 0.40
Bayes As it can be observed from the graphs, the values of
accuracy and AUC are quite stable in all ten folds of cross
validation. Hence, promising results are achieved for the
prediction of pneumonia by the proposed framework.
4
REFERENCES [18] Magoulas GD, Prentza A. Machine learning in medical applications.
[1] P. Rajpurkar et al. "Chexnet: Radiologist-level pneumonia detection InAdvanced Course on Artificial Intelligence 1999 Jul 5 (pp. 300-
on chest x-rays with deep learning."arXiv preprint 307). Springer, Berlin, Heidelberg.pneumonia pattern using RNA-Seq
arXiv:1711.0522,2017. and machine learning: challenges and solutions. BMC genomics.
[2] WHO. Pneumonia, 2016 [Online] Available: 2018 May;19(2):101.
http://www.who.int/news- room/fact-sheets/detail/pneumonia [19] Wernick MN, Yang Y, Brankov JG, Yourganov G, Strother SC.
[Accessed: May 24, 2018] Machine learning in medical imaging. IEEE signal processing
[3] X. Chen. "Big data deep learning: challenges and magazine. 2010 Jul;27(4):25-38.
perspectives."IEEE access", pp. 514-525, 2014. [20] Dietterich, Thomas G. "Ensemble methods in machine learning."
[4] A. Gandomi et al. "Beyond the hype: Big data concepts, methods, International workshop on multiple classifier systems. Springer,
and analytics."International Journal of Information Management Berlin, Heidelberg, 2000.
vol 35(2), pp.137-144, 2014. [21] Greenspan H, Van Ginneken B, Summers RM. Guest editorial deep
[5] Chest XRay data, 2018, [ONLINE] Available: learning in medical imaging: Overview and future promise of an
http://dx.doi.org/10.17632/rscbjbr9sj.2#file-41d542e7-7f91-47f6- exciting new technique. IEEE Transactions on Medical Imaging.
9ff2- dd8e5a5a7861 [Accessed: May 24, 2018] 2016 May;35(5):1153-9.
[6] H. Nishtha et al. "B2FSE framework for high dimensional [22] Choi Y, Liu TT, Pankratz DG, Colby TV, Barth NM, Lynch DA,
imbalanced data: A case study for drug toxicity Walsh PS, Raghu G, Kennedy GC, Huang J. Identification of usual
prediction."Neurocomputing vol. 276, pp.31-41, 2018. interstitial pneumonia pattern using RNA-Seq and machine learning:
[7] Liaw, Andy, and Matthew Wiener. "Classification and regression by challenges and solutions. BMC genomics. 2018 May;19(2):101.
randomForest."R news vol. 2(3), pp. 18-22, 2002.
[8] A. Rowley et al. . "Neural network-based face detection."IEEE
Transactions on pattern analysis and machine intelligence vol.
20(1), pp. 23- 38, 1998.
[9] M. Hearst, et al. "Support vector machines."IEEE Intelligent
Systems and their applications vol. 13(4), pp. 18-28, 1998.
[10] R.. Takashi Onoda, and K-R. Müller. "Soft margins for
AdaBoost."Machine learning, vol. 42(3), pp. 287-320,2001.
[11] Hosmer Jr, David W., Stanley Lemeshow, and Rodney X.
Sturdivant. Applied logistic regression. Vol. 398. John Wiley &
Sons, 2013.
[12] P. Pedro et al. . Community-acquired pneumonia: identification and
evaluation of non responders. Therapeutic advances in infectious
disease, 1(1), pp. 5-17, 2013.
[13] M. Aydogdu et al. Mortality prediction in community-acquired
pneumonia requiring mechanical ventilation; values of pneumonia
and intensive care unit severity scores. Tuberk Toraks, vol. 58(1),
pp. 25–34, 2010.
[14] D. Mollura et al. White paper report of the rad-aid conference on
international radiology for developing countries: identifying
challenges, opportunities, and strategies for imaging services in the
developing world. Journal of the American College of Radiology,
vol. 7(7), pp. 495– 500, 2010.
[15] Press, S. James, and Sandra Wilson. "Choosing between logistic
regression and discriminant analysis." Journal of the American
Statistical Association 73.364 (1978): 699-705
[16] Safavian, S. Rasoul, and David Landgrebe. "A survey of decision
tree classifier methodology." IEEE transactions on systems, man,
and cybernetics 21.3 (1991): 660-674.
[17] Kononenko I. Machine learning for medical diagnosis: history, state
of the art and perspective. Artificial Intelligence in medicine. 2001
Aug 1;23(1):89-109.