Boosted Ensemble
Boosted Ensemble
PII: S1568-4946(19)30223-6
DOI: https://doi.org/10.1016/j.asoc.2019.04.031
Reference: ASOC 5461
Please cite this article as: J.A. ALzubi, B. Bharathikannan, S. Tanwar et al., Boosted neural
network ensemble classification for lung cancer disease diagnosis, Applied Soft Computing Journal
(2019), https://doi.org/10.1016/j.asoc.2019.04.031
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to
our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form.
Please note that during the production process errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.
*Manuscript
Click here to view linked References
Abstract
Accurate diagnosis of Lung Cancer Disease (LCD) is an essential process to provide
timely treatment to the lung cancer patients. Artificial Neural Networks (ANN) is a recently
proposed Machine Learning (ML) algorithm which is used on both large-scale and small-size
datasets. In this paper, an ensemble of Weight Optimized Neural Network with Maximum
Likelihood Boosting (WONN-MLB) for LCD in big data is analyzed. The proposed method is
split into two stages, feature selection and ensemble classification. In the first stage, the essential
attributes are selected with an integrated Newton Raphsons Maximum Likelihood and Minimum
Redundancy (MLMR) preprocessing model for minimizing the classification time. In the second
stage, Boosted Weighted Optimized Neural Network Ensemble Classification algorithm is
applied to classify the patient with selected attributes which results to improve the cancer disease
diagnosis accuracy and to minimize the false positive rate. Experimental results demonstrate that
the proposed approach achieves better false positive rate, accuracy of prediction, and reduced
delay in comparison to the conventional techniques.
Keywords: Machine Learning, Lung Cancer Disease, Weighted Optimized, Neural Network,
Maximum Likelihood Boosting
1. Introduction
In the present era, one of the foremost causes of death in developing countries is lung
cancer which is increasing rapidly with the dramatic upsurge in cigarette smoking. According to
the survey conducted by big data research in Non-Small Cell Lung Cancer (NSCLC) [1], the
deep learning can be used to improve the rate of diagnostic accuracy by means of prediction and
decision in the medical system. Moreover, Artificial Intelligence techniques were used to solve
prediction and decision for big data in NSCLC. However, the image and diagnostic parameters
were integrated via machine algorithm. Therefore, combining image and diagnostic parameters
was an efficient method for doctors to solve patient’s diagnosis in large data (i.e., Healthcare 4.0)
environment. However, the time consumed to diagnosis for big data was not concentrated.
Boosted Support Vector Machine (SVM) method for imbalanced data (BSI) proposed by
Zięba et al. [2] to solve the issues related to the imbalanced data. They have combined the
advantages of applying ensemble classifiers for uneven data with the cost-sensitive support
vectors machines. Three steps were carried out with the input dataset. In the first step, the
information gain criterion was used to select the effective and required features. Followed by a
feature selection step, the problem to predictpostoperative life expectancy was analyzed
according to Gmean criterion, where the rules were said to be extracted. In the third step, the
accuracy measure and coverage measure were evaluated for the extracted rules which result in
the improved predictive accuracy. However, the prediction accuracy was achieved, but less focus
was made on the error aspect.
Considering the aforementioned issues, in this paper, a new combination approach for
classifier ensembles using the Newton Raphson’s MLMR preprocessing model is proposed,
where the essential features are extracted to reduce the time for lung cancer disease diagnosis.
Newton Raphson’s Maximum Likelihood model is applied to the MRMR attributes is proposed.
Moreover, the first and the second derivative results of maximum relevance minimum redundant
attributes are used to select the most relevant attributes. To achieve it, we explore the features of
the MRMR model, Newton Raphson’s Maximum Likelihood in the combination process of an
ensemble. Then, Boosted Weighted Optimized Neural Network Ensemble Classification
algorithm is proposed to minimize the error (i.e., false positive error rate) and improve diagnosis
accuracy. Optimizedweights related to the decision of each ensemble classifier are defined
dynamically, according to the ensemble classifier outputs and the relation among the outputs of
all ensemble classifiers. In order to evaluate the feasibility of the proposed approach, an
empirical analysis of ensemble performance using Thoracic Surgery Dataset, comparing its
performance with ensemble classifier using traditional methods.
1.1 Motivation
Healthcare is one of the essential sources in big data. Accurate analysis of healthcare data
is highly in demand for diagnosing the disease at early stage. Recently, many research works
have been designed for identifying disease in the big data with higher quality. But, there is a
requirement for novel classification technique to increase the diagnosis accuracy with time.
Moreover, ML algorithms are designed to increase the prediction accuracy in big data. However,
error rate still not exploited to its full potential. Therefore, this research work motivates
optimized machine learning algorithms to improve the diagnosis accuracy with lower time and
error.
1.2Research Contributions
Contributions of this paper are as follows.
To increase the performance of lung cancer diagnosis accuracy for big data as compared
to state-of-the-art works, WONN-MLB method is usedwithWeight Optimized Neural
Network to have Maximum Likelihood Boostinsg for Lung Cancer Disease.
To minimize the classification time for early lung cancer disease diagnosis, integrated
Newton Raphsons MLMR preprocessing model is used toselect the relevant attributes, to
obtain higher diagnosis accuracy.
To reduce the error (i.e. false positive rate) and improves the disease diagnosis accuracy
with higher classification efficiency and lower classification time, Boosted Weighted
Optimized Neural Network Ensemble Classification algorithm is designed in WONN-
MLB method.
1.3 Organization
Rest of the paper is organized as follows. Section 2 describes the related works on various
lung cancer disease diagnosis are reviewed. In Section 3, the ensemble classification method
along with a preprocessing model for lung cancer disease diagnosis is investigated, the maximum
relevance method along with the maximum likelihood function is explored in detail, and the
effects of the extracted relevant attributes on the ensemble classification performance of the
WONN-MLB method are also studied. In Section 4, the performance of the proposed approach is
compared with the state-of-the-art approaches to demonstrate its effectiveness for lung cancer
disease diagnosis and Section 5 concludes the paper.
2. Related works
With the invention of the microarray technique, scientists and researchers have immense
opportunity to evaluate the expression levels of thousands of genes concurrently in a single
experiment. In Ghorai et al.[4], the Nonparallel Plane Proximal Classifier (NPPC) was proposed
for cancer classification in a Computer Aided Diagnosis (CAD) framework to ensure high
classification accuracy and to minimize the computation time. But, Valvular heart disorders were
considered to be one of the most difficult classification problems. Sengur et al [5] used three
powerful and popular ensemble learning representative called, bagging, boosting, and random
subspacesto early detect Valvular heart disorders. However, the classification time was
minimized using methods, but the rate at which the accuracy was said to be attained remained
unaddressed. In Costaaet al. [6], three Generalized Mixture (GM) functions were applied via
dynamic weights to improve the classification accuracy of the classification system. Though the
function handles single-label classification, multi-label classification problem was not addressed.
A case study for brain tumor diagnosis using global optimization based hybrid wrapper-
filter feature selection with ensemble classification methods wasproposed by Huda et al.[7]. It
increases the classification accuracy, but the classification time was not minimized.
Approximately 40% of the world’s population is affected by cancer. A Proportion SVM was
used by Huseeinet al. [8] for efficient categorization of Lung Nodules, which results in the
improved diagnosing accuracy. The proportion of SVM failed to minimize the error rate in
disease categorization. Another method to early detect lung cancer was proposed by Abetiba et
al.[9] using Radial Basis Function Neural Network with Affine Transforms which in turn
achieved high classification accuracy and low mean square error. But, the performance of feature
extraction was not improved. A review of feature selection and parallel classification systems
was carried out by Jain et al. [10] to enhance the classification accuracy for disease perdition, but
classification time was not minimized.
A Critical assessment of ANN was carried out in Dande et al.[11] which results in an
increase in the efficacy and specificity of the diagnostic techniques, but it fails to minimize the
computational complexity. Tumor tissue based on pathological evaluation is considered to be
one of the most pivotal for early diagnosis in cancer patients. However, the automated image
analysis methods have the potential to improve the accuracy of disease diagnosis and to
minimize human errors. Khosravia et al. [12] proposed different computational methods using
convolutional neural networks (CNN), where a stand-alone pipeline was constructed in an
effective manner to classify several histopathology images across different types of cancer. But,
it fails to minimize the computation cost while classifying the various types of cancer.
Sharma et al. [13] proposed a two-stage hybrid ensemble classification technique to
increase the prediction accuracy of chronic kidney disease with ML technique. It improves the
disease diagnosis, but the multistage classification was not performed with minimum time. Early
diagnoses of lung cancers and differentiation between the tumor types and non-tumor types have
been required to improve the patient survival rate. In Hosseinzadeh et al. [14], a diagnostic
system with structural and physicochemical attributes of proteins via feature extraction, feature
selection, and prediction models was designed. Then, the ML models were applied to both
original and newly created database to predict the lung cancer type of tumors which results in
improved accuracy. However, the model reduces the processing time, but the false positive rate
was not minimized. Evaluation of ML algorithm for lung cancer diagnosis was carried out by
Podolsky et al. [15]. It accurately predicts cancer vulnerability as well as minimizes the false
positive rate. But, the classification time was not exploited which can be helpful for early lung
cancer detection.
A narrative review based on radiomic features to help diagnose lung cancer in an early
stage was proposed by Rabbani et al.[16], where the ML algorithms were combined with
artificial intelligence approaches. The objective of radiomics remains in extracting and analyzing
several quantitative features from medical images. Moreover, they focused on highly promising
in staging, diagnosing, and predicting outcomes of cancer treatments. However, the machine
learning algorithms used, but the feature extraction does not provideaccurate results. Zhou et al.
[17] proposed a multi-modality and multi-classifier radiomics predictive models to address the
aforementioned issues using a new reliable classifier fusion strategy. Here, the training of
modality-specific classifiers was first made, followed by an analytic evidential reasoning (ER)
rule, which was used to combine the output score from each modality to build an optimal
predictive model towards disease diagnosis. This model failed to minimize the disease prediction
time.
A systematic review of mortalities and survival rate of lung cancer with evolutionary
algorithms was conducted by Dubey et al. [18] to identify a better method for early lung cancer
diagnosis and to achieve higher accuracy rate with deep learningtechniques. It does not minimize
the error rate. Liu et al. [19] proposed a MultiView Convolutional Neural Networks (MV-CNN)
for efficient lung nodule classification, to improve the accuracy, and the classification time.
Here, accurate detection was not performed with the features. Baz et al. [20] explored some
crucial challenges and methodologies with CAD system for lung cancer. It increases the
detection and diagnosis of lung nodules, but the accurate feature selection was not performed to
minimize the detection time.
Deep feature fusion and hand-crafted features for lung nodule classification was
developed by Wang et al. [21]. But, classification performance was not accurate. CAD was
introduced for enhancing the performance of nodule candidate classification by Chen et al. [22].
However, classification time was not minimized. In order to effectively classify the lung nodules,
deep features were extracted in CT images with higher accuracy by Kumar et al. [23]. But, the
error rate was remained unaddressed. Image-based features selection method was developed for
classifying the lung cancer images with higher accuracy Baranidharan et al. [24]. In this method,
novel fusion-based selection was used to select the features for classification. During the feature
selection, the redundant features were unable to be removed thus introduced an error in
classification process. To overcome this problem, the proposed WONN-MLB method used
Newton Raphson’s Maximum Likelihood mode, where MLMR are used to choose the most
relevant attributes. Then, the boosting classifier is applied to classify the attributes for LCD
diagnosis, which reduces the error rate in the classification process.
Data analysis of population statistics and data mining techniques were used in [25] to
determining the cancer morbidity and mortality data in a regional cancer registry. However, false
positive rate was not minimized. Multiple aspects of large scale knowledge mining was covered
in [26] for medical and diseases examination. A new image-based features selection method was
planned in [27] to categorize the lung computed tomography images with a higher accuracy. But,
the feature selection rate was not improved.
Table 1 presents a comparison of the proposed approach with state-of-the-art approaches.
The main aim of this paper is to design diagnosis for LCD using ensemble classification
algorithm with an objective to reduce the classification time and false positive rate as compared
to the state-of-the-art approaches.
Ensemble learning To classify the Valvular Minimize the Classification accuracy rate
Daset al. [5] 2010
methods heart disease classification time remained unsolved
Perform cancer
provides better
Nonparallel Plane classification with
classification accuracy Valvular heart disorders
Ghorai et al. [4] 2011 Proximal Classifier higher accuracy in a
with lesser computation classification was difficult
(NPPC) Computer Aided
time
Diagnosis
Computer-aided Achieve better detection
Accurate feature selection
Baz et al. [20] 2012 diagnosis (CAD) Lung cancer diagnosis and diagnosis of lung
was not performed
system nodules
Provide more accurate
Hosseinzadeh Machine learning Predict and detect the The false positive rate was
2013 results in lung tumor
et al. [14] models type of lung tumors not minimized
detection
Improve classification,
Radial Basis Function
Adetiba et al. Classifies the Lung accuracy and achieves, Performance of feature
2015 Neural Network with
[9] Cancer and low mean square extraction was not improved
Affine Transforms
error
Detect the lung cancer
Kumar et al. Evolutionary The error rate was not
2016 lung cancer detection accurately with minimum
[18] algorithms minimized
time
Global optimization
based hybrid wrapper- Tumor classification Increases the imbalanced
Classification time was not
Huda et al. [7] 2016 filter feature selection with the imbalanced healthcare data
minimized
with ensemble healthcare data classification
classification
Increase the accuracy of
Podolsky et al. Machine learning predicting cancer Classification time remained
2016 lung cancer diagnosis
[15] algorithm susceptibility and unsolved
Minimize false positive
Multi-modality and
Extract numbers of Increase the disease
multi-classifier Failed to minimize the
Zhou et al. [17] 2017 quantitative features prediction accuracy with
radiomics predictive disease prediction time.
and disease prediction the features
models
Multi‐view Increases the Failed to attain accurate
lung nodule
Kang et al.[19] 2017 convolutional neural classification accuracy disease prediction with
classification
networks (MV‐CNN) and minimizes the time features
Diagnosis and Increase the efficacy and
Dande et al. Artificial Neural Failed to minimize the
2017 evaluation of medical specificity of disease
[11] Network computational complexity
conditions diagnosis
Increase the
Multi-label classification
Generalized mixture classification accuracy Handles single-label
Costaa et al. [6] 2018 problem remained
(GM) functions of a classification classification problems
unaddressed
system
proportion-Support
Hussein et al. Categorizes the Lung Improve the diagnosing Failed to minimize the error
2018 Vector Machine
[8] Nodules accuracy rate
(SVM)
feature selection and classification systems for
Enhancing the accuracy The classification time was
Jain et al. [10] 2018 parallel classification effective disease
of classification systems not minimized
systems prediction
Deep convolutional Increase the precision of
Khosravia et al. Classifying the various Computation cost was not
2018 neural networks diagnosis and minimizes
[12] cancer tissues minimized
(CNN) the error
The multi-stage diagnosis was
Sharma et Two-stage hybrid Classifying the chronic Accurate diagnosis of the
2018 not performed with minimum
al.[13] ensemble technique kidney disease disease with a feature set
time
Machine learning
extracting and analyzing
(ML) method The ML algorithms used for
Rabbani et al. several quantitative Improves diagnosis,
2018 Combining artificial feature extraction was not
[16] features from medical treatment and outcomes
intelligence attained the accurate results
images
approaches
Baranidharan et Image-based features Classify the lung cancer Increase the true positive Error rate was not effectively
2016
al. [24]. selection method images rate minimized
Weight Optimized
Increase diagnosing
Neural Network with
Lung Cancer Disease accuracy and minimizes
Proposed - Maximum Likelihood -
diagnosis with big data the false positive rate,
Boosting (WONN-
classification time
MLB) technique
This section describes the proposed approach and the proposed architecture with WONN-
MLB method for LCD, as shown in Fig. 1. The different phases to implement and utilize the
proposed approach are shown in Fig. 1. These include the data acquisition (Thoracic Surgery
Data Dataset) Zięba et al. [2], feature selection or preprocessing (reducing big data feature
dimensionality), and ensemble classification (using WONN-MLB) and are comprehensively
discussed in the next subsections.
Preprocessing
Data
acquisition
Maximum
Likelihood
Minimum
Thoracic Surgery Redundancy
Data Dataset
Optimal attributes
Ensemble
Classification
Thoracic
Surgery Data Mutual Information
Dataset ‘ ’
Marginal
Probability
Instance Joint Probability
Maximum Relevance
‘ ’
Minimum Redundant
‘ ’
Joint function ‘ ’
Newton Raphsons’s
Maximum Likelihood
function
(1)
However, with big data in consideration for lung cancer analysis, maximum relevance, and
minimum redundancy is measured. Maximum relevance ‘ ’ consists to search attributes with
higher relevancy factor and is formulated as follows:
(2)
(3)
From Eq. 3, the minimum redundant attributes ‘ ’ is obtained between the set of
attributes ‘ ’ and ‘ ’, respectively. From Eqs. 2 and 3, the integration and optimization of
both maximum relevancy ‘ ’ and the minimum redundancy ‘ ’ results in maximum
relevance minimum redundancy called as ‘ ’. The maximum relevance minimum
redundancy is calculated as follows:
(4)
Followed by maximum relevance minimum redundancy attributes obtained for lung cancer
disease diagnosis with an objective to minimize the time consumed, in this work, a Newton
Raphsons’s Maximum Likelihood function is used to the resultant attributes. The log-likelihood
function for Eq. 4is formulated as follows:
(5)
In the log-likelihood function the first derivative and second derivative are formulated as
follows:
(6)
(7)
The log-likelihood function is used to maximize the maximum relevance and minimum
redundant attributes. From that, the most relevant attributes are taken for classification process
which effectively reduces the time required to lung cancer disease diagnosis. The pseudo-code of
the proposed Maximum Likelihood Minimum Redundant preprocessing is given in algorithm 1.
Input: ,
Output: Maximum Likelihood Minimum Redundant attributes selected ‘ ’
1: Begin
2: For with
3. Find
4: Determine attribute
5: Minimize attribute
6: Combine
7: Formulate
8: Obtain and
9: End for
10: End
Once the Maximum Likelihood Minimum Redundant attributes are obtained, then an
ensemble classification model is used to improve the lung cancer diagnosis accuracy for big data.
In this work, an ensemble of WONN-MLB attributes is applied to achieve the objective of lung
cancer diagnosis accuracy with minimum time and error.
(8)
Then, a nonlinear activation function ‘ ’ is applied to the weighted sum ‘ ’ as given in Eq.
9which results in an output ‘ ’:
(9)
Then, a weak classifier with low weighted error is selected and is formulated as follows:
(10)
(11)
From Eqs. 10 and 11, the low weighted error ‘ ’ is obtained based on the probability of
distribution function ‘ ’ for a linear combination of weighted inputs (i.e., attributes)
‘ ’. Finally, a new component ‘ ’ based on error function is calculated as follows:
(12)
Upon successful completion of all of the boosting iterations, final ensemble learning classifier
which possesses weighted error that is better than chance, is evaluated by combining all weak
classifiers with an optimal weight Mana et al. [3]. This is formulated as follows:
(13)
From Eq. 13, the final ensemble learning classifier is measured as a weighted majority vote of
the weak classifiers ‘ ’, where each classifier is assigned by weighting ‘ ’. The pseudo code
of ensemble classification is given in algorithm 2.
Input: Maximum Likelihood Minimum Redundant attributes ‘ ’, , , iteration ,
Optimal weight ‘ ’
Output: Improved lung cancer diagnosis accuracy
1: Procedure
2: Initialize
3: For each and iteration
4: Measure
5: If ‘ ’then
6: Compute
7: Obtain
8: Obtain
9: End if
10: Else
11: Go to step 4
13: End for
14: End
It is considered as one of the important parameters for early disease diagnosis. Higher the
diagnosing accuracy, early disease diagnosis is said to be achieved and therefore the method is
also said to be efficient. It provides evidence on how well a method precisely recognizes the
disease and informs upcoming decisions about treatment for physicians or patients. It is given as
follows:
(14)
From Eq.14, the diagnosing accuracy ‘ ’ is arrived at based on the number of data
correctly diagnosed as disease ‘ ’ to the total samples ‘ ’ considered for
experimentation. It is measured in percentage. The values obtained through Eq. 14 are
represented as shown in Fig.3 for different patient data using the proposed WONN-MLB
approach and compared it with the NSCLC and BSVM approaches. The sample calculation to
measure the diagnosing accuracy using the aforementioned three methods is given as follows:
Sample calculation:
Proposed WONN-MLB: With ‘ ’ patient data considered for experimentation and
number of data correctly diagnosed as disease being ‘ ’, the diagnosing accuracy is
calculated as follows:
NSCLC: With ‘ ’ patient data considered for experimentation and number of data
correctly diagnosed as disease being ‘ ’, the diagnosing accuracy is calculated as
follows:
BSVM: With ‘ ’ patient data considered for experimentation and number of data
correctly diagnosed as disease being ‘ ’, the diagnosing accuracy is calculated as
follows:
NPPC: With ‘ ’ patient data considered for experimentation and number of data
correctly diagnosed as disease being ‘ ’, the diagnosing accuracy is calculated as
follows:
MV-CNN: With ‘ ’ patient data considered for experimentation and number of data
correctly diagnosed as disease being ‘ ’, the diagnosing accuracy is calculated as
follows:
Fig.3 shows the diagnosing accuracy comparison between proposed approach and
existing NSCLC and BSVM, respectively. It is found that the diagnosing accuracy of lung cancer
is improved using WONN-MLB because of measurement of the weak classifier with low
weighted error and new component based on error function through ensemble classification. The
results confirm that with an increase in the number of patient data, the diagnosing accuracy
increases for minimum patient data, then reduces with an increase in the number of patient data.
This happens because with an increase in the number of patient data, many irrelevant attributes
are also present. Moreover, preprocessing performed in the WONN-MLB method, the certain
error is occurred, which results in certain amount of irrelevant attributes even after
preprocessing. However, the comparison made with the existing methods NSCLC, BSVM,
NPPC and MV-CNN shows an improvement is observed by using the WONN-MLB method.
This happens because of the application of ensemble classification that not only minimizes the
error by updating the weak classifier, but also minimizes the time by boosting the updated
results. This in turn improves the diagnosing accuracy using WONN-MLB method by 7%, 11%,
19% and 28%as compared to NSCLC, BSVM, NPPC and MV-CNN, respectively.
4.2 Scenario 2: Impact of false positive rate
The second important parameter used to measure the early diagnosing of lung cancer is
the rate of false positive or error, while to conduct multiple comparisons in a statistical
framework, the false positive rate refers to the probability of falsely rejecting the null hypothesis
for a specific test. In other words, the false positive rate is measured as the ratio between the
number of negative events (i.e., not diagnosed with lung cancer) wrongly categorized as positive
(i.e., diagnosed with lung cancer) and the total number of actual negative events (i.e., not
diagnosed with lung cancer). It is formulated as follows:
(15)
From Eq.15, the false positive rate ‘ ’ refers to the ratio of number of patient data
incorrectly diagnosed as disease ‘ ’ to the total samples ‘ ’ considered for
experimentation. It is measured in terms of percentage (%).The values obtained through Eq. 15
are represented as shown in Fig. 5 for different patient data using the proposed WONN-MLB
approach and compared it with the NSCLC and BSVM. The sample calculation for measuring
false positive rate using the three methods is given as follows:
Sample calculation:
Proposed WONN-MLB: With ‘ ’ number of patient data considered as samples and
‘ ’ number of patient data incorrectly diagnosed with lung cancer disease, the false
positive rate is as given as follows:
From Eq. 15, the false positive rate for different number of patient data in the range of
1000 to 10000 is measured. The results of experimental evaluations conducted to measure the
false positive rate as shown in table 1. The false positive rate obtained using the proposed
WONN-NLB approach offers comparable values than the state-of-the-art methods.
Fig 4 shows the performance analysis of false positive rate for disease diagnosis for big
data. As illustrated in Fig 4, when 1000 number of patient data is considered as samples, 90
patient data were incorrectly diagnosed with lung cancer using WONN-MLB, 120 patient data
were incorrectly diagnosed with lung cancer using NSCLC, 140 patient data are incorrectly
diagnosed with lung cancer using BSVM, 160patient data were incorrectly diagnosed using
NPPC, 170 patient data were incorrectly diagnosed using MV-CNN. The false positive rate
using WONN-MLN is minimized by 25%, 36%, 44% and 47%as compared to NSCLS and
BSVM, NPPC and MV-CNN respectively. This result is achieved with Newton Raphsons
MLMR preprocessing model. The advantage of applying MLMR preprocessing model is that
instead of using all the attributes in the dataset, only the maximum likelihood and relevancy
attributes are considered for disease diagnosis. With the application of log-likelihood function,
the attribute availability also gets changed and reflected in the maximum relevance minimum
redundancy coefficient. This adaptive change made through maximum relevance minimum
redundancy coefficient in terms minimizes the incorrect lung cancer diagnosis using the WONN-
MLN method. The resultant attributes are then used to classify the patients as lung cancer and
normal patient which in turn minimizes the false positive rate by 39%, 53%, 58% and 61%as
compared as compared to NSCLS, BSVM, NPPC and MV-CNN respectively.
(16)
From Eq. 16, the classification time ‘ ’ is calculated according to the samples ‘ ’ and
the time consumed to perform ensemble classification ‘ ’. Lower the classification
time, early the lung cancer diagnosis is said to be. It is measured in terms of milliseconds (ms).
The values obtained through Eq. 16 are represented in Fig.4 with the proposed WONN-MLB
approach, existing NSCLC and BSVM. The sample calculation for classification time using the
three methods is given as follows:
Sample calculations:
Proposed WONN-MLB: With the time taken for classification of single patient data
being ‘ ’, with ‘ ’ number of patient data considered as samples, the
classification time is calculated as follows:
NSCLC: With the time taken for classification of single patient data being ‘ ’,
with ‘ ’ number of patient data considered as samples, the classification time is given
as follows:
BSVM: With the time taken for classification of single patient data being ‘ ’,
with ‘ ’ number of patient data considered as samples, the classification time is given
as follows:
NPPC: With the time taken for classification of single patient data being ‘ ’,
with ‘ ’ number of patient data considered as samples, the classification time is given
as follows:
MV-CNN: With the time taken for classification of single patient data being
‘ ’, with ‘ ’ number of patient data considered as samples, the
classification time is given as follows:
Fig.5 shows the measure of classification time to classify the patient data with diagnosed
as disease or not, the proposed approach is implemented in Java Language using various
numbers of patient data in the range of 1000 to 10000. The experimental result of classification
time using proposed method is compared with existing NSCLC and BSVM. When considering
1000 number of patient data for the experimental work, the proposed method consumed 8.5ms to
classify, whereas the existing NSCLC, BSVM, NPPC and MV-CNN consumed 8.9ms,
9.3ms,11.5ms and 13.2ms respectively. Thus, it is clear that the classification time using
proposed approach is less as compared to other existing methods [1], [2]. However, with an
increase in the number of patient data and increase in the number and size of the attributes, the
classification time is also increases using all the three methods. Comparative analysis shows that
the classification time using proposed approach is less than the [1], [2], [4] and [19] methods.
This is because of the application of the Newton Raphson’s Maximum Likelihood model in
addition to the maximum relevance minimum redundancy factor, which applies the first derivate
and the second derivate to extract the most relevant attributes. With this most relevant attributes
extracted, the classification time is reduced using proposed approach by 34%, 51%, 56%, and
59%as compared to NSCLC by Wu et al. [1], BSVM by Zięba et al. [2], NPPC by Ghorai et al.
[4], and MV-CNN by Liu et al. [19],respectively.
(17)
From Eq. 17, the classification time ‘ ’ is measured using average mean of
precision and recall value. Higher the F1-score, early the lung cancer diagnosis is said to be. The
sample calculation for F1-scoreusing the five methods is given as follows:
Sample calculations:
NSCLC: With ‘ ’ patient data considered for experimentation and precision is value
is identified as 89 and recall value is 87, the F1-score is calculated as follows:
BSVM: With ‘ ’ patient data considered for experimentation and precision is value
is identified as 86 and recall value is 85, the F1-score is calculated as follows:
NPPC: With ‘ ’ patient data considered for experimentation and precision is value is
identified as 80 and recall value is 82, the F1-score is calculated as follows:
In (18), ‘ ’ denotes a space complexity and ‘n’ denotes the number of the patient data.
The sample calculation for space complexity using the five methods is given as follows:
Sample calculations:
NSCLC: With ‘ ’ patient data considered for experimentation and space for storing
one patient data is 0.012MB, the space complexity is calculated as follows:
BSVM: With ‘ ’ patient data considered for experimentation and space for storing
one patient data is 0.015MB, the space complexity is calculated as follows:
NPPC: With ‘ ’ patient data considered for experimentation and space for storing
one patient data is 0.018MB, the space complexity is calculated as follows:
MV-CNN: With ‘ ’ patient data considered for experimentation and space for
storing one patient data is 0.021MB, the space complexity is calculated as follows:
Fig.7: Performance measure of space complexity
Fig.7 shows the measure of space complexity to store the patient data with minimum
space. To conduct the experiments, 1000 to 10000 patient data is considered. From Fig 7, the
performance analysis of space complexity using WONN-MLB approach is compared with
existing NSCLC, BSVM, NPPC, and MV-CNN. While considering 1000 number of patient data
for analyzing the performance, the proposed WONN-MLB approach provides the 10MB of
space complexity whereas the existing NSCLC, BSVM, NPPC and MV-CNN offers 12MB,
15MB, 18MB and 21MB respectively. From the above discussion, space complexity using
proposed WONN-MLB approach is lower as compared to other existing [1], [2], [4] and [19]
methods. This is because of the application of boosted weighted optimized neural network
ensemble classification algorithm in proposed WONN-MLB approach. This algorithm classifies
the patient data with higher accuracy and it is further stored for diagnosing the cancer diseases.
Therefore, space complexity is reduced using proposed WONN-MLB approach by 13%, 24%,
31%, and 36% as compared to NSCLC by Wu et al. [1], BSVM by Zięba et al. [2], NPPC by
Ghorai et al. [4], and MV-CNN by Liu et al. [19] respectively.
4.6 Scenario 6: Feature selection rate
Feature selection rate is defined as the ratio of number of relevant features that are
correctly selected to the total number of features. It is measured in terms of percentage
(%).The mathematical formula for feature selection rate is measured as follows,
(19)
In (19), ‘ ’ denotes a Feature Section Rate. The sample calculation for feature
selection rate using the five methods is given as follows:
Sample calculations:
NSCLC: With ‘ ’ features considered for experimentation and the number of features
correctly selected is 17, then the feature selection rate is calculated as follows:
BSVM: With ‘ ’ features considered for experimentation and the number of features
correctly selected is 16, then the feature selection rate is calculated as follows:
NPPC: With ‘ ’ features considered for experimentation and the number of features
correctly selected is 14, then the feature selection rate is calculated as follows:
MV-CNN: With ‘ ’ features considered for experimentation and the number of features
correctly selected is 13, then the feature selection rate is calculated as follows:
Fig.8: Performance measure of feature selection rate
Fig.8 depicts the feature selection rate comparison between proposed approach and
existing NSCLC, BSVM, NPPC, and MV-CNN respectively. In order to conduct the
experiments, 20 to 200featuresare considered. The performance analysis of feature selection rate
using proposed WONN-MLB approach is compared with existing NSCLC, BSVM, NPPC, and
MV-CNN. When considering 20 number of features for the performance analysis, the proposed
WONN-MLB approach provides the feature selection rate of 90%, whereas the existing NSCLC,
BSVM, NPPC and MV-CNN obtains 85%, 80%, 72% and 65% respectively. From the
discussion, it is clear that the feature selection rate using proposed WONN-MLB approach is
higher as compared to other existing [1], [2], [4] and [19] methods. This is due the application of
identifying maximum relevancy between set of attributes and reducing minimum redundancy
attributes in preprocessing. This helps to selects the accurate features for cancer disease
diagnosis. Therefore, feature selection rate is improved using proposed WONN-MLB approach
by 10%, 18%, 28%, and 41% as compared to NSCLC by Wu et al. [1], BSVM by Zięba et al.
[2], NPPC by Ghorai et al. [4], and MV-CNN by Liu et al. [19] respectively.
5. Conclusion
An effective Weight Optimized Neural Network with Maximum Likelihood Boosting for
LCD in big data is investigated to improve the LCD diagnosis accuracy and to minimize the false
positive rate as well as classification time. To achieve these, the preprocessing the model using
Newton Raphson’s MLMR attributes retrieved and remove the irrelevant features is used.
Therefore, the classification time gets minimized. With the most relevant attributes, an ensemble
classification model called Weighted Optimized Neural Network and Boosting is applied for
early lung cancer diagnosis with a higher accuracy rate. Here, not only the weighted sum
function is considered, but also the most optimal values are obtained. The final ensemble
technique finds the weak classifier with less error value and new component update based on the
error function. This process attains higher disease diagnosing accuracy with the minimum false
positive rate. Experimental evaluation is conducted with different parameters such as-disease
diagnosing accuracy, false positive rate, and classification. The experimental results show that
the proposed approach achieved accurate results for big data processing as compared to existing
methods. Proposed WONN-MLB approach is tested with different dataset, but still there is huge
amount of data points are presented which need to be tested with the proposed approach in
future.
References
[1] Jia Wu, Yanlin Tan, Zhigang Chen, Ming Zhao, “Decision based on big data research for non-small cell lung
cancer in medical artificial system in developing country”, Computer Methods and Programs in Biomedicine,
Elsevier, Volume 159, Mar 2018, Pages 87-101 [Big data research in Non-Small Cell Lung Cancer – Big data
research in NSCLC]
[2] Maciej Zięba, Jakub M. Tomczak, Marek Lubicz, Jerzy Świątek, “Boosted SVM for extracting rules from
imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients”,
Applied Soft Computing, Elsevier, Volume 14, Part A, January 2014, Pages 99-108 [Boosted Support vector
machine (SVM) method for Imbalanced data (BSI)]
[3] Zhihong Mana, Kevin Lee, Dianhui Wang, Zhenwei Cao, Suiyang Khoo, “An optimal weight learning machine
for handwritten digit image recognition”, Signal Processing, Elsevier, Volume 93, Issue 6, June 2013, Pages 1624-
1638
[4] Santanu Ghorai, Anirban Mukherjee, Sanghamitra Sengupta, Pranab K. Dutta, “Cancer Classification from Gene
Expression Data by NPPC Ensemble”, IEEE/ACM Transactions on Computational Biology and Bioinformatics,
Volume. 8, Issue 3, MAY/JUNE 2011, Pages 659 - 671
[5] Resul Das, Abdulkadir Sengur, “Evaluation of ensemble methods for diagnosing of valvular heart disease”,
Expert Systems with Applications, Elsevier, Volume 37, Issue 7, July 2010, Pages 5110-5115
[6] Valdigleis S. Costaa, Antonio Diego S. Fariasa, Benjam´ın Bedregala, Regivan H. N. Santiagoa, AnneMagaly de
P. Canutoa, “Combining Multiple Algorithms in Classifier Ensembles usingGeneralized Mixture Functions”, Neuro
Computing, Elsevier, Volume 313, November 2018, Pages 402-414
[7] Shamsul Huda, John Yearwood, Herbert F. Jelinek, Mohammad Mehedi Hassan, Giancarlo Fortino, Michael
Buckland, “A hybrid feature selection with ensemble classification for imbalanced healthcare data: A case study for
brain tumor diagnosis”, IEEE Access, Volume 4, May 2016, Pages 9145 - 9154
[8] Sarfaraz Hussein, Pujan Kandel, Candice W. Bolan, Michael B. Wallace, and Ulas Bagci, “Supervised and
Unsupervised Tumor Characterization in the Deep Learning Era”, IEEE Transactions on Medical Imaging, Volume
2, July 2018, Pages 1-11
[9] Emmanuel Adetiba, Oludayo O. Olugbara, “Improved Classification of Lung Cancer Using Radial Basis
Function Neural Network with Affine Transforms of Voss Representation”, PLOS ONE journal, Volume10, Issue
12, 2015, pages 1-25
[10] Divya Jain, Vijendra Singh, “Feature selection and classification systems for chronic disease prediction: A
review”, Egyptian Informatics Journal, Elsevier, Volume 19, Issue 3, November 2018, Pages 179-189
[11] Payal Dande, Purva Samant, “Acquaintance to Artificial Neural Networks and use of artificial intelligence as a
diagnostic tool for tuberculosis: A review”, Tuberculosis, Elsevier, Volume 108, January 2018, Pages 1-9
[12] Pegah Khosravia, Ehsan Kazemic, Marcin Imielinskid, Olivier Elemento, Iman Hajirasouliha, “Deep
Convolutional Neural Networks Enable Discrimination of Heterogeneous Digital Pathology Images”,
EBioMedicine, Elsevier, Volume 27, Jan 2018, Pages 317-328
[13] Sahil Sharma, Vinod Sharma, Atul Sharma, “A Two Stage Hybrid Ensemble Classifier Based Diagnostic Tool
for Chronic Kidney Disease Diagnosis Using Optimally Selected Reduced Feature Set”, International Journal of
Intelligent Systems and Applications in Engineering, Volume 6, Issue 2, Apr 2018, Pages 113-122
[14] Faezeh Hosseinzadeh, Amir Hossein KayvanJoo, Mansuor Ebrahimi, Bahram Goliaei, “Prediction of lung
tumor types based on protein attributes by machine learning algorithms”, Springer Plus, Volume 2, Issue 238, Sep
2013, Pages 1-14
[15] Maxim D Podolsky, Anton A Barchuk, Vladimir I Kuznetcov, Natalia F Gusarova, Vadim S Gaidukov, Segrey
A Tarakanov, “Evaluation of Machine Learning Algorithm Utilization for Lung Cancer Classification Based on
Gene Expression Levels”, Asian Pacific Journal of Cancer Prevention, Volume 17, Issue 2, 2016, Pages 835-838
[16] Mohamad Rabbani, Jonathan Kanevsky, Kamran Kafi, Florent Chandelier, Francis J. Giles, “Role of artificial
intelligence in the care of patients with nonsmall cell lung cancer”, European Journal of Clinical Investigation,
Wiley Online Library, Volume 48, Issue 4, January 2018, Pages 1-7
[17] Zhiguo Zhou, Zhi-Jie Zhou, Hongxia Hao, Shulong Li, Xi Chen, You Zhang, Michael Folkert, and Jing Wang,
“Constructing multi-modality and multiclassifier radiomics predictive models through reliable classifier fusion”,
IEEE Computer Society, Jun 2017, Pages 1-13
[18] Ashutosh Kumar Dubey, Umesh Gupta, Sonal Jain, “Epidemiology of lung cancer and approaches for its
prediction: a systematic review and analysis”, Chinese Journal of Cancer, Volume 35, Issue 71, July 2016, Pages 1-
13
[19] Kui Liu, Guixia Kang, “Multiview Convolutional Neural Networks for Lung Nodule Classification”,
International journal of imaging systems and technology, Wiley online library,
Volume 27, Issue 1, March 2017, Pages 12-22
[20] Ayman El-Baz, Garth M. Beache, Georgy Gimel’farb, Kenji Suzuki, Kazunori Okada, Ahmed Elnakib, Ahmed
Soliman, and Behnoush Abdollahi, “Computer-Aided Diagnosis Systems for Lung Cancer: Challenges and
Methodologies”, International Journal of Biomedical Imaging, Hindawi Publishing Corporation, Volume 2013,
November 2012, Pages 1-46.
[21]Changmiao Wang Ahmed Elazab Jianhuang Wu Qingmao Hu, “Lung nodule classification using deep feature
fusion in chest radiography”, Computerized Medical Imaging and Graphics, Elsevier, Volume 57, April 2017, Pages
10-18
[22]Sheng Chen, Kenji Suzuki, and Heber MacMahon, “Development and evaluation of a computer-aided
diagnostic scheme for lung nodule detection in chest radiographs by means of two-stage nodule enhancement with
support vector classification”, The international journal of medical physics research and practice, Vol. 38, No. 4,
2011, Pages 1844-1858
[23] Devinder Kumar, Alexander Wong, David A. Clausi, “Lung Nodule Classification Using Deep Features in CT
Images”, 2015 12th Conference on Computer and Robot Vision, 3-5 June 2015, Pages 133-138
[24] Thangavel Baranidharan, Thangavel Sumathi, Vadivelraj Chandra Shekar., “Weight Optimized Neural Network
Using Metaheuristics for the Classification of Large Cell Carcinoma and Adenocarcinoma from Lung Imaging”,
Current Signal Transduction Therapy, Volume 11 , Issue 2 , 2016, Pages 91-97.
[25] Varlamis, Apostolakis, Sifaki-Pistolla, Dey, Georgoulias, Lionis, “Application of data mining techniques and
data analysis methods to measure cancer morbidity and mortality data in a regional cancer registry: The case of the
island of Crete, Greece.”, Computer methods and programs in biomedicine, Volume 145, 2017, Pages 73-83.
[26] Md. Sarwar Kamal, Nilanjan Dey and Amira S. Ashour, “Large Scale Medical Data Mining for Accurate
Diagnosis: A Blueprint” In Handbook of Large-Scale Distributed Computing in Smart Healthcare, Springer, 2017,
Pages 157-176
[27] Thoracic Surgery Data Data Set: https://archive.ics.uci.edu/ml/datasets/Thoracic+Surgery+Data
*Highlights (for review)