0% found this document useful (0 votes)
4 views18 pages

CURE-SMOTE Algorithm and Hybrid Algorithm for Feature Selection and Parameter Optimization Based on Random Forests

This research article presents the CURE-SMOTE algorithm, designed to enhance the performance of the random forests classifier by addressing issues related to imbalanced data classification, feature selection, and parameter optimization. The proposed algorithm combines clustering techniques with synthetic minority oversampling to improve classification results, demonstrating superior performance compared to traditional methods. Additionally, a hybrid random forests algorithm is introduced, which optimizes feature selection and parameters, yielding better generalization ability and lower out-of-bag error rates.

Uploaded by

raminkunwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views18 pages

CURE-SMOTE Algorithm and Hybrid Algorithm for Feature Selection and Parameter Optimization Based on Random Forests

This research article presents the CURE-SMOTE algorithm, designed to enhance the performance of the random forests classifier by addressing issues related to imbalanced data classification, feature selection, and parameter optimization. The proposed algorithm combines clustering techniques with synthetic minority oversampling to improve classification results, demonstrating superior performance compared to traditional methods. Additionally, a hybrid random forests algorithm is introduced, which optimizes feature selection and parameters, yielding better generalization ability and lower out-of-bag error rates.

Uploaded by

raminkunwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Ma and Fan BMC Bioinformatics (2017) 18:169

DOI 10.1186/s12859-017-1578-z

RESEARCH ARTICLE Open Access

CURE-SMOTE algorithm and hybrid


algorithm for feature selection and
parameter optimization based on
random forests
Li Ma and Suohai Fan*

Abstract
Background: The random forests algorithm is a type of classifier with prominent universality, a wide application
range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to
improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature
selection and parameter optimization.
Results: We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on
imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original
synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results
on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE.
Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter
optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on
binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests
algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can
achieve the minimum OOB error and show the best generalization ability.
Conclusion: The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data
distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and
effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they
surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform
feature selection and parameter optimization.
Keywords: Random forests, Imbalance data, Intelligence algorithm, Feature selection, Parameter optimization

Background missing values, noise and outliers. Although random


Tin Kam Ho proposed the random forests (RF) concept forests have been applied to many other fields such as
[1] and the Random Subspace algorithm [2] in 1995 and biological prediction [4], fault detection [5], and network
1998, respectively. Breiman [3] proposed a novel ensemble attacks [6], studies seeking to improve the algorithm itself
learning classification, random forests, by combining are lacking. The RF algorithm still has some shortcomings;
bagging ensemble learning and Tin Kam Ho’s concept in for example, it performs poorly for classification on
2001. The feature of random forests that allows for imbalanced data, fails to control the model during specific
avoiding over-fitting makes it suitable for use as a data operations, and is sensitive to parameter adjustment and
dimension reduction method for processing data with random data attempts. Usually, there are two ways to
improve RF: increase the accuracy of each individual
* Correspondence: [email protected]
classifier or reduce the correlation between classifiers.
School of Information Science and Technology, Jinan University, Guangzhou
510632, China

© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 2 of 18

First, it is possible to increase the classification accuracy algorithm (GA-RF), a particle swarm-random forests
in minor class samples of RF for imbalanced training algorithm (PSO-RF) and an artificial fish swarm-random
sets through data preprocessing. Several types of methods forests algorithm (AFSA-RF). Simulation experiments
[7–10] based on both data and algorithms exist for imbal- show that the hybrid algorithm obtains better features,
anced data. Chen [11] found that undersampling provides selects better parameter values and achieves a higher
results closer to the original samples than does oversam- performance than traditional methods.
pling for large-scale data. A novel sampling approach [12]
based on sub-modularity subset selection was employed Methods
to balance the data and select a more representative data Random forests algorithm review
subset for predicting local protein properties. Similarly, an Algorithm principle
algorithm combining RF and a Support Vector Machine RF is a combination of Bagging and Random Subspace,
(SVM) with stratified sampling [13] yielded a better consisting of many binary or multi-way decision trees
performance than did other traditional algorithms for h1(x), h2(x), … hnTree(x), as shown in Fig. 1. The final
imbalanced-text categorization, including RF, SVM, SVM decision is made by majority voting to aggregate the
with undersampling and SVM with oversampling. A novel predictions of all the decision trees. The original dataset
hybrid algorithm [14] using a radial basis function neural T = {(xi1, xi2, …, xiM, yi)}N
i = 1 contains N samples, the vec-
network (RBFNN) integrated with RF was proposed to im- tor xi1, xi2, …, xiM denotes the M-dimension attributes or
prove the ability to classify the minor class of imbalanced features, Y = {yi}N i denotes classification labels, and a
datasets. In addition, imbalanced data for bioinformatics is sample is deduced as label c by yi = c.
a well-known problem and widely found in biomedical There are two random procedures in RF. First, training
fields. Applying RF with SMOTE to the CHOM, CHOA sets are constructed by using a bootstrap [25, 26] mech-
and Vero (A) datasets [15] is considered a remarkable im- anism randomly with replacement [Fig. 2 (I)]. Second,
provement that is helpful in the field of functional and random features are selected with non-replacement from
structural proteomics as well as in drug discovery. Ali S the total features when the nodes of the trees are split.
[16] processed imbalanced breast cancer data using the The size κ of the feature subset is usually far less than
CSL technique, which imposes a higher cost on misclassi- the size of the total features, M. The first step is to select
fied examples and develops an effective Cost-Sensitive κ features randomly, calculate the information gain of κ
Classifier with a GentleBoost Ensemble (Can-CSC-GBE). split and select the best features. Thus, the size of candi-
The Mega-Trend-Diffusion (MTD) technique [17] was date features becomes M − κ. Then, continue as shown
developed to obtain the best results on breast and colon in Fig. 2 (II).
cancer datasets by increasing the samples of the minority
class when building the prediction model. Classification rules and algorithmic procedure
Second, it is possible to improve algorithm construction. The best attribute can be computed by three methods: in-
Because the decision trees in the original algorithm have formation gain, information gain rate and Gini coefficient,
the same weights, a weighted RF was proposed that used which correspond to ID3, C4.5 [27] and CART [28],
different weights that affected the similarity [18] between respectively. When the attribute value is continuous, the
trees, out-of-bag error [19], and so on. Weighted RF has best split point must be selected. We use the CART
been shown to be better than the original RF algorithm method in this paper; hence, a smaller Gini coefficient in-
[20]. Ma [21] combined Adaboost with RF and adaptive dicates a better classification result. Let Pi represent the
weights to obtain a better performance. The weight of at- proportion of sample i in the total sample size. Assume
tributes reduces the similarity among trees and improves that sample T is divided into k parts after splitting by
RF [22]. Moreover, the nearest K-neighbour [23] and attribute A.
pruning mechanism can help achieve a better result when
X
c
using margin as the evaluation criterion [24]. GiniðT Þ ¼ 1− P2i ð1Þ
In this paper, the main work is divided into two parts: i
first, the CURE-SMOTE algorithm is combined with RF  
to solve the shortcomings of using SMOTE alone. Com- X
k T j 
GiniðT ; AÞ ¼   ð2Þ
pared with results on the original data, random oversam- j¼1 jT j Gini T j
pling, SMOTE, Borderline SMOTE1, safe-level-SMOTE,
C-SMOTE, and the k-means-SMOTE algorithm, CURE- There are several ways by which the termination criteria
SMOTE's effectiveness when classifying imbalanced data for RF can be met. For example, termination occurs when
is verified. Then, to simultaneously optimize feature selec- the decision tree reaches maximum depth, the impurity of
tion, tree size, and the number of sub-features, we propose the end node reaches the threshold, the number of final
a hybrid algorithm that includes a genetic-random forests samples reaches a set point, and the candidate attribute is
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 3 of 18

Fig. 1 Random forests algorithm

used up. The RF classification algorithm procedure is number of major class samples. Therefore, imbalanced
shown in Algorithm 1. data causes the training set for each decision tree to be
imbalanced during the first “random” procedure. The clas-
CURE-SMOTE algorithm sification performance of traditional RF on imbalanced
Definition and impact of imbalanced data data sets [30] is even worse than that of SVMs [31].
In recent years, the problem of classifying imbalanced
data [29] has attracted increasing attention. Imbalanced SMOTE algorithm
data sets generally refer to data that is distributed Several methods exist for processing imbalanced data,
unevenly among different categories where the data in including sample-based and algorithmic techniques, the
the smaller category is far less prevalent than data in the combination of sampling and algorithm techniques, and
larger category. The Imbalance Ratio (IR) is defined as feature selection. In particular, a type of synthesis resam-
the ratio of the number of minor class samples to the pling technique algorithm called the synthetic minority

Fig. 2 Two random procedures diagram


Ma and Fan BMC Bioinformatics (2017) 18:169 Page 4 of 18

oversampling technique (SMOTE) [32–34], has a posi- neighbour samples in the minor class, and k is the number
tive effect on the imbalanced data problem. The specific of samples in the major class.
idea is implemented as follows: obtain the k -nearest Motivated by Borderline–SMOTE 1, safe-level-SMOTE
neighbours of sample X in the minor class, select n [36] advocates calculating the safe level of minor class
samples randomly and record them as Xi. Finally, the samples, but it can easily fall into overfitting. Cluster-
new sample Xnew is defined by interpolation as follows: SMOTE [37] obtains a satisfactory classification effect for
  imbalanced datasets by using K-means to find clusters of
X new ¼ X origin þ rand  X i −X origin ; i ¼ 1; 2; …; n; minor class samples and then applying SMOTE. In
ð3Þ addition, spatial structures have been studied such as N-
SMOTE [38] and nuclear SMOTE [39]. The authors of
where rand is a random number uniformly distributed [40] proposed an interpolation algorithm based on cluster
within the range (0,1), and the ratio for generating new centres. SMOTE was combined with a fuzzy nearest-
samples approximates [1/IR] − 1. neighbour algorithm in [41]. In [42], a preferable classifi-
However, some flaws exist in the SMOTE algorithm. cation effect promoted by hierarchical clustering sampling
First, the selection of a value for k is not informed by was shown. Recently, a SMOTE noise-filtering algorithm
the nearest neighbours selection. Second, it is impossible [43] and MDO algorithms with Markov distance [44] have
to completely reflect the distribution of original data been proposed. In general, many improved versions of the
because the artificial samples generated by the minor SMOTE algorithm have been proposed, but none of these
class samples at the edges may lead to problems such as improvements seem perfect. This paper seeks to solve the
repeatability and noisy, fuzzy boundaries between the shortcomings of SMOTE.
positive and negative classes. The K-means algorithm is effective only for spherical
Therefore, researchers have sought to improve the datasets and its application requires a certain amount of
SMOTE algorithm. The Borderline–SMOTE1 algorithm time. The CURE [45] hierarchical clustering algorithm is
[35] causes new samples to be more effective using efficient for large datasets and suitable datasets of any
interpolation along the border areas, but it fails to find shape dataset. Moreover, it is not sensitive to outlier and
all the boundary points. Definitions for this algorithm can recognize abnormal points. Consequently, CURE is
are shown in Table 1: m is the number of nearest- better than the BIRCH, CLARANS and DBSCAN algo-
rithms [46]. In the CURE algorithm, each sample point
Table 1 Definitions in Borderline-SMOTE 1 is assumed to be a cluster. These points are merged
Point Definition using local clustering until the end of the algorithm.
Noisy point m=k Thus, the CURE algorithm is appropriate for distributed
extensions. In this paper, inspired by C-SMOTE [40]
Boundary point/dangerous point m/2 ≤ k < m
and the hierarchical clustering sampling adaptive
Safe point 0 ≤ k < m/2
semi-unsupervised weighted oversampling (A-SUWO)
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 5 of 18

[42] algorithms, the novel CURE-SMOTE algorithm is For example, the distance between sample X1 ffi
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
proposed to accommodate a wider range of applica- = (X11, X12 …, X1M) and sample u uX M  2
tion scenarios. X2 = (X21, X22 …, X2M) is d 12 ¼ t X 1j −X 2j .
j¼1
Design and analysis of CURE-SMOTE During the clustering process of the CURE-SMOTE
The general idea of the CURE-SMOTE algorithm is as algorithm, noisy points must be removed because they
follows: cluster the samples of the minor class using are far away from the normal points, and they hinder
CURE, remove the noise and outliers from the original the merge speed in the corresponding class. When
samples, and, then, generate artificial samples randomly clustering is complete, the clusters containing only a few
between representative points and the centre point. The samples are also deemed to be noisy points. For the
implementation steps of the CURE-SMOTE algorithm sample points after clustering, the interpolation can
are as follows: effectively prevent generalization and preserve the
original distribution attributes of the data set. In the
Step 1. Normalize the dataset, extract the minor class interpolation formula, Xi is replaced by the representa-
samples, X, and calculate the distance dist among them. tive points; consequently, the samples are generated only
Each point is initially considered as a cluster. For each between the representative samples and the samples in
cluster U, Ur and Uc represent the representative set the original minor class, which effectively avoids the in-
and the centre point, respectively. For two data items fluence of boundary points. The combination of the
p and q, the distance between the two clusters U and V clustering and merge operations serves to eliminate the
is: noise points at the end of the process and reduce the
complexity because there is no need to eliminate the
dist ðU; V Þ ¼ min dist ðp; qÞ: ð4Þ farthest generated artificial samples after the SMOTE
p∈Ur;q∈V r

algorithm runs. Moreover, all the termination criteria


Step 2. Set the clustering number, c, and update the such as reaching the pre-set number of clusters, the
centre and representative points after clustering and number of representative samples, or the distance
merging based on the smallest distance of the two threshold, avoid setting the k value of the original
clusters, SMOTE algorithm and, thus, reduce the instability of
the proposed algorithm.
jU j⋅UcþjV j⋅V c
Uc← ð5Þ Research concerning feature selection and parameter
jU jþjV j
optimization
Ur←fp þ α⋅ðUc−pÞjp∈Ur g; ð6Þ Classification [47] and feature selection [48–50] are
widely applied in bioinformatics applications such as
where |U| is the number of data items for class U, and gene selection [51, 52] and gene expression [53–55].
the shrinkage factor α is generally 0.5. The class with Chinnaswamy A [56] proposed a hybrid feature selection
slowest growth speed is judged to contain abnormal using correlation coefficients and particle swarm
points and will be deleted. If the number of representa- optimization on microarray gene expression data. The
tive points is larger than required, select the data point goal of feature selection is to choose a feature subset
farthest from the clustering centre as the first represen- that retains most of the information of the original data-
tative point. Then, the next representative point is the set, especially for high-dimensional data [57]. The au-
one farthest from the former. When the number of clus- thors of [58] showed that machine-learning algorithms
tering centres reaches a predefined setting, the algorithm achieve better results after feature selection. Kausar N.
terminates, and clusters containing only a few samples [59] proposed a scheme-based RF in which useful fea-
are removed. tures were extracted from both the spatial and transform
domains for medical image fusion. During the second
Step 3. Generate a new sample according to the "random" time of RF, a number of attributes were
interpolation formula. X represents the samples after selected randomly to reduce the correlation between
clustering by the CURE algorithm. trees, but this operation promotes redundant features
  that may affect the generalization ability to some degree.
X nnew ¼ X þ rand ð0; 1Þ  Ur−X : ð7Þ
Thus, new types of evaluation mechanisms were
Step 4. Calculate IR, and return to Step 3 if IR ≤ IR0. proposed based on the importance of the attributes
Step
 5. Finally, classify the new dataset as X new ¼ X [21, 60, 61], using weighted features as well as cost-
∪ X nnew and add samples of the major class by RF. sensitivity features [62], and so on; however, their calcula-
The distance is measured using Euclidean distance. tions are comparatively complicated. Recently, researchers
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 6 of 18

have combined the RF algorithm with intelligent algo- change the objective function into fitness functions. The
rithms. Such combinations have achieved good results in fitness value drives the main operations—selection, cross-
a variety of fields. In [5], an improved feature selection over and mutation—to search for the best potential indi-
method based on GA and RF was proposed for fault viduals iteratively. Eventually the algorithm converges, and
detection that significantly reduces the OOB error. The the optimal or a suboptimal solution of the problem is
results of [4, 6] indicate that a type of hybrid PSO-RF obtained. GA has the advantage of searching in parallel,
feature selection algorithm is widely applied in certain and it is suitable for a variety of complex scenarios.
fields. However, the works mentioned above do not The particle swarm optimization (PSO) algorithm is
involve parameter optimization. theoretically simpler and more efficient than the GA
Three main parameters influence the efficiency and per- [73]. The main idea behind PSO is to simulate the
formance of RF: nTree—the size of the tree, MinLeaf—the predation behaviour of birds. Each particle represents a
minimum sample number of leaf nodes, and κ —the attri- candidate solution and has a position, speed and a
bute subset size. Previous studies have shown that the fitness value. Historical information on the optimal solu-
classification performance of RF is less sensitive to tion instructs the particle to fly toward a better position.
MinLeaf [63]. A larger nTree increases the number of The artificial fish swarm algorithm (AFSA) [74] is a
trees in the classifier, helps ensure the diversity of individ- novel algorithm with high potential. The main idea be-
ual classifiers and, thus, improves performance. However, hind AFSA is to imitate the way that fish prey, swarm,
a larger nTree also increases the time cost and may lead to follow and adopt random behaviours. The candidate
less interpretable results, while a small nTree results in solution is translated into the individual positions of the
increased classification errors and poor performance. fish, while the objective function is converted to food
Usually, κ is far less than the number of total attributes concentration.
[64]. When all the similar attributes are used for splitting Diagrams for GA, PSO and AFSA are shown in Fig. 3.
the tree nodes in the Bagging algorithm, the effect of the There is little research on optimizing the hyper param-
tree model worsens due to the higher similarity degree eter κ of random forests. In [67], the size of the decision
among trees [65]; when κ is smaller, the stronger effects of tree is fixed at 500, but this approach achieves the optimal
randomness lower the classification accuracy. The hyper parameter on only half the dataset. Worse, it requires con-
parameter κ behaves differently for different issues [66]; siderable time and is suitable for single parameter
hence, an appropriate value can cause the algorithm to optimization only. This paper proposes combining a new
have excellent performance for a specific problem. hybrid algorithm for feature selection and parameter
Breiman pointed out that selecting the proper κ value has optimization with RF is proposed based on [4–6].
a great influence on the performance of the algorithm [3]
pffiffiffiffiffi pffiffiffiffiffi
and suggested that the value should be 1, M , 12 M , 2 The proposed hybrid algorithm for feature selection and
pffiffiffiffiffi pffiffiffiffiffi parameter optimization
M and ⌊ log2(M) + 1⌋. Generally, κ is fixed as M , but
that value does not guarantee obtaining the best classifier. We propose the hybrid GA-RF, PSO-RF or AFSA-RF
Therefore, the authors of [67] suggested that the minimum algorithm for feature selection, parameter optimization
OOB error be used to obtain the approximate value to and classification. The algorithm seeks to remove redun-
overcome the shortcomings of the orthogonal validation dant features and attain the optimal feature subset and,
method. Moreover, OOB data has been used to estimate finally, to explore the relation between performance and
the optimal training sample proportion to construct the nTree, as well as the hyper parameter κ.
Bagging classifier [68]. To sum up, it is difficult for trad- Generally, p -fold cross validation is used to traverse
itional parameter values to achieve an optimal perform- the parameter and to estimate the algorithm in the
ance. In terms of the search for the optimal parameter, experiment, but time complexity is high. In this paper,
typical approaches have incorporated exhaustive search, OOB error replaces the cross-validation algorithm for
grid search, and orthogonal selection, but these methods binary classification, while the full misclassification error
have a high time complexity. is used for multi-classification. Hence, the time com-
plexity is reduced to 1/p. During the process, cross
validation is required for classification.
Review of intelligent algorithms Objective function:
Because intelligent algorithms are superior for solving
NP-hard problems and for optimizing parameters, they f ðnTree ; κ  ; fAttributei ji ¼ 1; 2…; M gÞ ¼ arg minð avgOOB error Þ
have been the subject of many relevant and successful ð8Þ
studies [69–72].
The main idea behind the genetic algorithm (GA) is to Studies have shown that the larger nTree is, the more
encode unknown variables into chromosomes and stable the classification accuracy will be. We set nTree
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 7 of 18

Fig. 3 Diagrams of GA, PSO, and AFSA

and κ in the range [0, 500] and [1, M], respectively, by Hybrid GA-RF
considering both the time and space complexities.
Optimization variables: nTree, κ, {Attributei|i = 1, 2 …, M}
Step 1. Initialize the population: Perform binary
Binary encoding involves two tangent points and
encoding. The population size is set to popsize, the
three steps. Let nTree and κ be numbers in the bin-
max iteration time is set to maxgen, the crossover
ary system. A value of 0 in {Attributei|i = 1, 2 …, M}
probability is Pc, and the mutation probability is
represents an unselected feature in the corresponding
Pm.
position, while a 1 represents the selected features.
Step 2. Combine the GA with RF classification and
XM
The constraint condition is κ≤ Attributei . calculate the fitness function, F = max(1/f ), gen = 1.
i¼1 Step 3. Perform the selection operation with the
Then, an nTree is generated randomly between [0, 500]. roulette method: the probability of selecting an
Because 29 = 512, a 9-bit length ensures a full set of individual is dependent on the proportion of the
variables. The bits used for κ and the bits used for overall fitness value that the individual represents:
the attributes are different for different data sets. The
bits of κ are the binary representation of M, while
X
popsize
the number of bits of the attributes are M (Fig. 4). pi ¼ F i = F i: ð9Þ
The initialization continues until a valid variable is i¼1
generated.
The diagram for a hybrid algorithm based on RF and an
Step 4. Conduct the crossover operation with the
artificial algorithm for feature selection and parameter
single-point method: two selected individuals cross at a
optimization is shown in Fig. 5.
random position with different values. The offspring
generation will be regenerated until it turns out to be
legal. The process is shown in Fig. 6.
Step 5. Mutation operation: select an individual and a
position j randomly to mutate by switching 0 and 1.
When a feasible solution is achieved, calculate the
fitness value and update the optimal solution. The
mutation operation is shown in Fig. 7
Step 6. When gen > maxgen, the algorithm will
Fig. 4 Binary coding
terminate; otherwise, return to Step 3.
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 8 of 18

Fig. 5 The diagram of a hybrid algorithm based on RF and an artificial algorithm

  
Hybrid PSO-RF 0; rand > sigmoid V kþ1
Z kþ1;j ¼  kþ1  randeU ð0; 1Þ:
1; rand≤sigmoid V
Step 1. Initialize the population. The population size is ð12Þ
set to popsize, the max iteration time is set to maxgen,
the position of the binary particle is Xk = {Zk,1, Zk,2, …}, Step 4. If gen > maxgen, the algorithm will terminate;
k = 1, 2, … popsize, the velocity is V, the learning factors otherwise, return to Step 3.
are c1, c2, and the weight is w.
Step 2. Combine the PSO with RF classification and Hybrid AFSA-RF
calculate the fitness function F = max(1/f ), gen = 1.
Step 3. Update the velocities Vk + 1 and positions Xk + 1 Step 1. Initialize the population. The population size is
of particles. Let Pk be the optimal position of an set to popsize, the maximum number of iterations is set
individual particle, Pgk be the optimal position of all to maxgen, the fish positions are Xk = {Zk,1, Zk,2, …}, k =
particles, and rand be a random number uniformly 1, 2, … popsize, the visual distance is visual, the
distributed in the range (0,1): crowding degree factor is delta, and the maximum
number of behaviours to try is try_number.
    Step 2. Combine with RF classification and calculate
V kþ1 ¼ wV k þ c1 r 1 Pk −X k þ c2 r 2 Pg k −X k ; r 1 ; r 2 ∈ ½0; 1
the food concentration F = max(1/f );
ð10Þ Step 3. Swarm and follow at the same time.
a) Swarm behaviour: The current state of a fish is
  1 Xi, the number of partners in view is nf, and
sigmoid V kþ1 ¼ ð11Þ
1 þ e−V
kþ1
the centre position is Xc. When nf Fc
> delta⋅
Fitnessi , move to the centre position according

Fig. 6 Crossover operation Fig. 7 Mutation operation


Ma and Fan BMC Bioinformatics (2017) 18:169 Page 9 of 18

to the following formula; otherwise, conduct Table 2 Dataset


the prey behaviour. Id Dataset N M Positive Negative IR Label
class class
8
>
> Z k;i Z k;i ¼ Z c;i 1 Circle 1362 2 229 1133 0.2021:1 1:0
<
Z kþ1;i ¼ 0 Z k;i ≠Z c;i ; rand > 0:5 2 Blood-transfusion 748 4 178 570 0.3123:1 4:2
>
>
: 1 Z k;i ≠Z c;i ; rand≤0:5: 3 Haberman's survival 306 3 81 225 0.36:1 2:1
4 Breast-cancer- 702 10 243 459 0.5249:1 1:0
ð13Þ wisconsin
5 SPECT.train 80 23 26 54 0.4815 1:0
b) Follow behaviour: Find the fish Xmax with the
maximum food concentration value, Fmax.
If Fnfmax > delta⋅F i , move to Xmax and calculate the The measures accuracy, sensitivity, specificity and
food concentration value. Then, update the food precision are defined as follows.
concentration value by comparing it with the value
of the swarm behaviour; otherwise, conduct the prey Accurcacy ¼ ðT P þ TN Þ=ðT P þ T N þ FP þ FN Þ
behaviour. ¼ ðT P þ TN Þ=N
8 ð16Þ
< Z k;i Z k;i ¼ Z max;i
>
Sensitivity or Recall ¼ T P=ðT P þ FN Þ ð17Þ
Z kþ1;i ¼ 0 Z k;i ≠Z max;i ; rand > 0:5
>
:
1 Z k;i ≠Z max;i ; rand≤0:5: Specificity ¼ T N=ðFP þ T N Þ ð18Þ
ð14Þ Precision ¼ TP=ðT P þ FP Þ ð19Þ

c) Prey behaviour: The current state is Xk = {Zk,i}, and The classifiers may have a high overall accuracy with
the random selection state is Xj = {Zj,i} around the 100% accuracy in the majority class while achieving only
vision range with dij = visual. When Fk > Fj,restart to a 0–10% accuracy in the minority class because the over-
generate the next state, Xk + 1, and calculate the food all accuracy is biased towards the majority class. Hence,
concentration until try_number is reached; the accuracy measure is not a proper evaluation metric
otherwise, terminate the prey behaviour according for the imbalanced class problem. Instead, we suggest
to the following function: using F-value, Geometric Mean (G-mean) and AUC for
8 imbalanced data evaluations.
< Z k;i Z k;i ¼ Z j;i
> The F-value measure is defined following [26]. A larger
Z kþ1;i ¼ 0 Z k;i ≠Z j;i ; rand > 0:5 F-value indicates a better classifier. F-value is a perform-
>
: ance metric that links both precision and recall:
1 Z k;i ≠Z j;i ; rand≤0:5:
ð15Þ 2
F¼ : ð20Þ
1=Precision þ 1=Recall
Step 4. Update the state of the optimal fish. When
gen > maxgen, the algorithm will terminate; otherwise, The G-mean [76] attempts to maximize the accuracy
return to Step 3. across the two classes with a good balance and is defined
as follows. Only when both sensitivity and specificity are
Results and discussion high can the G-mean attain its maximum, which indi-
The experiments in this paper are divided into two parts. cates a better classifier:
Experiment 1 explores the validity of the CURE-SMOTE pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
G−mean ¼ SensitivitySpecificity: ð21Þ
algorithm. Experiment 2 investigates the effectiveness of
the hybrid algorithm. AUC is the area under the receiver operating charac-
teristics (ROC) curve. AUC has been shown to be a
Performance evaluation criteria reliable performance measure for imbalanced and cost-
Referring to the evaluation used in [75], the measures of
the quality of binary classification are built using a Table 3 Comparison of algorithms and references
confusion matrix, where TP and FN are the numbers of Algorithm Reference Algorithm Reference
correctly and incorrectly classified compounds of the SMOTE [32] Safe-level SMOTE [36]
actual positive class, respectively. Similarly, TN and FP Borderline-SMOTE 1 [35] C-SMOTE [36]
denote the numbers of correctly and incorrectly classi-
k-means-SMOTE [37] - -
fied compounds of the actual negative class.
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 10 of 18

Fig. 8 CURE-SMOTE algorithm diagram

Fig. 9 Artificial samples generated by different methods


Ma and Fan BMC Bioinformatics (2017) 18:169 Page 11 of 18

Clustering diagram by CURE−SMOTE


1
Original cluster 1
0.9 Original cluster 2
Centers
0.8 Representative points 1
Representative points 2
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 10 The CURE clustering result

sensitive problems. An AUC–based permutation variable Experiment 1 and parameter settings


is presented in [77]; this approach is more efficient than The experiments were implemented using Matlab 2012a
the approach based on the OOB error. on a workstation with a 64-bit operating system, 2 GB of
The training set is obtained by using the bootstrap RAM and a 2.53 GHz CPU. Artificial Data Circle and
method. Because of repeated extraction, it contains UCI imbalanced datasets were selected for the experi-
only 63% of the original data; the 37% of the data ments. More detailed information about five datasets is
that never appear are called "out-of- -bag" (OOB) listed in Table 2. To simulate the actual situation appro-
data [78]. OOB estimation is an unbiased estimate of priately and preserve the degree of imbalance of the ori-
the RF algorithm and can be used to measure the ginal data, the training set and testing set were divided
classifier's generalization ability. A smaller OOB error using stratified random sampling at a ratio of 3:1, except
indicates a better classification performance. OOB for SPECT. The SPECT.test dataset incorporates 187
error is defined as follows: samples, and the proportions of the classes labelled 1
and 0 are 84:103, respectively. The tree size is 100 and
X
nTree the depth is 20.
OOB error ¼ OOB error i =nTree: ð22Þ To verify the effectiveness of the CURE-SMOTE
i algorithm it was compared with the original data, random
oversampling, SMOTE, Borderline-SMOTE1, safe-level
Margin is a new evaluation criterion that has been SMOTE, C-SMOTE (using mean value as the centre) and
applied to the classification of remote sensing data [79]. k-means-SMOTE (shown in Table 3) algorithms. To
The larger the margin is, the higher the classifier's evaluate the performance of the different algorithms,
credibility is: F-value, G-mean, AUC and OOB error are used as
performance measures. The results of each experiment
X
nTree were averaged over 100 runs to eliminate random effects.
margin ¼ margini =nTree: ð23Þ To facilitate the comparisons, m and k were set to 20
i and 5, respectively, in SMOTE, Borderline-SMOTE1 and
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 12 of 18

Table 4 The classification results of different sampling algorithms


Dataset Method F G-Mean AUC OOB error
1. Circle Original data 0.9081 0.9339 0.9389 0.0296
Random oversampling 0.9249 0.9553 0.9567 0.0163
SMOTE 0.9086 0.9535 0.9579 0.0384
Borderline-SMOTE1 0.9110 0.9534 0.9619 0.0438
Safe-level-SMOTE 0.9146 0.9595 0.9559 0.0431
C-SMOTE 0.9302 0.9713 0.9813 0.0702
k-means-SMOTE 0.9262 0.9589 0.9602 0.0323
CURE-SMOTE 0.9431 0.9808 0.9855 0.0323
2. Blood-transfusion Original data 0.3509 0.5094 0.5083 0.2548
Random oversampling 0.3903 0.5490 0.5449 0.2250
SMOTE 0.4118 0.5798 0.5537 0.2152
Borderline-SMOTE1 0.4185 0.5832 0.5424 0.1630
Safe-level-SMOTE 0.4494 0.6174 0.5549 0.2479
C-SMOTE 0.4006 0.5549 0.5531 0.2418
k-means-SMOTE 0.4157 0.5941 0.5433 0.1872
CURE-SMOTE 0.5393 0.6719 0.6533 0.2531
3. Haberman’s survival Original data 0.3279 0.5018 0.6063 0.3149
Random oversampling 0.3504 0.5178 0.5959 0.1534
SMOTE 0.4350 0.5971 0.6259 0.1728
Borderline-SMOTE1 0.4523 0.6119 0.6298 0.2589
Safe-level-SMOTE 0.4762 0.6008 0.6030 0.3077
C-SMOTE 0.4528 0.5487 0.5656 0.2780
k-means-SMOTE 0.4685 0.6249 0.6328 0.1828
CURE-SMOTE 0.5000 0.6282 0.6940 0.2717
4. Breast–cancer-wisconsin Original data 0.9486 0.9619 0.9491 0.0446
Random oversampling 0.9451 0.9623 0.9620 0.0301
SMOTE 0.9502 0.9666 0.9627 0.0341
Borderline-SMOTE1 0.9506 0.9661 0.9635 0.0379
Safe-level-SMOTE 0.9509 0.9671 0.9638 0.0404
C-SMOTE 0.9491 0.9636 0.9561 0.0380
k-means-SMOTE 0.9449 0.9616 0.9562 0.0373
CURE-SMOTE 0.9511 0.9664 0.9621 0.0427
5. SPECT.train Original data 0.6348 0.6764 0.6579 0.3634
Random oversampling 0.6539 0.6924 0.6753 0.3468
SMOTE 0.6618 0.6990 0.6825 0.3688
Borderline-SMOTE1 0.6710 0.6926 0.6746 0.3489
Safe-level-SMOTE 0.6770 0.7074 0.6913 0.3160
C-SMOTE 0.6564 0.6936 0.6764 0.3448
k-means-SMOTE 0.6796 0.6941 0.6846 0.3599
CURE-SMOTE 0.6855 0.7155 0.6951 0.1108
From the classification results obtained by the different sampling algorithms discussed in Table 4, the best F-value, G-mean and AUC were achieved
on the Circle dataset by CURE-SMOTE, and its OOB error is second-best, behind only random sampling. The overall classification result on the blood-
transfusion dataset is poorer, but the CURE-SMOTE algorithm achieves the best F-value, G-mean and AUC, while its OOB error is inferior to the ori-
ginal data. On the Haberman's survival dataset, the F-value, G-mean and AUC achieved by CURE-SMOTE are superior to the other sampling algorithms.
For the breast-cancer-wisconsin dataset, CURE-SMOTE achieves the best F-value, but its G-mean and AUC are slightly lower, although they are little differ-
ent from the other sampling algorithms. On the SPECT dataset, CURE-SMOTE surpasses the other sampling algorithms with regard to F-value, G-mean,
AUC and OOB error
The best value of every performance evaluation criteria obtained by the algorithms are marked in boldface
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 13 of 18

Table 5 Dataset are not selected at all. The SMOTE algorithm also pro-
id Dataset N M Positive Negative IR Label duces repeated data and generates mixed data in other
class class classes as well as noise. Borderline-SMOTE1 picks out
1 Connectionist 208 17 97 111 0.8739 R:M the boundary point of minor class by calculating and
Bench comparing the samples of the major class around the
2 Wine 130 13 59 71 0.831 1:2 minor class; consequently, the generated data are con-
3 Ionosphere 351 34 126 225 0.56 b:g centrated primarily at the edges of the class. Safe-level
4 Breast-cancer- 702 10 243 459 0.5249 1:0 SMOTE follows the original distribution, but still gener-
wisconsin ates repeated points and distinguishes the boundary
5 Steel Plates Faults 1,941 27 - - - 7 incorrectly. Although C-SMOTE can erase the noise, the
labels generated data are too close to the centre to accurately
6 Libras Movement 360 90 - - - 15 identify other centres. K-means-SMOTE can identify the
labels area of the small class and slightly improves on the
7 mfeat-factors 2,000 216 - - - 10 SMOTE effect. The proposed CURE-SMOTE algorithm
labels generates data both near the centre and the representa-
tive points; overall, it follows the original distribution.
safe-level-SMOTE. The number of clusters in C-SMOTE Moreover, the representative points help to avoid noise
and k-means-SMOTE were set to five. Following the being treated as a constraining boundary during the gen-
suggested setting for the CURE algorithm, the cluster re- erating process. Detailed results are listed in Table 4.
sults are better when the constriction factor is in the In conclusion, the classification results of the CURE-
range [0.2, 0.7] and when the number of representative SMOTE algorithm as measured by the F-value, G-
points is greater than 10. Thus, the constriction factor means, and AUC are substantially enhanced, whereas
was set to 0.5 and the number of representative points the results using SMOTE alone are not particularly
was set to 15. The number of clusters was set to two in stable. Meanwhile, Borderline-SMOTE1, C-SMOTE, and
the circle, while the others were all five. Samples were the k-means-SMOTE algorithm are even worse than
removed when the number of representative points did random sampling on some datasets. Thus, the CURE-
not increase for ten iterations or when the sample size SMOTE algorithm combined with RF has a substantial
of the cluster class was less than 1/(10c) of the total effect on classification.
sample size when clustering was complete. In the experi-
ments in this paper, IR0 was fixed at 0.7. The CURE-
Experiment 2 and parameter settings
SMOTE algorithm diagram is depicted in Fig. 8.
In this section, to test the effectiveness of the hybrid algo-
rithm for feature selection and parameter optimization,
Results and discussion of CURE - SMOTE algorithm
we selected the representative binary classification and
Figure 9 shows the results of the original data, random
multi-classification imbalanced datasets shown in Table 5.
sampling, SMOTE sampling, Borderline-SMOTE1 sam-
These data are randomly stratified by sampling them into
pling, safe-level SMOTE sampling, C-SMOTE sampling,
four parts with a training set to testing set ratio of 3:1. In
K-means SMOTE sampling and CURE-SMOTE sam-
this procedure, 4-fold stratified cross validation is used for
pling, as well as the CURE clustering result. The black
classification. The parameter settings are listed in Table 6.
circles and the red star represent the major class sample
The depth is set to 20 for experiment 2.
and minor class sample, respectively, in the original data,
and the blue squares represent the artificial samples gen-
erated by different methods. Figure 10 shows the CURE Results and discussion of the hybrid algorithm
clustering results of the minor class sample. The cluster- According to the proposed settings in previous works, the
ing centre is two, the stars show the centres, and the parameters for all of the methods were set as follows:
pffiffiffiffiffi
blue diamonds indicate the representative points. nTree = 100, κ =1, M , ⌊ log2(M) + 1⌋ and M. Accuracy,
Figure 9 shows that a large number of data are OOB error and margin were selected as the evaluation cri-
obtained repeatedly by random sampling, and some data teria. The detailed results are listed in Table 7 and Table 8.

Table 6 Parameter settings


Hybrid GA-RF popsize :5 maxgen :20 Pc: 0.6 Pm:0.1
Hybrid PSO-RF popsize :5 maxgen :20 c1:1.5 r1,r2∈[0,1] Vmin:Vmax = -0.5:0.5 w:0.5
c2:1.5
Hybrid AFSA-RF popsize: 5 maxgen: 20 visual: 3 try_number: 5, delta: 0.618
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 14 of 18

Table 7 The binary classification results


pffiffiffiffi
1 M ⌊ log2(M) + 1⌋ M GA-RF PSO-RF AFSA-RF
Connectionist Bench Accuracy 0.6442 0.6442 0.6058 0.6635 0.6538 0.7308 0.6827
Sensitive 0.5882 0.6122 0.6500 0.7556 0.5741 0.6744 0.5870
Precision 0.6522 0.6250 0.4906 0.5862 0.7045 0.6744 0.6585
Specificity 0.6981 0.6727 0.5781 0.5932 0.7400 0.7705 0.7586
F 0.6186 0.6186 0.5591 0.6602 0.6327 0.6744 0.6207
G-mean 0.6408 0.6418 0.6130 0.6695 0.6518 0.7209 0.6673
AUC 0.4107 0.4119 0.3758 0.4482 0.4248 0.5196 0.4453
OOB 0.3808 0.3889 0.3344 0.3391 0.3314 0.3085 0.2932
margin 0.1078 0.1632 0.1991 0.2084 0.2056 0.1468 0.2418
nTree 100 100 100 100 315 193 151
κ 1 4 5 17 6 8 4
num (Attribute) 17 17 17 17 13 16 15
Wine Accuracy 0.9846 0.9692 0.9846 0.9692 0.9846 0.9846 0.9692
Sensitive 1.0000 0.9286 1.0000 1.0000 1.0000 1.0000 1.0000
Precision 0.9655 1.0000 0.9677 0.9333 0.9706 0.9643 0.9355
Specificity 0.9730 1.0000 0.9714 0.9459 0.9688 0.9737 0.9444
F 0.9825 0.9630 0.9836 0.9655 0.9851 0.9818 0.9667
G-mean 0.9864 0.9636 0.9856 0.9726 0.9843 0.9868 0.9718
AUC 0.9730 0.9286 0.9714 0.9459 0.9688 0.9737 0.9444
OOB 0.0442 0.0502 0.0288 0.0748 0.0246 0.0156 0.0238
margin 0.6951 0.7553 0.8149 0.7995 0.7863 0.7890 0.8345
nTree 100 100 100 100 349 354 90
κ 1 3 4 13 5 1 5
num (Attribute) 13 13 13 13 12 11 12
Ionosphere Accuracy 0.9200 0.9314 0.9371 0.9257 0.9371 0.9257 0.9314
Sensitive 0.9107 0.8475 0.8889 0.8824 0.8333 0.9032 0.9107
Precision 0.8500 0.9434 0.9057 0.9231 0.9804 0.8889 0.8793
Specificity 0.9244 0.9741 0.9587 0.9533 0.9913 0.9381 0.9412
F 0.8793 0.8929 0.8972 0.9003 0.9009 0.8960 0.8947
G-mean 0.9175 0.9086 0.9231 0.9171 0.9089 0.9205 0.9258
AUC 0.8956 0.8651 0.9002 0.8975 0.8548 0.8835 0.9029
OOB 0.1096 0.0860 0.1132 0.0884 0.0668 0.0831 0.0825
margin 0.5696 0.6918 0.6511 0.7041 0.7349 0.6934 0.6351
nTree 100 100 100 100 339 321 350
κ 1 5 6 34 9 15 2
num (Attribute) 34 34 34 34 29 30 28
Breast -cancer -wisconsin Accuracy 0.9801 0.9658 0.9715 0.9573 0.9544 0.9801 0.9658
Sensitive 0.9914 0.9474 0.9583 0.9748 0.9919 1.0000 0.9474
Precision 0.9504 0.9474 0.9583 0.9063 0.8905 0.9421 0.9474
Specificity 0.9745 0.9747 0.9784 0.9483 0.9342 0.9705 0.9747
F 0.9701 0.9474 0.9583 0.9393 0.9385 0.9702 0.9474
G-mean 0.9829 0.9609 0.9683 0.9614 0.9626 0.9851 0.9609
AUC 0.9844 0.9555 0.9595 0.9547 0.9601 0.9850 0.9474
OOB 0.0422 0.0399 0.0433 0.0467 0.0304 0.0411 0.0372
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 15 of 18

Table 7 The binary classification results (Continued)


margin 0.8247 0.8569 0.8509 0.8652 0.8842 0.8179 0.8616
nTree 100 100 100 100 319 420 351
κ 1 3 4 10 3 1 3
num (Attribute) 10 10 10 10 9 9 7
The best value of every performance evaluation criteria obtained by the algorithms are marked in boldface

GA-RF, PSO-RF and AFSA-RF represent the hybrid scores were obtained by AFSA-RF. For breast-cancer-
algorithm. wisconsin, we GA-RF achieved the best performance for
From the Connectionist Bench results, we find that OOB error and margin. The best parameter combination
pffiffiffiffiffi
the AFSA-RF achieves the minimum OOB error and the is (319,3), and κ is the same as the traditional value, M .
maximum margin. The best parameter combination is There are nine features selected in total. PSO-RF achieved
pffiffiffiffiffi
(151,4), and κ is the same as the traditional value, M . the maximum F-value, G-mean and AUC.
The features selected by AFSA-RF were [1 1 1 1 1 1 0 1 The multi-classification results show that the hybrid
1 0 1 1 1 1 1 1 1], meaning that the 7th and 10th fea- GA-RF, PSO-RF and AFSA-RF almost always discover
tures were removed. PSO-RF obtained the best F-value, better features and select better parameter values than
G-mean and AUC. On the wine dataset, PSO-RF the traditional value. There, are some differences
achieved the minimum OOB error and the maximum between the best κ and the traditional value. The more
G-mean and AUC scores. The best parameter combin- features there are originally, the greater the number of
ation is (354,1), and κ is the same as the traditional redundant features that are removed.
value, 1. There are 15 features selected in total. More- Figure 11 demonstrates that, overall, the OOB error
over, GA-RF achieved the best F-value and AFSA-RF values for all the hybrid algorithms are lower than the
achieved the best margin. For Ionosphere, we find that traditional value with fixed parameters for the six
GA-RF achieved the best OOB error, F-value and margin. datasets. Although the traditional value is reasonable for
The best parameter combination is (339,9), but the value some datasets, it fails to achieve good performance over
of κ is considerably different from the classic value. There the entire problem set. In conclusion, the hybrid
are 29 total features selected. The best G-mean and AUC algorithm effectively eliminates redundant features and

Table 8 The multi-classification results


pffiffiffiffi
1 M ⌊ log2(M) + 1⌋ M GA-RF PSO-RF AFSA-RF
Steel Plates Faults Accuracy 0.7464 0.7485 0.7598 0.7814 0.7881 0.7998 0.7914
OOB 0.3152 0.2819 0.2746 0.2640 0.2437 0.2276 0.2115
margin 0.2456 0.3384 0.3484 0.3789 0.3803 0.3812 0.3810
nTree 100 100 100 100 397 283 400
κ 1 5 5 27 8 6 6
num (Attribute) 27 27 27 27 23 22 22
Libras Movement Accuracy 0.7167 0.7556 0.6889 0.6444 0.7606 0.7767 0.7928
OOB 0.3546 0.3397 0.3480 0.3163 0.3030 0.3323 0.3116
margin 0.1464 0.1798 0.1990 0.2180 0.2443 0.2677 0.2910
nTree 100 100 100 100 258 348 135
κ 1 9 7 90 12 8 9
num (Attribute) 90 90 90 90 56 76 49
mfeat-fac Accuracy 0.4280 0.9030 0.8010 0.9620 0.9673 0.9600 0.9611
OOB 0.6949 0.1823 0.3192 0.0486 0.0416 0.0410 0.0361
margin −0.0987 0.4561 0.2361 0.8708 0.8749 0.8615 0.8698
nTree 100 100 100 100 377 270 196
κ 1 15 8 215 14 18 11
num (Attribute) 215 215 215 215 145 112 164
The best value of every performance evaluation criteria obtained by the algorithms are marked in boldface
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 16 of 18

OOB

Connectionist Bench
0.6 wine
Ionosphere
breast−cancer
0.5
Steel Plates Faults
Libras Movement
0.4 mfeat−fac

0.3

0.2

0.1

0
1 sqrt log+1 all GA−RF PSO−RF AFSA−RF
Fig. 11 Comparison of OOB errors among different methods and datasets

obtains a suitable combination of parameters. Therefore, RF: Genetic-random forests; G-mean: Geometric mean; IR: Imbalance
it enhances the classification performance of RF on ratio; MTD: Mega-trend-diffusion; OOB: Out of bag; PSO: Particle swarm
optimization; PSO-RF: Particle swarm-random forests; RBFNN: Radial basis
imbalanced high-dimensional data. function neural network; RF: Random forests; ROC: Receiver operating
characteristics; SMOTE: Enhances the original synthetic minority
oversampling technique; SVM: Support vector machine

Conclusions Acknowledgements
To improve the performance of the random forests The authors would like to thank the editor and the anonymous reviewers for
their helpful suggestions and comments which provide a great contribution
algorithm, the CURE-SMOTE algorithm is proposed for to the research of this paper, and Wenxing Ye for linguistic improvements of
imbalanced data classification. The experiments show the paper.
that the proposed algorithm effectively resolves the
Funding
shortcomings of the original SMOTE algorithm for This work is supported in part by the National Natural Science Foundation of
typical datasets and that various adaptive clustering China (Grant No. 61572233) and the National Social Science Foundation of
techniques can be added to further improve the algorithm. China (Grant No. 16BTJ032).
We plan to continue to study the influence of feature selec- Availability of data and materials
tion and parameter settings on RF. The proposed hybrids All data generated or analysed during this study are included in this
of RF with intelligent algorithms are used to optimize RF published article. The datasets used and/or analysed during the current
study available from the corresponding author on reasonable request.
for feature selection and parameter optimization. Simula-
tion results show that the hybrid algorithms achieve the Authors’ contributions
minimum OOB error, the best generalization ability and LM wrote the paper and conducted all analyses. SHF developed the paper.
Both authors contributed to the design of the analyses and substantially
that their F-value, G-mean and AUC scores are generally edited the manuscript. Both authors read and approved he final manuscript
better than those obtained using traditional values. The hy-
brid algorithm provides new effective guidance for feature Competing interests
The authors declare that they have no competing interests.
selection and parameter optimization. The time and data
dimensions of the experiments can be increased to further Consent for publication
verify the algorithm’s effectiveness. Not applicable.

Ethics approval and consent to participate


Not applicable.
Abbreviations
AFSA: Artificial fish swarm algorithm; AFSA-RF: Artificial fish swarm-
random forests algorithm; AUC: Area under the ROC curve; Can-CSC- Publisher’s note
GBE: Cost-Sensitive Classifier with a GentleBoost Ensemble; Springer Nature remains neutral with regard to jurisdictional claims in
CURE: Clustering using representatives; GA: Genetic algorithm; GA- published maps and institutional affiliations.
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 17 of 18

Received: 25 August 2016 Accepted: 3 March 2017 30. Lusa L. Class prediction for high-dimensional class-imbalanced data. BMC
bioinformatics. 2010;11(1):523.
31. Yan H, Zha W-x. Comparison on classification performance between
random forests and support vector machine. Software. 2012;33(6):107–10.
References 32. Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: Synthetic minority over-
1. Ho TK. Random decision forests [C]//Document Analysis and Recognition. sampling technique. J Artif Intell Res. 2002;16:321–57.
Proceedings of the Third International Conference on IEEE. 1995;1:278–82. 33. Chawla NV, Lazarevic A, Hall LO, et al. SMOTE Boost. Improving prediction
2. Ho TK. The random subspace method for constructing decision forests. IEEE of the minority class in Boosting. In: Proceedings of the 7th European
Trans Pattern Anal Mach Intell. 1998;20(8):832–44. Conference on Principles and Practice of Knowledge Discovery in Databases
3. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. (PKDD 2003), Lecture Notes in Computer Science, vol 2838. Springer-Verlag:
4. Hassan H, Badr A, Abdelhalim MB. Prediction of O-glycosylation sites using Berlin; 2003. p. 107–19.
random forest and GA-tuned PSO technique. Bioinform Biol Insights.
34. Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced data.
2015;9:103.
BMC Bioinformatics. 2013;14:106.
5. Cerrada M, Zurita G, Cabrera D, et al. Fault diagnosis in spur gears based on
35. Han H, Wan W Y, Mao B H. Borderline-SMOTE: a new over-sampling method in
genetic algorithm and random forest. Mech Syst Signal Process.
imbalanced data sets learning [C]//LNCS 3644: ICIC 2005, Part I, 2005: 878-887.
2016;70:87–103.
36. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safe-level-SMOTE: Safe-
6. Malik AJ, Shahzad W, Khan FA. Network intrusion detection using hybrid
level-synthetic minority over-sampling technique for handling the class
binary PSO and random forests algorithm. Security and Communication
imbalanced problem. In: Pacific-Asia Conference on Knowledge Discovery
Networks. 2015;8(16):2646–60.
and Data Mining, Lecture Notes on Computer Science, vol 5476. Springer-
7. López V, Fernández A, García S, et al. An insight into classification with
Verlag: Berlin; 2009. p. 475–82.
imbalanced data: Empirical results and current trends on using data intrinsic
37. Cieslak D A, Chawla N V, Striegel A. Combating imbalance in network
characteristics. Inform Sci. 2013;250:113–41.
intrusion datasets [C]//GrC. 2006: 732-737.
8. Sun Y, Wong AKC, Kamel MS. Classification of imbalanced data: A review. Int
38. García V, Sánchez J S, Mollineda R A. On the use of surrounding neighbors
J Pattern Recognit Artif Intell. 2009;23(04):687–719.
for synthetic over-sampling of the minority class [C]//Proceedings of the 8th
9. Khoshgoftaar TM, Golawala M, Hulse JV. An empirical study of learning from
conference on Simulation, modeling and optimization. World Scientific and
imbalanced data using random forest [C]//19th IEEE International
Engineering Academy and Society (WSEAS), 2008: 389-394.
Conference on. IEEE Tools with Artificial Intelligence. 2007;2:310–7.
39. Peng L, Wang X-l, Yuan-chao L. A classification method for imbalance data
10. Batista GE, Prati RC, Monard MC. A study of the behavior of several methods
Set based on hybrid strategy. Acta Electron Sin. 2007;35(11):2161–5.
for balancing machine learning training data. ACM Sigkdd Explorations
Newsletter. 2004;6(1):20–9. 40. Zheng-feng C. Study on optimization of random forests algorithm [D].
11. Chen JJ, Tsai CA, Young JF, et al. Classification ensembles for imbalanced Beijing: Capital University of Economics and Business; 2014.
class sizes in predictive toxicology. SAR QSAR Environ Res. 2005;16(6):517–29. 41. Zhao W, Xu M, Jia X, et al. A Classification Method for Imbalanced Data Based
12. Pan X, Zhu L, Fan YX, et al. Predicting protein–RNA interaction amino acids on SMOTE and Fuzzy Rough Nearest Neighbor Algorithm. In: Yao Y, et al (eds)
using random forest based on submodularity subset selection. Comput Biol Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing. Lecture Notes
Chem. 2014;53:324–30. in Computer Science, vol 9437. Springer-Verlag: Berlin; 2015. p. 340–51.
13. Wu Q, Ye Y, Zhang H, et al. ForesTexter: an efficient random forest algorithm 42. Nekooeimehr I, Lai-Yuen SK. Adaptive semi-unsupervised weighted
for imbalanced text categorization. Knowl-Based Syst. 2014;67:105–16. oversampling (A-SUWO) for imbalanced datasets. Expert Systems with
14. Han M, Zhu XR. Hybrid algorithm for classification of unbalanced datasets. Applications. 2016;46:405–16.
Control Theory & Applications. 2011;28(10):1485–9. 43. Sáez JA, Luengo J, Stefanowski J, et al. SMOTE–IPF: Addressing the noisy
15. Tahir M, Khan A, Majid A, et al. Subcellular localization using fluorescence and borderline examples problem in imbalanced classification by a re-
imagery: Utilizing ensemble classification with diverse feature extraction sampling method with filtering. Inform Sci. 2015;291:184–203.
strategies and data balancing. Appl Soft Comput. 2013;13(11):4231–43. 44. Abdi L, Hashemi S. To combat multi-class imbalanced problems by means
16. Ali S, Majid A, Javed SG, et al. Can-CSC-GBE: Developing Cost-sensitive of over-sampling techniques. IEEE Trans Knowl Data Eng. 2016;28(1):238–51.
Classifier with Gentleboost Ensemble for breast cancer classification using 45. Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large
protein amino acids and imbalanced data. Comput Biol Med. 2016;73:38–46. databases [C]//ACM SIGMOD Record. ACM. 1998;27(2):73–84.
17. Majid A, Ali S, Iqbal M, et al. Prediction of human breast and colon cancers 46. Ya-jian Z, Xu C, Ji-guo L. Unsupervised anomaly detection method based on
from imbalanced data using nearest neighbor and support vector improved CURE clustering algorithm. J Communications. 2010;31(7):18–23.
machines. Comput Methods Programs Biomed. 2014;113(3):792–808. 47. Pavlidis P, Weston J, Cai J, et al. Gene functional classification from
18. Robnik-Sikonja M. Improving random forests [M]//Machine Learning: ECML heterogeneous data. In: Proceedings of the fifth Annual International
2004. Springer Berlin Heidelberg, 2004: 359-370. Conference on Computational Molecular Biology. 2001;249-55.
19. Li H B, Wang W, Ding H W, et al. Trees Weighting Random Forests Method 48. Sharma A, Imoto S, Miyano S, et al. Null space based feature selection
for Classifying High-Dimensional Noisy Data [C]//2010 IEEE 7th International method for gene expression data. Int J Mach Learn Cybern. 2012;3(4):269–76.
Conference on IEEE e-Business Engineering (ICEBE), 2010:160-163. 49. Ghalwash MF, Cao XH, Stojkovic I, et al. Structured feature selection using
20. Jian-geng L, Gao Z-k. Setting of class weights in random forest for small- coordinate descent optimization. BMC bioinformatics. 2016;17(1):1.
sample data. Comput Eng Appl. 2009;45(26):131–4. 50. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in
21. Ma J-y, Wu X-z, Xie B-c. Quasi-adaptive random forest for classification. bioinformatics. Bioinformatics. 2007;23(19):2507–17.
Application of Statistics and Management. 2010;29(5):805–11. 51. Guo S, Guo D, Chen L, et al. A centroid-based gene selection method for
22. Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A. Conditional variable microarray data classification. J Theor Biol. 2016;400:32–41.
importance for random forests. BMC bioinformatics. 2008;9(1):1. 52. Sharbaf FV, Mosafer S, Moattar MH. A hybrid gene selection approach for
23. Li S, James Harner E, Adjeroh DA. Random KNN feature selection-a fast and microarray data classification using cellular learning automata and ant
stable alternative to Random Forests. BMC bioinformatics. 2011;12(1):1. colony optimization. Genomics. 2016;107(6):231–8.
24. Yang F, Lu W, Luo L, et al. Margin optimization based pruning for random 53. Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer:
forest. Neuro computing. 2012;94:54–63. class discovery and class prediction by gene expression monitoring.
25. Efron B, Tibshirani R. An introduction to the boostrap [M]. NewYork: Science. 1999;286(5439):531–7.
Chapman & Hall; 1993. 54. Furey TS, Cristianini N, Duffy N, et al. Support vector machine classification
26. Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40. and validation of cancer tissue samples using microarray expression data.
27. Quinaln J R. C4.5: programs for machine learning [M]. Morgan kuafmann, Bioinformatics. 2000;16(10):906–14.
1993. 55. Sharma A, Imoto S, Miyano S. A top-r feature selection algorithm for
28. Breiman L, Friedman J, Olshen R, and Stone C. Classification and Regression microarray gene expression data. IEEE/ACM Transactions on Computational
Trees. Boca Raton, FL: CRC Press; 1984. Biology and Bioinformatics (TCBB). 2012;9(3):754–64.
29. He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data 56. Chinnaswamy A, Srinivasan R. Hybrid Feature Selection Using Correlation
Eng. 2009;21(9):1263–84. Coefficient and Particle Swarm Optimization on Microarray Gene Expression
Ma and Fan BMC Bioinformatics (2017) 18:169 Page 18 of 18

Data. In: Snášel V, et al (eds) Innovations in Bio-Inspired Computing and


Applications. Advances in Intelligent Systems and Computing, vol 424.
Springer International Publishing Switzerland; 2016. p. 229-39.
57. Destrero A, Mosci S, De Mol C, et al. Feature selection for high-dimensional
data. Comput Manag Sci. 2009;6(1):25–40.
58. Zhu S, Wang D, Yu K, et al. Feature selection for gene expression using
model-based entropy. IEEE/ACM Trans Comput Biol Bioinform. 2010;7(1):25–36.
59. Kausar N, Majid A. Random forest-based scheme using feature and decision
levels information for multi-focus image fusion. Pattern Anal Applic.
2016;19(1):221–36.
60. Menze BH, et al. A comparison of random forest and its Gini importance
with standard chemometric methods for the feature selection and
classification of spectral data. BMC bioinformatics. 2009;10(1):213.
61. Strobl C, et al. Bias in random forest variable importance measures:
Illustrations, sources and a solution. BMC bioinformatics. 2007;8(1):1.
62. Zhou Q, Zhou H, Li T. Cost-sensitive feature selection using random forest:
Selecting low-cost subsets of informative features. Knowl-Based Syst.
2016;95:1–11.
63. Díaz-Uriarte R, De Andres SA. Gene selection and classification of microarray
data using random forest. BMC bioinformatics. 2006;7(1):1.
64. Lariviere B, Van den Poel D. Predicting customer retention and profitability
by using random forests and regression forests techniques. Expert Systems
with Applications. 2005;29:472–84.
65. Rodriguez-Galiano VF, Ghimire B, Rogan J, Chica-Olmo M, Rigol-Sanchez JP.
An assessment of the effectiveness of a random forest classifier for
landcover classification. ISPRS J Photogramm Remote Sens. 2012;67:93–104.
66. Bernard S, Heutte L, Adam S. Influence of Hyper parameters on Random
Forest Accuracy [C]//Proceedings of the 8th International workshop on
multiple classifier systems. Berlin, Heidelberg: Springer; 2009. p. 171–80.
67. Yu L, Chun-xia Z. Estimation of the hyper-parameter in random forest based
on out-of-bag sample. J Syst Eng. 2011;26(4):566–72.
68. Martinez-Munoz G, Suarez A. Out-of-bag estimation of the optimal sample
size in bagging. Pattern Recogn. 2010;43(1):143–52.
69. Ming-yuan Z, Yong T, Chong F, Ming-tian Z. Feature selection and
parameter optimization for SVM based on genetic algorithm with feature
chromosomes. Control and Decision. 2010;25(8):1133–8.
70. Lei L, Gao L, Shijie Z. Question of SVM kernel parameter optimization with
particle swarm algorithm based on neural network. Comput Eng Appl.
2015;51(4):162–4.
71. Leifu GAO, Shijie ZHAO, Jing GAO. Application of artificial fish-swarm
algorithm in SVM parameter optimization selection. Comput Eng Appl.
2013;49(23):86–90.
72. Xin-guang SHAO, Hui-zhong YANG, Gang CHEN. Parameters selection and
application of support vector machines based on particle swarm
optimization algorithm. Control Theory & Applications. 2006;23(5):740–4.
73. Kennedy J,Eberhart R. Particle Swarm Optimization [C]//IEEE International
Conference on Neural Networks,1995 Proceedings,1995:1942–1948.
74. Xiao-lei L, Zhi-jiang S, Ji-xin Q. An optimizing method based on
autonomous animals: Fish-swarm Algorithm. Systems Engineering-Theory &
Practice. 2002;22(11):31–8.
75. Chen J, Tang YY, Fang B, et al. In silico prediction of toxic action
mechanisms of phenols for imbalanced data with Random Forest learner.
J Mol Graph Model. 2012;35:21–7.
76. Espíndola R P, Ebecken N F F. On extending f-measure and g-mean metrics
to multi-class problems [C]//Sixth international conference on data mining,
text mining and their business applications, Wessex Institute of Technology,
UK. 2005, 35: 25-34.
77. Janitza S, Strobl C, Boulesteix AL. An AUC-based permutation variable Submit your next manuscript to BioMed Central
importance measure for random forests. BMC bioinformatics. 2013;14(1):119.
78. Breiman L. Out-of-bag Estimation [R]. Berkeley: Statistics Department, and we will help you at every step:
University of California; 1996.
• We accept pre-submission inquiries
79. Mellor A, Boukir S, Haywood A, et al. Exploring issues of training data
imbalance and mislabeling on random forest performance for large area • Our selector tool helps you to find the most relevant journal
land cover classification using the ensemble margin. ISPRS J Photogramm • We provide round the clock customer support
Remote Sens. 2015;105:155–68.
• Convenient online submission
• Thorough peer review
• Inclusion in PubMed and all major indexing services
• Maximum visibility for your research

Submit your manuscript at


www.biomedcentral.com/submit

You might also like