Cancer Type Prediction and Classification Based On RNA-sequencing Data
Cancer Type Prediction and Classification Based On RNA-sequencing Data
sequencing Data
Yi-Hsin Hsu, Dong Si*, Computer and Software Systems Department, University of Washington
Bothell
Authorized licensed use limited to: Cornell University Library. Downloaded on September 04,2020 at 05:52:45 UTC from IEEE Xplore. Restrictions apply.
the number of attributes was relatively larger than the number 𝑥 − min(𝑥)
𝑥′ = (new max(𝑥) − new min(𝑥)) + new min(𝑥)
of samples, so a feature selection or a feature reduction max(𝑥) − min(𝑥)
method would be useful [10][11]. Therefore, tree classifier
and variance threshold [12] methods were adopted to select The other approach was the standardization scaling (2). This
features. The tree classifier method is to select a subset of method standardizes features by removing the mean of each
features based on their importance scores. The variance data point and scaling each feature to unit variance.
threshold method filters out features with variances less than 𝑥−𝑥
𝑥′ =
threshold. Hence, the filtered data set contains high variance 𝜎
attributes only. Both data selection methods were used in
later experiments to see which method provided a better III. EXPERIMENTS AND RESULTS
result. A. Baseline Measurement
Imbalanced Classes: As mentioned above, the second First, without applying any pre-processing method, a
issue was that 33 classes were imbalanced (see Fig. 2). The baseline measurement was conducted to evaluate the original
outcomes of five ML algorithms. The measurement results
can be used to compare with the performances of later
Number of Samples in Each Cancer Class experiments using pre-processing procedures.
1400
Baseline Measurement Data Set and Methods: The
Number of Samples
1200
1000
baseline data set removed 212 attributes which had all zero
values, and then normalized through the standardization
800
method. In other words, the over-sampling, under-sampling,
600
and feature selection methods were not applied in baseline
400 measurement phase. Next, the data set was split into a 80%
200 training set and a 20% test set. After training was done, the
0 accuracy scores were calculated. The accuracy score is the
THYM
KIRC
LUAD
LGG
MESO
LUSC
LAML
STAD
KIRP
GBM
KICH
ACC
PRAD
SKCM
BLCA
COAD
OV
UCS
SARC
CHOL
BRCA
HNSC
UCEC
PCPG
READ
UVM
DLBC
THCA
LIHC
CESC
ESCA
PAAD
TGCT
5375
Authorized licensed use limited to: Cornell University Library. Downloaded on September 04,2020 at 05:52:45 UTC from IEEE Xplore. Restrictions apply.
In conclusion, the baseline measurement results not only
gave us a basic understanding regarding the performance of
each algorithm, but revealed the long training time and
imbalanced class problems should be fixed.
B. Experiments Design
As discussed above, several pre-processing methods were
selected to use in the following experiments. The tested pre-
processing methods and ML algorithms were listed as
follows:
Feature Selection: Tree Classifier and Variance
Threshold
Under-sampling and Over-sampling: 45 unique
samples per class (balanced data set), 300 repeated
samples per class (balanced data set), and 10,471
original samples (imbalanced data set)
Normalization: Min-max and Standardization
ML Algorithms: DT, kNN, linear SVM, poly SVM, Figure 3. Four box plots (A,B,C, and D) demonstrated how each testing
variable impacts the accuracy score among 21 experiments.
and ANN
C. Model Training As noted, in baseline measurement, the problem of linear
SVM was long training time. However, in the later 21
There is a total of 60 testing scenarios based on the listing
experiments, the median training time was much shorter. It
testing variables. Considering the complexity of testing all
was even better than ANN and DT (see Fig. 4). In total,
scenarios, 21 important experiments were selected and
considering accuracy score, F1-score, and training time, the
conducted to measure their performances. In these 21
linear SVM was the best algorithms among all. Besides, poly
experiments, there were five DT and poly SVM experiments
SVM performed much better in later 21 experiments than in
for each, four kNN and linear SVM experiments for each,
the baseline experiment in accuracy score and training time.
and three ANN experiments.
DT performed better than kNN in accuracy and F1-score, but
The whole data set was split into a 80% training set and a it took longer to train models. On the other hand, although
20% testing set. All 21 experiments were analyzed and kNN had the lowest accuracy score on average, it took the
compared by their accuracy scores, training time, precisions, shortest amount of time on training.
recalls, and F1-scores to identify the best model and best
testing scenario.
D. Results
Among 21 experiments, each testing variable was
compared side-by-side (see Fig. 3). As seen, min-max scaling
and standardization scaling were very close at their median
accuracy scores. However, since min-max scaling had
smaller interquartile range, the performance could be
considered more stable. Similarly, tree classifier and variance
threshold were close at their median accuracies, but
comparing the interquartile range, tree classifier was more Figure 4. A. box plot is F1-scores of each algorithm; B. box plot is
stable. training time of each algorithm.
On the other hand, comparing the balanced data sets
versus the imbalanced data set, the results had a big E. Validation
difference. The data set with 45 samples per class had the Among 21 models, the best model from each algorithm
lowest accuracy score on average comparing to other two was selected and run on 5-fold cross validation. Both
data sets. The reason might be related to the small total accuracy score and validation score were used to determine
sample size. Conversely, the data set with 300 samples per the top performance model (see Table II).
class had the highest accuracy score on average. The reason
can be linked to the decent total sample size and balanced TABLE II. THE PERFORMANCE OF THE BEST MODEL FROM EACH
ALGORITHM
classes. Lastly, five ML algorithms were compared with their
accuracy scores. Linear SVM was still the best classifier Model DT kNN
Linear Poly
ANN
among five algorithms in terms of accuracy rate. This result SVM SVM
was the same as in the baseline measurement. Linear SVM Accuracy
0.92222 0.86313 0.95808 0.94545 0.91515
score
also demonstrated the highest F1-score on average comparing Cross
to other algorithms (see Fig. 4). validation 0.92444 0.87455 0.94980 0.94030 0.91394
score
5376
Authorized licensed use limited to: Cornell University Library. Downloaded on September 04,2020 at 05:52:45 UTC from IEEE Xplore. Restrictions apply.
The 5-fold cross validation score results confirmed that C. Future Work
linear SVM was the top classifier among 5 models. The This study presents a high accuracy classification model.
accuracy score was 0.95808 and the cross validation was However, there are still many unknown questions need to be
0.94980. The pre-processing methods of this model were addressed. First, we would like to see if these pre-processing
variance threshold and standardization scaling. The data set methods are also applicable to other types of genomic data,
with 300 samples per class was used to train this model. In or even clinical data. Second, as the genomic data normally
addition, linear SVM also demonstrated the largest area have a relatively large number of features or genes, using
under the curve (AUC) in the receiver operating other methods to reduce features might be useful. Lastly,
characteristic (ROC) curve (see Fig. 5). since imbalanced data is a very common issue in biomedical
data, it needs to apply different strategies to cope with this
problem to improve the performance of a classifier.
ACKNOWLEDGMENT
This work was supported by the Graduate Research
Award from the Computing and Software Systems division
of University of Washington Bothell and the startup fund 74-
0525.
REFERENCES
[1] K.A. Hoadley, C. Yau, and D.M. Wolf, “Multiplatform analysis of 12
cancer types reveals molecular classification within and across tissues
of origin,” Cell, vol. 158, pp. 929–944, 2014.
[2] K. Tomczak, P. Czerwinska, and M. Wiznerowicz, “The Cancer
Genome Atlas (TCGA): an immeasurable source of knowledge,”
Figure 5. The ROC curve along with the AUC for the best model of each Contemporary Oncology, vol. 19 (1A), A68-77, 2015.
algorithm. [3] T.Q. Gan, Z.C. Xie, R.X. Tang, “Clinical value of miR-145-5p in
NSCLC and potential molecular mechanism exploration: A
IV. CONCLUSION AND DISCUSSION retrospective study based on GEO, qRT-PCR, and TCGA data,”
Tumor Biology, vol. 39, pp. 1-23, 2017.
A. Conclusion [4] Y. Guo, Q. Sheng, J. Li, “Large Scale Comparison of Gene
Expression Levels by Microarrays and RNAseq Using TCGA Data,”
Comparing the beginning baseline and later 21 PLoS One, vol. 8, pp. 1-10, 2013.
experiments, it is suggested that linear SVM was the best [5] Cancer Genome Atlas Research Network, J.N. Weinstein, E.A.
classifier in terms of accuracy score, F1-score, training time, Collisson, G.B. Mills, “The Cancer Genome Atlas Pan-Cancer
and AUC among 5 algorithms. Since original data set is big analysis project,” Nature Genetics, vol. 45, pp. 1113–1120, 2013.
and imbalanced, it is impossible to directly run and test on [6] A.G. Telonis, R. Magee, P. Loher, “Knowledge about the presence or
ML algorithms. This study adopted several effective data pre- absence of miRNA isoforms (isomiRs) can successfully discriminate
processing approaches to improve the performance of a amongst 32 TCGA cancer types,” Nucleic Acids Research, vol. 45, pp.
2973–2985, 2017.
classifier. Using ML for solutions to the problems of [7] K. Kourou, T.P. Exarchos, K.P. Exarchos, “Machine learning
predicting cancer types based on gene expression levels, this applications in cancer prognosis and prediction,” Computational and
study provides further information for understanding how to Structural Biotechnology Journal, vol. 13, pp. 8-17, 2015.
use RNA-sequencing data to build a reliable classification [8] L. Omberg, K. Ellrott, Y. Yuan, “Enabling transparent and
model. collaborative computational analysis of 12 tumor types within The
Cancer Genome Atlas,” Nature Genetics, vol. 45, pp. 1121-1126,
B. Discussion 2013.
Comparing baseline and later 21 experiments, we [9] A.E. Minoche, J.C. Dohm, H. Himmelbauer, “Evaluation of genomic
high-throughput sequencing data generated on Illumina HiSeq and
confirmed that normalization, feature selection, balanced Genome Analyzer systems,” Genome Biology, vol. 12, R112, 2011.
classes are all key factors which have great impacts on the [10] B. Zhang, X. He, F. Ouyang, “Radiomic machine-learning classifiers
performance of models. However, identifying which pre- for prognostic biomarkers of advanced nasopharyngeal carcinoma,”
processing methods are useful before model training is still Cancer Letters, vol. 403, pp. 21-27, 2017.
an unclear question. Based on the experiment results, there [11] C. Parmar, P. Grossmann, J. Bussink, “Machine Learning methods for
was no significant evidence suggesting which normalization Quantitative Radiomic Biomarkers,” Scientific Reports, vol. 5, 13087,
2015.
or feature selection method was better. These questions need
[12] Y. Saeys, I. Inza, P. Larranaga, “A review of feature selection
further clarification. Although the data set with 300 samples techniques in bioinformatics,” Bioinformatics, vol. 23, pp. 2507-2517,
per class was the best sampling data set in experiments, there 2007.
was a large number of repeated samples existed in this data
set. Therefore, new data are needed to further validate the
model. In this study, linear SVM was the best algorithm for
training this sequencing data set, and this finding was
consistent with previous studies. However, it is important to
note that if the data set includes more cancer types, whether
linear SVM is still the best algorithm is worth noting.
5377
Authorized licensed use limited to: Cornell University Library. Downloaded on September 04,2020 at 05:52:45 UTC from IEEE Xplore. Restrictions apply.