0% found this document useful (0 votes)
13 views4 pages

Cancer Type Prediction and Classification Based On RNA-sequencing Data

Cancer Type Prediction and Classification Based on RNA-sequencing Data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views4 pages

Cancer Type Prediction and Classification Based On RNA-sequencing Data

Cancer Type Prediction and Classification Based on RNA-sequencing Data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Cancer Type Prediction and Classification Based on RNA-

sequencing Data
Yi-Hsin Hsu, Dong Si*, Computer and Software Systems Department, University of Washington
Bothell

experiments and results. Section four discusses and


Abstract— Pan-cancer analysis is a significant research topic


in the past few years. Due to many advancing sequencing summaries the findings of this study.
technologies, researchers possess more resources and
knowledge to identify the key factors that could trigger cancer. II. METHOD
Furthermore, since The Cancer Genome Atlas (TCGA) project The workflow of the experiment is listed below (see Fig.
launched, using machine learning (ML) techniques to analyze 1).
TCGA data has been recognized as a useful solution in the line
of research. Therefore, this study uses RNA-sequencing data
Pre- Baseline Experiments Model Evaluation of
from TCGA and focuses on classifying thirty-three types of Raw Data
processing Measurement Design Training Models
cancer patients. Five ML algorithms include decision tree (DT),
k nearest neighbor (kNN), linear support vector machine Figure 1. The overview of workflow.
(linear SVM), polynomial support vector machine (poly SVM),
and artificial neural network (ANN) are conducted to compare A. Select Data
the performances of their accuracies, training time, precisions,
recalls, and F1-scores. The results show that linear SVM with a The raw data set of this study was from TCGA Pan-
95.8% accuracy rate is the best classifier in this study. Several Cancer analysis project [5] and downloaded from synapse.org
critical and sophisticated data pre-processing experiments are [8] which included 10,471 cancer patient samples across 33
also presented to clarify and to improve the performance of the cancer types. The attributes were a list of 20,531 genes; each
built model. gene had a measurement on the RNA expression levels. All
measurements in the raw data set were taken with the
I. INTRODUCTION Illumina HiSeq system [9].
The causes of tumor formation have been studied by
B. Class Labeling and TCGA Barcode
scientists for years. Since there are many subtypes in cancer,
it is challenging for researchers to identify the causes and Since the goal of this project is to classify 33 cancer
treatments regarding cancer. To characterize and identify types, additional class labels are needed before building the
different types of cancer, The Cancer Genome Atlas (TCGA) classification models. The labels were generated based on the
project emerged. Its goal is to demonstrate that gathering and TCGA barcodes. Each sample in the data set had a unique
analyzing large data can advance cancer research [1]. TCGA barcode. The barcode contained information including the
project has so far generated comprehensive genomic maps of cancer type and sample source site. By mapping out each
33 types of cancer. As TCGA provides an easily accessible barcode to the corresponding cancer type, we were able to
platform of genomic data [2], many scientists start to use label each sample with the correct cancer class.
TCGA data to discover genomic patterns in cancer [3][4]. TCGA barcodes also provided sample type information.
To understand the commonalities and differences across The raw data set included primary solid tumor samples,
cancer types, TCGA later launched the Pan-Cancer analysis secondary tumor samples, and other types of samples from
project [5]. Recently, there have been many studies [6][7] cancer patients. Because this study focuses on classifying
using machine learning (ML) algorithms to analyze TCGA cancer patients rather than cancer sample types, we remained
pan-cancer data sets and demonstrate its effectiveness in all types of samples in the study.
discovering cancer causes. As such, this study focuses on C. Data Pre-processing
using ML to build a reliable classification model which can
recognize 33 types of cancer patients. Five ML algorithms, To build a classification model with a reliable prediction
i.e., decision tree (DT), k nearest neighbor (kNN), linear rate, a data pre-processing step is needed. First, the raw data
support vector machine (linear SVM), polynomial support were checked to ensure there were no missing or duplicated
vector machine (ploy SVM), and artificial neural network values. Second, any row or column with all zero values was
(ANN) were tested and compared in sequence in this study. removed from the data set. As such, 212 attributes found with
The rest of this paper is organized as follows: Section two all zero values were removed. After these two steps, the
describes the procedures and methods. Section three presents attributes were reduced from 20,531 to 20,319 but the total
number of samples remained the same. Next, several data
pre-processing methods were considered for model training
preparation.
Yi-Hsin Hsu, Dong Si are with the Department of Computer and
Software Systems, University of Washington Bothell, Bothell, WA 98011 Feature Selection: When observing raw data, there were
USA (e-mail: [email protected], corresponding author to provide email: several issues which might affect the training results. First,
[email protected]).

U.S. Government work not protected by U.S. copyright 5374

Authorized licensed use limited to: Cornell University Library. Downloaded on September 04,2020 at 05:52:45 UTC from IEEE Xplore. Restrictions apply.
the number of attributes was relatively larger than the number 𝑥 − min(𝑥)
𝑥′ = (new max(𝑥) − new min(𝑥)) + new min(𝑥)
of samples, so a feature selection or a feature reduction max(𝑥) − min(𝑥) 
method would be useful [10][11]. Therefore, tree classifier
and variance threshold [12] methods were adopted to select The other approach was the standardization scaling (2). This
features. The tree classifier method is to select a subset of method standardizes features by removing the mean of each
features based on their importance scores. The variance data point and scaling each feature to unit variance.
threshold method filters out features with variances less than 𝑥−𝑥
𝑥′ =
threshold. Hence, the filtered data set contains high variance 𝜎 
attributes only. Both data selection methods were used in
later experiments to see which method provided a better III. EXPERIMENTS AND RESULTS
result. A. Baseline Measurement
Imbalanced Classes: As mentioned above, the second First, without applying any pre-processing method, a
issue was that 33 classes were imbalanced (see Fig. 2). The baseline measurement was conducted to evaluate the original
outcomes of five ML algorithms. The measurement results
can be used to compare with the performances of later
Number of Samples in Each Cancer Class experiments using pre-processing procedures.
1400
Baseline Measurement Data Set and Methods: The
Number of Samples

1200

1000
baseline data set removed 212 attributes which had all zero
values, and then normalized through the standardization
800
method. In other words, the over-sampling, under-sampling,
600
and feature selection methods were not applied in baseline
400 measurement phase. Next, the data set was split into a 80%
200 training set and a 20% test set. After training was done, the
0 accuracy scores were calculated. The accuracy score is the
THYM
KIRC
LUAD

LGG

MESO
LUSC

LAML
STAD

KIRP

GBM

KICH

ACC
PRAD

SKCM

BLCA

COAD

OV

UCS
SARC

CHOL
BRCA

HNSC

UCEC

PCPG

READ

UVM

DLBC
THCA

LIHC

CESC

ESCA

PAAD

TGCT

proportion of samples which are correctly classified in the


Cancer Type test set. Besides, the average precisions, recalls, F1-scores,
and training time were also calculated to compare the
Figure 2. Each cancer type along with the number of samples. performances. The baseline training results are presented in
Table I.
breast invasive carcinoma (BRCA) had 1,218 samples while
cholangiocarcinoma (CHOL) held 45 samples only. The TABLE I. BASELINE MEASUREMENT RESULTS
difference between the largest sample class (i.e., BRCA) and
the smallest sample class (i.e., CHOL) was 1,173 cases. Testing Accuracy Training Ave. Ave. Ave.
Variables Score Time Precision Recall F1
To fix the imbalanced problem, two methods were
23m 42s
utilized to cope with it. First, under-sampling 32 cancer DT 0.86014
121ms
0.86 0.86 0.86
classes to the same size as the smallest sample class. Since 30s
CHOL was the smallest sample class, we randomly selected kNN 0.89212 0.90 0.89 0.89
751ms
45 samples each from the other 32 cancer classes and pooled Linear
0.94988 ~4hr 0.95 0.95 0.95
them together to generate a total of 1,485 sample data set. In SVM
other words, this method selected 1,485 unique samples, and Poly 52m 52s
0.76754 0.86 0.77 0.77
SVM 518ms
all classes were balanced. However, this relatively small 18m 43s
sample size data set might impact the accuracy rate of a ANN 0.94797 0.95 0.95 0.95
312ms
classifier. To create a balanced and decent sample size data
set, another method was considered. Based on the raw data,
the average number of samples per class was 317. Thus, Baseline Measurement Results: As displayed in Table I,
choosing 300 samples per class to create a total of 9,900 the linear SVM had the highest accuracy score but longest
sample data set was reasonable. For classes with more than training time among five models. The long training time
300 samples, under-sampling them by randomly selecting indicated that the data set was relatively big, so a feature
300 unique samples for each class; for classes with less than selection method was needed. Besides, looking at the
300 samples, over-sampling them by randomly choosing precision, recall, and F1-score for each class, we found that
from their sample pools to fill up to a total of 300 samples for for classes with relatively low number of samples, the
each class. Finally, the data set with 45 samples per class and precision, recall, and F1-scores were all relatively low
the data set with 300 samples per class were tested in later comparing to other classes. Therefore, a balanced data set
experiments. was needed to improve the performance of a classifier.
Normalization: The mean and standard deviation of the In addition, when comparing all criteria, ANN was the
attributes widely spread in the raw data. Therefore, two best classifier among five models. In this experiment, two
normalization methods were adopted. The min-max scaling hidden layers, one with 850 neurons and the other with 800
(1) is a common feature scaling method. It scales each feature neurons, were used to train the ANN model. The activation
separately to make it in the same given range. function used was rectified linear unit function.

5375

Authorized licensed use limited to: Cornell University Library. Downloaded on September 04,2020 at 05:52:45 UTC from IEEE Xplore. Restrictions apply.
In conclusion, the baseline measurement results not only
gave us a basic understanding regarding the performance of
each algorithm, but revealed the long training time and
imbalanced class problems should be fixed.
B. Experiments Design
As discussed above, several pre-processing methods were
selected to use in the following experiments. The tested pre-
processing methods and ML algorithms were listed as
follows:
 Feature Selection: Tree Classifier and Variance
Threshold
 Under-sampling and Over-sampling: 45 unique
samples per class (balanced data set), 300 repeated
samples per class (balanced data set), and 10,471
original samples (imbalanced data set)
 Normalization: Min-max and Standardization
 ML Algorithms: DT, kNN, linear SVM, poly SVM, Figure 3. Four box plots (A,B,C, and D) demonstrated how each testing
variable impacts the accuracy score among 21 experiments.
and ANN
C. Model Training As noted, in baseline measurement, the problem of linear
SVM was long training time. However, in the later 21
There is a total of 60 testing scenarios based on the listing
experiments, the median training time was much shorter. It
testing variables. Considering the complexity of testing all
was even better than ANN and DT (see Fig. 4). In total,
scenarios, 21 important experiments were selected and
considering accuracy score, F1-score, and training time, the
conducted to measure their performances. In these 21
linear SVM was the best algorithms among all. Besides, poly
experiments, there were five DT and poly SVM experiments
SVM performed much better in later 21 experiments than in
for each, four kNN and linear SVM experiments for each,
the baseline experiment in accuracy score and training time.
and three ANN experiments.
DT performed better than kNN in accuracy and F1-score, but
The whole data set was split into a 80% training set and a it took longer to train models. On the other hand, although
20% testing set. All 21 experiments were analyzed and kNN had the lowest accuracy score on average, it took the
compared by their accuracy scores, training time, precisions, shortest amount of time on training.
recalls, and F1-scores to identify the best model and best
testing scenario.
D. Results
Among 21 experiments, each testing variable was
compared side-by-side (see Fig. 3). As seen, min-max scaling
and standardization scaling were very close at their median
accuracy scores. However, since min-max scaling had
smaller interquartile range, the performance could be
considered more stable. Similarly, tree classifier and variance
threshold were close at their median accuracies, but
comparing the interquartile range, tree classifier was more Figure 4. A. box plot is F1-scores of each algorithm; B. box plot is
stable. training time of each algorithm.
On the other hand, comparing the balanced data sets
versus the imbalanced data set, the results had a big E. Validation
difference. The data set with 45 samples per class had the Among 21 models, the best model from each algorithm
lowest accuracy score on average comparing to other two was selected and run on 5-fold cross validation. Both
data sets. The reason might be related to the small total accuracy score and validation score were used to determine
sample size. Conversely, the data set with 300 samples per the top performance model (see Table II).
class had the highest accuracy score on average. The reason
can be linked to the decent total sample size and balanced TABLE II. THE PERFORMANCE OF THE BEST MODEL FROM EACH
ALGORITHM
classes. Lastly, five ML algorithms were compared with their
accuracy scores. Linear SVM was still the best classifier Model DT kNN
Linear Poly
ANN
among five algorithms in terms of accuracy rate. This result SVM SVM
was the same as in the baseline measurement. Linear SVM Accuracy
0.92222 0.86313 0.95808 0.94545 0.91515
score
also demonstrated the highest F1-score on average comparing Cross
to other algorithms (see Fig. 4). validation 0.92444 0.87455 0.94980 0.94030 0.91394
score

5376

Authorized licensed use limited to: Cornell University Library. Downloaded on September 04,2020 at 05:52:45 UTC from IEEE Xplore. Restrictions apply.
The 5-fold cross validation score results confirmed that C. Future Work
linear SVM was the top classifier among 5 models. The This study presents a high accuracy classification model.
accuracy score was 0.95808 and the cross validation was However, there are still many unknown questions need to be
0.94980. The pre-processing methods of this model were addressed. First, we would like to see if these pre-processing
variance threshold and standardization scaling. The data set methods are also applicable to other types of genomic data,
with 300 samples per class was used to train this model. In or even clinical data. Second, as the genomic data normally
addition, linear SVM also demonstrated the largest area have a relatively large number of features or genes, using
under the curve (AUC) in the receiver operating other methods to reduce features might be useful. Lastly,
characteristic (ROC) curve (see Fig. 5). since imbalanced data is a very common issue in biomedical
data, it needs to apply different strategies to cope with this
problem to improve the performance of a classifier.

ACKNOWLEDGMENT
This work was supported by the Graduate Research
Award from the Computing and Software Systems division
of University of Washington Bothell and the startup fund 74-
0525.

REFERENCES
[1] K.A. Hoadley, C. Yau, and D.M. Wolf, “Multiplatform analysis of 12
cancer types reveals molecular classification within and across tissues
of origin,” Cell, vol. 158, pp. 929–944, 2014.
[2] K. Tomczak, P. Czerwinska, and M. Wiznerowicz, “The Cancer
Genome Atlas (TCGA): an immeasurable source of knowledge,”
Figure 5. The ROC curve along with the AUC for the best model of each Contemporary Oncology, vol. 19 (1A), A68-77, 2015.
algorithm. [3] T.Q. Gan, Z.C. Xie, R.X. Tang, “Clinical value of miR-145-5p in
NSCLC and potential molecular mechanism exploration: A
IV. CONCLUSION AND DISCUSSION retrospective study based on GEO, qRT-PCR, and TCGA data,”
Tumor Biology, vol. 39, pp. 1-23, 2017.
A. Conclusion [4] Y. Guo, Q. Sheng, J. Li, “Large Scale Comparison of Gene
Expression Levels by Microarrays and RNAseq Using TCGA Data,”
Comparing the beginning baseline and later 21 PLoS One, vol. 8, pp. 1-10, 2013.
experiments, it is suggested that linear SVM was the best [5] Cancer Genome Atlas Research Network, J.N. Weinstein, E.A.
classifier in terms of accuracy score, F1-score, training time, Collisson, G.B. Mills, “The Cancer Genome Atlas Pan-Cancer
and AUC among 5 algorithms. Since original data set is big analysis project,” Nature Genetics, vol. 45, pp. 1113–1120, 2013.
and imbalanced, it is impossible to directly run and test on [6] A.G. Telonis, R. Magee, P. Loher, “Knowledge about the presence or
ML algorithms. This study adopted several effective data pre- absence of miRNA isoforms (isomiRs) can successfully discriminate
processing approaches to improve the performance of a amongst 32 TCGA cancer types,” Nucleic Acids Research, vol. 45, pp.
2973–2985, 2017.
classifier. Using ML for solutions to the problems of [7] K. Kourou, T.P. Exarchos, K.P. Exarchos, “Machine learning
predicting cancer types based on gene expression levels, this applications in cancer prognosis and prediction,” Computational and
study provides further information for understanding how to Structural Biotechnology Journal, vol. 13, pp. 8-17, 2015.
use RNA-sequencing data to build a reliable classification [8] L. Omberg, K. Ellrott, Y. Yuan, “Enabling transparent and
model. collaborative computational analysis of 12 tumor types within The
Cancer Genome Atlas,” Nature Genetics, vol. 45, pp. 1121-1126,
B. Discussion 2013.
Comparing baseline and later 21 experiments, we [9] A.E. Minoche, J.C. Dohm, H. Himmelbauer, “Evaluation of genomic
high-throughput sequencing data generated on Illumina HiSeq and
confirmed that normalization, feature selection, balanced Genome Analyzer systems,” Genome Biology, vol. 12, R112, 2011.
classes are all key factors which have great impacts on the [10] B. Zhang, X. He, F. Ouyang, “Radiomic machine-learning classifiers
performance of models. However, identifying which pre- for prognostic biomarkers of advanced nasopharyngeal carcinoma,”
processing methods are useful before model training is still Cancer Letters, vol. 403, pp. 21-27, 2017.
an unclear question. Based on the experiment results, there [11] C. Parmar, P. Grossmann, J. Bussink, “Machine Learning methods for
was no significant evidence suggesting which normalization Quantitative Radiomic Biomarkers,” Scientific Reports, vol. 5, 13087,
2015.
or feature selection method was better. These questions need
[12] Y. Saeys, I. Inza, P. Larranaga, “A review of feature selection
further clarification. Although the data set with 300 samples techniques in bioinformatics,” Bioinformatics, vol. 23, pp. 2507-2517,
per class was the best sampling data set in experiments, there 2007.
was a large number of repeated samples existed in this data
set. Therefore, new data are needed to further validate the
model. In this study, linear SVM was the best algorithm for
training this sequencing data set, and this finding was
consistent with previous studies. However, it is important to
note that if the data set includes more cancer types, whether
linear SVM is still the best algorithm is worth noting.

5377

Authorized licensed use limited to: Cornell University Library. Downloaded on September 04,2020 at 05:52:45 UTC from IEEE Xplore. Restrictions apply.

You might also like