This document discusses differential expression analysis in RNA-Seq. It begins with an introduction that defines key concepts like expression levels, sequencing depth, and differential expression. It then covers normalization methods to account for biases in RNA-Seq data. The main method discussed is NOISeq, a non-parametric approach that does not require replicates. NOISeq compares signal distributions between conditions to noise distributions within conditions to identify differentially expressed genes. The document concludes with exercises to run NOISeq on sample data.
Introduces RNA-Seq and differential expression, covering definitions, expression measurements, experimental design, and significance.
Discusses the necessity of normalization in RNA-Seq to reduce biases due to sequencing depth and gene length.
Covers various normalization methods including RPKM, FPKM, and TMM, showing how they correct for biases in RNA-Seq data.
Introduces parametric and non-parametric approaches to differential expression analysis, mentioning specific statistical methods and challenges.
Explains the NOISeq method for differential expression, discussing signal and noise distributions, and how it handles replicates and outputs results.
Includes practical exercises using R for NOISeq, key remarks on experimental design, and suggestions for further reading on RNA-Seq and differential expression.
Introduction
Normalization
Differential Expression
Differential Expression in RNA-Seq
Sonia Tarazona
[email protected]
Data analysis workshop for massive sequencing data
Granada, June 2011
Sonia Tarazona Differential Expression in RNA-Seq
Introduction Some questions
Normalization Some definitions
Differential Expression RNA-seq expression data
Outline
1 Introduction
Some questions
Some definitions
RNA-seq expression data
2 Normalization
3 Differential Expression
Sonia Tarazona Differential Expression in RNA-Seq
4.
Introduction Some questions
Normalization Some definitions
Differential Expression RNA-seq expression data
Some questions
How do we measure expression?
What is differential expression?
Experimental design in RNA-Seq
Can we used the same statistics as in microarrays?
Do I need any “normalization”?
Sonia Tarazona Differential Expression in RNA-Seq
5.
Introduction Some questions
Normalization Some definitions
Differential Expression RNA-seq expression data
Some definitions
Expression level
RNA-Seq: The number of reads (counts) mapping to the biological feature of
interest (gene, transcript, exon, etc.) is considered to be linearly
related to the abundance of the target feature.
Microarrays: The abundance of each sequence is a function of the fluorescence
level recovered after the hybridization process.
Sonia Tarazona Differential Expression in RNA-Seq
6.
Introduction Some questions
Normalization Some definitions
Differential Expression RNA-seq expression data
Some definitions
Sequencing depth: Total number of reads mapped to the genome. Library size.
Gene length: Number of bases.
Gene counts: Number of reads mapping to that gene (expression measurement).
Sonia Tarazona Differential Expression in RNA-Seq
7.
Introduction Some questions
Normalization Some definitions
Differential Expression RNA-seq expression data
Some definitions
What is differential expression?
A gene is declared differentially expressed if an observed difference or change in
read counts between two experimental conditions is statistically significant, i.e.
whether it is greater than what would be expected just due to natural random
variation.
Statistical tools are needed to make such a decision by studying counts
probability distributions.
Sonia Tarazona Differential Expression in RNA-Seq
8.
Introduction Some questions
Normalization Some definitions
Differential Expression RNA-seq expression data
RNA-Seq expression data
Experimental design
Pairwise comparisons: Only two experimental conditions or groups are to be
compared.
Multiple comparisons: More than two conditions or groups.
Replication
Biological replicates. To draw general conclusions: from samples to population.
Technical replicates. Conclusions are only valid for compared samples.
Sonia Tarazona Differential Expression in RNA-Seq
Introduction Why?
Normalization Methods
Differential Expression Examples
Why Normalization?
RNA-seq biases
Influence of sequencing depth: The higher sequencing depth, the higher counts.
Sonia Tarazona Differential Expression in RNA-Seq
11.
Introduction Why?
Normalization Methods
Differential Expression Examples
Why Normalization?
RNA-seq biases
Dependence on gene length: Counts are proportional to the transcript length
times the mRNA expression level.
Sonia Tarazona Differential Expression in RNA-Seq
12.
Introduction Why?
Normalization Methods
Differential Expression Examples
Why Normalization?
RNA-seq biases
Differences on the counts distribution among samples.
Sonia Tarazona Differential Expression in RNA-Seq
13.
Introduction Why?
Normalization Methods
Differential Expression Examples
Why Normalization?
RNA-seq biases
Influence of sequencing depth: The higher sequencing depth, the higher counts.
Dependence on gene length: Counts are proportional to the transcript length
times the mRNA expression level.
Differences on the counts distribution among samples.
Options
1 Normalization: Counts should be previously corrected in order to minimize these
biases.
2 Statistical model should take them into account.
Sonia Tarazona Differential Expression in RNA-Seq
14.
Introduction Why?
Normalization Methods
Differential Expression Examples
Some Normalization Methods
RPKM (Mortazavi et al., 2008): Counts are divided by the transcript length
(kb) times the total number of millions of mapped reads.
number of reads of the region
RPKM =
total reads × region length
1000000 1000
Upper-quartile (Bullard et al., 2010): Counts are divided by upper-quartile of
counts for transcripts with at least one read.
TMM (Robinson and Oshlack, 2010): Trimmed Mean of M values.
Quantiles, as in microarray normalization (Irizarry et al., 2003).
FPKM (Trapnell et al., 2010): Instead of counts, Cufflinks software generates
FPKM values (Fragments Per Kilobase of exon per Million fragments mapped)
to estimate gene expression, which are analogous to RPKM.
Sonia Tarazona Differential Expression in RNA-Seq
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
Differential Expression
Parametric approaches
Counts are modeled using known probability distributions such as Binomial, Poisson,
Negative Binomial, etc.
R packages in Bioconductor:
edgeR (Robinson et al., 2010): Exact test based on Negative Binomial
distribution.
DESeq (Anders and Huber, 2010): Exact test based on Negative Binomial
distribution.
DEGseq (Wang et al., 2010): MA-plots based methods (MATR and MARS),
assuming Normal distribution for M|A.
baySeq (Hardcastle et al., 2010): Estimation of the posterior likelihood of
differential expression (or more complex hypotheses) via empirical Bayesian
methods using Poisson or NB distributions.
Sonia Tarazona Differential Expression in RNA-Seq
19.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
Differential Expression
Non-parametric approaches
No assumptions about data distribution are made.
Fisher’s exact test (better with normalized counts).
cuffdiff (Trapnell et al., 2010): Based on entropy divergence for relative
transcript abundances. Divergence is a measurement of the ”distance”between
the relative abundances of transcripts in two difference conditions.
NOISeq (Tarazona et al., coming soon) −→ http://bioinfo.cipf.es/noiseq/
Sonia Tarazona Differential Expression in RNA-Seq
20.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
Differential Expression
Drawbacks of differential expression methods
Parametric assumptions: Are they fulfilled?
Need of replicates.
Problems to detect differential expression in genes with low counts.
Sonia Tarazona Differential Expression in RNA-Seq
21.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
Differential Expression
Drawbacks of differential expression methods
Parametric assumptions: Are they fulfilled?
Need of replicates.
Problems to detect differential expression in genes with low counts.
NOISeq
Non-parametric method.
No need of replicates.
Less influenced by sequencing depth or number of counts.
Sonia Tarazona Differential Expression in RNA-Seq
22.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
NOISeq
Outline
Signal distribution. Computing changes in expression of each gene between the
two experimental conditions.
Noise distribution. Distribution of changes in expression values when comparing
replicates within the same condition.
Differential expression. Comparing signal and noise distributions to determine
differentially expressed genes.
Sonia Tarazona Differential Expression in RNA-Seq
23.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
NOISeq
NOISeq-real
Replicates are available for each condition.
Compute M-D in noise by comparing each pair of replicates within the same
condition.
Sonia Tarazona Differential Expression in RNA-Seq
24.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
NOISeq
NOISeq-real
Replicates are available for each condition.
Compute M-D in noise by comparing each pair of replicates within the same
condition.
NOISeq-sim
No replicates are available at all.
NOISeq simulates technical replicates for each condition. The replicates are
generated from a multinomial distribution taking the counts in the only sample
as the probabilities for the distribution.
Compute M-D in noise by comparing each pair of simulated replicates within the
same condition.
Sonia Tarazona Differential Expression in RNA-Seq
25.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
NOISeq: Signal distribution
1 Calculate the expression of each gene at each experimental condition
(expressioni , for i = 1, 2).
Technical replicates: expressioni = sum(valuesi )
Biological replicates: expressioni = mean(valuesi )
No replicates: expressioni = valuei
2 Compute for each gene the statistics measuring changes in expression:
expression
M = log2 expression1
2
D = |expression1 –expression2 |
Sonia Tarazona Differential Expression in RNA-Seq
26.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
NOISeq: Noise distribution
With replicates (NOISeq-real): For each condition, all the possible comparisons
among the available replicates are used to compute M and D values. All the
M-D values for all the comparisons, genes and for both conditions are pooled
together to create the noise distribution.
If the number of pairwise comparisons for a certain condition is higher than 30,
only 30 randomly selected comparisons are made.
Without replicates (NOISeq-sim): Replicates are simulated and the same
procedure is used to derive noise distribution.
Sonia Tarazona Differential Expression in RNA-Seq
27.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
NOISeq: Differential expression
Probability for each gene of being differentially expressed: It is obtained by
comparing M-D values of that gene against noise distribution and computing the
number of cases in which the values in noise are lower than the values for signal.
A gene is declared as differentially expressed if this probability is higher than q.
The threshold q is set to 0.8 by default, since this value is equivalent to an odds
of 4:1 (the gene in 4 times more likely to be differentially expressed than not).
Sonia Tarazona Differential Expression in RNA-Seq
28.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
NOISeq
Input
Data: datos1, datos2
Features length: long (only if length correction is to be applied)
Normalization: norm = {“rpkm”, “uqua”, “tmm”, “none”}; lc = “length
correction”
Replicates: repl = {“tech”, “bio”}
Simulation: nss = “number of replicates to be simulated”; pnr = “total counts
in each simulated replicate”; v = variability for pnr
Probability cutoff: q (≥ 0,8)
Others: k = 0,5
Sonia Tarazona Differential Expression in RNA-Seq
29.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
NOISeq
Output
Differential expression probability. For each feature, probability of being
differentially expressed.
Differentially expressed features. List of features names which are differentially
expressed according to q cutoff.
M-D values. For signal (between conditions and for each feature) and for noise
(among replicates within the same condition, pooled).
Sonia Tarazona Differential Expression in RNA-Seq
30.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
Exercises
Execute in an R terminal the code provided in exerciseDEngs.r
Reading data
simCount <- readData(file = "simCount.txt", cond1 = c(2:6),
cond2 = c(7:11), header = TRUE)
lapply(simCount, head)
depth <- as.numeric(sapply(simCount, colSums))
Differential Expression by NOISeq
NOISeq-real
res1noiseq <- noiseq(simCount[[1]], simCount[[2]], nss = 0, q = 0.8,
repl = ‘‘tech’’)
NOISeq-sim
res2noiseq <- noiseq(as.matrix(rowSums(simCount[[1]])),
as.matrix(rowSums(simCount[[2]])), q = 0.8, pnr = 0.2, nss = 5)
See the tutorial in http://bioinfo.cipf.es/noiseq/ for more information.
Sonia Tarazona Differential Expression in RNA-Seq
31.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
Some remarks
Experimental design is decisive to answer correctly your biological questions.
Differential expression methods for RNA-Seq data must be different to
microarray methods.
Normalization should be applied to raw counts, at least a library size correction.
Sonia Tarazona Differential Expression in RNA-Seq
32.
Methods
Introduction
NOISeq
Normalization
Exercises
Differential Expression
Concluding
For Further Reading
Oshlack, A., Robinson, M., and Young, M. (2010) From RNA-seq reads to
differential expression results. Genome Biology , 11, 220+.
Review on RNA-Seq, including differential expression.
Auer, P.L., and Doerge R.W. (2010) Statistical Design and Analysis of RNA
Sequencing Data. Genetics, 185, 405-416.
Bullard, J. H., Purdom, E., Hansen, K. D., and Dudoit, S. (2010) Evaluation of
statistical methods for normalization and differential expression in mRNA-Seq
experiments. BMC Bioinformatics, 11, 94+.
Normalization methods (including Upper Quartile) and differential expression.
Tarazona, S., Garc´
ıa-Alcalde, F., Dopazo, J., Ferrer A., and Conesa, A.
(submitted) Differential expression in RNA-seq: a matter of depth.
Sonia Tarazona Differential Expression in RNA-Seq