An introduction to RNA-seq
                        RNA-
     data analysis
                Sonika Tyagi
                Australian Genome Research Facility




1 August 2012
Outline
• Transcriptomics using RNA-seq:
  Applications
• Gene expression profiling workflows
• Design Challenges
RNA sequencing (mRNA-seq or
               (mRNA-
         RNA-
         RNA-seq)
“An experimental protocol that uses
next- generation sequencing
technologies to sequence the RNA
molecules within a biological sample in
an effort to determine the primary
sequence and relative abundance of
each RNA”
A typical RNA-seq experiment
          RNA-

                                         Library preparation
                                                 and
                                             Sequencing




                                       Bioinformatics Analysis




              Nature Reviews Genetics, November 2008; doi:10.1038/nrg2484
RNA-
       RNA-seq Application
• Allele specific expression: prevelance
  of transcribed SNPs
• Fusion transcripts: e.g., in cancer
• Abundance estimation: alternative
  splicing, RNA-editing, novel
  transcripts
• Gene expression profiling
Raw sequences (fastq
My Answer:          files)


              Quality control (QC)



             Spliced Read alignment


                   Transcripts
                 reconstruction


             Differential expression
                    analysis


                    Biology
Reference
                                  Available ?


                                                     Annotated de novo transcriptome
Annotated Genome            Assembled/Predicted
                                                                assembly
                               transcriptome


                             Reads mapping        •De novo assembly
Reads mapping
                                                  •Reference assisted


                             Transcripts
Transcripts                  reconstruction
reconstruction


                             Summarization
                       a     (by CDS, exon,
                             gene, splice
                             junctions )




                            Tables of
                            counts (digital
                            expression)




            Biology         DE analysis
                                                  RNA-
                                                  RNA-seq workflows
            (GO/Pathways)
Raw sequences (fastq
       files)


Quality control (QC)



Spliced Read alignment


      Transcripts
    reconstruction


Differential expression
       analysis


       Biology
QC tools
Raw sequences (fastq
       files)


 Quality control (QC)


    Spliced Read
     alignment

      Transcripts
    reconstruction


Differential expression
       analysis


       Biology
Alignments /
mapping splice
   junctions

   Unspliced read           Examples:             •       Ideal for mapping
                                                          reads against cDNA
   aligners                 • MAQ, Stampy,                databases.
                              ELAND               •       Splice junction/events
   • Seed methods
                            • BWA, Bowtie                 are not picked up
   • Burrow wheel methods



   Spliced read             Examples:                 •    Novel splice junctions
                                                           can be detected
   aligners                 • Tophat,Mapsplice,
                              SpliceMap               •    Perform better for
   • Exon first                                            polymorphic regions
   • Seed – Extend method   • GSNAP, QPALMA,
                                                           and aligning
                              Elandv2e
                                                           pseudogenes.
Raw sequences (fastq
       files)


 Quality control (QC)



Spliced Read alignment


     Transcripts
   reconstruction


Differential expression
       analysis


       Biology
Transcripts
reconstruction

                    Examples:
    Genome guided   • G.mor.se (short
                      reads), cufflinks and
                      Scripture (for long
                     reads)




                     Examples:
    Genome           •   Transabyss,
                         velvet+Oases,
    independent          MIRA, cufflinks*
Genome guided transcriptome
        assembly
Genome guided transcriptome
        assembly



           doi:10.1038/nrg3068
            doi:10.1038/nrg3068
            Published online




                Martin J and Wang Z, Nat Rev Gen 2011
Raw sequences (fastq
       files)


 Quality control (QC)



Spliced Read alignment


      Transcripts
    reconstruction


    Differential
expression analysis


       Biology
Normalisation
   and DE

   Library size     Examples:
   RPKM             ERANGE, Cuffdiff
   FPKM              edgeR , Myrna
   TMM
   Upper quartile
   Poisson GLM      Examples:
   Negative         DEGseq Myrna
   binomial         edgeR, bayseq,
                    Cuffdiff
Quantification and
normalisation
1. Digital expression or raw
   count: number of reads
   mapping to a region (exon/
   transcript/novel region)
2. Normalize counts* : number
   of reads per million reads
   per kb
3. Splice junction detection
4. Compare to existing gene
   models
        Nat Meth 2008 ; DOI:10.1038/NMETH.1226
Differential expression
• Normalised gene expression value as RPKM:
  – reads per kilobase of exon model per million mapped reads

• Or FPKM:
  – fragments per kilobase of exon model per million mapped reads

• Compare RPKM/FPKM across conditions or tissues




                                           Nat Meth DOI:10.1038/NMETH.1226
Raw sequences (fastq
       files)


 Quality control (QC)



Spliced Read alignment


      Transcripts
    reconstruction


Differential expression
       analysis


       Biology
System Biology: beyond the
       list of DE genes
• Ontologies: GO enrichment, Goseq
  (R package)
• DAVID (http://david.abcc.ncifcrf.gov)
• Pathway analysis
RNA-
        RNA-seq experiment design
               challenges
• NGS biases:
    – Libraryprep (GC content, 5’ or 3’
      depletion, random hexamer primers,
      RNA species, bias towards 3’ end …).
    – Transcript length
•   Sequencing depth
•   Single or paired end
•   Biological or technical replicates
•   Validation         BRIEFINGS IN BIOINFORMATICS. VOL 12. NO 3. 280^287
RNA-
   RNA-seq and other
transcriptomics methods




          Nature Reviews Genetics, November 2008; doi:10.1038/nrg2484
Summary
• RNA-seq: more versatile, comprehensive with
  superior reproducibility and resolution.
• Not dependent on prior sequence information:
  suitable for non-model organisms.
• Potentially provides information for all RNA
  species in the cell and allows discovery of novel
  ones.
• Still an actively developing fields and there are
  research areas which still need refinement.
• Experimental design and validation gold
  standards to be set.
Tophat Cufflinks pipeline reference


Differential gene and transcript expression
analysis of RNA-seq experiments with
TopHat and Cufflinks. Nat Protoc 7(3), 562-
78. [article]
Differential gene and transcript expression
analysis of RNA-seq experiments with
TopHat and Cufflinks. Nat Protoc 7(3), 562-
78. [article]
R-bioconductor based RNA-seq
                     RNA-
          packages
• edgeR
• Voom
• Deseq

http://bioconductor.org/packages/rele
ase/BiocViews.html#___Software
An introduction to RNA-seq data analysis

An introduction to RNA-seq data analysis

  • 1.
    An introduction toRNA-seq RNA- data analysis Sonika Tyagi Australian Genome Research Facility 1 August 2012
  • 2.
    Outline • Transcriptomics usingRNA-seq: Applications • Gene expression profiling workflows • Design Challenges
  • 3.
    RNA sequencing (mRNA-seqor (mRNA- RNA- RNA-seq) “An experimental protocol that uses next- generation sequencing technologies to sequence the RNA molecules within a biological sample in an effort to determine the primary sequence and relative abundance of each RNA”
  • 4.
    A typical RNA-seqexperiment RNA- Library preparation and Sequencing Bioinformatics Analysis Nature Reviews Genetics, November 2008; doi:10.1038/nrg2484
  • 5.
    RNA- RNA-seq Application • Allele specific expression: prevelance of transcribed SNPs • Fusion transcripts: e.g., in cancer • Abundance estimation: alternative splicing, RNA-editing, novel transcripts • Gene expression profiling
  • 6.
    Raw sequences (fastq MyAnswer: files) Quality control (QC) Spliced Read alignment Transcripts reconstruction Differential expression analysis Biology
  • 7.
    Reference Available ? Annotated de novo transcriptome Annotated Genome Assembled/Predicted assembly transcriptome Reads mapping •De novo assembly Reads mapping •Reference assisted Transcripts Transcripts reconstruction reconstruction Summarization a (by CDS, exon, gene, splice junctions ) Tables of counts (digital expression) Biology DE analysis RNA- RNA-seq workflows (GO/Pathways)
  • 8.
    Raw sequences (fastq files) Quality control (QC) Spliced Read alignment Transcripts reconstruction Differential expression analysis Biology
  • 9.
  • 10.
    Raw sequences (fastq files) Quality control (QC) Spliced Read alignment Transcripts reconstruction Differential expression analysis Biology
  • 11.
    Alignments / mapping splice junctions Unspliced read Examples: • Ideal for mapping reads against cDNA aligners • MAQ, Stampy, databases. ELAND • Splice junction/events • Seed methods • BWA, Bowtie are not picked up • Burrow wheel methods Spliced read Examples: • Novel splice junctions can be detected aligners • Tophat,Mapsplice, SpliceMap • Perform better for • Exon first polymorphic regions • Seed – Extend method • GSNAP, QPALMA, and aligning Elandv2e pseudogenes.
  • 12.
    Raw sequences (fastq files) Quality control (QC) Spliced Read alignment Transcripts reconstruction Differential expression analysis Biology
  • 13.
    Transcripts reconstruction Examples: Genome guided • G.mor.se (short reads), cufflinks and Scripture (for long reads) Examples: Genome • Transabyss, velvet+Oases, independent MIRA, cufflinks*
  • 14.
  • 15.
    Genome guided transcriptome assembly doi:10.1038/nrg3068 doi:10.1038/nrg3068 Published online Martin J and Wang Z, Nat Rev Gen 2011
  • 16.
    Raw sequences (fastq files) Quality control (QC) Spliced Read alignment Transcripts reconstruction Differential expression analysis Biology
  • 17.
    Normalisation and DE Library size Examples: RPKM ERANGE, Cuffdiff FPKM edgeR , Myrna TMM Upper quartile Poisson GLM Examples: Negative DEGseq Myrna binomial edgeR, bayseq, Cuffdiff
  • 18.
    Quantification and normalisation 1. Digitalexpression or raw count: number of reads mapping to a region (exon/ transcript/novel region) 2. Normalize counts* : number of reads per million reads per kb 3. Splice junction detection 4. Compare to existing gene models Nat Meth 2008 ; DOI:10.1038/NMETH.1226
  • 19.
    Differential expression • Normalisedgene expression value as RPKM: – reads per kilobase of exon model per million mapped reads • Or FPKM: – fragments per kilobase of exon model per million mapped reads • Compare RPKM/FPKM across conditions or tissues Nat Meth DOI:10.1038/NMETH.1226
  • 20.
    Raw sequences (fastq files) Quality control (QC) Spliced Read alignment Transcripts reconstruction Differential expression analysis Biology
  • 21.
    System Biology: beyondthe list of DE genes • Ontologies: GO enrichment, Goseq (R package) • DAVID (http://david.abcc.ncifcrf.gov) • Pathway analysis
  • 22.
    RNA- RNA-seq experiment design challenges • NGS biases: – Libraryprep (GC content, 5’ or 3’ depletion, random hexamer primers, RNA species, bias towards 3’ end …). – Transcript length • Sequencing depth • Single or paired end • Biological or technical replicates • Validation BRIEFINGS IN BIOINFORMATICS. VOL 12. NO 3. 280^287
  • 23.
    RNA- RNA-seq and other transcriptomics methods Nature Reviews Genetics, November 2008; doi:10.1038/nrg2484
  • 24.
    Summary • RNA-seq: moreversatile, comprehensive with superior reproducibility and resolution. • Not dependent on prior sequence information: suitable for non-model organisms. • Potentially provides information for all RNA species in the cell and allows discovery of novel ones. • Still an actively developing fields and there are research areas which still need refinement. • Experimental design and validation gold standards to be set.
  • 25.
    Tophat Cufflinks pipelinereference Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7(3), 562- 78. [article]
  • 26.
    Differential gene andtranscript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7(3), 562- 78. [article]
  • 27.
    R-bioconductor based RNA-seq RNA- packages • edgeR • Voom • Deseq http://bioconductor.org/packages/rele ase/BiocViews.html#___Software