January 7th, 2015
Mariam Quiñones
Computational Biology Specialist
Bioinformatics and Computational Biosciences Branch
Office of Cyber Infrastructure and Computational Biology
Upcoming Seminars on NGS Analysis
2
http://inside.niaid.nih.gov/topic/training/scientificsoftwaretraining/Pages/default.aspx
BCBB: A Branch Devoted to Bioinformatics and
Computational Biosciences
  Researchers’ time is increasingly important
  BCBB saves our collaborators time and effort
  Researchers speed projects to completion using
BCBB consultation and development services
  No need to hire extra post docs or use external
consultants or developers
3
BCBB Staff
4
Bioinformatics Software
Developers
Computational Biologists
Project Managers and
Analysts
Contact BCBB…
  “NIH Users: Access a menu of BCBB services on the
NIAID Intranet:
•  http://bioinformatics.niaid.nih.gov/
  Outside of NIH –
•  search “BCBB” on the NIAID Public Internet Page:
www.niaid.nih.gov
– or – use this direct link
http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx
  Email us at:
•  ScienceApps@niaid.nih.gov
5
Why has the scientific community adopted
deep sequencing?
6
•  Cheaper, faster sequencing
•  No need for cloning or probes
•  Many applications
•  Higher specificity and sensitivity (RNA-seq, Chip-Seq)
•  More..
What is Next Generation Sequencing?
7
Image from: http://s.ngm.com/2009/06/tag-caves/img/01-rumbling-falls-615.jpg
It is sequencing
produced by 2nd and 3rd
generation instruments
(e.g. Illumina, PacBio)”
•  It is also known as High-Throughput Next Generation Sequencing (HT-NGS)
or “Deep Sequencing”. Provides deeper coverage than the typical Sanger
sequencing
Agenda for today
  Overview of Next Generation Sequencing
  NGS sequencing platforms
  NGS Analysis Basics
•  File formats
•  Quality Control
•  Viewing alignment files
  Common applications of NGS
8
Remember Sanger?
9
•  Sanger introduced the “dideoxy method” (also known as Sanger
sequencing) in December 1977
Alignment of reads using tools such as ‘Sequencher’
IMAGE: http://www.lifetechnologies.com
Sanger   Next Generation Sequencing
10
• Sanger: Dideoxy Chain Termination1977
• Hood et al., Fluorescently labeled ddNTPs, Partial Automation1986
• NIH begins Human Genome Project,1990
• HGP/Celera draft assembly published Nature / Science2001
• Next-Gen Sequencing (454 Roche)2004
• First Solexa Sequencer, Genome Analyzer 1G/Run2006
•  1990 – 2003
•  “shotgun”
2007
J.Craig Venter James Watson
Greater vision: Genomics to Bedside
  “ Only a population perspective can fulfill the promise
of genomic medicine. The scientific landscape for
genomics is exciting, and the promise for improving
health is great. Applying genomic tools in clinical and
public health practice will require a multidisciplinary
research collaboration of basic sciences with clinical
and population sciences (e.g., epidemiologists;
behavioral, social, and communication scientists;
health services researchers; and public health
practitioners)”
Am J Public Health. 2012 January; 102(1): 34–37
11
Popular Sequencing Platforms (non-Illumina)
12
SOLiD – 5500 xl series
320 Gb / 8 day run
GS FLX Titanium XL+
700bp reads
Up to 700 Mb / 23 hours
PacBio RS II
500 Mb – 1 Gb / 4 hr run
(up to 40kb read lengths)
Ion Torrent 318
1.2–2 Gb / 7 hr
pH sensing
Sequencing by
Synthesis
Single molecule
Ion Torrent Proton
(2 exomes / 2-4 hr run)
Roche 454
  Pyrosequencing
  Used mostly for targeted
sequencing such as 16S rRNA
  Long reads (>500) but with high
error rate in homopolymer
regions
  Much lower yield than other
platforms
13
PacBio - Single Molecule Real Time
14
•  Very long reads that
are good to span
repeats but with 11%
error rate
•  Consensus analysis
of reads corrects error
rate
•  It’s good for base
modification detection
•  It can be combined
with shorter reads to
improve de novo
assemblies
Genome Res. 2013 Jan;23(1):121-8. doi: 10.1101/gr.
141705.112. Epub 2012 Oct 11
Ion Torrent sequence detection
15
http://en.wikipedia.org/wiki/
Ion_semiconductor_sequencing
New kid - MinION
16
Bases identified by
changes in current
Illumina platforms
17
ILLUMINA
18
HiSeq X = $1000/genome at 30X
And more throughput..
BROAD Institute, Macrogen…
Illumina
  Sequence by
synthesis
  It uses dNTPs
containing a
terminator (with a
fluorescent label)
which blocks further
polymerization
allowing only one
base added
19
http://nxseq.bitesizebio.com/articles/
sequencing-by-synthesis-explaining-the-
illumina-sequencing-technology/
Where are these sequences being stored?
•  NCBI SRA database http://www.ncbi.nlm.nih.gov/sra
•  European Read Archive (ENA)
http://www.ebi.ac.uk/ena/about/sra_submissions
•  1000 genomes data http://www.1000genomes.org/data
•  Human Microbiome Projec (Microbiome data) http://hmpdacc.org/
Some data repositories include:
Large Sequencing Projects
21
http://cancergenome.nih.gov/cancergenomics
www.1000genomes.org/
http://commonfund.nih.gov/hmp/
http://www.icgc.org/
http://img.jgi.doe.gov/cgi-bin/m/main.cgi
http://www.genome10k.org/
Major challenges when working with sequencing data
We need:
  Algorithms for managing (LIMS), analyzing and visualizing data
  Reproducible workflows and standards for analysis
  Better transfer and data storage technology
  Specialized tools for integrating various data types
22
Emerging solutions
Algorithms that can parallelize jobs in a cluster
  ABySS uses MPI, AllPaths LG, Discovar
  GATK Genome Analysis Toolkit uses MapReduce (Google’s framework)
Web tools with workflow capabilities
  Galaxy Bioinformatics https://usegalaxy.org
  Various Cloud based solutions (e.g. Illumina BaseSpace)
  Lots of open source tools: see http://seqanswers.com/wiki/Software
Galaxy https://usegalaxy.org
  Makes analysis methods available to
the community and facilitates
reproducibility via creation of reusable
workflows (read Galaxy slides)
  Free web service, also compatible with
Cloud http://usegalaxy.org/cloud
  Open source
  Provides a Genome Track Browser to
visualize custom data.
23
Cartoons from: fixingpcerrors.com and squido.com
I have data,
where do I start?
First the basics – NGS 101
 Sequence data
• What does a short read looks like?
• How to know if sequencer facility has
provided good quality reads?
• What to expect if sequencer facility has
mapped the reads to my genome of interest?
25
Understanding file
formats
@F29EPBU01CZU4O
GCTCCGTCGTAAAAGGGG
+
24469:666811//..,,
@F29EPBU01D60ZF
CTCGTTCTTGATTAATGAAACATTCTTGGCAAA
TGCTTTCGCTCTGGTCCGTCTTGCGCCGGTCCA
AGAATTTCACCTCTAGCGGCGCAATACGAATG
CCCAAACACACCCAACACACCA
+
G???HHIIIIIIIIIBG555?
=IIIIIIIIHHGHHIHHHIIIIIIHHHIIHHHIIIIIIIIIH99;;CB
BCCEI???DEIIIIII??;;;IIGDBCEA?
9944215BB@>>@A=BEIEEE
@F29EPBU01EIPCX
TTAATGATTGGAGTCTTGGAAGCTTGACTACCC
TACGTTCTCCTACAAATGGACCTTGAGAGCTTG
TTTGGAGGTTCTAGCAGGGGAGCGCATCTCCC
CAAACACACCCAACACACCA
+
IIIIIIIIIIIIIIIIIIIIIIHHHHIIIIHHHIIIIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIH
HHIIIIIIIIEIIB94422=4GEEEEEIBBBBHHHFIH??
?CII=?AEEEE
@F29EPBU01DER7Q
TGACGTGCAAATCGGTCGTCCGACCTCGGTAT
AGGGGCGAAGACTAATCGAACCATCTAGTAGC
Common Sequence file formats
  Next gen sequence file formats are based on the
commonly used
FASTA format
>sequence_ID and optional comments
ATTCCGGTGCGGTGCGGTGCTGCCGTGCCGGTGC
TTCGAAATTGGCGTCAGT
  The Phred quality scores per base were added
27
@HWI-ST406:207:D1DGFACXX:8:1101:20481:2058 1:N:0:AGTCAA!
CATGGGGATCGAATTCATCGCCGTCCCCTCTGTTCCGATTTATTCCATATGTGCTTCGCAACAACGCTTTCTCACAGAATACAGGAGCTTCTATACTGTA!
+!
BBBFFFFFFFFFFIIIIIIFFIIFFIIIFFIIFFFIFBFIIIIIIIIIFIIFBFFIFFFBFFBFFBFBFFFFFFFBBFFFFFFBBBBBBBBBBBFFFBFB!
Raw sequence file formats
  FASTQ format (fasta format with quality values for each base)
28
@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA - base calls
+
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ - Base quality+33
Full read header description"
@ <instrument-name>:<run ID>:<flowcell ID>:<lane-number>:<tile-number>: <x-pos>: <y-pos>
<read number>:<is filtered>:<control number>:<barcode sequence>
Space to separate Read ID
Read ID "
Quality values
29
Quality scores are normally expected up to 40 in a Phred scale.
ASCII characters <http://en.wikipedia.org/wiki/ASCII>
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ "
The highest base quality score in this sequence: ‘D’=(68-33)=35
From http://en.wikipedia.org/wiki/FASTQ_format
= 0.00032 (or 1/3200 incorrect)P=10
-35/10
If base quality = 35
Other read formats
  SFF (Roche 454 or Ion Torrent)
•  Sff – contain Flowgrams, phred quality scores, clipping information
•  454 reads are often reported as fasta and qual or converted to fastq
30
First the basics – NGS 101
 Sequence data
• What does a short read looks like?
• How to know if the sequencing facility has
provided good quality reads?
• What to expect if sequencer facility has
mapped the reads to my genome of interest?
31
Basic Concepts in Quality Control of
sequence data
  The sequencing facility runs quality control tests to ensure that the
actual run was successful and/or to determine if a new library is good
for sequencing more of it.
  The user should run quality control tests prior to full bioinformatics
analyses
•  This will avoid misinterpretation of the data due to unexpected bias
•  QC measurements can report the following:
–  Percent GC in sample reads
–  Presence of overrepresented kmers and sequences such as adapters
–  Per base quality score
–  Distribution of nucleotide bases
  After mapping reads to a genome, additional test could be run to
determine:
–  Mapping error rate
–  Percent of possible PCR duplicates (reads with same start and end position in
reference genome)
–  Distribution of insert size (pair ends)
32
Demo – Fastq Quality
33
Quality control with FastQC
By: FastQC
http://www.bioinformatics.bbsrc.ac.uk/projects/
fastqc/
Mean quality per base
Sequence content per base
GC content per read
For filtering or trimming by quality use tools such as
FastX-toolkit, Btrim, PrinSeq.
First the basics – NGS 101
 Sequence data
• What does a short read looks like?
• How to know if the sequencing facility has
provided good quality reads?
• What to expect if sequencing facility has
mapped (aligned) the reads to my genome
of interest?
35
File formats for aligned reads
  SAM (sequence alignment map)
36
CTTGGGCTGCGTCGTGTCTTCGCTTCACACCCGCGACGAGCGCGGCTTCT
CTTGGGCTGCGTCGTGTCTTCGCTTCACACC
Chr_ start end
Chr2 100000 100050
Chr_ start end
Chr2 50000 50050
Most commonly used alignment file formats
  SAM (sequence alignment map)
Unified format for storing alignments to a reference genome
  BAM (binary version of SAM) – used commonly to deliver data
Compressed SAM file, is normally indexed
  BED
Commonly used to report features described by chrom, start, end, name,
score, and strand.
For example:
chr1 11873 14409 uc001aaa.3 0 +
37
SAM/BAM format (sequence alignment map)
38
QNAME FLAG RNAME POSITION MAPQ CIAGR MRNM MPOS TLEN
SEQ QUAL OPT
http://samtools.sourceforge.net/samtools.shtml#8
First the basics – NGS 101
 How to visualize an alignment?
• Use a genome browser
39
What is a Genome Browser?
Graphical interface for display of genomic information
from biological databases
• known and predicted genes
• ESTs
• mRNAs
• CpG islands
• assembly gaps and coverage
• chromosomal bands
• homology to other organisms
• RNA-seq data
• Transcription factor binding sites
• GC percent
• Splicing variants
• Known SNPs
• Associated publications
• Sequence repeats
Besides genome sequence, they provide additional data:
40
Viewing reads in browser
  If your genome is available via the UCSC genome
browser http://genome.ucsc.edu/, import bam format
file to the UCSC genome browser by hosting the file on
a server and providing the link.
  If your genome is not in UCSC, use another browser
such as IGV http://www.broadinstitute.org/igv/ , or IGB
http://bioviz.org/igb/
•  Import genome (fasta)
•  Import annotations (gff3 or bed format)
•  Import data (bam)
Data from ENCODE, Expression (RNA-Seq), Methylation
and Transcription Factor binding (Chip-Seq) and more.
Use UCSC Custom Track to display data
Next-gen sequencing
can also be imported
typically by hosting a
BAM file in a server
and providing the link
  Is used to display your own data and annotations
  A variety of formats are accepted (click on the links to file types for more information)
  Data remains available for a limited time after upload
43
  Java tool that runs in user’s computer
  Allows for upload of custom annotations in many
formats
•  Add your genome (for example P. falciparum latest
version) in fasta format.
•  Add gene expression (gct format) or sequence
alignments (bam format).
•  Add custom annotations (bed, wig format, gff3).
Integrative Genomics Viewer (IGV)
http://www.broadinstitute.org/igv/
44
Example of use of IGV to visualize custom genome and data
Reads imported in bam format
Annotation imported in bed format
reads
RNA-Seq / miRNA-seq
(noncoding, differential
expression,
Novel splice forms,
antisense)
Epigenetics (Chip-
Seq, MNase-seq,
Bisulfite-Seq)
CNV,
Structural
variations
Targeted
resequencing
“Exome analysis”
Whole genome
sequencing
Metagenomics
(16S microbiome,
environmental
WGS)
Somatic mutations
Variants in
mendelian diseases
High throughput
sequencing
De novo
genome
assembly
Beyond the basics – a growing list of
NGS applications
47
RNA-Seq, Chip-Seq,
Resequencing,
Variant Analysis…
Most applications of NGS require alignment to a
known genome as the first step
Slide modified from Andrew Oler (BCBB)
How to align reads to a genome?
  Step 1: Choose an appropriate alignment software
http://seqanswers.com/wiki/Software
•  Common tools:
–  Bowtie: FAST, Accurate (e.g. for Chip-Seq)
–  BWA: FAST, Accurate, gapped alignment (variant analysis)
–  TopHat: Uses Bowtie for initial mapping and then maps
junctions (good for RNA-seq mapping)
–  GSMapper, MIRA: developed for 454 Roche data
–  QIIME, mothur: Suite for processing and alignment of 16S
rRNA amplicon data for microbiome
48
How to align reads to a genome?
  Step 2: Map the reads to generate an alignment file
(bam).
To visualize bam output files, sort and index the
output file with using tools samtools or picard.
49
Counting experiments (RNA-seq)
Methods
  Reverse transcribe to cDNA
  Prepare library (usually paired
end and/or strand specific)
Features
  A design for capture is not
required
  Alignment depth is
proportional to the abundance
of the transcript
Applications
  Identify coding sequences,
miRNAs, alternative splicing,
antisense transcripts
  Quantify differential
expression
50
RPKM - Reads Per Kilobase of exon model per Million mapped reads
(Haas & Zody, 2010)
Strategies for Mapping Junction Reads
  Split reads and align separately to reference
•  Sometimes based on intermediate reference of reconstructed splice
junction sequences
•  Finds known and novel splice sites
•  e.g., TopHat, SOAPsplice, Trinity
51
Frontiers in Genetics, Huang 2011
Slide courtesy of Andrew Oler (BCBB)
Strand specific RNA-seq can more easily
reveal antisense transcript regulation
52
Counting experiments (Chip-seq)
(Chromosome Immunoprecipitation and Sequencing)
Features
  Allows genome wide discovery of
protein-DNA interactions(e.g.
transcription factor, histone
modification)
  DNA and proteins are cross-linked
and purified; then bound DNA is
analyzed by massively parallel short-
read sequencing
  It is cheaper, and provides better
signal to noise ratio than chip-chip,
not dependent on probes
53
Counting experiments (Chip-seq)
Features
  Analysis typically involves
mapping, peak detection
and binding motif analysis
  Challenges include scoring
diffuse or low intensity
peaks in relation to
background and coverage
  Common tools: USeq,
MACS
54
http://bit.ly/qLjRGA
ChIP-seq Downstream Analysis
55
Supplemental Table 2: D1 Histone-enriched loci (Illumina GAII FDR< 0.0001)
Go Category Total
Genes
Changed
Genes
Enrichment FDR
Cell fate commitment 75 60 1.59848 0
Sequence-specific DNA binding 424 337 1.588112 0
Cellular morphogenesis during differentiation 125 99 1.582495 0
Cell projection organization and biogenesis 169 131 1.548823 0
Cell part morphogenesis 169 131 1.548823 0
Embryonic morphogenesis 88 68 1.543986 0
Regionalization 82 63 1.535126 0
Neurogenesis 221 168 1.518918 0
Wnt receptor signaling pathway 107 80 1.493907 0
Regulation of cell differentiation 119 88 1.477587 0
Regulation of transcription from RNA polymerase II
promoter
99 72 1.453164 0
Organ morphogenesis 304 221 1.452566 0
Embryonic development 226 164 1.449949 0
Regulation of developmental process 191 138 1.443653 0
Voltage-gated ion channel activity 171 123 1.43723 0
Nervous system development 604 433 1.432413 0
Cation channel activity 228 162 1.419703 0
Transcription factor activity 791 552 1.394376 0
Muscle development 136 94 1.38104 0
# Peaks Found in Different Tissues
Allele-specific Binding
Oler et al., NSMB, 2010; Mikkelsen et al., Nature, 2007; Park, Nat Rev Genet, 2009; Barski et al., Cell, 2007
Slide courtesy of Andrew Oler / Vijay Nagarajan (BCBB)
Are you still awake?
56
RNA-Seq / miRNA-seq
(noncoding, differential
expression,
Novel splice forms,
antisense)
Epigenetics (Chip-
Seq, Mnase-seq,
Bisulfite-Seq)
CNV,
Structural
variations
Targeted
resequencing
“Exome analysis”
Whole genome
sequencing
Metagenomics
(16S microbiome,
environmental
WGS)
Somatic mutations
Variants in
mendelian diseases
High throughput
sequencing
De novo
genome
assembly
Beyond the basics – a growing list of
NGS applications
De novo genome assembly
58
AllPATHS-LG
http://www.broadinstitute.org/news/2787
De novo genome assembly
  A good assembly needs:
•  library preparation that minimizes GC bias which lead to poor coverage
•  High coverage (e.g 100 fold Illumina ) with low error rate
•  For a small genome, if possible, add 50x fold PacBio (1500bp read length) to
reduce the number of contigs. Alternatively, use mate pairs and pair ends of
various insert sizes.
  De novo assemblers for large genomes
•  ALLPATHS-LG and DISCOVAR – developed and recommended by BROAD
Institute http://www.broadinstitute.org/science/programs/genome-biology/crd
•  SOAP de novo – developed and used by BGI http://1.usa.gov/oTUrWC
•  ABYSS http://www.bcgsc.ca/platform/bioinfo/software/abyss
  De novo assemblers for smaller genomes
•  VELVET
•  NEWBLER (454)
59
Related publications
http://1.usa.gov/id8h5d
Use case: Panda Genome
Published Nature 2010
•  SOAP denovo
(de Brujin graph algorithm)
•  56 fold coverage
•  500bp insert paired end
•  2kb mate pair
•  Genome was 94% complete
Image courtesy of Zhihe Zhang
In Scientific American
De novo genome assembly
Asian Honey Bee (published January 2015)
  238 Mbp draft of the A. cerana genome and generated 10,651 genes.
•  72% of the A. cerana-specific genes had more than one GO term, and
1,696 enzymes were categorized into 125 pathways.
•  Genes involved in chemoreception and immunity were carefully
identified and compared to those from other sequenced insect
models. These included 10 gustatory receptors, 119 odorant
receptors, 10 ionotropic receptors, and 160 immune-related genes.
61
•  3 libraries
•  Pair end: 500bp
•  Mate pair: 3kb and 10kb
•  2,430 scaffolds
•  RNA-seq data also assembled
•  Tools: AllPaths-LG, RepeatMasker,
•  RNA-seq tools: Trinity, TopHat, Cufflinks
62
Schematic overview of
SOAP denovo algorithm
http://1.usa.gov/oTUrWC
Contig assembly
Scaffolding
Preassembly sequencing
error correction
Gap closure
RNA-Seq / miRNA-seq
(noncoding, differential
expression,
Novel splice forms,
antisense)
Epigenetics (Chip-
Seq, Mnase-seq,
Bisulfite-Seq)
CNV,
Structural
variations
Targeted
resequencing
“Exome analysis”
Whole genome
sequencing
Metagenomics
(16S microbiome,
environmental
WGS)
Somatic mutations
Variants in
mendelian diseases
High throughput
sequencing
De novo
genome
assembly
Beyond the basics – a growing list of
NGS applications
64
Metagenomics and microbiome analysis
Analysis methods:
•  Reference based analysis
•  16S RNA – OTU based methods
•  Shotgun data (454, Illumina)
•  Assign taxonomy (RDP classifier, blast)
•  Pipelines for 16S RNA: qiime, mothur
•  Other tools: MEGAN, CARMA,
metaphyler
•  De novo Assembly of WGS and funtional
analysis of microbiomes
•  Methods are under development with
the goal of dealing with insufficient
coverage, sequencing errors, repeats
•  Tools: MG-RAST, metAMOS, HUMAnN
•  It looks at gene classes, metabolic
pathways
http://bit.ly/o4dGqH http://www.hmpdacc.org/
Sample study: skin microbiome
  The skin is an ecosystem, host to a microbial
milieu that, for the most part, is harmless.
  Analysis of 16S ribosomal RNA genes reveals a
greater diversity of organisms than has been
found by culture-based methods
  The cutaneous immune system modulates
colonization by the microbiota and is also vital
during infection and wounding. Dysregulation of
the skin immune response is evident in several
skin disorders
65
Elizabeth A. Grice & Julia A. Segre
Nature Reviews Microbiology 9, 244-253
Recommended software: mothur, qiime
RNA-Seq / miRNA-seq
(noncoding, differential
expression,
Novel splice forms,
antisense)
Epigenetics (Chip-
Seq, Mnase-seq,
Bisulfite-Seq)
CNV,
Structural
variations
Targeted
resequencing
“Exome analysis”
Whole genome
sequencing
Metagenomics
(16S microbiome,
environmental
WGS)
Somatic mutations
Variants in
mendelian diseases
High throughput
sequencing
De novo
genome
assembly
A growing list of applications
Variant Analysis
…like finding a needle in a ‘deep’ haystack
67
  SNPs – Single nucleotide
polymorphisms
  Indels – Insertion
Deletions
  CNVs- copy number
variations
  SV- structural variations
Variant = any position in
difference to a specified reference
sequence
68
Efforts at creating databases of variants:
HapMap Project
•  Project that started on 2002 with the goal of describing patterns of human
genetic variation and create a haplotype map using SNPs present in at
least 1% of the population, which were deposited in dbSNPs.
•  It used 269 individuals.
Haplotypes – adjacent SNPs that are inherited together
1000 Genomes
•  Started in 2008 with a goal of using at least 1000 individuals (about
2,500 samples at 4X coverage), interrogate 1000 gene regions in 900
samples (exome analysis), find most genetic variants with allele
frequencies above 1% and to a 0.1% if in coding regions as well as
Indels and structural variants
•  Make data available to the public ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/
or via Amazon Cloud http://s3.amazonaws.com/1000genomes
Encode: Encyclopedia of DNA Elements
69
Main Goal:
•  Find all functional elements in the genome
Exome-Seq
Targeted exome capture
  targets ~20,000 variants
near coding sequences
and a few rare missense or
loss of function variants
  Provides high depth of
coverage for more accurate
variant calling
  It is starting to be used as a
diagnostic tool
70
Ann Neurol. 2012 Jan;71(1):5-14.
-Nimblegen
-Agilent
-Illumina
VCF format (version 4.0)
71
  Format used to report information about a position in the
genome
  Use by the 1000 genomes project to report all variants
VCF format
72
http://www.broadinstitute.org/gsa/wiki/index.php/Understanding_the_Unified_Genotyper's_VCF_files
Thank You
Question or Comments please contact:
mariam.quinones@niaid.nih.gov
ScienceApps@niaid.nih.gov
73

More Related Content

PPTX
A Comparison of NGS Platforms.
PPTX
NGS data formats and analyses
PDF
NGS: Mapping and de novo assembly
PDF
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
PPTX
Next generation sequencing methods
PDF
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
PDF
Genome Assembly
PPTX
NGS.pptx
A Comparison of NGS Platforms.
NGS data formats and analyses
NGS: Mapping and de novo assembly
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Next generation sequencing methods
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Genome Assembly
NGS.pptx

What's hot (20)

PDF
BITS: UCSC genome browser - Part 1
PPT
Maximum parsimony
PPTX
Comparative genomics
PPTX
Chou fasman algorithm for protein structure prediction
PPTX
Proteins databases
PPTX
Comparative genomics
PPT
PDF
RNA-seq Analysis
PPTX
Next generation sequencing
PPTX
Next generation sequencing
PPTX
Major databases in bioinformatics
PDF
Introduction to next generation sequencing
PPTX
PPTX
Third Generation Sequencing
PPT
Single nucleotide polymorphism, (SNP)
PPT
Sequence file formats
PPT
Dotplots for Bioinformatics
PDF
Rna seq
PPTX
NEXT GENERATION SEQUENCING
BITS: UCSC genome browser - Part 1
Maximum parsimony
Comparative genomics
Chou fasman algorithm for protein structure prediction
Proteins databases
Comparative genomics
RNA-seq Analysis
Next generation sequencing
Next generation sequencing
Major databases in bioinformatics
Introduction to next generation sequencing
Third Generation Sequencing
Single nucleotide polymorphism, (SNP)
Sequence file formats
Dotplots for Bioinformatics
Rna seq
NEXT GENERATION SEQUENCING
Ad

Viewers also liked (17)

PPTX
3Com 3C509CX
PDF
T-BioInfo Methods and Approaches
PDF
Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...
PPTX
Invicta eshre-poster-mitochondrial dna
PDF
zandona14nipsA0
PDF
Next Generation Sequencing Informatics - Challenges and Opportunities
PPTX
Foursquare For Businesses
PDF
Next Generation Sequencing 2013 Report by Yole Developpement
PPT
CSU Next Generation Sequencing Core 06/09/2015
PPT
Colorado State University Next Generation Sequencing Core 060915
PPTX
Invicta eshre-poster-pregnancy rate after frozen blastocyst
PDF
Exploring new frontiers with next-generation sequencing
PPTX
Nextgenerationsequencing ngs 131218163555-phpapp02
PDF
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...
PPT
Avances en genética. Utilidad de la NGS y la bioinformática.
PPTX
Expanding Your Research Capabilities Using Targeted NGS
3Com 3C509CX
T-BioInfo Methods and Approaches
Development of a Multi-Variant Frequency Ladder™ for Next Generation Sequenci...
Invicta eshre-poster-mitochondrial dna
zandona14nipsA0
Next Generation Sequencing Informatics - Challenges and Opportunities
Foursquare For Businesses
Next Generation Sequencing 2013 Report by Yole Developpement
CSU Next Generation Sequencing Core 06/09/2015
Colorado State University Next Generation Sequencing Core 060915
Invicta eshre-poster-pregnancy rate after frozen blastocyst
Exploring new frontiers with next-generation sequencing
Nextgenerationsequencing ngs 131218163555-phpapp02
Clinical Validation of an NGS-based (CE-IVD) Kit for Targeted Detection of Ge...
Avances en genética. Utilidad de la NGS y la bioinformática.
Expanding Your Research Capabilities Using Targeted NGS
Ad

Similar to Overview of Next Gen Sequencing Data Analysis (20)

PDF
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
PPTX
ngs.pptx
PPT
2013 pag-equine-workshop
PPTX
Cloud bioinformatics 2
PPTX
Next Generation Sequencing - An Overview
PPTX
Bioinformatic tool for Annotation of gene
PPTX
Introduction to bioinformatics
PPTX
NGS File formats
PPTX
Making powerful science: an introduction to NGS data analysis
PPTX
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
PDF
New Technologies at the Center for Bioinformatics & Functional Genomics at Mi...
PPTX
2015 illinois-talk
PPTX
Lecture-1_NGS.pptx important document it
PDF
Introduction to Galaxy and RNA-Seq
PPTX
Bioinfo ngs data format visualization v2
PPTX
BEACON 101: Sequencing tech
PPTX
DNA Sequence Data in Big Data Perspective
PPTX
Data Management for Quantitative Biology - Data sources (Next generation tech...
PPTX
Coding & Best Practice in Programming in the NGS era
PPTX
Module5_Session1 (mlzrkfbbbbbbbbbbbz1).pptx
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
ngs.pptx
2013 pag-equine-workshop
Cloud bioinformatics 2
Next Generation Sequencing - An Overview
Bioinformatic tool for Annotation of gene
Introduction to bioinformatics
NGS File formats
Making powerful science: an introduction to NGS data analysis
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
New Technologies at the Center for Bioinformatics & Functional Genomics at Mi...
2015 illinois-talk
Lecture-1_NGS.pptx important document it
Introduction to Galaxy and RNA-Seq
Bioinfo ngs data format visualization v2
BEACON 101: Sequencing tech
DNA Sequence Data in Big Data Perspective
Data Management for Quantitative Biology - Data sources (Next generation tech...
Coding & Best Practice in Programming in the NGS era
Module5_Session1 (mlzrkfbbbbbbbbbbbz1).pptx

More from Bioinformatics and Computational Biosciences Branch (20)

PPTX
PPTX
Virus Sequence Alignment and Phylogenetic Analysis 2019
PDF
Nephele 2.0: How to get the most out of your Nephele results
PPTX
Protein fold recognition and ab_initio modeling
PDF
Protein structure prediction with a focus on Rosetta
PDF
UNIX Basics and Cluster Computing
PDF
Statistical applications in GraphPad Prism
PDF
Automating biostatistics workflows using R-based webtools
PDF
Overview of statistical tests: Data handling and data quality (Part II)
PDF
Overview of statistics: Statistical testing (Part I)
PDF
Virus Sequence Alignment and Phylogenetic Analysis 2019
Nephele 2.0: How to get the most out of your Nephele results
Protein fold recognition and ab_initio modeling
Protein structure prediction with a focus on Rosetta
UNIX Basics and Cluster Computing
Statistical applications in GraphPad Prism
Automating biostatistics workflows using R-based webtools
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistics: Statistical testing (Part I)

Recently uploaded (20)

PPTX
ELS 2ND QUARTER 2 FOR HUMSS STUDENTS.pptx
PDF
SWAG Research Lab Scientific Publications
PPT
INSTRUMENTAL ANALYSIS (Electrochemical processes )-1.ppt
PDF
2024_PohleJellKlug_CambrianPlectronoceratidsAustralia.pdf
PPT
dcs-computertraningbasics-170826004702.ppt
PDF
final prehhhejjehehhehehehebesentation.pdf
PPTX
Chapter 1 Introductory course Biology Camp
PPTX
Cutaneous tuberculosis Dermatology
PDF
Pentose Phosphate Pathway by Rishikanta Usham, Dhanamanjuri University
PPTX
The Electromagnetism Wave Spectrum. pptx
PPTX
Chromosomal Aberrations Dr. Thirunahari Ugandhar.pptx
PDF
Glycolysis by Rishikanta Usham, Dhanamanjuri University
PDF
chemical-kinetics-Basics for Btech .pdf
PPTX
INTRODUCTION TO CELL STRUCTURE_LESSON.pptx
PDF
SOCIAL PSYCHOLOGY chapter 1-what is social psychology and its definition
PDF
TOPIC-1-Introduction-to-Bioinformatics_for dummies
PPTX
ELS 2ND QUARTER 1 FOR HUMSS STUDENTS.pptx
PDF
Telemedicine: Transforming Healthcare Delivery in Remote Areas (www.kiu.ac.ug)
PPT
ZooLec Chapter 13 (Digestive System).ppt
PPTX
Introduction of Plant Ecology and Diversity Conservation
ELS 2ND QUARTER 2 FOR HUMSS STUDENTS.pptx
SWAG Research Lab Scientific Publications
INSTRUMENTAL ANALYSIS (Electrochemical processes )-1.ppt
2024_PohleJellKlug_CambrianPlectronoceratidsAustralia.pdf
dcs-computertraningbasics-170826004702.ppt
final prehhhejjehehhehehehebesentation.pdf
Chapter 1 Introductory course Biology Camp
Cutaneous tuberculosis Dermatology
Pentose Phosphate Pathway by Rishikanta Usham, Dhanamanjuri University
The Electromagnetism Wave Spectrum. pptx
Chromosomal Aberrations Dr. Thirunahari Ugandhar.pptx
Glycolysis by Rishikanta Usham, Dhanamanjuri University
chemical-kinetics-Basics for Btech .pdf
INTRODUCTION TO CELL STRUCTURE_LESSON.pptx
SOCIAL PSYCHOLOGY chapter 1-what is social psychology and its definition
TOPIC-1-Introduction-to-Bioinformatics_for dummies
ELS 2ND QUARTER 1 FOR HUMSS STUDENTS.pptx
Telemedicine: Transforming Healthcare Delivery in Remote Areas (www.kiu.ac.ug)
ZooLec Chapter 13 (Digestive System).ppt
Introduction of Plant Ecology and Diversity Conservation

Overview of Next Gen Sequencing Data Analysis

  • 1. January 7th, 2015 Mariam Quiñones Computational Biology Specialist Bioinformatics and Computational Biosciences Branch Office of Cyber Infrastructure and Computational Biology
  • 2. Upcoming Seminars on NGS Analysis 2 http://inside.niaid.nih.gov/topic/training/scientificsoftwaretraining/Pages/default.aspx
  • 3. BCBB: A Branch Devoted to Bioinformatics and Computational Biosciences   Researchers’ time is increasingly important   BCBB saves our collaborators time and effort   Researchers speed projects to completion using BCBB consultation and development services   No need to hire extra post docs or use external consultants or developers 3
  • 4. BCBB Staff 4 Bioinformatics Software Developers Computational Biologists Project Managers and Analysts
  • 5. Contact BCBB…   “NIH Users: Access a menu of BCBB services on the NIAID Intranet: •  http://bioinformatics.niaid.nih.gov/   Outside of NIH – •  search “BCBB” on the NIAID Public Internet Page: www.niaid.nih.gov – or – use this direct link http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx   Email us at: •  [email protected] 5
  • 6. Why has the scientific community adopted deep sequencing? 6 •  Cheaper, faster sequencing •  No need for cloning or probes •  Many applications •  Higher specificity and sensitivity (RNA-seq, Chip-Seq) •  More..
  • 7. What is Next Generation Sequencing? 7 Image from: http://s.ngm.com/2009/06/tag-caves/img/01-rumbling-falls-615.jpg It is sequencing produced by 2nd and 3rd generation instruments (e.g. Illumina, PacBio)” •  It is also known as High-Throughput Next Generation Sequencing (HT-NGS) or “Deep Sequencing”. Provides deeper coverage than the typical Sanger sequencing
  • 8. Agenda for today   Overview of Next Generation Sequencing   NGS sequencing platforms   NGS Analysis Basics •  File formats •  Quality Control •  Viewing alignment files   Common applications of NGS 8
  • 9. Remember Sanger? 9 •  Sanger introduced the “dideoxy method” (also known as Sanger sequencing) in December 1977 Alignment of reads using tools such as ‘Sequencher’ IMAGE: http://www.lifetechnologies.com
  • 10. Sanger   Next Generation Sequencing 10 • Sanger: Dideoxy Chain Termination1977 • Hood et al., Fluorescently labeled ddNTPs, Partial Automation1986 • NIH begins Human Genome Project,1990 • HGP/Celera draft assembly published Nature / Science2001 • Next-Gen Sequencing (454 Roche)2004 • First Solexa Sequencer, Genome Analyzer 1G/Run2006 •  1990 – 2003 •  “shotgun” 2007 J.Craig Venter James Watson
  • 11. Greater vision: Genomics to Bedside   “ Only a population perspective can fulfill the promise of genomic medicine. The scientific landscape for genomics is exciting, and the promise for improving health is great. Applying genomic tools in clinical and public health practice will require a multidisciplinary research collaboration of basic sciences with clinical and population sciences (e.g., epidemiologists; behavioral, social, and communication scientists; health services researchers; and public health practitioners)” Am J Public Health. 2012 January; 102(1): 34–37 11
  • 12. Popular Sequencing Platforms (non-Illumina) 12 SOLiD – 5500 xl series 320 Gb / 8 day run GS FLX Titanium XL+ 700bp reads Up to 700 Mb / 23 hours PacBio RS II 500 Mb – 1 Gb / 4 hr run (up to 40kb read lengths) Ion Torrent 318 1.2–2 Gb / 7 hr pH sensing Sequencing by Synthesis Single molecule Ion Torrent Proton (2 exomes / 2-4 hr run)
  • 13. Roche 454   Pyrosequencing   Used mostly for targeted sequencing such as 16S rRNA   Long reads (>500) but with high error rate in homopolymer regions   Much lower yield than other platforms 13
  • 14. PacBio - Single Molecule Real Time 14 •  Very long reads that are good to span repeats but with 11% error rate •  Consensus analysis of reads corrects error rate •  It’s good for base modification detection •  It can be combined with shorter reads to improve de novo assemblies Genome Res. 2013 Jan;23(1):121-8. doi: 10.1101/gr. 141705.112. Epub 2012 Oct 11
  • 15. Ion Torrent sequence detection 15 http://en.wikipedia.org/wiki/ Ion_semiconductor_sequencing
  • 16. New kid - MinION 16 Bases identified by changes in current
  • 18. 18 HiSeq X = $1000/genome at 30X And more throughput.. BROAD Institute, Macrogen…
  • 19. Illumina   Sequence by synthesis   It uses dNTPs containing a terminator (with a fluorescent label) which blocks further polymerization allowing only one base added 19 http://nxseq.bitesizebio.com/articles/ sequencing-by-synthesis-explaining-the- illumina-sequencing-technology/
  • 20. Where are these sequences being stored? •  NCBI SRA database http://www.ncbi.nlm.nih.gov/sra •  European Read Archive (ENA) http://www.ebi.ac.uk/ena/about/sra_submissions •  1000 genomes data http://www.1000genomes.org/data •  Human Microbiome Projec (Microbiome data) http://hmpdacc.org/ Some data repositories include:
  • 22. Major challenges when working with sequencing data We need:   Algorithms for managing (LIMS), analyzing and visualizing data   Reproducible workflows and standards for analysis   Better transfer and data storage technology   Specialized tools for integrating various data types 22 Emerging solutions Algorithms that can parallelize jobs in a cluster   ABySS uses MPI, AllPaths LG, Discovar   GATK Genome Analysis Toolkit uses MapReduce (Google’s framework) Web tools with workflow capabilities   Galaxy Bioinformatics https://usegalaxy.org   Various Cloud based solutions (e.g. Illumina BaseSpace)   Lots of open source tools: see http://seqanswers.com/wiki/Software
  • 23. Galaxy https://usegalaxy.org   Makes analysis methods available to the community and facilitates reproducibility via creation of reusable workflows (read Galaxy slides)   Free web service, also compatible with Cloud http://usegalaxy.org/cloud   Open source   Provides a Genome Track Browser to visualize custom data. 23
  • 24. Cartoons from: fixingpcerrors.com and squido.com I have data, where do I start?
  • 25. First the basics – NGS 101  Sequence data • What does a short read looks like? • How to know if sequencer facility has provided good quality reads? • What to expect if sequencer facility has mapped the reads to my genome of interest? 25
  • 27. Common Sequence file formats   Next gen sequence file formats are based on the commonly used FASTA format >sequence_ID and optional comments ATTCCGGTGCGGTGCGGTGCTGCCGTGCCGGTGC TTCGAAATTGGCGTCAGT   The Phred quality scores per base were added 27 @HWI-ST406:207:D1DGFACXX:8:1101:20481:2058 1:N:0:AGTCAA! CATGGGGATCGAATTCATCGCCGTCCCCTCTGTTCCGATTTATTCCATATGTGCTTCGCAACAACGCTTTCTCACAGAATACAGGAGCTTCTATACTGTA! +! BBBFFFFFFFFFFIIIIIIFFIIFFIIIFFIIFFFIFBFIIIIIIIIIFIIFBFFIFFFBFFBFFBFBFFFFFFFBBFFFFFFBBBBBBBBBBBFFFBFB!
  • 28. Raw sequence file formats   FASTQ format (fasta format with quality values for each base) 28 @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA - base calls + BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ - Base quality+33 Full read header description" @ <instrument-name>:<run ID>:<flowcell ID>:<lane-number>:<tile-number>: <x-pos>: <y-pos> <read number>:<is filtered>:<control number>:<barcode sequence> Space to separate Read ID Read ID "
  • 29. Quality values 29 Quality scores are normally expected up to 40 in a Phred scale. ASCII characters <http://en.wikipedia.org/wiki/ASCII> BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ " The highest base quality score in this sequence: ‘D’=(68-33)=35 From http://en.wikipedia.org/wiki/FASTQ_format = 0.00032 (or 1/3200 incorrect)P=10 -35/10 If base quality = 35
  • 30. Other read formats   SFF (Roche 454 or Ion Torrent) •  Sff – contain Flowgrams, phred quality scores, clipping information •  454 reads are often reported as fasta and qual or converted to fastq 30
  • 31. First the basics – NGS 101  Sequence data • What does a short read looks like? • How to know if the sequencing facility has provided good quality reads? • What to expect if sequencer facility has mapped the reads to my genome of interest? 31
  • 32. Basic Concepts in Quality Control of sequence data   The sequencing facility runs quality control tests to ensure that the actual run was successful and/or to determine if a new library is good for sequencing more of it.   The user should run quality control tests prior to full bioinformatics analyses •  This will avoid misinterpretation of the data due to unexpected bias •  QC measurements can report the following: –  Percent GC in sample reads –  Presence of overrepresented kmers and sequences such as adapters –  Per base quality score –  Distribution of nucleotide bases   After mapping reads to a genome, additional test could be run to determine: –  Mapping error rate –  Percent of possible PCR duplicates (reads with same start and end position in reference genome) –  Distribution of insert size (pair ends) 32
  • 33. Demo – Fastq Quality 33
  • 34. Quality control with FastQC By: FastQC http://www.bioinformatics.bbsrc.ac.uk/projects/ fastqc/ Mean quality per base Sequence content per base GC content per read For filtering or trimming by quality use tools such as FastX-toolkit, Btrim, PrinSeq.
  • 35. First the basics – NGS 101  Sequence data • What does a short read looks like? • How to know if the sequencing facility has provided good quality reads? • What to expect if sequencing facility has mapped (aligned) the reads to my genome of interest? 35
  • 36. File formats for aligned reads   SAM (sequence alignment map) 36 CTTGGGCTGCGTCGTGTCTTCGCTTCACACCCGCGACGAGCGCGGCTTCT CTTGGGCTGCGTCGTGTCTTCGCTTCACACC Chr_ start end Chr2 100000 100050 Chr_ start end Chr2 50000 50050
  • 37. Most commonly used alignment file formats   SAM (sequence alignment map) Unified format for storing alignments to a reference genome   BAM (binary version of SAM) – used commonly to deliver data Compressed SAM file, is normally indexed   BED Commonly used to report features described by chrom, start, end, name, score, and strand. For example: chr1 11873 14409 uc001aaa.3 0 + 37
  • 38. SAM/BAM format (sequence alignment map) 38 QNAME FLAG RNAME POSITION MAPQ CIAGR MRNM MPOS TLEN SEQ QUAL OPT http://samtools.sourceforge.net/samtools.shtml#8
  • 39. First the basics – NGS 101  How to visualize an alignment? • Use a genome browser 39
  • 40. What is a Genome Browser? Graphical interface for display of genomic information from biological databases • known and predicted genes • ESTs • mRNAs • CpG islands • assembly gaps and coverage • chromosomal bands • homology to other organisms • RNA-seq data • Transcription factor binding sites • GC percent • Splicing variants • Known SNPs • Associated publications • Sequence repeats Besides genome sequence, they provide additional data: 40
  • 41. Viewing reads in browser   If your genome is available via the UCSC genome browser http://genome.ucsc.edu/, import bam format file to the UCSC genome browser by hosting the file on a server and providing the link.   If your genome is not in UCSC, use another browser such as IGV http://www.broadinstitute.org/igv/ , or IGB http://bioviz.org/igb/ •  Import genome (fasta) •  Import annotations (gff3 or bed format) •  Import data (bam)
  • 42. Data from ENCODE, Expression (RNA-Seq), Methylation and Transcription Factor binding (Chip-Seq) and more.
  • 43. Use UCSC Custom Track to display data Next-gen sequencing can also be imported typically by hosting a BAM file in a server and providing the link   Is used to display your own data and annotations   A variety of formats are accepted (click on the links to file types for more information)   Data remains available for a limited time after upload 43
  • 44.   Java tool that runs in user’s computer   Allows for upload of custom annotations in many formats •  Add your genome (for example P. falciparum latest version) in fasta format. •  Add gene expression (gct format) or sequence alignments (bam format). •  Add custom annotations (bed, wig format, gff3). Integrative Genomics Viewer (IGV) http://www.broadinstitute.org/igv/ 44
  • 45. Example of use of IGV to visualize custom genome and data Reads imported in bam format Annotation imported in bed format reads
  • 46. RNA-Seq / miRNA-seq (noncoding, differential expression, Novel splice forms, antisense) Epigenetics (Chip- Seq, MNase-seq, Bisulfite-Seq) CNV, Structural variations Targeted resequencing “Exome analysis” Whole genome sequencing Metagenomics (16S microbiome, environmental WGS) Somatic mutations Variants in mendelian diseases High throughput sequencing De novo genome assembly Beyond the basics – a growing list of NGS applications
  • 47. 47 RNA-Seq, Chip-Seq, Resequencing, Variant Analysis… Most applications of NGS require alignment to a known genome as the first step Slide modified from Andrew Oler (BCBB)
  • 48. How to align reads to a genome?   Step 1: Choose an appropriate alignment software http://seqanswers.com/wiki/Software •  Common tools: –  Bowtie: FAST, Accurate (e.g. for Chip-Seq) –  BWA: FAST, Accurate, gapped alignment (variant analysis) –  TopHat: Uses Bowtie for initial mapping and then maps junctions (good for RNA-seq mapping) –  GSMapper, MIRA: developed for 454 Roche data –  QIIME, mothur: Suite for processing and alignment of 16S rRNA amplicon data for microbiome 48
  • 49. How to align reads to a genome?   Step 2: Map the reads to generate an alignment file (bam). To visualize bam output files, sort and index the output file with using tools samtools or picard. 49
  • 50. Counting experiments (RNA-seq) Methods   Reverse transcribe to cDNA   Prepare library (usually paired end and/or strand specific) Features   A design for capture is not required   Alignment depth is proportional to the abundance of the transcript Applications   Identify coding sequences, miRNAs, alternative splicing, antisense transcripts   Quantify differential expression 50 RPKM - Reads Per Kilobase of exon model per Million mapped reads (Haas & Zody, 2010)
  • 51. Strategies for Mapping Junction Reads   Split reads and align separately to reference •  Sometimes based on intermediate reference of reconstructed splice junction sequences •  Finds known and novel splice sites •  e.g., TopHat, SOAPsplice, Trinity 51 Frontiers in Genetics, Huang 2011 Slide courtesy of Andrew Oler (BCBB)
  • 52. Strand specific RNA-seq can more easily reveal antisense transcript regulation 52
  • 53. Counting experiments (Chip-seq) (Chromosome Immunoprecipitation and Sequencing) Features   Allows genome wide discovery of protein-DNA interactions(e.g. transcription factor, histone modification)   DNA and proteins are cross-linked and purified; then bound DNA is analyzed by massively parallel short- read sequencing   It is cheaper, and provides better signal to noise ratio than chip-chip, not dependent on probes 53
  • 54. Counting experiments (Chip-seq) Features   Analysis typically involves mapping, peak detection and binding motif analysis   Challenges include scoring diffuse or low intensity peaks in relation to background and coverage   Common tools: USeq, MACS 54 http://bit.ly/qLjRGA
  • 55. ChIP-seq Downstream Analysis 55 Supplemental Table 2: D1 Histone-enriched loci (Illumina GAII FDR< 0.0001) Go Category Total Genes Changed Genes Enrichment FDR Cell fate commitment 75 60 1.59848 0 Sequence-specific DNA binding 424 337 1.588112 0 Cellular morphogenesis during differentiation 125 99 1.582495 0 Cell projection organization and biogenesis 169 131 1.548823 0 Cell part morphogenesis 169 131 1.548823 0 Embryonic morphogenesis 88 68 1.543986 0 Regionalization 82 63 1.535126 0 Neurogenesis 221 168 1.518918 0 Wnt receptor signaling pathway 107 80 1.493907 0 Regulation of cell differentiation 119 88 1.477587 0 Regulation of transcription from RNA polymerase II promoter 99 72 1.453164 0 Organ morphogenesis 304 221 1.452566 0 Embryonic development 226 164 1.449949 0 Regulation of developmental process 191 138 1.443653 0 Voltage-gated ion channel activity 171 123 1.43723 0 Nervous system development 604 433 1.432413 0 Cation channel activity 228 162 1.419703 0 Transcription factor activity 791 552 1.394376 0 Muscle development 136 94 1.38104 0 # Peaks Found in Different Tissues Allele-specific Binding Oler et al., NSMB, 2010; Mikkelsen et al., Nature, 2007; Park, Nat Rev Genet, 2009; Barski et al., Cell, 2007 Slide courtesy of Andrew Oler / Vijay Nagarajan (BCBB)
  • 56. Are you still awake? 56
  • 57. RNA-Seq / miRNA-seq (noncoding, differential expression, Novel splice forms, antisense) Epigenetics (Chip- Seq, Mnase-seq, Bisulfite-Seq) CNV, Structural variations Targeted resequencing “Exome analysis” Whole genome sequencing Metagenomics (16S microbiome, environmental WGS) Somatic mutations Variants in mendelian diseases High throughput sequencing De novo genome assembly Beyond the basics – a growing list of NGS applications
  • 58. De novo genome assembly 58 AllPATHS-LG http://www.broadinstitute.org/news/2787
  • 59. De novo genome assembly   A good assembly needs: •  library preparation that minimizes GC bias which lead to poor coverage •  High coverage (e.g 100 fold Illumina ) with low error rate •  For a small genome, if possible, add 50x fold PacBio (1500bp read length) to reduce the number of contigs. Alternatively, use mate pairs and pair ends of various insert sizes.   De novo assemblers for large genomes •  ALLPATHS-LG and DISCOVAR – developed and recommended by BROAD Institute http://www.broadinstitute.org/science/programs/genome-biology/crd •  SOAP de novo – developed and used by BGI http://1.usa.gov/oTUrWC •  ABYSS http://www.bcgsc.ca/platform/bioinfo/software/abyss   De novo assemblers for smaller genomes •  VELVET •  NEWBLER (454) 59 Related publications http://1.usa.gov/id8h5d
  • 60. Use case: Panda Genome Published Nature 2010 •  SOAP denovo (de Brujin graph algorithm) •  56 fold coverage •  500bp insert paired end •  2kb mate pair •  Genome was 94% complete Image courtesy of Zhihe Zhang In Scientific American De novo genome assembly
  • 61. Asian Honey Bee (published January 2015)   238 Mbp draft of the A. cerana genome and generated 10,651 genes. •  72% of the A. cerana-specific genes had more than one GO term, and 1,696 enzymes were categorized into 125 pathways. •  Genes involved in chemoreception and immunity were carefully identified and compared to those from other sequenced insect models. These included 10 gustatory receptors, 119 odorant receptors, 10 ionotropic receptors, and 160 immune-related genes. 61 •  3 libraries •  Pair end: 500bp •  Mate pair: 3kb and 10kb •  2,430 scaffolds •  RNA-seq data also assembled •  Tools: AllPaths-LG, RepeatMasker, •  RNA-seq tools: Trinity, TopHat, Cufflinks
  • 62. 62 Schematic overview of SOAP denovo algorithm http://1.usa.gov/oTUrWC Contig assembly Scaffolding Preassembly sequencing error correction Gap closure
  • 63. RNA-Seq / miRNA-seq (noncoding, differential expression, Novel splice forms, antisense) Epigenetics (Chip- Seq, Mnase-seq, Bisulfite-Seq) CNV, Structural variations Targeted resequencing “Exome analysis” Whole genome sequencing Metagenomics (16S microbiome, environmental WGS) Somatic mutations Variants in mendelian diseases High throughput sequencing De novo genome assembly Beyond the basics – a growing list of NGS applications
  • 64. 64 Metagenomics and microbiome analysis Analysis methods: •  Reference based analysis •  16S RNA – OTU based methods •  Shotgun data (454, Illumina) •  Assign taxonomy (RDP classifier, blast) •  Pipelines for 16S RNA: qiime, mothur •  Other tools: MEGAN, CARMA, metaphyler •  De novo Assembly of WGS and funtional analysis of microbiomes •  Methods are under development with the goal of dealing with insufficient coverage, sequencing errors, repeats •  Tools: MG-RAST, metAMOS, HUMAnN •  It looks at gene classes, metabolic pathways http://bit.ly/o4dGqH http://www.hmpdacc.org/
  • 65. Sample study: skin microbiome   The skin is an ecosystem, host to a microbial milieu that, for the most part, is harmless.   Analysis of 16S ribosomal RNA genes reveals a greater diversity of organisms than has been found by culture-based methods   The cutaneous immune system modulates colonization by the microbiota and is also vital during infection and wounding. Dysregulation of the skin immune response is evident in several skin disorders 65 Elizabeth A. Grice & Julia A. Segre Nature Reviews Microbiology 9, 244-253 Recommended software: mothur, qiime
  • 66. RNA-Seq / miRNA-seq (noncoding, differential expression, Novel splice forms, antisense) Epigenetics (Chip- Seq, Mnase-seq, Bisulfite-Seq) CNV, Structural variations Targeted resequencing “Exome analysis” Whole genome sequencing Metagenomics (16S microbiome, environmental WGS) Somatic mutations Variants in mendelian diseases High throughput sequencing De novo genome assembly A growing list of applications
  • 67. Variant Analysis …like finding a needle in a ‘deep’ haystack 67   SNPs – Single nucleotide polymorphisms   Indels – Insertion Deletions   CNVs- copy number variations   SV- structural variations Variant = any position in difference to a specified reference sequence
  • 68. 68 Efforts at creating databases of variants: HapMap Project •  Project that started on 2002 with the goal of describing patterns of human genetic variation and create a haplotype map using SNPs present in at least 1% of the population, which were deposited in dbSNPs. •  It used 269 individuals. Haplotypes – adjacent SNPs that are inherited together 1000 Genomes •  Started in 2008 with a goal of using at least 1000 individuals (about 2,500 samples at 4X coverage), interrogate 1000 gene regions in 900 samples (exome analysis), find most genetic variants with allele frequencies above 1% and to a 0.1% if in coding regions as well as Indels and structural variants •  Make data available to the public ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/ or via Amazon Cloud http://s3.amazonaws.com/1000genomes
  • 69. Encode: Encyclopedia of DNA Elements 69 Main Goal: •  Find all functional elements in the genome
  • 70. Exome-Seq Targeted exome capture   targets ~20,000 variants near coding sequences and a few rare missense or loss of function variants   Provides high depth of coverage for more accurate variant calling   It is starting to be used as a diagnostic tool 70 Ann Neurol. 2012 Jan;71(1):5-14. -Nimblegen -Agilent -Illumina
  • 71. VCF format (version 4.0) 71   Format used to report information about a position in the genome   Use by the 1000 genomes project to report all variants