0% found this document useful (0 votes)
24 views92 pages

2023-GenomicaFuncional y Biocomputacion-Day1

Uploaded by

maitelarzabaleso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views92 pages

2023-GenomicaFuncional y Biocomputacion-Day1

Uploaded by

maitelarzabaleso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

DAY 1

Genómica Funcional y
Biocomputación-2023/2024

Day 1 : Introduction to transcriptomics. RNAseq. Practicals (Galaxy


and Bioconductor).

Day 2: Discussion of RNAseq practicals (Galaxy). A real case


analyses explained in deep. Practicals (Galaxy).

Day 3: Introduction to proteomics and proteogenomics. Practicals


(Galaxy).

Day 4: Introductions to networks and RNA interactions networks.


In Molecular Biology we focus on
1957, Crick
-Omics Universe

Genomics Transcriptomics
Proteomics

-Add your favourite “epi” stuff


-Omics combos
Genomics

+ Gene Regulatory
Networks

Transcriptomics
-Omics combos
Genomics

Epigenomics

+ Genome 3D

ATACseq, etc

Proteomics
Protein-DNA regulatory Networks
-Omics combos
Transcriptomics

+
Proteogenomics

Proteomics
-Omics combos
Proteomics

+
Protein Interactions
Networks
Proteomics
Russ F. Doolittle: January 10,
1931 – October 11, 2019

Roots are in the sequences


The Roots

“Sequences, the simple order of individual units in biological


polymers, are at the heart of bioinformatics, and the search for
relationships among them and the reconstruction of their
histories has arguably proved the most informative of biological
inquiries.”
The founder
Illustration by Gloria Fuentes
@glogliiita • Trained in math and quantum chemistry.
• Associate director of the back-then National Biomedical
Research Foundation (Now NIH).
• Wrote seminal FORTRAN programs to derive amino
acids sequences by using partial overlaps of
fragmented amino acid sequences (From months to
minutes!).
Developments

• Created the Protein Atlas (a 8 women/1 man team)


• Her work on organelles was essential for the scientific
community to accept Lynn Margulis’ endosymbiotic
theory.
• Worked with Sagan to model planetary atmospheres.

• Realized the potential of computer


Margaret Dayhoff applications to nucleic acids and gene
sequences.

“The mother and father of Bioinformatics” (D. Lipman)


Modified from Aaron Quinlan
https://github.com/quinlan-lab/applied-computational-genomics
Transcriptomics

Total RNA All organisms

Eukaryotes only

Coding RNA Functional RNA


4 % of total 96 % of total

Pre-mRNA
(hnRNA) Pre-rRNA miRNA siRNA
Pre-tRNA snRNA snoRNA

mRNA rRNA tRNA


Transcriptomics

The complete set of RNA transcripts produced from the genome, (under
different conditions at particular place and time).
Microarrays, RNA seq

Comparison of gene expression under different conditions (before/after


treatment, during development, cancer vs normal cells,….)

Transcriptome assembly – discoveries of novel genes, non-coding RNAs,


novel splicing variants of known genes

An alternative to genome sequencing and assembly of a species with


unknown genome, when we are interested only in expressed genes
Transcriptomics- Techiques

• mRNA-seq
• Exome capture
• Targeted
• Small RNA (miRNAs,pRNAs,sncRNA)
• Total RNA
• Ribosome profiling
• Single Cell RNA-Seq
Transcriptomics- Applications

Discovery: Differential expression:


• Transcripts • Gene level expression changes
• Isoforms • Relative isoform abundance
• Splice junctions • Splicing patterns
• Fusion genes

Variant calling
From Microarray to RNAseq

That is from data poor to data intensive


A reminder: Microarray
More limited Getting cheaper

Microarray: RNA-seq:
Requires prior Comprehensive view
knowledge Best dynamic range
Higher throughput Isoform discovery
Analyses is more user- Can detect SNV
friendly
From Microarray to RNAseq
• The Dynamic range concept

Huge dynamic range of mRNA abundance


(Some mRNAs have only a few copies per
cell, while the most abundant ones have
>10,000 copies per cell).

log-transformed data for RNA-seq and


microarray are plotted, -> don’t look
uniformly distributed around the trend line

Dynamic range of RNA-seq dependent on


seq. depth (microarray ~ fixed dynamic
range).

Zhao et al. (2014), PLOS One


RNAseq= mRNAs

Experimental Design

Sequencing Design

Quality control

Alignment & quantification

Differential Expression

Functional profiling
Experimental Design

sRNA mRNA
AAAA RNA integrity number (RIN)
AAAA
Size select by
PolyA select
PAGE or kit
AAAA
Ligate RNA AAAA
adapter
Fragment

Convert to cDNA

Construct library

Sequence

Ideally, samples with high RIN (8) are used in


RNA sequencing experiments.
New transcript discovery
Quantitation
Experimental Design

Multiplexing

modified from Malone JH, Oliver B (2011) BMC Biol.


Considerations Experimental Design

Cheaper Improves mapping of repeats,


and across exon-exon junctions
Improvess Accuracy for low
expressed genes
Experimental Design
Considerations: adequate read depth to dected
low-expression genes

http://info.l7informatics.com/blog/ngs-101-biology-is-a-big-data-problem
Experimental Design
More sequences or more replication?

Hor High and medium: more replicates

For LowE: more replicates and more 2014 Feb 1;30(3):301-4. doi:
depth 10.1093/bioinformatics/btt688.
Experimental Design

A protein nanopore is set in an electrically


resistant polymer membrane. An ionic current
is passed through the nanopore by setting a
voltage across this membrane. If an analyte
passes through the pore or near its aperture,
this event creates a characteristic disruption in
current (as shown in the diagram below).
Measurement of that current makes it possible
to identify the molecule in question.

Long Reads=many thousands of bases

https://nanoporetech.com/how-it-works
Secuenciación en mi portátil (MinIon, Nanopore)
Experimental Design
Experimental Design
Sequencing Design
Sequencing Design
Distribution of RNA species in NGS samples after poly-A Sequencing Design
enrichment (mRNA) and rRNA depletion (whole transcriptome)
sequencing

More depth needed in


total RNA to achieve
same sensitivity in
measuring DE

Enriched mRNA gives


more reads of exons.

http://www.exiqon.com/whole-transcriptome-ngs
Sequencing Design

AVOID
ConFOUNDING
VARIABLES !!!!!!
The NOISE PROBLEM
Sampling Bias
Sampling Bias
Process
Process

Duplicates (PCR or optical)

Index swapping:

Sequencing errors
File Quality control
FASTQC Quality control

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
MultiQC Quality control

https://multiqc.info/examples/rna-seq/multiqc_report.html#general_stats
Quality control
Quality control

• Useful to diagnose problem with the library preparation


• Contamination can produce peaks
• rRNA also peaks if a total RNA approach
Fastq Screen: compares a sub- Quality control
sample of your library against different
genomes

Mouse DNA-seq exp with %10 of


human contamination

What is the Proportion of reads coming from different references?

To check if NGS come from the intended genome


Samples are often contaminated (i.e. human end up in mouse cDNA)
Quality control

https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/
% of reads mapped to a Quality control
reference Normally 80-90% will map
(rest do not due to variety reasons)
Quality control
PCR duplicates Quality control

Duplication indicates a problem with PCR amplification


DAY 2
Analyses Overview

HiSAT2

FeatureCounts
Analyses Overview
Alignment and quantification
Alignment and quantification
Reference, splice-aware

• Reads mapped directly to a


ref genome (allows splice
junctions)

• Require large amount of


memory and CPU time
(usually run on clusters)

• For discovery of isoforms:


FULL alignment

• Tools: STAR, HiSAT2, TopHat2


Alignment and quantification
Pseudoaligment
which does not identify the positions
of the reads in the transcripts, only
their potential transcripts of origin

• Maps to a reference
transcriptome

• Does not identify the


positions of the reads in the
transcripts, only their
potential transcripts of origin

• Extremely fast

• Tools: Salmon, Kallisto From Bray et al. Near-optimal probabilistic


RNA-seq quantification, Nature Biotechnology,
2016
Overview of kallisto. The input consists of a
Pseudoaligment reference transcriptome and reads from an
Kallisto, Salmon… RNA-seq experiment.

(a) An example of a read (in black) and three


overlapping transcripts with exonic regions
as shown.

(b) An index is constructed by creating the


transcriptome de Bruijn Graph (T-DBG) where
nodes (v1, v2, v3, … ) are k-mers, each transcript
corresponds to a colored path as shown and the
path cover of the transcriptome induces a k-
compatibility class for each k-mer.

(c) Conceptually, the k-mers of a read are hashed


(black nodes) to find the k-compatibility class of a
read.

(d) Skipping (black dashed lines) uses the


information stored in the T-DBG to skip k-mers
that are redundant because they have the same
k-compatibility class.

(e) The k-compatibility class of the read is


determined by taking the intersection of the k-
From Bray et al. Near-optimal probabilistic RNA-seq
compatibility classes of its constituent k-mers.
quantification, Nature Biotechnology, 2016
Alignment and quantification

https://bioinfo.iric.ca/understanding-how-kallisto-works/
Building the T-DBG graph Alignment and quantification
and the kallisto index.
• All transcript sequences are decomposed into k-mers (here k=5) to construct the
colored de Bruijn graph.
• The idea is that each different transcript will lead to a different path in the graph

ACGTG ATGA

ATGA
ATGA

ACGTG ATGA
ACGTG

ACGTG ATGT
ATGT

https://bioinfo.iric.ca/understanding-how-kallisto-works/
Reads are decomposed into k-mers (k=5 Alignment and quantification
here too) and the pre-built index is used to
determine the k-compatibility classes of
each k-mer.

For read 1, the intersection of all the k-


compatibility classes of its k-mers suggests
that it might come from transcript 1 or
transcript 2.

https://bioinfo.iric.ca/understanding-how-kallisto-works/
Summarization

Reads can be higher because

• There are more transcripts


• The transcripts are longer
Summarisation/Counting
Summarisation/Counting

To discard:
• Not unique mapping
• Positions overlap with many genes
• Poor quality alignment
• If pair end, only one reads matches gene
Summarisation/Counting
Normalisation
Normalisation/Scaling
Normalisation/RPKM
• Gene length based normalisation
• Library size, number of reads

Highly expressed genes or highly DE


genes can distort values!!!
• Highly expressed genes or highly DE
genes can distort values!!!
Normalisation/geometric scaling

1
Normalisation/trimmed mean of M weighted trimmed mean of
the log expression ratios (trimmed mean of M values (TMM)
“highly expressed genes and those that have a large variation of expression are
excluded”

Tech higher expression in kidney


replicates
liver
distribution
of M values
(liver to
kidney)
skewed to
negative

higher expression in kidney


Differential Expression
Modelling Differential Expression
Differential Expression
8 replicates, 2 conditions
Dimensionality reduction: PCA Differential Expression

More replicates
Which pipeline? Differential Expression

Aligner/Modeler/DEtester
(i.e.ThCuNo)

The DE tool is the one making the difference


Differential Expression

If we want to validate most! (high precision pipeline)

If we want as much as genes as


possible or enrichment studies
(high recall pipeline)
Differential Expression

The multiple testing issue

The hype…
Differential Expression

• Testing for DE expression at EACH gene is ONE


experiment

• Testing across THOUSANDS of genes requires


correction for multiple comparisons

• Bonferroni vs False Discovery Rate.

• FDR aims to keep the total population false positive


rate below a threshold (usually 5%).
Differential Expression

FDR-corrected value (qval)

LogFC
Functional analyses
Put data in Biological Context
https://usegalaxy.eu/

TRANSCRIPTOMICS

https://training.galaxyproject.org/training-
material/topics/transcriptomics/tutorials/ref-based/tutorial.html

Time estimation: 8 hours


Repetir el Mapping con HISAT2 en lugar de STAR.
TRANSCRIPTOMICS

https://training.galaxyproject.org/training-
material/topics/transcriptomics/tutorials/ref-based/tutorial.html
https://usegalaxy.eu/
Quality control FastQC
MultiQC: to aggregate data

You might also like