0% found this document useful (0 votes)

13 views16 pages

Lab02 - Reading Results

Uploaded by

laiphuong1112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views16 pages

Lab02 - Reading Results

Uploaded by

laiphuong1112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.

org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

Labwork 2

EXPLANATION

Table of content
Understand the FASTQ format.................................................................................................................2
Understand quality with FastQC ..............................................................................................................3
Basic statistics ................................................................................................................................................................................. 3
Per base sequence quality ......................................................................................................................................................... 3
Per sequence quality scores...................................................................................................................................................... 4
Per base sequence content ....................................................................................................................................................... 4
Per sequence GC content .......................................................................................................................................................... 5
Per base N content ....................................................................................................................................................................... 6
Sequence length distribution ................................................................................................................................................... 6
Sequence duplication levels ..................................................................................................................................................... 7
Over-represented sequences ................................................................................................................................................... 7
Adapter content ............................................................................................................................................................................ 8
Per tile sequence quality ............................................................................................................................................................ 8
Kmer Content ................................................................................................................................................................................. 9
Specific problem for alternate library types........................................................................................................................ 9

More information ................................................................................................................................... 11

Signal decay and phasing ....................................................................................................................................................... 11
Other sequence quality profiles ........................................................................................................................................... 12
Biases by library type ................................................................................................................................................................ 14
More details about duplication ............................................................................................................................................ 15
Other adapter content profiles ............................................................................................................................................. 16

Page 1 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

Understand the FASTQ format

Although it looks complicated (and maybe it is), the FASTQ format is easy to understand with a little
decoding.
Each read, representing a fragment of the library, is encoded by 4 lines:
Line Description
1 Always begins with @ followed by the information about the read
2 The actual nucleic sequence
3 Always begins with a + and contains sometimes the same info in line 1
Has a string of characters which represent the quality scores associated with each base of the nucleic sequence;
4
must have the same number of characters as line 2
So for example, the first sequence in our file is:
@M00970:337:000000000-BR5KF:1:1102:17745:1557 1:N:0:CGCAGAAC+ACAGAGTT
GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTA
TGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAA
AAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTA
AAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGG
AFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*
02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(
((-((((-.(4(,,))).,(())))))).)))))))-))-(

It means that:
- The fragment named: @M00970
- Corresponds to the DNA sequence:
GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCC
GTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAG
AATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATC
AGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA
- This sequence has been sequenced with a quality, corresponding:
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGG
FGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++
0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-
.(4(,,))).,(())))))).)))))))-))-(

The quality score for each sequence is a string of characters, one for each base of the nucleic sequence,
used to characterize the probability of mis-identification of each base. The score is encoded using the ASCII
character table:

So there is an ASCII character associated with each nucleotide, representing its Phred quality score1, the
probability of an incorrect base call:

1 https://en.wikipedia.org/wiki/Phred_quality_score
Page 2 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%

Understand quality with FastQC

Basic statistics
This box will provide us general information of the raw sequencing data, such as Encoding followed which
technology, Number of sequences in the file, Number of poor quality sequences (if they are flagged),
Number of nucleotide per seqence, and GC content

Per base sequence quality

With FastQC we can use the
per base sequence quality plot
to check the base quality of
the reads.
On the x-axis are the base
position in the read. In this
example, the sample
contains reads that are up to
296 bp long.
For each position, a boxplot
is drawn with:
o the median value,
represented by the
central red line
o the inter-quartile range (25-75%), represented by the yellow box
o the 10% and 90% values in the upper and lower whiskers
o the mean quality, represented by the blue line
The y-axis shows the quality scores. The higher the score, the better the base call. The background of the
graph divides the y-axis into very good quality scores (green), scores of reasonable quality (orange),
and reads of poor quality (red).
It is normal with all Illumina sequencers for the median quality score to start out lower over the first 5-7
bases and to then rise. The quality of reads on most platforms will drop at the end of the read. This is often
due to signal decay or phasing during the sequencing run (more information in Signal decay and phasing and
Page 3 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

Other sequence quality profiles). The recent developments in chemistry applied to sequencing has improved
this somewhat, but reads are now longer than ever.
When the median quality is below a Phred score of ~20, we should consider trimming away bad quality
bases from the sequence. We will explain that process in the Trim and filter section.

Per sequence quality scores

It plots the average quality score over the full length of all reads on the x-axis and gives the total number
of reads with this score on the y-axis:

The distribution of average read quality should be tight peak in the upper range of the plot. It can also
report if a subset of the sequences have universally low quality values: it can happen because some
sequences are poorly imaged (on the edge of the field of view etc), however these should represent only a
small percentage of the total sequences.

Per base sequence content

This plots the percentage of each of the four nucleotides (T, C, A, G) at each position across
all reads in the input sequence file. As for the per base sequence quality, the x-axis is non-uniform.

Page 4 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

In a random library we would expect that there would be little to no difference between the four bases.
The proportion of each of the four bases should remain relatively constant over the length of the read
with %A=%T and %G=%C, and the lines in this plot should run parallel with each other. This is amplicon data,
where 16S DNA is PCR amplified and sequenced, so we’d expect this plot to have some bias (more
information in Biases by library type) and not show a random distribution.

Per sequence GC content

This plot displays the
number of reads vs. percentage
of bases G and C per read. It is
compared to a theoretical
distribution assuming an
uniform GC content for
all reads, expected for whole
genome shotgun sequencing,
where the central peak
corresponds to the overall GC
content of the underlying
genome. Since the GC content of
the genome is not known, the
modal GC content is calculated
from the observed data and
used to build a reference
distribution.
An unusually-shaped distribution could indicate a contaminated library or some other kind of biased
subset. A shifted normal distribution indicates some systematic bias, which is independent of base position.
If there is a systematic bias which creates a shifted normal distribution then this won’t be flagged as an error
by the module since it doesn’t know what your genome’s GC content should be.
But there are also other situations in which an unusually-shaped distribution may occur. For example,
with RNA sequencing there may be a greater or lesser distribution of mean GC content among transcripts
causing the observed plot to be wider or narrower than an ideal normal distribution.

Page 5 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

Per base N content

If a sequencer is unable to make a base call with
sufficient confidence, it will write an “N” instead of a
conventional base call. This plot displays the
percentage of base calls at each position or bin
for which an N was called.

It’s not unusual to see a very high proportion of Ns

appearing in a sequence, especially near the end of a
sequence. But this curve should never rises noticeably
above zero. If it does this indicates a problem occurred
during the sequencing run. In the example below, an
error caused the instrument to be unable to call a base
for approximately 20% of the reads at position 29:

Sequence length distribution

This plot shows the distribution of fragment sizes in the file which was analysed. In many cases this will
produce a simple plot showing a peak only at one size, but for variable length FASTQ files this will show the
relative amounts of each different size of sequence fragment. Our plot shows variable length as we trimmed
the data. The biggest peak is at 296bp but there is a second large peak at ~100bp. So even though our
sequences range up to 296bp in length, a lot of the good-quality sequences are shorter. This corresponds
with the drop we saw in the sequence quality at ~100bp and the red stripes starting at this position in the
per tile sequence quality plot.

Page 6 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

Some high-throughput sequencers generate sequence fragments of uniform length, but others can
contain reads of widely varying lengths. Even within uniform length libraries some pipelines will trim
sequences to remove poor quality base calls from the end or the first $n$ bases if they match the first $n$
bases of the adapter up to 90% (by default), with sometimes $n = 1$.

Sequence duplication levels

The graph shows in blue the percentage of reads of a given sequence in the file which are present a
given number of times in the file:

In a diverse library most sequences will occur only once in the final set. A low level of duplication may
indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to
indicate some kind of enrichment bias (more information in More details about duplication).
Two sources of duplicate reads can be found:
o PCR duplication in which library fragments have been over-represented due to biased PCR
enrichment. It is a concern because PCR duplicates misrepresent the true proportion of sequences
in the input.
o Truly over-represented sequences such as very abundant transcripts in an RNA-Seq library or in
amplicon data (like this sample). It is an expected case and not of concern because it does faithfully
represent the input.

Over-represented sequences
A normal high-throughput
library will contain a diverse set of
sequences, with no individual
sequence making up a tiny
fraction of the whole. Finding
that a single sequence is very
over-represented in the set either

Page 7 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

means that it is highly biologically significant, or indicates that the library is contaminated, or not as diverse
as expected.
FastQC lists all of the sequence which make up more than 0.1% of the total. For each over-represented
sequence FastQC will look for matches in a database of common contaminants and will report the best hit it
finds. Hits must be at least 20bp in length and have no more than 1 mismatch. Finding a hit doesn’t necessarily
mean that this is the source of the contamination, but may point you in the right direction. It’s also worth
pointing out that many adapter sequences are very similar to each other so you may get a hit reported which
isn’t technically correct, but which has a very similar sequence to the actual match.
RNA sequencing data may have some transcripts that are so abundant that they register as over-
represented sequence. With DNA sequencing data no single sequence should be present at a high enough
frequency to be listed, but we can sometimes see a small percentage of adapter reads.

Adapter content
The plot shows the
cumulative percentage
of reads with the different
adapter sequences at each
position. Once an adapter
sequence is seen in a read it is
counted as being present right
through to the end of the read so
the percentage increases with the
read length. FastQC can detect
some adapters by default (e.g.
Illumina, Nextera), for others we
could provide a contaminants file
as an input to the FastQC tool.
Ideally Illumina sequence data should not have any adapter sequence present. But with long reads, some
of the library inserts are shorter than the read length resulting in read-through to the adapter at the 3’ end
of the read. This microbiome sample has relatively long reads and we can see Nextera dapater has been
detected (more information in Other adapter content profiles).

Per tile sequence quality

This plot enables you to look at the
quality scores from each tile across all
of your bases to see if there was a loss in
quality associated with only one part of the
flowcell. The plot shows the deviation from
the average quality for each flowcell tile.
The hotter colours indicate that reads in
the given tile have worse qualities for that
position than reads in other tiles. With this
sample, you can see that certain tiles show
consistently poor quality, especially from ~100bp onwards. A good plot should be blue all over.
This plot will only appear for Illumina library which retains its original sequence identifiers. Encoded in
these is the flowcell tile from which each read came.

Page 8 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

Kmer Content
This plot not output by default. As stated in the tool form, if you want this module it needs to be enabled
using a custom Submodule and limits file. With this module, FastQC does a generic analysis of all of the short
nucleotide sequences of length k (kmer, with k = 7 by default) starting at each position along the read in the
library to find those which do not have an even coverage through the length of your reads. Any given kmer
should be evenly represented across the length of the read.
FastQC will report the list of kmers which appear at specific positions with a greater frequency than
expected. This can be due to different sources of bias in the library, including the presence of read-through
adapter sequences building up on the end of the sequences. The presence of any overrepresented sequences
in the library (such as adapter dimers) causes the kmer plot to be dominated by the kmer from these
sequences. Any biased kmer due to other interesting biases may be then diluted and not easy to see.
The following example is from a high-quality DNA-Seq library. The biased kmers nearby the start of the
read likely are due to slight sequence dependent efficiency of DNA shearing or a result of random priming:

This module can be very difficult to interpret. The adapter content plot and overrepesented
sequences table are easier to interpret and may give you enough information without needing this
plot. RNA-seq libraries may have highly represented kmers that are derived from highly expressed
sequences. To learn more about this plot, please check the FastQC Kmer Content documentation2.

Specific problem for alternate library types

Small/micro RNA
In small RNA libraries, we typically have a relatively small set of unique, short sequences.
Small RNA libraries are not randomly sheared before adding sequencing adapters to their ends: all
the reads for specific classes of microRNAs will be identical. It will result in:

2 https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/11%20Kmer%20Content.html
Page 9 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

• Extremely biased per base sequence content

• Extremely narrow distribution of GC content
• Very high sequence duplication levels
• Abundance of overrepresented sequences
• Read-through into adapters

Amplicon
Amplicon libraries are prepared by PCR amplification of a specific target. For example, the V4
hypervariable region of the bacterial 16S rRNA gene. All reads from this type of library are expected to be
nearly identical. It will result in:
• Extremely biased per base sequence content
• Extremely narrow distribution of GC content
• Very high sequence duplication levels
• Abundance of overrepresented sequences

Bisulfite or Methylation sequencing

With Bisulfite or methylation sequencing, the majority of the cytosine (C) bases are converted to
thymine (T). It will result in:
• Biased per base sequence content
• Biased per sequence GC content

Adapter dimer contamination

Any library type may contain a very small percentage of adapter dimer (i.e. no insert) fragments. They
are more likely to be found in amplicon libraries constructed entirely by PCR (by formation of PCR primer-
dimers) than in DNA-Seq or RNA-Seq libraries constructed by adapter ligation. If a sufficient fraction of the
library is adapter dimer it will become noticeable in the FastQC report:
• Drop in per base sequence quality after base 60
• Possible bi-modal distribution of per sequence quality scores
• Distinct pattern observed in per bases sequence content up to base 60
• Spike in per sequence GC content
• Overrepresented sequence matching adapter
• Adapter content > 0% starting at base 1

Page 10 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

More information
Signal decay and phasing

Page 11 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

Other sequence quality profiles

Page 12 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

Page 13 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

Biases by library type

Page 14 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

More details about duplication

Page 15 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.

Lab02 - Reading Results

Uploaded by

Lab02 - Reading Results

Uploaded by

Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.

More information ................................................................................................................................... 11

Understand the FASTQ format

Understand quality with FastQC

Per base sequence quality

Per sequence quality scores

Per base sequence content

Per sequence GC content

Per base N content

It’s not unusual to see a very high proportion of Ns

Sequence length distribution

Sequence duplication levels

Per tile sequence quality

Specific problem for alternate library types

• Extremely biased per base sequence content

Bisulfite or Methylation sequencing

Adapter dimer contamination

Other sequence quality profiles

Biases by library type

More details about duplication

Other adapter content profiles

You might also like