Lab02 - Reading Results
Lab02 - Reading Results
org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
Labwork 2
EXPLANATION
Table of content
Understand the FASTQ format.................................................................................................................2
Understand quality with FastQC ..............................................................................................................3
Basic statistics ................................................................................................................................................................................. 3
Per base sequence quality ......................................................................................................................................................... 3
Per sequence quality scores...................................................................................................................................................... 4
Per base sequence content ....................................................................................................................................................... 4
Per sequence GC content .......................................................................................................................................................... 5
Per base N content ....................................................................................................................................................................... 6
Sequence length distribution ................................................................................................................................................... 6
Sequence duplication levels ..................................................................................................................................................... 7
Over-represented sequences ................................................................................................................................................... 7
Adapter content ............................................................................................................................................................................ 8
Per tile sequence quality ............................................................................................................................................................ 8
Kmer Content ................................................................................................................................................................................. 9
Specific problem for alternate library types........................................................................................................................ 9
Page 1 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
It means that:
- The fragment named: @M00970
- Corresponds to the DNA sequence:
GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCC
GTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAG
AATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATC
AGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA
- This sequence has been sequenced with a quality, corresponding:
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGG
FGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++
0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-
.(4(,,))).,(())))))).)))))))-))-(
The quality score for each sequence is a string of characters, one for each base of the nucleic sequence,
used to characterize the probability of mis-identification of each base. The score is encoded using the ASCII
character table:
So there is an ASCII character associated with each nucleotide, representing its Phred quality score1, the
probability of an incorrect base call:
1 https://en.wikipedia.org/wiki/Phred_quality_score
Page 2 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%
Other sequence quality profiles). The recent developments in chemistry applied to sequencing has improved
this somewhat, but reads are now longer than ever.
When the median quality is below a Phred score of ~20, we should consider trimming away bad quality
bases from the sequence. We will explain that process in the Trim and filter section.
The distribution of average read quality should be tight peak in the upper range of the plot. It can also
report if a subset of the sequences have universally low quality values: it can happen because some
sequences are poorly imaged (on the edge of the field of view etc), however these should represent only a
small percentage of the total sequences.
Page 4 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
In a random library we would expect that there would be little to no difference between the four bases.
The proportion of each of the four bases should remain relatively constant over the length of the read
with %A=%T and %G=%C, and the lines in this plot should run parallel with each other. This is amplicon data,
where 16S DNA is PCR amplified and sequenced, so we’d expect this plot to have some bias (more
information in Biases by library type) and not show a random distribution.
Page 5 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
Page 6 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
Some high-throughput sequencers generate sequence fragments of uniform length, but others can
contain reads of widely varying lengths. Even within uniform length libraries some pipelines will trim
sequences to remove poor quality base calls from the end or the first $n$ bases if they match the first $n$
bases of the adapter up to 90% (by default), with sometimes $n = 1$.
In a diverse library most sequences will occur only once in the final set. A low level of duplication may
indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to
indicate some kind of enrichment bias (more information in More details about duplication).
Two sources of duplicate reads can be found:
o PCR duplication in which library fragments have been over-represented due to biased PCR
enrichment. It is a concern because PCR duplicates misrepresent the true proportion of sequences
in the input.
o Truly over-represented sequences such as very abundant transcripts in an RNA-Seq library or in
amplicon data (like this sample). It is an expected case and not of concern because it does faithfully
represent the input.
Over-represented sequences
A normal high-throughput
library will contain a diverse set of
sequences, with no individual
sequence making up a tiny
fraction of the whole. Finding
that a single sequence is very
over-represented in the set either
Page 7 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
means that it is highly biologically significant, or indicates that the library is contaminated, or not as diverse
as expected.
FastQC lists all of the sequence which make up more than 0.1% of the total. For each over-represented
sequence FastQC will look for matches in a database of common contaminants and will report the best hit it
finds. Hits must be at least 20bp in length and have no more than 1 mismatch. Finding a hit doesn’t necessarily
mean that this is the source of the contamination, but may point you in the right direction. It’s also worth
pointing out that many adapter sequences are very similar to each other so you may get a hit reported which
isn’t technically correct, but which has a very similar sequence to the actual match.
RNA sequencing data may have some transcripts that are so abundant that they register as over-
represented sequence. With DNA sequencing data no single sequence should be present at a high enough
frequency to be listed, but we can sometimes see a small percentage of adapter reads.
Adapter content
The plot shows the
cumulative percentage
of reads with the different
adapter sequences at each
position. Once an adapter
sequence is seen in a read it is
counted as being present right
through to the end of the read so
the percentage increases with the
read length. FastQC can detect
some adapters by default (e.g.
Illumina, Nextera), for others we
could provide a contaminants file
as an input to the FastQC tool.
Ideally Illumina sequence data should not have any adapter sequence present. But with long reads, some
of the library inserts are shorter than the read length resulting in read-through to the adapter at the 3’ end
of the read. This microbiome sample has relatively long reads and we can see Nextera dapater has been
detected (more information in Other adapter content profiles).
Page 8 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
Kmer Content
This plot not output by default. As stated in the tool form, if you want this module it needs to be enabled
using a custom Submodule and limits file. With this module, FastQC does a generic analysis of all of the short
nucleotide sequences of length k (kmer, with k = 7 by default) starting at each position along the read in the
library to find those which do not have an even coverage through the length of your reads. Any given kmer
should be evenly represented across the length of the read.
FastQC will report the list of kmers which appear at specific positions with a greater frequency than
expected. This can be due to different sources of bias in the library, including the presence of read-through
adapter sequences building up on the end of the sequences. The presence of any overrepresented sequences
in the library (such as adapter dimers) causes the kmer plot to be dominated by the kmer from these
sequences. Any biased kmer due to other interesting biases may be then diluted and not easy to see.
The following example is from a high-quality DNA-Seq library. The biased kmers nearby the start of the
read likely are due to slight sequence dependent efficiency of DNA shearing or a result of random priming:
This module can be very difficult to interpret. The adapter content plot and overrepesented
sequences table are easier to interpret and may give you enough information without needing this
plot. RNA-seq libraries may have highly represented kmers that are derived from highly expressed
sequences. To learn more about this plot, please check the FastQC Kmer Content documentation2.
2 https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/11%20Kmer%20Content.html
Page 9 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
Amplicon
Amplicon libraries are prepared by PCR amplification of a specific target. For example, the V4
hypervariable region of the bacterial 16S rRNA gene. All reads from this type of library are expected to be
nearly identical. It will result in:
• Extremely biased per base sequence content
• Extremely narrow distribution of GC content
• Very high sequence duplication levels
• Abundance of overrepresented sequences
Page 10 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
More information
Signal decay and phasing
Page 11 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
Page 12 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
Page 13 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
Page 14 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
Page 15 of 16
Taken and modified from Quality control. Bérénice Batut, Maria Doyle. https://training.galaxyproject.org/training-material/topics/sequence-
analysis/tutorials/quality-control/tutorial.html#per-base-sequence-quality.
Page 16 of 16