Quality Control of NGS Data
Solutions
Surya Saha ss2489@cornell.edu
BTI PGRP Summer Internship Program 2014
Slides: Aureliano Bombarely ab782@cornell.edu
Goal:
Learn the use of read evaluation programs keeping
attention in relevant parameters such as qscore and
length distributions and reads duplications.
Data:
(Illumina data for two tomato ripening stages)
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
Tools:
tar -zxvf (command line, untar and unzip the files)
head (command line, take a quick look of the files)
mv (command line, change the name of the files)
grep (command line, find/count patterns in files)
FASTX toolkit (command line, process fasta/fastq)
FastQC (gui, to calculate several stats for each file)
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 2
Exercise 1:
1. Untar and Unzip the file:
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
2. Raw data will be found in two dirs: breaker and
immature_fruit. Print the first 10 lines for the files:
SRR404331_ch4.fq, SRR404333_ch4.fq, SRR404334_ch4.fq
and SRR404336_ch4.fq.
Question 1.1: Do these files have fastq format?
3. Change the extension of the .fq files to .fastq
4. Type ‘fastqc’ to start the FastQC program. Load the four
sequence files in the program.
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 3
Solution 1:
1. Untar and Unzip the file: ch4_demo_dataset.tar.gz
tar -zxvf ch4_demo_dataset.tar.gz
or in two steps:
gunzip ch4_demo_dataset.tar.gz
tar -xvf ch4_demo_dataset.tar
2. Raw data will be found in two dirs: breaker and immature_fruit.
Print the first 10 lines for the files: SRR404331_ch4.fq,
SRR404333_ch4.fq, SRR404334_ch4.fq and SRR404336_ch4.fq.
head SRR404331_ch4.fq; head SRR404333_ch4.fq; head SRR404334_ch4.fq; head
SRR404336_ch4.fq;
Question 1.1: Do these files have fastq format? Yes
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 4
Solution 1:
• Change the extension of the .fq files to .fastq
mv SRR404331_ch4.fq SRR404331_ch4.fastq
mv SRR404333_ch4.fq SRR404333_ch4.fastq
mv SRR404334_ch4.fq SRR404334_ch4.fastq
mv SRR404336_ch4.fq SRR404336_ch4.fastq
• Count number of sequences in each fastq file using
commands you learnt earlier
grep –c ‘^+$’ SRR404331_ch4.fastq
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 5
Solution 1:
• Convert the fastq files to fasta
fastq_to_fasta –I SRR404331_ch4.fastq –o SRR404331_ch4.fasta –Q33
-n
The –Q33 flag is to denote Sanger Phred 33 encoding. It expects Illumina Phred+64 by
default. See http://seqanswers.com/forums/showthread.php?t=7596
-n tells it not to remove any sequences from the file. It removes any reads containing N’s by
default
• Now count the number of sequences in fasta file and see
if the number of sequences has changed.
grep –c ‘^>’ SRR404331_ch4.fasta
It may be different if you did not use -n parameter. Not using -n will remove sequences with
any ‘N’ bases
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 6
Solutions 2:
Question 2.2: How many sequences there are per file in
FastQC?
SRR404331_ch4.fastq = 762,365
SRR404333_ch4.fastq = 744,048
SRR404334_ch4.fastq = 592,123
SRR404336_ch4.fastq = 880,982
Question 2.3: Which is the length range for these reads?
53 in most of the cases, except in SRR404336 where is 54
Question 2.4: Which is the qscore range for these reads? Which
one looks best quality-wise?
The range goes from 2 in some cases such as SRR404334 to a maximum of 44. SRR404331 has
consistent quality over entire length with least 3’ end drop-off.
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 7
Solutions 1:
Question 2.5: Do these datasets have read
overrepresentation?
In some cases such as SRR404331. A Blast search of the top overrepresented sequence
reveals that it is the gene ASR1 from tomato, so it is not a contamination and probably it has
some biological relevance.
Question 2.6: Looking into the kmer content, do you think that
the samples have an adaptor?
No, it is possible to see some structure at the 5’ extreme, but it starts at the position 7 and
doesn’t affect to all the sequences.
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 8
Goal:
Trim the low quality ends of the reads and remove
the short reads.
Data:
(Illumina data for two tomato ripening stages)
ch4_demo_dataset.tar.gz
Tools:
fastq-mcf (command line tool to process reads)
FastQC (gui, to calculate several stats for each file)
Preprocessing
7/8/2014 BTI PGRP Summer Internship Program 2014 9
Exercise 3:
• Download the file: adapters1.fa from
ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCo
rpoica/adapters1.fa
• Run the read processing program over each of the
datasets using a min. qscore of 30 and a min. length of 40
bp.
fastq-mcf -q 30 -l 40 -o SRR404331_ch4.q30l40.fastq
/home/bioinfo/Downloads/adapters1.fa SRR404331_ch4.fastq
Remember to specify the right path for the adapter1.fa and the input file
Preprocessing
7/8/2014 BTI PGRP Summer Internship Program 2014 10
Exercise 2:
•Type ‘fastqc’ to start the FastQC program. Load the
four sequence files in the program. Compare the
results with the previous datasets.
Before fastq-mcf processing
880,982 reads
After fastq-mcf processing
840,759 reads
SRR404336
Preprocessing
7/8/2014 BTI PGRP Summer Internship Program 2014 11
Need Help??
7/8/2014 BTI PGRP Summer Internship Program 2014 12

Quality Control of NGS Data Solutions

  • 1.
    Quality Control ofNGS Data Solutions Surya Saha [email protected] BTI PGRP Summer Internship Program 2014 Slides: Aureliano Bombarely [email protected]
  • 2.
    Goal: Learn the useof read evaluation programs keeping attention in relevant parameters such as qscore and length distributions and reads duplications. Data: (Illumina data for two tomato ripening stages) /home/bioinfo/Data/ch4_demo_dataset.tar.gz Tools: tar -zxvf (command line, untar and unzip the files) head (command line, take a quick look of the files) mv (command line, change the name of the files) grep (command line, find/count patterns in files) FASTX toolkit (command line, process fasta/fastq) FastQC (gui, to calculate several stats for each file) Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 2
  • 3.
    Exercise 1: 1. Untarand Unzip the file: /home/bioinfo/Data/ch4_demo_dataset.tar.gz 2. Raw data will be found in two dirs: breaker and immature_fruit. Print the first 10 lines for the files: SRR404331_ch4.fq, SRR404333_ch4.fq, SRR404334_ch4.fq and SRR404336_ch4.fq. Question 1.1: Do these files have fastq format? 3. Change the extension of the .fq files to .fastq 4. Type ‘fastqc’ to start the FastQC program. Load the four sequence files in the program. Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 3
  • 4.
    Solution 1: 1. Untarand Unzip the file: ch4_demo_dataset.tar.gz tar -zxvf ch4_demo_dataset.tar.gz or in two steps: gunzip ch4_demo_dataset.tar.gz tar -xvf ch4_demo_dataset.tar 2. Raw data will be found in two dirs: breaker and immature_fruit. Print the first 10 lines for the files: SRR404331_ch4.fq, SRR404333_ch4.fq, SRR404334_ch4.fq and SRR404336_ch4.fq. head SRR404331_ch4.fq; head SRR404333_ch4.fq; head SRR404334_ch4.fq; head SRR404336_ch4.fq; Question 1.1: Do these files have fastq format? Yes Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 4
  • 5.
    Solution 1: • Changethe extension of the .fq files to .fastq mv SRR404331_ch4.fq SRR404331_ch4.fastq mv SRR404333_ch4.fq SRR404333_ch4.fastq mv SRR404334_ch4.fq SRR404334_ch4.fastq mv SRR404336_ch4.fq SRR404336_ch4.fastq • Count number of sequences in each fastq file using commands you learnt earlier grep –c ‘^+$’ SRR404331_ch4.fastq Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 5
  • 6.
    Solution 1: • Convertthe fastq files to fasta fastq_to_fasta –I SRR404331_ch4.fastq –o SRR404331_ch4.fasta –Q33 -n The –Q33 flag is to denote Sanger Phred 33 encoding. It expects Illumina Phred+64 by default. See http://seqanswers.com/forums/showthread.php?t=7596 -n tells it not to remove any sequences from the file. It removes any reads containing N’s by default • Now count the number of sequences in fasta file and see if the number of sequences has changed. grep –c ‘^>’ SRR404331_ch4.fasta It may be different if you did not use -n parameter. Not using -n will remove sequences with any ‘N’ bases Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 6
  • 7.
    Solutions 2: Question 2.2:How many sequences there are per file in FastQC? SRR404331_ch4.fastq = 762,365 SRR404333_ch4.fastq = 744,048 SRR404334_ch4.fastq = 592,123 SRR404336_ch4.fastq = 880,982 Question 2.3: Which is the length range for these reads? 53 in most of the cases, except in SRR404336 where is 54 Question 2.4: Which is the qscore range for these reads? Which one looks best quality-wise? The range goes from 2 in some cases such as SRR404334 to a maximum of 44. SRR404331 has consistent quality over entire length with least 3’ end drop-off. Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 7
  • 8.
    Solutions 1: Question 2.5:Do these datasets have read overrepresentation? In some cases such as SRR404331. A Blast search of the top overrepresented sequence reveals that it is the gene ASR1 from tomato, so it is not a contamination and probably it has some biological relevance. Question 2.6: Looking into the kmer content, do you think that the samples have an adaptor? No, it is possible to see some structure at the 5’ extreme, but it starts at the position 7 and doesn’t affect to all the sequences. Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 8
  • 9.
    Goal: Trim the lowquality ends of the reads and remove the short reads. Data: (Illumina data for two tomato ripening stages) ch4_demo_dataset.tar.gz Tools: fastq-mcf (command line tool to process reads) FastQC (gui, to calculate several stats for each file) Preprocessing 7/8/2014 BTI PGRP Summer Internship Program 2014 9
  • 10.
    Exercise 3: • Downloadthe file: adapters1.fa from ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCo rpoica/adapters1.fa • Run the read processing program over each of the datasets using a min. qscore of 30 and a min. length of 40 bp. fastq-mcf -q 30 -l 40 -o SRR404331_ch4.q30l40.fastq /home/bioinfo/Downloads/adapters1.fa SRR404331_ch4.fastq Remember to specify the right path for the adapter1.fa and the input file Preprocessing 7/8/2014 BTI PGRP Summer Internship Program 2014 10
  • 11.
    Exercise 2: •Type ‘fastqc’to start the FastQC program. Load the four sequence files in the program. Compare the results with the previous datasets. Before fastq-mcf processing 880,982 reads After fastq-mcf processing 840,759 reads SRR404336 Preprocessing 7/8/2014 BTI PGRP Summer Internship Program 2014 11
  • 12.
    Need Help?? 7/8/2014 BTIPGRP Summer Internship Program 2014 12