0% found this document useful (0 votes)
15 views19 pages

Formats

Uploaded by

kabilhoyah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views19 pages

Formats

Uploaded by

kabilhoyah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Briefly: Bioinformatics

File Formats
J Fass | 26 March 2018
Overview
● ASCII Text
○ Sequence
■ Fasta, Fastq
○ ~Annotation
■ TSV, CSV, BED, GFF, GTF, VCF, SAM
● Binary (Data, Compressed, Executable)
○ Data
■ HDF5
■ BAM / CRAM
■ 2bit
○ Compressed
■ gzip, bzip2, bgzip
○ Executable

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


TEXT

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


Fasta
>m54050R1_180210_051102/4194473/0_1421
CCCGGCTGCCCGCCCCGCTCGAAGCGATGACTTGCCGGCGGCCCGACGCGATTAGCTGCCGCGCATGCGATGCGGCCGCGGGCGGCGTGCTGACCTGGCTGGCGGTG
TTGAGCTGCTATACATCCGGCAACAACGCTGCCCAACGACTGACCTGACCGGCCGCCTCGATCCTGGCGGCCGCCGGCCTGGCCTGCGCTTTTCCTTCTTCTCTTTC
CTTC
>m54050R1_180210_051102/4194473/1497_4602
GGCGCGCTGATCGGCAAAACGGCTGGGGCGGCCGGAACACCTTTCAACCGTCGCCAACCGCGATCGCCGCGCGCACCCCGCCTTCCGCGCCGCTGTGGCGTTCCTCG
CCCGTCCTACTCTACTGGCATCCGTCTCATTTCTCCCGCTCTTCCCTCCACCCTTTCCCTGCTCACCGCTTCCGTCTTTTTGTCAACCTCTCCTCTGGGCCGACGAC
GTCGCCGCCTACTGCGACAAAAACCGAGGTCGACAAGGCCCGCCGTTACGACCGTCACCCCGAATTCCATCCGGCTGCTCCGCGGT
>m54050R1_180210_051102/4194551/0_17688
ACCGGACGTACCGCGGGCGGGGGCCTCCCCCCCGGGTGGCTCGGGTGCAGCGCAAATCCTTTCTTTGCTGACCCACCTGCGCAGCGAGTGTGAATCTGTGCGGATCG
AGAAAACAAGAAACCCGGCGGGCCCTGCCTGACGCGCGCCCGTCCCGCCGCGCCCCTTCCGCTTGGCGACGTCGAGTTTTTGACGGGAGGTTTGTCGCTTCGACAGA
CGGGTCCGCCAGCACCCCCTCGTCGCAGTCCCGTTAACTCAGGAAGAACTCCCAGTTGGCCCGGGCATCTGCCAACGCCTCCGGGG
>m54050R1_180210_051102/4194551/17752_17812
AAACATATTATTTTTTATTACTCAAATAATTATTATATTCACCTAATTTTCTTTATTATT
>m54050R1_180210_051102/4194552/0_89
CAGATCGGGGCCCAGCATGGCCACCCGTCCTGCACGTCTACGCGCACTTCGCCGGTGGGGATCGGCAGCGGGAACGGCTCGCGGGCTGG
>m54050R1_180210_051102/4194552/162_490
GCCGCACCCGAGCCGTTCCCGCTGCCGATCCACACCGTCGACGTGCGCGTCGACGTGCAGCCGGCGTCCATGCTTGCCCCGATCTTGGGCTAACAAGCCGCTGCTGA
CACCGACGGACGCCACCGCCCGCGACCAGCTGGCCCGGGCCTCGGTGATGGCGCTGTCCTACCGTCGCGCATTCCCGCGCTCGGCATCTATCAGCCTCGGTGCCGCA
GCGTCATCGACGATGGCGAAACCGTCACTGCACGTTTTCATGACGCGGGGCAGGCAGCGAACCGGGCACATCGGGCATCTACGCCT

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


Fasta
>m54050R1_180210_051102/4194473/0_1421
CCCGGCTGCCCGCCCCGCTCGAAGCGATGACTTGCCGGCGGCCCGACGCGATTAGCTGCCGCGCATGCGATGCGGCCGCGGGCGGCGTGCTGACCTGGCTGGCGGTG
TTGAGCTGCTATACATCCGGCAACAACGCTGCCCAACGACTGACCTGACCGGCCGCCTCGATCCTGGCGGCCGCCGGCCTGGCCTGCGCTTTTCCTTCTTCTCTTTC
CTTC

Header symbol “>” also redirects stuff into files, so be careful using > in bash commands!

Header text (sequence ID) has formats particular to different organizations and different software, but
really has no consistent rules that you can rely on.

Sequence can contain: newline characters (“\n”), ACGT, N, acgt, n, x, . or - (gaps), IUPAC ambiguity
codes BDHV etc., alternates like [A/T], amino acid single letter codes (protein fasta; sometimes file
name is ‘sequence.fna’ for fasta nucleic acid, or ‘sequence.faa’ for fasta amino acid)

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


Fastq … “fasta + qualities”
@SN638:981:HK7HWBCXX:2:1101:14799:2762 1:N:0:TTAGGC @Header1
TGGCGCAACTGCCGATCACCATCGACACCAACGGGTATCTGGTCGCCAAC
+ Sequence
GGGGGIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIG +Header2
@SN638:981:HK7HWBCXX:2:1101:14784:2782 1:N:0:TTAGGC
CATCATCGAGGACAGCGCCGGTGACCTGGCGGCCCGCATCGGTGCCCCCC
Qualities
+
GGGGGIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIGIIII
@SN638:981:HK7HWBCXX:2:1101:14983:2799 1:N:0:TTAGGC
Blocks of four lines for each sequence (sequences
CGGCGCCGTTGCTGCTGCTGCCGGTGCTGCTTTCGGCGCTGATCGTGCGG shouldn’t occupy more than one line, as they can in
+
fasta). Second header line (starting with “+”) is
GGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIII
@SN638:981:HK7HWBCXX:2:1101:14763:2901 1:N:0:TTAGGC mandatory, sometimes contains the same header as
CCTGACGACGGCACGAAGGACCTCTTCGTCCACTACTCCGAGATCCAGGG the first line (that starts with “@”). Why??
+
GAGGGIGIGGGGGGGGIA.<GGGIGGAGGGGIIGIIGGIIIG<GA.<<GA
The nth quality character applies to the nth nucleotide,
and is a number that is encoded in a single character
from the ASCII table.

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


Fastq … “fasta + qualities”
@SN638:981:HK7HWBCXX:2:1101:14799:2762 1:N:0:TTAGGC The “I” for base 16 (“C”) means that that base has a
TGGCGCAACTGCCGATCACCATCGACACCAACGGGTATCTGGTCGCCAAC
+ quality of (I’s decimal value: 73) - 33 = 40
GGGGGIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIG (sometimes referred to as “Q40”). Why 33? Because
@SN638:981:HK7HWBCXX:2:1101:14784:2782 1:N:0:TTAGGC
CATCATCGAGGACAGCGCCGGTGACCTGGCGGCCCGCATCGGTGCCCCCC
there are 32 non-printable “characters” at the
+ beginning of the ASCII table! (type ‘man ascii’)
GGGGGIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIGIIII
@SN638:981:HK7HWBCXX:2:1101:14983:2799 1:N:0:TTAGGC
CGGCGCCGTTGCTGCTGCTGCCGGTGCTGCTTTCGGCGCTGATCGTGCGG Q40 means that the probability of error (that C is
+
actually the wrong basecall) is:
GGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIII
@SN638:981:HK7HWBCXX:2:1101:14763:2901 1:N:0:TTAGGC
CCTGACGACGGCACGAAGGACCTCTTCGTCCACTACTCCGAGATCCAGGG pe = 10(-40 / 10) = 0.0001, or 1 in 10,000
+
GAGGGIGIGGGGGGGGIA.<GGGIGGAGGGGIIGIIGGIIIG<GA.<<GA

see also: https://en.wikipedia.org/wiki/FASTQ_format

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


CSV and TSV - comma/tab-separated values
B01 B02 B03 B04 For example, abundances of mRNAs from genes (count data).
PDCD1 0 0 0 0
GAL3ST2 0 0 0 0
D2HGDH 55 71 89 101 (First tab character - “\t” - in column names sometimes omitted for ease of
ING5 1 1 1 1
DTYMK 2 5 7 12
reading by R scripts).
ATG4B 0 0 0 0
THAP4 136 158 85 161
BOK 0 0 0 0
STK25 145 175 195 141

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


BED - tsv with defined column meanings
chr7 127471196 127472363 Pos1 0 + column meaning
chr7 127472363 127473530 Pos2 0 +
chr7 127473530 127474697 Pos3 0 + 1 chromosome name
chr7 127474697 127475864 Pos4 0 + 2 feature start coordinate (0-based...?)
chr7 127475864 127477031 Neg1 0 -
chr7 127477031 127478198 Neg2 0 -
3 feature stop coordinate (0-based...?)
chr7 127478198 127479365 Neg3 0 - 4 feature name
chr7 127479365 127480532 Pos5 0 +
chr7 127480532 127481699 Neg4 0 -
5 score (1-1000)
6 strand (‘+’ or ‘-’ or ‘.’ for unknown or not applicable)
… …

Number of columns used shouldn’t vary within a particular file.

see also:
https://genome.ucsc.edu/FAQ/FAQformat.html#format1

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


GFF / GTF - tsv with defined column meanings
chr22 TeleGene enhancer 10000000 10001000 500 + . touch1
chr22 TeleGene promoter 10010000 10010100 900 + . touch1
chr22 TeleGene promoter 10020000 10025000 800 - . touch2

column meaning
1 chromosome / scaffold name
2 source (e.g. software that generated this feature / gene call)
3 feature name (e.g. “exon1”, “enhance”r, “3’-UTR”)
4 feature start coordinate (1-based)
GTF is newer, and shares the first
5 feature stop coordinate (1-based) eight (8) columns. Column 9 has
6 score (1-1000) additional restrictions in format
7 strand (‘+’ or ‘-’ or ‘.’ for unknown or not applicable) (gene_id, transcript_id, etc.)
8 reading frame (0, 1, 2, or “.” if N/A)
9 group (allows grouping features together)

see also: https://genome.ucsc.edu/FAQ/FAQformat.html#format3

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


VCF - tsv with defined column meanings
##fileformat=VCFv4.2
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample07 ...
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50
20 111069 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27
20 123027 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60
20 123457 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


SAM - tsv with defined column meanings
http://www.htslib.org/

See also samtools man page: http://samtools.sourceforge.net/

SAM spec grew out of 1000 Genomes Project (see Li et al. 2009 Bioinformatics
25:2078)

SAM is plain text; BAM is binary, compressed version of SAM; CRAM is further
compressed but not widely used / recognizable by many tools.

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


SAM - tsv with defined column meanings
[...]
@SQ SN:ctg103993 LN:217
@SQ SN:ctg103994 LN:222
@SQ SN:ctg103995 LN:205
@SQ SN:ctg103996 LN:210
@PG ID:bwa PN:bwa VN:0.7.13-r1126 CL:bwa mem -t 4 -M ../../01_Reference/Transcriptome-Contigs-Build2.fna
../../02-Cleaned/3E/3E_SE.fastq
@PG ID:bwa-7BC92A6F PN:bwa VN:0.7.13-r1126 CL:bwa mem -t 4 -M ../../01_Reference/Transcriptome-Contigs-Build2.fna
../../02-Cleaned/3E/3E_R1.fastq ../../02-Cleaned/3E/3E_R2.fastq
K00188:264:HG3WJBBXX:1:1116:14692:35180#0 121 ctg2 128 58 101M = 128 0
AAGTCTCGACCAAGTGGTTCAGATGGTGACACAGATGTTAGCCCCATCCACCATTCAGTTGCCGTTTTGATAGCTGGAAATCCTGTAAACACAATGCTGAG
FJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA NM:i:10
K00188:264:HG3WJBBXX:1:1116:14692:35180#0 181 ctg2 128 0 * = 128 0
TTTAGTTTTAATTTTTGACTTTGAATAGCGGGAGTCCAGATCGTGTGAACACAGCAGACTGAGCACTCCATTGACAGCCTTCTTCTGTACTTTAGCTATCC
FJFJJFAAJF7F7JJJJAFFFAF<7<AFFJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFAJJJJJJJJFFFJJJJJJJJJJJFFJJJJJJJFFFAA AS:i:0 XS:i:0
K00188:264:HG3WJBBXX:1:1202:11028:9596#0 121 ctg5 45 60 101M = 45 0
TTCTTTTTTCTACAGTTCATTGTCTGTATAAAGTATGCATCAGGAACAATCTGACTAGGAAGGTAAATAATGTAAAACAGATGATTATTGTATGAAAGTTG
JJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA NM:i:8
K00188:264:HG3WJBBXX:1:1202:11028:9596#0 181 ctg5 45 0 * = 45 0
TCAGCTGTATTAGTAATTTAGTAGAAAAGGTCTTGAGAGAATTATGTTTTTTAAAAATCCACATCACTTCAAACAAAAAGCCCCATTAGAATGGAGGGCCA
FJFJJJJJJFJJJJJJFFJJJJFJAJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFF-JFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA AS:i:0
[...]

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


SAM - tsv with defined column meanings

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


BINARY

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


HDF5
● “Hierarchical Data Format” used across many industries
● PacBio read data no longer comes in bas.h5 / bax.h5 files (instead, you
get BAM files) … so let’s forget about HDF5!

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


BAM / CRAM - compressed SAM
● * Don’t dump binary formats to your terminal / shell …
● Indexing both BAM and CRAM allow rapid random read access to any
coordinate range, without uncompressing whole file first
● CRAM restricts sequence alphabet, so compression ratio can be greater
● CRAM does lossy compression of base qualities, also helps
compression ratio

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


2bit
● Old format used for sequence in UCSC Genome Browser
● Can only store 4 bases per position:
○ 00 = A
○ 01 = C
○ 10 = G
○ 11 = T
○ … N? Lower case acgt for soft masking? Nope ...

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26


Questions … comments … confusion?

UC Davis Genome Center | Bioinformatics Core | J Fass Formats 2018-03-26

You might also like