Formats
Formats
File Formats
J Fass | 26 March 2018
Overview
● ASCII Text
○ Sequence
■ Fasta, Fastq
○ ~Annotation
■ TSV, CSV, BED, GFF, GTF, VCF, SAM
● Binary (Data, Compressed, Executable)
○ Data
■ HDF5
■ BAM / CRAM
■ 2bit
○ Compressed
■ gzip, bzip2, bgzip
○ Executable
Header symbol “>” also redirects stuff into files, so be careful using > in bash commands!
Header text (sequence ID) has formats particular to different organizations and different software, but
really has no consistent rules that you can rely on.
Sequence can contain: newline characters (“\n”), ACGT, N, acgt, n, x, . or - (gaps), IUPAC ambiguity
codes BDHV etc., alternates like [A/T], amino acid single letter codes (protein fasta; sometimes file
name is ‘sequence.fna’ for fasta nucleic acid, or ‘sequence.faa’ for fasta amino acid)
see also:
https://genome.ucsc.edu/FAQ/FAQformat.html#format1
column meaning
1 chromosome / scaffold name
2 source (e.g. software that generated this feature / gene call)
3 feature name (e.g. “exon1”, “enhance”r, “3’-UTR”)
4 feature start coordinate (1-based)
GTF is newer, and shares the first
5 feature stop coordinate (1-based) eight (8) columns. Column 9 has
6 score (1-1000) additional restrictions in format
7 strand (‘+’ or ‘-’ or ‘.’ for unknown or not applicable) (gene_id, transcript_id, etc.)
8 reading frame (0, 1, 2, or “.” if N/A)
9 group (allows grouping features together)
SAM spec grew out of 1000 Genomes Project (see Li et al. 2009 Bioinformatics
25:2078)
SAM is plain text; BAM is binary, compressed version of SAM; CRAM is further
compressed but not widely used / recognizable by many tools.