Storing SNPs in a HDF5 file: my notebook
I discovered the benefits of using twitter in 2008, the day Deepak Singh replied to one of my tweets related to the storage of a large number of genotypes.
Since that day, I've tried to use the HDF5 library, without any success (there's a large disheartening documentation/API on the HDF5 site and the API seems to be only used by a small number of geeks). Furthermore, HDF5 is a technology used by the IGV genome browser.
In that post, I'll describe a C program loading a set of SNPs defined by the following C structure:
typedef struct structSnpThe SNPs will be read from stdin. Four columns are expected: rs-id, chrom, chromStart and chromEnd. Here is the command line I used to get a small input file:
{
char rsId[RS_LENGTH];//rs##
char chrom[CHROM_LENGTH];//chromosome name
int chromStart;//genomic start index
int chromEnd;//genomic end index
}Snp;
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/snp132.txt.gz" |\
gunzip -c |\
cut -d ' ' -f 2-5 |\
head -n 10000 > sample.tsv
Source code
The C program is described below. I hope my comments will make the code readable.
The Makefile
Compile and run
gcc -Wall -o a.out -I /path/to/hdf5/include -L /path/to/hdf5/lib dbsnp2hdf5.c -lhdf5
./a.out < sample.tsv
Dump the data
At the en, the program creates a structured binary file containing our SNPs. The HDF5 utility h5dump can be used to display the data as text:
$ h5dump storage.h5
HDF5 "storage.h5" {
GROUP "/" {
GROUP "variations" {
DATASET "dbSNP" {
DATATYPE H5T_COMPOUND {
H5T_STRING {
STRSIZE 13;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "rs";
H5T_STRING {
STRSIZE 7;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "chrom";
H5T_STD_I32LE "chromStart";
H5T_STD_I32LE "chromEnd";
}
DATASPACE SIMPLE { ( 10000 ) / ( H5S_UNLIMITED ) }
DATA {
(0): {
"rs112750067",
"chr1",
10326,
10327
},
(1): {
"rs56289060",
"chr1",
10433,
10433
},
(2): {
"rs112155239",
"chr1",
10439,
10440
},
(3): {
"rs112766696",
"chr1",
10439,
10440
},
(...)
450425,
450426
}
}
ATTRIBUTE "rdfs:comment" {
DATATYPE H5T_STRING {
STRSIZE 15;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "My list of SNPs"
}
}
}
}
}
}
That's it !
Pierre