23 April 2010

Short post: Plain text vs Binary data.

This week, I was asked the difference between storing some data using a plain text format and a binary format. I wrote the following code C++ to illustrate how to store some genotypes in a ( possibly huge )table saved as a binary file. The first bytes of this file tells the number of individuals, the number of markers, the name of the individuals. Then for each marker, we get the name of the marker, its position and an array of genotypes for each individual.

The first invocation of this program (-p write ) creates a random table and a second call (-p read ) answers the genotypes for some random individuals/markers.



That's it.

Pierre

20 April 2010

Reading/Writing SAM/BAM files with the picard API for java.

I'm currently playing with the BAM/SAM Java API available in the picard package. This API can be used for creating new programs that read and write SAM files, the generic format for storing large nucleotide sequence alignments. Both SAM text format and SAM binary (BAM) format are supported. The source code comes with a simple example showing how to read and write a SAM/BAM file.

(...)
public void convertReadNamesToUpperCase(
final File inputSamOrBamFile,
final File outputSamOrBamFile
)
{
final SAMFileReader inputSam = new SAMFileReader(inputSamOrBamFile);
final SAMFileWriter outputSam = new SAMFileWriterFactory().makeSAMOrBAMWriter(inputSam.getFileHeader(),
true, outputSamOrBamFile);

for (final SAMRecord samRecord : inputSam) {
// Convert read name to upper case.
samRecord.setReadName(samRecord.getReadName().toUpperCase());
outputSam.addAlignment(samRecord);
}
outputSam.close();
inputSam.close();
}
(...)
toay, I've played with this SAM API to build two simple java tools:

SbamGrep

SbamGrep is available at http://code.google.com/p/code915/wiki/SbamGrep. It filters a SAM/BAM file using the SAM flags.

Option

 -o (filename-out) or default is stdout SAM
-v inverse selection
-f [flag] add bam/sam flag for filtering. multiple separeted with a comma. One of:
READ_PAIRED or 1 or 0x1
PROPER_PAIR_ or 2 or 0x2
READ_UNMAPPED or 4 or 0x4
MATE_UNMAPPED or 8 or 0x8
READ_STRAND or 16 or 0x10
MATE_STRAND or 32 or 0x20
FIRST_OF_PAIR or 64 or 0x40
SECOND_OF_PAIR or 128 or 0x80
NOT_PRIMARY_ALIGNMENT or 256 or 0x100
READ_FAILS_VENDOR_QUALITY_CHECK or 512 or 0x200
DUPLICATE_READ or 1024 or 0x400

Example

The following command lists all the reads having a flag (READ_PAIRED and READ_UNMAPPED and MATE_UNMAPPED and FIRST_OF_PAIR):
java -cp sam-1.16.jar:dist/sbamgrep.jar fr.inserm.umr915.sbamtools.SbamGrep -f 0x4,0x1,0x8,0x40 file.bam
Result (masked):
@HD VN:1.0 SO:unsorted
@SQ SN:chrT LN:349250
IL_XXXX1 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (94**0-)*7=0688855555522@86;;;5;6:;63:4?-622647..-.5.%
IL_XXXX2 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9*+*2396,@5+5:@@@;;5)50)6960684;58;86*5102)0*+8:*137;
IL_XXXX3 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (/999-00328:88984@@=8??@@:-66,;8;5;6+;255,1;788883676'
IL_XXXX4 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (916928.82@@50854;33222224;@25?5522;5=;;858888555*0666
IL_XXXX5 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9*4*5-**32989+::@82;;853+39;80.53)-)79)..'55.8988*200
IL_XXXX6 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (*+**,14265;@@??@8?9@@@5@488?8666260.)-*9;;;88:8'05418
IL_XXXX7 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9136242-2@@@;96.888@@@@80$585882623.':**+3*03137..--.
IL_XXXX8 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (/89255855@88?585557..())@@@;5552286526755@@5888..3;/$
(...)

SbamStats

The second tool, SbamStats is available at http://code.google.com/p/code915/wiki/SbamStats. It provides a simple report about all the SAM flags used in one or more files.

Example

:
java -cp sam-1.16.jar:dist/sbamstats.jar fr.inserm.umr915.sbamtools.SBamStats file.bam
Output:
READ_PAIRED:0x1|READ_STRAND:0x10|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 10827
READ_PAIRED:0x1|READ_STRAND:0x10|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 10827
READ_PAIRED:0x1|FIRST_OF_PAIR:0x40 13951
READ_PAIRED:0x1|SECOND_OF_PAIR:0x80 13951
READ_PAIRED:0x1|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 23846
READ_PAIRED:0x1|READ_STRAND:0x10|SECOND_OF_PAIR:0x80 23846
READ_PAIRED:0x1|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 33606
READ_PAIRED:0x1|READ_STRAND:0x10|FIRST_OF_PAIR:0x40 33606
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|READ_STRAND:0x10|SECOND_OF_PAIR:0x80 143090
READ_PAIRED:0x1|READ_UNMAPPED:0x4|READ_STRAND:0x10|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 143090
READ_PAIRED:0x1|READ_UNMAPPED:0x4|SECOND_OF_PAIR:0x80 161781
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|FIRST_OF_PAIR:0x40 161781
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|SECOND_OF_PAIR:0x80 174429
READ_PAIRED:0x1|READ_UNMAPPED:0x4|FIRST_OF_PAIR:0x40 174429
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|READ_STRAND:0x10|FIRST_OF_PAIR:0x40 176219
READ_PAIRED:0x1|READ_UNMAPPED:0x4|READ_STRAND:0x10|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 176219
READ_PAIRED:0x1|PROPER_PAIR_:0x2|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 915161
READ_PAIRED:0x1|PROPER_PAIR_:0x2|READ_STRAND:0x10|SECOND_OF_PAIR:0x80 915161
READ_PAIRED:0x1|PROPER_PAIR_:0x2|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 2623345
READ_PAIRED:0x1|PROPER_PAIR_:0x2|READ_STRAND:0x10|FIRST_OF_PAIR:0x40 2623345
READ_PAIRED:0x1|READ_UNMAPPED:0x4|MATE_UNMAPPED:0x8|SECOND_OF_PAIR:0x80 26232123
READ_PAIRED:0x1|READ_UNMAPPED:0x4|MATE_UNMAPPED:0x8|FIRST_OF_PAIR:0x40 26232123



That's it !
Pierre

19 April 2010

A stateful C function for R: parsing Fasta sequences

In the following post, I'll create a C extension for R. This extension will iterate over all the FASTA sequences in a file and will return a pair(name,sequence) for each sequence, that is to say that I won't store all the sequences in memory.

The C code

The structure FastaHandler holds the state of the fasta parser. It contains a pointer to the FILE, the current sequence and fasta header. We also need to save the previous line.
/** stateful structure for the fastafile */
typedef struct fastaHandler_t
{
/** input file */
FILE* in;
/** save the previous line, it will contains the next fasta header */
char* previous_line;
/** sequence name */
char *seq_name;
/** sequence dna */
char *seq_dna;
/** sequence_length */
int seq_length;
}FastaHandler,*FastaHandlerPtr;

The function fasta_open opens the fasta file and initialize a new FastaHandlerPtr. This pointer is then persisted into a 'R' variable using R_MakeExternalPtr.
/**
* open a fasta file, init the FastaHandler
*/
SEXP fasta_open(SEXP filename)
{
FastaHandlerPtr handle=NULL;
if(!isString(filename)) error("filename is not a string");
if(length(filename)!=1) error("expected only one filename");

handle=(FastaHandlerPtr)calloc(1,sizeof(FastaHandler));
if(handle==NULL)
{
error("Cannot alloc FastaHandler");
}
const char* c_filename=CHAR(STRING_ELT(filename,0));
errno=0;
handle->in= fopen( c_filename,"r");
if(handle->in==NULL)
{
error("Cannot open \"%s\": %s",c_filename,strerror(errno));
}
/** the handle is bound a R variable */
return R_MakeExternalPtr(handle, R_NilValue, R_NilValue);
}
We also need a function to close the structure. The pointer to the FastaHandlerPtr is retrieved using R_ExternalPtrAddr(R_variable):
/**
* close the FastaHandlerPtr
*/
SEXP fasta_close(SEXP r_handle)
{
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in!=NULL)
{
fclose(handle->in);
handle->in=NULL;
}
if(handle->previous_line!=NULL)
{
free(handle->previous_line);
handle->previous_line=NULL;
}
free(handle->seq_name);
handle->seq_name=NULL;
free(handle->seq_dna);
handle->seq_dna=NULL;
R_ClearExternalPtr(r_handle);
return ScalarInteger(0);
}
The function fasta_next return the next pair(name,sequence):
/**
* read the next fasta sequence
*/
SEXP fasta_next(SEXP r_handle)
{
int count=0;
char* ptr=NULL;
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in==NULL) error("handle->in==NULL");

while((ptr=readLine(handle))!=NULL)
{
if(ptr[0]=='>')
{
//this is the header of a fasta sequence
if(handle->seq_name!=NULL || handle->seq_dna!=NULL)
{
handle->previous_line=ptr;
SEXP return_value=NULL;
//if there a sequence in memory ?
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
//create the sequence
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
//return the sequence if any
if(return_value!=NULL) return return_value;
}
//cleanup
handle->seq_name=ptr;
handle->seq_length=0;
handle->seq_dna=NULL;
}
else
{
int j=0;
int len=0;
//remove blank characters
while(ptr[j]!=0)
{
if(!isspace(ptr[j]))
{
ptr[len++]=ptr[j];
}
++j;
}
//enlarge the dna sequence
handle->seq_dna=realloc(
handle->seq_dna,
sizeof(char)*(handle->seq_length+1+len)
);
if(handle->seq_dna==NULL) error("cannot realloc seq_dna");
//append the new line
memcpy(&handle->seq_dna[handle->seq_length],ptr,sizeof(char)*len);
handle->seq_length+=len;
handle->seq_dna[handle->seq_length]=0;
free(ptr);
ptr=NULL;
}
}
//last sequence
SEXP return_value=NULL;
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
return (return_value==NULL?R_NilValue:return_value);
}
This function calls readLine reading the next non-empty line:
/**
* return the next non empty line from a FastaHandlerPtr
* should be free()
*/
static char* readLine(FastaHandlerPtr handle)
{
int c;
char* ptr=NULL;
int length=0;
/* if there was a saved line, return it */
if(handle->previous_line!=NULL)
{
ptr=handle->previous_line;
handle->previous_line=NULL;
return ptr;
}
/** while the line is empty */
while(length==0)
{
int capacity=0;
//no more to read ?
if(feof(handle->in)) return NULL;
ptr=NULL;
//read each char
while((c=fgetc(handle->in))!=EOF)
{
//if this is an EOL, break
if(c=='\n') break;
//enlarge the buffer if it is too small
if(ptr==NULL || (length+2)>=capacity)
{
capacity+=500;
ptr=(char*)realloc(ptr,capacity*sizeof(char));
if(ptr==NULL) error("cannot realloc to %d bytes",capacity);
}
//append the char to the buffer
ptr[(length)++]=c;
}
//nothing was read (empty line) ? continue
if(length==0)
{
free(ptr);
ptr=NULL;
}
}
//add a end-of-string char
ptr[length]=0;
//return the line
return ptr;
}
It also invokes the function make_sequence(name,dna). This function creates a new R pair(name,sequence) :
/** create a pair list containing the name and the dna */
static SEXP make_sequence(const char* seq_name,const char* length)
{
SEXP values,names;
//the R value contains two objects
PROTECT(values = allocVector(VECSXP, 2));
//first item is the sequence name
SET_VECTOR_ELT(values, 0, mkString(seq_name));
//second item is the sequence dna
SET_VECTOR_ELT(values, 1, mkString(length));

//the labels
PROTECT(names = allocVector(STRSXP, 2));
//first label is the name
SET_STRING_ELT(names, 0, mkChar("name"));
//2nd label is the name
SET_STRING_ELT(names, 1, mkChar("sequence"));

//bind the labels to the values
setAttrib(values, R_NamesSymbol, names);
UNPROTECT(2);
return values;
}

The 'R' code

The 'C' dynamic library is loaded and each C function is bound to R using ".Call":
dyn.load(paste("libfasta", .Platform$dynlib.ext, sep=""))

fasta.open <- function(filename)
{
.Call("fasta_open", filename)
}

fasta.close <- function(handler)
{
.Call("fasta_close",handler)
}

fasta.next <- function(handler)
{
.Call("fasta_next",handler)
}

Compiling

Here is my makefile:
run:libfasta.so rs_chLG1.fas
${R_HOME}/bin/R --no-save < fasta.R

rs_chLG1.fas:
wget -O [email protected] ftp://ftp.ncbi.nih.gov/snp/organisms/bee_7460/rs_fasta/rs_chLG1.fas.gz
gunzip [email protected]

libfasta.so:fasta.c
gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include fasta.c
gcc -shared -Wl,-soname,fasta.so.1 -o $@ fasta.o

Testing

I've downloaded some fasta sequences from dbSNP/Apis Mellifera. This file is open with fasta.open, we loop over each sequence and we print those sequences having a length lower than 300bp, at the end we close the stream with fasta.close:
f<-fasta.open("rs_chLG1.fas")

while(!is.null(snp <-fasta.next(f)))
{
if(nchar(as.character(snp$sequence))< 300)
print(snp)
}

fasta.close(f)

Result:
$name
[1] "gnl|dbSNP|rs44106732 rs=44106732|pos=48|len=298|taxid=7460|mol=\"genomic\"|class=1|alleles=\"C/T\"|build=127"

$sequence
[1] "GAAAATCAAAGACAATTTTTGGAATGGAACTAAATAACTTTATTCTTYCTTTCTTTCTCGTCGAGAATATCGTTATCGTTGCATGGGTTATGGAATAGCGTTCGTTAAAAATGTTATATTTCGAGGAAATATCGAAGATAGGCTTTGCGAAAGTCTGTTTCTCTAGAATTAAGATTTATATGTGTTGCAAGGGGAAGTTCAAAGAGAAATCGTGGCCAGTTCGAACATTATTATGTCTATCAATGATCGAGAAGTGTCATTGATGCAAGAGAAAAGTTTTCTCTTGCATTTAAGTATC"

$name
[1] "gnl|dbSNP|rs44108824 rs=44108824|pos=251|len=268|taxid=7460|mol=\"genomic\"|class=1|alleles=\"C/T\"|build=127"

$sequence
[1] "GTTAAATGGGAATTTTGGGGATTATTGGAGGAGGATTTACGTTTCGAGGATTGTTGATGATCTTAGGATTGTTTTCAGTTTGGAATTTTTTCTTCTTCTTCAACGTTGTACAACATTCTCCTCGAATTTTGTGTAACGAGGAGGATTAAACCTTTGGAAAATCACGTAAATTAGAGGACGATATATTGGGTTGGCAACTAAGTAATTGCGGATTTTTTTTAGAAAATCAAAGACAATTTTTGGAATGGAAYTAAATAACTTTATTCTT"

$name
[1] "gnl|dbSNP|rs44150807 rs=44150807|pos=251|len=274|taxid=7460|mol=\"genomic\"|class=1|alleles=\"C/T\"|build=127"

$sequence
[1] "TTCGATCTTCGTTTCACGCTCACGTTTCACGTTTCTCGTCCGGCGTAACGAACGGTATTCCGCTGATCACGAATAATTTCTTTCTGGAGAGTCCATTAGGGGACCGTCCCCTCCCCCCTCTCGCGCACACAGACACCCATCGCTTTCGACGCCTCGTCCATTCGAGGGAGAAACGAACGACTATTAGAAAAAAATCTTCTTTATATCTCTATAAATTCAATTTGCAGAGAAGCAAAGAGCTTTAAAATATYAACCATTATAACCGAACTTGTTG"

(....)


Full Source code


Makefile


run:libfasta.so rs_chLG1.fas
${R_HOME}/bin/R --no-save < fasta.R

rs_chLG1.fas:
wget -O [email protected] ftp://ftp.ncbi.nih.gov/snp/organisms/bee_7460/rs_fasta/rs_chLG1.fas.gz
gunzip [email protected]

libfasta.so:fasta.c
gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include fasta.c
gcc -shared -Wl,-soname,fasta.so.1 -o $@ fasta.o

fasta.c


#include <ctype.h>
#include <errno.h>
#include <R.h>
#include <Rinternals.h>

/** stateful structure for the fastafile */
typedef struct fastaHandler_t
{
/** input file */
FILE* in;
/** save previous line, it will contains the next fasta header */
char* previous_line;
/** sequence name */
char *seq_name;
/** sequence dna */
char *seq_dna;
/** sequence_length */
int seq_length;

}FastaHandler,*FastaHandlerPtr;


/**
* return the next non empty line from a FastaHandlerPtr
* should be free()
*/
static char* readLine(FastaHandlerPtr handle)
{
int c;
char* ptr=NULL;
int length=0;
/* if there was a saved line, return it */
if(handle->previous_line!=NULL)
{
ptr=handle->previous_line;
handle->previous_line=NULL;
return ptr;
}
/** while the line is empty */
while(length==0)
{
int capacity=0;
//no more to read ?
if(feof(handle->in)) return NULL;
ptr=NULL;
//read each char
while((c=fgetc(handle->in))!=EOF)
{
//if this is an EOL, break
if(c=='\n') break;
//enlarge the buffer if it is too small
if(ptr==NULL || (length+2)>=capacity)
{
capacity+=500;
ptr=(char*)realloc(ptr,capacity*sizeof(char));
if(ptr==NULL) error("cannot realloc to %d bytes",capacity);
}
//append the char to the buffer
ptr[(length)++]=c;
}
//nothing was read (empty line) ? continue
if(length==0)
{
free(ptr);
ptr=NULL;
}
}
//add a end-of-string char
ptr[length]=0;
//return the line
return ptr;
}

/**
* open a fasta file, init the FastaHandler
*/
SEXP fasta_open(SEXP filename)
{
FastaHandlerPtr handle=NULL;
if(!isString(filename)) error("filename is not a string");
if(length(filename)!=1) error("expected only one filename");

handle=(FastaHandlerPtr)calloc(1,sizeof(FastaHandler));
if(handle==NULL)
{
error("Cannot alloc FastaHandler");
}
const char* c_filename=CHAR(STRING_ELT(filename,0));
errno=0;
handle->in= fopen( c_filename,"r");
if(handle->in==NULL)
{
error("Cannot open \"%s\": %s",c_filename,strerror(errno));
}
/** the handle is bound a R variable */
return R_MakeExternalPtr(handle, R_NilValue, R_NilValue);
}

/** create a pair list containing the name and the dna */
static SEXP make_sequence(const char* seq_name,const char* length)
{
SEXP values,names;
//the R value contains two objects
PROTECT(values = allocVector(VECSXP, 2));
//first item is the sequence name
SET_VECTOR_ELT(values, 0, mkString(seq_name));
//second item is the sequence dna
SET_VECTOR_ELT(values, 1, mkString(length));

//the labels
PROTECT(names = allocVector(STRSXP, 2));
//first label is the name
SET_STRING_ELT(names, 0, mkChar("name"));
//2nd label is the name
SET_STRING_ELT(names, 1, mkChar("sequence"));

//bind the labels to the values
setAttrib(values, R_NamesSymbol, names);
UNPROTECT(2);
return values;
}

/**
* read the next fasta sequence
*/
SEXP fasta_next(SEXP r_handle)
{
int count=0;
char* ptr=NULL;
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in==NULL) error("handle->in==NULL");

while((ptr=readLine(handle))!=NULL)
{
if(ptr[0]=='>')
{
//this is the header of a fasta sequence
if(handle->seq_name!=NULL || handle->seq_dna!=NULL)
{
handle->previous_line=ptr;
SEXP return_value=NULL;
//if there a sequence in memory ?
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
//create the sequence
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
//return the sequence if any
if(return_value!=NULL) return return_value;
}
//cleanup
handle->seq_name=ptr;
handle->seq_length=0;
handle->seq_dna=NULL;
}
else
{
int j=0;
int len=0;
//remove blank characters
while(ptr[j]!=0)
{
if(!isspace(ptr[j]))
{
ptr[len++]=ptr[j];
}
++j;
}
//enlarge the dna sequence
handle->seq_dna=realloc(
handle->seq_dna,
sizeof(char)*(handle->seq_length+1+len)
);
if(handle->seq_dna==NULL) error("cannot realloc seq_dna");
//append the new line
memcpy(&handle->seq_dna[handle->seq_length],ptr,sizeof(char)*len);
handle->seq_length+=len;
handle->seq_dna[handle->seq_length]=0;
free(ptr);
ptr=NULL;
}
}
//last sequence
SEXP return_value=NULL;
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
return (return_value==NULL?R_NilValue:return_value);
}

/**
* close the FastaHandlerPtr
*/
SEXP fasta_close(SEXP r_handle)
{
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in!=NULL)
{
fclose(handle->in);
handle->in=NULL;
}
if(handle->previous_line!=NULL)
{
free(handle->previous_line);
handle->previous_line=NULL;
}
free(handle->seq_name);
handle->seq_name=NULL;
free(handle->seq_dna);
handle->seq_dna=NULL;
R_ClearExternalPtr(r_handle);
return ScalarInteger(0);
}

fasta.R


dyn.load(paste("libfasta", .Platform$dynlib.ext, sep=""))

fasta.open <- function(filename)
{
.Call("fasta_open", filename)
}
fasta.close <- function(handler)
{
.Call("fasta_close",handler)
}

fasta.next <- function(handler)
{
.Call("fasta_next",handler)
}

f<-fasta.open("rs_chLG1.fas")

while(!is.null(snp <-fasta.next(f)))
{
if(nchar(as.character(snp$sequence))< 300)
print(snp)
}

fasta.close(f)

That's it !
Pierre

14 April 2010

Object Oriented Programming with R: My notebook

In the following post, I describe how I've used the OOP features of R to create and use the following class hierarchy:

Your browser does not support the <CANVAS> element !


First the class Person is defined. It contains four fields : firstName, lastName, birthDate and birthPlace.
setClass("Person",
representation(
firstName="character",
lastName="character",
birthDate="Date",
birthPlace="character"
))

A kind of 'constructor' function can be called for Person to check that both firtsName and lastName are not empty:
setValidity("Person",
function(object)
{
length(object@firstName)>0 &&
length(object@lastName)>0
}
)
DeceasedPerson is a subClass of Person, it contains two more fields: deathPlace and deathDate:
setClass("DeceasedPerson",
representation(
deathDate="Date",
deathPlace="character"
),
contains="Person"
)
Scientist is another subClass of Person it contains one more field:'knownFor':
setClass("Scientist",
representation(
knownFor="character"
),
contains="Person"
)
Lastly, DeceasedScientist is a subClass of both Scientist and DeceasedPerson:
setClass("DeceasedScientist",
contains=c("Scientist","DeceasedPerson")
)
Let's define a 'generic' function 'age' returning the age of an individual from his 'birthdate':
age <- function(individual)
{
as.integer((Sys.Date()-individual@birthDate)/365)
}
setGeneric("age")
Polymorphism: for the DeceasedPerson another function will be used, it will calculate the age from both 'deathDate' and 'birthDate':
age.of.death <- function(individual)
{
as.integer((individual@deathDate-individual@birthDate)/365)
}
setMethod(age,signature=c("DeceasedPerson"),definition=age.of.death)
Ok, let's play with our class, we can first create a new instance of Scientist for Craig Venter:
craigVenter <-new(
"Scientist",
firstName="Craig",
lastName="Venter",
birthPlace="Salt Lake City",
birthDate=as.Date("1946-10-14", "%Y-%m-%d"),
knownFor=c("The Institute for Genomic Research","J. Craig Venter Institute")
)
... and Charles Darwin is a DeceasedScientist:
charlesDarwin <-new(
"DeceasedScientist",
firstName="Charles",
lastName="Darwin",
birthDate=as.Date("1809-02-12", "%Y-%m-%d"),
deathDate=as.Date("1882-04-19", "%Y-%m-%d"),
knownFor=c("Natural Selection","The Voyage of the Beagle")
)
Hey , we know where Charles was born!
charlesDarwin@birthPlace="Shrewsbury"
The following statement fails because the firstName is empty:
> try(new("Person",lastName="Darwin",birthDate=as.Date("1809-02-12", "%Y-%m-%d")),FALSE)
Error in validObject(.Object) : invalid class "Person" object: FALSE
Is Darwin a valid object?:
> validObject(charlesDarwin)
[1] TRUE
Print both individuals:
> charlesDarwin
An object of class “DeceasedScientist”
Slot "knownFor":
[1] "Natural Selection" "The Voyage of the Beagle"

Slot "firstName":
[1] "Charles"

Slot "lastName":
[1] "Darwin"

Slot "birthDate":
[1] "1809-02-12"

Slot "birthPlace":
[1] "Shrewsbury"

Slot "deathDate":
[1] "1882-04-19"

Slot "deathPlace":
character(0)

> craigVenter
An object of class “Scientist”
Slot "knownFor":
[1] "The Institute for Genomic Research" "J. Craig Venter Institute"

Slot "firstName":
[1] "Craig"

Slot "lastName":
[1] "Venter"

Slot "birthDate":
[1] "1946-10-14"

Slot "birthPlace":
[1] "Salt Lake City"
Let's use the 'is' operator:
> is(craigVenter,"Person")
[1] TRUE
> is(craigVenter,"DeceasedScientist")
[1] FALSE
> is(charlesDarwin,"DeceasedScientist")
[1] TRUE
Finally let's invoke the polymorhic function 'age' for both individuals:
> age(charlesDarwin)
[1] 73 #age.of.death was called
> age(craigVenter)
[1] 63 #generic "age'


Full source code

setClass("Person",
representation(
firstName="character",
lastName="character",
birthDate="Date",
birthPlace="character"
))

setValidity("Person",
function(object)
{
length(object@firstName)>0 &&
length(object@lastName)>0
}
)

setClass("DeceasedPerson",
representation(
deathDate="Date",
deathPlace="character"
),
contains="Person"
)

setClass("Scientist",
representation(
knownFor="character"
),
contains="Person"
)

age <- function(individual)
{
as.integer((Sys.Date()-individual@birthDate)/365)
}

setGeneric("age")

age.of.death <- function(individual)
{
as.integer((individual@deathDate-individual@birthDate)/365)
}


setClass("DeceasedScientist",
contains=c("Scientist","DeceasedPerson")
)

setMethod(age,signature=c("DeceasedPerson"),definition=age.of.death)

craigVenter <-new(
"Scientist",
firstName="Craig",
lastName="Venter",
birthPlace="Salt Lake City",
birthDate=as.Date("1946-10-14", "%Y-%m-%d"),
knownFor=c("The Institute for Genomic Research","J. Craig Venter Institute")
)

charlesDarwin <-new(
"DeceasedScientist",
firstName="Charles",
lastName="Darwin",
birthDate=as.Date("1809-02-12", "%Y-%m-%d"),
deathDate=as.Date("1882-04-19", "%Y-%m-%d"),
knownFor=c("Natural Selection","The Voyage of the Beagle")
)

try(new("Person",lastName="Darwin",birthDate=as.Date("1809-02-12", "%Y-%m-%d")),FALSE)

charlesDarwin@birthPlace="Shrewsbury"

validObject(charlesDarwin)

charlesDarwin
craigVenter


is(craigVenter,"Person")
is(craigVenter,"DeceasedScientist")
is(charlesDarwin,"DeceasedScientist")
age(charlesDarwin)
age(craigVenter)


That's it !
Pierre

08 April 2010

BAM, BWA and SAMTOOLS: my notebook

Here are my notes about MAQ,BWA and SAMTOOLS.

MAQ


Maq is a software that builds mapping assemblies from short reads generated by the next-generation sequencing machines. Here I used chrM.fa, the reference sequence for the mitochondrial genome
>chrM
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT
TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG
GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA
AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT
GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCA
AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC
CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT
TTTATCTTTAGGCGGTATGCACTTTTAACAGTCACCCCCCAACTA.....
fasta2bfa converts the reference sequence to the binary fasta format
$(MAQ) fasta2bfa chrM.fa _reference.bfa


I also generated a set of random fastq sequences for the chrM. Each read contains a mutation A/G every 10 bases and an insertion of 3 bases every 50 bases.
(...)
@chrM|8|2
GTATATCACCCTATTAACCACTAACGGGAG
+
kzzst|{efokkdyirlqkflupvofumxe
@chrM|20|3
ATTAACCACTAACGGGAGCTATCCATGCAT
+
zplixjsdlmeemjrxxkt|nrd?qwqwym
@chrM|25|4
CCACTAACGGGAGCTATCCATGCATAAATGGT
+
|uklu}+rqepiuoi|qjyrhpyzlllrq|oz
@chrM|32|5
CGGGAGCTATCCATGCATAAATGGTATTTTAG
+
qyvylqiqkhddve4rddxh|frmznluouxp
(...)

fastq2bfq converts the short reads to the binary fasta format:
$(MAQ) fastq2bfq chrM.fastq _reads.bfq

We can now align the reads to the chrM:
$(MAQ) map -u _poor.txt -H _multiple.data _align.map _reference.bfa _reads.bfq
'_multiple.data' contains the multiple/all 01-mismatch hits , and '_poor.txt' contains the unmapped and poorly aligned reads:
chrM|0|1 99 AAAATCACAGGTATATCACCCTATTAACCACT ````````T``````)````````````````
chrM|25|4 99 CCACTAACGGGAGCTATCCATGCATAAATGGT ``````+`````````````````````````
chrM|32|5 99 CGGGAGCTATCCATGCATAAATGGTATTTTAG ``````````````4`````````````````
chrM|35|6 99 GAGCTATCCATGCATAAATGGTATTTTAGTCT ``````````````````````2`````````
chrM|49|7 99 TAAATGGTATTTTAGTCTGGGGGATGTGCACG ````````````````````````````````
chrM|76|10 99 ACGCAATAGCATTGAGAGACGCTGAAAAGCCG `````````````&``````````````````
chrM|80|11 99 AATAGCATTGAGAGACGCTGAAAAGCCGGAGC ``````````````B```Q``````2``````
chrM|89|12 99 GAGAGACGCTGAAAAGCCGGAGCACCCTATGT ````````````````:```````````````
chrM|90|13 99 AGAGACGCTGAAAAGCCGGAGCACCCTATGTC ````````````````````````````````
chrM|92|14 99 AGACGCTGAAAAGCCGGAGCACCCTATGTCAC ``````````````````?`````````````

View the mapping alignment with mapview:
$(MAQ) mapview _align.map > _mapview.txt

##result 'verticalized'

$1 read.name : chrM|71|26834
$2 chromosome : chrM
$3 position : 72
$4 strand : +
$5 insert.size : 0
$6 paired.flag : 0
$7 mapping.quality : 59
$8 single-end.mapping.quality : 59
$9 alternative.mapping.quality: 59
$10 number.mismatches.of.best.hit : 2
$11 sum.qualities.mismatched.bases.best.hit : 60
$12 number of 0-mismatch hits of the first 24bp : 0
$13 number of 1-mismatch hits of the first 24bp on ref : 1
$14 length of the read : 32
$15 sequence : TGTGCACGCGATAGCATTGGGAGACGCTGGGG
$16 quality : ```````````````````````X````````
(...)


mapstat : get statistics from the alignment:
$(MAQ) mapstat _align.map > _mapstat.txt


-- Total number of reads: 308
-- Sum of read length: 9856
-- Error rate: 0.058746
-- Average read length: 32.00

-- Mismatch statistics:

-- MM 0 1
-- MM 1 37
-- MM 2 268
-- MM 3 2

-- Mapping quality statistics:

-- MQ 20-29 1 1
-- MQ 40-49 138 138
-- MQ 50-59 107 107
-- MQ 70-79 62 62

-- Flag statistics:

-- FG 0 308


mapcheck: Read quality check
$(MAQ) mapcheck _reference.bfa _align.map > _mapcheck.txt

Number of reference sequences: 1
Length of reference sequences exlcuding gaps: 16571
Length of gaps in the reference sequences: 0
Length of non-gap regions covered by reads: 3846
Length of 24bp unique regions of the reference: 3846
Reference nucleotide composition: A: 30.86%, C: 31.33%, G: 13.16%, T: 24.66%
Reads nucleotide composition: A: 41.35%, C: 24.81%, G: 13.91%, T: 19.93%
Average depth across all non-gap regions: 0.591
Average depth across 24bp unique regions: 0.591

A C G T : AC AG AT CA CG CT GA GC GT TA TC TG : 0? 1? 2? 3? 4? 5? 6? : 0? 1? 2? 3? 4? 5? 6?
1 52.1 18.5 16.8 12.5 : 0 33 25 253 0 0 125 0 0 288 0 182 : 3 204 0 0 254 340 198 : 15 998 167 167 0 0 0
2 56.4 16.2 12.9 14.5 : 0 41 0 164 33 0 171 0 0 220 0 34 : 3 36 0 3 412 346 198 : 15 991 167 15 200 29 0
3 52.1 19.5 12.5 15.8 : 0 29 7 114 43 0 222 0 0 138 0 52 : 3 7 3 7 438 340 201 : 15 8 15 8 255 10 0
4 45.9 24.1 10.2 19.8 : 0 15 15 52 0 0 97 0 0 48 0 16 : 10 3 3 3 438 336 204 : 5 15 15 15 105 10 0
5 39.9 25.7 15.2 19.1 : 0 34 0 13 0 0 68 0 0 48 0 16 : 0 3 3 3 442 353 195 : 167 15 15 15 90 0 0
6 35.6 25.7 13.5 25.1 : 0 10 0 37 0 0 48 0 0 26 0 0 : 3 10 0 7 445 343 191 : 15 5 167 8 59 0 0
7 38.3 31.4 12.5 17.8 : 0 0 0 10 0 0 73 0 0 0 0 0 : 0 7 0 7 445 340 201 : 167 8 167 8 30 0 0
8 38.0 26.7 14.2 21.1 : 0 0 0 0 0 0 0 0 0 0 0 0 : 3 7 3 3 435 346 201 : 15 8 15 15 0 0 0
9 39.9 20.1 18.5 21.5 : 0 0 0 116 0 0 18 0 0 58 0 0 : 0 17 7 0 425 353 198 : 167 3 8 167 8 112 0
10 48.5 15.2 17.2 19.1 : 0 0 0 238 32 0 82 0 0 113 0 70 : 0 0 3 7 435 353 201 : 167 167 15 8 46 261 0
11 36.6 27.4 17.5 18.5 : 0 0 0 0 0 0 0 0 0 0 0 0 : 3 13 7 10 425 353 188 : 15 4 8 5 0 0 0
(...)


pileup displays the alignment in a ‘pileup’ text format.
$(MAQ) pileup _reference.bfa _align.map > _pileup.txt

#chrom pos. ref depth base.on.read(@=same)
chrM 1 G 0 @
chrM 2 A 0 @
chrM 3 T 0 @
(...)
chrM 72 T 1 @,
chrM 73 G 1 @,
chrM 74 T 1 @,
chrM 75 G 1 @,
(...)
chrM 110 C 2 @,,
chrM 111 A 2 @GG
chrM 112 C 2 @,,
(..)


assemble creates a consensus sequences from read mapping.
$(MAQ) assemble _consensus.cns _reference.bfa _align.map 2> /dev/null


cns2snp extracts SNP sites from the consensus.
$(MAQ) cns2snp _consensus.cns > _snp.txt

##verticalized
$1 chromosome : chrM
$2 position : 91
$3 ref.base : C
$4 consensus.base : G
$5 phred.qual : 30
$6 read.depth : 1
$7 average number of hits of reads : 1.00
$8 highest mapping quality of the reads covering the position : 59
$9 minimum consensus quality in the 3bp flanking regions at each side of the site : 30
$10 second best call : N
$11 log likelihood ratio of the second best and the third best call : 29
$12 third best call : N
(...)


cns2view shows detailed information at all sites from the consensus.
$(MAQ) cns2view _consensus.cns _viewsnp.txt

##verticalized
$1 chromosome : chrM
$2 position : 1
$3 ref.base : G
$4 consensus.base : N
$5 phred.qual : 0
$6 read.depth : 0
$7 average number of hits of reads : 0.00
$8 highest mapping quality of the reads covering the position : 0
$9 minimum consensus quality in the 3bp flanking regions at each side of the site : 0
$10 second best call : N
$11 log likelihood ratio of the second best and the third best call : 0
$12 third best call : N
(...)


cns2fq extracts the consensus sequences in FASTQ format. "Bases in lower case are essentially repeats or do not have sufficient coverage; bases in upper case indicate regions where SNPs can be reliably called"
$(MAQ) cns2fq _consensus.cns > _cns2fq.txt


@chrM
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnntgtgcacgcgatagcattgggagacgcTggrGccggagcgccctatgtc
gcagtatctgnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnntcaATATT
ACAGGCGAACATACCTACTAAAAAGtgtnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnncagACATCATAACAAAAAATTTCCACCA
AAAACCccccnnnnnnnnnnnntggccacaacacttaaacacatctctgcaaaannnnnn
(...)
+
(...)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!???????????????????????????2EE8EBBBBBBBBBBBBBBBBB
BBBBBBBBBB!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!BHHHKKKK
KKKKKKKKKKKKKKKKKKKKKKKKE???!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
(...)

BWA


Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome.

index indexes the chrM in the FASTA format:
$(BWA) index -a is -p _bwa.db chrM.fa


aln finds the suffix array coordinates of the input reads.
$(BWA) aln _bwa.db chrM.fastq > _bam.aln.sai


samse generates alignments in the SAM format given single-end reads.
$(BWA) samse _bwa.db _bam.aln.sai chrM.fastq > _bam.samse.sam

#_bam.samse.sam verticalized
(...)
$1 query : chrM|12602|40780
$2 flag : 0
$3 ref: chrM
$4 position : 12603
$5 quality : 37
$6 cigar.string : 30M
$7 mate.ref.sequence : *
$8 mate.pos : 0
$9 insert.size : 0
$10 query.seq : TCATCCCTGTAGCATTGTGCGTTACATGGT
$11 query.qual : kdfz=|srknznewhnkftyowhtmevnhz
$12 ? : XT:A:U ##cf. manual
$13 ? : NM:i:1
$14 ? : X0:i:1
$15 ? : X1:i:0
$16 ? : XM:i:1
$17 ? : XO:i:0
$18 ? : XG:i:0
$19 ? : MD:Z:18T11

(...)


SAM tools

The SAM format is a generic format for storing large nucleotide sequence alignments. SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. Sometimes this tool needs a tab-delimited file 'chrM.ref' containing the length of each chromosome:
chrM 16571


import converts the alignment into BAM (a binary/compact version of SAM)
$(SAM) import chrM.ref _bam.samse.sam _unsorted.bam


sort BAM alignments by leftmost coordinates
$(SAM) sort _unsorted.bam _sorted


sort prints the alignments:
$(SAM) view -h _sorted.bam _sam.view.txt

#verticalized result (BAM format):

$1 ? : chrM|8|2
$2 ? : 0
$3 ? : chrM
$4 ? : 9
$5 ? : 25
$6 ? : 30M
$7 ? : *
$8 ? : 0
$9 ? : 0
$10 ? : GTATATCACCCTATTAACCACTAACGGGAG
$11 ? : kzzst|{efokkdyirlqkflupvofumxe
$12 ? : XT:A:U
$13 ? : NM:i:2
$14 ? : X0:i:1
$15 ? : X1:i:0
$16 ? : XM:i:2
$17 ? : XO:i:0
$18 ? : XG:i:0
$19 ? : MD:Z:2C19C7


pileup prints the alignment in the pileup format.
$(SAM) pileup -c -f chrM.fa _sorted.bam > _sam.pileup.txt

#verticalized...
(...)
$1 chromosome name : chrM
$2 ref coordinate : 11
$3 reference base : C
$4 read base : A
$5 consensus quality : 60
$6 SNP quality : 60
$7 root mean square mapping quality : 25
$8 reads covering the position : 11
$9 read bases at a SNP line : AAAAAAAAA^:A^:A
$10 read quality : yj
(...)


That's it
Pierre

05 April 2010

The Alchemist: CSS pseudo 3D.

In the following post, I use a two-and-a-half-dimensional (2.5) technique to fake a three-dimensional (3D) visualization of Joseph Wright of Derby's painting The Alchemist in Search of the Philosopher's Stone This post was mostly inspired by Román Cortès' 3D Menimas and a post on Experiment Garden.

The Alchemist in Search of the Philosopher's Stone


The following image/script was successfully tested under Firefox 3.6. Move your mouse over the image

How it works

Each component of the collage is part of a larger transparent picture:
An array defines each sprite. One sprite is defined by its rectangular clip (x,y,width,height) in the original image, its position (x,y) in the collage and a value (ratio) reflecting the amount of pixels it should scroll
var picts=[
{
/* background */
img:null,
clip:[0,0,image_width,image_height],
ratio:0.0,
x:0,y:0
},
{
/* people */
img:null,
clip:[227,1,111,147],
ratio:0.1,
x:5,y:154
},
(...)
];
The HTML is defined as follow:
<div id='id9879709_main' style='overflow:hidden;position:relative;background-color:red;width:226px;height:320px;border:1px solid black;'>
<span style="position:relative;padding:3px;background-color:white;top:15px;left:15px;border:2px solid black;z-index:100; font-weight: bold;"><a target="_wp" href="http://commons.wikimedia.org/wiki/File:JosephWright-Alchemist.jpg">The Alchemist</a></span>
</div>
.When this html document is loaded, a <div> clipping each sprite is created an inserted into the document:
var main=document.getElementById('main');
main.onmousemove = callback;
for(var i in picts)
{
var pict=picts[i];
pict.img=document.createElement("div"); pict.img.setAttribute("style","position:absolute;top:"+pict.y+"px;left:"+pict.x+"px;width:"+pict.clip[2]+"px;height:"+pict.clip[3]+"px;z-index:"+i+";background:url(alchemy02.png) no-repeat scroll -"+pict.clip[0]+"px -"+pict.clip[1]+"px;");
main.appendChild(pict.img);
}
Lastly a javascript callback is called each time the mouse moves over the image. This method shifts each sprites according to the value of its 'ratio':
var v=document.getElementById('id9879709_main');
var left=v.offsetLeft;
var width=226.0;
var midX= left+width/2.0;
var maxshift=30;
var mouseX=e.clientX;
if(mouseX &gt; left+width) mouseX=left+width;
if(mouseX &lt; left) mouseX=left;
var shift= ((e.clientX-midX)/(width/2.0))*maxshift;
for(var i in picts)
{
var pict=picts[i];
pict.img.style.left=(pict.x+shift*pict.ratio)+"px";
}


That's it
Pierre

26 March 2010

'R' = dna.translate("AGG") . A custom C function for R, My notebook.


In the following post, I will show how I've implemented a custom C function for R. This C function will translate a DNA to a protein. I'm very new to 'R' so feel free to make any comment about the code.

C code


The data in 'R' are stored in an opaque structure named 'SEXP'. A custom C function receives some SEXP arguments and returns another SEXP. So, the declaration of my function translate_dna is:
SEXP translate_dna(SEXP sequences)
{
(...)
}

We check if the argument 'sequences' has a type='character'.
if(!isString(sequences))
error("argument is not a string[]");
The input consists of one or more DNA, so our return value will be an SEXP array of type 'character' (STRSXP) with a size equals to the number of DNAs. This array is allocated:
SEXP array_of_peptides = allocVector(STRSXP,length(sequences));
We loop over each DNA sequence
for(in i=0;i<length(sequences);++i) ...
and we allocate some memory for the peptide:
const char *dna=CHAR(STRING_ELT(sequences, i));
int dna_length=strlen(dna);
char* peptide =malloc(dna_length/3+1);
The peptide is then filled with its amino acids:
for(j=0;j+2 < dna_length;j+=3)
{
peptide[n++]=_translate(dna[j],dna[j+1],dna[j+2]);
}
peptide[n]=0;
And we put this new string/peptide in the returned value
SET_STRING_ELT(array_of_peptides,i,Rf_mkChar(peptide));
Here, I am not sure how 'R' handles its memory. Should I care about the way R runs its garbage manager ? how should I use the macros PROTECT and UNPROTECT ?

Compilation


The code 'translate.c' is compiled as a dynamic library ' libtranslate.so':

gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include translate.c
gcc -shared -Wl,-soname,libtranslate.so.1 -o libtranslate.so translate.o

R code

on the R side , the previous dynamic library is loaded:
dyn.load(paste("libtranslate", .Platform$dynlib.ext, sep=""))
A function dna.translate is declared: it forces the array to be an array of string and it invokes the C function 'translate_dna'
storage.mode(dna) <- "character"
.Call("translate_dna", dna)
We can now invoke the R function dna.translate:
peptides <- dna.translate(
c( "ATGGAGAGGCAGAAACGGAAGGCGGACATCGAGAAAGGGCTGCAGTTCATTCAGTCGACACTAC",
NULL,
"CCCAAAAGCAAGAAGAATATGAGGCCTTTCTGCT",
"CAAACTGGTGCAGAATCTGTTTGCTGAGGGCAATGA",
NULL,
"GGCAGATCAGGGAACTTCTAATGGATTGGGGTCC",
"GGATAACTGCACCTTCGCCTACCATCAGGAGGA",
"GGGTCCCAGGCAGCGCTGCCTGGGGGCTGGGGAG",
"TTGCGACAGGCTCCAGAAGGGCAAAGCCTGCCCAGAT",
"GCACCCCCCTCCTCTCCACCCTACCTTCCATCAACCA",
"AGG"
))

print(peptides)

Result:
[1] "MERQKRKADIEKGLQFIQSTL" "PKSKKNMRPFC" "QTGAESVC*GQ*"
[4] "GRSGNF*WIGV" "G*LHLRLPSGG" "GSQAALPGGWG"
[7] "LRQAPEGQSLPR" "APPSSPPYLPST" "R"


Full source code

:
C code:
#include <stdio.h>
#include <ctype.h>
#include <R.h>
#include <Rinternals.h>
/* the genetic code */
static const char* STANDARD_GENETIC_CODE="FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG";

static char _translate(char a,char b,char c);


/** translates DNA to protein
* @arg sequences one or more DNA sequence
* @return an array of peptides
*/
SEXP translate_dna(SEXP sequences)
{
int i;
SEXP array_of_peptides;
//check input has type= characters
if(!isString(sequences))
error("argument is not a string[]");
//prepare a new list for the translated sequence
array_of_peptides = allocVector(STRSXP,length(sequences));

//loop over the input
for(i=0;i< length(sequences);++i)
{
int j=0,n=0;
//transform the sequence into a ptr* of char
const char *dna=CHAR(STRING_ELT(sequences, i));
//get the length of this sequence
int dna_length=strlen(dna);
//alloc the protein sequence
char* peptide =malloc(dna_length/3+1);
if(peptide==NULL) error("out of memory");
//loop over the codons
for(j=0;j+2 < dna_length;j+=3)
{
//set the amino acid at 'n'
peptide[n++]=_translate(dna[j],dna[j+1],dna[j+2]);
}
//add EOS
peptide[n]=0;
//put a copy of this peptide in the vector
SET_STRING_ELT(array_of_peptides,i,Rf_mkChar(peptide));
//free our copy
free(peptide);
}
//return the array of petptide
return array_of_peptides;
}


static int base2index(char c)
{
switch(tolower(c))
{
case 't': return 0;
case 'c': return 1;
case 'a': return 2;
case 'g': return 3;
default: return -1;
}
}

static char _translate(char a,char b,char c)
{
int base1= base2index(a);
int base2= base2index(b);
int base3= base2index(c);
if(base1==-1 || base2==-1 || base3==-1)
{
return '?';
}
else
{
return STANDARD_GENETIC_CODE[base1*16+base2*4+base3];
}

}

Makefile:
run:translate.so
${R_HOME}/bin/R --no-save < translate.R

translate.so:translate.c
gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include translate.c
gcc -shared -Wl,-soname,libtranslate.so.1 -o libtranslate.so translate.o

R code:
#load the dynamic library translate
dyn.load(paste("libtranslate", .Platform$dynlib.ext, sep=""))
#declare the function dna.translate
dna.translate <- function(dna)
{
#force dna to be type=character
storage.mode(dna) <- "character"
#call the C function
.Call("translate_dna", dna)
}

peptides <- dna.translate(
c( "ATGGAGAGGCAGAAACGGAAGGCGGACATCGAGAAAGGGCTGCAGTTCATTCAGTCGACACTAC",
NULL,
"CCCAAAAGCAAGAAGAATATGAGGCCTTTCTGCT",
"CAAACTGGTGCAGAATCTGTTTGCTGAGGGCAATGA",
NULL,
"GGCAGATCAGGGAACTTCTAATGGATTGGGGTCC",
"GGATAACTGCACCTTCGCCTACCATCAGGAGGA",
"GGGTCCCAGGCAGCGCTGCCTGGGGGCTGGGGAG",
"TTGCGACAGGCTCCAGAAGGGCAAAGCCTGCCCAGAT",
"GCACCCCCCTCCTCTCCACCCTACCTTCCATCAACCA",
"AGG"
))

print(peptides)


That's it

Pierre

25 March 2010

Playing with Biomart. My notebook.


Thanks to Neil, I've discovered that Biomart was not just a web interace, it also contains a powerful API: Citing http://www.biomart.org/martservice.html:
To submit a query using our webservices generate an XML document conforming to our Query XML syntax. This can be achieved simply by building up your query using MartView and hitting the XML button . This XML should be posted to http://(...)/martservice attached to a single parameter of query.

The base URL for Biomart is http://www.biomart.org/biomart.

Retrieving the Registry information for this Biomart installation: http://www.biomart.org/biomart/martservice?type=registry
<MartRegistry>
<MartURLLocation database="ensembl_mart_57" default="1" displayName="ENSEMBL GENES 57 (SANGER UK)" host="www.biomart.org" includeDatasets="" martUser="" name="ensembl" path="/biomart/martservice" port="80" serverVirtualSchema="default" visible="1" />
<MartURLLocation database="hapmart_27_36" default="1" displayName="HAPMAP 27 (NCBI US)" host="hapmap.ncbi.nlm.nih.gov" includeDatasets="" name="HapMap_rel27" path="/biomart/martservice" port="80" redirect="1" serverVirtualSchema="rel27_NCBI_Build36" visible="1" />
<MartURLLocation database="vega_mart_57" default="0" displayName="VEGA 37 (SANGER UK)" host="www.biomart.org" includeDatasets="" martUser="" name="vega" path="/biomart/martservice" port="80" serverVirtualSchema="default" visible="1" />
<MartURLLocation database="genomic_features_mart_57" default="0" displayName="ENSEMBL GENOMIC FEATURES 57 (SANGER UK)" host="www.biomart.org" includeDatasets="" martUser="" name="genomic_features" path="/biomart/martservice" port="80" serverVirtualSchema="default" visible="0" />
(...)
</MartRegistry>


Retrieving the datasets available for the registry HapMap_rel27: http://www.biomart.org/biomart/martservice?type=datasets&mart=HapMap_rel27 :
TableSet hm27_variation_yri HapMap Population YRI - Yoruba in Ibadan, Nigeria (West Africa) 1 20050000 default 2009-10-28 11:19:52
TableSet hm27_variation_tsi HapMap Population TSI - Toscans in Italy 1 200 50000 default 2009-10-28 11:13:07
TableSet hm27_gene hm27_gene 0 200 50000 default 2009-09-17 12:04:20
TableSet hm27_variation_chd HapMap Population CHD - Chinese in Metropolitan Denver, Colorado 1 200 50000 default 2009-10-28 11:07:01
TableSet hm27_variation All Populations 1 200 50000 default 2009-10-28 11:22:21
TableSet hm27_variation_chb HapMap Population CHB - Han Chinese in Beijing, China 1 200 50000 default 2009-10-28 11:05:57
TableSet hm27_variation_asw HapMap Population ASW - African ancestry in Southwest USA 1 20050000 default 2009-10-28 08:53:43
TableSet hm27_variation_ceu HapMap Population CEU - Utah residents with Northern and Western European ancestry from the CEPH collection 1 200 50000 default 2009-10-28 11:04:39
TableSet hm27_variation_mkk HapMap Population MKK - Maasai in Kinyawa, Kenya 1 200 50000 default 2009-10-28 11:12:12
(...)

Where can I find a description of those columns ?


Retrieving the attributes for the dataset hm27_variation_ceu: http://www.biomart.org/biomart/martservice?type=attributes&dataset=hm27_variation_ceu :
chrom chromosome naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main chrom
start position naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main start
strand strand naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main strand
marker1 marker id naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main marker1
alleles alleles naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main alleles
refallele reference allele naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main refallele
center genotyping center naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__maincenter
platform genotyping platform naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main platform
ceu_id CEU genotype naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__dm genotype
ref_allele reference allele naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main ref_allele
ref_allele_freq reference allele frequency naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main ref_allele_freq
ref_allele_count reference allele count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main ref_allele_count
other_allele other allele naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__mainother_allele
other_allele_freq other allele frequency naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main other_allele_freq
other_allele_count other allele count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main other_allele_count
total_allele_count total allele count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main total_allele_count
refhom_gt reference homozygote genotype naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main refhom_gt
het_gt heterozygote genotype naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__mainhet_gt
otherhom_gt other homozygote genotype naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main otherhom_gt
refhom_gtfreq reference homozygote genotype frequency naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main refhom_gtfreq
het_gtfreq heterozygote genotype frequency naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main het_gtfreq
otherhom_gtfreq other homozygote genotype frequency naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main otherhom_gtfreq
refhom_gtcount reference homozygote genotype count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main refhom_gtcount
het_gtcount heterozygote genotype count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main het_gtcount
otherhom_gtcount other homozygote genotype count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main otherhom_gtcount
total_gtcount total genotype count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main total_gtcount
prot_lsid protocol LSID naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__assay__dm prot_lsid
assay_id assay LSID naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__assay__dm assay_lsid

(Where can I find a description of those columns ?). Each attribute is a specific column in the final result. Example of attributes available for the dataset hm27_variation_ceu:
  • chrom
  • start
  • marker1
  • alleles
  • (...)


Retrieving the filters for the dataset hm27_variation_ceu: http://www.biomart.org/biomart/martservice?type=filters&dataset=hm27_variation_ceu :
allele_freq Minor Allele Frequency [>=] [0.0,0.01,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5] naive_filters list >= hm27_variation_ceu__variation__main avg_dprime
center Genotyping Center [affymetrix,bcm,illumina,imsut-riken,mcgill-gqic,perlegen,ucsf-wu,broad,sanger] naive_filters list = hm27_variation_ceu__variation__main center
platform Genotyping Platform [AFFY_100K,AFFY_500K,BeadArray,FP-TDI,Infinium_H1,Infinium_H300,Invader,MIP,Perlegen,AFFY_6.0,Illumina_1M] naive_filters list = hm27_variation_ceu__variation__main platform
monomorphic Monomorphic SNPs [,0] naive_filters boolean_num =only,excluded hm27_variation_ceu__variation__main is_reference_bool
marker_name List of HapMap SNPs [] naive_filters = hm27_variation_ceu__variation__mainmarker1
utr3 3' UTR [] naive_filters boolean_list only,excluded hm27_variation_ceu__variation__main is_utr_bool
utr5 5' UTR [] naive_filters boolean_list only,excluded hm27_variation_ceu__variation__main is_5utr_bool
coding_synon Exons - synonymous coding SNPs [] naive_filters boolean_list only,excluded hm27_variation_ceu__variation__main is_coding_synon_bool
coding_nonsynon Exons - non-synonymous coding SNPs [] naive_filters boolean_list only,excluded hm27_variation_ceu__variation__main is_coding_nonsynon_bool
intronic introns [] naive_filters boolean_list only,excluded hm27_variation_ceu__variation__mainis_intronic_bool
chrom chromosome [chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY] naive_filters list = hm27_variation_ceu__variation__main chrom
start From position [] naive_filters text >= hm27_variation_ceu__variation__main start
stop To position [] naive_filters text <= hm27_variation_ceu__variation__main stop
gene_list List of Genes [] naive_filters =,in pointer dataset gene_primary_symbol
encode_region ENCODE region [ENm010: 7:26924045..27424045,ENm013: 7:89621624..90736048,ENm014: 7:125865891..127029088,ENr112: 2:51512208..52012208,ENr113: 4:118466103..118966103,ENr123: 12:38626476:39126476,ENr131: 2:234156563..234656627,ENr213: 18:23719231..24219231,ENr232: 9:130725122..131225122,ENr321: 8:118882220..119382220] naive_filters list = pointer dataset glook_encode_region
(Where can I find a description of those columns ?) Here we can find some filters for hm27_variation_ceu (e.g.: allele_freq)

Retrieving the configurations for the dataset hm27_variation_ceu: http://www.biomart.org/biomart/martservice?type=configuration&dataset=hm27_variation_ceu :
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DatasetConfig>
<DatasetConfig dataset="hm27_variation_ceu" datasetID="2" displayName="HapMap Population CEU - Utah residents with Northern and Western European ancestry from the CEPH collection" interfaces="default" internalName="default" martUsers="default" modified="2009-10-28 11:04:39" softwareVersion="0.6" template="hm27_variation_ceu" type="TableSet" visible="1">
<MainTable>hm27_variation_ceu__variation__main</MainTable>
<Key>marker_id_key</Key>
<Importable filters="chrom,start,stop" internalName="encode_filter" linkName="encode_filter" name="encode_filter" type="link"/>
<Importable displayName="gene_filter_35" filters="chrom,start,stop" internalName="gene_filter_35" linkName="gene_filter_35" name="gene_filter_35" type="link"/>
<FilterPage displayName="FILTERS" internalName="naive_filters">
<FilterGroup displayName="FILTERS" internalName="filters">
<FilterCollection displayName="MINOR ALLELE FREQUENCY [&gt;=]" internalName="pop_filter">
<FilterDescription displayName="Minor Allele Frequency [&gt;=]" displayType="list" field="avg_dprime" internalName="allele_freq" key="marker_id_key" legal_qualifiers="&gt;=" qualifier="&gt;=" style="menu" tableConstraint="main" type="list">
<Option displayName="0.0" internalName="0.0" isSelectable="true" value="0.0"/>
<Option displayName="0.01" internalName="0.01" isSelectable="true" value="0.01"/>
<Option displayName="0.05" internalName="0.05" isSelectable="true" value="0.05"/>
<Option displayName="0.1" internalName="0.1" isSelectable="true" value="0.1"/>
<Option displayName="0.15" internalName="0.15" isSelectable="true" value="0.15"/>
<Option displayName="0.2" internalName="0.2" isSelectable="true" value="0.2"/>
<Option displayName="0.25" internalName="0.25" isSelectable="true" value="0.25"/>
<Option displayName="0.3" internalName="0.3" isSelectable="true" value="0.3"/>
<Option displayName="0.35" internalName="0.35" isSelectable="true" value="0.35"/>
(...)
As far as I understand , this file is used to 'group' the filters.

At the end, here is Query sent to Biomart:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query virtualSchemaName="default" formatter="TSV" header="0" uniqueRows="0" count="" datasetConfigVersion="0.6">
<Dataset name="hm27_variation_ceu" interface="default">
<Filter name="stop" value="1000000"/>
<Filter name="allele_freq" value="0.3"/>
<Filter name="chrom" value="chr1"/>
<Filter name="start" value="0"/>
<Attribute name="chrom"/>
<Attribute name="start"/>
<Attribute name="strand"/>
<Attribute name="marker1"/>
<Attribute name="alleles"/>
<Attribute name="ceu_id"/>
</Dataset>
</Query>
and we can send it with curl:
curl --data 'query=<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE Query><Query virtualSchemaName="default" formatter="TSV" header="0" uniqueRows="0" count="" datasetConfigVersion="0.6" ><Dataset name="hm27_variation_ceu" interface="default" ><Filter name="stop" value="1000000"/><Filter name="allele_freq" value="0.3"/><Filter name="chrom" value="chr1"/><Filter name="start" value="0"/><Attribute name="chrom" /><Attribute name="start" /><Attribute name="strand" /><Attribute name="marker1" /><Attribute name="alleles" /><Attribute name="ceu_id" /></Dataset></Query>' http://www.biomart.org/biomart/martservice


Response from the Biomart server:
chr1 843817 + rs1806509 A/C AC AA AC AC AC CC AC CC AC CC AC NN AC CC AA NN AC AC NN CC AC CC AC AC AA AC AC AC CC AA AC AC AC AC AA CC AC AC AA AA AA AA NN AA AA AA AC AA NN AA NN AC AC CC AA AA AC AC AC AC AA AA AA AC AA AA CC CC AC AA AC AC AC AA CC AA AC CC AA NN CC CC AA AC AA AC AA AA AC AC AC AC AA AA NN AA AA CC AC AC AA AA AA AC AC AC AA AC AA AC AC AA AA AC AA CC AC AC AC AC AA AA AA CC AC CC AC NN AC AC AA AA CC AA AC AC AC AA AC AA AC AC AC AA AC CC AA AA CC AA AA AC AC AC AC AA CC AA AC CC CC AC AC AA CC AA AC AC AC AA AA CC AA AA
chr1 881808 + rs13303106 A/G AG GG AG GG GG AG AG AG GG GG GG NN AA AG GG NN GG GG NN GG AG AG GG AG GG GG AG GG AA GG AG GG AG AA GG AG GG AG GG GG GG GG NN GG GG GG AA AG NN GG NN AG AG AA GG AG AG AA AG AG GG GG GG AG GG GG AG AA GG GG AG GG AG GG AG GG AG AG GG NN AA GG GG AG GG AG AG AG AG AG GG AG GG GG NN GG GG AG AG AG AG GG AG AG GG GG GG AG GG AG AG GG AG AG AG AA GG AG AG AG GG GG GG AG AG AA AG NN AG AG GG GG AA GG GG GG AA GG GG AG AG AG AG GG AG AA GG GG AA GG GG AG AG AG GG GG AA GG GG AA AA AG AG AG AA GG AA AG AG GG AG AA GG GG
chr1 906697 + rs6694632 A/G AG AA AA AG AA AG AG AG AA AA AG NN GG AA GG NN AA AA NN AA AG AG AA AG AA AA AG AA GG GG GG AG AA GG AA AA AA AG AG AG AG AA NN AA AA AA AG AG NN AA NN AG AG AG AG AG AG GG AG AG AA AA AA AG AG AG AA GG AA AA AA GG GG AA AG GG AA AG AA NN AG AA AG AG AA AG GG GG AA GG AG AG AG AG NN AA AG AG AG AG AG AA AA GG AA AA AA AA AA AG AG AA AG AG AG AA AG AA AG AG AA AA GG AG AG GG AG NN AG AG GG AG AA AG AG AA AG AA AA AA AG AA AG AA AA GG AA AA GG AA AA AG AG AG AA AA GG AA AA GG GG AG AG AG AG AG GG AA AG AG AG AG AG AG
chr1 908247 + rs13303118 G/T GT TT TT GT TT GT GT GT TT TT GT NN GG TT GG NN TT TT NN TT GT GT TT GT TT TT GT TT GG GG GG GT TT GG TT TT TT GT GT GT GT TT NN TT TT TT GT GT NN TT NN NN GT GT GT GT GT GG GT GT TT TT TT GT GT GT TT GG TT TT TT GG GG TT GT GG TT GT TT NN GT TT GT GT TT GT GG GG TT GG GT GT GT GT NN TT GT GT GT GT GT TT TT GG TT TT TT TT TT GT GT TT GT GT GT TT GT TT GT GT TT TT GG GT GT GG GT NN GT GT GG GT TT GT GT TT GT TT TT TT GT TT GT TT TT GG TT TT GG TT TT GT GT GT TT TT GG TT TT GG GG GT GT GT GT GT GG TT GT GT GT GT GT GT
chr1 908436 + rs2341354 A/G AG GG GG AG GG AG AG AG GG GG AG NN AA GG AA NN GG GG NN GG AG AG GG AG GG GG AG GG AA AA AA AG GG AA GG GG GG AG AG AG AG GG NN GG GG GG AG AG NN GG NN AG AG AG AG AG AG AA AG AG GG GG GG AG AG AG AG AA GG GG GG AA AA GG AG AA GG AG GG NN AG GG AG AG GG AG AA AA GG AA AG AG AG AG NN GG AG AG AG AG AG GG GG AA GG GG GG GG GG AG AG GG AG AG AG GG AG GG AG AG GG GG AA AG AG AA AG NN AG AG AA AG GG AG AG GG AG GG GG GG AG GG AG GG GG AA GG GG AA GG GG AG AG AG GG GG AA GG GG AA AA AG AG AG AG AG AA GG AG AG AG AG AG AG
chr1 934427 + rs3128117 C/T TT TT CT CC TT CT CC CT TT TT CT NN CT CT CC NN TT TT NN TT TT CT TT CC TT TT CT TT CC CC TT TT TT CT TT TT CT CC CT CT CT TT NN TT CT TT CT CT NN TT NN CT CC TT CT CT CT CC CT CT TT CT TT CT CT CT TT CT TT TT CT CC CT TT CT CC CT CT TT NN CT TT CT CT CT CT CC CC TT CC CT TT TT CC NN TT TT CT CC CT TT TT TT CT CT TT TT TT TT TT TT TT CT CT CT CT CT CT CC CT CT TT CC CT CT CT CT NN CC CC CC CT CT CT CC TT CT TT TT TT CT TT CC CT TT CC CT TT CC TT TT CT CC CT TT TT CT TT TT CT CC TT CT TT CT CT CT CT TT CT TT TT CT CT
chr1 940106 + rs1891906 A/C AA AA AC CC AA AC AC AC AA AA AC NN AC AC CC NN AA AA NN AA AA AC AA CC AA AA AC AA CC CC AA AA AA AC AA AA AC CC AC AC AC AA NN AA AC AA AC AC NN AA NN AC CC AA AC AC AC CC AC AC AA AC AA AC AC AC AA AC AA AA AC CC AC AA AC AC AC AC AA NN AC AA AC AC AC AC CC CC AA NN AC AA AA CC NN AA AA AC CC AC AA AA AA AC AC AA AA AA AA AA AA AA AC AC AC AC AC AC CC AC AC AA CC AC AC AC AC NN CC CC AC AC AC AC CC AA AC AA AA AA AC AA CC AC AA CC AC AA CC AA AA AC CC AC AA AA AC AA AA AC CC AA AC AA AC AC AC AC AA AC AA AA AC AC
chr1 949705 + rs2710888 C/T CT CC CT CT CC CT CT CT CC CC CT NN CT CT TT NN CC CC NN CC CC CC NN CT CC CC CT CC TT CT CC CC CT CC CC CC CT TT CT CT CT CC NN CC CC CT TT CT NN CC NN CT TT CC CT CC CT TT CT CT CT CC CC CT CC CT CC CT CC CC CT CT CT CT CT CC CT CT CC NN CT CC CT CT CT CT CT TT CT TT CC CC CC TT NN CC CC CT TT CC CC CC CT CT CT CC CC CT CC CC CT CC CT CT CT TT CC CT TT CT CT CC CT CT CT CT CC NN CC TT CT CT CT CT CT CC CT CC CC CC CT CT TT CT TT TT CT CC TT CC CC CT TT CT CC CC CT CC CC CT TT CC CT CC CT CT CT CT CT CC CT CT CC CT
chr1 952073 + rs3128126 A/G NN AA NN NN AA AG AG NN NN AA NN AG AG AG NN AG NN NN AG NN AG AA AG NN NN AA NN AA NN AG AA AA NN NN AA AG NN NN NN AG AG NN AA NN NN AA AG AG AA AA AG AG GG NN NN AA AG GG AA AG AA NN AG AG NN NN NN NN NN NN NN NN NN NN AG AG AG AG AA AG GG AA AG AG NN AG AG GG AG GG AA AA AA GG GG AA AG AG GG NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN GG NN AG GG NN NN AG NN NN AG AG AA NN AA AA AA AG NN NN NN NN NN NN AA GG AA AA AG GG NN NN NN NN NN NN NN NN NN GG AA AG GG AG AG NN AA NN NN AA AG
chr1 952469 + rs4970393 A/G AG AA AG GG AA AG AG AG AA AA AG NN AG AG AG NN AA AA NN AA AG AA AG AG AA AA AG AA GG GG AA AA AA GG AA AA AG GG AG AG AG AA NN AA AG AA AG AG NN AA NN AG GG AG AG AG AG GG AA AG AA AG AA AG AG AG AA GG AA AA AG AG AG AG AG AG AG AG AA NN AG AA AG AG AG AG GG NN AG GG AG AA AA GG NN AA AA AG GG AG AG AA AA AG AG AA AA AG AA AA AG AA AG AG AG AG AG AG GG AG AG AG NN AG AG AG AA NN AG NN GG AG AG AG GG AA AG AA AA AA AG AG GG AG GG GG AG AA GG AA AA AG GG AG AA AA AG AA AA GG AG AG AG AG AG AG NN AG AA AG AA AA AG AG



Note: I've also tried to use the SOAP/WSDL Biomart API, but as it was discussed on Biostar, the API does not work.


That's it

Pierre

24 March 2010

Learning DesignPatterns: The Factory Method Pattern

"The Factory method pattern defines a separate method for creating the objects, which subclasses can then override to specify the derived type of product that will be created"

Example

/** Abstract class for melting temperature calculation */
abstract class TmCalculator
{
public abstract double calcTm(CharSequence s);
}

/** the factory creates two kinds of TmCalculator
1) 2AT4GC
2) Nearest-Neighbor
*/
public class TmCalculatorFactory
{
public TmCalculator create2AT4GC()
{
return new TmCalculator()
{
@Override
public double calcTm(CharSequence s)
{
(...implements ... )
}
};
}

public TmCalculator createNearestNeighbor()
{
return new TmCalculator()
{
@Override
public double calcTm(CharSequence s)
{
(...implements ... )
}
};
}
}


That's it

Pierre