YOKOFAKUN

23 April 2010

Short post: Plain text vs Binary data.

This week, I was asked the difference between storing some data using a plain text format and a binary format. I wrote the following code C++ to illustrate how to store some genotypes in a ( possibly huge )table saved as a binary file. The first bytes of this file tells the number of individuals, the number of markers, the name of the individuals. Then for each marker, we get the name of the marker, its position and an array of genotypes for each individual.

The first invocation of this program (-p write ) creates a random table and a second call (-p read ) answers the genotypes for some random individuals/markers.

That's it.

Pierre

20 April 2010

Reading/Writing SAM/BAM files with the picard API for java.

I'm currently playing with the BAM/SAM Java API available in the picard package. This API can be used for creating new programs that read and write SAM files, the generic format for storing large nucleotide sequence alignments. Both SAM text format and SAM binary (BAM) format are supported. The source code comes with a simple example showing how to read and write a SAM/BAM file.

(...)
public void convertReadNamesToUpperCase(
final File inputSamOrBamFile,
final File outputSamOrBamFile
)
{
final SAMFileReader inputSam = new SAMFileReader(inputSamOrBamFile);
final SAMFileWriter outputSam = new SAMFileWriterFactory().makeSAMOrBAMWriter(inputSam.getFileHeader(),
true, outputSamOrBamFile);

for (final SAMRecord samRecord : inputSam) {
// Convert read name to upper case.
samRecord.setReadName(samRecord.getReadName().toUpperCase());
outputSam.addAlignment(samRecord);
}
outputSam.close();
inputSam.close();
}
(...)

toay, I've played with this SAM API to build two simple java tools:

SbamGrep

SbamGrep is available at http://code.google.com/p/code915/wiki/SbamGrep. It filters a SAM/BAM file using the SAM flags.

Option

 -o (filename-out) or default is stdout SAM
 -v inverse selection
 -f [flag] add bam/sam flag for filtering. multiple separeted with a comma. One of:
        READ_PAIRED or 1 or 0x1
        PROPER_PAIR_ or 2 or 0x2
        READ_UNMAPPED or 4 or 0x4
        MATE_UNMAPPED or 8 or 0x8
        READ_STRAND or 16 or 0x10
        MATE_STRAND or 32 or 0x20
        FIRST_OF_PAIR or 64 or 0x40
        SECOND_OF_PAIR or 128 or 0x80
        NOT_PRIMARY_ALIGNMENT or 256 or 0x100
        READ_FAILS_VENDOR_QUALITY_CHECK or 512 or 0x200
        DUPLICATE_READ or 1024 or 0x400

Example

The following command lists all the reads having a flag (READ_PAIRED and READ_UNMAPPED and MATE_UNMAPPED and FIRST_OF_PAIR):

java -cp sam-1.16.jar:dist/sbamgrep.jar fr.inserm.umr915.sbamtools.SbamGrep -f 0x4,0x1,0x8,0x40 file.bam

Result (masked):

@HD VN:1.0 SO:unsorted
@SQ SN:chrT LN:349250
IL_XXXX1 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (94**0-)*7=0688855555522@86;;;5;6:;63:4?-622647..-.5.%
IL_XXXX2 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9*+*2396,@5+5:@@@;;5)50)6960684;58;86*5102)0*+8:*137;
IL_XXXX3 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (/999-00328:88984@@=8??@@:-66,;8;5;6+;255,1;788883676'
IL_XXXX4 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (916928.82@@50854;33222224;@25?5522;5=;;858888555*0666
IL_XXXX5 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9*4*5-**32989+::@82;;853+39;80.53)-)79)..'55.8988*200
IL_XXXX6 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (*+**,14265;@@??@8?9@@@5@488?8666260.)-*9;;;88:8'05418
IL_XXXX7 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9136242-2@@@;96.888@@@@80$585882623.':**+3*03137..--.
IL_XXXX8 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (/89255855@88?585557..())@@@;5552286526755@@5888..3;/$
(...)

SbamStats

The second tool, SbamStats is available at http://code.google.com/p/code915/wiki/SbamStats. It provides a simple report about all the SAM flags used in one or more files.

Example

java -cp sam-1.16.jar:dist/sbamstats.jar fr.inserm.umr915.sbamtools.SBamStats file.bam

Output:

That's it !
Pierre

19 April 2010

A stateful C function for R: parsing Fasta sequences

In the following post, I'll create a C extension for R. This extension will iterate over all the FASTA sequences in a file and will return a pair(name,sequence) for each sequence, that is to say that I won't store all the sequences in memory.

The C code

The structure FastaHandler holds the state of the fasta parser. It contains a pointer to the FILE, the current sequence and fasta header. We also need to save the previous line.

/** stateful structure for the fastafile */
typedef struct fastaHandler_t
{
/** input file */
FILE* in;
/** save the previous line, it will contains the next fasta header */
char* previous_line;
/** sequence name */
char *seq_name;
/** sequence dna */
char *seq_dna;
/** sequence_length */
int seq_length;
}FastaHandler,*FastaHandlerPtr;

The function fasta_open opens the fasta file and initialize a new FastaHandlerPtr. This pointer is then persisted into a 'R' variable using R_MakeExternalPtr.

/**
* open a fasta file, init the FastaHandler
*/
SEXP fasta_open(SEXP filename)
{
FastaHandlerPtr handle=NULL;
if(!isString(filename)) error("filename is not a string");
if(length(filename)!=1) error("expected only one filename");

handle=(FastaHandlerPtr)calloc(1,sizeof(FastaHandler));
if(handle==NULL)
{
error("Cannot alloc FastaHandler");
}
const char* c_filename=CHAR(STRING_ELT(filename,0));
errno=0;
handle->in= fopen( c_filename,"r");
if(handle->in==NULL)
{
error("Cannot open \"%s\": %s",c_filename,strerror(errno));
}
/** the handle is bound a R variable */
return R_MakeExternalPtr(handle, R_NilValue, R_NilValue);
}

We also need a function to close the structure. The pointer to the FastaHandlerPtr is retrieved using R_ExternalPtrAddr(R_variable):

/**
* close the FastaHandlerPtr
*/
SEXP fasta_close(SEXP r_handle)
{
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in!=NULL)
{
fclose(handle->in);
handle->in=NULL;
}
if(handle->previous_line!=NULL)
{
free(handle->previous_line);
handle->previous_line=NULL;
}
free(handle->seq_name);
handle->seq_name=NULL;
free(handle->seq_dna);
handle->seq_dna=NULL;
R_ClearExternalPtr(r_handle);
return ScalarInteger(0);
}

The function fasta_next return the next pair(name,sequence):

/**
* read the next fasta sequence
*/
SEXP fasta_next(SEXP r_handle)
{
int count=0;
char* ptr=NULL;
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in==NULL) error("handle->in==NULL");

while((ptr=readLine(handle))!=NULL)
{
if(ptr[0]=='>')
{
//this is the header of a fasta sequence
if(handle->seq_name!=NULL || handle->seq_dna!=NULL)
{
handle->previous_line=ptr;
SEXP return_value=NULL;
//if there a sequence in memory ?
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
//create the sequence
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
//return the sequence if any
if(return_value!=NULL) return return_value;
}
//cleanup
handle->seq_name=ptr;
handle->seq_length=0;
handle->seq_dna=NULL;
}
else
{
int j=0;
int len=0;
//remove blank characters
while(ptr[j]!=0)
{
if(!isspace(ptr[j]))
{
ptr[len++]=ptr[j];
}
++j;
}
//enlarge the dna sequence
handle->seq_dna=realloc(
handle->seq_dna,
sizeof(char)*(handle->seq_length+1+len)
);
if(handle->seq_dna==NULL) error("cannot realloc seq_dna");
//append the new line
memcpy(&handle->seq_dna[handle->seq_length],ptr,sizeof(char)*len);
handle->seq_length+=len;
handle->seq_dna[handle->seq_length]=0;
free(ptr);
ptr=NULL;
}
}
//last sequence
SEXP return_value=NULL;
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
return (return_value==NULL?R_NilValue:return_value);
}

This function calls readLine reading the next non-empty line:

/**
* return the next non empty line from a FastaHandlerPtr
* should be free()
*/
static char* readLine(FastaHandlerPtr handle)
{
int c;
char* ptr=NULL;
int length=0;
/* if there was a saved line, return it */
if(handle->previous_line!=NULL)
{
ptr=handle->previous_line;
handle->previous_line=NULL;
return ptr;
}
/** while the line is empty */
while(length==0)
{
int capacity=0;
//no more to read ?
if(feof(handle->in)) return NULL;
ptr=NULL;
//read each char
while((c=fgetc(handle->in))!=EOF)
{
//if this is an EOL, break
if(c=='\n') break;
//enlarge the buffer if it is too small
if(ptr==NULL || (length+2)>=capacity)
{
capacity+=500;
ptr=(char*)realloc(ptr,capacity*sizeof(char));
if(ptr==NULL) error("cannot realloc to %d bytes",capacity);
}
//append the char to the buffer
ptr[(length)++]=c;
}
//nothing was read (empty line) ? continue
if(length==0)
{
free(ptr);
ptr=NULL;
}
}
//add a end-of-string char
ptr[length]=0;
//return the line
return ptr;
}

It also invokes the function make_sequence(name,dna). This function creates a new R pair(name,sequence) :

/** create a pair list containing the name and the dna */
static SEXP make_sequence(const char* seq_name,const char* length)
{
SEXP values,names;
//the R value contains two objects
PROTECT(values = allocVector(VECSXP, 2));
//first item is the sequence name
SET_VECTOR_ELT(values, 0, mkString(seq_name));
//second item is the sequence dna
SET_VECTOR_ELT(values, 1, mkString(length));

//the labels
PROTECT(names = allocVector(STRSXP, 2));
//first label is the name
SET_STRING_ELT(names, 0, mkChar("name"));
//2nd label is the name
SET_STRING_ELT(names, 1, mkChar("sequence"));

//bind the labels to the values
setAttrib(values, R_NamesSymbol, names);
UNPROTECT(2);
return values;
}

The 'R' code

The 'C' dynamic library is loaded and each C function is bound to R using ".Call":

Compiling

Here is my makefile:

run:libfasta.so rs_chLG1.fas
${R_HOME}/bin/R --no-save < fasta.R

rs_chLG1.fas:
wget -O [email protected] ftp://ftp.ncbi.nih.gov/snp/organisms/bee_7460/rs_fasta/rs_chLG1.fas.gz
gunzip [email protected]

libfasta.so:fasta.c
gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include fasta.c
gcc -shared -Wl,-soname,fasta.so.1 -o $@ fasta.o

Testing

I've downloaded some fasta sequences from dbSNP/Apis Mellifera. This file is open with fasta.open, we loop over each sequence and we print those sequences having a length lower than 300bp, at the end we close the stream with fasta.close:

f<-fasta.open("rs_chLG1.fas")

while(!is.null(snp <-fasta.next(f)))
{
if(nchar(as.character(snp$sequence))< 300)
print(snp)
}

fasta.close(f)

Result:

$name
[1] "gnl|dbSNP|rs44106732 rs=44106732|pos=48|len=298|taxid=7460|mol=\"genomic\"|class=1|alleles=\"C/T\"|build=127"

$sequence
[1] "GAAAATCAAAGACAATTTTTGGAATGGAACTAAATAACTTTATTCTTYCTTTCTTTCTCGTCGAGAATATCGTTATCGTTGCATGGGTTATGGAATAGCGTTCGTTAAAAATGTTATATTTCGAGGAAATATCGAAGATAGGCTTTGCGAAAGTCTGTTTCTCTAGAATTAAGATTTATATGTGTTGCAAGGGGAAGTTCAAAGAGAAATCGTGGCCAGTTCGAACATTATTATGTCTATCAATGATCGAGAAGTGTCATTGATGCAAGAGAAAAGTTTTCTCTTGCATTTAAGTATC"

$name
[1] "gnl|dbSNP|rs44108824 rs=44108824|pos=251|len=268|taxid=7460|mol=\"genomic\"|class=1|alleles=\"C/T\"|build=127"

$sequence
[1] "GTTAAATGGGAATTTTGGGGATTATTGGAGGAGGATTTACGTTTCGAGGATTGTTGATGATCTTAGGATTGTTTTCAGTTTGGAATTTTTTCTTCTTCTTCAACGTTGTACAACATTCTCCTCGAATTTTGTGTAACGAGGAGGATTAAACCTTTGGAAAATCACGTAAATTAGAGGACGATATATTGGGTTGGCAACTAAGTAATTGCGGATTTTTTTTAGAAAATCAAAGACAATTTTTGGAATGGAAYTAAATAACTTTATTCTT"

$name
[1] "gnl|dbSNP|rs44150807 rs=44150807|pos=251|len=274|taxid=7460|mol=\"genomic\"|class=1|alleles=\"C/T\"|build=127"

$sequence
[1] "TTCGATCTTCGTTTCACGCTCACGTTTCACGTTTCTCGTCCGGCGTAACGAACGGTATTCCGCTGATCACGAATAATTTCTTTCTGGAGAGTCCATTAGGGGACCGTCCCCTCCCCCCTCTCGCGCACACAGACACCCATCGCTTTCGACGCCTCGTCCATTCGAGGGAGAAACGAACGACTATTAGAAAAAAATCTTCTTTATATCTCTATAAATTCAATTTGCAGAGAAGCAAAGAGCTTTAAAATATYAACCATTATAACCGAACTTGTTG"

(....)

Full Source code

Makefile

fasta.c

#include <ctype.h>
#include <errno.h>
#include <R.h>
#include <Rinternals.h>

/** stateful structure for the fastafile */
typedef struct fastaHandler_t
{
/** input file */
FILE* in;
/** save previous line, it will contains the next fasta header */
char* previous_line;
/** sequence name */
char *seq_name;
/** sequence dna */
char *seq_dna;
/** sequence_length */
int seq_length;

}FastaHandler,*FastaHandlerPtr;

/**
* return the next non empty line from a FastaHandlerPtr
* should be free()
*/
static char* readLine(FastaHandlerPtr handle)
{
int c;
char* ptr=NULL;
int length=0;
/* if there was a saved line, return it */
if(handle->previous_line!=NULL)
{
ptr=handle->previous_line;
handle->previous_line=NULL;
return ptr;
}
/** while the line is empty */
while(length==0)
{
int capacity=0;
//no more to read ?
if(feof(handle->in)) return NULL;
ptr=NULL;
//read each char
while((c=fgetc(handle->in))!=EOF)
{
//if this is an EOL, break
if(c=='\n') break;
//enlarge the buffer if it is too small
if(ptr==NULL || (length+2)>=capacity)
{
capacity+=500;
ptr=(char*)realloc(ptr,capacity*sizeof(char));
if(ptr==NULL) error("cannot realloc to %d bytes",capacity);
}
//append the char to the buffer
ptr[(length)++]=c;
}
//nothing was read (empty line) ? continue
if(length==0)
{
free(ptr);
ptr=NULL;
}
}
//add a end-of-string char
ptr[length]=0;
//return the line
return ptr;
}

/**
* open a fasta file, init the FastaHandler
*/
SEXP fasta_open(SEXP filename)
{
FastaHandlerPtr handle=NULL;
if(!isString(filename)) error("filename is not a string");
if(length(filename)!=1) error("expected only one filename");

handle=(FastaHandlerPtr)calloc(1,sizeof(FastaHandler));
if(handle==NULL)
{
error("Cannot alloc FastaHandler");
}
const char* c_filename=CHAR(STRING_ELT(filename,0));
errno=0;
handle->in= fopen( c_filename,"r");
if(handle->in==NULL)
{
error("Cannot open \"%s\": %s",c_filename,strerror(errno));
}
/** the handle is bound a R variable */
return R_MakeExternalPtr(handle, R_NilValue, R_NilValue);
}

/** create a pair list containing the name and the dna */
static SEXP make_sequence(const char* seq_name,const char* length)
{
SEXP values,names;
//the R value contains two objects
PROTECT(values = allocVector(VECSXP, 2));
//first item is the sequence name
SET_VECTOR_ELT(values, 0, mkString(seq_name));
//second item is the sequence dna
SET_VECTOR_ELT(values, 1, mkString(length));

//the labels
PROTECT(names = allocVector(STRSXP, 2));
//first label is the name
SET_STRING_ELT(names, 0, mkChar("name"));
//2nd label is the name
SET_STRING_ELT(names, 1, mkChar("sequence"));

//bind the labels to the values
setAttrib(values, R_NamesSymbol, names);
UNPROTECT(2);
return values;
}

/**
* read the next fasta sequence
*/
SEXP fasta_next(SEXP r_handle)
{
int count=0;
char* ptr=NULL;
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in==NULL) error("handle->in==NULL");

while((ptr=readLine(handle))!=NULL)
{
if(ptr[0]=='>')
{
//this is the header of a fasta sequence
if(handle->seq_name!=NULL || handle->seq_dna!=NULL)
{
handle->previous_line=ptr;
SEXP return_value=NULL;
//if there a sequence in memory ?
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
//create the sequence
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
//return the sequence if any
if(return_value!=NULL) return return_value;
}
//cleanup
handle->seq_name=ptr;
handle->seq_length=0;
handle->seq_dna=NULL;
}
else
{
int j=0;
int len=0;
//remove blank characters
while(ptr[j]!=0)
{
if(!isspace(ptr[j]))
{
ptr[len++]=ptr[j];
}
++j;
}
//enlarge the dna sequence
handle->seq_dna=realloc(
handle->seq_dna,
sizeof(char)*(handle->seq_length+1+len)
);
if(handle->seq_dna==NULL) error("cannot realloc seq_dna");
//append the new line
memcpy(&handle->seq_dna[handle->seq_length],ptr,sizeof(char)*len);
handle->seq_length+=len;
handle->seq_dna[handle->seq_length]=0;
free(ptr);
ptr=NULL;
}
}
//last sequence
SEXP return_value=NULL;
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
return (return_value==NULL?R_NilValue:return_value);
}

/**
* close the FastaHandlerPtr
*/
SEXP fasta_close(SEXP r_handle)
{
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in!=NULL)
{
fclose(handle->in);
handle->in=NULL;
}
if(handle->previous_line!=NULL)
{
free(handle->previous_line);
handle->previous_line=NULL;
}
free(handle->seq_name);
handle->seq_name=NULL;
free(handle->seq_dna);
handle->seq_dna=NULL;
R_ClearExternalPtr(r_handle);
return ScalarInteger(0);
}

fasta.R

dyn.load(paste("libfasta", .Platform$dynlib.ext, sep=""))

fasta.open <- function(filename)
{
.Call("fasta_open", filename)
}
fasta.close <- function(handler)
{
.Call("fasta_close",handler)
}

fasta.next <- function(handler)
{
.Call("fasta_next",handler)
}

f<-fasta.open("rs_chLG1.fas")

while(!is.null(snp <-fasta.next(f)))
{
if(nchar(as.character(snp$sequence))< 300)
print(snp)
}

fasta.close(f)

That's it !
Pierre

14 April 2010

Object Oriented Programming with R: My notebook

In the following post, I describe how I've used the OOP features of R to create and use the following class hierarchy:

First the class Person is defined. It contains four fields : firstName, lastName, birthDate and birthPlace.

setClass("Person",
representation(
firstName="character",
lastName="character",
birthDate="Date",
birthPlace="character"
))

A kind of 'constructor' function can be called for Person to check that both firtsName and lastName are not empty:

setValidity("Person",
function(object)
{
length(object@firstName)>0 &&
length(object@lastName)>0
}
)

DeceasedPerson is a subClass of Person, it contains two more fields: deathPlace and deathDate:

setClass("DeceasedPerson",
representation(
deathDate="Date",
deathPlace="character"
),
contains="Person"
)

Scientist is another subClass of Person it contains one more field:'knownFor':

setClass("Scientist",
representation(
knownFor="character"
),
contains="Person"
)

Lastly, DeceasedScientist is a subClass of both Scientist and DeceasedPerson:

setClass("DeceasedScientist",
contains=c("Scientist","DeceasedPerson")
)

Let's define a 'generic' function 'age' returning the age of an individual from his 'birthdate':

age <- function(individual)
{
as.integer((Sys.Date()-individual@birthDate)/365)
}
setGeneric("age")

Polymorphism: for the DeceasedPerson another function will be used, it will calculate the age from both 'deathDate' and 'birthDate':

age.of.death <- function(individual)
{
as.integer((individual@deathDate-individual@birthDate)/365)
}
setMethod(age,signature=c("DeceasedPerson"),definition=age.of.death)

Ok, let's play with our class, we can first create a new instance of Scientist for Craig Venter:

craigVenter <-new(
"Scientist",
firstName="Craig",
lastName="Venter",
birthPlace="Salt Lake City",
birthDate=as.Date("1946-10-14", "%Y-%m-%d"),
knownFor=c("The Institute for Genomic Research","J. Craig Venter Institute")
)

... and Charles Darwin is a DeceasedScientist:

charlesDarwin <-new(
"DeceasedScientist",
firstName="Charles",
lastName="Darwin",
birthDate=as.Date("1809-02-12", "%Y-%m-%d"),
deathDate=as.Date("1882-04-19", "%Y-%m-%d"),
knownFor=c("Natural Selection","The Voyage of the Beagle")
)

Hey , we know where Charles was born!

charlesDarwin@birthPlace="Shrewsbury"

The following statement fails because the firstName is empty:

> try(new("Person",lastName="Darwin",birthDate=as.Date("1809-02-12", "%Y-%m-%d")),FALSE)
Error in validObject(.Object) : invalid class "Person" object: FALSE

Is Darwin a valid object?:

> validObject(charlesDarwin)
[1] TRUE

Print both individuals:

> charlesDarwin
An object of class “DeceasedScientist”
Slot "knownFor":
[1] "Natural Selection" "The Voyage of the Beagle"

Slot "firstName":
[1] "Charles"

Slot "lastName":
[1] "Darwin"

Slot "birthDate":
[1] "1809-02-12"

Slot "birthPlace":
[1] "Shrewsbury"

Slot "deathDate":
[1] "1882-04-19"

Slot "deathPlace":
character(0)

> craigVenter
An object of class “Scientist”
Slot "knownFor":
[1] "The Institute for Genomic Research" "J. Craig Venter Institute"

Slot "firstName":
[1] "Craig"

Slot "lastName":
[1] "Venter"

Slot "birthDate":
[1] "1946-10-14"

Slot "birthPlace":
[1] "Salt Lake City"

Let's use the 'is' operator:

> is(craigVenter,"Person")
[1] TRUE
> is(craigVenter,"DeceasedScientist")
[1] FALSE
> is(charlesDarwin,"DeceasedScientist")
[1] TRUE

Finally let's invoke the polymorhic function 'age' for both individuals:

> age(charlesDarwin)
[1] 73 #age.of.death was called
> age(craigVenter)
[1] 63 #generic "age'

Full source code

setClass("Person",
representation(
firstName="character",
lastName="character",
birthDate="Date",
birthPlace="character"
))

setValidity("Person",
function(object)
{
length(object@firstName)>0 &&
length(object@lastName)>0
}
)

setClass("DeceasedPerson",
representation(
deathDate="Date",
deathPlace="character"
),
contains="Person"
)

setClass("Scientist",
representation(
knownFor="character"
),
contains="Person"
)

age <- function(individual)
{
as.integer((Sys.Date()-individual@birthDate)/365)
}

setGeneric("age")

age.of.death <- function(individual)
{
as.integer((individual@deathDate-individual@birthDate)/365)
}

setClass("DeceasedScientist",
contains=c("Scientist","DeceasedPerson")
)

setMethod(age,signature=c("DeceasedPerson"),definition=age.of.death)

craigVenter <-new(
"Scientist",
firstName="Craig",
lastName="Venter",
birthPlace="Salt Lake City",
birthDate=as.Date("1946-10-14", "%Y-%m-%d"),
knownFor=c("The Institute for Genomic Research","J. Craig Venter Institute")
)

charlesDarwin <-new(
"DeceasedScientist",
firstName="Charles",
lastName="Darwin",
birthDate=as.Date("1809-02-12", "%Y-%m-%d"),
deathDate=as.Date("1882-04-19", "%Y-%m-%d"),
knownFor=c("Natural Selection","The Voyage of the Beagle")
)

try(new("Person",lastName="Darwin",birthDate=as.Date("1809-02-12", "%Y-%m-%d")),FALSE)

charlesDarwin@birthPlace="Shrewsbury"

validObject(charlesDarwin)

charlesDarwin
craigVenter

is(craigVenter,"Person")
is(craigVenter,"DeceasedScientist")
is(charlesDarwin,"DeceasedScientist")
age(charlesDarwin)
age(craigVenter)

That's it !
Pierre

08 April 2010

BAM, BWA and SAMTOOLS: my notebook

Here are my notes about MAQ,BWA and SAMTOOLS.

MAQ

Maq is a software that builds mapping assemblies from short reads generated by the next-generation sequencing machines. Here I used chrM.fa, the reference sequence for the mitochondrial genome

>chrM
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT
TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG
GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA
AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT
GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCA
AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC
CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT
TTTATCTTTAGGCGGTATGCACTTTTAACAGTCACCCCCCAACTA.....

fasta2bfa converts the reference sequence to the binary fasta format

$(MAQ) fasta2bfa chrM.fa _reference.bfa

I also generated a set of random fastq sequences for the chrM. Each read contains a mutation A/G every 10 bases and an insertion of 3 bases every 50 bases.

(...)
@chrM|8|2
GTATATCACCCTATTAACCACTAACGGGAG
+
kzzst|{efokkdyirlqkflupvofumxe
@chrM|20|3
ATTAACCACTAACGGGAGCTATCCATGCAT
+
zplixjsdlmeemjrxxkt|nrd?qwqwym
@chrM|25|4
CCACTAACGGGAGCTATCCATGCATAAATGGT
+
|uklu}+rqepiuoi|qjyrhpyzlllrq|oz
@chrM|32|5
CGGGAGCTATCCATGCATAAATGGTATTTTAG
+
qyvylqiqkhddve4rddxh|frmznluouxp
(...)

fastq2bfq converts the short reads to the binary fasta format:

$(MAQ) fastq2bfq chrM.fastq _reads.bfq

We can now align the reads to the chrM:

$(MAQ) map -u _poor.txt -H _multiple.data _align.map _reference.bfa _reads.bfq

'_multiple.data' contains the multiple/all 01-mismatch hits , and '_poor.txt' contains the unmapped and poorly aligned reads:

chrM|0|1 99 AAAATCACAGGTATATCACCCTATTAACCACT ````````T``````)````````````````
chrM|25|4 99 CCACTAACGGGAGCTATCCATGCATAAATGGT ``````+`````````````````````````
chrM|32|5 99 CGGGAGCTATCCATGCATAAATGGTATTTTAG ``````````````4`````````````````
chrM|35|6 99 GAGCTATCCATGCATAAATGGTATTTTAGTCT ``````````````````````2`````````
chrM|49|7 99 TAAATGGTATTTTAGTCTGGGGGATGTGCACG ````````````````````````````````
chrM|76|10 99 ACGCAATAGCATTGAGAGACGCTGAAAAGCCG `````````````&``````````````````
chrM|80|11 99 AATAGCATTGAGAGACGCTGAAAAGCCGGAGC ``````````````B```Q``````2``````
chrM|89|12 99 GAGAGACGCTGAAAAGCCGGAGCACCCTATGT ````````````````:```````````````
chrM|90|13 99 AGAGACGCTGAAAAGCCGGAGCACCCTATGTC ````````````````````````````````
chrM|92|14 99 AGACGCTGAAAAGCCGGAGCACCCTATGTCAC ``````````````````?`````````````

View the mapping alignment with mapview:

$(MAQ) mapview _align.map > _mapview.txt

##result 'verticalized'

$1 read.name : chrM|71|26834
$2 chromosome : chrM
$3 position : 72
$4 strand : +
$5 insert.size : 0
$6 paired.flag : 0
$7 mapping.quality : 59
$8 single-end.mapping.quality : 59
$9 alternative.mapping.quality: 59
$10 number.mismatches.of.best.hit : 2
$11 sum.qualities.mismatched.bases.best.hit : 60
$12 number of 0-mismatch hits of the first 24bp : 0
$13 number of 1-mismatch hits of the first 24bp on ref : 1
$14 length of the read : 32
$15 sequence : TGTGCACGCGATAGCATTGGGAGACGCTGGGG
$16 quality : ```````````````````````X````````
(...)

mapstat : get statistics from the alignment:

$(MAQ) mapstat _align.map > _mapstat.txt

-- Total number of reads: 308
-- Sum of read length: 9856
-- Error rate: 0.058746
-- Average read length: 32.00

-- Mismatch statistics:

-- MM 0 1
-- MM 1 37
-- MM 2 268
-- MM 3 2

-- Mapping quality statistics:

-- MQ 20-29 1 1
-- MQ 40-49 138 138
-- MQ 50-59 107 107
-- MQ 70-79 62 62

-- Flag statistics:

-- FG 0 308

mapcheck: Read quality check

$(MAQ) mapcheck _reference.bfa _align.map > _mapcheck.txt

Number of reference sequences: 1
Length of reference sequences exlcuding gaps: 16571
Length of gaps in the reference sequences: 0
Length of non-gap regions covered by reads: 3846
Length of 24bp unique regions of the reference: 3846
Reference nucleotide composition: A: 30.86%, C: 31.33%, G: 13.16%, T: 24.66%
Reads nucleotide composition: A: 41.35%, C: 24.81%, G: 13.91%, T: 19.93%
Average depth across all non-gap regions: 0.591
Average depth across 24bp unique regions: 0.591

A C G T : AC AG AT CA CG CT GA GC GT TA TC TG : 0? 1? 2? 3? 4? 5? 6? : 0? 1? 2? 3? 4? 5? 6?
1 52.1 18.5 16.8 12.5 : 0 33 25 253 0 0 125 0 0 288 0 182 : 3 204 0 0 254 340 198 : 15 998 167 167 0 0 0
2 56.4 16.2 12.9 14.5 : 0 41 0 164 33 0 171 0 0 220 0 34 : 3 36 0 3 412 346 198 : 15 991 167 15 200 29 0
3 52.1 19.5 12.5 15.8 : 0 29 7 114 43 0 222 0 0 138 0 52 : 3 7 3 7 438 340 201 : 15 8 15 8 255 10 0
4 45.9 24.1 10.2 19.8 : 0 15 15 52 0 0 97 0 0 48 0 16 : 10 3 3 3 438 336 204 : 5 15 15 15 105 10 0
5 39.9 25.7 15.2 19.1 : 0 34 0 13 0 0 68 0 0 48 0 16 : 0 3 3 3 442 353 195 : 167 15 15 15 90 0 0
6 35.6 25.7 13.5 25.1 : 0 10 0 37 0 0 48 0 0 26 0 0 : 3 10 0 7 445 343 191 : 15 5 167 8 59 0 0
7 38.3 31.4 12.5 17.8 : 0 0 0 10 0 0 73 0 0 0 0 0 : 0 7 0 7 445 340 201 : 167 8 167 8 30 0 0
8 38.0 26.7 14.2 21.1 : 0 0 0 0 0 0 0 0 0 0 0 0 : 3 7 3 3 435 346 201 : 15 8 15 15 0 0 0
9 39.9 20.1 18.5 21.5 : 0 0 0 116 0 0 18 0 0 58 0 0 : 0 17 7 0 425 353 198 : 167 3 8 167 8 112 0
10 48.5 15.2 17.2 19.1 : 0 0 0 238 32 0 82 0 0 113 0 70 : 0 0 3 7 435 353 201 : 167 167 15 8 46 261 0
11 36.6 27.4 17.5 18.5 : 0 0 0 0 0 0 0 0 0 0 0 0 : 3 13 7 10 425 353 188 : 15 4 8 5 0 0 0
(...)

pileup displays the alignment in a ‘pileup’ text format.

$(MAQ) pileup _reference.bfa _align.map > _pileup.txt

#chrom pos. ref depth base.on.read(@=same)
chrM 1 G 0 @
chrM 2 A 0 @
chrM 3 T 0 @
(...)
chrM 72 T 1 @,
chrM 73 G 1 @,
chrM 74 T 1 @,
chrM 75 G 1 @,
(...)
chrM 110 C 2 @,,
chrM 111 A 2 @GG
chrM 112 C 2 @,,
(..)

assemble creates a consensus sequences from read mapping.

$(MAQ) assemble _consensus.cns _reference.bfa _align.map 2> /dev/null

cns2snp extracts SNP sites from the consensus.

$(MAQ) cns2snp _consensus.cns > _snp.txt

##verticalized
$1 chromosome : chrM
$2 position : 91
$3 ref.base : C
$4 consensus.base : G
$5 phred.qual : 30
$6 read.depth : 1
$7 average number of hits of reads : 1.00
$8 highest mapping quality of the reads covering the position : 59
$9 minimum consensus quality in the 3bp flanking regions at each side of the site : 30
$10 second best call : N
$11 log likelihood ratio of the second best and the third best call : 29
$12 third best call : N
(...)

cns2view shows detailed information at all sites from the consensus.

$(MAQ) cns2view _consensus.cns _viewsnp.txt

##verticalized
$1 chromosome : chrM
$2 position : 1
$3 ref.base : G
$4 consensus.base : N
$5 phred.qual : 0
$6 read.depth : 0
$7 average number of hits of reads : 0.00
$8 highest mapping quality of the reads covering the position : 0
$9 minimum consensus quality in the 3bp flanking regions at each side of the site : 0
$10 second best call : N
$11 log likelihood ratio of the second best and the third best call : 0
$12 third best call : N
(...)

cns2fq extracts the consensus sequences in FASTQ format. "Bases in lower case are essentially repeats or do not have sufficient coverage; bases in upper case indicate regions where SNPs can be reliably called"

$(MAQ) cns2fq _consensus.cns > _cns2fq.txt

@chrM
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnntgtgcacgcgatagcattgggagacgcTggrGccggagcgccctatgtc
gcagtatctgnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnntcaATATT
ACAGGCGAACATACCTACTAAAAAGtgtnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnncagACATCATAACAAAAAATTTCCACCA
AAAACCccccnnnnnnnnnnnntggccacaacacttaaacacatctctgcaaaannnnnn
(...)
+
(...)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!???????????????????????????2EE8EBBBBBBBBBBBBBBBBB
BBBBBBBBBB!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!BHHHKKKK
KKKKKKKKKKKKKKKKKKKKKKKKE???!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
(...)

BWA

Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome.

index indexes the chrM in the FASTA format:

$(BWA) index -a is -p _bwa.db chrM.fa

aln finds the suffix array coordinates of the input reads.

$(BWA) aln _bwa.db chrM.fastq > _bam.aln.sai

samse generates alignments in the SAM format given single-end reads.

$(BWA) samse _bwa.db _bam.aln.sai chrM.fastq > _bam.samse.sam

#_bam.samse.sam verticalized
(...)
$1 query : chrM|12602|40780
$2 flag : 0
$3 ref: chrM
$4 position : 12603
$5 quality : 37
$6 cigar.string : 30M
$7 mate.ref.sequence : *
$8 mate.pos : 0
$9 insert.size : 0
$10 query.seq : TCATCCCTGTAGCATTGTGCGTTACATGGT
$11 query.qual : kdfz=|srknznewhnkftyowhtmevnhz
$12 ? : XT:A:U ##cf. manual
$13 ? : NM:i:1
$14 ? : X0:i:1
$15 ? : X1:i:0
$16 ? : XM:i:1
$17 ? : XO:i:0
$18 ? : XG:i:0
$19 ? : MD:Z:18T11

(...)

SAM tools

The SAM format is a generic format for storing large nucleotide sequence alignments. SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. Sometimes this tool needs a tab-delimited file 'chrM.ref' containing the length of each chromosome:

chrM 16571

import converts the alignment into BAM (a binary/compact version of SAM)

$(SAM) import chrM.ref _bam.samse.sam _unsorted.bam

sort BAM alignments by leftmost coordinates

$(SAM) sort _unsorted.bam _sorted

sort prints the alignments:

$(SAM) view -h _sorted.bam _sam.view.txt

#verticalized result (BAM format):

$1 ? : chrM|8|2
$2 ? : 0
$3 ? : chrM
$4 ? : 9
$5 ? : 25
$6 ? : 30M
$7 ? : *
$8 ? : 0
$9 ? : 0
$10 ? : GTATATCACCCTATTAACCACTAACGGGAG
$11 ? : kzzst|{efokkdyirlqkflupvofumxe
$12 ? : XT:A:U
$13 ? : NM:i:2
$14 ? : X0:i:1
$15 ? : X1:i:0
$16 ? : XM:i:2
$17 ? : XO:i:0
$18 ? : XG:i:0
$19 ? : MD:Z:2C19C7

pileup prints the alignment in the pileup format.

$(SAM) pileup -c -f chrM.fa _sorted.bam > _sam.pileup.txt

#verticalized...
(...)
$1 chromosome name : chrM
$2 ref coordinate : 11
$3 reference base : C
$4 read base : A
$5 consensus quality : 60
$6 SNP quality : 60
$7 root mean square mapping quality : 25
$8 reads covering the position : 11
$9 read bases at a SNP line : AAAAAAAAA^:A^:A
$10 read quality : yj
(...)

That's it
Pierre

05 April 2010

The Alchemist: CSS pseudo 3D.

In the following post, I use a two-and-a-half-dimensional (2.5) technique to fake a three-dimensional (3D) visualization of Joseph Wright of Derby's painting The Alchemist in Search of the Philosopher's Stone This post was mostly inspired by Román Cortès' 3D Menimas and a post on Experiment Garden.

The Alchemist in Search of the Philosopher's Stone

The following image/script was successfully tested under Firefox 3.6. Move your mouse over the image

The Alchemist

How it works

Each component of the collage is part of a larger transparent picture:

An array defines each sprite. One sprite is defined by its rectangular clip (x,y,width,height) in the original image, its position (x,y) in the collage and a value (ratio) reflecting the amount of pixels it should scroll

var picts=[
{
/* background */
img:null,
clip:[0,0,image_width,image_height],
ratio:0.0,
x:0,y:0
},
{
/* people */
img:null,
clip:[227,1,111,147],
ratio:0.1,
x:5,y:154
},
(...)
];

The HTML is defined as follow:

<div id='id9879709_main' style='overflow:hidden;position:relative;background-color:red;width:226px;height:320px;border:1px solid black;'>
<span style="position:relative;padding:3px;background-color:white;top:15px;left:15px;border:2px solid black;z-index:100; font-weight: bold;"><a target="_wp" href="http://commons.wikimedia.org/wiki/File:JosephWright-Alchemist.jpg">The Alchemist</a></span>
</div>

.When this html document is loaded, a <div> clipping each sprite is created an inserted into the document:

var main=document.getElementById('main');
main.onmousemove = callback;
for(var i in picts)
{
var pict=picts[i];
pict.img=document.createElement("div"); pict.img.setAttribute("style","position:absolute;top:"+pict.y+"px;left:"+pict.x+"px;width:"+pict.clip[2]+"px;height:"+pict.clip[3]+"px;z-index:"+i+";background:url(alchemy02.png) no-repeat scroll -"+pict.clip[0]+"px -"+pict.clip[1]+"px;");
main.appendChild(pict.img);
}

Lastly a javascript callback is called each time the mouse moves over the image. This method shifts each sprites according to the value of its 'ratio':

var v=document.getElementById('id9879709_main');
var left=v.offsetLeft;
var width=226.0;
var midX= left+width/2.0;
var maxshift=30;
var mouseX=e.clientX;
if(mouseX > left+width) mouseX=left+width;
if(mouseX < left) mouseX=left;
var shift= ((e.clientX-midX)/(width/2.0))*maxshift;
for(var i in picts)
{
var pict=picts[i];
pict.img.style.left=(pict.x+shift*pict.ratio)+"px";
}

That's it
Pierre

26 March 2010

'R' = dna.translate("AGG") . A custom C function for R, My notebook.

In the following post, I will show how I've implemented a custom C function for R. This C function will translate a DNA to a protein. I'm very new to 'R' so feel free to make any comment about the code.

C code

The data in 'R' are stored in an opaque structure named 'SEXP'. A custom C function receives some SEXP arguments and returns another SEXP. So, the declaration of my function translate_dna is:

SEXP translate_dna(SEXP sequences)
{
(...)
}

We check if the argument 'sequences' has a type='character'.

if(!isString(sequences))
error("argument is not a string[]");

The input consists of one or more DNA, so our return value will be an SEXP array of type 'character' (STRSXP) with a size equals to the number of DNAs. This array is allocated:

SEXP array_of_peptides = allocVector(STRSXP,length(sequences));

We loop over each DNA sequence

for(in i=0;i<length(sequences);++i) ...

and we allocate some memory for the peptide:

const char *dna=CHAR(STRING_ELT(sequences, i));
int dna_length=strlen(dna);
char* peptide =malloc(dna_length/3+1);

The peptide is then filled with its amino acids:

for(j=0;j+2 < dna_length;j+=3)
{
peptide[n++]=_translate(dna[j],dna[j+1],dna[j+2]);
}
peptide[n]=0;

And we put this new string/peptide in the returned value

SET_STRING_ELT(array_of_peptides,i,Rf_mkChar(peptide));

Here, I am not sure how 'R' handles its memory. Should I care about the way R runs its garbage manager ? how should I use the macros PROTECT and UNPROTECT ?

Compilation

The code 'translate.c' is compiled as a dynamic library ' libtranslate.so':

gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include translate.c
gcc -shared -Wl,-soname,libtranslate.so.1 -o libtranslate.so translate.o

R code

on the R side , the previous dynamic library is loaded:

dyn.load(paste("libtranslate", .Platform$dynlib.ext, sep=""))

A function dna.translate is declared: it forces the array to be an array of string and it invokes the C function 'translate_dna'

storage.mode(dna) <- "character"
.Call("translate_dna", dna)

We can now invoke the R function dna.translate:

peptides <- dna.translate(
c( "ATGGAGAGGCAGAAACGGAAGGCGGACATCGAGAAAGGGCTGCAGTTCATTCAGTCGACACTAC",
NULL,
"CCCAAAAGCAAGAAGAATATGAGGCCTTTCTGCT",
"CAAACTGGTGCAGAATCTGTTTGCTGAGGGCAATGA",
NULL,
"GGCAGATCAGGGAACTTCTAATGGATTGGGGTCC",
"GGATAACTGCACCTTCGCCTACCATCAGGAGGA",
"GGGTCCCAGGCAGCGCTGCCTGGGGGCTGGGGAG",
"TTGCGACAGGCTCCAGAAGGGCAAAGCCTGCCCAGAT",
"GCACCCCCCTCCTCTCCACCCTACCTTCCATCAACCA",
"AGG"
))

print(peptides)

Result:

[1] "MERQKRKADIEKGLQFIQSTL" "PKSKKNMRPFC" "QTGAESVC*GQ*"
[4] "GRSGNF*WIGV" "G*LHLRLPSGG" "GSQAALPGGWG"
[7] "LRQAPEGQSLPR" "APPSSPPYLPST" "R"

Full source code

:
C code:

#include <stdio.h>
#include <ctype.h>
#include <R.h>
#include <Rinternals.h>
/* the genetic code */
static const char* STANDARD_GENETIC_CODE="FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG";

static char _translate(char a,char b,char c);

/** translates DNA to protein
* @arg sequences one or more DNA sequence
* @return an array of peptides
*/
SEXP translate_dna(SEXP sequences)
{
int i;
SEXP array_of_peptides;
//check input has type= characters
if(!isString(sequences))
error("argument is not a string[]");
//prepare a new list for the translated sequence
array_of_peptides = allocVector(STRSXP,length(sequences));

//loop over the input
for(i=0;i< length(sequences);++i)
{
int j=0,n=0;
//transform the sequence into a ptr* of char
const char *dna=CHAR(STRING_ELT(sequences, i));
//get the length of this sequence
int dna_length=strlen(dna);
//alloc the protein sequence
char* peptide =malloc(dna_length/3+1);
if(peptide==NULL) error("out of memory");
//loop over the codons
for(j=0;j+2 < dna_length;j+=3)
{
//set the amino acid at 'n'
peptide[n++]=_translate(dna[j],dna[j+1],dna[j+2]);
}
//add EOS
peptide[n]=0;
//put a copy of this peptide in the vector
SET_STRING_ELT(array_of_peptides,i,Rf_mkChar(peptide));
//free our copy
free(peptide);
}
//return the array of petptide
return array_of_peptides;
}

static int base2index(char c)
{
switch(tolower(c))
{
case 't': return 0;
case 'c': return 1;
case 'a': return 2;
case 'g': return 3;
default: return -1;
}
}

static char _translate(char a,char b,char c)
{
int base1= base2index(a);
int base2= base2index(b);
int base3= base2index(c);
if(base1==-1 || base2==-1 || base3==-1)
{
return '?';
}
else
{
return STANDARD_GENETIC_CODE[base1*16+base2*4+base3];
}

}

Makefile:

run:translate.so
${R_HOME}/bin/R --no-save < translate.R

translate.so:translate.c
gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include translate.c
gcc -shared -Wl,-soname,libtranslate.so.1 -o libtranslate.so translate.o

R code:

#load the dynamic library translate
dyn.load(paste("libtranslate", .Platform$dynlib.ext, sep=""))
#declare the function dna.translate
dna.translate <- function(dna)
{
#force dna to be type=character
storage.mode(dna) <- "character"
#call the C function
.Call("translate_dna", dna)
}

peptides <- dna.translate(
c( "ATGGAGAGGCAGAAACGGAAGGCGGACATCGAGAAAGGGCTGCAGTTCATTCAGTCGACACTAC",
NULL,
"CCCAAAAGCAAGAAGAATATGAGGCCTTTCTGCT",
"CAAACTGGTGCAGAATCTGTTTGCTGAGGGCAATGA",
NULL,
"GGCAGATCAGGGAACTTCTAATGGATTGGGGTCC",
"GGATAACTGCACCTTCGCCTACCATCAGGAGGA",
"GGGTCCCAGGCAGCGCTGCCTGGGGGCTGGGGAG",
"TTGCGACAGGCTCCAGAAGGGCAAAGCCTGCCCAGAT",
"GCACCCCCCTCCTCTCCACCCTACCTTCCATCAACCA",
"AGG"
))

print(peptides)

That's it

Pierre

25 March 2010

Playing with Biomart. My notebook.

Thanks to Neil, I've discovered that Biomart was not just a web interace, it also contains a powerful API: Citing http://www.biomart.org/martservice.html:
To submit a query using our webservices generate an XML document conforming to our Query XML syntax. This can be achieved simply by building up your query using MartView and hitting the XML button . This XML should be posted to http://(...)/martservice attached to a single parameter of query.

The base URL for Biomart is http://www.biomart.org/biomart.

Retrieving the Registry information for this Biomart installation: http://www.biomart.org/biomart/martservice?type=registry

<MartRegistry>
  <MartURLLocation database="ensembl_mart_57" default="1" displayName="ENSEMBL GENES 57 (SANGER UK)" host="www.biomart.org" includeDatasets="" martUser="" name="ensembl" path="/biomart/martservice" port="80" serverVirtualSchema="default" visible="1" />
  <MartURLLocation database="hapmart_27_36" default="1" displayName="HAPMAP 27 (NCBI US)" host="hapmap.ncbi.nlm.nih.gov" includeDatasets="" name="HapMap_rel27" path="/biomart/martservice" port="80" redirect="1" serverVirtualSchema="rel27_NCBI_Build36" visible="1" />
  <MartURLLocation database="vega_mart_57" default="0" displayName="VEGA 37 (SANGER UK)" host="www.biomart.org" includeDatasets="" martUser="" name="vega" path="/biomart/martservice" port="80" serverVirtualSchema="default" visible="1" />
  <MartURLLocation database="genomic_features_mart_57" default="0" displayName="ENSEMBL GENOMIC FEATURES 57 (SANGER UK)" host="www.biomart.org" includeDatasets="" martUser="" name="genomic_features" path="/biomart/martservice" port="80" serverVirtualSchema="default" visible="0" />
(...)
</MartRegistry>

Retrieving the datasets available for the registry HapMap_rel27: http://www.biomart.org/biomart/martservice?type=datasets&mart=HapMap_rel27 :

TableSet        hm27_variation_yri      HapMap Population YRI - Yoruba in Ibadan, Nigeria (West Africa) 1               20050000    default 2009-10-28 11:19:52
TableSet        hm27_variation_tsi      HapMap Population TSI - Toscans in Italy        1               200     50000   default     2009-10-28 11:13:07
TableSet        hm27_gene       hm27_gene       0               200     50000   default 2009-09-17 12:04:20
TableSet        hm27_variation_chd      HapMap Population CHD - Chinese in Metropolitan Denver, Colorado        1          200      50000   default 2009-10-28 11:07:01
TableSet        hm27_variation  All Populations 1               200     50000   default 2009-10-28 11:22:21
TableSet        hm27_variation_chb      HapMap Population CHB - Han Chinese in Beijing, China   1               200     50000       default 2009-10-28 11:05:57
TableSet        hm27_variation_asw      HapMap Population ASW - African ancestry in Southwest USA       1               20050000    default 2009-10-28 08:53:43
TableSet        hm27_variation_ceu      HapMap Population CEU - Utah residents with Northern and Western European ancestry from the CEPH collection 1               200     50000   default 2009-10-28 11:04:39
TableSet        hm27_variation_mkk      HapMap Population MKK - Maasai in Kinyawa, Kenya        1               200     50000       default 2009-10-28 11:12:12
(...)

Where can I find a description of those columns ?

Retrieving the attributes for the dataset hm27_variation_ceu: http://www.biomart.org/biomart/martservice?type=attributes&dataset=hm27_variation_ceu :

chrom   chromosome              naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main     chrom
start   position                naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main     start
strand  strand          naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main     strand
marker1 marker id               naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main     marker1
alleles alleles         naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main     alleles
refallele       reference allele                naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main refallele
center  genotyping center               naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__maincenter
platform        genotyping platform             naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main platform
ceu_id  CEU genotype            naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__dm       genotype
ref_allele      reference allele                naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main ref_allele
ref_allele_freq reference allele frequency              naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main ref_allele_freq
ref_allele_count        reference allele count          naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main ref_allele_count
other_allele    other allele            naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__mainother_allele
other_allele_freq       other allele frequency          naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main other_allele_freq
other_allele_count      other allele count              naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main other_allele_count
total_allele_count      total allele count              naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main total_allele_count
refhom_gt       reference homozygote genotype           naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main refhom_gt
het_gt  heterozygote genotype           naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__mainhet_gt
otherhom_gt     other homozygote genotype               naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main otherhom_gt
refhom_gtfreq   reference homozygote genotype frequency         naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main refhom_gtfreq
het_gtfreq      heterozygote genotype frequency         naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main het_gtfreq
otherhom_gtfreq other homozygote genotype frequency             naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main otherhom_gtfreq
refhom_gtcount  reference homozygote genotype count             naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main refhom_gtcount
het_gtcount     heterozygote genotype count             naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main het_gtcount
otherhom_gtcount        other homozygote genotype count         naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main otherhom_gtcount
total_gtcount   total genotype count            naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__variation__main total_gtcount
prot_lsid       protocol LSID           naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__assay__dm   prot_lsid
assay_id        assay LSID              naive_attributes        html,txt,csv,tsv,xls    hm27_variation_ceu__assay__dm   assay_lsid

(Where can I find a description of those columns ?). Each attribute is a specific column in the final result. Example of attributes available for the dataset hm27_variation_ceu:

chrom
start
marker1
alleles
(...)

Retrieving the filters for the dataset hm27_variation_ceu: http://www.biomart.org/biomart/martservice?type=filters&dataset=hm27_variation_ceu :

allele_freq     Minor Allele Frequency [>=]     [0.0,0.01,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5]         naive_filters       list    >=      hm27_variation_ceu__variation__main     avg_dprime
center  Genotyping Center       [affymetrix,bcm,illumina,imsut-riken,mcgill-gqic,perlegen,ucsf-wu,broad,sanger]         naive_filters       list    =       hm27_variation_ceu__variation__main     center
platform        Genotyping Platform     [AFFY_100K,AFFY_500K,BeadArray,FP-TDI,Infinium_H1,Infinium_H300,Invader,MIP,Perlegen,AFFY_6.0,Illumina_1M]          naive_filters   list    =       hm27_variation_ceu__variation__main     platform
monomorphic     Monomorphic SNPs        [,0]            naive_filters   boolean_num     =only,excluded  hm27_variation_ceu__variation__main is_reference_bool
marker_name     List of HapMap SNPs     []              naive_filters           =       hm27_variation_ceu__variation__mainmarker1
utr3    3' UTR  []              naive_filters   boolean_list    only,excluded   hm27_variation_ceu__variation__main     is_utr_bool
utr5    5' UTR  []              naive_filters   boolean_list    only,excluded   hm27_variation_ceu__variation__main     is_5utr_bool
coding_synon    Exons - synonymous coding SNPs  []              naive_filters   boolean_list    only,excluded   hm27_variation_ceu__variation__main is_coding_synon_bool
coding_nonsynon Exons - non-synonymous coding SNPs      []              naive_filters   boolean_list    only,excluded   hm27_variation_ceu__variation__main is_coding_nonsynon_bool
intronic        introns []              naive_filters   boolean_list    only,excluded   hm27_variation_ceu__variation__mainis_intronic_bool
chrom   chromosome      [chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY]              naive_filters   list    =       hm27_variation_ceu__variation__main     chrom
start   From position   []              naive_filters   text    >=      hm27_variation_ceu__variation__main     start
stop    To position     []              naive_filters   text    <=      hm27_variation_ceu__variation__main     stop
gene_list       List of Genes   []              naive_filters           =,in    pointer dataset gene_primary_symbol
encode_region   ENCODE region   [ENm010: 7:26924045..27424045,ENm013: 7:89621624..90736048,ENm014: 7:125865891..127029088,ENr112: 2:51512208..52012208,ENr113: 4:118466103..118966103,ENr123: 12:38626476:39126476,ENr131: 2:234156563..234656627,ENr213: 18:23719231..24219231,ENr232: 9:130725122..131225122,ENr321: 8:118882220..119382220]          naive_filters   list    =  pointer dataset  glook_encode_region

(Where can I find a description of those columns ?) Here we can find some filters for hm27_variation_ceu (e.g.: allele_freq)

Retrieving the configurations for the dataset hm27_variation_ceu: http://www.biomart.org/biomart/martservice?type=configuration&dataset=hm27_variation_ceu :

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DatasetConfig>
<DatasetConfig dataset="hm27_variation_ceu" datasetID="2" displayName="HapMap Population CEU - Utah residents with Northern and Western European ancestry from the CEPH collection" interfaces="default" internalName="default" martUsers="default" modified="2009-10-28 11:04:39" softwareVersion="0.6" template="hm27_variation_ceu" type="TableSet" visible="1">
  <MainTable>hm27_variation_ceu__variation__main</MainTable>
  <Key>marker_id_key</Key>
  <Importable filters="chrom,start,stop" internalName="encode_filter" linkName="encode_filter" name="encode_filter" type="link"/>
  <Importable displayName="gene_filter_35" filters="chrom,start,stop" internalName="gene_filter_35" linkName="gene_filter_35" name="gene_filter_35" type="link"/>
  <FilterPage displayName="FILTERS" internalName="naive_filters">
    <FilterGroup displayName="FILTERS" internalName="filters">
      <FilterCollection displayName="MINOR ALLELE FREQUENCY [&gt;=]" internalName="pop_filter">
        <FilterDescription displayName="Minor Allele Frequency [&gt;=]" displayType="list" field="avg_dprime" internalName="allele_freq" key="marker_id_key" legal_qualifiers="&gt;=" qualifier="&gt;=" style="menu" tableConstraint="main" type="list">
          <Option displayName="0.0" internalName="0.0" isSelectable="true" value="0.0"/>
          <Option displayName="0.01" internalName="0.01" isSelectable="true" value="0.01"/>
          <Option displayName="0.05" internalName="0.05" isSelectable="true" value="0.05"/>
          <Option displayName="0.1" internalName="0.1" isSelectable="true" value="0.1"/>
          <Option displayName="0.15" internalName="0.15" isSelectable="true" value="0.15"/>
          <Option displayName="0.2" internalName="0.2" isSelectable="true" value="0.2"/>
          <Option displayName="0.25" internalName="0.25" isSelectable="true" value="0.25"/>
          <Option displayName="0.3" internalName="0.3" isSelectable="true" value="0.3"/>
          <Option displayName="0.35" internalName="0.35" isSelectable="true" value="0.35"/>
          (...)

As far as I understand , this file is used to 'group' the filters.

At the end, here is Query sent to Biomart:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query virtualSchemaName="default" formatter="TSV" header="0" uniqueRows="0" count="" datasetConfigVersion="0.6">
  <Dataset name="hm27_variation_ceu" interface="default">
    <Filter name="stop" value="1000000"/>
    <Filter name="allele_freq" value="0.3"/>
    <Filter name="chrom" value="chr1"/>
    <Filter name="start" value="0"/>
    <Attribute name="chrom"/>
    <Attribute name="start"/>
    <Attribute name="strand"/>
    <Attribute name="marker1"/>
    <Attribute name="alleles"/>
    <Attribute name="ceu_id"/>
  </Dataset>
</Query>

and we can send it with curl:

curl --data 'query=<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE Query><Query virtualSchemaName="default" formatter="TSV" header="0" uniqueRows="0" count="" datasetConfigVersion="0.6" ><Dataset name="hm27_variation_ceu" interface="default" ><Filter name="stop" value="1000000"/><Filter name="allele_freq" value="0.3"/><Filter name="chrom" value="chr1"/><Filter name="start" value="0"/><Attribute name="chrom" /><Attribute name="start" /><Attribute name="strand" /><Attribute name="marker1" /><Attribute name="alleles" /><Attribute name="ceu_id" /></Dataset></Query>' http://www.biomart.org/biomart/martservice

Response from the Biomart server:

chr1    843817  +       rs1806509       A/C     AC AA AC AC AC CC AC CC AC CC AC NN AC CC AA NN AC AC NN CC AC CC AC AC AA AC AC AC CC AA AC AC AC AC AA CC AC AC AA AA AA AA NN AA AA AA AC AA NN AA NN AC AC CC AA AA AC AC AC AC AA AA AA AC AA AA CC CC AC AA AC AC AC AA CC AA AC CC AA NN CC CC AA AC AA AC AA AA AC AC AC AC AA AA NN AA AA CC AC AC AA AA AA AC AC AC AA AC AA AC AC AA AA AC AA CC AC AC AC AC AA AA AA CC AC CC AC NN AC AC AA AA CC AA AC AC AC AA AC AA AC AC AC AA AC CC AA AA CC AA AA AC AC AC AC AA CC AA AC CC CC AC AC AA CC AA AC AC AC AA AA CC AA AA
chr1    881808  +       rs13303106      A/G     AG GG AG GG GG AG AG AG GG GG GG NN AA AG GG NN GG GG NN GG AG AG GG AG GG GG AG GG AA GG AG GG AG AA GG AG GG AG GG GG GG GG NN GG GG GG AA AG NN GG NN AG AG AA GG AG AG AA AG AG GG GG GG AG GG GG AG AA GG GG AG GG AG GG AG GG AG AG GG NN AA GG GG AG GG AG AG AG AG AG GG AG GG GG NN GG GG AG AG AG AG GG AG AG GG GG GG AG GG AG AG GG AG AG AG AA GG AG AG AG GG GG GG AG AG AA AG NN AG AG GG GG AA GG GG GG AA GG GG AG AG AG AG GG AG AA GG GG AA GG GG AG AG AG GG GG AA GG GG AA AA AG AG AG AA GG AA AG AG GG AG AA GG GG
chr1    906697  +       rs6694632       A/G     AG AA AA AG AA AG AG AG AA AA AG NN GG AA GG NN AA AA NN AA AG AG AA AG AA AA AG AA GG GG GG AG AA GG AA AA AA AG AG AG AG AA NN AA AA AA AG AG NN AA NN AG AG AG AG AG AG GG AG AG AA AA AA AG AG AG AA GG AA AA AA GG GG AA AG GG AA AG AA NN AG AA AG AG AA AG GG GG AA GG AG AG AG AG NN AA AG AG AG AG AG AA AA GG AA AA AA AA AA AG AG AA AG AG AG AA AG AA AG AG AA AA GG AG AG GG AG NN AG AG GG AG AA AG AG AA AG AA AA AA AG AA AG AA AA GG AA AA GG AA AA AG AG AG AA AA GG AA AA GG GG AG AG AG AG AG GG AA AG AG AG AG AG AG
chr1    908247  +       rs13303118      G/T     GT TT TT GT TT GT GT GT TT TT GT NN GG TT GG NN TT TT NN TT GT GT TT GT TT TT GT TT GG GG GG GT TT GG TT TT TT GT GT GT GT TT NN TT TT TT GT GT NN TT NN NN GT GT GT GT GT GG GT GT TT TT TT GT GT GT TT GG TT TT TT GG GG TT GT GG TT GT TT NN GT TT GT GT TT GT GG GG TT GG GT GT GT GT NN TT GT GT GT GT GT TT TT GG TT TT TT TT TT GT GT TT GT GT GT TT GT TT GT GT TT TT GG GT GT GG GT NN GT GT GG GT TT GT GT TT GT TT TT TT GT TT GT TT TT GG TT TT GG TT TT GT GT GT TT TT GG TT TT GG GG GT GT GT GT GT GG TT GT GT GT GT GT GT
chr1    908436  +       rs2341354       A/G     AG GG GG AG GG AG AG AG GG GG AG NN AA GG AA NN GG GG NN GG AG AG GG AG GG GG AG GG AA AA AA AG GG AA GG GG GG AG AG AG AG GG NN GG GG GG AG AG NN GG NN AG AG AG AG AG AG AA AG AG GG GG GG AG AG AG AG AA GG GG GG AA AA GG AG AA GG AG GG NN AG GG AG AG GG AG AA AA GG AA AG AG AG AG NN GG AG AG AG AG AG GG GG AA GG GG GG GG GG AG AG GG AG AG AG GG AG GG AG AG GG GG AA AG AG AA AG NN AG AG AA AG GG AG AG GG AG GG GG GG AG GG AG GG GG AA GG GG AA GG GG AG AG AG GG GG AA GG GG AA AA AG AG AG AG AG AA GG AG AG AG AG AG AG
chr1    934427  +       rs3128117       C/T     TT TT CT CC TT CT CC CT TT TT CT NN CT CT CC NN TT TT NN TT TT CT TT CC TT TT CT TT CC CC TT TT TT CT TT TT CT CC CT CT CT TT NN TT CT TT CT CT NN TT NN CT CC TT CT CT CT CC CT CT TT CT TT CT CT CT TT CT TT TT CT CC CT TT CT CC CT CT TT NN CT TT CT CT CT CT CC CC TT CC CT TT TT CC NN TT TT CT CC CT TT TT TT CT CT TT TT TT TT TT TT TT CT CT CT CT CT CT CC CT CT TT CC CT CT CT CT NN CC CC CC CT CT CT CC TT CT TT TT TT CT TT CC CT TT CC CT TT CC TT TT CT CC CT TT TT CT TT TT CT CC TT CT TT CT CT CT CT TT CT TT TT CT CT
chr1    940106  +       rs1891906       A/C     AA AA AC CC AA AC AC AC AA AA AC NN AC AC CC NN AA AA NN AA AA AC AA CC AA AA AC AA CC CC AA AA AA AC AA AA AC CC AC AC AC AA NN AA AC AA AC AC NN AA NN AC CC AA AC AC AC CC AC AC AA AC AA AC AC AC AA AC AA AA AC CC AC AA AC AC AC AC AA NN AC AA AC AC AC AC CC CC AA NN AC AA AA CC NN AA AA AC CC AC AA AA AA AC AC AA AA AA AA AA AA AA AC AC AC AC AC AC CC AC AC AA CC AC AC AC AC NN CC CC AC AC AC AC CC AA AC AA AA AA AC AA CC AC AA CC AC AA CC AA AA AC CC AC AA AA AC AA AA AC CC AA AC AA AC AC AC AC AA AC AA AA AC AC
chr1    949705  +       rs2710888       C/T     CT CC CT CT CC CT CT CT CC CC CT NN CT CT TT NN CC CC NN CC CC CC NN CT CC CC CT CC TT CT CC CC CT CC CC CC CT TT CT CT CT CC NN CC CC CT TT CT NN CC NN CT TT CC CT CC CT TT CT CT CT CC CC CT CC CT CC CT CC CC CT CT CT CT CT CC CT CT CC NN CT CC CT CT CT CT CT TT CT TT CC CC CC TT NN CC CC CT TT CC CC CC CT CT CT CC CC CT CC CC CT CC CT CT CT TT CC CT TT CT CT CC CT CT CT CT CC NN CC TT CT CT CT CT CT CC CT CC CC CC CT CT TT CT TT TT CT CC TT CC CC CT TT CT CC CC CT CC CC CT TT CC CT CC CT CT CT CT CT CC CT CT CC CT
chr1    952073  +       rs3128126       A/G     NN AA NN NN AA AG AG NN NN AA NN AG AG AG NN AG NN NN AG NN AG AA AG NN NN AA NN AA NN AG AA AA NN NN AA AG NN NN NN AG AG NN AA NN NN AA AG AG AA AA AG AG GG NN NN AA AG GG AA AG AA NN AG AG NN NN NN NN NN NN NN NN NN NN AG AG AG AG AA AG GG AA AG AG NN AG AG GG AG GG AA AA AA GG GG AA AG AG GG NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN GG NN AG GG NN NN AG NN NN AG AG AA NN AA AA AA AG NN NN NN NN NN NN AA GG AA AA AG GG NN NN NN NN NN NN NN NN NN GG AA AG GG AG AG NN AA NN NN AA AG
chr1    952469  +       rs4970393       A/G     AG AA AG GG AA AG AG AG AA AA AG NN AG AG AG NN AA AA NN AA AG AA AG AG AA AA AG AA GG GG AA AA AA GG AA AA AG GG AG AG AG AA NN AA AG AA AG AG NN AA NN AG GG AG AG AG AG GG AA AG AA AG AA AG AG AG AA GG AA AA AG AG AG AG AG AG AG AG AA NN AG AA AG AG AG AG GG NN AG GG AG AA AA GG NN AA AA AG GG AG AG AA AA AG AG AA AA AG AA AA AG AA AG AG AG AG AG AG GG AG AG AG NN AG AG AG AA NN AG NN GG AG AG AG GG AA AG AA AA AA AG AG GG AG GG GG AG AA GG AA AA AG GG AG AA AA AG AA AA GG AG AG AG AG AG AG NN AG AA AG AA AA AG AG

Note: I've also tried to use the SOAP/WSDL Biomart API, but as it was discussed on Biostar, the API does not work.

That's it

Pierre

24 March 2010

Learning DesignPatterns: The Factory Method Pattern

"The Factory method pattern defines a separate method for creating the objects, which subclasses can then override to specify the derived type of product that will be created"

Example

/** Abstract class for melting temperature calculation */
abstract class TmCalculator
     {
     public abstract double calcTm(CharSequence s);
     }
 
/** the factory creates two kinds of TmCalculator 
   1) 2AT4GC
   2) Nearest-Neighbor
*/
public class TmCalculatorFactory
    {
    public TmCalculator create2AT4GC()
        {
        return new TmCalculator()
         {
         @Override
         public double calcTm(CharSequence s)
          {
          (...implements ... ) 
          }
         };
        }
  
    public TmCalculator createNearestNeighbor()
        {
        return new TmCalculator()
         {
         @Override
         public double calcTm(CharSequence s)
          {
          (...implements ... ) 
          }
         };
        }
   }

That's it

Pierre

23 April 2010

20 April 2010

SbamGrep

Option

Example

SbamStats

Example

19 April 2010

The C code

The 'R' code

Compiling

Testing

Full Source code

Makefile

fasta.c

fasta.R

14 April 2010

Full source code

08 April 2010

MAQ

BWA

SAM tools

05 April 2010

The Alchemist in Search of the Philosopher's Stone

How it works

26 March 2010

C code

Compilation

R code

Full source code

25 March 2010

24 March 2010

Example

About Me

Feeds

Blog Archive

Web2.0

Labels