Showing posts with label c++. Show all posts
Showing posts with label c++. Show all posts

29 June 2015

A BLAST to SAM converter.

Some times ago, I've received a set of Ion-Torrent /mate-reads with a poor quality. I wasn't able to align much things using bwa. I've always wondered if I could get better alignments using NCBI-BLASTN (short answer: no) . That's why I asked guyduche, my intership student to write a C program to convert the output of blastn to SAM. His code is available on github at :

The input for blast2sam is
  • the XML output of NCBI blastn (or stdin)
  • The single or pair of fastq file(s)
  • The reference sequence indexed with picard
.

Example:

fastq2fasta in.R1.fq.gz in.R2.fq.gz |\
blastn -db REFERENCE   -outfmt 5 | \
blast2bam -o result.bam -W 40 -R '@RG   ID:foo  SM:sample' - REFERENCE.dict  in.R1.fq.gz in.R2.fq.gz

Output:
@SQ SN:gi|9629357|ref|NC_001802.1|  LN:9181 
@RG ID:foo  SM:sample
@PG ID:Blast2Bam    PN:Blast2Bam    VN:0.1  CL:../../bin/blast2bam -o results.sam -W 40 -R @RG  ID:foo  SM:sample - db.dict test_1.fastq.gz test_2.fastq.gz
(...)
ERR656485.2 83  gi|9629357|ref|NC_001802.1| 715 60  180S7=1X8=1X11=1X2=2X4=1X14=1X8=1X33=1X4=1X2=1X5=1X2=1X6=1S =   715 -119    CCTAGTGTTGCTTGCTTTTCTTCTTTTTTTTTTCAAGCAGAAGACGGCATACGAGATCCTCTATCGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTAAGATAGAGGAAGAACAAAACAAATGTCAGCAAAGTCAGCAAAAGACACAGCAGGAAAAAGGGGCTGACGGGAAGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAN    (),.((((,(((((,((.((.-(>69>20E>6/=>5EC@9-52?BEE::2951.)74B64=B==FFAF=A??59:>FFFDF:55GGFGF?DFGGFE868>GGGFGGGGED;FGFFGGGGGGGGGGGEFFGE9GGGGFGGGGGGGGDGECGGFGGGGGGGGGGFGGGGGEGGFGGGGGGFFGGGGGFF?EGGFFFEGGGGGGGGFEGGGEGGGFEGGGGGGGGGGDGFFCEGFGGGGGGGGGGGFFECFGGGGFGGGGGGGGGGGFCGGGGGGGGGGGGGGGGGGFGGGGGGGGF@CCA8!    NM:i:13 RG:Z:foo    AS:i:80 XB:f:148.852    XE:Z:4.07e-39
ERR656485.2 163 gi|9629357|ref|NC_001802.1| 715 60  73S7=1X8=1X11=1X2=2X4=1X14=1X8=1X33=1X4=1X2=1X5=1X2=1X8=106S    =   715 119 NAGATAGAGGAAGAACAAAACAAATGTCAGCAAAGTCAGCAAAAGACACAGCAGGAAAAAGGGGCTGACGGGAAGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTCTCATTACAAAAAAAACATACACAATAAATGATATAAGCGGAATCAACAGCATGA    !8A@CGGEFGFGCDFGGGGGGGGGGGGGGFGGGGGFGFGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGFGGFGGGGGGGGGEGFGFGGGFFGGGGGGGGFGGGGGGGGGGGGFFFFGGGGGG=FFGGFFDGGGGGGGG8FGFGGGGGGGGGFGGGGGGGGGGFDGGFGGFGGGFFFGFF8DFDFDFFFFFFFFFBCDB<@EAFB@ABAC@CDFF?4>EEFE<*>BDAFB@FFBFF>((6<5CC.;C;=D9106(.))).)-46<<))))))))))((,(-)))()((()))    NM:i:13 RG:Z:foo    AS:i:82 XB:f:152.546    XE:Z:3.15e-40
(...)
Now, I would be interested in finding another dataset where this tool could be successfully used.
That's it,
Pierre

30 January 2015

Filtering Fasta Sequences using the #NCBI C++ API. My notebook.

In my previous post (http://plindenbaum.blogspot.com/2015/01/compiling-c-hello-world-program-using.html) I've built a simple "Hello World" application using the
NCBI C++ toolbox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/). Here, I'm gonna to extend the code in order to create a program filtering FASTA sequences on their sizes.

This new application FastaFilterSize needs three new arguments:

  • '-m' min size of the fasta sequence
  • '-M' max size of the fasta sequence
  • '-v' inverse selection

the method FastaFilterSize::Init() is used to register those optional arguments

/* optional argument for min size */
arg_desc->AddOptionalKey(
    "m",
    "min_size",
    "min size inclusive",
    CArgDescriptions::eInteger
    );
/* optional argument for max size */
arg_desc->AddOptionalKey(
    "M",
    "max_size",
    "max size inclusive",
    CArgDescriptions::eInteger
    );
/* add simple flag to inverse selection */
arg_desc->AddFlag("v", 
    "inverse selection",
    true
    );

the FastaFilterSize::Run() method is the workhorse of the program. Here, we get the output stream that will be used by a CFastaOstream to print the filtered Fasta sequences:

(...)
/* get output stream */
CNcbiOstream& out =  this->GetArgs()["o"].AsOutputFile();
/* create a FASTA writer */
CFastaOstream* writer = new CFastaOstream(out);
(...)

We get the input stream that will be used by a CFastaReader to read the FASTA sequences to be filtered.

/* get input stream */
CNcbiIstream& input_stream  = this->GetArgs()["i"].AsInputFile();
CFastaReader reader(input_stream ,
        CFastaReader::fNoParseID /* http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/fasta_8hpp_source.html */
        );

We get the optional values for min-length and max-length:

if (this->GetArgs()["m"])
    {
    min_inclusive = this->GetArgs()["m"].AsInteger();
    }
if (this->GetArgs()["M"])
    {
    max_inclusive = this->GetArgs()["M"].AsInteger();
    }

We get the 'inverse' flag.

bool invert=this->GetArgs()["v"];

While a Fasta sequence can be read, we check its size and we print it if needed:

while( !reader.AtEOF() )
    {
    /* read next sequence */
    CRef<CSeq_entry> seq(reader.ReadOneSeq());
    /* should we print the sequence ?*/
    if((seq->GetSeq().GetLength()>= min_inclusive && 
        seq->GetSeq().GetLength()<= max_inclusive
        )!= invert)
        {
        /* yes, print */
        writer->Write(*seq);
        }
    }

All in one:

#include <memory>
#include <limits>
#include <objtools/readers/fasta.hpp>
#include "corelib/ncbiapp.hpp"
#include <objects/seq/Bioseq.hpp>
#include <objects/seq/Bioseq.hpp>
#include <objects/seq/Delta_ext.hpp>
#include <objects/seq/Delta_seq.hpp>
#include <objects/seq/NCBIeaa.hpp>
#include <objects/seq/IUPACaa.hpp>
#include <objects/seq/IUPACna.hpp>
#include <objects/seq/Seg_ext.hpp>
#include <objects/seq/Seq_annot.hpp>
#include <objects/seq/Seq_descr.hpp>
#include <objects/seq/Seq_ext.hpp>
#include <objects/seq/Seq_hist.hpp>
#include <objects/seq/Seq_inst.hpp>
#include <objects/seq/Seq_literal.hpp>
#include <objects/seq/Seqdesc.hpp>
#include <objects/seq/seqport_util.hpp>
#include <objects/seqalign/Dense_seg.hpp>
#include <objects/seqalign/Seq_align.hpp>
#include <objects/seqloc/Seq_id.hpp>
#include <objects/seqloc/Seq_interval.hpp>
#include <objects/seqloc/Seq_loc.hpp>
#include <objects/seqloc/Seq_loc_mix.hpp>
#include <objects/seqloc/Seq_point.hpp>
#include <objects/seqset/Bioseq_set.hpp>
#include <objects/seqset/Seq_entry.hpp>
#include <objtools/readers/message_listener.hpp>
#include <objtools/readers/line_error.hpp>
#include <objtools/seqmasks_io/mask_reader.hpp>
#include <objtools/seqmasks_io/mask_writer.hpp>
#include <objtools/seqmasks_io/mask_fasta_reader.hpp>
#include <objmgr/util/sequence.hpp>
#include <ctype.h>

USING_NCBI_SCOPE;
USING_SCOPE(objects);

/**
 * FastaFilterSize
 *    
 *  Motivation:
 *      filter sequences on size
 *  Author:
 *      Pierre Lindenbaum PhD 2015
 *
 */
class FastaFilterSize : public CNcbiApplication /* extends a basic NCBI application defined in c++/include/corelib/ncbiapp.hpp */
    {
    public:

        /* constructor, just set the version to '1.0.0'*/
        FastaFilterSize()
            {
            CRef<CVersion> version(new CVersion());
            version->SetVersionInfo(1, 0, 0);
            this->SetFullVersion(version);
            }
            
            
        /* called to initialize rge program.
        *  The default behavior of this in 'CNcbiApplication' is "do nothing".
        *  Here, we set the command line arguments.
        */
        virtual void Init()
            {
            CArgDescriptions* arg_desc = new CArgDescriptions ; /* defined in /c++/include/corelib/ncbiargs.hpp */
            arg_desc->SetUsageContext(
                GetArguments().GetProgramBasename(),
                "A program filtering fasta sequences on size"
                );
            /* argument for input file */
            arg_desc->AddDefaultKey(
                    "i",/* name */
                    "input_file_name",/* synopsis */
                    "input file name",/* comment */
                    CArgDescriptions::eInputFile, /* argument type */
                    "-" /* default value*/
                    );
           /* argument for output file */
           arg_desc->AddDefaultKey(
                    "o",/* name */
                    "output_file_name",/* synopsis */
                    "output file name",/* comment */
                    CArgDescriptions::eOutputFile, /* argument type */
                    "-" /* default value*/
                    );
            /* optional argument for min size */
            arg_desc->AddOptionalKey(
                    "m",
                    "min_size",
                    "min size inclusive",
                    CArgDescriptions::eInteger
                    );
            /* optional argument for max size */
            arg_desc->AddOptionalKey(
                    "M",
                    "max_size",
                    "max size inclusive",
                    CArgDescriptions::eInteger
                    );
            
            /* add simple flag to inverse selection */
            arg_desc->AddFlag("v", 
                "inverse selection",
                true
                );
            
            /* push this command args */
            SetupArgDescriptions( arg_desc );
            }   
        
        /* class destructor */
        virtual ~FastaFilterSize()
            {
            }
        
        /* workhorse of the program */
        virtual int Run()
            {
            /* get output stream */
            CNcbiOstream& out =  this->GetArgs()["o"].AsOutputFile();
            /* create a FASTA writer */
            CFastaOstream* writer = new CFastaOstream(out);
            /* shall we inverse the selection ? */
            bool invert=this->GetArgs()["v"];

            /* get input stream */
            CNcbiIstream& input_stream  = this->GetArgs()["i"].AsInputFile();
            /* init min and max inclusive */
            TSeqPos min_inclusive=0;
            TSeqPos max_inclusive= std::numeric_limits<TSeqPos>::max();
            /* if min and max set by user */
            if (this->GetArgs()["m"])
                {
                min_inclusive = this->GetArgs()["m"].AsInteger();
                }
            if (this->GetArgs()["M"])
                {
                max_inclusive = this->GetArgs()["M"].AsInteger();
                }
            /* create a FASTA parser */
            CFastaReader reader(input_stream ,
                    CFastaReader::fNoParseID /* http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/fasta_8hpp_source.html */
                    );
            /* while we can read something */
            while( !reader.AtEOF() )
                {
                /* read next sequence */
                CRef<CSeq_entry> seq(reader.ReadOneSeq());
                
                /* should we print the sequence ?*/
                if((seq->GetSeq().GetLength()>= min_inclusive && 
                    seq->GetSeq().GetLength()<= max_inclusive
                    )!= invert)
                    {
                    /* yes, print */
                    writer->Write(*seq);
                    }
                }
            /* cleanup */
            delete writer;
            out.flush();
            return 0;
            }
    };

int main(int argc,char** argv)
    {
    return FastaFilterSize().AppMain(
        argc,
        argv,
        0, /* envp Environment pointer */
        eDS_ToStderr,/* log message. In /c++/include/corelib/ncbidiag.hpp  */
        0, /* Specify registry to load, as per LoadConfig() */
        "filterfasta" /* Specify application name */
        );
    }

Compiling

g++  -std=gnu++11  -finline-functions -fstrict-aliasing -fomit-frame-pointer    
-I/commun/data/users/lindenb/src/blast2.2.30/ncbi-blast-2.2.30+-src/c++/include 
-I/commun/data/users/lindenb/src/blast2.2.30/ncbi-blast-2.2.30+-src/c++/ReleaseM
T/inc  -Wno-deprecated -Wno-deprecated-declarations -o filtersize filtersize.cpp
 -L/commun/data/users/lindenb/src/blast2.2.30/ncbi-blast-2.2.30+-src/c++/Release
MT/lib -lblast_app_util -lblastinput-static -lncbi_xloader_blastdb-static -lncbi
_xloader_blastdb_rmt-static -lxblastformat-static -lalign_format-static -ltaxon1
-static -lblastdb_format-static -lgene_info-static -lxalnmgr-static -lblastxml-s
tatic -lblastxml2-static -lxcgi-static -lxhtml-static -lxblast-static -lxalgobla
stdbindex-static -lcomposition_adjustment-static -lxalgodustmask-static -lxalgow
inmask-static -lseqmasks_io-static -lseqdb-static -lblast_services-static -lxobj
util-static -lxobjread-static -lvariation-static -lcreaders-static -lsubmit-stat
ic -lxnetblastcli-static -lxnetblast-static -lblastdb-static -lscoremat-static -
ltables-static -lxalnmgr-static -lncbi_xloader_genbank-static -lncbi_xreader_id1
-static -lncbi_xreader_id2-static -lncbi_xreader_cache-static -lxconnext-static 
-lxconnect-static -ldbapi_driver-static -lncbi_xreader-static -lxconnext-static 
-lxconnect-static -lid1-static -lid2-static -lseqsplit-static -lxcompress-static
 -lxobjmgr-static -lgenome_collection-static -lseqedit-static -lseqset-static -l
seq-static -lseqcode-static -lsequtil-static -lpub-static -lmedline-static -lbib
lio-static -lgeneral-static -lxser-static -lxutil-static -lxncbi-static -lgomp -
lz -lbz2 -ldl -lnsl -lrt -lm -lpthread -L/commun/data/users/lindenb/src/blast2.2
.30/ncbi-blast-2.2.30+-src/c++/ReleaseMT/lib -lblast_app_util -lblastinput-stati
c -lncbi_xloader_blastdb-static -lncbi_xloader_blastdb_rmt-static -lxblastformat
-static -lalign_format-static -ltaxon1-static -lblastdb_format-static -lgene_inf
o-static -lxalnmgr-static -lblastxml-static -lblastxml2-static -lxcgi-static -lx
html-static -lxblast-static -lxalgoblastdbindex-static -lcomposition_adjustment-
static -lxalgodustmask-static -lxalgowinmask-static -lseqmasks_io-static -lseqdb
-static -lblast_services-static -lxobjutil-static -lxobjread-static -lvariation-
static -lcreaders-static -lsubmit-static -lxnetblastcli-static -lxnetblast-stati
c -lblastdb-static -lscoremat-static -ltables-static -lxalnmgr-static -lncbi_xlo
ader_genbank-static -lncbi_xreader_id1-static -lncbi_xreader_id2-static -lncbi_x
reader_cache-static -lxconnext-static -lxconnect-static -ldbapi_driver-static -l
ncbi_xreader-static -lxconnext-static -lxconnect-static -lid1-static -lid2-stati
c -lseqsplit-static -lxcompress-static -lxobjmgr-static -lgenome_collection-stat
ic -lseqedit-static -lseqset-static -lseq-static -lseqcode-static -lsequtil-stat
ic -lpub-static -lmedline-static -lbiblio-static -lgeneral-static -lxser-static 
-lxutil-static -lxncbi-static -lgomp -lz -lbz2 -ldl -lnsl -lrt -lm -lpthread

Testing

get help

$ ./filtersize  -help
USAGE
  filtersize [-h] [-help] [-xmlhelp] [-i input_file_name]
    [-o output_file_name] [-m min_size] [-M max_size] [-v]
    [-logfile File_Name] [-conffile File_Name] [-version] [-version-full]
    [-dryrun]

DESCRIPTION
   A program filtering fasta sequences on size

OPTIONAL ARGUMENTS
 (...)
 -i <File_In>
   input file name
   Default = `-'
 -o <File_Out>
   output file name
   Default = `-'
 -m <Integer>
   min size inclusive
 -M <Integer>
   max size inclusive
 -v
   inverse selection
(...)

test with some fasta sequences:

echo -e  ">seq1\nAATCGTACGCTAG\n>seq2\nAATCGTACGATGCAAAAAAAAAAAATCTAG\n>seq3\nAATC\n>seq3\nAATC" |\
./filtersize -m 10 -M 20

>lcl|1 seq1
AATCGTACGCTAG
echo -e  ">seq1\nAATCGTACGCTAG\n>seq2\nAATCGTACGATGCAAAAAAAAAAAATCTAG\n>seq3\nAATC\n>seq3\nAATC" |\
./filtersize -m 10 -M 20 -v 

>lcl|2 seq2
AATCGTACGATGCAAAAAAAAAAAATCTAG
>lcl|3 seq3
AATC
>lcl|4 seq3
AATC

That's it,
Pierre

08 August 2014

A GNU-make plug-in for the #Illumina FASTQs.

The latest version of GNU-Make http://www.gnu.org/software/make/ provides many advanced capabilities, including many useful functions. However, it does not contain a complete programming language and so it has limitations. Sometimes these limitations can be overcome through use of the shell function to invoke a separate program, although this can be inefficient. On systems which support dynamically loadable objects, you can write your own extension in any language (which can be compiled into such an object) and load it to provide extended capabilities ( see http://www.gnu.org/software/make/manual/make.html#Loading-Objects )

Building a plug-in for the Illumina FASTQs.

from http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

Illumina FASTQ files use the following naming scheme:

<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz

For example, the following is a valid FASTQ file name:

NA10831_ATCACG_L002_R1_001.fastq.gz

Here I'm writing a set of new functions for makefile to extract the different parts (sample, lane...) of a fastq file-name:

First a struct holding the parts of the file is created:

enum E_IlluminaComponent
    {
    E_sampleName,
    E_barcodeSequence,
    E_lane,
    E_readNumber,
    E_setNumber
    };

typedef struct illumina_scheme_t
    {
    char* filename;
    char* components[NUM_ILLUMINA_COMPONENTS];
    } IlluminaScheme,*IlluminaSchemePtr ;

and a function parsing the filenames is created:

IlluminaSchemePtr IlluminaSchemeNew(const char* filename)
    {
    ...
    }

when the plugin llumina is loaded as a dynamic C library, the method llumina_gmk_setup is called,
and we tell make about the new functions with gmk_add_function(name,callback,min_args,max_args,no_expand_content) :

int illumina_gmk_setup ()
  {
   gmk_add_function ("illumina_sample",illumina_sample, 1, 1, 0);
   gmk_add_function ("illumina_lane",illumina_lane, 1, 1, 0);
   (...)
  }

A function registered with make must match the gmk_func_ptr type.
It will be invoked with three parameters: name (the name of the function), argc (the number of arguments to the function), and argv (an array of pointers to arguments to the function). The last pointer (that is, argv[argc]) will be null (0).
The return value of the function is the result of expanding the function.

char* illumina_sample(const char *function_name, unsigned int argc, char **argv)
    {
    /** extract the filename(s), build and return a string containing the samples */
    }

Compiling

the plugin must be compiled as a dynamic C library.

Note: The manual says this step can also be generated in the final 'Makefile' (via load ./illumina.so) but I was not able to compile a missing library (illumina.so cannot open shared object file: No such file or directory)

so I compiled it by hand:

gcc -Wall -I/path/to/sources/make-4.0 -shared -fPIC -o illumina.so illumina.c

Test

here is the makefile:

SAMPLES=  NA10831_ATCACG_L002_R1_001.fastq.gz \
      hello \
      NA10832_ATCACG_L002_R1_001.fastq.gz \
      NA10831_ATCACG_L002_R2_001.fastq.gz \
      NA10832_ATCACG_L002_R2_001.fastq.gz \
      NA10833_ATCAGG_L003_R1_003.fastq.gz \
      NA10833_ATCAGG_L003_R1_003.fastq.gz \
      ERROR_ATCAGG_x003_R3_0z3.fastq.gz \
      false

all:
    @echo "SAMPLES: " $(illumina_sample  ${SAMPLES} )
    @echo "BARCODES: " $(illumina_barcode  ${SAMPLES} )
    @echo "LANE: " $(illumina_lane  ${SAMPLES} )
    @echo "READ: " $(illumina_read  ${SAMPLES} )
    @echo "SET: " $(illumina_set  ${SAMPLES} )

output:

$ make
SAMPLES:  NA10831 NA10832 NA10833
BARCODES:  ATCACG ATCAGG
LANE:  L002 L003
READ:  R1 R2
SET:  001 003

That's it,

Pierre

30 July 2014

writing #rstats bindings for bwa-mem, my notebook.

I wanted to learn how to bind a C library to R, so I've created the following bindings for BWA. My code is available on github at :

Most of the C code was inspired from Heng Li's code https://github.com/lh3/bwa/blob/master/example.c.

A short description of the C code

In https://github.com/lindenb/rbwa/blob/master/rbwa.c:
struct RBwaHandler
This structure holds a pointer to the bwa-index (bwaidx_t) and to the options for bwa (mem_opt_t).
RBwaOpen(filename)
This methods opens the bwa index, wrap the pointer in a 'R' tructure using R_MakeExternalPtr and registers a destructor '_RBwaFinalizer' to be called by the garbage manager.
_RBwaFinalizer(handler)
This is the destructor called by the garbage manager. It calls 'RBwaClose'
RBwaClose(handler)
retrieves the pointer to the 'struct RBwaHandler' using 'R_ExternalPtrAddr', disposes the resources, free the RBwaHandler using 'R_ClearExternalPtr'
RBwaMap(handler,DNAsequence)
This is the workhorse of the code: it retrieves the pointer to the 'struct RBwaHandler' using 'R_ExternalPtrAddr', creates a "data.frame" with 6 columns (chromosome, position, strand, mapq, NM, secondary), maps the DNA sequence by calling bwa::mem_align1 and insert the hits in the data.frame.

The R code

See https://github.com/lindenb/rbwa/blob/master/rbwa.R. This R code loads the dynamic C libraries and declares the R functions calling the previous C functions using .Call:
  • bwa.open(filename)
  • bwa.close(bwt)
  • bwa.map(bwt,dnasequence)

Example

source("rbwa.R")
bwt <- bwa.open("test_files/chrM.fa")

for(s in c(
 "GCATGTGTAATCTTACTAAGAGCTAATAGAAAGGCTAGGACCAAACCTAT",
 "GCATGTGTAATCTTACTAAGCTAATAGAAAGGCTAGGACCAAACCTAT",
 "CTATCTACTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTAT",
 "TACTAAACCC",
 "GCGAACCCAACTTCGATTCCCTCGCCGATCTCCGACGGAGCCGTGTGCAT" 
 ))
 {
 print(paste("QUERY:",s));
 hits<-bwa.map(bwt,s)
 print(hits)
 }

bwa.close(bwt);
here is the R output:
> source("rbwa.R")
> 
> bwt <- bwa.open("test_files/chrM.fa")
> 
> for(s in c(
+   "GCATGTGTAATCTTACTAAGAGCTAATAGAAAGGCTAGGACCAAACCTAT",
+   "GCATGTGTAATCTTACTAAGCTAATAGAAAGGCTAGGACCAAACCTAT",
+   "CTATCTACTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTAT",
+   "TACTAAACCC",
+   "GCGAACCCAACTTCGATTCCCTCGCCGATCTCCGACGGAGCCGTGTGCAT" 
+   ))
+  {
+  print(paste("QUERY:",s));
+  hits<-bwa.map(bwt,s)
+  print(hits)
+  }
[1] "QUERY: GCATGTGTAATCTTACTAAGAGCTAATAGAAAGGCTAGGACCAAACCTAT"
  chrom pos strand mapq NM secondary
1  chrM 650      1   60  0         0
[1] "QUERY: GCATGTGTAATCTTACTAAGCTAATAGAAAGGCTAGGACCAAACCTAT"
  chrom pos strand mapq NM secondary
1  chrM 650      1   60  2         0
[1] "QUERY: CTATCTACTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTAT"
  chrom  pos strand mapq NM secondary
1  chrM 3100      0   60  4         0
[1] "QUERY: TACTAAACCC"
[1] chrom     pos       strand    mapq      NM        secondary
<0 rows> (or 0-length row.names)
[1] "QUERY: GCGAACCCAACTTCGATTCCCTCGCCGATCTCCGACGGAGCCGTGTGCAT"
[1] chrom     pos       strand    mapq      NM        secondary
<0 rows> (or 0-length row.names)
> 
> bwa.close(bwt);
[1] TRUE

That's it,

Pierre

29 July 2014

Including the hash for the current git-commit in a C program

Say you wrote the following simple C program:
It includes a file "githash.h" that would contain the hash for the current commit in Git:



Because you're working with a Makefile, the file "githash.h" is generated by invoking 'git rev-parse HEAD ':



the file "githash.h" loooks like this:




But WAIT that is not so simple, once the file 'githash.h' has been created it won't be updated by Make as it already exists. This file should be removed each time 'git commit' is invoked. We can do this by creating POST COMMIT git hook: we create a bash script named ".git/hooks/post-commit" removing 'githash.h:



don't forget make it executable: `chmod +x .git/hooks/post-commit`

Now, each time 'git commit' is called, the file githash.h for the previous git-commit will be deleted !


That's it,

Pierre








05 July 2014

Pushed : makefile2graph , creating a graph of dependencies from GNU-Make.

I pushed makefile2graph at https://github.com/lindenb/makefile2graph. This is the standalone and 'C' implementation of a program I first wrote in java in 2012. The program creates a graph of dependencies from GNU-Make and its output is a graphiz-dot file or a Gexf/Gephi-XML file.

Usage

$ make -Bnd | make2graph > output.dot
$ make -Bnd | make2graph -x > gephi.gexf.xml 

Example

Here is the output of makefile2graph for Tabix:
$ cd tabix-0.2.5
$ make -Bnd |make2graph
digraph G {
n1[label="", color="green"];
n2[label="Makefile", color="green"];
n4[label="all", color="red"];
n3[label="all-recur", color="red"];
n23[label="bedidx.c", color="green"];
n22[label="bedidx.o", color="red"];
n9[label="bgzf.c", color="green"];
n10[label="bgzf.h", color="green"];
n8[label="bgzf.o", color="red"];
n27[label="bgzip", color="red"];
n29[label="bgzip.c", color="green"];
n28[label="bgzip.o", color="red"];
n18[label="index.c", color="green"];
n17[label="index.o", color="red"];
n20[label="khash.h", color="green"];
n16[label="knetfile.c", color="green"];
n11[label="knetfile.h", color="green"];
n15[label="knetfile.o", color="red"];
n24[label="kseq.h", color="green"];
n21[label="ksort.h", color="green"];
n13[label="kstring.c", color="green"];
n14[label="kstring.h", color="green"];
n12[label="kstring.o", color="red"];
n6[label="lib", color="red"];
n7[label="libtabix.a", color="red"];
n26[label="main.c", color="green"];
n25[label="main.o", color="red"];
n5[label="tabix", color="red"];
n19[label="tabix.h", color="green"];
n2 -> n1 ; 
n4 -> n1 ; 
n3 -> n1 ; 
(..)
}

That's it
Pierre

29 May 2013

Binding a C library with Javascript/ #mozilla. An example with the Tabix library

In this post I'll show how to bind a C API to javascript using the mozilla/xul-runner API and the tabix library.

About xpcshell

XULRunner is a Mozilla runtime package. The SDK package contains xpcshell, a JavaScript Shell application that lets you run JavaScript code. "Unlike the ordinary JS shell (js), xpcshell lets the scripts running in it access the mozila technologies (XPCOM)." I've tested the current code with
$ xulrunner -v
Mozilla XULRunner 22.0 - 20130521223249
XULRunner is not installed by default on ubuntu on needs to be downloaded.

The js.type library

The js-ctypes is a foreign-function library for Mozilla's privileged JavaScript. It provides C-compatible data types and allows JS code to call functions in shared libraries (dll, so, dylib) and implement callback functions.

Tabix

Heng Li's Tabix is "a generic tool that indexes position sorted files in TAB-delimited formats such as GFF, BED, PSL, SAM and SQL export, and quickly retrieves features overlapping specified regions.". The code is available in github at https://github.com/samtools/tabix.

Binding the Tabix library to javascript

First of all, the dynamic library for tabix must be compiled:
$ cd /path/to/tabix.dir
$ make libtabix.so.1
A javascript file tabix.js is created. At the top, we tell the javascrpipt engine we want to use the js.type library:
Components.utils.import("resource://gre/modules/ctypes.jsm")
The dynamic library for tabix is loaded:
var lib = ctypes.open("libtabix.so.1");
We bind each required methods of the tabix library to javascript. As an example we're going to bind ti_open. The C declaration for this method is:
tabix_t *ti_open(const char *fn, const char *fnidx);
Using js.type, the call to that method is wrapped to javascript using declare/:
var DLOpen= lib.declare("ti_open",/* method name */
 ctypes.default_abi,/* Application binary interface type */
 ctypes.voidptr_t, /* return type is a pointer 'void*' */
 ctypes.char.ptr,  /* first argument is 'char*' */
 ctypes.int32_t /* second argument is 'int' */
 );
In javascript, the library is used by invoking DLOpen :
function TabixFile(filename)
 {
 this.ptr= DLOpen(filename,0);
 if(this.ptr.isNull()) throw "I/O ERROR: Cannot open \""+filename+"\"";
 };
var tabix=new TabixFile("annotatons.bed.gz");

The tabix.js library

All in one, I wrote the following file.

Testing


load("tabix.js");
var tabix=new TabixFile("/path/to/tabix-0.2.5/example.gtf.gz");
var iter=tabix.query("chr2:32800-35441");
while((line=iter.next())!=null)
 {
 print(line);
 }
tabix.close();

Set the dynamic library path (LD_LIBRARY_PATH) and invoke this script with xpcshell:
LD_LIBRARY_PATH=/path/to/xulrunner-sdk/bin:/path/to/tabix-0.2.5 /path/to/xulrunner-sdk/bin/xpcshell -f test.js
Output:
chr2 HAVANA transcript 28814 36385 . - . gene_id "ENSG00000184731"; transcript_id "ENST00000327669"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; level 2; tag "CCDS"; ccdsid "CCDS42645"; havana_gene "OTTHUMG00000151321"; havana_transcript "OTTHUMT00000322220";
chr2 HAVANA gene 28814 36870 . - . gene_id "ENSG00000184731"; transcript_id "ENSG00000184731"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C"; level 2; havana_gene "OTTHUMG00000151321";
chr2 HAVANA transcript 31220 32952 . - . gene_id "ENSG00000184731"; transcript_id "ENST00000460464"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "FAM110C-003"; level 2; havana_gene "OTTHUMG00000151321"; havana_transcript "OTTHUMT00000322222";
chr2 HAVANA transcript 31221 36870 . - . gene_id "ENSG00000184731"; transcript_id "ENST00000461026"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "FAM110C-002"; level 2; havana_gene "OTTHUMG00000151321"; havana_transcript "OTTHUMT00000322221";
chr2 HAVANA exon 32809 32952 . - . gene_id "ENSG00000184731"; transcript_id "ENST00000460464"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "FAM110C-003"; level 2; havana_gene "OTTHUMG00000151321"; havana_transcript "OTTHUMT00000322222";
chr2 HAVANA CDS 35440 36385 . - 0 gene_id "ENSG00000184731"; transcript_id "ENST00000327669"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; level 2; tag "CCDS"; ccdsid "CCDS42645"; havana_gene "OTTHUMG00000151321"; havana_transcript "OTTHUMT00000322220";
chr2 HAVANA exon 35440 36385 . - . gene_id "ENSG00000184731"; transcript_id "ENST00000327669"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; level 2; tag "CCDS"; ccdsid "CCDS42645"; havana_gene "OTTHUMG00000151321"; havana_transcript "OTTHUMT00000322220";

That's it,

Pierre

26 April 2013

Java JNI bindings for BWA(mem-lite)

Motivation

BWA 7.4(http://bio-bwa.sourceforge.net/) contains a small C example(https://github.com/lh3/bwa/blob/master/example.c) for running bwa-mem as a library (bwamem-lite). I created some JNI bindings to see if I can bind the C bwa library to java and get the same output than bwamem-lite. I put the code on github at https://github.com/lindenb/jbwa.

Example


(compare to https://github.com/lh3/bwa/blob/master/example.c )

System.loadLibrary("bwajni");
BwaIndex index=new BwaIndex(new File("hg19.fa"));
BwaMem mem=new BwaMem(index);
KSeq kseq=new KSeq(new File("input.fastq.gz");
ShortRead read=null;
while((read=kseq.next())!=null)
        {
        for(AlnRgn a: mem.align(read))
                {
                if(a.getSecondary()>=0) continue;
                System.out.println(  read.getName()+"\t"+  a.getStrand()+"\t"+  a.getChrom()+"\t"+
                        a.getPos()+"\t"+ a.getMQual()+"\t"+ a.getCigar()+"\t"+  a.getNm() );
                }
        }
kseq.dispose();
index.close();
mem.dispose();

Testing


Here is the ouput of the JAVA version:

gunzip -c input.fastq.gz | head -n 4000 |\
java  -Djava.library.path=src/main/native -cp src/main/java \
   com.github.lindenb.jbwa.jni.Example human_g1k_v37.fasta -| tail 


HWI-1KL149:20:C1CU7ACXX:4:1101:3077:33410       +       3       38647538        60      89M11S  1
HWI-1KL149:20:C1CU7ACXX:4:1101:3396:33445       +       8       52567289        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:10013:33288      -       1       156104115       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:10390:33496      -       6       123824853       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:13537:33483      +       2       157367092       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:14139:33390      +       20      31413797        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:14514:33458      +       2       179401813       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:15292:33282      +       15      63335820        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:16960:33276      -       12      110782784       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:17355:33322      +       6       126077895       60      100M    1

And the ouput of the Native C version:

gunzip -c input.fastq.gz | head -n 4000 |\
bwa-0.7.4/bwamem-lite human_g1k_v37.fasta - | tail 

HWI-1KL149:20:C1CU7ACXX:4:1101:3077:33410       +       3       38647538        60      89M11S  1
HWI-1KL149:20:C1CU7ACXX:4:1101:3396:33445       +       8       52567289        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:10013:33288      -       1       156104115       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:10390:33496      -       6       123824853       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:13537:33483      +       2       157367092       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:14139:33390      +       20      31413797        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:14514:33458      +       2       179401813       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:15292:33282      +       15      63335820        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:16960:33276      -       12      110782784       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:17355:33322      +       6       126077895       60      100M    1

GUI


As a test I also created a swing-Based interface for BWA:

java  -Djava.library.path=src/main/native  -cp src/main/java \
    com.github.lindenb.jbwa.jni.BwaFrame human_g1k_v37.fasta



That's it,


Pierre

11 February 2013

Counting the reads in a BAM file using SGE and the Open-MPI library: my notebook.

In the current post, I'll describe my first Open MPI program. Open MPI is "a Message Passing Interface (MPI) library, a standardized and portable message-passing system to function on a wide variety of parallel computers". My C program takes a list of BAMs, distributes some jobs on the SGE (SUN/Oracle Grid Engine) to count the number of reads and returns the results to a master process.

Downloading and installing

The library was downloaded from http://www.open-mpi.org/software/ompi/v1.6/, compiled with the option --with-sge and installed in '/commun/data/packages/openmpi'.
./configure --prefix=/commun/data/packages/openmpi --with-sge
make 
make install

Configuring SGE to run MPI-based program

As described in http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html .
$ su
$ source /commun/sge_root/bird/common/settings.csh
$ setenv ARCH lx24-amd64
$ qconf -ap orte
And added:
pe_name            orte
slots              448
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
Add orte to the parallele env list:
qconf -mattr queue pe_list "orte" all.q
Using qconf -mconf I've also set shell_start_mode to 'unix_behavior'.

Algorithm

In the 'main' method, test if this is the master(==0) or a slave process(!=0):
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
 {
 master(argc-1,&argv[1]);
 }
else
 {
 slave(argc-1,&argv[1]);
 }

MASTER

The master gets the number of available proc:
MPI_Comm_size(MPI_COMM_WORLD, &proc_count)
Loop over the number of available processors and the BAMS, for each BAM, create a fixed size structure containing the path to the BAM as well as the count of reads initialized to '0':
typedef struct Params_t
 {
 char filename[FILENAME_MAX];
 int is_error;
 long count_total;
 long count_mapped;
 long count_properly_paired;
 long count_dup;
 }Params;
This structure is sent to the slaves:
 MPI_Send(
 &param, /* message buffer */
 sizeof(Params), /* one data item */
 MPI_CHAR, /* data item is an array of bytes */
 rank, /* destination process rank */
 TAG_COUNT_BAM, /* user chosen message tag */
 MPI_COMM_WORLD /* default communicator */
 ); 
and we wait for the slaves to return those 'Params' with the filled values.:
MPI_Recv(&param, /* message buffer */
 sizeof(Params), /* one data item */
 MPI_CHAR, /* data item is an array of bytes */
 MPI_ANY_SOURCE, /* receive from any sender */
 MPI_ANY_TAG, /* any type of message */
 MPI_COMM_WORLD, /* default communicator */
 &status);
And the result is printed:
printf("%s\t%ld\t%ld\t%ld\t%ld\n",
 param.filename,
 param.count_total,
 param.count_mapped,
 param.count_properly_paired,
 param.count_dup
 );
At the end of the master method, when no more BAM is available, the slaves are released (tag is 'TAG_RELEASE_SLAVE') with :
MPI_Send(
 &empty, /* message buffer */
 sizeof(Params), /* one data item */
 MPI_CHAR, /* data item is an array of bytes */
 rank, /* destination process rank */
 TAG_RELEASE_SLAVE, /* SHutdown */
 MPI_COMM_WORLD
 ); /* default communicator */

SLAVE

The slave method receives some messages from the master:
MPI_Recv(
 &param,
 sizeof(Params),
 MPI_CHAR,
 0,
 MPI_ANY_TAG,
 MPI_COMM_WORLD,
 &status);
If the message was 'TAG_RELEASE_SLAVE' we exit the method. Else the BAM file is scanned using the SAmtools API for C.
count_reads(&param);
and the result is sent back to the master:
MPI_Send(
 &param,
 sizeof(Params),
 MPI_CHAR,
 0,
 0,
 MPI_COMM_WORLD
 );
.

Running the Job on SGE

I still don't understand why I need to write the following line:
# qsh -pe orte 4
Your job 84581 ("INTERACTIVE") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 84581 has been successfully scheduled.
Run the analysis:
qrsh -cwd -pe orte 100 /commun/data/packages/openmpi/bin/mpirun \
 /path/to/ompibam \
 `find /commun/data/projects/folder1234/align -name "*.bam"`
qstat -f shows the jobs running:
arch          states
   ---------------------------------------------------------------------------------
   all.q@node01                   BIP   0/15/64        2.76 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    15
   ---------------------------------------------------------------------------------
   all.q@node02                   BIP   0/14/64        3.89 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node03                   BIP   0/14/64        3.23 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node04                   BIP   0/14/64        3.68 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node05                   BIP   0/15/64        2.91 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    15
   ---------------------------------------------------------------------------------
   all.q@node06                   BIP   0/14/64        3.91 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node07                   BIP   0/14/64        3.79 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14 
output:
#filename total mapped properly_paired dup
/commun/data/projects/folder1234/11X0589_markdup.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0589_pair1_sorted.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0589_realigned.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0589_recal.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0602_pair1_sorted.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0602_realigned.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0602_markdup.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0602_pair1_unsorted.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0375_pair1_sorted.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0375_realigned.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0375_markdup.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0367_pair1_sorted.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0367_markdup.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0375_pair1_unsorted.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0367_realigned.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0291_pair1_sorted.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0291_markdup.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0291_realigned.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0311_markdup.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0311_realigned.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0375_recal.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0589_pair1_unsorted.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0367_pair1_unsorted.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0290_pair1_sorted.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0290_realigned.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0291_pair1_unsorted.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0311_pair1_sorted.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0240_pair1_sorted.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/11X0240_realigned.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/11X0240_markdup.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/XX10272_pair1_sorted.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/XX10272_markdup.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/XX10272_realigned.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/11X0602_recal.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0290_markdup.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0290_pair1_unsorted.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0247_pair1_sorted.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0456_pair1_sorted.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0456_realigned.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0247_markdup.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0247_realigned.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0456_markdup.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0367_recal.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0305_pair1_sorted.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0305_markdup.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0305_realigned.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0542_pair1_sorted.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/XX07551_recal.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0542_realigned.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/11X0542_markdup.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/11X0311_pair1_unsorted.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0291_recal.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/XX07551_pair1_sorted.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0240_pair1_unsorted.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/XX10272_pair1_unsorted.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/XX07551_markdup.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/XX07551_realigned.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0305_pair1_unsorted.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0311_recal.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0456_pair1_unsorted.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0247_pair1_unsorted.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0542_pair1_unsorted.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/XX07551_pair1_unsorted.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0433_pair1_sorted.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/11X0433_markdup.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/XX10272_recal.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/11X0433_realigned.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/11X0240_recal.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/11X0290_recal.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0456_recal.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0247_recal.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0433_pair1_unsorted.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/11X0305_recal.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0542_recal.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/11X0433_recal.bam 2178798 1800943 1249738 0
However I noticed seems that running this code on the cluster is slower that running it on the master node...

The C code

Note: the C code is compiled with ortecc, not gcc.


I'm still very new to this API and to those concepts. Feel free to suggest anything :-)
That's it,
Pierre

25 January 2013

Samtools tview as a library to display the BAM

I've forked samtools and modified the code of tview to use it as a library to display the alignments. The original code is an inreractive interface using the ncurses library. I've modified the original code and changed the structure of the C 'struct tview' with a few callbacks to make it more object-oriented:

(...)
typedef struct AbstractTview {
 int mrow, mcol;
   (...)
    khash_t(kh_rg) *rg_hash;
    /* callbacks */
    void (*my_destroy)(struct AbstractTview* );
    void (*my_mvprintw)(struct AbstractTview* ,int,int,const char*,...);
    void (*my_mvaddch)(struct AbstractTview*,int,int,int);
    void (*my_attron)(struct AbstractTview*,int);
    void (*my_attroff)(struct AbstractTview*,int);
    void (*my_clear)(struct AbstractTview*);
    int (*my_colorpair)(struct AbstractTview*,int);
    int (*my_drawaln)(struct AbstractTview*,int,int);
    int (*my_loop)(struct AbstractTview*);
    int (*my_underline)(struct AbstractTview*);
} tview_t;
With those callbacks, there is a strong separation between the 'view' and the 'model'. Tview can now be used as a library and it's now easy to create any kind of view you need by extending the 'struct tview_t'. I've created two new interfaces: one for HTML:


samtools tview -d H examples/sorted.bam  -p seq1:20 
seq1:20
 21        31        41        51        61        71        81                 
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ATGTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTGGSTG
ATGTGTGGTTTAACTCGTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG  GCCTGGCTGTGGGGGCACCAGCCGCTG
ATGTGTGGTTTAACTCGTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG  GCCTGGCTGTGGGGGCACCAGCCGCTG
ATGTGTGGTTTAACTCGTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG  GCCTGGCTGTGGGGGCACCAGCCGCTG
ATGTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGCGCTGTGGACCCTGC       CTGTGGGGGCCGCAGTGGCTG
ATGTGTGGTTTAACTCGT     GCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTG CTGTGGGGGCCGCAGTGGCTG
ATGTGTGGTTTAACTCGT     GCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTG CTGTGGGGGCCGCAGTGGGTG
ATGTGTGGTTTAACTCGTCC   GCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTG CTGTGGGGGCCGCAGTGGCTG
ATGTGTGGTTTAACTCGTCC         CATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTTG          GGCTG
ATGTGTGGTTTAACTCGTCC         CATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTTG          GGCTG
TTTTTTGTTTTAACTCTTCTCT       CATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTTG          GGCTG
TTTTTTGTTTTAACTCTTCTCT        ATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGGGG               
TTTTTTGTTTTAACTCTTCTCT        ATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGGGG               
ATGTGTGGTTTAACTCGTCCATGG      ATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGGGG               
ATGTGTGGTTTAACTCGTCCATGG       TTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGG              
ATGTGTGGTTTAACTCGTCCATGG       TTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGG              
ATGTGTGGTTTAACTCGTCCCTGGCCCA   TTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGG              
ATGTGTGGTTTAACTCGTCCATGGCCCAG   TAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGGGG            
ATGTGTGGTTTAACTCGTCCCTGGCCCA    TAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGGGG            
ATGTGTGGTTTAACTCGTCCATGGCCCAG   TAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGGGG            
ATGTGTGGTTTAACTCGTCCCTGGCCCA           CTGTGGACCCTGCAGCCTGGCTGTGGGGGGCGCCG      
ATGTGTGGTTTAACTCGTCCATGGCCCAG          CTGTGGACCCTGCAGCCTGGCTGTGGGGGGCGCCG      
ATGTGTGGTTTAACTCGTCCATTGCCCAGC         CTGTGGACCCTGCAGCCTGGCTGTGGGGGGCGCCG      
ATGTGTGGTTTAACTCGTCCATTGCCCAGC          TGTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTG    
ATGTGTGGTTTAACTCGTCCATTGCCCAGC          TGTGGACCCTGCAGCCTGGCTGGGGGGGGCGCAGT     
atgtgtggtttaactcgtccatggcccagcatt       TGTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTG    
atgtgtggtttaactcgtccatggcccagcatt       TGTGGACCCTGCAGCCTGGCTGGGGGGGGCGCAGT     
atgtgtggtttaactcgtccatggcccagcatt       TGTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTG    
  GTGTGGTTTAACTCGTCCATGGCCCAGCATTTGGG   TGTGGACCCTGCAGCCTGGCTGGGGGGGGCGCAGT     
  GTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGG    GTGGACCCTGCAGCCTGGCTGGGGGGGGCACGGGG    
  GTGTGGTTTAACTCGTCCATGGCCCAGCATTTGGG    GTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTG    
  GTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGG    GTGGACCCTGCAGCCTGGCTGGGGGGGGCACGGGG    
  GTGTGGTTTAACTCGTCCATGGCCCAGCATTTGGG    GTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTG    
  GTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGG    GTGGACCCTGCAGCCTGGCTGGGGGGGGCACGGGG    
    GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC GTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTG    
    GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC   GGACCCTGCAGCCTGGCTGTGGGGGCCGCTGTGGG  
    GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC   GGACCCTGCAGCCTGGCTGTGGGGGCCGCTGTGGG  
        TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT    CCTGCAGCCTGGCTGTGGGGGCCGCAGCGGGTG
        TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT    CCTGCAGCCTGGCTGTGGGGGCCGCAGCGGGTG
        TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT    CCTGCAGCCTGGCTGTGGGGGCCGCAGCGGGTG
         TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG      GCAGCCTGGCTGTGGGGGCCGCAGTGGCTG
         TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG      GCAGCCTGGCTGTGGGGGCCGCAGTGGCTG
         TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG      GCAGCCTGGCTGTGGGGGCCGCAGTGGCTG
           AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA             CTGTGGGGGCCGCAGTGGGTG
           AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA             CTGTGGGGGCCGCAGTGGCTG
           AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA             CTGTGGGGGCCGCAGTGGCTG
                 TCCATGGCCCAGCATTAGGGCGCTGTGGACCCTGC       CTGTGGGGGCCGCAGTGGGTG
                 TCCATGGCCCAGCATTAGGGCGCTGTGGACCCTGC       CTGTGGGGGCCGCAGTGGCTG
                                           GGACCCTGCAGCCTGGCTGTGGGGGCCGCTGTGGG  
                                                            TGTGGGGGCCGCAGTGGCTG
                                                            TGTGGGGGCCGCAGTGGCTG
                                                            TGTGGGGGCCGCAGTGGCTG

And another one for TEXT (with colors on a terminal):

samtools tview -d t examples/sorted.bam  -p seq1:23

        31        41        51        61        71        81        91          
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
TGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTGGSTGAGG
TGTGGTTTAACTCGTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG  GCCTGGCTGTGGGGGCACCAGCCGCTGCGG
TGTGGTTTAACTCGTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG  GCCTGGCTGTGGGGGCACCAGCCGCTGCGG
TGTGGTTTAACTCGTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG  GCCTGGCTGTGGGGGCACCAGCCGCTGCGG
TGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGCGCTGTGGACCCTGC       CTGTGGGGGCCGCAGTGGCTGAGG
TGTGGTTTAACTCGT     GCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTG CTGTGGGGGCCGCAGTGGCTGAGG
TGTGGTTTAACTCGT     GCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTG CTGTGGGGGCCGCAGTGGGTGAGG
TGTGGTTTAACTCGTCC   GCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTG CTGTGGGGGCCGCAGTGGCTGAGG
TGTGGTTTAACTCGTCC         CATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTTG          GGCTGAGG
TGTGGTTTAACTCGTCC         CATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTTG          GGCTGAGG
TTTGTTTTAACTCTTCTCT       CATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTTG          GGCTGAGG
TTTGTTTTAACTCTTCTCT        ATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGGGG               AGG
TTTGTTTTAACTCTTCTCT        ATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGGGG               AGG
TGTGGTTTAACTCGTCCATGG      ATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGGGG               AGG

The code is available on github:


That's it,

Pierre


13 July 2012

Parsing the Newick format in C using flex and bison.

The following post is my answer for this question on biostar "Newick 2 Json converter".
The Newick tree format is a simple format used to write out trees (using parentheses and commas) in a text file .
The original question asked for a parser based on perl but here, I've implemented a C parser using flex/bison.


Example:

((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);

A formal grammar for the Newick format is available here
Items in { } may appear zero or more times.
   Items in [ ] are optional, they may appear once or not at all.
   All other punctuation marks (colon, semicolon, parentheses, comma and
         single quote) are required parts of the format.


              tree ==> descendant_list [ root_label ] [ : branch_length ] ;

   descendant_list ==> ( subtree { , subtree } )

           subtree ==> descendant_list [internal_node_label] [: branch_length]
                   ==> leaf_label [: branch_length]

            root_label ==> label
   internal_node_label ==> label
            leaf_label ==> label

                 label ==> unquoted_label
                       ==> quoted_label

        unquoted_label ==> string_of_printing_characters
          quoted_label ==> ' string_of_printing_characters '

         branch_length ==> signed_number
                       ==> unsigned_number

The Flex Lexer

The Flex Lexer is used to extract the terminal tokens of the grammar from the input stream.
Those terminals are '(' ')' ',' ';' ':' , strings and numbers. For the simple and double quoted strings, we tell the lexer to enter in a specific state ( 'apos' and 'quot').

The Bison Scanner

The Bison scanner reads the tokens returned by Flex and implements the grammar.
The simple structure holding the tree is defined in 'struct tree_t'. The code also contains some methods to dump the tree as JSON.

Makefile


Testing

compile:
$ make
bison -d newick.y
flex newick.l
gcc -Wall -O3 newick.tab.c lex.yy.c
lex.yy.c:1265:17: warning: ‘yyunput’ defined but not used [-Wunused-function]
lex.yy.c:1306:16: warning: ‘input’ defined but not used [-Wunused-function]
test:
echo "((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);" | ./a.out

{
    "children": [
        {
            "length": 0.1,
            "children": [
                {
                    "label": "Human",
                    "length": 0.3
                },
                {
                    "label": "Chimpanzee",
                    "length": 0.2
                }
            ]
        },
        {
            "label": "Gorilla",
            "length": 0.3
        },
        {
            "length": 0.2,
            "children": [
                {
                    "label": "Mouse",
                    "length": 0.6
                },
                {
                    "label": "Rat",
                    "length": 0.5
                }
            ]
        }
    ]
}


That's it,

Pierre

28 January 2012

Insert your VCFs in a sqlite database.

vcf2sqlite is C++ tool that is part of my Variation Toolkit.
It inserts a "Variant Call Format document" (VCF) into a sqlite3 database.

Download

Download the sources from Google-Code using subversion:....
svn checkout http://variationtoolkit.googlecode.com/svn/trunk/ variationtoolkit-read-only
... or update the sources of an existing installation...
cd variationtoolkit
svn update
... and edit the variationtoolkit/congig.mk file.

Dependencies

http://www.sqlite.org/ : libraries and headers for sqlite3.

Compilation

Define "SQLITE_LIB" and "SQLITE_CFLAGS" in config.mk (see HowToInstall )
$ cd variationtoolkit/src/
$ make ../bin/vcf2sqlite 

if ! [ -z "$(SQLITE_LIB)" ] ;then g++ -o ../bin/vcf2sqlite vcf2sqlite.cpp xsqlite.cpp application.o -O3 -Wall -lz   ; else g++ -o ../bin/vcf2sqlite vcf2sqlite.cpp  -DNOSQLITE -O3 -Wall  ; fi

Usage

vcf2sqlite -f database.sqlite (file1.vcf file2... | stdin )

Options

  • -f (file) sqlite3 database (REQUIRED).

Schema


Example:

$ vcf2sqlite -f db.sqlite file.vcf
$ sqlite3 -line db.sqlite  "select * from VCFCALL LIMIT 4"

       id = 1
   nIndex = 0
vcfrow_id = 1
sample_id = 1
     prop = GT
    value = 1/1

       id = 2
   nIndex = 1
vcfrow_id = 1
sample_id = 1
     prop = PL
    value = 46,6,0

       id = 3
   nIndex = 2
vcfrow_id = 1
sample_id = 1
     prop = GQ
    value = 10

       id = 4
   nIndex = 0
vcfrow_id = 2
sample_id = 1
     prop = GT
    value = 1/1

$ sqlite3 -column -header  db.sqlite \
   "select SAMPLE.name,VCFCALL.value,count(*) from VCFCALL,SAMPLE where SAMPLE.id=VCFCALL.sample_id and prop='GT' group by SAMPLE.id,VCFCALL.value"

name         value       count(*)  
-----------  ----------  ----------
rmdup_1.bam  0/1         545       
rmdup_1.bam  1/1         429       
rmdup_2.bam  0/1         625       
rmdup_2.bam  1/1         349       
rmdup_3.bam  0/1         595       
rmdup_3.bam  1/1         379       
rmdup_4.bam  0/1         548       
rmdup_4.bam  1/1         426       
rmdup_5.bam  0/1         564       
rmdup_5.bam  1/1         410       
rmdup_6.bam  0/1         724       
rmdup_6.bam  1/1         250
That's it
Pierre

09 September 2011

Customizing "C printf" to print a FASTA sequence.

The GNU C library lets you define your own custom conversion specifiers for printf template strings. In the current post, I will show how to add a new modifier for printf to print a fasta sequence. Say we have a C structure holding a FASTA sequence:

typedef struct fasta_t
	{
	char* name;
	char* sequence;
	}Fasta;
The new output conversion is registered with register_printf_specifier.
register_printf_specifier(
               'W',/* the new modifier */
		/* handler function */
		printf_fasta_handler,
		/* arginfo function */
		printf_fasta_arginfo )

The handler function

"The handler-function is the function called by printf and friends when this conversion appears in a template string... The structure printf_info is used to pass information about the options appearing in an instance of a conversion specifier in a printf template string to the handler and arginfo functions for that specifier.". Here the width is used to restrict the length of the fasta sequence and the precision defines the length of the lines.
#define FPUTC(C) ret+=(fputc(C,stream)==EOF?0:1)
/** handler for printf called by register_printf_function */
static int printf_fasta_handler(
	FILE *stream, /*  stream output */
	const struct printf_info *info, /* information about the various options */
         const void *const *args /* arguments */
	)
	{
	int n=0;
	int ret=0;
	char* p;

	int lline =(info->prec<1?60:info->prec);
	const Fasta* fasta=*((const Fasta**)(args[0]));
	
	if(fasta->name!=NULL)
		{
		FPUTC('>');
		ret+=fputs(fasta->name,stream);
		}
	else
		{
		FPUTC('>');
		}
	
	if(fasta->sequence!=NULL)
		{
		p=(char*)fasta->sequence;
		while(p[n]!=0 && (info->width<1 || n< info->width))
			{
			if(n%lline==0) 
				{
				FPUTC('\n');
				}
			FPUTC(p[n]);
			++n;
			}
		}
	FPUTC('\n');
       	/* returns the number of characters writtern */
	return ret;
	}

The arginfo function

"The arginfo-function is the function called by parse_printf_format when this conversion appears in a template string."
static int printf_fasta_arginfo(
	const struct printf_info *info,
	size_t n,
	int *argtypes,
	int *size)
	{
	 /* We always take exactly one argument and this is a pointer to the
          structure.. */
	 if (n > 0) argtypes[0] = PA_POINTER;
	return 1;
	}

Printing the FASTA sequence:


Fasta record;
(...)
printf("%80W\n",&record);
printf("%200.90W\n",&record);
printf("%200W\n",&record);
printf("%.80W\n",&record);

Compile and run

$ gcc -Wall source.c
$ ./a.out < sequence.fa

>gi|27592135
GGAAGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGT
CTGAGTGACTTCACGGGTCG

>gi|27592135
GGAAGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGTCTGAGTGACTTCACGGGTCGCCTTTGTGCA
GTACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCTGGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAA
GGGGAGAAACAGATCCAGAT

>gi|27592135
GGAAGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGT
CTGAGTGACTTCACGGGTCGCCTTTGTGCAGTACTAGATATGCAGCAGACCTATGACATG
TGGCTAAAGAAACACAATCCTGGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAA
GGGGAGAAACAGATCCAGAT

>gi|27592135
GGAAGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGTCTGAGTGACTTCACGGGTCG
CCTTTGTGCAGTACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCTGGGAAGCCTGGAGAGGGAA
CACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGACTATGCTGACATCATGATGGGCTACCACTGC
TGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTT
CACCTCAGACAGTGACTCCAGCTGCTGGAGCTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTACCATGCAC
ATGGCTCTGTTTGATCCCAGAAGTGATGACTACTTAGTGGTAAAAACACATTTCCAGACACACAACTTCAGAAAATGAGT
GCAAGCTTCAAGTCTGCCCTTTGTAGCCATAATGTGCTCAGCTCTCGGTCTGCTGAACAGAGTCTACTTGGCTCAATTCT
TGGGGGAATCCCAGATGCTTTATTAGATTGTTTGAATGTCTCACGCCCTCTGAATCAGTGCCTTGAGGTGCCTTCAGAAG
GCTTGTGATGGTTAGNNNTNGCATTTTGGTT

Code


That's it,

Pierre