22 February 2015

Drawing a Manhattan plot in SVG using a GWAS+XML model.

On friday, I saw my colleague @b_l_k starting writing SVG+XML code to draw a Manhattan plot. I told him that a better idea would be to describe the data using XML and to transform the XML to SVG using XSLT.

So, let's do this. I put the XSLT stylesheet on github at https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/manhattan.xsl . And the model of data would look like this (I took the data from @genetics_blog's http://www.gettinggeneticsdone.com/2011/04/annotated-manhattan-plots-and-qq-plots.html ) .

<?xml version="1.0"?>
<manhattan>
  <chromosome name="1" length="1500">
    <data rs="1" position="1" p="0.914806043496355"/>
    <data rs="2" position="2" p="0.937075413297862"/>
    <data rs="3" position="3" p="0.286139534786344"/>
    <data rs="4" position="4" p="0.830447626067325"/>
    <data rs="5" position="5" p="0.641745518893003"/>
 (...)
  </chromosome>
  <chromosome name="22" length="535">
    <data rs="15936" position="1" p="0.367785102687776"/>
    <data rs="15937" position="2" p="0.628192085539922"/>
(...)
    <data rs="1" position="1" p="0.914806043496355"/>
  </chromosome>
</manhattan>

The stylesheet

At the beginning , we declare the size of the drawing

<xsl:variable name="width" select="number(1000)"/>
<xsl:variable name="height" select="number(400)"/>

we need to get the size of the genome.

<xsl:variable name="genomeSize">
    <xsl:call-template name="sumLengthChrom">
      <xsl:with-param name="length" select="number(0)"/>
      <xsl:with-param name="node" select="manhattan/chromosome[1]"/>
    </xsl:call-template>
</xsl:variable>

We could use the xpath function 'sum()' but here I the user is free to omit the size of the chromosome. It this attribute '@length' is not declared, we use the maximum position of the SNP in this chromosome:

<xsl:template name="sumLengthChrom">
    <xsl:param name="length"/>
    <xsl:param name="node"/>
    <xsl:variable name="chromlen">
      <xsl:apply-templates select="$node" mode="length"/>
    </xsl:variable>
    <xsl:choose>
      <xsl:when test="count($node/following-sibling::chromosome)&gt;0">
        <xsl:call-template name="sumLengthChrom">
          <xsl:with-param name="length" select="$length + number($chromlen)"/>
          <xsl:with-param name="node" select="$node/following-sibling::chromosome[1]"/>
        </xsl:call-template>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="$length + number($chromlen)"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

we get the smallest p-value:

<xsl:variable name="minpvalue">
    <xsl:for-each select="manhattan/chromosome/data">
      <xsl:sort select="@p" data-type="number" order="ascending"/>
      <xsl:if test="position() = 1">
        <xsl:value-of select="number(@p)"/>
      </xsl:if>
    </xsl:for-each>
  </xsl:variable>

then we plot each chromosome, the xsl parameter "previous" contains the number of bases already printed.
We use the SVG attribute transform to translate the current chromosome in the drawing

<xsl:template name="plotChromosomes">
    <xsl:param name="previous"/>
    <xsl:param name="node"/>
(...)
      <xsl:attribute name="transform">
        <xsl:text>translate(</xsl:text>
        <xsl:value-of select="(number($previous) div number($genomeSize)) * $width"/>
        <xsl:text>,0)</xsl:text>
      </xsl:attribute>

we plot each SNP:

<svg:g style="fill-opacity:0.5;">
        <xsl:apply-templates select="data" mode="plot"/>
      </svg:g>

and we plot the remaining chromosomes, if any :

<xsl:if test="count($node/following-sibling::chromosome)&gt;0">
      <xsl:call-template name="plotChromosomes">
        <xsl:with-param name="previous" select="$previous + number($chromlen)"/>
        <xsl:with-param name="node" select="$node/following-sibling::chromosome[1]"/>
      </xsl:call-template>
    </xsl:if>

to plot each SNP, we get the X coordinate in the current chromosome:

<xsl:template match="data" mode="x">
    <xsl:variable name="chromWidth">
      <xsl:apply-templates select=".." mode="width"/>
    </xsl:variable>
    <xsl:variable name="chromLength">
      <xsl:apply-templates select=".." mode="length"/>
    </xsl:variable>
    <xsl:value-of select="(number(@position) div number($chromLength)) * $chromWidth"/>
  </xsl:template>

and the Y coordinate:

<xsl:template match="data" mode="y">
    <xsl:value-of select="$height - (( (math:log(number(@p)) * -1 ) div $maxlog2value ) * $height )"/>
  </xsl:template>

we can also wrap the data in a hyperlink if a @rs attribute exists:

<xsl:choose>
      <xsl:when test="@rs">
        <svg:a target="_blank">
          <xsl:attribute name="xlink:href">
            <xsl:value-of select="concat('http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=',@rs)"/>
          </xsl:attribute>
          <xsl:apply-templates select="." mode="plotshape"/>
        </svg:a>
      </xsl:when>
      <xsl:otherwise>
        <xsl:apply-templates select="." mode="plotshape"/>
      </xsl:otherwise>
    </xsl:choose>

we plot the SNP itself, as a circle:

<xsl:template match="data" mode="plotshape">
    <svg:circle r="5">
      <xsl:attribute name="cx">
        <xsl:apply-templates select="." mode="x"/>
      </xsl:attribute>
      <xsl:attribute name="cy">
        <xsl:apply-templates select="." mode="y"/>
      </xsl:attribute>
    </svg:circle>
  </xsl:template>

Result:

$ xsltproc manhattan.xsl model.xml > plot.svg

Plot

That's it
Pierre

18 February 2015

Automatic code generation for @knime with XSLT: An example with two nodes: fasta reader and writer.

KNIME is a java+eclipse-based graphical workflow-manager.

Biologists in my lab often use this tool to filter VCFs or other tabular data. A software Development kit (SDK) is provided to build new nodes. My main problem with this SDK is, that you need to write a large number of similar files and you also have to interact with a graphical interface. I wanted to automatize the generation of java code for new node. In the following post I will describe how I wrote two new node for reading and writing fasta files.

The nodes are described in a XML file and the java code is generated with a XSLT stylesheet and is available on github at:

https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/knime/knime2java.xsl

Example

We're going to create two nodes for FASTA:

a fasta reader
a fasta writer

We define a plugin.xml file, it uses xinclude to include the definition of the two nodes. The base package of will be com.github.lindenb.xsltsandbox . The nodes will be displayed in the knime-workbench under /community/bio/fasta

<?xml version="1.0" encoding="UTF-8"?>
<plugin xmlns:xi="http://www.w3.org/2001/XInclude"
        package="com.github.lindenb.xsltsandbox"
        >
  <category name="bio">
    <category name="fasta" label="Fasta">
      <xi:include href="node.read-fasta.xml"/>
      <xi:include href="node.write-fasta.xml"/>
    </category>
  </category>
</plugin>

node.read-fasta.xml : it takes a FileReader (for the input fasta file ) and an integer to limit the number of fasta sequences to be read. The outpout will be a table with two columns (name/sequence). We only write the code for reading the fasta file.

<?xml version="1.0" encoding="UTF-8"?>
<node name="readfasta" label="Read Fasta" description="Reads a Fasta file">
  <property type="file-read" name="fastaIn">
    <extension>.fa</extension>
    <extension>.fasta</extension>
    <extension>.fasta.gz</extension>
    <extension>.fa.gz</extension>
  </property>
  <property type="int" name="limit" label="max sequences" description="number of sequences to be fetch. 0 = ALL" default="0">
  </property>
  <property type="bool" name="upper" label="Uppercase" description="Convert to Uppercase" default="false">
  </property>
  <outPort name="output">
    <column name="title" label="Title" type="string"/>
    <column name="sequence" label="Sequence" type="string"/>
  </outPort>
  <code>
    <import>
      import java.io.*;
      </import>
    <body>
    @Override
    protected BufferedDataTable[] execute(final BufferedDataTable[] inData, final ExecutionContext exec) throws Exception
        {
        int limit = this.getPropertyLimitValue();
        String url = this.getPropertyFastaInValue();
        boolean to_upper = this.getPropertyUpperValue();
        getLogger().info("reading "+url);
        java.io.BufferedReader r= null;
        int n_sequences = 0;
        try
            {
            r = this.openUriForBufferedReader(url);

            DataTableSpec dataspec0 = this.createOutTableSpec0();
            BufferedDataContainer container0 = exec.createDataContainer(dataspec0);

            String seqname="";
            StringBuilder sequence=new StringBuilder();
            for(;;)
                {
                exec.checkCanceled();
                exec.setMessage("Sequences "+n_sequences);
                String line= r.readLine();
                if(line==null || line.startsWith("&gt;"))
                    {
                    if(!(sequence.length()==0 &amp;&amp; seqname.trim().isEmpty()))
                        {
                          container0.addRowToTable(new  org.knime.core.data.def.DefaultRow(
                          org.knime.core.data.RowKey.createRowKey(n_sequences),
                        this.createDataCellsForOutTableSpec0(seqname,sequence)
                                                                                                                                    ));
                        ++n_sequences;
                        }
                    if(line==null) break;
                    if( limit!=0 &amp;&amp; limit==n_sequences) break;
                    seqname=line.substring(1);
                    sequence=new StringBuilder();
                    }
                else
                    {
                    line= line.trim();
                    if( to_upper ) line= line.toUpperCase();
                    sequence.append(line);
                    }
                }
            container0.close();
            BufferedDataTable out0 = container0.getTable();
            return new BufferedDataTable[]{out0};
            }
        finally
            {
            r.close();
            }
        }
      </body>
  </code>
</node>

node.write-fasta.xml : it needs an input dataTable with two column (name/sequence), an integer to set the lentgh of the lines and requires a file-writer to write the fasta file.

<?xml version="1.0" encoding="UTF-8"?>
<node name="writefasta" label="Write Fasta" description="Write a Fasta file">
  <inPort name="input">
  </inPort>
  <property type="file-save" name="fastaOut">
  </property>
  <property type="column" name="title" label="Title" description="Fasta title" data-type="string">
  </property>
  <property type="column" name="sequence" label="Sequence" description="Fasta Sequence" data-type="string">
  </property>
  <property type="int" name="fold" label="Fold size" description="Fold sequences greater than..." default="60">
  </property>
  <code>
    <import>
      import org.knime.core.data.container.CloseableRowIterator;
      import java.io.*;
      </import>
    <body>
    @Override
    protected BufferedDataTable[] execute(final BufferedDataTable[] inData, final ExecutionContext exec) throws Exception
        {
        CloseableRowIterator iter=null;
        BufferedDataTable inTable=inData[0];
        int fold = this.getPropertyFoldValue();
        int tIndex = this.findTitleRequiredColumnIndex(inTable.getDataTableSpec());
        int sIndex = this.findSequenceRequiredColumnIndex(inTable.getDataTableSpec());
        PrintWriter w =null;
        try
            {
            w= openFastaOutForPrinting();

            int nRows=0;
            double total=inTable.getRowCount();
            iter=inTable.iterator();
            while(iter.hasNext())
                {
                DataRow row=iter.next();
                DataCell tCell =row.getCell(tIndex);

                DataCell sCell =row.getCell(sIndex);

                w.print("&gt;");
                if(!tCell.isMissing())
                    { 
                    w.print(StringCell.class.cast(tCell).getStringValue());
                    }
                if(!sCell.isMissing())
                    {
                    String sequence = StringCell.class.cast(sCell).getStringValue();
                    for(int i=0;i&lt;sequence.length();++i)
                        {
                        if(i%fold == 0) w.println();
                        w.print(sequence.charAt(i));
                        exec.checkCanceled();
                        }
                    }
                w.println();

                exec.checkCanceled();
                exec.setProgress(nRows/total,"Saving Fasta");
                ++nRows;
                }
            w.flush();
            return new BufferedDataTable[0];
            }
        finally
            {
            if(w!=null) w.close();
            }
        }
      </body>
  </code>
</node>

The following Makefile generates the code, compiles and installs the new plugin in the ${knime.root}/plugins directory :

.PHONY:all clean install run

knime.root=${HOME}/package/knime_2.11.2

all: install

run: install
        ${knime.root}/knime -clean

install:
        rm -rf generated
        xsltproc --xinclude \
                --stringparam base.dir generated \
                knime2java.xsl plugin.xml
        $(MAKE) -C generated install knime.root=${knime.root}


clean:
        rm -rf generated

The code generated by this Makefile:

$ find generated/ -type f
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/readfasta/ReadfastaNodeFactory.xml
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/readfasta/ReadfastaNodePlugin.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/readfasta/ReadfastaNodeFactory.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/readfasta/ReadfastaNodeDialog.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/readfasta/AbstractReadfastaNodeModel.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/readfasta/ReadfastaNodeModel.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/writefasta/WritefastaNodeFactory.xml
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/writefasta/WritefastaNodePlugin.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/writefasta/WritefastaNodeFactory.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/writefasta/WritefastaNodeDialog.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/writefasta/AbstractWritefastaNodeModel.java
generated/src/com/github/lindenb/xsltsandbox/bio/fasta/writefasta/WritefastaNodeModel.java
generated/src/com/github/lindenb/xsltsandbox/CompileAll__.java
generated/src/com/github/lindenb/xsltsandbox/AbstractNodeModel.java
generated/MANIFEST.MF
generated/Makefile
generated/plugin.xml
generated/dist/com_github_lindenb_xsltsandbox.jar
generated/dist/com.github.lindenb.xsltsandbox_2015.02.18.jar

The file generated/dist/com.github.lindenb.xsltsandbox_2015.02.18.jar is the file to move to ${knime.root}/plugins

(At the time of writing I put the jar at http://cardioserve.nantes.inserm.fr/~lindenb/knime/fasta/ )

open knime, the new nodes are now displayed in the Node Repository

Node Repository

You can now use the nodes, the code is displayed in the documentation:

Workbench

That's it,

Pierre

02 February 2015

Listing the 'Subject' Sequences in a BLAST database using the NCBI C++ toolbox. My notebook.

In my previous post (http://plindenbaum.blogspot.com/2015/01/filtering-fasta-sequences-using-ncbi-c.html) I've built an application filtering FASTA sequences using the
NCBI C++ toolbox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/). Here, I'm gonna write a tool listing the 'subject' sequences in a BLAST database.

This new application ListBlastDatabaseContent takes only one argument '-db', the name of the blast database :

void ListBlastDatabaseContent::Init()
    {
    (...)
    /* argument for name */
    arg_desc->AddDefaultKey(
            "db",/* name */
            "database",/* synopsis */
            "Blast database",/* comment */
            CArgDescriptions::eString, /* argument type */
            "" /* default value*/
            );
    (...)
    }

in ListBlastDatabaseContent::Run() we get the name of the database

(...)
string database_name(  this->GetArgs()["db"].AsString() );
(...)

and we open a new CSearchDatabase object ("Blast Search Subject")

CSearchDatabase* searchDatabase = new CSearchDatabase(
    database_name, /* db name */
    CSearchDatabase::eBlastDbIsNucleotide /* db type */
    );

We get a handle to the seqdb, "it defines access to the database component by calling methods on objects which represent the various database files, such as the index, header, sequence, and alias files".

CRef<CSeqDB> seqdb= searchDatabase->GetSeqDb();

We get an iterator to the seqdb database http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSeqDBIter.html:

CSeqDBIter  iter = seqdb->Begin();

While this iterator is valid, we retrieve and print the DNA sequence and the header associated to the iterator.

while(iter)
    {
    string output;
    TSeqRange range; /* entire sequence */
    seqdb->GetSequenceAsString(iter.GetOID(),output,range); 
    CRef< CBlast_def_line_set > def_line= seqdb->GetHdr(iter.GetOID());
    CBlast_def_line_set:: Tdata::const_iterator r = def_line->Get().begin();
    while(r!= def_line->Get().end())
        {
        cout << ">" << (*r)->GetTitle()  << endl;
        cout << output << endl;
        ++r;
        }
    ++iter;
    }

All in one:

#include <memory>
#include <limits>
#include "corelib/ncbiapp.hpp"
#include <algo/blast/api/blast_types.hpp>
#include <algo/blast/api/sseqloc.hpp>
#include <algo/blast/api/blast_aux.hpp>
#include <algo/blast/api/blast_options_handle.hpp>
#include <algo/blast/api/blast_results.hpp>
#include <algo/blast/api/local_blast.hpp>
#include <ctype.h>



USING_NCBI_SCOPE;
using namespace blast;

/**
 *  Name
 *      ListBlastDatabaseContent
 *  Motivation:
 *      Print content of blast database
 *  Author:
 *      Pierre Lindenbaum PhD 2015
 *
 */
class ListBlastDatabaseContent : public CNcbiApplication /* extends a basic NCBI application defined in c++/include/corelib/ncbiapp.hpp */
    {
    public:

        /* constructor, just set the version to '1.0.0'*/
        ListBlastDatabaseContent()
            {
            CRef<CVersion> version(new CVersion());
            version->SetVersionInfo(1, 0, 0);
            this->SetFullVersion(version);
            }
            
            
        /* called to initialize rge program.
        *  The default behavior of this in 'CNcbiApplication' is "do nothing".
        *  Here, we set the command line arguments.
        */
        virtual void Init()
            {
            CArgDescriptions* arg_desc = new CArgDescriptions ; /* defined in /c++/include/corelib/ncbiargs.hpp */
            arg_desc->SetUsageContext(
                GetArguments().GetProgramBasename(),
                __FILE__
                );
            
         
            /* argument for name */
           arg_desc->AddDefaultKey(
                    "db",/* name */
                    "database",/* synopsis */
                    "Blast database",/* comment */
                    CArgDescriptions::eString, /* argument type */
                    "" /* default value*/
                    );
            
            /* push this command args */
            SetupArgDescriptions( arg_desc );
            }   
        
        /* class destructor */
        virtual ~ListBlastDatabaseContent()
            {
            }
        
        /* workhorse of the program */
        virtual int Run()
            {
            string database_name(  this->GetArgs()["db"].AsString() );
            /* Blast Search Subject.  http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSearchDatabase.html#details */
            CSearchDatabase* searchDatabase = 0;
            
        
            /* initialize search database */
            searchDatabase = new CSearchDatabase(
                database_name, /* db name */
                CSearchDatabase::eBlastDbIsNucleotide /* db type */
                );
            
            /* This class provides the top-level interface class for BLAST database users. It 
            defines access to the database component by calling methods on objects which represent
            the various database files, such as the index, header, sequence, and alias files. 
            http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSeqDB.html */
            CRef<CSeqDB> seqdb= searchDatabase->GetSeqDb();                 
            
            /* Small class to iterate over a seqdb database. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSeqDBIter.html#details */
            CSeqDBIter  iter = seqdb->Begin();
            
            while(iter)
                {
                /* we put the sequence here */
                 string output;
                 /* retrieve the entire sequence */
                 TSeqRange range;
                /* This method gets the sequence data for an OID, converts it to a human-readable encoding (either Iupacaa for protein, or Iupacna for nucleotide), and returns it in a string. This is equivalent to calling the three-argument versions of this method with those encodings. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCSeqDB.html */
                seqdb->GetSequenceAsString(iter.GetOID(),output,range); 
                /* Represents ASN.1 type Blast-def-line-set defined in file blastdb.asn http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/classCBlast__def__line__set.html */
                CRef< CBlast_def_line_set > def_line= seqdb->GetHdr(iter.GetOID());
                CBlast_def_line_set:: Tdata::const_iterator r = def_line->Get().begin();
                while(r!= def_line->Get().end())
                    {
                    cout << ">" << (*r)->GetTitle()  << endl;
                    cout << output << endl;
                    ++r;
                    }

                ++iter;
                }
            if( searchDatabase != 0) delete searchDatabase;
            return 0;
            }
    };

int main(int argc,char** argv)
    {
    return ListBlastDatabaseContent().AppMain(
        argc,
        argv,
        0, /* envp Environment pointer */
        eDS_ToStderr,/* log message. In /c++/include/corelib/ncbidiag.hpp  */
        0, /* Specify registry to load, as per LoadConfig() */
        "loadblastdb" /* Specify application name */
        );
    }

Testing

get some fasta sequences, remove the headers and create a blast database.

$ curl  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=281381068,281381067,281381066,281381065&rettype=fasta" |\
        awk '/^>/ {printf(">%d\n",NR);next;} {print;}' > sequences.fa

$ makeblastdb -in sequences.fa -dbtype nucl

run the program:

$ loadblastdb -db sequences.fa
>1
GATCGCGGGTATCTTCGATGTAGGGCGAGTCGCGTCACCCTGACCAACCCCTCGAGGTGGTCCTCGTGGAAGGAGGAAGCGCAGTCGACGAAGCGGAAGATGTGGCTGAGCCATCTCGTCGTGTCGGCCGCGAGACGACCCGCGAGTTTCTGCGCGTGTGACTTTGGTAGCGCTACTGCGTAAGCCGAGA
>6
CAGAAACTCGCGGGTCGTCTCGCGGCCGACACGACGAGATGGCTCAGCCACATCTTCCGCTTCGTCGACTGCGCTTCCTCCTTCCACGAGGACCACCTCGAGGGGTTGGTCAGGGTGACGCGACTC
>10
ACCGAAACGGCAAAATTGGCCTTAGTCACGCACACTATCTCGGCTTACGCAGTAGCGCTACCAAAGTCACACGCGCAGAAACTCGCGGGTCGTCTCGCGGCCGACACGACGAGATGGCTCAGCCACATCTTCCGCTTCGTCGACTGCGCTTCCTCCTTCCACGAGGACCACCTCGAGGGGTTGGTCAGGGTGACGCGACTCGCCCTACATCGAAGATACCCGCGATC
>16
GATCGAACGATAAACTTCTGCTATCGGTTCCTCAGCAATCAGTATAACGAGCGGCACATGTATTGCGGCATATTGCAAAAAACATACCTATAAAGAAGGCTAAACTGATTGATTCAAATTAACTTGAGAATACACGATGTTGATGTTCAGTCGAAACAAATCGCTCGCGTATTGTGATC

That's it,
Pierre

30 January 2015

Filtering Fasta Sequences using the #NCBI C++ API. My notebook.

In my previous post (http://plindenbaum.blogspot.com/2015/01/compiling-c-hello-world-program-using.html) I've built a simple "Hello World" application using the
NCBI C++ toolbox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/). Here, I'm gonna to extend the code in order to create a program filtering FASTA sequences on their sizes.

This new application FastaFilterSize needs three new arguments:

'-m' min size of the fasta sequence
'-M' max size of the fasta sequence
'-v' inverse selection

the method FastaFilterSize::Init() is used to register those optional arguments

/* optional argument for min size */
arg_desc->AddOptionalKey(
    "m",
    "min_size",
    "min size inclusive",
    CArgDescriptions::eInteger
    );
/* optional argument for max size */
arg_desc->AddOptionalKey(
    "M",
    "max_size",
    "max size inclusive",
    CArgDescriptions::eInteger
    );
/* add simple flag to inverse selection */
arg_desc->AddFlag("v", 
    "inverse selection",
    true
    );

the FastaFilterSize::Run() method is the workhorse of the program. Here, we get the output stream that will be used by a CFastaOstream to print the filtered Fasta sequences:

(...)
/* get output stream */
CNcbiOstream& out =  this->GetArgs()["o"].AsOutputFile();
/* create a FASTA writer */
CFastaOstream* writer = new CFastaOstream(out);
(...)

We get the input stream that will be used by a CFastaReader to read the FASTA sequences to be filtered.

/* get input stream */
CNcbiIstream& input_stream  = this->GetArgs()["i"].AsInputFile();
CFastaReader reader(input_stream ,
        CFastaReader::fNoParseID /* http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/fasta_8hpp_source.html */
        );

We get the optional values for min-length and max-length:

if (this->GetArgs()["m"])
    {
    min_inclusive = this->GetArgs()["m"].AsInteger();
    }
if (this->GetArgs()["M"])
    {
    max_inclusive = this->GetArgs()["M"].AsInteger();
    }

We get the 'inverse' flag.

bool invert=this->GetArgs()["v"];

While a Fasta sequence can be read, we check its size and we print it if needed:

while( !reader.AtEOF() )
    {
    /* read next sequence */
    CRef<CSeq_entry> seq(reader.ReadOneSeq());
    /* should we print the sequence ?*/
    if((seq->GetSeq().GetLength()>= min_inclusive && 
        seq->GetSeq().GetLength()<= max_inclusive
        )!= invert)
        {
        /* yes, print */
        writer->Write(*seq);
        }
    }

All in one:

#include <memory>
#include <limits>
#include <objtools/readers/fasta.hpp>
#include "corelib/ncbiapp.hpp"
#include <objects/seq/Bioseq.hpp>
#include <objects/seq/Bioseq.hpp>
#include <objects/seq/Delta_ext.hpp>
#include <objects/seq/Delta_seq.hpp>
#include <objects/seq/NCBIeaa.hpp>
#include <objects/seq/IUPACaa.hpp>
#include <objects/seq/IUPACna.hpp>
#include <objects/seq/Seg_ext.hpp>
#include <objects/seq/Seq_annot.hpp>
#include <objects/seq/Seq_descr.hpp>
#include <objects/seq/Seq_ext.hpp>
#include <objects/seq/Seq_hist.hpp>
#include <objects/seq/Seq_inst.hpp>
#include <objects/seq/Seq_literal.hpp>
#include <objects/seq/Seqdesc.hpp>
#include <objects/seq/seqport_util.hpp>
#include <objects/seqalign/Dense_seg.hpp>
#include <objects/seqalign/Seq_align.hpp>
#include <objects/seqloc/Seq_id.hpp>
#include <objects/seqloc/Seq_interval.hpp>
#include <objects/seqloc/Seq_loc.hpp>
#include <objects/seqloc/Seq_loc_mix.hpp>
#include <objects/seqloc/Seq_point.hpp>
#include <objects/seqset/Bioseq_set.hpp>
#include <objects/seqset/Seq_entry.hpp>
#include <objtools/readers/message_listener.hpp>
#include <objtools/readers/line_error.hpp>
#include <objtools/seqmasks_io/mask_reader.hpp>
#include <objtools/seqmasks_io/mask_writer.hpp>
#include <objtools/seqmasks_io/mask_fasta_reader.hpp>
#include <objmgr/util/sequence.hpp>
#include <ctype.h>

USING_NCBI_SCOPE;
USING_SCOPE(objects);

/**
 * FastaFilterSize
 *    
 *  Motivation:
 *      filter sequences on size
 *  Author:
 *      Pierre Lindenbaum PhD 2015
 *
 */
class FastaFilterSize : public CNcbiApplication /* extends a basic NCBI application defined in c++/include/corelib/ncbiapp.hpp */
    {
    public:

        /* constructor, just set the version to '1.0.0'*/
        FastaFilterSize()
            {
            CRef<CVersion> version(new CVersion());
            version->SetVersionInfo(1, 0, 0);
            this->SetFullVersion(version);
            }
            
            
        /* called to initialize rge program.
        *  The default behavior of this in 'CNcbiApplication' is "do nothing".
        *  Here, we set the command line arguments.
        */
        virtual void Init()
            {
            CArgDescriptions* arg_desc = new CArgDescriptions ; /* defined in /c++/include/corelib/ncbiargs.hpp */
            arg_desc->SetUsageContext(
                GetArguments().GetProgramBasename(),
                "A program filtering fasta sequences on size"
                );
            /* argument for input file */
            arg_desc->AddDefaultKey(
                    "i",/* name */
                    "input_file_name",/* synopsis */
                    "input file name",/* comment */
                    CArgDescriptions::eInputFile, /* argument type */
                    "-" /* default value*/
                    );
           /* argument for output file */
           arg_desc->AddDefaultKey(
                    "o",/* name */
                    "output_file_name",/* synopsis */
                    "output file name",/* comment */
                    CArgDescriptions::eOutputFile, /* argument type */
                    "-" /* default value*/
                    );
            /* optional argument for min size */
            arg_desc->AddOptionalKey(
                    "m",
                    "min_size",
                    "min size inclusive",
                    CArgDescriptions::eInteger
                    );
            /* optional argument for max size */
            arg_desc->AddOptionalKey(
                    "M",
                    "max_size",
                    "max size inclusive",
                    CArgDescriptions::eInteger
                    );
            
            /* add simple flag to inverse selection */
            arg_desc->AddFlag("v", 
                "inverse selection",
                true
                );
            
            /* push this command args */
            SetupArgDescriptions( arg_desc );
            }   
        
        /* class destructor */
        virtual ~FastaFilterSize()
            {
            }
        
        /* workhorse of the program */
        virtual int Run()
            {
            /* get output stream */
            CNcbiOstream& out =  this->GetArgs()["o"].AsOutputFile();
            /* create a FASTA writer */
            CFastaOstream* writer = new CFastaOstream(out);
            /* shall we inverse the selection ? */
            bool invert=this->GetArgs()["v"];

            /* get input stream */
            CNcbiIstream& input_stream  = this->GetArgs()["i"].AsInputFile();
            /* init min and max inclusive */
            TSeqPos min_inclusive=0;
            TSeqPos max_inclusive= std::numeric_limits<TSeqPos>::max();
            /* if min and max set by user */
            if (this->GetArgs()["m"])
                {
                min_inclusive = this->GetArgs()["m"].AsInteger();
                }
            if (this->GetArgs()["M"])
                {
                max_inclusive = this->GetArgs()["M"].AsInteger();
                }
            /* create a FASTA parser */
            CFastaReader reader(input_stream ,
                    CFastaReader::fNoParseID /* http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/fasta_8hpp_source.html */
                    );
            /* while we can read something */
            while( !reader.AtEOF() )
                {
                /* read next sequence */
                CRef<CSeq_entry> seq(reader.ReadOneSeq());
                
                /* should we print the sequence ?*/
                if((seq->GetSeq().GetLength()>= min_inclusive && 
                    seq->GetSeq().GetLength()<= max_inclusive
                    )!= invert)
                    {
                    /* yes, print */
                    writer->Write(*seq);
                    }
                }
            /* cleanup */
            delete writer;
            out.flush();
            return 0;
            }
    };

int main(int argc,char** argv)
    {
    return FastaFilterSize().AppMain(
        argc,
        argv,
        0, /* envp Environment pointer */
        eDS_ToStderr,/* log message. In /c++/include/corelib/ncbidiag.hpp  */
        0, /* Specify registry to load, as per LoadConfig() */
        "filterfasta" /* Specify application name */
        );
    }

Compiling

g++  -std=gnu++11  -finline-functions -fstrict-aliasing -fomit-frame-pointer    
-I/commun/data/users/lindenb/src/blast2.2.30/ncbi-blast-2.2.30+-src/c++/include 
-I/commun/data/users/lindenb/src/blast2.2.30/ncbi-blast-2.2.30+-src/c++/ReleaseM
T/inc  -Wno-deprecated -Wno-deprecated-declarations -o filtersize filtersize.cpp
 -L/commun/data/users/lindenb/src/blast2.2.30/ncbi-blast-2.2.30+-src/c++/Release
MT/lib -lblast_app_util -lblastinput-static -lncbi_xloader_blastdb-static -lncbi
_xloader_blastdb_rmt-static -lxblastformat-static -lalign_format-static -ltaxon1
-static -lblastdb_format-static -lgene_info-static -lxalnmgr-static -lblastxml-s
tatic -lblastxml2-static -lxcgi-static -lxhtml-static -lxblast-static -lxalgobla
stdbindex-static -lcomposition_adjustment-static -lxalgodustmask-static -lxalgow
inmask-static -lseqmasks_io-static -lseqdb-static -lblast_services-static -lxobj
util-static -lxobjread-static -lvariation-static -lcreaders-static -lsubmit-stat
ic -lxnetblastcli-static -lxnetblast-static -lblastdb-static -lscoremat-static -
ltables-static -lxalnmgr-static -lncbi_xloader_genbank-static -lncbi_xreader_id1
-static -lncbi_xreader_id2-static -lncbi_xreader_cache-static -lxconnext-static 
-lxconnect-static -ldbapi_driver-static -lncbi_xreader-static -lxconnext-static 
-lxconnect-static -lid1-static -lid2-static -lseqsplit-static -lxcompress-static
 -lxobjmgr-static -lgenome_collection-static -lseqedit-static -lseqset-static -l
seq-static -lseqcode-static -lsequtil-static -lpub-static -lmedline-static -lbib
lio-static -lgeneral-static -lxser-static -lxutil-static -lxncbi-static -lgomp -
lz -lbz2 -ldl -lnsl -lrt -lm -lpthread -L/commun/data/users/lindenb/src/blast2.2
.30/ncbi-blast-2.2.30+-src/c++/ReleaseMT/lib -lblast_app_util -lblastinput-stati
c -lncbi_xloader_blastdb-static -lncbi_xloader_blastdb_rmt-static -lxblastformat
-static -lalign_format-static -ltaxon1-static -lblastdb_format-static -lgene_inf
o-static -lxalnmgr-static -lblastxml-static -lblastxml2-static -lxcgi-static -lx
html-static -lxblast-static -lxalgoblastdbindex-static -lcomposition_adjustment-
static -lxalgodustmask-static -lxalgowinmask-static -lseqmasks_io-static -lseqdb
-static -lblast_services-static -lxobjutil-static -lxobjread-static -lvariation-
static -lcreaders-static -lsubmit-static -lxnetblastcli-static -lxnetblast-stati
c -lblastdb-static -lscoremat-static -ltables-static -lxalnmgr-static -lncbi_xlo
ader_genbank-static -lncbi_xreader_id1-static -lncbi_xreader_id2-static -lncbi_x
reader_cache-static -lxconnext-static -lxconnect-static -ldbapi_driver-static -l
ncbi_xreader-static -lxconnext-static -lxconnect-static -lid1-static -lid2-stati
c -lseqsplit-static -lxcompress-static -lxobjmgr-static -lgenome_collection-stat
ic -lseqedit-static -lseqset-static -lseq-static -lseqcode-static -lsequtil-stat
ic -lpub-static -lmedline-static -lbiblio-static -lgeneral-static -lxser-static 
-lxutil-static -lxncbi-static -lgomp -lz -lbz2 -ldl -lnsl -lrt -lm -lpthread

Testing

get help

$ ./filtersize  -help
USAGE
  filtersize [-h] [-help] [-xmlhelp] [-i input_file_name]
    [-o output_file_name] [-m min_size] [-M max_size] [-v]
    [-logfile File_Name] [-conffile File_Name] [-version] [-version-full]
    [-dryrun]

DESCRIPTION
   A program filtering fasta sequences on size

OPTIONAL ARGUMENTS
 (...)
 -i <File_In>
   input file name
   Default = `-'
 -o <File_Out>
   output file name
   Default = `-'
 -m <Integer>
   min size inclusive
 -M <Integer>
   max size inclusive
 -v
   inverse selection
(...)

test with some fasta sequences:

echo -e  ">seq1\nAATCGTACGCTAG\n>seq2\nAATCGTACGATGCAAAAAAAAAAAATCTAG\n>seq3\nAATC\n>seq3\nAATC" |\
./filtersize -m 10 -M 20

>lcl|1 seq1
AATCGTACGCTAG

echo -e  ">seq1\nAATCGTACGCTAG\n>seq2\nAATCGTACGATGCAAAAAAAAAAAATCTAG\n>seq3\nAATC\n>seq3\nAATC" |\
./filtersize -m 10 -M 20 -v 

>lcl|2 seq2
AATCGTACGATGCAAAAAAAAAAAATCTAG
>lcl|3 seq3
AATC
>lcl|4 seq3
AATC

That's it,
Pierre

29 January 2015

Compiling a C++ 'Hello world' program using the #NCBI C++ toolbox: my notebook.

This post is my notebook for compiling a simple C++ application using the NCBI C++ toolbox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/).

This application prints 'Hello world' and takes two arguments:

'-o' to specificiy the output filename (default is standard output)
'-n' to set the name to be printed (default: "Word !")

The code I used is the one containing in the distribution of blast 2.2.30

The program itself is a C++ Class, it extends the class CNcbiApplication

(...)
class HelloApp : public CNcbiApplication
    {
    (...)
    }

In the constructor, we just set the name and the version of the program:

HelloApp::HelloApp()
    {
    CRef<CVersion> version(new CVersion());
    version->SetVersionInfo(1, 0, 0);
    this->SetFullVersion(version);
    }

The destructor is void

HelloApp::~HelloApp()
    {
    }

an HelloApp::Init() method is used to register the command line arguments.

void HelloApp::Init()
    {
    CArgDescriptions* arg_desc = new CArgDescriptions ;
    arg_desc->SetUsageContext(
        GetArguments().GetProgramBasename(),
        "Hello NCBI"
        );
    (...)

in HelloApp::Init() we register the '-o' argument:

arg_desc->AddDefaultKey(
    "o",/* name */
    "output_file_name",/* synopsis */
    "output file name",/* comment */
    CArgDescriptions::eOutputFile, /* argument type */
    "-" /* default value*/
    );

and the '-n' argument:

arg_desc->AddDefaultKey(
    "n",/* name */
    "name",/* synopsis */
    "Name",/* comment */
    CArgDescriptions::eString, /* argument type */
    "World !" /* default value*/
    );

once the two arguments have been registered, the new command line is pushed in the application:

SetupArgDescriptions( arg_desc );

the HelloApp::Run() method is the workhorse of the program. Here, we get the output stream and we print the name.

int HelloApp::Run()
    {
    /* get output stream */
    CNcbiOstream& out =  this->GetArgs()["o"].AsOutputFile();
    out << "Hello " << this->GetArgs()["n"].AsString() << endl;
    return 0;
    }

All in one:

#include <memory>
#include <limits>
#include "corelib/ncbiapp.hpp"
#include <ctype.h>



USING_NCBI_SCOPE;

/**
 * Hello NCBI
 *      testing NCBI C++ toolbox
 *  Motivation:
 *      filter sequences on size
 *  Author:
 *      Pierre Lindenbaum PhD 2015
 *
 */
class HelloApp : public CNcbiApplication /* extends a basic NCBI application defined in c++/include/corelib/ncbiapp.hpp */
    {
    public:

        /* constructor, just set the version to '1.0.0'*/
        HelloApp()
            {
            CRef<CVersion> version(new CVersion());
            version->SetVersionInfo(1, 0, 0);
            this->SetFullVersion(version);
            }
            
            
        /* called to initialize rge program.
        *  The default behavior of this in 'CNcbiApplication' is "do nothing".
        *  Here, we set the command line arguments.
        */
        virtual void Init()
            {
            CArgDescriptions* arg_desc = new CArgDescriptions ; /* defined in /c++/include/corelib/ncbiargs.hpp */
            arg_desc->SetUsageContext(
                GetArguments().GetProgramBasename(),
                "Hello NCBI"
                );
            
           /* argument for output file */
           arg_desc->AddDefaultKey(
                    "o",/* name */
                    "output_file_name",/* synopsis */
                    "output file name",/* comment */
                    CArgDescriptions::eOutputFile, /* argument type */
                    "-" /* default value*/
                    );
            /* argument for name */
           arg_desc->AddDefaultKey(
                    "n",/* name */
                    "name",/* synopsis */
                    "Name",/* comment */
                    CArgDescriptions::eString, /* argument type */
                    "World !" /* default value*/
                    );
            
            /* push this command args */
            SetupArgDescriptions( arg_desc );
            }   
        
        /* class destructor */
        virtual ~HelloApp()
            {
            }
        
        /* workhorse of the program */
        virtual int Run()
            {
            /* get output stream */
            CNcbiOstream& out =  this->GetArgs()["o"].AsOutputFile();
            out << "Hello " << this->GetArgs()["n"].AsString() << endl;
            return 0;
            }
    };

int main(int argc,char** argv)
    {
    return HelloApp().AppMain(
        argc,
        argv,
        0, /* envp Environment pointer */
        eDS_ToStderr,/* log message. In /c++/include/corelib/ncbidiag.hpp  */
        0, /* Specify registry to load, as per LoadConfig() */
        "hello" /* Specify application name */
        );
    }

Compiling

there is a bunch of libaries available under ncbi-blast-2.2.30+-src/c++/ReleaseMT/lib and it's hard to know which are the one required for linking. I put everything in my command line. All in one it looks like:

g++  -std=gnu++11  -finline-functions -fstrict-aliasing -fomit-frame-pointer    
-Iblast2.2.30/ncbi-blast-2.2.30+-src/c++/include -Iblast2.2.30/ncbi-blast-2.2.30
+-src/c++/ReleaseMT/inc  -Wno-deprecated -Wno-deprecated-declarations -o hello h
ello.cpp -Lblast2.2.30/ncbi-blast-2.2.30+-src/c++/ReleaseMT/lib -lblast_app_util
 -lblastinput-static -lncbi_xloader_blastdb-static -lncbi_xloader_blastdb_rmt-st
atic -lxblastformat-static -lalign_format-static -ltaxon1-static -lblastdb_forma
t-static -lgene_info-static -lxalnmgr-static -lblastxml-static -lblastxml2-stati
c -lxcgi-static -lxhtml-static -lxblast-static -lxalgoblastdbindex-static -lcomp
osition_adjustment-static -lxalgodustmask-static -lxalgowinmask-static -lseqmask
s_io-static -lseqdb-static -lblast_services-static -lxobjutil-static -lxobjread-
static -lvariation-static -lcreaders-static -lsubmit-static -lxnetblastcli-stati
c -lxnetblast-static -lblastdb-static -lscoremat-static -ltables-static -lxalnmg
r-static -lncbi_xloader_genbank-static -lncbi_xreader_id1-static -lncbi_xreader_
id2-static -lncbi_xreader_cache-static -lxconnext-static -lxconnect-static -ldba
pi_driver-static -lncbi_xreader-static -lxconnext-static -lxconnect-static -lid1
-static -lid2-static -lseqsplit-static -lxcompress-static -lxobjmgr-static -lgen
ome_collection-static -lseqedit-static -lseqset-static -lseq-static -lseqcode-st
atic -lsequtil-static -lpub-static -lmedline-static -lbiblio-static -lgeneral-st
atic -lxser-static -lxutil-static -lxncbi-static -lgomp -lz -lbz2 -ldl -lnsl -lr
t -lm -lpthread -Lblast2.2.30/ncbi-blast-2.2.30+-src/c++/ReleaseMT/lib -lblast_a
pp_util -lblastinput-static -lncbi_xloader_blastdb-static -lncbi_xloader_blastdb
_rmt-static -lxblastformat-static -lalign_format-static -ltaxon1-static -lblastd
b_format-static -lgene_info-static -lxalnmgr-static -lblastxml-static -lblastxml
2-static -lxcgi-static -lxhtml-static -lxblast-static -lxalgoblastdbindex-static
 -lcomposition_adjustment-static -lxalgodustmask-static -lxalgowinmask-static -l
seqmasks_io-static -lseqdb-static -lblast_services-static -lxobjutil-static -lxo
bjread-static -lvariation-static -lcreaders-static -lsubmit-static -lxnetblastcl
i-static -lxnetblast-static -lblastdb-static -lscoremat-static -ltables-static -
lxalnmgr-static -lncbi_xloader_genbank-static -lncbi_xreader_id1-static -lncbi_x
reader_id2-static -lncbi_xreader_cache-static -lxconnext-static -lxconnect-stati
c -ldbapi_driver-static -lncbi_xreader-static -lxconnext-static -lxconnect-stati
c -lid1-static -lid2-static -lseqsplit-static -lxcompress-static -lxobjmgr-stati
c -lgenome_collection-static -lseqedit-static -lseqset-static -lseq-static -lseq
code-static -lsequtil-static -lpub-static -lmedline-static -lbiblio-static -lgen
eral-static -lxser-static -lxutil-static -lxncbi-static -lgomp -lz -lbz2 -ldl -l
nsl -lrt -lm -lpthread

Testing

run without arguments:

$ ./hello 
Hello World !

run with arguments:

$ ./hello -n Pierre
Hello Pierre
$ ./hello -n Pierre -o test.txt && cat test.txt 
Hello Pierre

errors:

$ ./hello -x Pierre
USAGE
  hello [-h] [-help] [-xmlhelp] [-o output_file_name] [-n name]
    [-logfile File_Name] [-conffile File_Name] [-version] [-version-full]
    [-dryrun]

DESCRIPTION
   Hello NCBI

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Unknown argument: "x"

get help:

$ ./hello -help
USAGE
  hello [-h] [-help] [-xmlhelp] [-o output_file_name] [-n name]
    [-logfile File_Name] [-conffile File_Name] [-version] [-version-full]
    [-dryrun]

DESCRIPTION
   Hello NCBI

OPTIONAL ARGUMENTS
 -h
   Print USAGE and DESCRIPTION;  ignore all other parameters
 -help
   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
 -xmlhelp
   Print USAGE, DESCRIPTION and ARGUMENTS in XML format; ignore all other
   parameters
 -o <File_Out>
   output file name
   Default = `-'
 -n <String>
   Name
   Default = `World !'
 -logfile <File_Out>
   File to which the program log should be redirected
 -conffile <File_In>
   Program's configuration (registry) data file
 -version
   Print version number;  ignore other arguments
 -version-full
   Print extended version data;  ignore other arguments
 -dryrun
   Dry run the application: do nothing, only test all preconditions

get xml help:

$ ./hello -xmlhelp | xmllint --format -
<?xml version="1.0"?>
<ncbi_application xmlns="ncbi:application" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="ncbi:application ncbi_application.xsd">
  <program type="regular">
    <name>hello</name>
    <version>1.0.0</version>
    <description>Hello NCBI</description>
  </program>
  <arguments>
    <key name="conffile" type="File_In" optional="true">
      <description>Program's configuration (registry) data file</description>
      <synopsis>File_Name</synopsis>
    </key>
    <key name="logfile" type="File_Out" optional="true">
      <description>File to which the program log should be redirected</description>
      <synopsis>File_Name</synopsis>
    </key>
    <key name="n" type="String" optional="true">
      <description>Name</description>
      <synopsis>name</synopsis>
      <default>World !</default>
    </key>
    <key name="o" type="File_Out" optional="true">
      <description>output file name</description>
      <synopsis>output_file_name</synopsis>
      <default>-</default>
    </key>
    <flag name="dryrun" optional="true">
      <description>Dry run the application: do nothing, only test all preconditions</description>
    </flag>
    <flag name="h" optional="true">
      <description>Print USAGE and DESCRIPTION;  ignore all other parameters</description>
    </flag>
    <flag name="help" optional="true">
      <description>Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters</description>
    </flag>
    <flag name="version" optional="true">
      <description>Print version number;  ignore other arguments</description>
    </flag>
    <flag name="version-full" optional="true">
      <description>Print extended version data;  ignore other arguments</description>
    </flag>
    <flag name="xmlhelp" optional="true">
      <description>Print USAGE, DESCRIPTION and ARGUMENTS in XML format; ignore all other parameters</description>
    </flag>
  </arguments>
</ncbi_application>

That's it,
Pierre

05 December 2014

Divide-and-conquer in a #Makefile : recursivity and #parallelism.

This post is my notebook about implementing a divide-and-conquer strategy in GNU make.
Say you have a list of 'N' VCFs files. You want to create a list of:

common SNPs in vcf1 and vcf2
common SNPs in vcf3 and the previous list
common SNPs in vcf4 and the previous list
(...)
common SNPs in vcfN and the previous list

Yes, I know I can do this using:grep -v '^#' f.vcf|cut -f 1,2,4,5 | sort | uniq

Using a linear Makefile it could look like:

list2: vcf1 vcf2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list3: vcf3 list2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list4: vcf4 list3
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
list5: vcf5 list4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@
(...)

We can speed-up the workflow using the parallel option of make -j (number-of-parallel-jobs) and using a divide-and-conquer strategy. Here, the targets 'list1_2' and 'list3_4' can be processed independently in parallel.

list1_2: vcf1 vcf2
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

list3_4: vcf3 vcf4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

list1_4: list1_2 list3_4
    grep -v '^#' $^ |cut -f 1,2,4,5 | sort | uniq > $@

By using the internal 'Make' functions $(eval), $(call), $(shell) we can define a recursive method "recursive". This method takes two arguments which are the 0-based indexes of a VCF in the list of VCFs. Here is the Makefile:
Running the makefile:

$ make 
gunzip -c Sample18.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target0_1
gunzip -c Sample13.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target1_2
LC_ALL=C comm -12 target0_1 target1_2 > target0_2
gunzip -c Sample1.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target2_3
gunzip -c Sample19.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target3_4
gunzip -c Sample12.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target4_5
LC_ALL=C comm -12 target3_4 target4_5 > target3_5
LC_ALL=C comm -12 target2_3 target3_5 > target2_5
LC_ALL=C comm -12 target0_2 target2_5 > target0_5
gunzip -c Sample17.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target5_6
gunzip -c Sample16.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target6_7
LC_ALL=C comm -12 target5_6 target6_7 > target5_7
gunzip -c Sample9.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target7_8
gunzip -c Sample15.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target8_9
gunzip -c Sample5.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target9_10
LC_ALL=C comm -12 target8_9 target9_10 > target8_10
LC_ALL=C comm -12 target7_8 target8_10 > target7_10
LC_ALL=C comm -12 target5_7 target7_10 > target5_10
LC_ALL=C comm -12 target0_5 target5_10 > target0_10
gunzip -c Sample14.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target10_11
gunzip -c Sample3.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target11_12
LC_ALL=C comm -12 target10_11 target11_12 > target10_12
gunzip -c Sample11.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target12_13
gunzip -c Sample2.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target13_14
gunzip -c Sample6.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target14_15
LC_ALL=C comm -12 target13_14 target14_15 > target13_15
LC_ALL=C comm -12 target12_13 target13_15 > target12_15
LC_ALL=C comm -12 target10_12 target12_15 > target10_15
gunzip -c Sample20.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target15_16
gunzip -c Sample10.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target16_17
LC_ALL=C comm -12 target15_16 target16_17 > target15_17
gunzip -c Sample4.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target17_18
gunzip -c Sample8.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target18_19
gunzip -c Sample7.vcf.gz | grep -v '^#' | cut -f 1,2,4,5 | LC_ALL=C sort | uniq > target19_20
LC_ALL=C comm -12 target18_19 target19_20 > target18_20
LC_ALL=C comm -12 target17_18 target18_20 > target17_20
LC_ALL=C comm -12 target15_17 target17_20 > target15_20
LC_ALL=C comm -12 target10_15 target15_20 > target10_20
LC_ALL=C comm -12 target0_10 target10_20 > target0_20

and here is the generated workflow (drawn with make2graph ).
:

That's it
Pierre.

04 December 2014

XML+XSLT = #Makefile -based #workflows for #bioinformatics

I've recently read some conversations on Twitter about Makefile-based bioinformatics workflows. I've suggested on biostars.org (Standard simple format to describe a bioinformatics analysis pipeline) that a XML file could be used to describe a model of data and XSLT could transform this model to a Makefile-based workflow. I've already explored this idea in a previous post (Generating a pipeline of analysis (Makefile) for NGS with xslt. ) and in our lab, we use JSON and jsvelocity to generate our workflows (e.g: https://gist.github.com/lindenb/3c07ca722f793cc5dd60).
In the current post, I'll describe my github repository containing a complete XML+XSLT example: https://github.com/lindenb/ngsxml.

Download the data

git clone "https://github.com/lindenb/ngsxml.git"

The model
The XML model is self-explanatory:

<?xml version="1.0" encoding="UTF-8"?>
<model name="myProject" description="my project" directory="OUT">
  <project name="Proj1">
    <sample name="Sample1">
      <fastq>
        <for>test/fastq/sample_1_01_R1.fastq.gz</for>
        <rev>test/fastq/sample_1_01_R2.fastq.gz</rev>
      </fastq>
      <fastq id="groupid2" lane="2" library="lib1" platform="ILMN" median-size="98">
        <for>test/fastq/sample_1_02_R1.fastq.gz</for>
        <rev>test/fastq/sample_1_02_R2.fastq.gz</rev>
      </fastq>
    </sample>
    <sample name="Sample2">
      <fastq>
        <for>test/fastq/sample_2_01_R1.fastq.gz</for>
        <rev>test/fastq/sample_2_01_R2.fastq.gz</rev>
      </fastq>
    </sample>
  </project>
</model>

Validating the model:
This XML document is possibly validated with a XML schema:

$ xmllint  --schema xsd/schema01.xsd --noout test/model01.xml
test/model01.xml validates

Generating the Makefile:
The XML document is transformed into a Makefile using the following XSLT stylesheet: https://github.com/lindenb/ngsxml/blob/master/stylesheets/model2make.xsl

xsltproc --output Makefile stylesheets/model2make.xsl test/model01.xml

The Makefile:

# Name
#  myProject
# Description:
#  my project
# 
include config.mk
OUTDIR=OUT
BINDIR=$(abspath ${OUTDIR})/bin


# if tools are undefined
bwa.exe ?=${BINDIR}/bwa-0.7.10/bwa
samtools.exe ?=${BINDIR}/samtools-0.1.19/samtools
bcftools.exe ?=${BINDIR}/samtools-0.1.19/bcftools/bcftools
tabix.exe ?=${BINDIR}/tabix-0.2.6/tabix
bgzip.exe ?=${BINDIR}/tabix-0.2.6/bgzip

.PHONY=all clean all_bams all_vcfs

all: all_vcfs


all_vcfs:  \
 $(OUTDIR)/Projects/Proj1/VCF/Proj1.vcf.gz


all_bams:  \
 $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam \
 $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam

#
# VCF for project 'Proj1'
# 
$(OUTDIR)/Projects/Proj1/VCF/Proj1.vcf.gz : $(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam) \
 $(addsuffix .fai,${REFERENCE}) ${samtools.exe}  ${bgzip.exe} ${tabix.exe} ${bcftools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} mpileup -uf ${REFERENCE} $(basename $(filter %.bai,$^)) | \
 ${bcftools.exe} view -vcg - > $(basename $@)  && \
 ${bgzip.exe} -f $(basename $@) && \
 ${tabix.exe} -f -p vcf $@
 
    



#
# index final BAM for Sample 'Sample1'
# 
$(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} index  $<
#
# prepare final BAM for Sample 'Sample1'
# 
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}rmdup.bam
 mkdir -p $(dir $@) && \
 cp  $< $@


$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}rmdup.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}merged.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} rmdup  $<  $@



#
# merge BAMs 
#
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}merged.bam :  \
  $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam \
  $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
  ${samtools.exe} merge -f $@ $(filter %.bam,$^)
  



 
#
# Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam
#
$(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe}
 ${samtools} index $<

#
# Align test/fastq/sample_1_01_R1.fastq.gz and test/fastq/sample_1_01_R2.fastq.gz
#
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}1_sorted.bam : \
 test/fastq/sample_1_01_R1.fastq.gz  \
 test/fastq/sample_1_01_R2.fastq.gz \
 $(addsuffix .bwt,${REFERENCE}) \
 ${bwa.exe} ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${bwa.exe} mem -R '@RG\tID:idp20678948\tSM:Sample1\tLB:Sample1\tPL:ILLUMINA\tPU:1' \
  ${REFERENCE} \
  test/fastq/sample_1_01_R1.fastq.gz \
  test/fastq/sample_1_01_R2.fastq.gz |\
 ${samtools.exe} view -uS - |\
 ${samtools.exe} sort - $(basename $@) 





 
#
# Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam
#
$(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam ${samtools.exe}
 ${samtools} index $<

#
# Align test/fastq/sample_1_02_R1.fastq.gz and test/fastq/sample_1_02_R2.fastq.gz
#
$(OUTDIR)/Projects/Proj1/Samples/Sample1/BAM/${tmp.prefix}2_sorted.bam : \
 test/fastq/sample_1_02_R1.fastq.gz  \
 test/fastq/sample_1_02_R2.fastq.gz \
 $(addsuffix .bwt,${REFERENCE}) \
 ${bwa.exe} ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${bwa.exe} mem -R '@RG\tID:groupid2\tSM:Sample1\tLB:lib1\tPL:ILMN\tPU:2\tPI:98' \
  ${REFERENCE} \
  test/fastq/sample_1_02_R1.fastq.gz \
  test/fastq/sample_1_02_R2.fastq.gz |\
 ${samtools.exe} view -uS - |\
 ${samtools.exe} sort - $(basename $@) 




#
# index final BAM for Sample 'Sample2'
# 
$(addsuffix .bai, $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam): $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} index  $<
#
# prepare final BAM for Sample 'Sample2'
# 
$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}rmdup.bam
 mkdir -p $(dir $@) && \
 cp  $< $@


$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}rmdup.bam : $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${samtools.exe} rmdup  $<  $@





 
#
# Index BAM $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam
#
$(addsuffix .bai,$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ): $(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam ${samtools.exe}
 ${samtools} index $<

#
# Align test/fastq/sample_2_01_R1.fastq.gz and test/fastq/sample_2_01_R2.fastq.gz
#
$(OUTDIR)/Projects/Proj1/Samples/Sample2/BAM/${tmp.prefix}1_sorted.bam : \
 test/fastq/sample_2_01_R1.fastq.gz  \
 test/fastq/sample_2_01_R2.fastq.gz \
 $(addsuffix .bwt,${REFERENCE}) \
 ${bwa.exe} ${samtools.exe}
 mkdir -p $(dir $@) && \
 ${bwa.exe} mem -R '@RG\tID:idp20681172\tSM:Sample2\tLB:Sample2\tPL:ILLUMINA\tPU:1' \
  ${REFERENCE} \
  test/fastq/sample_2_01_R1.fastq.gz \
  test/fastq/sample_2_01_R2.fastq.gz |\
 ${samtools.exe} view -uS - |\
 ${samtools.exe} sort - $(basename $@) 




$(addsuffix .fai,${REFERENCE}): ${REFERENCE} ${samtools.exe}
 ${samtools.exe} faidx $<

$(addsuffix .bwt,${REFERENCE}): ${REFERENCE} ${bwa.exe}
 ${bwa.exe} index $<


${BINDIR}/bwa-0.7.10/bwa :
 rm -rf $(BINDIR)/bwa-0.7.10/ && \
 mkdir -p $(BINDIR) && \
 curl -o $(BINDIR)/bwa-0.7.10.tar.bz2 -L "http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.10.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj $(BINDIR)/bwa-0.7.10.tar.bz2 -C $(OUTDIR)/bin  && \
 rm $(BINDIR)/bwa-0.7.10.tar.bz2 && \
 make -C $(dir $@)

${BINDIR}/samtools-0.1.19/bcftools/bcftools: ${BINDIR}/samtools-0.1.19/samtools

${BINDIR}/samtools-0.1.19/samtools  : 
 rm -rf $(BINDIR)/samtools-0.1.19/ && \
 mkdir -p $(BINDIR) && \
 curl -o $(BINDIR)/samtools-0.1.19.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/samtools-0.1.19.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj $(BINDIR)/samtools-0.1.19.tar.bz2 -C $(OUTDIR)/bin  && \
 rm $(BINDIR)/samtools-0.1.19.tar.bz2 && \
 make -C $(dir $@)


${BINDIR}/tabix-0.2.6/bgzip : ${BINDIR}/tabix-0.2.6/tabix


${BINDIR}/tabix-0.2.6/tabix  : 
 rm -rf $(BINDIR)/tabix-0.2.6/ && \
 mkdir -p $(BINDIR) && \
 curl -o $(BINDIR)/tabix-0.2.6.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/tabix-0.2.6.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj $(BINDIR)/tabix-0.2.6.tar.bz2 -C $(OUTDIR)/bin  && \
 rm $(BINDIR)/tabix-0.2.6.tar.bz2 && \
 make -C $(dir $@) tabix bgzip

clean:
 rm -rf ${BINDIR}

Drawing the Workflow:
The workflow is drawn with https://github.com/lindenb/makefile2graph.

Running Make:
And here is the output of make:

rm -rf /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/ && \
 mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \
 curl -o /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 -L "http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.10.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 -C OUT/bin  && \
 rm /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10.tar.bz2 && \
 make -C /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/
bwa-0.7.10/
bwa-0.7.10/bamlite.c
(...)
gcc -g -Wall -Wno-unused-function -O2 -msse -msse2 -msse3 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS QSufSort.o bwt_gen.o bwase.o bwaseqio.o bwtgap.o bwtaln.o bamlite.o is.o bwtindex.o bwape.o kopen.o pemerge.o bwtsw2_core.o bwtsw2_main.o bwtsw2_aux.o bwt_lite.o bwtsw2_chain.o fastmap.o bwtsw2_pair.o main.o -o bwa -L. -lbwa -lm -lz -lpthread
make[1]: Entering directory `/home/lindenb/src/ngsxml'
/home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa index test/ref/ref.fa
rm -rf /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/ && \
 mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \
 curl -o /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/samtools-0.1.19.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 -C OUT/bin  && \
 rm /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19.tar.bz2 && \
 make -C /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/
samtools-0.1.19/
samtools-0.1.19/.gitignore
samtools-0.1.19/AUTHORS
(...)
gcc -g -Wall -O2 -o bamcheck bamcheck.o -L.. -lm -lbam -lpthread -lz
make[3]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/misc'
make[2]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19'
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:idp11671724\tSM:Sample1\tLB:Sample1\tPL:ILLUMINA\tPU:1' \
  test/ref/ref.fa \
  test/fastq/sample_1_01_R1.fastq.gz \
  test/fastq/sample_1_01_R2.fastq.gz |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__1_sorted 
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:groupid2\tSM:Sample1\tLB:lib1\tPL:ILMN\tPU:2\tPI:98' \
  test/ref/ref.fa \
  test/fastq/sample_1_02_R1.fastq.gz \
  test/fastq/sample_1_02_R2.fastq.gz |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__2_sorted 
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
  /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools merge -f OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__merged.bam OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__1_sorted.bam OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__2_sorted.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools rmdup  OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__merged.bam  OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__rmdup.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 cp  OUT/Projects/Proj1/Samples/Sample1/BAM/__DELETE__rmdup.bam OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample1/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools index  OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/bwa-0.7.10/bwa mem -R '@RG\tID:idp11673828\tSM:Sample2\tLB:Sample2\tPL:ILLUMINA\tPU:1' \
  test/ref/ref.fa \
  test/fastq/sample_2_01_R1.fastq.gz \
  test/fastq/sample_2_01_R2.fastq.gz |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools view -uS - |\
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools sort - OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__1_sorted 
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools rmdup  OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__1_sorted.bam  OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__rmdup.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 cp  OUT/Projects/Proj1/Samples/Sample2/BAM/__DELETE__rmdup.bam OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam
mkdir -p OUT/Projects/Proj1/Samples/Sample2/BAM/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools index  OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam
/home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools faidx test/ref/ref.fa
rm -rf /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/ && \
 mkdir -p /home/lindenb/src/ngsxml/OUT/bin && \
 curl -o /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 -L "http://sourceforge.net/projects/samtools/files/tabix-0.2.6.tar.bz2/download?use_mirror=freefr" && \
 tar xvfj /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 -C OUT/bin  && \
 rm /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6.tar.bz2 && \
 make -C /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/ tabix bgzip
tabix-0.2.6/
tabix-0.2.6/ChangeLog
tabix-0.2.6/Makefile
(...)
make[2]: Leaving directory `/home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6'
mkdir -p OUT/Projects/Proj1/VCF/ && \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/samtools mpileup -uf test/ref/ref.fa OUT/Projects/Proj1/Samples/Sample1/BAM/Proj1_Sample1.bam OUT/Projects/Proj1/Samples/Sample2/BAM/Proj1_Sample2.bam | \
 /home/lindenb/src/ngsxml/OUT/bin/samtools-0.1.19/bcftools/bcftools view -vcg - > OUT/Projects/Proj1/VCF/Proj1.vcf  && \
 /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/bgzip -f OUT/Projects/Proj1/VCF/Proj1.vcf && \
 /home/lindenb/src/ngsxml/OUT/bin/tabix-0.2.6/tabix -f -p vcf OUT/Projects/Proj1/VCF/Proj1.vcf.gz
make[1]: Leaving directory `/home/lindenb/src/ngsxml'

At the end, a VCF is generated

##fileformat=VCFv4.1
##samtoolsVersion=0.1.19-44428cd
(...)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2
chr4_gl000194_random 1973 . C G 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,3,0;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
chr4_gl000194_random 2462 . A T 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
chr4_gl000194_random 2492 . G C 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8
chr4_gl000194_random 2504 . A T 14.4 . DP=3;VDB=2.063840e-02;AF1=1;AC1=4;DP4=0,0,0,3;MQ=60;FQ=-32.3 GT:PL:GQ 1/1:31,6,0:11 1/1:17,3,0:8

That's it,

Pierre

30 October 2014

Visualizing @GenomeBrowser liftOver/chain files using animated #SVG

I wrote a tool to visualize some UCSC "chain/liftOver" files as an animated SVG file. This tool is available on github at:

https://github.com/lindenb/jvarkit/wiki/LiftOverToSVG

"A liftOver file is a chain file, where for each region in the genome the alignments of the best/longest syntenic regions are used to translate features from one version of a genome to another.".

SVG Elements and CSS styles can be animated in a SVG file (see http://www.w3.org/TR/SVG/animate.html ) using the <animate/> element.

For example the following SVG snippet

defines a rectangle(x=351,y=35,width=6,height=5).
at t=60secs the opacity will change from 0 to 0.7 for 2secs
the position 'x' will move from x=351 to x=350 starting at t=62secs for 16 seconds
the position 'y' will move from y=35 to y=36 starting at t=62secs for 16 seconds
the 'width' will grow from width=6 to width=100 starting at t=62secs for 16 seconds
at t=78secs the opacity will change from 0.7 to 0 for 2secs

<rect x="351" y="35" width="6" height="15">
        <animate attributeType="CSS" attributeName="opacity" begin="60" dur="2" from="0" to="0.7" repeatCount="1" fill="freeze"/>
        <animate attributeType="XML" attributeName="x" begin="62" dur="16" from="351" to="350" repeatCount="1" fill="freeze"/>
        <animate attributeType="XML" attributeName="y" begin="62" dur="16" from="35" to="36" repeatCount="1" fill="freeze"/>
        <animate attributeType="XML" attributeName="width" begin="62" dur="16" from="6" to="100" repeatCount="1" fill="freeze"/>
        <animate attributeType="CSS" attributeName="opacity" begin="78" dur="2" from="0.7" to="0" repeatCount="1" fill="freeze"/>
</rect>

A demo hg16-hg17-hg18-hg19-hg38 was posted here: http://cardioserve.nantes.inserm.fr/~lindenb/liftover2svg/hg16ToHg38.svg

visualizing @GenomeBrowser liftOver chain files from hg16 to hg38 using animated SVG: http://t.co/DgXx1vU1Vf pic.twitter.com/eR5e4LeH4R
- Pierre Lindenbaum (@yokofakun) October 30, 2014

That's it,

Pierre.

16 October 2014

IGVFox: Integrative Genomics Viewer control through mozilla Firefox

I've just pushed IGVFox 0.1 an add-on for Firefox, controlling IGV, the Integrative Genomics Viewer.
This add-on allows the users to set the genomic position of IGV by just clicking a hyperlink in a HTML page. The source code is available on github at https://github.com/lindenb/igvfox and a first release is available as a *.xpi file at https://github.com/lindenb/igvfox/releases.

That's it,

Pierre

YOKOFAKUN

22 February 2015

Drawing a Manhattan plot in SVG using a GWAS+XML model.

The stylesheet

Result:

18 February 2015

Automatic code generation for @knime with XSLT: An example with two nodes: fasta reader and writer.

Example

02 February 2015

Listing the 'Subject' Sequences in a BLAST database using the NCBI C++ toolbox. My notebook.

Testing

30 January 2015

Filtering Fasta Sequences using the #NCBI C++ API. My notebook.

Compiling

Testing

29 January 2015

Compiling a C++ 'Hello world' program using the #NCBI C++ toolbox: my notebook.

Compiling

Testing

05 December 2014

Divide-and-conquer in a #Makefile : recursivity and #parallelism.

04 December 2014

XML+XSLT = #Makefile -based #workflows for #bioinformatics

30 October 2014

Visualizing @GenomeBrowser liftOver/chain files using animated #SVG

16 October 2014

IGVFox: Integrative Genomics Viewer control through mozilla Firefox

About Me

Feeds

Blog Archive

Web2.0

Labels