08 March 2012

My first walker for the GATK : my notebook

This is my first notebook for developping a new Walker for the Genome Analysis Toolkit. This post was mostly inspired by the following pdf: kvg_20_line_lifesavers_mad_v2.pptx.pdf.

Get the sources

git clone http://github.com/broadgsa/gatk.git GATK.dev
the javac compiler also requires the following library from google :http://code.google.com/p/cofoja/.

A first "Short-Reads" walker

The following class ReadWalker scans the reads and print them as fasta. The @Output annotation tells the GATK that we're going to channel our output through the java.io.PrintStream object. This field is automatically filled by the application runtime.

Compilation

javac -cp /path/to/GenomeAnalysisTK.jar:/path/to/cofoja-1.0-r139.jar:. \
 -sourcepath src \
 -d tmp src/mygatk/HelloRead.java
jar cvf HelloRead.jar -C tmp .

Running

Here I'm using a BAM from the 'examples' folder of samtools. (We need to pre-process this BAM with picard AddOrReplaceReadGroups). We then use our library as follow:
java -cp path/to/GenomeAnalysisTK.jar:HelloRead.jar \
org.broadinstitute.sting.gatk.CommandLineGATK -T HelloRead \
 -I test.bam \
 -R ${SAMTOOLS}/examples/ex1.fa 

Result:

The Makefile

That's it, Pierre

04 March 2012

Java Remote Method Invocation (RMI) for Bioinformatics

"Java Remote Method Invocation (Java RMI) enables the programmer to create distributed Java technology-based to Java technology-based applications, in which the methods of remote Java objects can be invoked from other Java virtual machines*, possibly on different hosts. "[Oracle] In the current post a java client will send a java class to the server that will analyze a DNA sequence fetched from the NCBI, using the RMI technology.

Files and directories

I In this example, my files are structured as defined below:
./sandbox/client/FirstBases.java
./sandbox/client/GCPercent.java
./sandbox/client/SequenceAnalyzerClient.java
./sandbox/server/SequenceAnalyzerServiceImpl.java
./sandbox/shared/SequenceAnalyzerService.java
./sandbox/shared/SequenceAnalyzer.java
./client.policy
./server.policy

The Service: SequenceAnalyzerService.java

The remote service provided by the server is defined as an interface named SequenceAnalyzerService: it fetches a DNA sequence for a given NCBI-gi, processes the sequence with an instance of SequenceAnalyzer (see below) and returns a serializable value (that is to say, we can transmit this value through the network).

Extract a value from a DNA sequence : SequenceAnalyzer

The interface SequenceAnalyzer defines how the remote service should parse a sequence. A SAX Parser will be used by the 'SequenceAnalyzerService' to process a TinySeq-XML document from the NCBI. The method characters is called each time a chunck of sequence is found. At the end, the remote server will return the value calculated from getResult:

Server side : an implementation of SequenceAnalyzerService

The class SequenceAnalyzerServiceImpl is an implementation of the service SequenceAnalyzerService. In the method analyse, a SAXParser is created and the given 'gi' sequence is downloaded from the NCBI. The instance of SequenceAnalyzer received from the client is invoked for each chunck of DNA. At the end, the "value" calculated by the instance of SequenceAnalyzer is returned to the client through the network. The 'main' method contains the code to bind this service to the RMI registry:

Client side

On the client side, we're going to connect to the SequenceAnalyzerService and send two distinct implementations of SequenceAnalyzer. What's interesting here: the server doesn't know anything about those implementations of SequenceAnalyzer. The client's java compiled classes have to be sent to the service.

GCPercent.java

A first implementation of 'SequenceAnalyzer' computes the GC% of a sequence:

FirstBases

The second implementation of 'SequenceAnalyzer' retrieves the first bases of a sequence.

The Client

And here is the java code for the client. The client connects to the RMI server and invokes 'analyse' with the two instances of SequenceAnalyzer for some NCBI-gi:

A note about security

As the server/client doesn't want to receive some malicious code, we have to use some policy files:
server.policy:

client.policy:

Compiling and Running

Compiling the client

javac -cp . sandbox/client/SequenceAnalyzerClient.java

Compiling the server

javac -cp . sandbox/server/SequenceAnalyzerServiceImpl.java

Starting the RMI registry

${JAVA_HOME}/bin/rmiregistry

Starting the SequenceAnalyzerServiceImpl

$ java \
 -Djava.security.policy=server.policy \
 -Djava.rmserver.codebase=file:///path/to/RMI/ \
 -cp . sandbox.server.SequenceAnalyzerServiceImpl

SequenceAnalyzerService bound.

Running the client

$ java  \
 -Djava.rmi.server.codebase=file:///path/to/RMI/ \
 -Djava.security.policy=client.policy  \
 -cp . sandbox.client.SequenceAnalyzerClient  localhost

gi=25 gc%=2.1530612244897958
gi=25 start=TAGTTATTC
gi=26 gc%=2.1443298969072164
gi=26 start=TAGTTATTAA
gi=27 gc%=2.3022222222222224
gi=27 start=AACCAGTATTA
gi=28 gc%=2.376543209876543
gi=28 start=TCGTA
gi=29 gc%=2.2014742014742015
gi=29 start=TCTTTG
That's it, Pierre

31 January 2012

Inside the Variation Toolkit: Tools for Gene Ontology

GeneOntologyDbManager is a C++ tool that is part of my experimental Variation Toolkit.
This program is a set of tools for GeneOntology, it is based on the sqlite3 library.

Download

Download the sources from Google-Code using subversion:....
svn checkout http://variationtoolkit.googlecode.com/svn/trunk/ variationtoolkit-read-only
... or update the sources of an existing installation...
cd variationtoolkit
svn update
... and edit the variationtoolkit/congig.mk file.

Dependencies

Compilation

Define "SQLITE_LIB" and "SQLITE_CFLAGS" in config.mk (see HowToInstall )
$ cd variationtoolkit/src/
$ make ../bin/godbmgr 

if ! [ -z "-lsqlite3" ] ;then g++ -o ../bin/godbmgr godatabasemgr.cpp xsqlite.cpp application.o xstream.o xxml.o -g `xml2-config --cflags `  /usr/include/sqlite3.h   -lz -lsqlite3 `xml2-config  --libs` ; else g++ -o ../bin/godbmgr godatabasemgr.cpp  -DNOSQLITE -O3 -Wall  ; fi

Usage

godbmgr (program-name) -f database.sqlite [options] (file1.vcf file2... | stdin )

Sub-Program: loadrdf

Loads the RDF/XML GO database (http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz) into the sqlite3 database.

Usage

godbmgr loadrdf -f database.sqlite (stdin|file)

Options

  • -f (filename) the sqlite3 database

Example

$ curl -s "http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz" |\
  gunzip -c |\
  godbmgr loadrdf -f database.sqlite
list the content of the database:
$ sqlite3 -separator '  ' -header  database.sqlite 'select * from TERM where acn="GO:0000007"'
acn xml
GO:0000007 <go:term xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:about="http://www.geneontology.org/go#GO:0000007">
            <go:accession xmlns:go="http://www.geneontology.org/dtds/go.dtd#">GO:0000007</go:accession>
            <go:name xmlns:go="http://www.geneontology.org/dtds/go.dtd#">low-affinity zinc ion transmembrane transporter activity</go:name>
            <go:definition xmlns:go="http://www.geneontology.org/dtds/go.dtd#">Catalysis of the transfer of a solute or solutes from one side of a membrane to the other according to the reaction: Zn2+ = Zn2+, probably powered by proton motive force. In low affinity transport the transporter is able to bind the solute only if it is present at very high concentrations.</go:definition>
            <go:is_a xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:resource="http://www.geneontology.org/go#GO:0005385"/>
        </go:term>


$ sqlite3 -separator '  ' -header  database.sqlite 'select * from TERM2REL where acn="GO:0000007"'
acn rel target
GO:0000007 is_a GO:0005385

Sub-Program: loadgoa

inserts the database for GOA into a sqlite3 database (e.g: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz)

Usage

godbmgr loadgoa -f database.sqlite (stdin|file)

Options

  • -f (filename) the sqlite3 database

Examples

$  curl -s "ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz" |\
     gunzip -c |\
     godbmgr loadgoa -f database.sqlite
list the content of the database:
$ sqlite3 -line   database.sqlite 'select * from GOA where term="GO:0005385" limit 2' 
              DB = UniProtKB
    DB_Object_ID = B3KU87
DB_Object_Symbol = SLC30A6
            term = GO:0005385
  DB_Object_Name = cDNA FLJ45816 fis, clone NT2RP7019682, highly similar to Homo sapiens solute carrier family 30 (zinc transporter), member 6 (SLC30A6), mRNA
         Synonym = B3KU87_HUMAN|SLC30A6|hCG_23082|IPI01009565|B7WP49
  DB_Object_Type = protein

              DB = UniProtKB
    DB_Object_ID = B5MCR8
DB_Object_Symbol = SLC30A6
            term = GO:0005385
  DB_Object_Name = Solute carrier family 30 (Zinc transporter), member 6, isoform CRA_b
         Synonym = B5MCR8_HUMAN|SLC30A6|hCG_23082|IPI00894292
  DB_Object_Type = protein

Sub-Program: desc

print the descendants (children) of a given GO node.

Usage

godbmgr desc -f db.sqlite [options] term1 term2 ... termn

Options

Examples

Default output

$ godbmgr desc -f database.sqlite "GO:0005385"
GO:0000006
GO:0000007
GO:0005385
GO:0015341
GO:0015633
GO:0016463
GO:0022883

xml/rdf output

$ godbmgr desc -f database.sqlite  -t xml "GO:0005385" | head

<go:go xmlns:go='http://www.geneontology.org/dtds/go.dtd#' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
 <rdf:RDF>
<go:term xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:about="http://www.geneontology.org/go#GO:0000006">
            <go:accession xmlns:go="http://www.geneontology.org/dtds/go.dtd#">GO:0000006</go:accession>
            <go:name xmlns:go="http://www.geneontology.org/dtds/go.dtd#">high affinity zinc uptake transmembrane transporter activity</go:name>
            <go:definition xmlns:go="http://www.geneontology.org/dtds/go.dtd#">Catalysis of the transfer of a solute or solutes from one side of a membrane to the other according to the reaction: Zn2+(out) = Zn2+(in), probably powered by proton motive force. In high affinity transport the transporter is able to bind the solute even if it is only present at very low concentrations.</go:definition>
            <go:is_a xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:resource="http://www.geneontology.org/go#GO:0005385"/>
        </go:term>
<go:term xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:about="http://www.geneontology.org/go#GO:0000007">
            <go:accession xmlns:go="http://www.geneontology.org/dtds/go.dtd#">GO:0000007</go:accession>

GOA output

$godbmgr desc -f database.sqlite -t goa  "GO:0005385"

UniProtKB B3KU87 SLC30A6 GO:0005385 cDNA FLJ45816 fis, clone NT2RP7019682, highly similar to Homo sapiens solute carrier family 30 (zinc transporter), member 6 (SLC30A6), mRNA B3KU87_HUMAN|SLC30A6|hCG_23082|IPI01009565|B7WP49 protein
UniProtKB B5MCR8 SLC30A6 GO:0005385 Solute carrier family 30 (Zinc transporter), member 6, isoform CRA_b B5MCR8_HUMAN|SLC30A6|hCG_23082|IPI00894292 protein
(..)
UniProtKB Q99726 SLC30A3 GO:0015633 Zinc transporter 3 ZNT3_HUMAN|ZNT3|SLC30A3|IPI00293793|Q8TC03protein

TSV output

$ godbmgr desc -f database.sqlite  -t tsv "GO:0022857" |\
    cut -c 1-100 |\
    head
#go:accession go.name go.def
GO:0000006 high affinity zinc uptake transmembrane transporter activity Catalysis of the transfer of
GO:0000007 low-affinity zinc ion transmembrane transporter activity Catalysis of the transfer of a s
GO:0000064 L-ornithine transmembrane transporter activity Catalysis of the transfer of L-ornithine f
GO:0000095 S-adenosylmethionine transmembrane transporter activity Catalysis of the transfer of S-ad
GO:0000099 sulfur amino acid transmembrane transporter activity Catalysis of the transfer of sulfur 
GO:0000100 S-methylmethionine transmembrane transporter activity Catalysis of the transfer of S-meth
GO:0000102 L-methionine secondary active transmembrane transporter activity Catalysis of the transfe
GO:0000227 oxaloacetate secondary active transmembrane transporter activity Catalysis of the transfe
GO:0000269 toxin export channel activity Enables the energy independent passage of toxins, sized les
(...)

Sub-Program: asc

prints the ascendants (parents) of a given node.

Usage

godbmgr asc -f db.sqlite [options] term1 term2 ... termn

Options

Examples

Default output

$ godbmgr asc -f database.sqlite "GO:0022857"
GO:0003674
GO:0005215
GO:0022857
all

xml/rdf output

$ godbmgr asc -f database.sqlite  -t xml "GO:0022857" | head

<go:go xmlns:go='http://www.geneontology.org/dtds/go.dtd#' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
 <rdf:RDF>
<go:term xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:about="http://www.geneontology.org/go#GO:0003674">
            <go:accession xmlns:go="http://www.geneontology.org/dtds/go.dtd#">GO:0003674</go:accession>
            <go:name xmlns:go="http://www.geneontology.org/dtds/go.dtd#">molecular_function</go:name>
            <go:synonym xmlns:go="http://www.geneontology.org/dtds/go.dtd#">GO:0005554</go:synonym>
            <go:synonym xmlns:go="http://www.geneontology.org/dtds/go.dtd#">molecular function</go:synonym>
            <go:synonym xmlns:go="http://www.geneontology.org/dtds/go.dtd#">molecular function unknown</go:synonym>
            <go:definition xmlns:go="http://www.geneontology.org/dtds/go.dtd#">Elemental activities, such as catalysis or binding, describing the actions of a gene product at the molecular level. A given gene product may exhibit one or more molecular functions.</go:definition>
            <go:comment xmlns:go="http://www.geneontology.org/dtds/go.dtd#">Note that, in addition to forming the root of the molecular function ontology, this term is recommended for use for the annotation of gene products whose molecular function is unknown. Note that when this term is used for annotation, it indicates that no information was available about the molecular function of the gene product annotated as of the date the annotation was made; the evidence code ND, no data, is used to indicate this.</go:comment>

GOA output

$godbmgr desc -f database.sqlite -t goa  "GO:0005385"

UniProtKB B3KU87 SLC30A6 GO:0005385 cDNA FLJ45816 fis, clone NT2RP7019682, highly similar to Homo sapiens solute carrier family 30 (zinc transporter), member 6 (SLC30A6), mRNA B3KU87_HUMAN|SLC30A6|hCG_23082|IPI01009565|B7WP49 protein
UniProtKB B5MCR8 SLC30A6 GO:0005385 Solute carrier family 30 (Zinc transporter), member 6, isoform CRA_b B5MCR8_HUMAN|SLC30A6|hCG_23082|IPI00894292 protein
(..)
UniProtKB Q99726 SLC30A3 GO:0015633 Zinc transporter 3 ZNT3_HUMAN|ZNT3|SLC30A3|IPI00293793|Q8TC03protein

TSV output

$ godbmgr asc -f database.sqlite  -t tsv "GO:0022857"   
#go:accession go.name go.def
GO:0003674 molecular_function Elemental activities, such as catalysis or binding, describing the actions of a gene product at the molecular level. A given gene product may exhibit one or more molecular functions.
GO:0005215 transporter activity Enables the directed movement of substances (such as macromolecules, small molecules, ions) into, out of or within a cell, or between cells.
GO:0022857 transmembrane transporter activity Enables the transfer of a substance from one side of a membrane to the other.
all all .

Sub-program: goa

Annotate a TSV file with the GOA annotation.

Usage

godbmgr goa -f db.sqlite [options] (stdin|files)

Options

  • -f (filename) the sqlite3 database
  • -c (column index) REQUIRED. The observed column.

Example

$ echo -e "#MyGene\nHello\nNOTCH2" |\
   godbmgr goa -c 1 -f database.sqlite  |\
   head -n 4 |\
   verticalize

>>> 2
$1 #MyGene                Hello
$2 DB                     .
$3 DB_Object_ID           .
$4 DB_Object_Symbol       .
$5 term                   .
$6 DB_Object_Name,Synonym .
$7 DB_Object_Type         .
$8   ???                    .
<<< 2

>>> 3
$1 #MyGene                NOTCH2
$2 DB                     UniProtKB
$3 DB_Object_ID           Q04721
$4 DB_Object_Symbol       NOTCH2
$5 term                   GO:0001709
$6 DB_Object_Name,Synonym Neurogenic locus notch homolog protein 2
$7 DB_Object_Type         NOTC2_HUMAN|NOTCH2|IPI00297655|Q5T3X7|Q99734|Q9H240
$8   ???                    protein
<<< 3

>>> 4
$1 #MyGene                NOTCH2
$2 DB                     UniProtKB
$3 DB_Object_ID           Q04721
$4 DB_Object_Symbol       NOTCH2
$5 term                   GO:0004872
$6 DB_Object_Name,Synonym Neurogenic locus notch homolog protein 2
$7 DB_Object_Type         NOTC2_HUMAN|NOTCH2|IPI00297655|Q5T3X7|Q99734|Q9H240
$8   ???                    protein

Sub-Program: grep

filters the line having an identifier (gene...) that is a children of a given GO term.

Usage

godbmgr grep -f db.sqlite [options] (stdin|files)

Options

Example

$ 
$ echo -e "#MyACN\nGO:0003674\nGO:0001618\n" |\
  godbmgr grep -f database.sqlite -c 1 -t GO:0004872 -t GO:0004879 

#MyACN
GO:0001618
$ echo -e "#MyACN\nGO:0003674\nGO:0001618\n" |\
  godbmgr grep -f database.sqlite -c 1 -t GO:0004872 -t GO:0004879 -v

#MyACN
GO:0003674


That's it,

Pierre

28 January 2012

Inside the variation toolkit: VCF2XML

vcf2xml is C++ tool that is part of my Variation Toolkit.
It transforms a "Variant Call Format document" to XML, so it can be later processed with xslt, xquery, etc...

Dependencies

Download

Download the sources from Google-Code using subversion:....
svn checkout http://variationtoolkit.googlecode.com/svn/trunk/ variationtoolkit-read-only
... or update the sources of an existing installation...
cd variationtoolkit
svn update
... and edit the variationtoolkit/congig.mk file.

Compiling:

$ cd variationtoolkit/src/
$ make ../bin/vcf2xml

g++ -o ../bin/vcf2xml vcf2xml.cpp application.o -O3 -Wall `xml2-config --cflags --libs` -lz

Usage:

vcf2xml (file.vcf | stdin)

Example:

$ vcf2xml input.vcf | xmllint --format -

<?xml version="1.0" encoding="UTF-8"?>
<vcf>
  <head>
    <meta key="fileformat">VCFv4.1</meta>
    <meta key="samtoolsVersion">0.1.17 (r973:277)</meta>
    <infos>
      <info>
        <id>DP</id>
        <number>1</number>
        <type>Integer</type>
        <description>Raw read depth</description>
      </info>
      <info>
        <id>DP4</id>
        <number>4</number>
        <type>Integer</type>
        <description># high-quality ref-forward bases</description>
      </info>
      <info>
        <id>MQ</id>
(...)
      </calls>
    </variation>
    <variation>
      <chrom>chr1</chrom>
      <pos>112697</pos>
      <ref>T</ref>
      <alt>G</alt>
      <qual>10.4</qual>
      <infos>
        <info key="DP">1</info>
        <info key="AF1">1</info>
        <info key="AC1">2</info>
        <info key="DP4">0,0,0,1</info>
        <info key="MQ">60</info>
        <info key="FQ">-30</info>
      </infos>
      <calls>
        <call sample="input.bam">
          <prop key="GT">1/1</prop>
          <prop key="PL">40,3,0</prop>
          <prop key="GQ">5</prop>
        </call>
      </calls>
    </variation>
  </body>
</vcf>
That's it,
Pierre

Insert your VCFs in a sqlite database.

vcf2sqlite is C++ tool that is part of my Variation Toolkit.
It inserts a "Variant Call Format document" (VCF) into a sqlite3 database.

Download

Download the sources from Google-Code using subversion:....
svn checkout http://variationtoolkit.googlecode.com/svn/trunk/ variationtoolkit-read-only
... or update the sources of an existing installation...
cd variationtoolkit
svn update
... and edit the variationtoolkit/congig.mk file.

Dependencies

http://www.sqlite.org/ : libraries and headers for sqlite3.

Compilation

Define "SQLITE_LIB" and "SQLITE_CFLAGS" in config.mk (see HowToInstall )
$ cd variationtoolkit/src/
$ make ../bin/vcf2sqlite 

if ! [ -z "$(SQLITE_LIB)" ] ;then g++ -o ../bin/vcf2sqlite vcf2sqlite.cpp xsqlite.cpp application.o -O3 -Wall -lz   ; else g++ -o ../bin/vcf2sqlite vcf2sqlite.cpp  -DNOSQLITE -O3 -Wall  ; fi

Usage

vcf2sqlite -f database.sqlite (file1.vcf file2... | stdin )

Options

  • -f (file) sqlite3 database (REQUIRED).

Schema


Example:

$ vcf2sqlite -f db.sqlite file.vcf
$ sqlite3 -line db.sqlite  "select * from VCFCALL LIMIT 4"

       id = 1
   nIndex = 0
vcfrow_id = 1
sample_id = 1
     prop = GT
    value = 1/1

       id = 2
   nIndex = 1
vcfrow_id = 1
sample_id = 1
     prop = PL
    value = 46,6,0

       id = 3
   nIndex = 2
vcfrow_id = 1
sample_id = 1
     prop = GQ
    value = 10

       id = 4
   nIndex = 0
vcfrow_id = 2
sample_id = 1
     prop = GT
    value = 1/1

$ sqlite3 -column -header  db.sqlite \
   "select SAMPLE.name,VCFCALL.value,count(*) from VCFCALL,SAMPLE where SAMPLE.id=VCFCALL.sample_id and prop='GT' group by SAMPLE.id,VCFCALL.value"

name         value       count(*)  
-----------  ----------  ----------
rmdup_1.bam  0/1         545       
rmdup_1.bam  1/1         429       
rmdup_2.bam  0/1         625       
rmdup_2.bam  1/1         349       
rmdup_3.bam  0/1         595       
rmdup_3.bam  1/1         379       
rmdup_4.bam  0/1         548       
rmdup_4.bam  1/1         426       
rmdup_5.bam  0/1         564       
rmdup_5.bam  1/1         410       
rmdup_6.bam  0/1         724       
rmdup_6.bam  1/1         250
That's it
Pierre

07 January 2012

A CGI-version of samtools tview.

I've created a lightweight CGI-based web-application for samtools tview. This C++ program named ngsproject.cgi uses the samtools api, it allows any user to visualize all the alignments in a given NGS project. The projects and their BAMS are defined on the server side using a simple XML document. e.g:

<?xml version="1.0"?>
<projects>
 <reference id="hg19">
  <path>/home/lindenb/samtools-0.1.18/examples/ex1.fa</path>
 </reference>
 <bam id="b1">
  <sample>Sample 1</sample>
  <path>/home/lindenb/samtools-0.1.18/examples/ex1.bam</path>
 </bam>
 <bam id="b2">
  <sample>Sample 2</sample>
  <path>/home/lindenb/samtools-0.1.18/examples/ex1.bam</path>
 </bam>
 <project id="1">
  <name>Test 1</name>
  <description>Test</description>
  <bam ref="b1"/>
  <bam ref="b2"/>
  <reference ref="hg19" />
 </project>
 <project id="2">
  <name>Test 2</name>
  <description>Test</description>
  <bam ref="b2"/>
  <reference ref="hg19" />
 </project>
</projects>

Once the CGI has been installed, the user can visualize the reads of each samples.

This tool is available in the variation toolkit at http://code.google.com/p/variationtoolkit/.

That's it.

Pierre

05 January 2012

The Variation Toolkit

During the last weeks, I've worked on an experimental C++ package named The Variation Toolkit (varkit). It was originally designed to provide some command lines equivalent to knime4bio but I've added more tools over time. Some of those tools are very simple-and-stupid ( fasta2tsv) , reinvent the wheel ("numericsplit"), are part of an answer to biostar, are some old tools (e.g. bam2wig) that have been moved to this package, but some others like "samplepersnp", "groupbygene" might be useful to people.
The package is available at : http://code.google.com/p/variationtoolkit/.

Here is the current documentation (05 Jan 2012):




That's it,

Pierre

01 December 2011

Suggest some new terms for the EDAM Ontology for Bioinformatics

EDAM is an ontology of general bioinformatics concepts, including topics and data types, formats, identifiers and operations.
Is your specific subject of research present in this ontology (e.g "RNA-Seq") ? go and have a look at http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM. If it is not, feel free to suggest a new term in the form below. Your term might be included in the next version of the ontology and it might be used as a possible choice for the Bioinformatics Career Survey 2011/2012.

That's it, Pierre

20 November 2011

Processing json data with apache velocity.

I've written a tool named "apache velocity" which parse json data and processes it with "Apache velocity" (a template engine ). The (javacc) source code is available here:


https://github.com/lindenb/jsandbox/blob/master/src/sandbox/VelocityJson.jj

Example

Say you have defined some classes using JSON:

[
  {
    "type": "record",
    "name": "Exon",
    "fields" : [
      {"name": "start", "type": "int"},
      {"name": "end", "type": "int"}
    ]
  },
  {
    "type": "record",
    "name": "Gene",
    "fields" : [
      {"name": "chrom", "type": "string"},
      {"name": "name", "type": "string"},
      {"name": "txStart", "type": "int"},
      {"name": "txEnd", "type": "int"},
      {"name": "cdsStart", "type": "int"},
      {"name": "cdsEnd", "type": "int"},
      {"name": "exons", "type":{"type":"array","items":"Exon"}}
    ]
  } 
 ]
and here is a velocity template transforming this json structure to java :

#macro(javaName $s)$s.substring(0,1).toUpperCase()$s.substring(1)#end
#macro(setter $s)set#javaName($s)#end
#macro(getter $s)get#javaName($s)#end
#macro(javaType $f)
#if($f.type.equals("string"))
java.lang.String#elseif($f.type.equals("boolean"))
boolean#elseif($f.type.equals("long"))
long#elseif($f.type.equals("float"))
float#elseif($f.type.equals("double"))
double#elseif($f.type.equals("int"))
int#elseif($f.items)
$f.items#elseif($f.type.type.equals("array"))
java.util.List<#javaType($f.type)>#else
$f.type
#end
#end

#foreach( $class in $avro)

class $class.name
{
#foreach( $field in $class.fields )
private  #javaType($field) $field.name;
#end

public ${class.name}()
 {
 }

public ${class.name}(#foreach( $field in $class.fields )
 #if($velocityCount>1),#end#javaType($field) $field.name
 #end
 )
 {
 #foreach( $field in $class.fields )
 this.$field.name=$field.name;
 #end
 }
 


#foreach( $field in $class.fields )
public void #setter($field.name)(#javaType($field) $field.name)
 {
 this.$field.name=$field.name;
 }
public #javaType($field) #getter($field.name)()
 {
 return this.$field.name;
 }
#end
}
#end
The json file can be processed with velocity using the following command line:

$ java -jar velocityjson.jar -f avro structure.json json2java.vm

Result

class Exon
{
private  int start;
private  int end;

public Exon()
 {
 }

public Exon( int start
  ,int end
  )
 {
  this.start=start;
  this.end=end;
  }
 


public void setStart(int start)
 {
 this.start=start;
 }
public int getStart()
 {
 return this.start;
 }
public void setEnd(int end)
 {
 this.end=end;
 }
public int getEnd()
 {
 return this.end;
 }
}

class Gene
{
private  java.lang.String chrom;
private  java.lang.String name;
private  int txStart;
private  int txEnd;
private  int cdsStart;
private  int cdsEnd;
private  java.util.List<Exon> exons;

public Gene()
 {
 }

public Gene( java.lang.String chrom
  ,java.lang.String name
  ,int txStart
  ,int txEnd
  ,int cdsStart
  ,int cdsEnd
  ,java.util.List<Exon> exons
  )
 {
  this.chrom=chrom;
  this.name=name;
  this.txStart=txStart;
  this.txEnd=txEnd;
  this.cdsStart=cdsStart;
  this.cdsEnd=cdsEnd;
  this.exons=exons;
  }
 


public void setChrom(java.lang.String chrom)
 {
 this.chrom=chrom;
 }
public java.lang.String getChrom()
 {
 return this.chrom;
 }
public void setName(java.lang.String name)
 {
 this.name=name;
 }
public java.lang.String getName()
 {
 return this.name;
 }
public void setTxStart(int txStart)
 {
 this.txStart=txStart;
 }
public int getTxStart()
 {
 return this.txStart;
 }
public void setTxEnd(int txEnd)
 {
 this.txEnd=txEnd;
 }
public int getTxEnd()
 {
 return this.txEnd;
 }
public void setCdsStart(int cdsStart)
 {
 this.cdsStart=cdsStart;
 }
public int getCdsStart()
 {
 return this.cdsStart;
 }
public void setCdsEnd(int cdsEnd)
 {
 this.cdsEnd=cdsEnd;
 }
public int getCdsEnd()
 {
 return this.cdsEnd;
 }
public void setExons(java.util.List<Exon> exons)
 {
 this.exons=exons;
 }
public java.util.List<Exon> getExons()
 {
 return this.exons;
 }
}


That's it,

Pierre

16 November 2011

"VCF annotation" with the NHLBI GO Exome Sequencing Project (JAX-WS)

The NHLBI Exome Sequencing Project (ESP) has released a web service to query their data. "The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.".
In the current post, I'll show how I've used this web service to annotate a VCF file with this information.
The web service provided by the ESP is based on the SOAP protocol.
Here is an example of the XML response: We can generate the java classes for a client invoking this Web Service by using ${JAVA_HOME}/bin/wsimport.

$ wsimport -keep "http://evs.gs.washington.edu/wsEVS/EVSDataQueryService?wsdl"

parsing WSDL...
generating code...
compiling code...

Here is the java code running this client. It scans the VCF, calls the webservice for each variation and insert the annotation as JSON in a new column .
... and the makefile:

Result (some columns have been cut)

curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/EUR.2of4intersection_allele_freq.20100804.sites.vcf.gz" |\
 gunzip -c |\
 java -jar evsclient.jar 



##fileformat=VCFv4.0
##filedat=20101112
##datarelease=20100804
##samples=629
##description="Where BI calls are present, genotypes and alleles are from BI.  In there absence, UM genotypes are used.  If neither are available, no genotype information is present and the alleles are from the NCBI calls."
(...)
#CHROM POS ID EVS
1 10469 rs117577454 {"start":10469,"chromosome":"1","stop":10470,"strand":"+","snpList":[],"setOfSiteCoverageInfo":[]}
1 10583 rs58108140 {"start":10583,"chromosome":"1","stop":10584,"strand":"+","snpList":[],"setOfSiteCoverageInfo":[]}
1 11508 . {"start":11508,"chromosome":"1","stop":11509,"strand":"
(...)
1 69511 . {"start":69511,"chromosome":"1","stop":69512,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"1.0","conservationScoreGERP":"0.5","refAllele":"A","ancestralAllele":"G","filters":"PASS","clinicalLink":"unknown","positionString":"1:69511","chrPosition":69511,"alleles":"G/A","uaAlleleCounts":"1373/47","aaAlleleCounts":"880/600","totalAlleleCounts":"2253/647","uaAlleleAndCount":"G=1373/A=47","aaAlleleAndCount":"G=880/A=600","totalAlleleAndCount":"G=2253/A=647","uaMAF":3.3099,"aaMAF":40.5405,"totalMAF":22.3103,"avgSampleReadDepth":185,"geneList":"OR4F5","snpFunction":{"chromosome":"1","position":69511,"conservationScore":"1.0","conservationScoreGERP":"0.5","snpFxnList":[{"mrnaAccession":"NM_001005484","fxnClassGVS":"missense","aminoAcids":"THR,ALA","proteinPos":"141/306","cdnaPos":421,"pphPrediction":"benign","granthamScore":"58"}],"refAllele":"A","ancestralAllele":"G","firstRsId":75062661,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"G","hasAtLeastOneAccession":"true","rsIds":"rs75062661"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":69511,"avgSampleReadDepth":185.0,"totalSamplesCovered":1452,"eaSamplesCovered":712,"avgEaSampleReadDepth":157.0,"aaSamplesCovered":740,"avgAaSampleReadDepth":211.0},{"chromosome":"1","position":69512,"avgSampleReadDepth":180.0,"totalSamplesCovered":1501,"eaSamplesCovered":739,"avgEaSampleReadDepth":153.0,"aaSamplesCovered":762,"avgAaSampleReadDepth":207.0}]}
(...)
1 901923 . {"start":901923,"chromosome":"1","stop":901924,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"1.0","conservationScoreGERP":"5.0","refAllele":"C","ancestralAllele":"C","filters":"PASS","clinicalLink":"unknown","positionString":"1:901923","chrPosition":901923,"alleles":"A/C","uaAlleleCounts":"2/2542","aaAlleleCounts":"52/1934","totalAlleleCounts":"54/4476","uaAlleleAndCount":"A=2/C=2542","aaAlleleAndCount":"A=52/C=1934","totalAlleleAndCount":"A=54/C=4476","uaMAF":0.0786,"aaMAF":2.6183,"totalMAF":1.1921,"avgSampleReadDepth":35,"geneList":"PLEKHN1","snpFunction":{"chromosome":"1","position":901923,"conservationScore":"1.0","conservationScoreGERP":"5.0","snpFxnList":[{"mrnaAccession":"NM_032129","fxnClassGVS":"missense","aminoAcids":"SER,ARG","proteinPos":"4/612","cdnaPos":12,"pphPrediction":"probably-damaging","granthamScore":"110"}],"refAllele":"C","ancestralAllele":"C","firstRsId":0,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"A","hasAtLeastOneAccession":"true","rsIds":"none"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":901923,"avgSampleReadDepth":35.0,"totalSamplesCovered":2280,"eaSamplesCovered":1272,"avgEaSampleReadDepth":32.0,"aaSamplesCovered":1008,"avgAaSampleReadDepth":38.0},{"chromosome":"1","position":901924,"avgSampleReadDepth":35.0,"totalSamplesCovered":2283,"eaSamplesCovered":1273,"avgEaSampleReadDepth":32.0,"aaSamplesCovered":1010,"avgAaSampleReadDepth":38.0}]}
1 902069 rs116147894 {"start":902069,"chromosome":"1","stop":902070,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"0.0","conservationScoreGERP":"1.0","refAllele":"T","ancestralAllele":"T","filters":"PASS","clinicalLink":"unknown","positionString":"1:902069","chrPosition":902069,"alleles":"C/T","uaAlleleCounts":"2/320","aaAlleleCounts":"18/212","totalAlleleCounts":"20/532","uaAlleleAndCount":"C=2/T=320","aaAlleleAndCount":"C=18/T=212","totalAlleleAndCount":"C=20/T=532","uaMAF":0.6211,"aaMAF":7.8261,"totalMAF":3.6232,"avgSampleReadDepth":13,"geneList":"PLEKHN1","snpFunction":{"chromosome":"1","position":902069,"conservationScore":"0.0","conservationScoreGERP":"1.0","snpFxnList":[{"mrnaAccession":"NM_032129","fxnClassGVS":"intron","aminoAcids":"none","proteinPos":"NA","cdnaPos":-1,"pphPrediction":"unknown","granthamScore":"NA"}],"refAllele":"T","ancestralAllele":"T","firstRsId":0,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"C","hasAtLeastOneAccession":"true","rsIds":"none"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":902069,"avgSampleReadDepth":13.0,"totalSamplesCovered":304,"eaSamplesCovered":169,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":135,"avgAaSampleReadDepth":12.0},{"chromosome":"1","position":902070,"avgSampleReadDepth":12.0,"totalSamplesCovered":338,"eaSamplesCovered":190,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":148,"avgAaSampleReadDepth":12.0}]}
1 902108 rs62639981 {"start":902108,"chromosome":"1","stop":902109,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"0.0","conservationScoreGERP":"-8.7","refAllele":"C","ancestralAllele":"unknown","filters":"PASS","clinicalLink":"unknown","positionString":"1:902108","chrPosition":902108,"alleles":"T/C","uaAlleleCounts":"5/333","aaAlleleCounts":"0/248","totalAlleleCounts":"5/581","uaAlleleAndCount":"T=5/C=333","aaAlleleAndCount":"T=0/C=248","totalAlleleAndCount":"T=5/C=581","uaMAF":1.4793,"aaMAF":0.0,"totalMAF":0.8532,"avgSampleReadDepth":13,"geneList":"PLEKHN1","snpFunction":{"chromosome":"1","position":902108,"conservationScore":"0.0","conservationScoreGERP":"-8.7","snpFxnList":[{"mrnaAccession":"NM_032129","fxnClassGVS":"coding-synonymous","aminoAcids":"none","proteinPos":"36/612","cdnaPos":108,"pphPrediction":"unknown","granthamScore":"NA"}],"refAllele":"C","ancestralAllele":"unknown","firstRsId":62639981,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"T","hasAtLeastOneAccession":"true","rsIds":"rs62639981"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":902108,"avgSampleReadDepth":13.0,"totalSamplesCovered":294,"eaSamplesCovered":170,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":124,"avgAaSampleReadDepth":13.0},{"chromosome":"1","position":902109,"avgSampleReadDepth":13.0,"totalSamplesCovered":309,"eaSamplesCovered":177,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":132,"avgAaSampleReadDepth":13.0}]}
(...)
That's it
Pierre