07 January 2011

SVG image in wikipedia + Zoom.it =a Zoomable "Tree of life"

Zoom.it is a free service for viewing and sharing high-resolution imagery. You give us the link to any image ( including SVG, pdfs) on the web along with a nice short URL.. As a test, I used the SVG file "Tree of life with genome size.svg" on wikipedia and here is the awesome result generated by Zoom.it:



That's it,
Piere

05 January 2011

Template:Infobox biodatabase

I've just started creating a wikipedia infobox to annotate the biological databases in wikipedia. If many articles use this template, then it will be possible to parse the them and to create a list of the databases providing some web services, some SPARQL endpoints, having a download area etc...
The infobox itself is still a draft, so feel free to modify it or to suggest some other fields in the 'Talk' page.



that's it,

Pierre

Coding a CXF web service translating a DNA to a protein. My notebook

Apache CXF is a Web Services framework. In this post, I'll will describe how I implemented a Web Service translating a DNA to a protein using the web server Apache Tomcat and the CXF libraries.

Defining the interface

First a simple java interface bio.Translate is needed to describe the service. This simple service receives a string (the dna) and returns a string (the peptide). The annotations will be used by CXF to name the parameters in the WSDL file (see later):

Implementing the service

bio.TranslateImpl implements bio.Translate. The setter/getter for ncbiString will be used by a configuration file to specify a genetic code (standard, mitochondrial) for this service. The methods initIt and cleanUp could be used to acquire and to release some resources for the service when it is created and/or disposed.

Configuring the service

CXF uses the libraries of the Spring framework (I blogged about spring here ). A XML config file beans.xml makes it easy to configure two java beans for the 'standard genetic code' and the 'mitochondrial code'. In this config file, we also tell Spring about the two methods initIt and cleanUp. Those two beans will be used by two Web Services

Defining the CXF application for Tomcat

The following web.xml file only tells tomcat, the web server, to use the CXFServlet to listen to the SOAP queries.

Compile & Deploy

Installing a CXF web service requires many libraries and at the end, the size of the deployed 'war' file was 8.5Mo(!). Currently, my structure for the current project is:
./translate/WEB-INF/classes/bio/TranslateImpl.java
./translate/WEB-INF/classes/bio/Translate.java
./translate/WEB-INF/beans.xml
./translate/WEB-INF/web.xml
The service was compiled and deployed using the following Makefile:
cxf.lib=apache-cxf-2.3.1/lib
all:
mkdir -p translate/WEB-INF/lib
javac -d translate/WEB-INF/classes -sourcepath translate/WEB-INF/classes translate/WEB-INF/classes/bio/TranslateImpl.java
cp ${cxf.lib}/cxf-2.3.1.jar \
${cxf.lib}/geronimo-activation_1.1_spec-1.1.jar \
${cxf.lib}/geronimo-annotation_1.0_spec-1.1.1.jar \
${cxf.lib}/geronimo-javamail_1.4_spec-1.7.1.jar \
${cxf.lib}/geronimo-servlet_3.0_spec-1.0.jar \
${cxf.lib}/geronimo-ws-metadata_2.0_spec-1.1.3.jar \
${cxf.lib}/geronimo-jaxws_2.2_spec-1.0.jar \
${cxf.lib}/geronimo-stax-api_1.0_spec-1.0.1.jar \
${cxf.lib}/jaxb-api-2.2.1.jar \
${cxf.lib}/jaxb-impl-2.2.1.1.jar \
${cxf.lib}/neethi-2.0.4.jar \
${cxf.lib}/saaj-api-1.3.jar \
${cxf.lib}/saaj-impl-1.3.2.jar \
${cxf.lib}/wsdl4j-1.6.2.jar \
${cxf.lib}/XmlSchema-1.4.7.jar \
${cxf.lib}/xml-resolver-1.2.jar \
${cxf.lib}/aopalliance-1.0.jar \
${cxf.lib}/spring-core-3.0.5.RELEASE.jar \
${cxf.lib}/spring-beans-3.0.5.RELEASE.jar \
${cxf.lib}/spring-context-3.0.5.RELEASE.jar \
${cxf.lib}/spring-web-3.0.5.RELEASE.jar \
${cxf.lib}/commons-logging-1.1.1.jar \
${cxf.lib}/spring-asm-3.0.5.RELEASE.jar \
${cxf.lib}/spring-expression-3.0.5.RELEASE.jar \
${cxf.lib}/spring-aop-3.0.5.RELEASE.jar \
translate/WEB-INF/lib
jar cvf translate.war -C translate .
mv translate.war path-to-tomcat/webapps

Checking the URL

We can see that the service was correctly deployed by pointing a web browser at http://localhost:8080/translate/, where we can see the two services:
Available SOAP services:
Translate
  • translate
Endpoint address: http://localhost:8080/translate/translateMit
WSDL : {http://bio/}TranslateService
Target namespace: http://bio/
Translate
  • translate
Endpoint address: http://localhost:8080/translate/translateStd
WSDL : {http://bio/}TranslateService
Target namespace: http://bio/

Here, the URLs link to the WSDL definition for the web service:

Creating a client

For creating a client consuming this service, I first used the code generated by CXF's wsdl2java but there was a bug with one of the generated classe (it is a known bug feature) so here, I'm going to use the standard ${JAVA_HOME}/bin/wsimport.
> wsimport -p generated -d client -keep "http://localhost:8080/translate/translateStd?wsdl"
parsing WSDL...
generating code...
compiling code...
I wrote a java client MyClient.java using this generated API:

Compiling

> cd client
> javac MyClient.java

Running

> java MyClient
EFIDHSIAC*


That's it,

Pierre

Don't mask this sequence, please.

I recently asked on Biostar if it would be worth to mask the non-genic sequence before aligning the short reads on the reference after an exome sequencing. Although I was convinced by the answer of lh3, I was curious to observe the difference with some real data.

I've downloaded two fastqs files from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA20772/sequence_read/ERR004053_*.recal.fastq.gz and the sequence for the human chromosome chr1 from the UCSC.

One copy of chr1.fa was masked using the UCSC knownGene table +/- 10kb using a custom software. The bases were replaced by a 'N' if they were not contained in a genic region+/-10kb.

  • Number of 'N' in chr1 without masking:22,250,000
  • Number of 'N' in chr1 with masking:108,399,153


Then, the two fastq were aligned on each chr1 (masked/not masked) and the mutations were called with 'samtools pileup':
bwa-0.5.9rc1/bwa index chr1.fa
bwa-0.5.9rc1/bwa aln chr1.fa ERR004053_1.recal.fastq.gz > aln1.sai
bwa-0.5.9rc1/bwa aln chr1.fa ERR004053_2.recal.fastq.gz > aln2.sai
bwa-0.5.9rc1/bwa sampe chr1.fa aln1.sai aln2.sai ERR004053_1.recal.fastq.gz ERR004053_2.recal.fastq.gz > file.sam
samtools-0.1.10/samtools faidx chr1.fa
samtools-0.1.10/samtools view -b -t chr1.fa.fai file.sam > file.bam
samtools-0.1.10/samtools sort file.bam sorted
samtools-0.1.10/samtools index sorted.bam
samtools-0.1.10/samtools pileup -vcf chr1.fa sorted.bam |\
awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' |\
cut -d ' ' -f 1-4 |\
sort | uniq > pileup.txt


At the end:
  • Number of mutations (no masking): 24921

  • Number of mutations (masking): 26062

  • Number of mutations common in 'masking'/'no masking': 13573

  • Number of mutations unique in 'no masking': 12489

  • Number of mutations unique in 'masking': 11348


'chr1:100005960 c/A': a mutation from the 'masked' sequence but not found in 'not-masked':

chr1 masked

  100005921 100005931 100005941  100005951 100005961 100005971
tgctaattggtcagattggagatggaatca*tggggggtcgacgtgaggttttcttgctgtcttct
....G.........G............... MM.......R.A....RK.................
.. ,,,,,,,,,,,,,,cac,,,,,,,,,a,,,,a,,, ,,,,,,,,,,,,,
.... ..........*CA.......A.A.......N....... ,,,,,,
.. ..........*CA.......A.A............... ,,,,
G...G..G.. ..........*CA.......A.A...............
....G.........G..T.. ...........T.................
....G....... .....A.....T.................
,,,,g,,,,,,,,,g,,,,,,,,,,,,,,,*,,,,, ..............G........

chr1 NOT masked

  100005931 100005941 100005951 100005961 100005971 100005981
tcagattggagatggaatcatggggggtcgacgtgaggttttcttgctgtcttctgttcctgggtg
..........CA.......R.A.....K............................
..........CA.......A.A.......N.......
...........T.........................
.....A.....T.........................
.A..................................
..............G...................
in this case, it is visiblethat the reads have been more correctly aligned on the non-masked sequence.


That's it

Pierre

01 January 2011

My tool to annotate VCF files.

The Variant Call Format is a text file format generated by many tools for NGS. It contains meta-information lines, a header line, and then data lines describing how the mutations were called. I don't like this format because it cannot be used to store some hierachical annotations (like json or xml), nevertheless it is a de facto standard.

I wrote a tool called vcfannotator to append a set of annotations from the UCSC database to a VCF file. As I wanted to keep this tool simple and without any dependencies, it only uses the flat files available from the download area at the UCSC.

This tools appends several informations:

  • A prediction of the mutation: is it in the cDNA ? in an intron ? in an exon ? is it a non-synonymous mutation ? was a stop codon lost or gained ? is there any consequence on the splicing process ? Here, the UCSC DAS server and the KnonwGenes table are used to retrieve the genomic DNA and the structure of the gene.
  • dbSNP: was this mutation found before in dbSNP ?
  • Personal genomes: was this mutation found before in Venter's genome ? In Watson's genome ? etc...
  • The mapability (Uniqueness of Reference Genome ): gives an idea about how the context is unique in the genome
  • genomicSuperDups: is this mutation located in a segmental genomic duplication ?
  • tfbsConsSites: is this mutation located in the site of a transcription factor ?
  • phastCons44way: how is the mutation conserved within a multiple alignments of 44 vertebrate genomes to the human genome ?


The tool is available on github at: https://github.com/lindenb/jsandbox in jsandbox/src/sandbox/VCFAnnotator.java.

Compilation

> cd jsandbox
> ant vcfannotator

Execution & Options

> java -jar dist/vcfannotator.jar -h
VCF annotator
Pierre Lindenbaum PhD. 2010. http://plindenbaum.blogspot.com
Options:
-b ucsc.build default:hg18
-m mapability.
-g genomicSuperDups
-p basic prediction
-c phastcons prediction (phastCons44way)
-t transcription factors sites prediction
-pg personal genomes
-snp <id> add ucsc <id> must be present in "http://hgdownload.cse.ucsc.edu/goldenPath/<ucscdb>/database/<id>.txt.gz" e.g. snp129
-log <level> one value from java.util.logging.Level default:OFF
-proxyHost <host>
-proxyPort <port>

Example

Input

##fileformat=VCFv3.2
##fileDate=20091120
##source=gigabayes
##reference=ncbi36
##phasing=none
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1105366 . T C 99 0 NS=471;DP=16179;AC=8;AN=942;HWE=0.0685233
1 1105373 . C T 99 0 NS=469;DP=15318;AC=5;AN=938;HWE=0.0267954
1 1108138 rs61733845 C T 99 0 NS=450;DP=16134;AC=110;AN=900;dbSNP;HWE=0.930276
1 1109206 . G A 99 0 NS=696;DP=54701;AC=2;AN=1392;HWE=0.0028777
1 1109262 . C T 99 0 NS=696;DP=55783;AC=12;AN=1392;HWE=0.104349
1 1110233 . C G 99 0 NS=694;DP=34301;AC=43;AN=1388;HWE=4.91397
1 1110240 . T A 99 0 NS=695;DP=35846;AC=3;AN=1390;HWE=0.00648883
1 1110294 rs1320571 G A 99 0 NS=608;DP=36050;AC=246;AN=1216;dbSNP;HM;HWE=3.05358
1 1110351 . A C 99 0 NS=696;DP=26284;AC=15;AN=1392;HWE=0.163402
1 1110358 . T C 99 0 NS=694;DP=24596;AC=24;AN=1388;HWE=0.422309
1 1110366 . G A 99 0 NS=697;DP=22511;AC=16;AN=1394;HWE=0.185781
1 3537495 . C T 99 0 NS=697;DP=25302;AC=15;AN=1394;HWE=0.163165
1 3537910 . G A 99 0 NS=605;DP=15896;AC=34;AN=1210;HWE=0.46683
1 3537996 rs2760321 T C 99 0 NS=571;DP=15376;AC=1040;AN=1142;dbSNP;HM;HWE=4.28852
(...)

Command:

java -jar dist/vcfannotator.jar -pg -m -g -p -c -t -snp snp130 -log ALL input.vcf
## wait !

Result

##fileformat=VCFv3.2
##fileDate=20091120
##source=gigabayes
##reference=ncbi36
##phasing=none
##SNP130=table snp130 from UCSC
##INFO=NA12878,1,String,"NA12878's Personal genome"
##INFO=NA12891,1,String,"NA12891's Personal genome"
(...)
chr1 1108138 rs61733845 C T 99 0 NS=450;DP=16134;AC=110;AN=900;dbSNP;HWE=0.930276;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:TTLL10|wild.codon:TGC|wild.aa:C|pos.cdna:935|kgId:uc001acy.1|position.protein:312|strand:+|mut.aa:C|mut.codon:TGT|exon.name:Exon 11|type:EXON_CODING_SYNONYMOUS;PREDICTION=geneSymbol:TTLL10|wild.codon:TGC|wild.aa:C|pos.cdna:716|kgId:uc001acz.1|position.protein:239|strand:+|mut.aa:C|mut.codon:TGT|exon.name:Exon 7|type:EXON_CODING_SYNONYMOUS
(...)
chr1 1110351 . A C 99 0 NS=696;DP=26284;AC=15;AN=1392;HWE=0.163402;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:TTLL10|wild.codon:AAG|wild.aa:K|splicing:SPLICING_DONOR|pos.cdna:1399|kgId:uc001acy.1|position.protein:467|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 13|type:EXON_CODING_NON_SYNONYMOUS;PREDICTION=geneSymbol:TTLL10|wild.codon:AAG|wild.aa:K|pos.cdna:1180|kgId:uc001acz.1|position.protein:394|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 9|type:EXON_CODING_NON_SYNONYMOUS
(...)
chr1 113068276 . C G 99 0 NS=697;DP=12774;AC=7;AN=1394;HWE=0.0353282;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:FAM19A3|wild.codon:CCA|wild.aa:P|pos.cdna:451|kgId:uc001ecu.1|position.protein:151|strand:+|mut.aa:R|mut.codon:CGA|exon.name:Exon 4|type:EXON_CODING_NON_SYNONYMOUS;PREDICTION=geneSymbol:FAM19A3|wild.codon:ACC|wild.aa:T|pos.cdna:383|kgId:uc001ecv.1|position.protein:128|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 4|type:EXON_CODING_SYNONYMOUS
(...)
Interestingly, for this last mutation chr1:113068276, there are two transcripts at the same position with two different translation frames, so there are two predictions: one synonymous mutation and one non-synonymous mutation.


That's it,

Pierre

31 December 2010

Translating a DNA to a Protein using server-side javascript and C: my notebook

In my previous post , I used Node.js to translate a DNA to a protein on the Server-side, using javascript. In the following post, I again will translate a DNAn but this time by calling a specialized C program on the server side.

Source code


The C program

The C program reads a DNA string from stdin a translate it using the standard genetic code:
Compilation:
gcc -o /my/bin/path/translate translate.c

The Node.js script

When the Node.js server receive a DNA parameter, it spawns a new process to the C program and we write the DNA to this process via 'stdin'.
Each time a new 'data' event (containing the protein) is received, it is printed to the http response. At the end of the process, we close the stream by calling 'end()'.

test

> node-v0.2.5/node translate.js
Server running at http://127.0.0.1:8080

> curl -s "http://localhost:8080/?dna=ATGATGATAGATAGATATAGTAGATATGATCGTCAGCCATACG"
MMIDRYSRYDRQPY


That's it,

Pierre

Server-side javascript: translating a DNA with Node.js

(wikipedia) Node.js is an evented I/O framework for the V8 JavaScript engine on Unix-like platforms. It is intended for writing scalable (javascript-based) network programs such as web servers.

In the following post I will create a javascript server translating a DNA to a protein.

Installing Node.js

I've downloaded the sources for Node.js from http://nodejs.org/#download. It compiled (configure+make) and ran without any problem.

The script

The following script contains a class handling a GeneticCode and the server TranslateDna translating the DNA to a protein, it handles both the POST and the GET http methods. It no parameter is found it displays a simple HTML form, else the form data are decoded and the DNA is translated. The protein is returned as a JSON structure.

Running the server

> node-v0.2.5/node translate.js
Server running at http://127.0.0.1:8080

Test


> curl "http://localhost:8080/"
<html><body><form action="/" method="GET"><h1>DNA</h1><textarea name="dna"></textarea><br/><input type="submit" value="Submit"></form></body></html>

> curl "http://localhost:8080/?dna=ATGAACTATCGATGCTACGACTGATCG"
{"protein":"MNYRCYD*S","query":"ATGAACTATCGATGCTACGACTGATCG"}



That's it,

Pierre

14 December 2010

Looking for an expert ?

Yesterday, Andrew Su asked on Biostar: "Given a gene, what is the best automated method to identify the world experts? ".

Here is my solution:

  • First for a given gene name, we use NCBI-ESearch to find its Gene-Id in NCBI Gene
  • The Gene record is then downloaded as XML using NCBI-EFetch
  • XPATH is used to retrieve all the articles in pubmed about this gene and identified by the XML tags <PubMedId>
  • Each article is downloaded from pubmed. The element <Affiliation> is extracted from the record; sometimes this tag contains the the main contact's email. The authors are also extracted and we count the number of times each author was found. I tried to solve the problem of ambiguity for the names of the authors by looking at the name, surname and initials. If an author's name was contained in the e-mail, it was affected to him
  • At the end, all the authors are sorted in function of the number of times they were seen and the most prolific author is printed out.


Source code


Compilation

javac BioStar4296.java

Test

java BioStar4296 ZC3H7B eif4G1 PRNP

<?xml version="1.0" encoding="UTF-8"?>
<experts>
<gene name="ZC3H7B" geneId="23264" count-pmids="13">
<Person>
<firstName>Sumio</firstName>
<lastName>Sugano</lastName>
<pmid>8125298</pmid>
<pmid>9373149</pmid>
<pmid>14702039</pmid>
<affilitation>International and Interdisciplinary Studies, The University of Tokyo, Japan.</affilitation>
<affilitation>Institute of Medical Science, University of Tokyo, Japan.</affilitation>
<affilitation>Helix Research Institute, 1532-3 Yana, Kisarazu, Chiba 292-0812, Japan.</affilitation>
</Person>
</gene>
<gene name="eif4G1" geneId="1981" count-pmids="106">
<Person>
<firstName>Nahum</firstName>
<lastName>Sonenberg</lastName>
<pmid>7651417</pmid>
<pmid>7935836</pmid>
<pmid>8449919</pmid>
(...)
<affilitation>Department of Biochemistry and McGill Cancer Center, McGill University, Montreal, H3G 1Y6, Quebec, Canada.</affilitation>
<affilitation>Department of Biochemistry, McGill University, Montreal, Quebec, Canada.</affilitation>
<affilitation>Laboratories of Molecular Biophysics, The Rockefeller University, New York, New York 10021, USA.</affilitation>
(...)
</Person>
</gene>
<gene name="PRNP" geneId="5621" count-pmids="429">
<Person>
<firstName>John</firstName>
<lastName>Collinge</lastName>
<pmid>1352724</pmid>
<pmid>1677164</pmid>
<pmid>2159587</pmid>
<pmid>20583301</pmid>
(...)
<mail>[email protected]</mail>
<affilitation>Krebs Institute for Biomolecular Research, Department of Molecular Biology and Biotechnology, University of Sheffield, Sheffield S10 2TN, UK.</affilitation>
<affilitation>MRC Prion Unit and Department of Neurogenetics, Imperial College School of Medicine at St. Mary's, London, United Kingdom. [email protected]</affilitation>
<affilitation>Division of Neuroscience (Neurophysiology), Medical School, University of Birmingham, Edgbaston, Birmingham, UK. [email protected]</affilitation>
(...)
</Person>
</gene>
</experts>

about this result


  • ZC3H7B the result is wrong. In Dr Sugano's article (3 articles) ZC3H7B was present in among a large set of other genes used in his studies. The expert would be Dr D. Poncet, my former thesis advisor but he 'only' wrote two articles about this protein.
  • Eif4G1: I know Dr Sonenberg is the expert. His email wasn't found.
  • PRNP Collinge seems to be the expert. Dr Collinge's e-mail was detected.


That's it,

Pierre

13 December 2010

A new journal: BMC Open Research Computation #OpenResComp


Citing ''Aims & scope'':Open Research Computation publishes peer reviewed articles that describe the development, capacities, and uses of software designed for use by researchers in any field.

Submissions relating to software for use in any area of research are welcome as are articles dealing with algorithms, useful code snippets, as well as large applications or web services, and libraries.

Open Research Computation differs from other journals with a software focus in its requirement for the software source code to be made available under an Open Source Initiative compliant license, and in its assessment of the quality of documentation and testing of the software.

In addition to articles describing software Open Research Computation also welcomes submissions that review or describe developments relating to software based tools for research. These include, but are not limited to, reviews or proposals for standards, discussion of best practice in research software development, educational and support resources and tools for researchers that develop or use software based tools.


See also the insights from Cameron Neylon, Jan Aerts, Neil 10K Saunders ...

17 November 2010

BLAST/XML+Annotations

I recently asked on Biostar if it would be possible to align two sequences while displaying their respective annotations.

As both answers I received (SPICE and jalview ) require a graphical interface, I quickly wrote a command-line java program doing the job. This program reads a NCBI/BLAST XML output and, if the 'query' or the 'hit' definition lines start with "gi|....", it fetches the Genbank records and the annotations for the sequence and map them onto the alignments.

The program is available on github at https://github.com/lindenb/jsandbox/blob/master/src/sandbox/BlastAnnotation.java.

Do we need an external library parsing Blast?

No, the java binding compiler, ${JAVA_HOME}/bin/xjc, can generate a java parser from BLAST DTD:
xjc -d src -p sandbox.ncbi.blast -dtd http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd

parsing a schema...
compiling a schema...
sandbox/ncbi/blast/BlastOutput.java
sandbox/ncbi/blast/BlastOutputIterations.java
sandbox/ncbi/blast/BlastOutputMbstat.java
sandbox/ncbi/blast/BlastOutputParam.java
sandbox/ncbi/blast/Hit.java
sandbox/ncbi/blast/HitHsps.java
sandbox/ncbi/blast/Hsp.java
sandbox/ncbi/blast/Iteration.java
sandbox/ncbi/blast/IterationHits.java
sandbox/ncbi/blast/IterationStat.java
sandbox/ncbi/blast/ObjectFactory.java
sandbox/ncbi/blast/Parameters.java
sandbox/ncbi/blast/Statistics.java


And do we need an external library parsing Genbank?

No, again xjc did the job:
xjc -d src -p sandbox.ncbi.gbc -dtd http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd

parsing a schema...
compiling a schema...
sandbox/ncbi/gbc/INSDAltSeqData.java
sandbox/ncbi/gbc/INSDAltSeqDataItems.java
sandbox/ncbi/gbc/INSDAltSeqItem.java
sandbox/ncbi/gbc/INSDAltSeqItemInterval.java
sandbox/ncbi/gbc/INSDAltSeqItemIsgap.java
sandbox/ncbi/gbc/INSDAuthor.java
sandbox/ncbi/gbc/INSDComment.java
sandbox/ncbi/gbc/INSDCommentItem.java
sandbox/ncbi/gbc/INSDCommentParagraph.java
sandbox/ncbi/gbc/INSDCommentParagraphItems.java
sandbox/ncbi/gbc/INSDCommentParagraphs.java
sandbox/ncbi/gbc/INSDFeature.java
sandbox/ncbi/gbc/INSDFeatureIntervals.java
sandbox/ncbi/gbc/INSDFeaturePartial3.java
sandbox/ncbi/gbc/INSDFeaturePartial5.java
sandbox/ncbi/gbc/INSDFeatureQuals.java
sandbox/ncbi/gbc/INSDFeatureSet.java
sandbox/ncbi/gbc/INSDFeatureSetFeatures.java
sandbox/ncbi/gbc/INSDFeatureXrefs.java
sandbox/ncbi/gbc/INSDInterval.java
sandbox/ncbi/gbc/INSDIntervalInterbp.java
sandbox/ncbi/gbc/INSDIntervalIscomp.java
sandbox/ncbi/gbc/INSDKeyword.java
sandbox/ncbi/gbc/INSDQualifier.java
sandbox/ncbi/gbc/INSDReference.java
sandbox/ncbi/gbc/INSDReferenceAuthors.java
sandbox/ncbi/gbc/INSDReferenceXref.java
sandbox/ncbi/gbc/INSDSecondaryAccn.java
sandbox/ncbi/gbc/INSDSeq.java
sandbox/ncbi/gbc/INSDSeqAltSeq.java
sandbox/ncbi/gbc/INSDSeqCommentSet.java
sandbox/ncbi/gbc/INSDSeqFeatureSet.java
sandbox/ncbi/gbc/INSDSeqFeatureTable.java
sandbox/ncbi/gbc/INSDSeqKeywords.java
sandbox/ncbi/gbc/INSDSeqOtherSeqids.java
sandbox/ncbi/gbc/INSDSeqReferences.java
sandbox/ncbi/gbc/INSDSeqSecondaryAccessions.java
sandbox/ncbi/gbc/INSDSeqStrucComments.java
sandbox/ncbi/gbc/INSDSeqid.java
sandbox/ncbi/gbc/INSDSet.java
sandbox/ncbi/gbc/INSDStrucComment.java
sandbox/ncbi/gbc/INSDStrucCommentItem.java
sandbox/ncbi/gbc/INSDStrucCommentItems.java
sandbox/ncbi/gbc/INSDXref.java
sandbox/ncbi/gbc/ObjectFactory.java

Example


As an example I've aligned the "human eif4G1" (gi|303227906) with "Mus musculus eif4G1" (gi|56699433).
The very first lines of the BLAST report are:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dt
<BlastOutput>
<BlastOutput_program>blastn</BlastOutput_program>
<BlastOutput_version>BLASTN 2.2.24+</BlastOutput_version>
<BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch
<BlastOutput_db>n/a</BlastOutput_db>
<BlastOutput_query-ID>gi|303227906|ref|NM_198241.2|</BlastOutput_query-ID>
<BlastOutput_query-def>Homo sapiens eukaryotic translation initiation factor 4
<BlastOutput_query-len>5538</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_expect>10</Parameters_expect>
<Parameters_sc-match>2</Parameters_sc-match>
<Parameters_sc-mismatch>-3</Parameters_sc-mismatch>
<Parameters_gap-open>5</Parameters_gap-open>
<Parameters_gap-extend>2</Parameters_gap-extend>
<Parameters_filter>L;m;</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>gi|303227906|ref|NM_198241.2|</Iteration_query-ID>
<Iteration_query-def>Homo sapiens eukaryotic translation initiation factor 4 g
<Iteration_query-len>5538</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gi|56699433|ref|NM_001005331.1|</Hit_id>
<Hit_def>Mus musculus eukaryotic translation initiation factor 4, gamma 1 (Eif
<Hit_accession>NM_001005331</Hit_accession>
<Hit_len>5460</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>6818.02</Hsp_bit-score>
<Hsp_score>7560</Hsp_score>
<Hsp_evalue>0</Hsp_evalue>
<Hsp_query-from>53</Hsp_query-from>
<Hsp_query-to>5538</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>5418</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>1</Hsp_hit-frame>
<Hsp_identity>4820</Hsp_identity>
<Hsp_positive>4820</Hsp_positive>
<Hsp_gaps>138</Hsp_gaps>
<Hsp_align-len>5521</Hsp_align-len>
<Hsp_qseq>GGCGCCGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTGCGGGGAGCCGGAAA...
<Hsp_hseq>GGCGCTGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTGCGGGGAGCCGGAAA...
<Hsp_midline>||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||...
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>0</Statistics_db-num>
<Statistics_db-len>0</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>-1</Statistics_kappa>
<Statistics_lambda>-1</Statistics_lambda>
<Statistics_entropy>-1</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>

And here is the output of my program:

java -jar dist/blastannot.jar ~/jeter.blast.xml

QUERY: Homo sapiens eukaryotic translation initiation factor 4 gamma, 1 (EIF4G1), transcript variant 2, mRNA
ID:gi|303227906|ref|NM_198241.2| Len:5538
>Mus musculus eukaryotic translation initiation factor 4, gamma 1 (Eif4g1), transcript variant 2, mRNA
NM_001005331
id:gi|56699433|ref|NM_001005331.1| len:5460

e-value:0 gap:138 bitScore:6818.02

#####:############################################ exon 1..180 gene:EIF4G1
QUERY 000000053 GGCGCCGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTG 000000102
||||| ||||||||||||||||||||||||||||||||||||||||||||
HIT 000000001 GGCGCTGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTG 000000050
#####:############################################ exon 1..128 gene:Eif4g1



################################################## exon 1..180 gene:EIF4G1
QUERY 000000103 CGGGGAGCCGGAAATGGTTGTGGACTACGTCTGTGCGGCTGCGTGGGGCT 000000152
||||||||||||||||||||||||||||||||||||||||||||||||||
HIT 000000051 CGGGGAGCCGGAAATGGTTGTGGACTACGTCTGTGCGGCTGCGTGGGGCT 000000100
################################################## exon 1..128 gene:Eif4g1



############::::::::::###### exon 1..180 gene:EIF4G1
#:::::::::::::###::::: exon 181..237 gene:EIF4G1
QUERY 000000153 CGGCCGCGCGGACTGAAGGAGACTGAAGGCCCTCGGATGCCCAGAACCTG 000000202
|||||||||||| ||||||| |||
HIT 000000101 CGGCCGCGCGGA----------CTGAAGG-------------AGA----- 000000122
############----------#######-------------### gene 1..5460 gene:Eif4g1
############----------#######-------------### exon 1..128 gene:Eif4g1



::::::::::::::::::::::##:##:::::::# exon 181..237 gene:EIF4G1
############### exon 238..331 gene:EIF4G1
QUERY 000000203 TAGGCCGCACCGTGGACTTGTTCTTAATCGAGGGGGTGCTGGGGGGACCC 000000252
|| || ||||||||||||||||
HIT 000000123 ----------------------CTGAA-------GGTGCTGGGGGGACCC 000000143
----------------------##:##-------# exon 1..128 gene:Eif4g1
############### exon 129..222 gene:Eif4g1



#:###############################:###:############ exon 238..331 gene:EIF4G1
##############:###:############ CDS 272..5071 gene:EIF4G1
QUERY 000000253 TGATGTGGCACCAAATGAAATGAACAAAGCTCCACAGTCCACAGGCCCCC 000000302
| ||||||||||||||||||||||||||||||| ||| ||||||||||||
HIT 000000144 TAATGTGGCACCAAATGAAATGAACAAAGCTCCCCAGCCCACAGGCCCCC 000000193
#:###############################:###:############ exon 129..222 gene:Eif4g1
##############:###:############ CDS 163..4944 gene:Eif4g1


(...)


############:#:#:#####:######:#:########:##:###### exon 4890..5521 gene:EIF4G1
############:#:#:#####:######:#:########:##:###### STS 4948..5505 gene:EIF4G1
############:#:#:#####:######:#:########:##:###### STS 5174..5403 gene:EIF4G1
QUERY 000005319 TTGGTGTGTCTTGGGGTGGGGAGGGGCACCAACGCCTGCCCCTGGGGTCC 000005368
|||||||||||| | | ||||| |||||| | |||||||| || ||||||
HIT 000005201 TTGGTGTGTCTTTGCGGGGGGAAGGGCACTACCGCCTGCCTCTAGGGTCC 000005250
############:#:#:#####:######:#:########:##:###### exon 4760..5396 gene:Eif4g1



::##############:##########:###################### exon 4890..5521 gene:EIF4G1
::##############:##########:###################### STS 4948..5505 gene:EIF4G1
::##############:##########:####### STS 5174..5403 gene:EIF4G1
QUERY 000005369 TTTTTTTTATTTTCTGAAAATCACTCTCGGGACTGCCGTCCTCGCTGCTG 000005418
|||||||||||||| |||||||||| ||||||||||||||||||||||
HIT 000005251 --TTTTTTATTTTCTG-AAATCACTCTTGGGACTGCCGTCCTCGCTGCTG 000005297
--##############-##########:###################### exon 4760..5396 gene:Eif4g1



######################:#############:############# exon 4890..5521 gene:EIF4G1
######################:#############:############# STS 4948..5505 gene:EIF4G1
QUERY 000005419 GGGGCATATGCCCCAGCCCCTGTACCACCCCTGCTGTTGCCTGGGCAGGG 000005468
|||||||||||||||||||||| ||||||||||||| |||||||||||||
HIT 000005298 GGGGCATATGCCCCAGCCCCTGCACCACCCCTGCTGCTGCCTGGGCAGGG 000005347
######################:#############:############# exon 4760..5396 gene:Eif4g1



#:##-############################################: exon 4890..5521 gene:EIF4G1
#:##-################################# STS 4948..5505 gene:EIF4G1
###### polyA_signal 5496..5501 gene:EIF4G1
# polyA_site 5516 gene:EIF4G1
QUERY 000005469 GGAA-GGGGGGGCACGGTGCCTGTAATTATTAAACATGAATTCAATTAAG 000005517
| || ||||||||||||||||||||||||||||||||||||||||||||
HIT 000005348 GAAAGGGGGGGGCACGGTGCCTGTAATTATTAAACATGAATTCAATTAAA 000005397
#:##:############################################ exon 4760..5396 gene:Eif4g1



:::# exon 4890..5521 gene:EIF4G1
# polyA_site 5521 gene:EIF4G1
QUERY 000005518 CTCAAAAAAAAAAAAAAAAAA 000005538
||||||||||||||||||
HIT 000005398 AAAAAAAAAAAAAAAAAAAAA 000005418



That's it,
Pierre