Showing posts with label berkeleydb. Show all posts
Showing posts with label berkeleydb. Show all posts

07 February 2013

A tool to compare the BAMs

Following that thread on Biostar, I've created a tool that compares two or more BAMs. This java program uses the Picard and berkeleydb-JE libraries and is available at: http://code.google.com/p/jvarkit/wiki/CompareBams.

Download & install

see http://code.google.com/p/jvarkit/wiki/CompareBams.

Example

The following Makefile align the same pair of FASTQs with 5 different parameters for bwa aln -O (gap_open_penalty)

FASTQ1=001.fastq.gz
FASTQ2=002.fastq.gz
REF=/path/to/human_g1k_v37.fasta
BWA=bwa
SAMTOOLS=samtools
ALL_BAMS= 

define SAI
$(1)_1.sai : ${FASTQ1}  $(REF)
        $(BWA) aln  $(2) -t 2  -f $$@ $(REF) $$<
$(1)_2.sai :${FASTQ2}  $(REF)
        $(BWA) aln  $(2) -t 2  -f $$@ $(REF) $$<

endef

define ALN

ALL_BAMS+= $(1).bam 

$(eval $(call SAI,$(1),$(2)))

$(1).bam: $(1)_1.sai $(1)_2.sai
        $(BWA) sampe $(3)  ${REF} \
                $(1)_1.sai $(1)_2.sai  \
                $(FASTQ1) $(FASTQ2) |\
        $(SAMTOOLS) view -S -b -o $$@ -T $(REF) - 

endef

.PHONY:all

all: diff.gz

$(eval $(foreach GAP, 8 9 10 11 12 , $(call ALN,GAP$(GAP), -O $(GAP) , )))


diff.gz: $(ALL_BAMS)
        mkdir -p tmp.bdb
        java -jar  /commun/data/packages/jvarkit/comparebams.jar -d tmp.bdb $^ | gzip --best > $@
execute:
$ make

(...)
java -jar  /path/to/jvarkit/comparebams.jar -d tmp.bdb GAP8.bam GAP9.bam GAP10.bam GAP11.bam GAP12.bam | gzip --best > diff.gz
Feb 07, 2013 2:09:57 PM fr.inserm.umr1087.jvarkit.tools.cmpbams.CompareBams run
#INFO: in GAP8.bam count:1
Feb 07, 2013 2:10:30 PM fr.inserm.umr1087.jvarkit.tools.cmpbams.CompareBams run
INFO: in GAP9.bam count:1
(...)

The output is a tab delimited file containing

  • the name of the read
  • the comparison of each bam vs the others EQ=equals, NE=not-equals
  • the positions of the reads in each bam
$ gunzip -c diff.gz | egrep -E  '(#|NE)' | head


#READ-Name      GAP8.bam GAP9.bam|GAP8.bam GAP10.bam|GAP8.bam GAP11.bam|GAP8.bam GAP12.bam|GAP9.bam GAP10.bam|GAP9.bam GAP11.bam|GAP9.bam GAP12.bam|GAP10.bam GAP11.bam|GAP10.bam GAP12.bam|GAP11.bam GAP12.bam GAP8.bam        GAP9.bam        GAP10.bam       GAP11.bam       GAP12.bam
M00491:10:000000000-A27BP:1:1101:10029:10672    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   6:123892013,6:123892006 6:123892013,6:123892006 6:123892005,6:123892013 6:123892005,6:123892013 6:123892005,6:123892013
M00491:10:000000000-A27BP:1:1101:10265:10054    EQ|EQ|NE|NE|EQ|NE|NE|NE|NE|NE   19:49671437,19:49671412 19:49671437,19:49671412 19:49671437,19:49671412 19:49671435     19:49671412,19:49671435
M00491:10:000000000-A27BP:1:1101:10904:12333    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   10:88681151,10:88681156 10:88681151,10:88681156 10:88681156,10:88681150 10:88681156,10:88681150 10:88681156,10:88681150
M00491:10:000000000-A27BP:1:1101:11211:13492    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   8:52321469      8:52321469      8:52321470      8:52321470      8:52321470
M00491:10:000000000-A27BP:1:1101:11298:18283    NE|NE|NE|NE|EQ|EQ|EQ|EQ|EQ|EQ   6:126071103,6:126071110 6:126071104,6:126071110 6:126071104,6:126071110 6:126071104,6:126071110 6:126071104,6:126071110
M00491:10:000000000-A27BP:1:1101:11381:15675    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   1:156106900,1:156106905 1:156106900,1:156106905 1:156106905,1:156106899 1:156106905,1:156106899 1:156106905,1:156106899
M00491:10:000000000-A27BP:1:1101:12189:14088    EQ|NE|NE|EQ|NE|NE|EQ|EQ|NE|NE   15:22015803     15:22015803     15:21009140     15:21009140     15:22015803
M00491:10:000000000-A27BP:1:1101:12382:11193    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   4:111542263,4:111542256 4:111542263,4:111542256 4:111542263,4:111542254 4:111542263,4:111542254 4:111542263,4:111542254
M00491:10:000000000-A27BP:1:1101:12998:24492    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   2:179433503,2:179433496 2:179433503,2:179433496 2:179433503,2:179433497 2:179433503,2:179433497 2:179433503,2:179433497
(....)

That's it
Pierre

19 June 2011

A couchdb-like application using apache JETTY and BerkeleyDB. My notebook


This is my notebook for the Jetty API. Jetty provides an Web server and javax.servlet container. It can be embedded in devices, tools, frameworks, application servers, and clusters. I find it more easier for developping and testing a new web application as it doesn't require some deployement descriptors, some configuration files, a web container, etc... like in apache tomcat. As I will show below, a whole web applications can be embedded in a single java class.
As an example, I wrote a couchdb-like web application, named DivanDB ('divan' is the french word for couch), using Jetty and Berkeleydb. The source file is available on github at:(See also my previous post :CouchDB for Bioinformatics: Storing SNPs - My Notebook)

The BerkeleyDB-base key/value datastore

The class BDBStorage consists in a BerkeleyDB environment and a BerkeleyDB database. The database stores keys and the values as String.
/** a berkeley-db String/String datastore */ 
private static class BDBStorage
{
/** bdb environment */
private Environment environment=null;
/** string/string database */
private Database database=null;

private BDBStorage()
{
}

/** open environment & database */
private void open(File dbHome) throws DatabaseException
{
(...)
}
/** close environment & database */
private void close()
{
(...)
}
}


The web context

A servlet context is created where we put a BDBStorage. We also tell the context about the DivanDB servlet and the path of the application.

DivanDB app=new DivanDB();
BDBStorage storage=new BDBStorage();
ServletContextHandler context = new ServletContextHandler();
context.setAttribute(STORAGE_ATTRIBUTE, storage);
context.addServlet(new ServletHolder(app),"/*");
context.setContextPath("/divandb");
context.setResourceBase(".");

The server

The embedded server listens to the specified port. We add the context to the server, it is started and listen to the queries forever.
/* create a new server */
Server server = new Server(port);
/* context */
server.setHandler(context);
/* start server */
server.start();
/* loop forever */
server.join();


The Servlet

The servlet DivanDB receives the POST, GET and DELETE queries just like couchdb. It decodes the JSON data using a JSON API provided in the jetty distribution. See the code for more information.

Start the server

$java -jar dist/divandb.jar
2011-06-19 18:12:16.823:INFO::jetty-8.0.0.M3
2011-06-19 18:12:16.896:INFO::Started [email protected]:8080 STARTING

Testing, using CURL

Add a single snp. Here the id is a concatenation of the chromosome number and the genomic position:
$curl -X PUT -d '{"id":"010000011390","name":"rs75100478","avHet":0.5,"chrom":"chr1","position":11390}' http://localhost:8080/divandb/
{"ok":"true","id":["010000011390"]}
Add single snp without id (it will be generated):
$curl -X PUT -d '{"name":"rs3877545","avHet":0.375,"chrom":"chr1","position":11507}' http://localhost:8080/divandb/
{"ok":"true","id":["id972755720"]}
Add an array of SNPs:
$ curl -X PUT -d '[{"id":"010000011390","name":"rs75100478","avHet":0.5,"chrom":"chr1","position":11390},{"id":"010000011507","name":"rs3877545","avHet":0.375,"chrom":"chr1","position":11507},{"id":"010000011863","name":"rs76933399","avHet":0.46875,"chrom":"chr1","position":11863},{"id":"010000011920","name":"rs78105014","avHet":0.5,"chrom":"chr1","position":11920},{"id":"010000011959","name":"rs71234145","avHet":0.5,"chrom":"chr1","position":11959},{"id":"010000012048","name":"rs7357889","avHet":0.277778,"chrom":"chr1","position":12048},{"id":"010000012073","name":"rs76972523","avHet":0.5,"chrom":"chr1","position":12073},{"id":"010000012120","name":"rs78253371","avHet":0.5,"chrom":"chr1","position":12120},{"id":"010000012148","name":"rs78371031","avHet":0.5,"chrom":"chr1","position":12148},{"id":"010000012158","name":"rs79313092","avHet":0.5,"chrom":"chr1","position":12158},{"id":"010000012168","name":"rs74400200","avHet":0.5,"chrom":"chr1","position":12168},{"id":"010000012169","name":"rs79903148","avHet":0.375,"chrom":"chr1","position":12169},{"id":"010000012213","name":"rs79272378","avHet":0.5,"chrom":"chr1","position":12213},{"id":"010000012264","name":"rs2475482","avHet":0.444444,"chrom":"chr1","position":12264},{"id":"010000012295","name":"rs76096671","avHet":0.375,"chrom":"chr1","position":12295},{"id":"010000012295","name":"rs79461972","avHet":0.375,"chrom":"chr1","position":12295},{"id":"010000012305","name":"rs71236994","avHet":0.5,"chrom":"chr1","position":12305},{"id":"010000012332","name":"rs2981846","avHet":0.375,"chrom":"chr1","position":12332},{"id":"010000012336","name":"rs76760687","avHet":0.375,"chrom":"chr1","position":12336},{"id":"010000012355","name":"rs76588012","avHet":0.444444,"chrom":"chr1","position":12355}]' http://localhost:8080/divandb/
{"ok":"true","id":["010000011390","010000011507","010000011863","010000011920","010000011959","010000012048","010000012073","010000012120","010000012148","010000012158","010000012168","010000012169","010000012213","010000012264","010000012295","010000012295","010000012305","010000012332","010000012336","010000012355"]
Deleting a SNP:
$curl -X DELETE -d '{"id":"2065420809"}' http://localhost:8080/divandb/
{"ok":"true","id":["id972755720"]}
List all the SNPs:
$ curl http://localhost:8080/divandb/
[
{"position":11390,"id":"010000011390","_timestamp":"1308501619331","name":"rs75100478","avHet":0.5,"chrom":"chr1"},
{"position":11507,"id":"010000011507","_timestamp":"1308501619335","name":"rs3877545","avHet":0.375,"chrom":"chr1"},
{"position":11863,"id":"010000011863","_timestamp":"1308501619336","name":"rs76933399","avHet":0.46875,"chrom":"chr1"},
{"position":11920,"id":"010000011920","_timestamp":"1308501619336","name":"rs78105014","avHet":0.5,"chrom":"chr1"},
{"position":11959,"id":"010000011959","_timestamp":"1308501619337","name":"rs71234145","avHet":0.5,"chrom":"chr1"},
{"position":12048,"id":"010000012048","_timestamp":"1308501619337","name":"rs7357889","avHet":0.277778,"chrom":"chr1"},
{"position":12073,"id":"010000012073","_timestamp":"1308501619338","name":"rs76972523","avHet":0.5,"chrom":"chr1"},
{"position":12120,"id":"010000012120","_timestamp":"1308501619338","name":"rs78253371","avHet":0.5,"chrom":"chr1"},
{"position":12148,"id":"010000012148","_timestamp":"1308501619339","name":"rs78371031","avHet":0.5,"chrom":"chr1"},
{"position":12158,"id":"010000012158","_timestamp":"1308501619339","name":"rs79313092","avHet":0.5,"chrom":"chr1"},
{"position":12168,"id":"010000012168","_timestamp":"1308501619339","name":"rs74400200","avHet":0.5,"chrom":"chr1"},
{"position":12169,"id":"010000012169","_timestamp":"1308501619340","name":"rs79903148","avHet":0.375,"chrom":"chr1"},
{"position":12213,"id":"010000012213","_timestamp":"1308501619340","name":"rs79272378","avHet":0.5,"chrom":"chr1"},
{"position":12264,"id":"010000012264","_timestamp":"1308501619341","name":"rs2475482","avHet":0.444444,"chrom":"chr1"},
{"position":12295,"id":"010000012295","_timestamp":"1308501619342","name":"rs79461972","avHet":0.375,"chrom":"chr1"},
{"position":12305,"id":"010000012305","_timestamp":"1308501619344","name":"rs71236994","avHet":0.5,"chrom":"chr1"},
{"position":12332,"id":"010000012332","_timestamp":"1308501619348","name":"rs2981846","avHet":0.375,"chrom":"chr1"},
{"position":12336,"id":"010000012336","_timestamp":"1308501619348","name":"rs76760687","avHet":0.375,"chrom":"chr1"},
{"position":12355,"id":"010000012355","_timestamp":"1308501619349","name":"rs76588012","avHet":0.444444,"chrom":"chr1"}
]
List the SNPs on chr1 between 12000 and 12150:
$ curl "http://localhost:8080/divandb/?startkey=010000012000&endkey=010000012150"
[
{"position":12048,"id":"010000012048","_timestamp":"1308501619337","name":"rs7357889","avHet":0.277778,"chrom":"chr1"},
{"position":12073,"id":"010000012073","_timestamp":"1308501619338","name":"rs76972523","avHet":0.5,"chrom":"chr1"},
{"position":12120,"id":"010000012120","_timestamp":"1308501619338","name":"rs78253371","avHet":0.5,"chrom":"chr1"},
{"position":12148,"id":"010000012148","_timestamp":"1308501619339","name":"rs78371031","avHet":0.5,"chrom":"chr1"}
]

List 3 SNPs starting from index=2
$ curl "http://localhost:8080/divandb/?start=2&limit=3"
[
{"position":11507,"id":"010000011507","_timestamp":"1308501619335","name":"rs3877545","avHet":0.375,"chrom":"chr1"},{"position":11863,"id":"010000011863","_timestamp":"1308501619336","name":"rs76933399","avHet":0.46875,"chrom":"chr1"},
{"position":11920,"id":"010000011920","_timestamp":"1308501619336","name":"rs78105014","avHet":0.5,"chrom":"chr1"}
]

That's it


Pierre

26 November 2009

A Java implementation of Jan Aerts' LocusTree

This post is a description of my implementation of Jan Aerts' LocusTree algorithm (I want to thank Jan, our discussion and his comments were as great source of inspiration) based on BerkeleyDB-JE, a Key/Value datastore. This implementation has been used to build a genome browser displaying its data with the SVG format. In brief: splicing each chromosome using a dichotomic approach allows to quickly find all the features in a given genomic region for a given resolution. A count of the total number of objects in the descent of each child node is used to produce a histogram of the number of objects smaller than the given resolution.
Your browser does not support the <CANVAS> element !

JSON/Metadata

All the information is stored in BerkeleyDB and I've used JSON to add some metadata about each object. The JSON is serialized, gzipped and stored in BerkeleyDB.
Your browser does not support the <CANVAS> element !

Organism

Each organism is defined by an ID and a Name. The Key of the BerkeleyDB is the organism.id.
Your browser does not support the <CANVAS> element !

The organisms are loaded in the database using a simple XML file:
<organisms>
<organism id="36">
<name>hg18</name>
<description>Human Genome Build v.36</description>
<metadata>{"taxon-id":9606}</metadata>
</organism>
</organisms>


Chromosome

Each chromosome is defined by an ID, a Name, its length and its organism-ID. The Key in berkeleyDB is the chromosome ID.
Your browser does not support the <CANVAS> element !

The chromosomes are loaded in the database using a simple XML file:
<chromosomes organism-id="36">
<chromosome id="1">
<name>chr1</name>
<metadata>{"size":247249719,"type":"autosomal"}</metadata>
</chromosome>
<chromosome id="10">
<name>chr10</name>
<metadata>{"size":135374737,"type":"autosomal"}</metadata>
</chromosome>
(...)
</chromosomes>



Track

Each track is defined by an ID and a Name. The Key in berkeleyDB is the track ID.
Your browser does not support the <CANVAS> element !