Showing posts with label filter. Show all posts
Showing posts with label filter. Show all posts

26 May 2013

Filtering a VCF with javascript

This is my answer for that question on biostar. I wrote a java program filtering the VCF with the rhino javascript-engine.
I put the code on github: see https://github.com/lindenb/jvarkit#-filtering-vcf-with-javascript-rhino-.

For each variation, the script binds the following variables:

  • variant : the current variation; a org.broadinstitute.variant.variantcontext.VariantContext ( http://sourceforge.net/p/picard/code/HEAD/tree/trunk/src/java/org/broadinstitute/variant/variantcontext/VariantContext.java )
  • header : the VCF header org.broadinstitute.variant.vcf.VCFHeader ( http://sourceforge.net/p/picard/code/HEAD/tree/trunk/src/java/org/broadinstitute/variant/vcf/VCFHeader.java).
and evaluate the user script. This user script should return '1' or true if the current VCF file should be printed.

For example, you want to keep the variants having at least two samples having a depth (DP) greater that 200.
The script would be:
function myfilterFunction()
    {
    var samples=header.genotypeSamples;
    var countOkDp=0;
    for(var i=0; i< samples.size();++i)
        {
        var sampleName=samples.get(i);
        if(! variant.hasGenotype(sampleName)) continue;
        var genotype = variant.genotypes.get(sampleName);
        if( ! genotype.hasDP()) continue;
        var dp= genotype.getDP();
        if(dp < 200 ) countOkDp++;
        }
    return (countOkDp>2)
    }
myfilterFunction();

Example:

curl -s "https://raw.github.com/jamescasbon/PyVCF/master/vcf/test/gatk.vcf" |\
java -jar  dist/vcffilterjs.jar  -f filter.js |\
grep -v "##"


#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BLANK NA12878 NA12891 NA12892 NA19238 NA19239 NA19240
chr22 42526449 . T A 151.47 . AC=1;AF=0.071;AN=14;BaseQRankSum=2.662;DP=1226;DS;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=41.2083;MQ=240.47;MQ0=0;MQRankSum=0.578;QD=4.89;ReadPosRankSum=3.611 GT:AD:DP:GQ:PL 0/1:23,8:31:99:190,0,694 0/0:188,0:190:99:0,478,5376 0/0:187,0:187:99:0,493,5322 0/0:247,0:249:99:0,634,6728 0/0:185,0:185:99:0,487,5515 0/0:202,0:202:99:0,520,5857 0/0:181,1:182:99:0,440,5362
chr22 42526634 . T C 32.60 . AC=1;AF=0.071;AN=14;BaseQRankSum=1.147;DP=1225;DS;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=50.0151;MQ=240.65;MQ0=0;MQRankSum=1.151;QD=1.30;ReadPosRankSum=1.276 GT:AD:DP:GQ:PL 0/1:21,4:25:71:71,0,702 0/0:187,2:189:99:0,481,6080 0/0:233,0:233:99:0,667,7351 0/0:230,0:230:99:0,667,7394 0/0:174,1:175:99:0,446,5469 0/0:194,2:196:99:0,498,6239 0/0:174,0:175:99:0,511,5894
chr22 42527793 rs1080989 C T 3454.66 . AC=2;AF=0.167;AN=12;BaseQRankSum=-3.007;DB;DP=1074;DS;Dels=0.01;FS=0.000;HRun=1;HaplotypeScore=75.7865;MQ=209.00;MQ0=0;MQRankSum=3.014;QD=9.36;ReadPosRankSum=0.618 GT:AD:DP:GQ:PL ./. 0/1:72,90:162:99:1699,0,1767 0/1:103,96:202:99:1756,0,2532 0/0:188,0:188:99:0,526,5889 0/0:160,0:160:99:0,457,4983 0/0:197,0:198:99:0,544,6100 0/0:156,0:156:99:0,439,5041

That's it,

Pierre

11 May 2011

Analyzing exome data with KNIME

A presentation I gave last week in my lab

20 April 2010

Reading/Writing SAM/BAM files with the picard API for java.

I'm currently playing with the BAM/SAM Java API available in the picard package. This API can be used for creating new programs that read and write SAM files, the generic format for storing large nucleotide sequence alignments. Both SAM text format and SAM binary (BAM) format are supported. The source code comes with a simple example showing how to read and write a SAM/BAM file.

(...)
public void convertReadNamesToUpperCase(
final File inputSamOrBamFile,
final File outputSamOrBamFile
)
{
final SAMFileReader inputSam = new SAMFileReader(inputSamOrBamFile);
final SAMFileWriter outputSam = new SAMFileWriterFactory().makeSAMOrBAMWriter(inputSam.getFileHeader(),
true, outputSamOrBamFile);

for (final SAMRecord samRecord : inputSam) {
// Convert read name to upper case.
samRecord.setReadName(samRecord.getReadName().toUpperCase());
outputSam.addAlignment(samRecord);
}
outputSam.close();
inputSam.close();
}
(...)
toay, I've played with this SAM API to build two simple java tools:

SbamGrep

SbamGrep is available at http://code.google.com/p/code915/wiki/SbamGrep. It filters a SAM/BAM file using the SAM flags.

Option

 -o (filename-out) or default is stdout SAM
-v inverse selection
-f [flag] add bam/sam flag for filtering. multiple separeted with a comma. One of:
READ_PAIRED or 1 or 0x1
PROPER_PAIR_ or 2 or 0x2
READ_UNMAPPED or 4 or 0x4
MATE_UNMAPPED or 8 or 0x8
READ_STRAND or 16 or 0x10
MATE_STRAND or 32 or 0x20
FIRST_OF_PAIR or 64 or 0x40
SECOND_OF_PAIR or 128 or 0x80
NOT_PRIMARY_ALIGNMENT or 256 or 0x100
READ_FAILS_VENDOR_QUALITY_CHECK or 512 or 0x200
DUPLICATE_READ or 1024 or 0x400

Example

The following command lists all the reads having a flag (READ_PAIRED and READ_UNMAPPED and MATE_UNMAPPED and FIRST_OF_PAIR):
java -cp sam-1.16.jar:dist/sbamgrep.jar fr.inserm.umr915.sbamtools.SbamGrep -f 0x4,0x1,0x8,0x40 file.bam
Result (masked):
@HD VN:1.0 SO:unsorted
@SQ SN:chrT LN:349250
IL_XXXX1 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (94**0-)*7=0688855555522@86;;;5;6:;63:4?-622647..-.5.%
IL_XXXX2 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9*+*2396,@5+5:@@@;;5)50)6960684;58;86*5102)0*+8:*137;
IL_XXXX3 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (/999-00328:88984@@=8??@@:-66,;8;5;6+;255,1;788883676'
IL_XXXX4 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (916928.82@@50854;33222224;@25?5522;5=;;858888555*0666
IL_XXXX5 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9*4*5-**32989+::@82;;853+39;80.53)-)79)..'55.8988*200
IL_XXXX6 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (*+**,14265;@@??@8?9@@@5@488?8666260.)-*9;;;88:8'05418
IL_XXXX7 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9136242-2@@@;96.888@@@@80$585882623.':**+3*03137..--.
IL_XXXX8 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (/89255855@88?585557..())@@@;5552286526755@@5888..3;/$
(...)

SbamStats

The second tool, SbamStats is available at http://code.google.com/p/code915/wiki/SbamStats. It provides a simple report about all the SAM flags used in one or more files.

Example

:
java -cp sam-1.16.jar:dist/sbamstats.jar fr.inserm.umr915.sbamtools.SBamStats file.bam
Output:
READ_PAIRED:0x1|READ_STRAND:0x10|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 10827
READ_PAIRED:0x1|READ_STRAND:0x10|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 10827
READ_PAIRED:0x1|FIRST_OF_PAIR:0x40 13951
READ_PAIRED:0x1|SECOND_OF_PAIR:0x80 13951
READ_PAIRED:0x1|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 23846
READ_PAIRED:0x1|READ_STRAND:0x10|SECOND_OF_PAIR:0x80 23846
READ_PAIRED:0x1|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 33606
READ_PAIRED:0x1|READ_STRAND:0x10|FIRST_OF_PAIR:0x40 33606
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|READ_STRAND:0x10|SECOND_OF_PAIR:0x80 143090
READ_PAIRED:0x1|READ_UNMAPPED:0x4|READ_STRAND:0x10|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 143090
READ_PAIRED:0x1|READ_UNMAPPED:0x4|SECOND_OF_PAIR:0x80 161781
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|FIRST_OF_PAIR:0x40 161781
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|SECOND_OF_PAIR:0x80 174429
READ_PAIRED:0x1|READ_UNMAPPED:0x4|FIRST_OF_PAIR:0x40 174429
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|READ_STRAND:0x10|FIRST_OF_PAIR:0x40 176219
READ_PAIRED:0x1|READ_UNMAPPED:0x4|READ_STRAND:0x10|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 176219
READ_PAIRED:0x1|PROPER_PAIR_:0x2|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 915161
READ_PAIRED:0x1|PROPER_PAIR_:0x2|READ_STRAND:0x10|SECOND_OF_PAIR:0x80 915161
READ_PAIRED:0x1|PROPER_PAIR_:0x2|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 2623345
READ_PAIRED:0x1|PROPER_PAIR_:0x2|READ_STRAND:0x10|FIRST_OF_PAIR:0x40 2623345
READ_PAIRED:0x1|READ_UNMAPPED:0x4|MATE_UNMAPPED:0x8|SECOND_OF_PAIR:0x80 26232123
READ_PAIRED:0x1|READ_UNMAPPED:0x4|MATE_UNMAPPED:0x8|FIRST_OF_PAIR:0x40 26232123



That's it !
Pierre

18 May 2007

A filter for RSS feeds

Stew has introduced the google-code-project where is hosted OpenRevien (the source code of Postgenomic). This gave me the idea to host feedfilter a program filtering rss feeds that I wrote quickly a few days ago. My project is hosted at http://code.google.com/p/feedfilter/. It's a small java program running as a CGI (you don't need tomcat or either..). A parameter in the cgi is used to filter the feeds:

Examples:

PNAS - RSS feed of Early Edition articles: keep posts containing the word "protein"
feed("http://www.pnas.org/rss/ahead.xml",contains("protein"))


PNAS - RSS feed of Early Edition articles: keep posts NOT containing the words "[IMMUNOLOGY]" or "[geology]"
feed("http://www.pnas.org/rss/ahead.xml",NOT(OR(contains("[IMMUNOLOGY]"),contains("[geology]"))))

RSS feeds from connotea about tag=bioinformatics where author is not dc:creator=lindenb
feed("http://www.connotea.org/rss/tag/bioinformatics",not(equals("dc:creator","lindenb")))

Syntax
input: feed( <url:string>, <node> )
node: AND( <node> , <node> )
| OR( <node> , <node> )
| NOT( <node> )
| contains( <qName:string>,<value:string>)
| contains( <value:string>)
| equals( <qName:string>,<value:string>)
| equals( <value:string>)
| regex( <qName:string>,<java regular expression:string>)
| regex( <java regular expression:string>)




Pierre