YOKOFAKUN: variation

Showing posts with label variation. Show all posts

25 October 2013

YES: "Choice of transcripts and software has a large effect on variant annotation"

This post was inspired by Aaron Quinlan's tweet:

Poster 1485: variant annotation is harder than you think. transcript set matters (surprise). #ASHG2013 http://t.co/FgJzomqVuV
— Aaron Quinlan (@aaronquinlan) October 23, 2013

Here is an example of a missense mutation found with VCFPredictions, a simple tool I wrote for variant effect prediction.

#CHROM POS ID ALT REF 
1 23710805 rs605468 A G

my tool uses the UCSC knownGene track, here is the context of the variant in the UCSC genome browser. There is one exon for TCEA3 (uc021oig.1) on the reverse strand.

If the base at 23710805 is changed from 'A'→'G' on the forward strand, there will be a non-synonymous variation Y(TAT)→H(CAT) on the reverse strand.

At the NCBI rs605468 is said to be " located in the intron region of NM_003196.1."

VEP cannot find this missense variation:

Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra
rs605468 1:23710805 G - ENSR00001518296 Transcript regulatory_region_variant - - - - - rs605468 GMAF=A:0.1204
rs605468 1:23710805 G ENSG00000204219 ENST00000476978 Transcript intron_variant - - - - - rs605468 GMAF=A:0.1204
rs605468 1:23710805 G ENSG00000204219 ENST00000450454 Transcript intron_variant - - - - - rs605468 GMAF=A:0.1204

(of course, my tool doesn't find some variations found in VEP too)

That's it,
Pierre

27 September 2012

Playing with the #Ensembl REST API: filtering a VCF with javascript

The new Ensembl REST API was announced today: "We are pleased to announce the beta release of our programming language agnostic REST API, for Release 68 data, at http://beta.rest.ensembl. Our initial release provides access to:

Sequences (genomic, cDNA, CDS and protein)
VEP (Variant Effect Predictor)
Homologies
Gene Trees
Assembly and coordinate mapping."

In the current post I will filter a VCF file with this API and javascript.

The VCF

Our initial file is the following VCF:

##fileformat=VCFv4.0
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 10327 rs112750067 T C . PASS DP=65
1 69427 rs148502021 T A . PASS DP=1557
1 69452 rs142004627 A G . PASS DP=155
1 69475 rs148502021 C T . PASS DP=231
1 865583 rs148711625 A G . PASS DP=231
1 866460 rs148884928 A G . PASS DP=23
1 866461 . G A . PASS DP=24

The javascript

The VCF will be read on the standard input using the following script and the Rhino JS engine. The script reads the VCF and for each substitution, it calls the Ensembl REST API, parses the JSON response and return the transcript identifier if the mutation is a missense_variant or a polyphen_prediction="probably damaging":

importPackage(java.io);
importPackage(java.lang);



function sleep(milliseconds)
  {
  /* hacked from http://www.phpied.com/sleep-in-javascript/ */
  var start = new Date().getTime();
  for (var i = 0; i < 1e7; i++) {
    if ((new Date().getTime() - start) > milliseconds){
      break;
      }
    }
  }


function damagingTranscript(json)
 {
 for(var d in json.data)
  {
  var transcripts=json.data[d].transcripts;
 
  if(!transcripts) return null;
  for(var t in transcripts)
   {
   var transcript=transcripts[t];
   for(var a in transcript.alleles)
    {
    var allele=transcript.alleles[a];
    if(allele.polyphen_prediction=="probably damaging" ||
       allele.consequence_terms.indexOf("missense_variant")!=-1 )
        {
        return transcript.transcript_id;
        }
    }
   }
  }
 return null;
 }

var baseRegex=new RegExp(/^[ATGCatgc]$/);



var stdin = new java.io.BufferedReader( new java.io.InputStreamReader(java.lang.System['in']) );
var line;
while((line=stdin.readLine())!=null)
 {
 if(line.startsWith("#"))
  {
  print(line); continue;
  }
 var tokens=line.split("\t");
 var chrom=tokens[0];
 var pos= parseInt(tokens[1]);
 var ref= tokens[3];
 var alt= tokens[4];
 if(!baseRegex.test(ref)) continue;
 if(!baseRegex.test(alt)) continue;
 
 sleep(200);
 var url="http://beta.rest.ensembl.org/vep/human/"+
  chrom+":"+pos+"-"+pos+"/"+alt+
  "/consequences?content-type=application/json";
 var jsonStr=readUrl(url,"UTF-8");
 var json=eval("("+jsonStr+")");
 var transcript=damagingTranscript(json);
 if(transcript==null) continue;
 tokens[7]+=(";TRANSCRIPT="+transcript);
 print(tokens.join('\t'));
 }

As an example, here is the response from the ENSEMBL server for 1:866460 A/G:

http://beta.rest.ensembl.org/vep/human/1:866460-866460/G/consequences?content-type=application/json

{
    "data": [
        {
            "location": {
                "coord_system": "chromosome",
                "name": "1",
                "strand": 1,
                "end": "866460",
                "start": "866460"
            },
            "hgvs": {
                "G": "1:g.866460C>G"
            },
            "transcripts": [
                {
                    "translation_stable_id": "ENSP00000411579",
                    "intron_number": null,
                    "cdna_end": 385,
                    "translation_end": 99,
                    "exon_number": "4/7",
                    "is_canonical": 0,
                    "transcript_id": "ENST00000420190",
                    "cdna_start": 385,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 296,
                    "translation_start": 99,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.001,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000411579.1:p.Ala99Gly",
                            "sift_score": 0.04,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000420190.1:c.296C>G"
                        }
                    },
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 296,
                    "cdna_allele_string": "C/G",
                    "codon_position": 2
                },
                {
                    "translation_stable_id": "ENSP00000393181",
                    "intron_number": null,
                    "cdna_end": 356,
                    "translation_end": 99,
                    "exon_number": "4/5",
                    "is_canonical": 0,
                    "transcript_id": "ENST00000437963",
                    "cdna_start": 356,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 296,
                    "translation_start": 99,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.001,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000393181.1:p.Ala99Gly",
                            "sift_score": 0.03,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000437963.1:c.296C>G"
                        }
                    },
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 296,
                    "cdna_allele_string": "C/G",
                    "codon_position": 2
                },
                {
                    "translation_stable_id": "ENSP00000342313",
                    "intron_number": null,
                    "cdna_end": 379,
                    "translation_end": 99,
                    "exon_number": "4/14",
                    "is_canonical": 1,
                    "transcript_id": "ENST00000342066",
                    "cdna_start": 379,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 296,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.001,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000342313.3:p.Ala99Gly",
                            "sift_score": 0.01,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000342066.3:c.296C>G"
                        }
                    },
                    "translation_start": 99,
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 296,
                    "cdna_allele_string": "C/G",
                    "ccds": "CCDS2.2",
                    "codon_position": 2
                },
                {
                    "translation_stable_id": "ENSP00000349216",
                    "intron_number": null,
                    "cdna_end": 67,
                    "translation_end": 23,
                    "exon_number": "2/12",
                    "is_canonical": 0,
                    "transcript_id": "ENST00000341065",
                    "cdna_start": 67,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 68,
                    "translation_start": 23,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.008,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000349216.4:p.Ala23Gly",
                            "sift_score": 0.01,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000341065.4:c.67C>G"
                        }
                    },
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 68,
                    "cdna_allele_string": "C/G",
                    "codon_position": 2
                }
            ]
        }
    ]
}

Invoking RHINO, filtering the VCF

$ cat input.vcf | rhino -f restensembl.js
  
##fileformat=VCFv4.0
##INFO=
#CHROM POS ID REF ALT QUAL FILTER INFO
1 69427 rs148502021 T A . PASS DP=1557;TRANSCRIPT=ENST00000335137
1 69452 rs142004627 A G . PASS DP=155;TRANSCRIPT=ENST00000335137
1 69475 rs148502021 C T . PASS DP=231;TRANSCRIPT=ENST00000335137
1 865583 rs148711625 A G . PASS DP=231;TRANSCRIPT=ENST00000420190
1 866460 rs148884928 A G . PASS DP=23;TRANSCRIPT=ENST00000420190

Limitations for the Ensembl REST API:

@yokofakun how many requests have you done? We rate limit to 3 per second averaged over the hour and allow spikes of 6 per second
— andrewyatz (@andrewyatz) September 27, 2012

@yokofakun lol I like your style :). If you do 3 predictions per second you'll be fine :)
— andrewyatz (@andrewyatz) September 27, 2012

That's it,

Pierre

27 August 2012

Reasoning with the Variation Ontology using Apache Jena #OWL #RDF

The Variation Ontology (VariO), "is an ontology for standardized, systematic description of effects, consequences and mechanisms of variations".
In this post I will use the Apache Jena library for RDF to load this ontology. It will then be used to extract a set of variations that are a sub-class of a given class of Variation.

Loading the ontology

The OWL ontology is available for download here: http://www.variationontology.org/download/VariO_0.979.owl. A new RDF model for an OWL ontology is created and the owl file is loaded.

OntModel ontModel = ModelFactory.createOntologyModel();
InputStream in = FileManager.get().open(VO_OWL_URL);
ontModel.read(in, "");
in.close();

Creating a Reasoner

A OWL Reasoner is then created and associated to the previous model:

Reasoner reasoner = ReasonerRegistry.getOWLReasoner();
reasoner=this.reasoner.bindSchema(ontModel);

Creating a random set of variations

A new RDF model is created to hold a few instances of random Variations. For each instance, we add a random property 'my:chromosome', a random property 'my:position' and we associated one of the following type:

vo:VariO_0000029 "modified amino acid", a sub-Class of vo:VariO_0000028 ("post translationally modified protein")
vo:VariO_0000030 "spliced protein", a sub-Class of vo:VariO_0000028 ("post translationally modified protein")
vo:VariO_0000033 "effect on protein subcellular localization". It is NOT a sub-class of vo:VariO_0000028

Random rand=new Random();
com.hp.hpl.jena.rdf.model.Model instances = ModelFactory.createDefaultModel();
instances.setNsPrefix("vo",VO_PREFIX);
instances.setNsPrefix("my",MY_URI);

for(int i=0;i< 10;++i)
 {
 Resource subject= null;
 Resource rdftype=null;
 switch(i%3)
    {
    case 0:
       {
       //modified amino acid
       subject=instances.createResource(AnonId.create("modaa_"+i));
       rdftype=instances.createResource(VO_PREFIX+"VariO_0000029");
       break;
       }
    case 1:
       {
       //spliced protein
       subject=instances.createResource(AnonId.create("spliced_"+i));
       rdftype=instances.createResource(VO_PREFIX+"VariO_0000030");
       break;
       }
    default:
       {
       //effect on protein subcellular localization
       subject=instances.createResource(AnonId.create("subcell_"+i));
       rdftype=instances.createResource(VO_PREFIX+"VariO_0000033");
       break;
       }
    }
 instances.add(subject, RDF.type, rdftype);
 instances.add(subject, hasChromosome, instances.createLiteral("chr"+(1+rand.nextInt(22))));
 instances.add(subject, hasPosition, instances.createTypedLiteral(rand.nextInt(1000000)));
 }

Reasoning

A new inference model is created using the reasoner and the instances of variation. An iterator is used to only list the variations being a subclasses of vo:VariO_0000028 and having a property "my:chromosome" and a property "my:position".

InfModel model = ModelFactory.createInfModel (reasoner, instances);
ExtendedIterator<Statement> sti = model.listStatements(
        null, null, model.createResource(VO_PREFIX+"VariO_0000028"));
sti=sti.filterKeep(new Filter<Statement>()
      {
      @Override
      public boolean accept(Statement stmt)
         {
         return   stmt.getSubject().getProperty(hasChromosome)!=null &&
               stmt.getSubject().getProperty(hasPosition)!=null
               ;
         }
      });

Loop over the iterator and print the result:

while(sti.hasNext() )
         {
         Statement stmt = sti.next();
         System.out.println("\t+ " + PrintUtil.print(stmt));
         Statement val=stmt.getSubject().getProperty(hasChromosome);
         System.out.println("\t\tChromosome:\t"+val.getObject());
         val=stmt.getSubject().getProperty(hasPosition);
         System.out.println("\t\tPosition:\t"+val.getObject());
         }

Result

   + (spliced_7 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr7
      Position:   134172^^http://www.w3.org/2001/XMLSchema#int
   + (spliced_4 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr13
      Position:   674316^^http://www.w3.org/2001/XMLSchema#int
   + (spliced_1 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr22
      Position:   457596^^http://www.w3.org/2001/XMLSchema#int
   + (modaa_9 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr12
      Position:   803303^^http://www.w3.org/2001/XMLSchema#int
   + (modaa_6 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr15
      Position:   794137^^http://www.w3.org/2001/XMLSchema#int
   + (modaa_3 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr14
      Position:   34487^^http://www.w3.org/2001/XMLSchema#int
   + (modaa_0 rdf:type http://purl.obolibrary.org/obo/VariO_0000028)
      Chromosome:   chr15
      Position:   536371^^http://www.w3.org/2001/XMLSchema#int

Full source code

import java.io.IOException;
import java.io.InputStream;
import java.util.Random;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.hp.hpl.jena.ontology.OntModel;
import com.hp.hpl.jena.ontology.OntModelSpec;
import com.hp.hpl.jena.rdf.model.AnonId;
import com.hp.hpl.jena.rdf.model.InfModel;
import com.hp.hpl.jena.rdf.model.ModelFactory;
import com.hp.hpl.jena.rdf.model.Property;
import com.hp.hpl.jena.rdf.model.Resource;
import com.hp.hpl.jena.rdf.model.Statement;
import com.hp.hpl.jena.reasoner.Reasoner;
import com.hp.hpl.jena.reasoner.ReasonerRegistry;
import com.hp.hpl.jena.util.FileManager;
import com.hp.hpl.jena.util.PrintUtil;
import com.hp.hpl.jena.util.iterator.ExtendedIterator;
import com.hp.hpl.jena.util.iterator.Filter;
import com.hp.hpl.jena.vocabulary.RDF;

public class VariationOntologyReasoner
  {
  private static final String VO_PREFIX="http://purl.obolibrary.org/obo/";
  private static final String MY_URI="urn:my:ontology";
  private static final String VO_OWL_URL="http://www.variationontology.org/download/VariO_0.979.owl";
  private Reasoner reasoner;
  static final private Property hasChromosome=ModelFactory.createDefaultModel().createProperty(MY_URI,"chromosome");
  static final private Property hasPosition=ModelFactory.createDefaultModel().createProperty(MY_URI,"position");
  
  private VariationOntologyReasoner() throws IOException
    {
    OntModel ontModel = ModelFactory.createOntologyModel();
     InputStream in = FileManager.get().open(VO_OWL_URL);
    ontModel.read(in, "");
    in.close();
    this.reasoner = ReasonerRegistry.getOWLReasoner();
    this.reasoner=this.reasoner.bindSchema(ontModel);
    }
  
  private void run()
    {
    Random rand=new Random();
    com.hp.hpl.jena.rdf.model.Model instances = ModelFactory.createDefaultModel();
    instances.setNsPrefix("vo",VO_PREFIX);
    instances.setNsPrefix("my",MY_URI);

    for(int i=0;i< 10;++i)
      {
      Resource subject= null;
      Resource rdftype=null;
      switch(i%3)
        {
        case 0:
          {
          //modified amino acid
          subject=instances.createResource(AnonId.create("modaa_"+i));
          rdftype=instances.createResource(VO_PREFIX+"VariO_0000029");
          break;
          }
        case 1:
          {
          subject=instances.createResource(AnonId.create("spliced_"+i));
          rdftype=instances.createResource(VO_PREFIX+"VariO_0000030");
          break;
          }
        default:
          {
          //effect on protein subcellular localization
          subject=instances.createResource(AnonId.create("subcell_"+i));
          rdftype=instances.createResource(VO_PREFIX+"VariO_0000033");
          break;
          }
        }
      instances.add(subject, RDF.type, rdftype);
      instances.add(subject, hasChromosome, instances.createLiteral("chr"+(1+rand.nextInt(22))));
      instances.add(subject, hasPosition, instances.createTypedLiteral(rand.nextInt(1000000)));
      }
    
    InfModel model = ModelFactory.createInfModel (reasoner, instances);
    ExtendedIterator<Statement> sti = model.listStatements(null, null, model.createResource(VO_PREFIX+"VariO_0000028"));
    sti=sti.filterKeep(new Filter<Statement>()
        {
        @Override
        public boolean accept(Statement stmt)
          {
          return  stmt.getSubject().getProperty(hasChromosome)!=null &&
              stmt.getSubject().getProperty(hasPosition)!=null
              ;
          }
        });
    while(sti.hasNext() )
      {
      Statement stmt = sti.next();
      System.out.println("\t+ " + PrintUtil.print(stmt));
      Statement val=stmt.getSubject().getProperty(hasChromosome);
      System.out.println("\t\tChromosome:\t"+val.getObject());
      val=stmt.getSubject().getProperty(hasPosition);
      System.out.println("\t\tPosition:\t"+val.getObject());
      }
    }
  
  public static void main(String[] args) throws Exception
    {
    VariationOntologyReasoner app=new VariationOntologyReasoner();
    app.run();
  }
}

That's it,

Pierre

23 January 2007

Create your Visualizations with IBM's Many Eyes

Many Eyes is a new social tool from IBM used to create a custom dynamic visualization (bar chart,treemap, block histogram, bubble chart, line graph, network diagram, pie, scatterplot, stack, GIS ...) charts from your data. The graphics are plotted via a java applet. I've tested it using the snp data from the UCSC: for example, I've running the following query in the snp table:
select chrom as "chromosome",molType,class ,func as "function",avg(avHet),min(avHet),max(avHet),count(*) as "count" from snp126 where avHet>0 and valid!="unknown" group by chrom,molType,class,func

which group the snps features by chromosome, class ,etc... and I've displayed it as a treemap using Many Eyes

Pierre

YOKOFAKUN