Showing posts with label hapmap. Show all posts
Showing posts with label hapmap. Show all posts

23 July 2010

Rules Engine for Bioinformatics. Playing with Drools and the Hapmap data


Drools is a Rules Engine implementation of the Rete algorithm. (from "http://java.sys-con.com/node/45082":)A rule engine is a system for applying some set of if/then rules to the data in a system and then taking some action on the data or the system itself. Its primary purpose is to separate the business logic from the system logic - to externalize the business logic so it can be maintained separately. Use a rule engine to separate the if/then rules from system code - to externalize them so they can be maintained separately from the system code. The behavior of the system can then be modified without changing code or needing to recompile/redeploy. Rules are stored in human-readable form in a file so they can be changed with a text editor or rule editor.

I would like to know if JBOSS can be used for some common bioinformatics tasks. For example, can I use Drools when my users want to select some candidate genes with various parameters : "I want a gene with 3 SNPs in the first exon but with less than 2 microsattelites in the 3'UTR and if there is another gene close to this one I want ... etc... etc...".

In the current post, I've played with the Drools engine to search some mendelians incompatibilities in the hapmap data.

First, download some the pedigrees and the genotypes:

Here is the structure of my project:
./test
./test/Individual.java
./test/Snp.java
./test/Drools01.java
./test/Genotype.java
./test/hapmap.drl


Individual, Genotype and Snp are 3 simple POJO files.

Individual.java


package test;
import java.io.Serializable;

public class Individual implements Serializable
{
private static final long serialVersionUID = 1L;
private int familyId;
private String name;
private String fatherName;
private String motherName;
private int gender;
//constructor, getters, setters, etc...
}


Snp.java


package test;
import java.io.Serializable;

public class Snp
implements Serializable
{
private static final long serialVersionUID = 1L;
private String name;
private String alleles;
private String chrom;
private int position;
//constructor, getters, setters, etc...
}


Genotype.java


package test;
import java.io.Serializable;

public class Genotype
implements Serializable
{
private static final long serialVersionUID = 1L;
private Snp snp;
private Individual individual;
private char a1;
private char a2;

public Genotype(Snp snp, Individual individual, String observed)
{
super();
this.snp = snp;
this.individual = individual;

this.a1 = observed.charAt(0);
this.a2 = observed.charAt(1);
if(this.a1> this.a2)
{
this.a2 = a1;
this.a1=observed.charAt(1);
}
}
public boolean contains(char allele)
{
return getA1()==allele || getA2()==allele;
}

//constructor, getters, setters, etc...
}

hapmap.drl


hapmap.drl is the Drools rules file. It contains the "business logic" and can be modified without changing anything in the java program:
package test

rule "Select snps"
salience 30
when
$snp:Snp($chrom:chrom,$pos:position)
eval(!($chrom=="chr21" && $pos> 9880000 && $pos< 10043000))
then
retract($snp);
System.err.println("Removing : "+$snp.getName());
end

rule "One Parent"
salience 20

when
$children : Individual($dad:fatherName,$mum:motherName)
$parent: Individual(name==$dad || name==$mum)
$genotypeChild : Genotype(individual==$children,$snp:snp,a1!='N',a2!='N' )
$genotypeParent : Genotype(individual==$parent,snp==$snp,a1!='N',a2!='N')
eval(!(
$genotypeParent.contains($genotypeChild.getA1()) ||
$genotypeParent.contains($genotypeChild.getA2())
))
then
retract($genotypeChild);
System.out.println($snp.getName()+" : problem with "+
$children+"("+$genotypeChild.getA1()+"/"+$genotypeChild.getA2()+") incompatibility with parent:"+
$parent+"("+$genotypeParent.getA1()+"/"+$genotypeParent.getA2()+")"
);
end

rule "Both Parents"
salience 10
when
$children : Individual($dad:fatherName,$mum:motherName)
$father: Individual(name==$dad )
$mother: Individual(name==$mum )
$genotypeChild : Genotype(individual==$children,$snp:snp,a1!='N',a2!='N' )
$genotypeDad : Genotype(individual==$father,snp==$snp,a1!='N',a2!='N')
$genotypeMum : Genotype(individual==$mother,snp==$snp,a1!='N',a2!='N')
eval(!(
($genotypeDad.contains( $genotypeChild.getA1()) && $genotypeMum.contains( $genotypeChild.getA2())) ||
($genotypeDad.contains( $genotypeChild.getA2()) && $genotypeMum.contains( $genotypeChild.getA1()))
))
then
retract($genotypeChild);
System.out.println($snp.getName()+" : problem with "+
$children+"("+$genotypeChild.getA1()+"/"+$genotypeChild.getA2()+") incompatibility with parents:"+
$father+"("+$genotypeDad.getA1()+"/"+$genotypeDad.getA2()+") "+
$mother+"("+$genotypeMum.getA1()+"/"+$genotypeMum.getA2()+") "
);
end
  • The first rule "Select snps" (=with the highest priority (salience)) remove all the SNPs that are not in "chr21:9880000-10043000"
  • The second rule "One Parent" prints a message if there is an incompatibility between a children and one of his parents
  • The last rule "Both Parents" prints a message if the is an incompatibility between a children and both of his parents


Drools01.java

Drools01.java initializes the Drools engine, parses the hapmap files, put those objects in the "KnowledgeBase" and fires all the rules:
package test;
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;

import org.drools.KnowledgeBase;
import org.drools.KnowledgeBaseFactory;
import org.drools.builder.KnowledgeBuilder;
import org.drools.builder.KnowledgeBuilderError;
import org.drools.builder.KnowledgeBuilderErrors;
import org.drools.builder.KnowledgeBuilderFactory;
import org.drools.builder.ResourceType;
import org.drools.definition.KnowledgePackage;
import org.drools.io.ResourceFactory;
import org.drools.runtime.StatefulKnowledgeSession;

public class Drools01
{
private void loadHapMap(StatefulKnowledgeSession session)throws Exception
{
Map<String, Individual> name2individual=new HashMap<String, Individual>();
String line;
BufferedReader in= new BufferedReader(new FileReader(
"/home/lindenb/relationships_w_pops_121708.txt"
));

while((line=in.readLine())!=null)
{
if(line.startsWith("FID") || !line.endsWith("CEU")) continue;
String tokens[]=line.split("[\t]");
Individual indi=new Individual(
Integer.parseInt(tokens[0]),
tokens[1],
tokens[2].equals("0")?null:tokens[2],
tokens[3].equals("0")?null:tokens[3],
Integer.parseInt(tokens[4])
);
name2individual.put(indi.getName(), indi);
}
in.close();

in= new BufferedReader(new FileReader("/home/lindenb/genotypes_chr21_CEU_r27_nr.b36_fwd.txt"));
int r=0;

line=in.readLine();
String tokens[]=line.split("[ ]");
Map<Integer, Individual> index2individual=new HashMap<Integer, Individual>(tokens.length);

for(int i=11;i <tokens.length;++i)
{
Individual indi=name2individual.get(tokens[i]);
session.insert(indi);
if(indi==null) throw new NullPointerException("Cannot get "+tokens[i]);
index2individual.put(i, indi);
}

while((line=in.readLine())!=null)
{
tokens=line.split("[ ]");
Snp snp=new Snp(tokens[0], tokens[1],tokens[2], Integer.parseInt(tokens[3]));
session.insert(snp);
for(int i=11;i <tokens.length;++i)
{
session.insert(new Genotype(snp, index2individual.get(i), tokens[i]));
}
//System.err.println(line);
if(r++>2000) break;//just read the first 2000 rows everything...
}
in.close();
}

private void run()throws Exception
{
KnowledgeBase kbase = KnowledgeBaseFactory.newKnowledgeBase();
KnowledgeBuilder kbuilder = KnowledgeBuilderFactory.newKnowledgeBuilder();
kbuilder.add( ResourceFactory.newClassPathResource(
"hapmap.drl",
Drools01.class ),
ResourceType.DRL );
KnowledgeBuilderErrors errors= kbuilder.getErrors();
if(!errors.isEmpty())
{
for(KnowledgeBuilderError error:errors)
{
System.err.println(error.getMessage());
}
return;
}

Collection<KnowledgePackage> pkgs = kbuilder.getKnowledgePackages();
kbase.addKnowledgePackages( pkgs );
StatefulKnowledgeSession ksession = kbase.newStatefulKnowledgeSession();
loadHapMap(ksession);

ksession.fireAllRules();
}
/**
* @param args
*/
public static void main(String[] args) throws Exception
{
try {
new Drools01().run();
} catch(Exception err)
{
err.printStackTrace();
}
}
}

Compilation


Makefile::
DROOLS=/home/lindenb/package/drools-5.0
ECLIPSE=/home/lindenb/package/eclipse/plugins
CP=$(DROOLS)/drools-core-5.0.1.jar:$(DROOLS)/drools-api-5.0.1.jar:$(DROOLS)/lib/mvel2-2.0.10.jar:$(DROOLS)/lib/antlr-runtime-3.1.1.jar:$(DROOLS)/drools-compiler-5.0.1.jar:$(ECLIPSE)/org.eclipse.jdt.core_3.5.2.v_981_R35x.jar

all:
javac -cp ${CP}:. test/Drools01.java
java -cp ${CP}:. test.Drools01

Results

Removing : rs885550
Removing : rs28363862
Removing : rs28783163
rs1028272 : problem with NA10838(T/T) incompatibility with parent:NA12003(A/A)
rs9647052 : problem with NA10847(A/A) incompatibility with parent:NA12239(C/C)
rs1882882 : problem with NA07014(A/A) incompatibility with parent:NA07031(G/G)
rs12627714 : problem with NA07048(G/G) incompatibility with parent:NA07034(A/A)
rs17240368 : problem with NA07048(A/A) incompatibility with parent:NA07034(G/G)
rs2822605 : problem with NA07048(C/C) incompatibility with parent:NA07034(G/G)
rs2822593 : problem with NA07048(G/G) incompatibility with parent:NA07034(A/A)
rs7276922 : problem with NA07048(T/T) incompatibility with parent:NA07034(C/C)
rs10439653 : problem with NA07048(A/A) incompatibility with parent:NA07034(C/C)
rs10439652 : problem with NA07048(C/C) incompatibility with parent:NA07034(T/T)
rs17001769 : problem with NA10856(A/A) incompatibility with parent:NA11829(C/C)
rs9977658 : problem with NA10860(C/C) incompatibility with parent:NA11992(T/T)
rs8133625 : problem with NA10856(G/G) incompatibility with parent:NA11830(A/A)
rs12627045 : problem with NA12740(A/A) incompatibility with parent:NA12750(G/G)
rs416083 : problem with NA10843(G/G) incompatibility with parent:NA11919(A/A)
rs2822484 : problem with NA06991(A/A) incompatibility with parent:NA06985(G/G)
rs9977169 : problem with NA12865(C/C) incompatibility with parent:NA12875(T/T)
rs379724 : problem with NA07019(G/G) incompatibility with parent:NA07056(A/A)
rs13046593 : problem with NA10860(C/C) incompatibility with parent:NA11992(G/G)
rs9984592 : problem with NA10854(G/G) incompatibility with parent:NA11840(A/A)
rs2187275 : problem with NA10831(C/C) incompatibility with parent:NA12156(T/T)
rs6516605 : problem with NA12708(G/G) incompatibility with parent:NA12718(C/C)
rs7283783 : problem with NA10830(G/G) incompatibility with parent:NA12154(A/A)
rs13051673 : problem with NA12739(G/G) incompatibility with parent:NA12748(A/A)
rs3115488 : problem with NA10860(A/A) incompatibility with parent:NA11993(G/G)
rs8132413 : problem with NA10860(T/T) incompatibility with parent:NA11993(A/A)
rs2207842 : problem with NA10838(A/A) incompatibility with parent:NA12004(G/G)
rs2821973 : problem with NA12832(C/C) incompatibility with parent:NA12843(T/T)
rs469471 : problem with NA07349(A/A) incompatibility with parent:NA07347(G/G)
rs8129674 : problem with NA10839(G/G) incompatibility with parent:NA12006(A/A)
rs2257224 : problem with NA10854(G/G) incompatibility with parent:NA11840(A/A)
rs865859 : problem with NA10855(C/C) incompatibility with parent:NA11832(T/T)
rs2742158 : problem with NA10855(C/C) incompatibility with parent:NA11832(T/T)
rs4111253 : problem with NA10836(C/C) incompatibility with parent:NA12275(T/T)
rs240444 : problem with NA10861(T/T) incompatibility with parent:NA11994(C/C)
rs469812 : problem with NA06991(C/C) incompatibility with parent:NA06985(G/G)
rs210534 : problem with NA06991(T/T) incompatibility with parent:NA06985(A/A)
rs2822670 : problem with NA10861(C/T) incompatibility with parents:NA11994(C/C) NA11995(C/C)
rs9305335 : problem with NA10831(C/T) incompatibility with parents:NA12155(C/C) NA12156(C/C)
rs9305297 : problem with NA12801(A/G) incompatibility with parents:NA12812(G/G) NA12813(G/G)
rs2822641 : problem with NA07349(A/C) incompatibility with parents:NA07347(C/C) NA07346(C/C)
rs9977057 : problem with NA12801(G/T) incompatibility with parents:NA12812(T/T) NA12813(T/T)
rs2822614 : problem with NA12877(C/T) incompatibility with parents:NA12889(C/C) NA12890(C/C)
rs2178907 : problem with NA07349(A/G) incompatibility with parents:NA07347(G/G) NA07346(G/G)
rs1124322 : problem with NA10837(A/G) incompatibility with parents:NA12272(G/G) NA12273(G/G)
rs2822537 : problem with NA12336(A/G) incompatibility with parents:NA12342(G/G) NA12343(G/G)
rs386524 : problem with NA12753(C/T) incompatibility with parents:NA12762(C/C) NA12763(C/C)
rs367249 : problem with NA12802(A/G) incompatibility with parents:NA12814(G/G) NA12815(G/G)
rs17001380 : problem with NA10856(G/T) incompatibility with parents:NA11829(G/G) NA11830(G/G)
rs2155965 : problem with NA12336(A/T) incompatibility with parents:NA12342(A/A) NA12343(A/A)
rs8133601 : problem with NA10854(C/T) incompatibility with parents:NA11839(C/C) NA11840(C/C)
rs8134986 : problem with NA07014(A/C) incompatibility with parents:NA07051(C/C) NA07031(C/C)
rs3094804 : problem with NA12865(C/T) incompatibility with parents:NA12874(T/T) NA12875(T/T)
rs1929150 : problem with NA06997(C/T) incompatibility with parents:NA06986(T/T) NA07045(T/T)
rs7276195 : problem with NA07019(A/G) incompatibility with parents:NA07022(A/A) NA07056(A/A)
rs3855691 : problem with NA12877(A/G) incompatibility with parents:NA12889(G/G) NA12890(G/G)
rs17468376 : problem with NA10835(A/G) incompatibility with parents:NA12248(A/A) NA12249(A/A)
rs2747618 : problem with NA10835(C/T) incompatibility with parents:NA12248(C/C) NA12249(C/C)
rs2943900 : problem with NA10854(C/T) incompatibility with parents:NA11839(T/T) NA11840(T/T)
rs4009972 : problem with NA10835(G/T) incompatibility with parents:NA12248(G/G) NA12249(G/G)
rs8133159 : problem with NA12767(A/G) incompatibility with parents:NA12777(A/A) NA12778(A/A)


Questions

So many questions: how should I model my data ? what if those data are alread present in database) ? how drools supports a large amount of data ? etc... etc...

That's it

Pierre

11 May 2010

Playing with the C API for sqlite: inserting the hapmap data , my notebook.



In this post, I'll insert a Hapmap genotype file into a SQLite database using the C API for SQLIte. Of course, this is just a test and I would not use SQLite to store this kind of information. Moreover the speed of insertion of SQLite was deadly slow here.

The Genotype file

The first line contains the name of the individuals, the other rows of the file contain one SNP per line and the genotype for each individual:
rs# alleles chrom pos strand assembly# center protLSID assayLSID panelLSID QCcode NA06984 NA06985 NA06986 NA06989 NA06991 NA06993 NA06994 NA06995 NA06997 NA07000 NA07014 NA07019 NA07022 NA07029 NA07031 NA07034 NA07037 NA07045 NA07048 NA07051 NA07055 NA07056 NA07345 NA07346 NA07347 NA07348 NA07349 NA07357 NA07435 NA10830 NA10831 NA10835 NA10836 NA10837 NA10838 NA10839 NA10840 NA10843 NA10845 NA10846 NA10847 NA10850 NA10851 NA10852 NA10853 NA10854 NA10855 NA10856 NA10857 NA10859 NA10860 NA10861 NA10863 NA10864 NA10865 NA11829 NA11830 NA11831 NA11832 NA11839 NA11840 NA11843 NA11881 NA11882 NA11891 NA11892 NA11893 NA11894 NA11917 NA11918 NA11919 NA11920 NA11930 NA11931 NA11992 NA11993 NA11994 NA11995 NA12003 NA12004 NA12005 NA12006 NA12043 NA12044 NA12045 NA12056 NA12057 NA12144 NA12145 NA12146 NA12154 NA12155 NA12156 NA12234 NA12236 NA12239 NA12248 NA12249 NA12264 NA12272 NA12273 NA12275 NA12282 NA12283 NA12286 NA12287 NA12335 NA12336 NA12340 NA12341 NA12342 NA12343 NA12344 NA12347 NA12348 NA12375 NA12376 NA12383 NA12386 NA12399 NA12400 NA12413 NA12489 NA12546 NA12707 NA12708 NA12716 NA12717 NA12718 NA12739 NA12740 NA12748 NA12749 NA12750 NA12751 NA12752 NA12753 NA12760 NA12761 NA12762 NA12763 NA12766 NA12767 NA12775 NA12776 NA12777 NA12778 NA12801 NA12802 NA12812 NA12813 NA12814 NA12815 NA12817 NA12818 NA12827 NA12828 NA12829 NA12830 NA12832 NA12842 NA12843 NA12864 NA12865 NA12872 NA12873 NA12874 NA12875 NA12877 NA12878 NA12889 NA12890 NA12891 NA12892
rs6423165 A/G chrY 109805 + ncbi_b36 broad urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:3 urn:LSID:broad.hapmap.org:Assay:SNP_A-8572888:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ GG AG GG AG AG AG AA AA GG AA AG NN AG AA AG NN AA AG NN GG GG GG GG AA GG GG AG AG AA AG GG AG AG AG AA AG AG AG GG GG AA AA NN GG AG NN AA AG NN GG NN GG GG GG AA AG AA AA AG GG AG AG GG AG AA AG GG AG GG GG AG GG AG GG GG AG GG GG AA NN AG GG AG GG GG GG AG GG GG AA AA GG GG GG NN AG AG AG GG AG AG AA AA GG AA GG AA AG AA AG GG AG GG GG AG AG GG GG AG AG AG NN AG AG AG AG AG NN AA AA AG AG AG AG AG AA AA AG AG AA AG GG AG GG GG AG AG AG AG AG AG AA AG AG AG AG AA AA AG GG GG GG GG AG AG GG GG AG AA GG AG AG GG AG
rs6608381 C/T chrY 109922 + ncbi_b36 broad urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:3 urn:LSID:broad.hapmap.org:Assay:SNP_A-8528859:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ CC CT CC CT CT CT TT TT CC TT CT NN CT TT CT NN TT CC NN CC CC CC CC TT CC CC CT CT TT CT CC CT CT CT TT CT CT CT CC CC TT TT NN CC CT NN TT CT NN CC NN CC CC CC TT CT TT TT CT CC CT CT CC CT TT CT CC CT CC CC CT CC CT CC CC CT CC CC TT NN CT CC CT CC CC CC CT CC CC TT TT CC CC CC NN CT CT CT CC CT CT TT TT CC TT CC TT CT TT CT CC CT CC CC CT CT CC CC CT CT CT NN CT CT CT CT CT NN TT TT CT CT CT CT CT TT TT CT CT TT CT CC CT CC CC CT CT CC CT CT CC TT CT CT CT CT TT TT CT CC CC CC CC CT CT CC CC CT TT CC CT CT CC CT
rs6644972 A/G chrY 118624 + ncbi_b36 sanger urn:LSID:illumina.hapmap.org:Protocol:Human_1M_BeadChip:3 urn:LSID:sanger.hapmap.org:Assay:H1Mrs6644972:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ GG GG AG GG AG AG AG GG AG GG AG NN GG GG AG NN GG GG NN AG AG GG GG GG GG GG GG GG GG AG AA GG GG GG AG GG GG GG GG GG GG GG NN GG AG GG GG GG NN AG NN AG AG AA GG GG GG GG GG GG GG GG GG AG GG GG AA AG GG AG GG GG GG GG GG GG GG AG AG NN AG GG AG GG GG GG AG GG GG GG GG AA AA AG NN GG GG GG GG GG GG GG AG GG GG GG GG AG GG GG AG GG GG GG GG GG GG GG GG GG GG AG GG GG GG GG GG NN GG GG AG GG GG AG GG AG GG AG AA GG GG GG GG AG GG GG GG AG GG AG GG GG GG AG GG GG AG GG AG GG GG GG GG GG GG AG GG GG GG GG GG GG GG GG
rs6655397 A/G chrY 141935 + ncbi_b36 sanger urn:LSID:illumina.hapmap.org:Protocol:Human_1M_BeadChip:3 urn:LSID:sanger.hapmap.org:Assay:H1Mrs6655397:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ GG GG GG GG AG AG GG GG GG GG GG NN GG GG NN NN GG GG NN GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG NN AG GG GG GG GG NN AG NN GG GG GG NN AG GG GG GG AG GG GG AG GG GG GG GG AG GG AG GG GG GG GG AG AG GG GG GG NN AG GG GG GG GG GG GG GG GG GG GG GG GG GG NN GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG NN GG GG AG GG AG NN GG GG GG GG GG GG GG GG AG GG GG AG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG AG GG GG GG GG GG GG
rs6603172 C/T chrY 142439 + ncbi_b36 broad urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:3 urn:LSID:broad.hapmap.org:Assay:SNP_A-8527413:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ TT TT TT TT CT CT TT TT TT TT TT NN TT TT TT NN TT TT NN TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT NN CT TT TT TT TT NN CT NN TT TT TT TT CT TT TT TT CT TT TT CT TT TT TT TT CT TT CT TT TT TT TT CT CT TT TT TT NN CT TT TT TT TT TT TT TT TT TT TT TT TT TT NN TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT CT TT CT NN TT TT TT TT TT TT TT TT CT TT TT CT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT CT TT TT TT TT TT TT
rs6644970 A/G chrY 142664 + ncbi_b36 broad urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:3 urn:LSID:broad.hapmap.org:Assay:SNP_A-4207883:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ AG AA AA AG AG GG AG AG AA AA GG NN AG AA GG NN AG AG NN AG AG AG GG AG AG AG AA AG GG GG AA AG AG AG AG AG AA AA AG GG AG GG NN GG AG NN AG AG NN AG NN AA AG AA AA AG GG GG AG AG AG AG AG AG AA AG AA AG GG GG AG AG AA GG AG GG AA AG AG NN AG AG GG AG GG GG AG GG GG AA AG AA AG GG NN GG AA GG AG AG GG GG AA GG AA AG AG GG GG AG AG GG AG AG AG GG AG AG AG AG AG NN AG GG AG AG GG NN AA AG AA AA GG AG AG AG GG AG AG AG GG AG AA AG GG AA AA AG GG AA AG AG GG AA GG AA AA AG AG AG AG AG GG AG GG GG AG AG AG AA AA GG AG AG
rs11019 A/G chrY 159978 + ncbi_B36 affymetrix urn:LSID:affymetrix.hapmap.org:Protocol:GenomeWideSNP_6.0:2 urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-4207841:2 urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1 QC+ NN AG NN NN AG GG AA NN NN AG NN AG GG AA NN AG NN NN GG NN GG AG GG NN NN GG NN GG NN GG AG GG NN NN GG AG NN NN NN AG AG NN NN NN NN AA AG GG GG AA AG GG GG NN NN GG GG AG GG AG AG NN AA AA NN NN NN NN NN NN NN NN NN NN GG AG GG AG AG GG AG AG GG GG NN GG GG AG AG GG GG AG AG AG AG AG GG GG GG NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN AG NN AA GG NN NN GG NN NN GG GG AG AA AG GG AA AA NN NN NN NN NN NN AG GG GG AG GG AG NN NN NN NN NN NN NN NN NN GG GG AG AG GG GG NN GG NN NN GG AG
rs1132353 C/T chrY 160116 + ncbi_b36 sanger urn:LSID:illumina.hapmap.org:Protocol:Human_1M_BeadChip:3 urn:LSID:sanger.hapmap.org:Assay:H1Mrs1132353:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ CC CT TT CT CT TT NN CT TT CT CC NN TT CC CT NN TT CT NN CT TT CT CT TT TT CC TT CT CC TT CT TT TT TT TT CT TT TT TT CT CT TT NN TT CT NN CT TT NN CC NN TT TT CC NN TT TT CT TT CT CT CC CC CC CT CT CC CT CC CT CT TT TT TT TT CT TT CT CT NN CT CT TT TT TT TT TT CC CT TT TT CT CT CT NN CT TT TT TT TT TT TT TT TT TT TT CT CT CC TT CT TT CT CT CT CT CT CC CC CC CC TT CC CT CT CT CC NN TT CT TT TT CT TT CT CT CC CT TT CC CC TT TT TT CT TT TT CT TT TT CT TT CT TT TT CT TT CT TT CT CT CT TT TT CT CT TT TT CT TT TT CC TT CT
rs35047434 C/T chrY 169118 + ncbi_b36 sanger urn:LSID:illumina.hapmap.org:Protocol:Human_1M_BeadChip:3 urn:LSID:sanger.hapmap.org:Assay:H1Mrs35047434:3 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:3 QC+ TT TT TT TT CT CT TT TT TT TT TT NN TT TT TT NN TT TT NN CT TT TT TT TT TT TT TT TT TT TT TT TT TT CT TT TT TT CT TT TT CT TT NN CT TT TT TT TT NN TT NN TT TT TT TT TT TT TT TT TT CT TT TT TT TT TT TT CT TT TT TT CT TT CT CT TT TT TT TT NN TT TT TT TT TT TT TT TT TT CT TT TT TT TT NN TT CT TT TT TT CT CT TT TT TT CT CT TT TT CT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT NN TT CT TT CT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT CT TT TT TT TT CT TT TT TT CT TT TT TT TT CT TT TT TT CC TT TT TT TT TT TT
(...)

Opening the SQLIte Database

if(sqlite3_open(env->fileout,&(env->connection))!=SQLITE_OK)
{
fprintf(stderr,"Cannot open sqlite file %s.\n",env->fileout);
exit(EXIT_FAILURE);
}

Closing the SQLIte Database

sqlite3_close(env->connection);

Creating a table

if(sqlite3_exec(env->connection,
"create table markers(id INTEGER PRIMARY KEY,name varchar(20) unique,chrom varchar(10),position integer)",
NULL,NULL,&error
)!=SQLITE_OK)
{
fprintf(stderr,"Cannot create table markers: %s\n",error);
exit(EXIT_FAILURE);
}

Filling a prepared statement and inserting some data

if(sqlite3_prepare(env->connection,
"insert into markers(name,chrom,position) values(?,?,?)",
-1,&pstmt_insert_marker,NULL
)!=SQLITE_OK)
{
fprintf(stderr,"Cannot compile insert into markers.\n");
exit(EXIT_FAILURE);
}
(...)

if(sqlite3_bind_text(
pstmt_insert_marker,1,
name,
strlen(name),
NULL)!=SQLITE_OK)
{
fprintf(stderr,"Cannot bind markers's name.\n");
exit(EXIT_FAILURE);
}
(
if(sqlite3_bind_text(
pstmt_insert_marker,2,
chrom,
strlen(chrom),
NULL)!=SQLITE_OK)
{
fprintf(stderr,"Cannot bind markers's chrom.\n");
exit(EXIT_FAILURE);
}
if( sqlite3_bind_int(
pstmt_insert_marker,3,
genomic_position
)!=SQLITE_OK)
{
fprintf(stderr,"Cannot bind markers's position.\n");
exit(EXIT_FAILURE);
}

if (sqlite3_step(pstmt_insert_marker) != SQLITE_DONE) {
fprintf(stderr,"Could not step insert marker.\n");
exit(EXIT_FAILURE);
}

marker_id = sqlite3_last_insert_rowid(env->connection);

sqlite3_reset(pstmt_insert_marker);

Creating a custom function for sqlite

Here I implemented a custom function named homozygous it returns true if a genotype is a string of two identical characters (but not 'N').
void is_homozygous(
sqlite3_context* context, /* context */
int argc, /* number of arguments */
sqlite3_value** argv /* arguments */
)
{
/* one text arg */
if(argc==1 &&
sqlite3_value_type(argv[0])==SQLITE_TEXT)
{
const char *alleles=(const char*)sqlite3_value_text(argv[0]);
if( alleles!=NULL &&
strlen(alleles)==2 &&
alleles[0]==alleles[1] &&
alleles[0]!='N')
{
/* set result to TRUE */
sqlite3_result_int(context,1);
return;
}
}
/* set result to FALSE */
sqlite3_result_int(context,0);
}

Telling sqlite about the new custom function

sqlite3_create_function(env->connection,
"homozygous",/* function name */
1,/* number of args */
SQLITE_UTF8,/* encoding */
NULL,
is_homozygous,/* the implementation callback */
NULL,
NULL
);

Implementing a callback for a simple 'SELECT' query

The following callback prints the results to stdout.
static int callback_select_homozygous(
void* notUsed,
int argc,/* number of args */
char** argv,/* arguments as string */
char** columns /* labels for the columns */
)
{
int i;
/* prints all the columns to stdout*/
for(i=0; i<argc; i++)
{
printf( "\"%s\": \"%s\"\n",
columns[i],
argv[i]!=NULL? argv[i] : "NULL"
);
}
printf("\n");
return 0;
}

Executing a SELECT query

Here, we select all the homoygous genotypes:
if(sqlite3_exec(
env->connection, /* An open database */
"select individuals.name,"
"markers.name,"
"markers.chrom,"
"markers.position,"
"genotypes.alleles "
"from "
"genotypes,"
"markers,"
"individuals "
"where "
"markers.id=genotypes.marker_id and "
"individuals.id=genotypes.individual_id and "
"homozygous(genotypes.alleles)=1"
, /* SQL to be evaluated */
callback_select_homozygous, /* Callback function */
NULL, /* 1st argument to callback */
&errormsg /* Error msg written here */
)!=SQLITE_OK)
{
fprintf(stderr,"Cannot select.\n");
exit(EXIT_FAILURE);
}

The full source code



Excuting the C program

[LOG]inserted marker ID.1
[LOG]inserted marker ID.2
[LOG]inserted marker ID.3
[LOG]inserted marker ID.4
[LOG]inserted marker ID.5
(... WAIT !!! )
"name": "NA12878"
"name": "rs6603251"
"chrom": "chrY"
"position": "240580"
"alleles": "CC"

"name": "NA12890"
"name": "rs6603251"
"chrom": "chrY"
"position": "240580"
"alleles": "CC"

"name": "NA12892"
"name": "rs6603251"
"chrom": "chrY"
"position": "240580"
"alleles": "CC"


That's it

Pierre

04 April 2008

BerkeleyDB and Hapmap: My notebook.

I'm currently trying to find the best way to store some genotypes. For example I need to store 278.766.958 illumina genotypes (marker,individual, allele1, allele2) and mysql, even with indexes, is getting slow when I'm looking for the Mendelian incompatibilities. Deepak suggested via twitter to use href="http://hdf.ncsa.uiuc.edu/HDF5/">HDF5 but as far as I understand the documentation, HDF5 is "just" a smarter implementation of the C functions fseek/fread/fwrite.

I've been looking at the java implementation of the BerkeleyDB (BDB) just to watch its performance according to my needs. This engine can be used as en embedded database and doesn't need a database server running as a daemon (just like JavaDB). A BDB berkeley database contains records where each record contains a "key" and a "value" (a kind of two columns database, or a kind of C++ std::multimap). In the current post, I'll describe a program which 1) download some genotypes from HapMap 2) Find the pedigree of the samples 3) Loop over the markers (ordered on their position on the genome) and get the frequency 4) find Mendelian incompatibilities

The source code is available on pastie at:



First we need a few classes:

class Position holds a position on a chromosome
private static class Position
{
String chromosome;
int position;
Position(String chromosome,int position)
{
this.chromosome=chromosome;
this.position=position;
}
}


class Marker contains the informations about a snp :
private static class Marker
{
int rsId;
String alleles;
Position pos;
char strand;
}


class Individual contains the informations about an individual including his 'id' on corriel ( see http://cardiogenomics.med.harvard.edu)
private static class Individual
{
/**Coriell Repository Number*/
String name;
/** id in corriel */
String individualID=null;
/** father id */
String fatherID=null;
/** mother id */
String motherID=null;
}


As BDB is a collection of pairs(key,value) we need a class MarkerIndividual holding the pair(marker,individual) to store the genotypes with f(pair(marker,individual)=genotype.
private static class MarkerIndividual
{
private int rsId;
/**Coriell Repository Number*/
private String individualId;
}


At the end, we need a class Genotype to store two alleles:
private static class Genotype
{
char a1,a2;

public boolean same(char a1,char a2)
{
return (this.a1==a1 && this.a2==a2) ||
(this.a1==a2 && this.a2==a1)
;
}
}


BDB has several alternatives to read and write the java objects from/to the databases, this operation requires the object to be converted into an array of bytes: 1) the java Serialization can be used, 2) a TupleBinding can be implemented, this class must impements two functions which are used to encode/decode the object to/from an array of bytes. I've choosen to use this later option, and here is for example the TupleBinding implementation for the class Individual:

private TupleBinding individualTupleBinding=new TupleBinding()
{
@Override
public Object entryToObject(TupleInput input)
{
Individual indi= new Individual();
indi.name= input.readString();
indi.individualID= input.readString();
indi.fatherID=input.readString();
indi.motherID= input.readString();
return indi;
}

@Override
public void objectToEntry(Object object, TupleOutput output) {
Individual indi= Individual.class.cast(object);
output.writeString(indi.name);
output.writeString(indi.individualID);
output.writeString(indi.fatherID);
output.writeString(indi.motherID);
}
};


To open and create an Berkeley Environment the following code was written:
EnvironmentConfig envConf= new EnvironmentConfig();
envConf.setAllowCreate(!readOnly);
envConf.setReadOnly(readOnly);
envConf.setTransactional(false);
this.env= new Environment(
file,
envConf
);


Then, we open each database. We need 3 databases: markers, individuals and genotypes. Opening a Database looks like this:

DatabaseConfig dbConfig= new DatabaseConfig();
/* allow create if database does not exists */
dbConfig.setAllowCreate(!readOnly);
dbConfig.setReadOnly(readOnly);
Database db= this.env.openDatabase(null, "database-name", d
bConfig);


Althoug DBD is based on a pair(key,value) another component of the value could be searched and need to be indexed. This process is called a "Secondary Database". For example, with this project I created a secondary database:1) to find/loop over the markers using their position 2) to find/loop over the individuals using individualID instead of name. We need a class extending SecondaryKeyCreator
to extract this second key for the original value. For example here is the class extraction the Position from the Marker.
class MarkerPositionCreator implements SecondaryKeyCreator
{
public boolean createSecondaryKey(
SecondaryDatabase secDb,
DatabaseEntry keyEntry,
DatabaseEntry dataEntry,
DatabaseEntry resultEntry)
{
Marker marker= (Marker)Test01.this.markerTupleBinding.entryToObject(dataEntry);
Test01.this.positionTupleBinding.objectToEntry(marker.pos, resultEntry);
return true;
}
}

We also need to open those "secondary" databases
SecondaryConfig secondConfig= new SecondaryConfig();
secondConfig.setAllowCreate(!readOnly);
/* on open, if the secondary database is empty then the primary
database is read in its entirety and additions/modifications to the secondary's records occur automatically */
secondConfig.setAllowPopulate(true);
secondConfig.setSortedDuplicates(true);
MarkerPositionCreator posCreator = new MarkerPositionCreator();
/* Identifies the key creator object to be used for secondary key creation. */
secondConfig.setKeyCreator(posCreator);
positionDB = this.env.openSecondaryDatabase(null, "position
", this.markerDB,secondConfig);


OK, more genetic now. The HapMap genotypes are available here:http://www.hapmap.org/genotypes/latest/rs_strand/non-redundant/ (the path may change). A file looks like a table f(marker,individual)=genotype:
rs# SNPalleles chrom pos strand genome_build center protLSID assayLSID panelLSID QC_code NA18940 NA18942 NA189
rs28412942 A/T chrMT 410 + ncbi_B36 affymetrix GenomeWideSNP_6.02 Japanese:2 QC+ AA AA AA AA AA AA AA AA AA AA
rs3937039 A/G chrMT 665 + ncbi_b36 broad genotype_protocol_11 Japanese:1 QC+ AA AA AA AA AA AA AA AA AA AA AA
rs2853517 A/G chrMT 711 + ncbi_b36 broad genotype_protocol_11 Japanese:1 QC+ GG GG GG GG GG GG GG GG GG GG GG
rs28358568 C/T chrMT 712 + ncbi_b36 broad genotype_protocol_11 Japanese:1 QC+ TT TT TT TT TT TT TT TT TT TT TT
(...)


The file is processed as follow.

Pattern space=Pattern.compile("[ ]");
String HEADER[]=new String[]{"rs#","SNPalleles","chrom","pos","strand","genome_build","center","protLSID","assayLSID","panelLSID","QC_code"};
BufferedReader in= new BufferedReader(new InputStreamReader(new GZIPInputStream(url.openStream())));

String line= in.readLine();
if(line==null) throw new IOException("Empty file");
/* The first line of this file is the header*/
String header[]=space.split(line);
for(int i=0;i< HEADER.length;++i)
{
if(!header[i].equals(HEADER[i])) throw new IOException("Bad header "+header[i]+" expected "+HEADER[i]);
}
/* the header contains the name of the Individual which will be inserted. At this time we don't know what are the relationships between those individuals.*/
for(int i=HEADER.length;i< header.length;++i)
{
Individual individual= new Individual();
individual.name=header[i];
DatabaseEntry key= new DatabaseEntry(individual.name.getBytes());
DatabaseEntry data= new DatabaseEntry();
this.individualTupleBinding.objectToEntry(individual, data);
getIndividualDB().put(null
,key
,data
);
}
/** the following lines are the markers and the genotypes */
TupleBinding INT_BINDING=TupleBinding.getPrimitiveBinding(Integer.class);
while((line=in.readLine())!=null)
{
if(!line.startsWith("rs")) continue;
String tokens[]=space.split(line);
//System.err.println(line);
/* fill the information of this marker */
Marker marker= new Marker(Integer.parseInt( tokens[0].substring(2)));
marker.alleles= tokens[1];
marker.pos= new Position(tokens[2],Integer.parseInt(tokens[3]));
marker.strand= tokens[4].charAt(0);

DatabaseEntry key= new DatabaseEntry();
INT_BINDING.objectToEntry(marker.getRSId(), key);
DatabaseEntry data= new DatabaseEntry();
this.markerTupleBinding.objectToEntry(marker, data);
getMarkerDB().put(null
,key
,data
);
/** loop over this marker and get the genotypes */
for(int i=HEADER.length;i<header.length;++i)
{
if(tokens[i].length()!=2 || tokens[i].equals("NN")) continue;
/** create the KEY */
MarkerIndividual mi= new MarkerIndividual(marker.rsId,header[i]);
this.markerIndividualTupleBinding.objectToEntry(mi, key);
/** create the genotype */
this.genotypeTupleBinding.objectToEntry(
new Genotype(tokens[i].charAt(0),tokens[i].charAt(1)),
data
);
/** put the new pair( pair(marker,individual), genotype) in the BDB */
getGenotypeDB().put(null
,key
,data
);
}
}
in.close();


OK, I want to find the relationships between those individuals, this information is available here. For each "Coriell Repository Number" we find the individual in our database, if it exists we add the information and write the individual back to the database. (See function ">makePedigree line 466).

To retrieve the genotype g=f(marker,individual) I wrote the following simple utility function getGenotypeAt:

Genotype getGenotypeAt(Marker marker,Individual indi) throws DatabaseException
{
if(marker==null || indi==null) return null;
DatabaseEntry key=new DatabaseEntry();
DatabaseEntry value=new DatabaseEntry();
this.markerIndividualTupleBinding.objectToEntry(new MarkerIndividual(marker.rsId,indi.name),key);
if(this.genotypeDB.get(null, key, value, LockMode.DEFAULT)!=OperationStatus.SUCCESS) return null;
Genotype g= Genotype.class.cast(this.genotypeTupleBinding.entryToObject(value));
return g;
}



To get the frequencies of the alleles, we loop each over each marker (using a secondary database to get the markers ordered on the genome (not ordered on rs##)) and we get all the genotypes for each individual. To loop over a BDB an instance Cursor (looks like a java.util.Iterator) is used.
void frequencies() throws DatabaseException
{
SecondaryCursor cM= this.positionDB.openSecondaryCursor(null, null);
DatabaseEntry key=new DatabaseEntry();
DatabaseEntry value=new DatabaseEntry();
while(cM.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
{
Marker m= Marker.class.cast(this.markerTupleBinding.entryToObject(value));
HashMap<Character, Integer> allele2count= new HashMap<Character, Integer>();
int totalGenotyped=0;
int totalFailures=0;
System.out.print("rs"+m.rsId+" "+m.alleles+" "+m.pos.chromosome+" "+m.pos.position+" "+m.strand);
Cursor cI= this.individualDB.openCursor(null, null);
while(cI.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
{
Individual indi=Individual.class.cast(this.individualTupleBinding.entryToObject(value));
Genotype g= getGenotypeAt(m, indi);
if(g!=null)
{
totalGenotyped++;
for(int i=0;i< 2;++i)
{
char c= (i==0?g.a1:g.a2);
Integer count= allele2count.get(c);
if(count==null) count=0;
allele2count.put(c,count+1);
}
}
else
{
totalFailures++;
}
}
cI.close();
System.out.print(" genotyped:"+(int)(100.0*(totalGenotyped-totalFailures)/(float)totalGenotyped)+"%");
for(Character allele: allele2count.keySet())
{
System.out.print(" f("+allele+")="+allele2count.get(allele)/(2.0*totalGenotyped));
}
System.out.println();
}
cM.close();
}



Finding the Mendelian incompatibilities is much like the same: I sued here the brute force, we loop over each individual and over each marker. If the individual as any parent, we check that his genotype is compatible with them.
void incompats() throws DatabaseException
{
DatabaseEntry key=new DatabaseEntry();
DatabaseEntry value=new DatabaseEntry();

Cursor cI= this.individualDB.openCursor(null, null);
while(cI.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
{
Individual indi=Individual.class.cast(this.individualTupleBinding.entryToObject(value));

if(indi.fatherID==null && indi.motherID==null)
{
continue;
}

Individual father= findIndividualByCorrielId(indi.fatherID);
Individual mother= findIndividualByCorrielId(indi.motherID);

Cursor cM= this.markerDB.openCursor(null, null);
while(cM.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
{
Marker m= Marker.class.cast(this.markerTupleBinding.entryToObject(value));
Genotype gChild= getGenotypeAt(m, indi);
Genotype gFather= getGenotypeAt(m, father);
if(isIncompat(gChild,gFather))
{
System.out.println("Incompat: for rs"+m.getRSId()+"("+m.alleles+") "+
indi.individualID+" is "+gChild+" and his father "+indi.fatherID+" is "+
gFather
);
continue;
}
Genotype gMother= getGenotypeAt(m, mother);
if(isIncompat(gChild,gMother))
{
System.out.println("Incompat: for rs"+m.getRSId()+"("+m.alleles+") "+
indi.individualID+" is "+gChild+" and his mother "+indi.motherID+" is "+
gMother
);
continue;
}
if(isIncompat(gChild,gFather,gMother))
{
System.out.println("Incompat: for rs"+m.getRSId()+"("+m.alleles+") "+
indi.individualID+" is "+gChild+
" and his father "+indi.fatherID+" is "+ gFather+" and his mother "+indi.motherID+" is "+ gMother
);
}
}
cM.close();
}
cI.close();
}


That's it. I first test the chromosome 1 at http://www.hapmap.org/genotypes/latest/rs_strand/non-redundant/genotypes_chr1_CEU_r23a_nr.b36.txt.gz(11Mo) but I pressed Ctrl-C when the files reached 1.4Go !
I then used the chr22 file directly downloaded on my computer. The space required by BerkeleyDB to hold the database and the indexes was 721Mo whereas the zipped original source of data was 2Mo (25Mo unzipped)!!! (Arghhhhhhhhhhhh !!!!!).

  • Individuals count:=90

  • Markers count:=54786

  • Genotypes count:=4853237


Time required to load the hapmap file : 1174secs (20min)

rs9624968 A/G chr22 24783030 + genotyped:86% f(G)=0.879746835443038 f(A)=0.12025316455696203
rs9624969 C/T chr22 24784595 + genotyped:87% f(T)=0.075 f(C)=0.925
rs6004919 C/T chr22 24785216 + genotyped:100% f(T)=0.12777777777777777 f(C)=0.8722222222222222
rs4585127 A/G chr22 24785559 + genotyped:100% f(G)=0.8722222222222222 f(A)=0.12777777777777777
rs5752262 A/G chr22 24786367 + genotyped:95% f(G)=0.5116279069767442 f(A)=0.4883720930232558
rs16981296 C/G chr22 24787784 + genotyped:95% f(G)=0.8488372093023255 f(C)=0.1511627906976744
rs1003547 G/T chr22 24788134 + genotyped:86% f(T)=0.44936708860759494 f(G)=0.5506329113924051
rs9613094 A/G chr22 24788388 + genotyped:100% f(G)=0.2222222222222222 f(A)=0.7777777777777778
rs16986627 A/G chr22 24789298 + genotyped:88% f(G)=0.2654320987654321 f(A)=0.7345679012345679
(...)


Time required to generate the frequencies Time:955secs

Incompat: for rs133457(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs136009(C/T) 1341M02 is CT and his father 1341MF13 is TT and his mother 1341MM14 is TT
Incompat: for rs394518(C/T) 1341M02 is CT and his father 1341MF13 is TT and his mother 1341MM14 is TT
Incompat: for rs628437(A/C) 1341M02 is AC and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs731403(C/G) 1341M02 is GG and his mother 1341MM14 is CC
Incompat: for rs1122940(C/T) 1341M02 is CT and his father 1341MF13 is TT and his mother 1341MM14 is TT
Incompat: for rs2845343(A/T) 1341M02 is AA and his mother 1341MM14 is TT
Incompat: for rs4822498(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs4822499(C/T) 1341M02 is TT and his father 1341MF13 is CC
Incompat: for rs4823195(A/G) 1341M02 is AA and his mother 1341MM14 is GG
Incompat: for rs4991267(C/T) 1341M02 is CC and his mother 1341MM14 is TT
Incompat: for rs5755047(A/T) 1341M02 is TT and his father 1341MF13 is AA
Incompat: for rs5755343(A/T) 1341M02 is AA and his mother 1341MM14 is TT
Incompat: for rs5755420(A/C) 1341M02 is AC and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs5765056(C/T) 1341M02 is CT and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs5765436(A/C) 1341M02 is AC and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs5765499(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs5768636(G/T) 1341M02 is GT and his father 1341MF13 is GG and his mother 1341MM14 is GG
Incompat: for rs5769710(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs5770600(A/C) 1341M02 is CC and his father 1341MF13 is AA
Incompat: for rs5997220(A/G) 1341M02 is AG and his father 1341MF13 is GG and his mother 1341MM14 is GG
(...)
764 errors


Time required to find the Mendelian incompatibilities: 11021secs (~3H00)

When the program closed, the database was compacted down to 710Mo.

Conclusion: Too slow, some huges files generated. Definitely a bad choice to handle this kind of data.