Showing posts with label flex. Show all posts
Showing posts with label flex. Show all posts

13 July 2012

Parsing the Newick format in C using flex and bison.

The following post is my answer for this question on biostar "Newick 2 Json converter".
The Newick tree format is a simple format used to write out trees (using parentheses and commas) in a text file .
The original question asked for a parser based on perl but here, I've implemented a C parser using flex/bison.


Example:

((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);

A formal grammar for the Newick format is available here
Items in { } may appear zero or more times.
   Items in [ ] are optional, they may appear once or not at all.
   All other punctuation marks (colon, semicolon, parentheses, comma and
         single quote) are required parts of the format.


              tree ==> descendant_list [ root_label ] [ : branch_length ] ;

   descendant_list ==> ( subtree { , subtree } )

           subtree ==> descendant_list [internal_node_label] [: branch_length]
                   ==> leaf_label [: branch_length]

            root_label ==> label
   internal_node_label ==> label
            leaf_label ==> label

                 label ==> unquoted_label
                       ==> quoted_label

        unquoted_label ==> string_of_printing_characters
          quoted_label ==> ' string_of_printing_characters '

         branch_length ==> signed_number
                       ==> unsigned_number

The Flex Lexer

The Flex Lexer is used to extract the terminal tokens of the grammar from the input stream.
Those terminals are '(' ')' ',' ';' ':' , strings and numbers. For the simple and double quoted strings, we tell the lexer to enter in a specific state ( 'apos' and 'quot').

The Bison Scanner

The Bison scanner reads the tokens returned by Flex and implements the grammar.
The simple structure holding the tree is defined in 'struct tree_t'. The code also contains some methods to dump the tree as JSON.

Makefile


Testing

compile:
$ make
bison -d newick.y
flex newick.l
gcc -Wall -O3 newick.tab.c lex.yy.c
lex.yy.c:1265:17: warning: ‘yyunput’ defined but not used [-Wunused-function]
lex.yy.c:1306:16: warning: ‘input’ defined but not used [-Wunused-function]
test:
echo "((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);" | ./a.out

{
    "children": [
        {
            "length": 0.1,
            "children": [
                {
                    "label": "Human",
                    "length": 0.3
                },
                {
                    "label": "Chimpanzee",
                    "length": 0.2
                }
            ]
        },
        {
            "label": "Gorilla",
            "length": 0.3
        },
        {
            "length": 0.2,
            "children": [
                {
                    "label": "Mouse",
                    "length": 0.6
                },
                {
                    "label": "Rat",
                    "length": 0.5
                }
            ]
        }
    ]
}


That's it,

Pierre

22 February 2011

A flex scanner extracting the metadata from a PDF file.

4 years ago, I played with the adobe XMP library to extract the XMP metadata from a set of PDF files.

Today, I was suprised to simply display the XMP data contained in a PDF from Nature by using the following command line:



I've generalized this process by implementing a GNU-flex scanner that prints the XML content between two <xmp:xmpmeta/> tags. The source code is available on github at: https://github.com/lindenb/ccsandbox/blob/master/src/xmpextractor.l.

Compilation

flex -f -B --read xmpextractor.l
gcc -o xmpextractor lex.yy.c

Testing with "PLOS"

Harper MA, Chen Z, Toy T, Machado IMP, Nelson SF, et al. (2011) Phenotype Sequencing: Identifying the Genes That Cause a Phenotype Directly from Pooled Sequencing of Independent Mutants. PLoS ONE 6(2): e16517. doi:10.1371/journal.pone.0016517:

curl -s "http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0016517&representation=PDF" |\
./xmpextractor

<?xml version="1.0" encoding="UTF-8"?>
<XMP>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP toolkit 2.9.1-13, framework 1.6">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:iX="http://ns.adobe.com/iX/1.0/">
<rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" pdf:Producer="Acrobat Distiller 7.0 (Windows)"/>
<rdf:Description xmlns:xap="http://ns.adobe.com/xap/1.0/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" xap:CreateDate="2011-02-14T08:21:58+08:00" xap:CreatorTool="3B2 Total Publishing System 7.51n/W" xap:ModifyDate="2011-02-17T14:32:08+08:00" xap:MetadataDate="2011-02-17T14:32:08+08:00"/>
<rdf:Description xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" xapMM:DocumentID="uuid:f98295b5-0980-42f2-884e-6cecd2d75c90" xapMM:InstanceID="uuid:d226393a-bfaf-4a2b-8c28-08215008694e"/>
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" dc:format="application/pdf">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">pone.0016517 1..16</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
</XMP>

Hum, nothing really interesting here.

Testing with "Nature Reviews Cardiology"

A test with Percutaneous coronary intervention in the elderly. Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184
curl -s "http://www.nature.com/nrcardio/journal/v8/n2/pdf/nrcardio.2010.184.pdf" |\
./xmpextractor


<?xml version="1.0" encoding="UTF-8"?>
<XMP><x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:identifier>doi:10.1038/nrcardio.2010.184</dc:identifier>
<dc:creator>
<rdf:Seq>
<rdf:li>Tracy Y. Wang</rdf:li>
<rdf:li>Antonio Gutierrez</rdf:li>
<rdf:li>Eric D. Peterson</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184</rdf:li>
</rdf:Alt>
</dc:description>
<dc:publisher>
<rdf:Bag>
<rdf:li>Nature Publishing Group</rdf:li>
</rdf:Bag>
</dc:publisher>
<dc:rights>
<rdf:Alt>
<rdf:li xml:lang="x-default">&#xA; © 2010 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.</rdf:li>
</rdf:Alt>
</dc:rights>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Percutaneous coronary intervention in the elderly</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Adobe PDF Library 8.0</pdf:Producer>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/">
<prism:copyright>© 2010 Nature Publishing Group</prism:copyright>
<prism:doi>10.1038/nrcardio.2010.184</prism:doi>
<prism:eIssn>1759-5010</prism:eIssn>
<prism:endingPage>90</prism:endingPage>
<prism:issn>1759-5002</prism:issn>
<prism:number>2</prism:number>
<prism:publicationName>Nature Publishing Group</prism:publicationName>
<prism:rightsAgent>[email protected]</prism:rightsAgent>
<prism:startingPage>79</prism:startingPage>
<prism:volume>8</prism:volume>
<prism:publicationDate>
<rdf:Bag>
<rdf:li>2010-12-07</rdf:li>
</rdf:Bag>
</prism:publicationDate>
<prism:url>
<rdf:Bag>
<rdf:li>http://dx.doi.org/10.1038/nrcardio.2010.184</rdf:li>
</rdf:Bag>
</prism:url>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:CreateDate>2011-01-10T10:09:23+05:30</xmp:CreateDate>
<xmp:CreatorTool/>
<xmp:Label>Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184</xmp:Label>
<xmp:MetadataDate>2011-01-14T18:25:45+05:30</xmp:MetadataDate>
<xmp:ModifyDate>2011-01-14T18:25:45+05:30</xmp:ModifyDate>
<xmp:Identifier>
<rdf:Bag>
<rdf:li>doi:10.1038/nrcardio.2010.184</rdf:li>
</rdf:Bag>
</xmp:Identifier>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmpRights="http://ns.adobe.com/xap/1.0/rights/">
<xmpRights:Marked>True</xmpRights:Marked>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>uuid:51421664-c9c6-4657-9fbe-318ac969ca26</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:c65fa1fb-b6f3-4a61-8615-07b04871749e</xmpMM:InstanceID>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
</XMP>

That's more interesting isn't it ?

That's it,

Pierre

14 December 2009

Parsing a genetic code using flex and bison. Part 2/2

(this post was inspired from this tutorial: "Writing a Reentrant Parser with Flex and Bison").
In the previous post I've shown how to parse a NCBI genetic code in ASN.1 using flex and bison. However the generated code is non reentrant that is to say that some globals variables prevent it to be used in a concurrent environment.

If we want to use this code in a multithread environment, some changes must be added: first we need a new structure GCState where the state of the lexer will be saved between each time bison requests for a new token:
typedef struct gcState
{
/** reference to the flex lexer */
void* handle_to_the_scanner;
}GCStateStruct,*GCState;
Your browser does not support the <CANVAS> element !

The Lexer

In the lexer some options must be added: we tell Flex to create a reentrant parser and that the yylex function generated by flex takes an extra argument yylval
%option bison-bridge
%option reentrant
Prior to be used, this context 'yyscan_t' has to be initialized:
void gcParse(FILE* in)
{
GCStateStruct ctx;
memset(&ctx,0,sizeof(GCStateStruct));
yyscan_t scanner;//structure created by flex

if(yylex_init_extra(&ctx,&scanner)!=0)//initialize the scanner
{
return;
}
ctx.handle_to_the_scanner=&scanner;//will tell bison where is the lexer
yyset_in (in,scanner );//change the input stream

gc_parse(ctx);//generated by bison
yylex_destroy(scanner);
}
.The for each method in yylex, a reference to the state in GCState is added:
[0-9]+ {yyget_lval(yyscanner)->integer=atoi(yyget_text(yyscanner)) ;return INTEGER;}
\"[^\"]+\" {
yyget_lval(yyscanner)->s= malloc(yyget_leng(yyscanner)-1);
if(yyget_lval(yyscanner)->s==NULL) exit(EXIT_FAILURE);
strncpy(yyget_lval(yyscanner)->s,&yytext[1],yyget_leng(yyscanner)-2);
return TEXT;

The Parser

We tell bison that we want a reentrant parser
%pure-parser
, each time the lexer will be asked for a new token, we want to pass an extra GCState parameter:
%parse-param{GCState state}
. We also tell bison that the yylex function generated by flex takes an extra parameter hoding the state of the scanner:
%lex-param{void* handle_to_the_scanner}
.A #define is added to tell bison how to extract the handle to the scanner from the GCState:
#define handle_to_the_scanner (state->handle_to_the_scanner)

All in one

Here is the lexer gc.l:
%option prefix="gc_"
%option noyywrap
%option bison-bridge
%option reentrant

%{
#include <stdio.h>
#include <stdlib.h>
#include "gc.h"
#include "gc.tab.h" //generated by bison
%}

%%
\-\-.*\n ;/* ignore comments */

Genetic\-code\-table return GENETIC_CODE_TABLE;
name return NAME;
id return ID;
ncbieaa return NCBIEAA;
sncbieaa return SNCBIEAA;
[0-9]+ {yyget_lval(yyscanner)->integer=atoi(yyget_text(yyscanner)) ;return INTEGER;}
\:\:= return ASSIGN;
, return COMMA;
\{ return OPEN;
\} return CLOSE;
\"[^\"]+\" {
yyget_lval(yyscanner)->s= malloc(yyget_leng(yyscanner)-1);
if(yyget_lval(yyscanner)->s==NULL) exit(EXIT_FAILURE);
strncpy(yyget_lval(yyscanner)->s,&yytext[1],yyget_leng(yyscanner)-2);
return TEXT;
}
[\n\t ] ;//ignore blanks
. return yytext[0];

%%

void gcParse(FILE* in)
{
GCStateStruct ctx;
memset(&ctx,0,sizeof(GCStateStruct));
yyscan_t scanner;

if(yylex_init_extra(&ctx,&scanner)!=0)
{
return;
}
ctx.handle_to_the_scanner=&scanner;
yyset_in (in,scanner );
yyset_debug(1,scanner);
gc_parse(ctx);
yylex_destroy(scanner);
}
Here is the parser gc.y:

%{
#include <stdio.h>
#include <stdlib.h>
#include "gc.h"
%}

%union {
char* s;
int integer;
GeneticCode gCode;
}

%pure-parser
%name-prefix="gc_"
%defines
%error-verbose
%parse-param{GCState state}
%lex-param{void* handle_to_the_scanner}

%token GENETIC_CODE_TABLE NAME OPEN CLOSE COMMA ASSIGN NCBIEAA SNCBIEAA ID
%token<integer> INTEGER
%token<s> TEXT

%type<gCode> code
%type<gCode> codes
%type<s> optional_name
%start input

%{
void yyerror(GCState state,const char* message)
{
fprintf(stderr,"ERROR:%s\n",message);
}
#define handle_to_the_scanner (state->handle_to_the_scanner)

%}

%%

input: GENETIC_CODE_TABLE ASSIGN OPEN codes CLOSE
{
GeneticCode gc=$4;
fputs("Genetic-code-table ::= {",stdout);
while(gc!=NULL)
{
printf("{name \"%s\",", gc->name1);
if(gc->name2!=NULL) printf("name \"%s\",", gc->name2);
printf("id %d,ncbieaa \"%s\",sncbieaa \"%s\"}",
gc->id, gc->ncbieaa, gc->sncbieaa
);
if(gc->next!=NULL) fputc(',',stdout);
gc=gc->next;
}
fputc('}',stdout);
/** free memory here.... */
};

codes: code
{
$$=$1;
}
| codes COMMA code
{
$$=$3;
$$->next=$1;
}
;


code: OPEN
NAME TEXT COMMA
optional_name
ID INTEGER COMMA
NCBIEAA TEXT COMMA
SNCBIEAA TEXT
CLOSE
{
$$ = malloc(sizeof(GeneticCodeStruct));
if($$==NULL) exit(EXIT_FAILURE);
$$->name1=$3;
$$->name2=$5;
$$->id=$7;
$$->ncbieaa=$10;
$$->sncbieaa=$13;
}
;
optional_name:/* nothing */
{
$$=NULL;
}
| NAME TEXT COMMA
{
$$=$2;
}
%%

extern void gcParse(FILE* in);

int main(int argc,char** argv)
{
gcParse(stdin);
return 0;
}
and the file gc.h
#ifndef GENETIC_CODE_H
#define GENETIC_CODE_H

typedef struct geneticCode
{
char* name1;
char* name2;
int id;
char* ncbieaa;
char* sncbieaa;
struct geneticCode *next;
}GeneticCodeStruct,*GeneticCode;


typedef struct gcState
{
/** reference to the flex lexer */
void* handle_to_the_scanner;
}GCStateStruct,*GCState;

#endif
Compiling...
flex gc.l
bison gc.y
gcc gc.tab.c lex.gc_.c
Testing:
cat gc.ptr |./a.out | ./a.out |./a.out |./a.out

Genetic-code-table ::= {{name "Standard",name "SGC0",id 1,ncbieaa "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",sncbieaa "---M---------------M---------------M----------------------------"},
{name "Vertebrate Mitochondrial",name "SGC1",id 2,ncbieaa "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG",sncbieaa "--------------------------------MMMM---------------M------------"},
(...)
,{name "Thraustochytrium Mitochondrial",id 23,ncbieaa "FF*LSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",sncbieaa "--------------------------------M--M---------------M------------"}}

That's it.
Pierre


Parsing a genetic code using flex and bison. Part 1/2

In this post I'll show how to write a lexer and a parser using Flex and Bison for parsing the genetic codes at NCBI. I'll write a non-reentrant parser that is to say that the generated code contains some global variables and cannot be safely executed concurrently. The input file is an ASN1 file that looks like this:
--*************************************************************************
-- This is the NCBI genetic code table
-- (...)
--**************************************************************************
Genetic-code-table ::= {
{
name "Standard" ,
name "SGC0" ,
id 1 ,
ncbieaa "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "---M---------------M---------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
},
{
(...)
} ,
{
name "Thraustochytrium Mitochondrial" ,
id 23 ,
ncbieaa "FF*LSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "--------------------------------M--M---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
}
}
Each genetic code will be stored as a linked list of C structures called GeneticCode:
file gc.h:
#ifndef GENETIC_CODE_H
#define GENETIC_CODE_H
typedef struct geneticCode
{
char* name1;
char* name2;
int id;
char* ncbieaa;
char* sncbieaa;
struct geneticCode *next;
}GeneticCodeStruct,*GeneticCode;
#endif
.The LEXER generated by flex converts the input into a sequence of tokens, and the parser generated by bison converts a grammar description into a C program.

Your browser does not support the <CANVAS> element !



The Parser

The grammar of Bison says that the lexical context might carry a string (the code ID) , a string (ncbieaa,ncbieaa...) or a GeneticCode:
%union {
char* s;
int integer;
GeneticCode gCode;
}
(...)
%token<integer> INTEGER
%token<s> TEXT
A genetic code is a series of tokens consisting in one or two names, the id, the ncbiaa and the sncbiaa strings. Every time this context is found, a new GeneticCode is allocated and its fields are filled:

%type<gCode> code
(...)
code: OPEN
NAME TEXT COMMA
optional_name
ID INTEGER COMMA
NCBIEAA TEXT COMMA
SNCBIEAA TEXT
CLOSE
{
$$ = malloc(sizeof(GeneticCodeStruct));
if($$==NULL) exit(EXIT_FAILURE);
$$->name1=$3;
$$->name2=$5;
$$->id=$7;
$$->ncbieaa=$10;
$$->sncbieaa=$13;
}
;
An optional name is 'nothing' or the following tokens: 'NAME TEXT COMMA'
optional_name:/* nothing */
{
$$=NULL;
}
| NAME TEXT COMMA
{
$$=$2;
}

A series of genetic codes is one genetic code, or a series of genetic codes followed by one genetic code:
codes: code
{
$$=$1;
}
| codes COMMA code
{
$$=$3;
$$->next=$1;
}
;
At last, the input constists in the tokens 'GENETIC_CODE_TABLE ASSIGN OPEN' and 'CLOSE' , surrounding a series of genetic codes. Each time, an input has been parsed, the linked list is scanned so each genetic code is printed to stdout.
input: GENETIC_CODE_TABLE ASSIGN OPEN codes CLOSE
{
GeneticCode gc=$4;
fputs("Genetic-code-table ::= {",stdout);
while(gc!=NULL)
{
printf("{name \"%s\",", gc->name1);
if(gc->name2!=NULL) printf("name \"%s\",", gc->name2);
printf("id %d,ncbieaa \"%s\",sncbieaa \"%s\"}",
gc->id, gc->ncbieaa, gc->sncbieaa
);
if(gc->next!=NULL) fputc(',',stdout);
gc=gc->next;
}
fputc('}',stdout);
/** free memory here.... */
};

The Lexer

The lexers breaks the input into tokens. In this lexer, comments are ignored:
\-\-.*\n ;/* ignore comments */
... keywords are returned
Genetic\-code\-table return GENETIC_CODE_TABLE;
name return NAME;
id return ID;
ncbieaa return NCBIEAA;
sncbieaa return SNCBIEAA;
\:\:= return ASSIGN;
, return COMMA;
\{ return OPEN;
\} return CLOSE;
... and the tokens carrying a context (integers, strings...) are decoded:
[0-9]+ {gc_lval.integer=atoi(yytext) ;return INTEGER;}
\"[^\"]+\" {
//copy the string without the quotes
gc_lval.s= malloc(yyleng-1);
if(gc_lval.s==NULL) exit(EXIT_FAILURE);
strncpy(gc_lval.s,&yytext[1],yyleng-2);
return TEXT;
}

All in one here is the flex file: gc.l
%option prefix="gc_"
%option noyywrap


%{
#include <stdio.h>
#include <stdlib.h>
#include "gc.h"
#include "gc.tab.h" //generated by bison
%}

%%
\-\-.*\n ;/* ignore comments */

Genetic\-code\-table return GENETIC_CODE_TABLE;
name return NAME;
id return ID;
ncbieaa return NCBIEAA;
sncbieaa return SNCBIEAA;
[0-9]+ {gc_lval.integer=atoi(yytext) ;return INTEGER;}
\:\:= return ASSIGN;
, return COMMA;
\{ return OPEN;
\} return CLOSE;
\"[^\"]+\" {
gc_lval.s= malloc(yyleng-1);
if(gc_lval.s==NULL) exit(EXIT_FAILURE);
strncpy(gc_lval.s,&yytext[1],yyleng-2);
return TEXT;
}
[\n\t ] ;//ignore blanks
. return yytext[0];

%%
... the bison file gc.y:
%{
#include <stdio.h>
#include <stdlib.h>
#include "gc.h"
%}

%union {
char* s;
int integer;
GeneticCode gCode;
}

%name-prefix="gc_"
%defines
%error-verbose

%token GENETIC_CODE_TABLE NAME OPEN CLOSE COMMA ASSIGN NCBIEAA SNCBIEAA ID
%token<integer> INTEGER
%token<s> TEXT

%type<gCode> code
%type<gCode> codes
%type<s> optional_name
%start input

%{
void yyerror(const char* message)
{
fprintf(stderr,"ERROR:%s\n",message);
}

%}

%%

input: GENETIC_CODE_TABLE ASSIGN OPEN codes CLOSE
{
GeneticCode gc=$4;
fputs("Genetic-code-table ::= {",stdout);
while(gc!=NULL)
{
printf("{name \"%s\",", gc->name1);
if(gc->name2!=NULL) printf("name \"%s\",", gc->name2);
printf("id %d,ncbieaa \"%s\",sncbieaa \"%s\"}",
gc->id, gc->ncbieaa, gc->sncbieaa
);
if(gc->next!=NULL) fputc(',',stdout);
gc=gc->next;
}
fputc('}',stdout);
/** free memory here.... */
};

codes: code
{
$$=$1;
}
| codes COMMA code
{
$$=$3;
$$->next=$1;
}
;


code: OPEN
NAME TEXT COMMA
optional_name
ID INTEGER COMMA
NCBIEAA TEXT COMMA
SNCBIEAA TEXT
CLOSE
{
$$ = malloc(sizeof(GeneticCodeStruct));
if($$==NULL) exit(EXIT_FAILURE);
$$->name1=$3;
$$->name2=$5;
$$->id=$7;
$$->ncbieaa=$10;
$$->sncbieaa=$13;
}
;
optional_name:/* nothing */
{
$$=NULL;
}
| NAME TEXT COMMA
{
$$=$2;
}
%%

int main(int argc,char** argv)
{
yyparse();
return 0;
}
Compiling:
flex gc.l
bison gc.y
gcc -o a.out gc.tab.c lex.gc_.c
Testing:
cat gc.ptr |./a.out | ./a.out |./a.out |./a.out

Genetic-code-table ::= {{name "Standard",name "SGC0",id 1,ncbieaa "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",sncbieaa "---M---------------M---------------M----------------------------"},
{name "Vertebrate Mitochondrial",name "SGC1",id 2,ncbieaa "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG",sncbieaa "--------------------------------MMMM---------------M------------"},
(...)
,{name "Thraustochytrium Mitochondrial",id 23,ncbieaa "FF*LSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",sncbieaa "--------------------------------M--M---------------M------------"}}

That's it.
Pierre

In the next post I'll write a reentrant parser using flex and bison.

17 December 2008

Validating JSON with lex & yac

In a recent post on Twitter, Chris Lasher/agbiotec said:
I did a quick Google for "JSON schema" and "JSON validation"; looks like there's nothing in plac e yet like XML schema..
I suggested that lex/yacc could be used to create a trivial tool for this kind of validation. Here is an example.
Say you have a linkage file expressed as JSON. This file contains some information about a set of genetic markers, a set of samples and some genotypes.

{
"markers":[
{
"id":1,
"name":"rs1",
"chrom":"chr1",
"position":1
},
{
"id":2,
"name":"rs2",
"chrom":"chr1",
"position":2
},
{
"id":3,
"name":"rs3",
"chrom":"chr1",
"position":3
}
],
"samples":[
{
"id":1,
"name":"Individual1",
"father-id":2,
"mother-id":3,
"illness":true
},
{
"id":2,
"name":"Individual2",
"father-id":0,
"mother-id":0,
"illness":false
},
{
"id":3,
"name":"Individual3",
"father-id":0,
"mother-id":0,
"illness":true
}
],
"genotypes" : [

{
"sample":1,
"marker":2,
"allele-1":"A",
"allele-2":"T"
},
{
"sample":2,
"marker":2,
"allele-1":"A",
"allele-2":"T"
}

]
}


If we want to validate this file with lex/yacc or flex/bison. We need:

  • A Scanner generated by bison. This scanner contains the grammatical rules.

  • A Lexer generated by flex. This lexer transforms the input into a set of semantic tokens



sample.y: the Scanner


%{
#include <stdio.h>
int yywrap() { return 1;}
void yyerror(const char* s) {fprintf(stderr,"Error:%s.\n",s);}
%}


%token ALLELE_1 ALLELE_2 CHROM FATHER_ID GENOTYPES ID ILLNESS MARKER MARKERS MOTHER_ID NAME POSITION SAMPLE SAMPLES
%token BOOLEAN INTEGER STRING
%start linkage
%%

linkage: '{' markers ',' samples ',' genotypes '}' ;

markers: MARKERS ':' '[' marker_list ']';
marker_list: marker | marker_list ',' marker ;
marker: '{'
ID ':' INTEGER ','
NAME ':' STRING ','
CHROM ':' STRING ','
POSITION ':' INTEGER
'}';



samples: SAMPLES ':' '[' sample_list ']';
sample_list: sample | sample_list ',' sample ;
sample: '{'
ID ':' INTEGER ','
NAME ':' STRING ','
FATHER_ID ':' INTEGER ','
MOTHER_ID ':' INTEGER ','
ILLNESS ':' BOOLEAN
'}';


genotypes: GENOTYPES ':' '[' genotype_list ']';
genotype_list: genotype | genotype_list ',' genotype;
genotype: '{'
SAMPLE ':' INTEGER ','
MARKER ':' INTEGER ','
ALLELE_1 ':' STRING ','
ALLELE_2 ':' STRING
'}';
%%

int main(int argc,char** argv)
{
yyparse();
}
In this file the %token declares the keywords that will be accepted by the scanner. This grammer %starts with the linkage rule. This rule starst with a parenthesis followed by a 'markers' rule, followed by a comma, followed by a 'samples' rule, followed by a comma, followed by a 'genotypes' rule, followed by a parenthesis.
The 'markers' rule says that this rule is a 'marker_list' into a pair of brackets.
A marker_list is a marker or a marker_list (recursive rule) followed by a comma and another marker.
A marker is a set of JSON key/value. (Here, for simplicity, I expect that all the fields will appear in a given order)
etc...

To convert this file into a C source and a C header:
bison -d sample.y

sample.l: the Lexer


Here the lexer is basically a set of ordered regular expressions that will return a semantic identifier (e.g.BOOLEAN, INTEGER, MARKERS,...) about the tokens found in the input. Those identifiers were declared in a C header by the scanner.
%{
#include <stdio.h>
#include "sample.tab.h"/* generated by the scanner */
%}
%%
"\"allele-1\"" return ALLELE_1;
"\"allele-2\"" return ALLELE_2;
"\"chrom\"" return CHROM;
"\"father-id\"" return FATHER_ID;
"\"mother-id\"" return MOTHER_ID;
"\"id\"" return ID;
"\"markers\"" return MARKERS;
"\"marker\"" return MARKER;
"\"illness\"" return ILLNESS;
"\"genotypes\"" return GENOTYPES;
"\"name\"" return NAME;
"\"position\"" return POSITION;
"\"samples\"" return SAMPLES;
"\"sample\"" return SAMPLE;
true return BOOLEAN;
false return BOOLEAN;
[0-9]+ return INTEGER;
\"[^\"]*\" return STRING;/* a very simple string without escapes... */
[ \n\t\r] ;/* ignore */
. return yytext[0];
%%

To convert this file into a C source :
flex sample.l

Compilation


gcc -o validate sample.tab.c lex.yy.c

Testing


cat sample.json | ./validate
echo "Hello"| ./validate
Error: Syntax error


That's it

Pierre

17 May 2007

FLEX, Flash and Bioinformatics: episode II

Looking for resources about flash/flex and biofinformatics I found this nice (action)script example querying pubmed:

www: http://codelog.voisen.org
Here is the flash application
Here is the the source code

Pierre

08 May 2007

Hello World ! My first FLEX2 Application.

Adobe Flex 2 software is a rich Internet application framework based on Adobe Flash that will enable you to productively create beautiful, scalable applications that can reach virtually anyone on any platform.

The SDK was downloaded and extracted from : http://download.macromedia.com/pub/flex/sdk/flex_sdk_2.zip.

File:first.mxml

<?xml version="1.0" encoding="utf-8"?>
<mx:Application
xmlns:mx="http://www.adobe.com/2006/mxml"
horizontalAlign="center" verticalAlign="center">
<mx:Button id="myButton" label="Hello Bioinformatics World !" />
</mx:Application>


The file was compiled to flash using the mxmlc command (a java application):
>FLEX/SDK/bin/mxmlc --strict=true --file-specs first.mxml
Loading configuration file FLEX/SDK/frameworks/flex-config.xml
FLEX/test/first.swf (125402 bytes)
> ls
first.mxml first.swf


I then opened first.swf in firefox and...
My First FLEX2 app

Whaaaa ! It worked without any problem !!

I'M THE KING OF THE WORLD !!


This looks exciting ! I need a good tutorial ! I need a good tutorial ! I need ....


Pierre :-)