Datalift: A Catalyser for the Web of Data


                    François Scharffe
                    LIRMM/CNRS/University of Montpellier
                       francois.scharffe@lirmm.fr
                       @lechatpito




With the help of the Datalift team
And the support of the French National Research Agency



                FOSDEM 5/02/2011                    1
The data revolution is on its way !

     As Open Data meets the Semantic Web
The promises of linked-data
Richer Applications




Linked Data Lite | the Web on Steroids 1.0 (iPhone)
Richer applications




    BBC Programmes
More precise search and QA
Making your data 5 stars




http://www.w3.org/DesignIssues/LinkedData.html
So, how to lift data ?
    How to publish data on the Web as linked-
    data ?
●   Basic principles Tim Berners Lee [2006] (Design Issues)
       –   Use URIs to identify things (not only documents)
       –   Use HTTP URIs
       –   When dereferecing URIS, return a description of the
           ressource
       –   Include links to other ressources on the Web
Welcome aboard the data lift
                Published and interlinked data on the Web
                             Applications


                Interconnexion


Publication infrastructure


           Data convertion


                 Vocabulary selection




                                        Raw data
Datalift


Datasets publication
R&D to automate the publication process
Tool suite to help publish data
Training, tutorials, data publication camps
st
                       1 floor - Selection
SemWebPro 18/01/2011            11
Les vocabulaires de mes amis …


Ø What is a (good) vocabulary for linked data ?
    § Usability criterias
            Simplicity, visibility, sustainability, integration, coherence …

Ø Differents types of vocabularies
    §   metadata, reference, domain, generalist …
    § The pillars of Linked Data : Dublin Core, FOAF, SKOS
Ø Good and less good practices
    § Ex : Programmes BBC vs legislation.gov.uk
    § Vocabulary of a Friend : networked vocabularies
Ø Linguistic problems
    § Existing vocabularies are in English at 99%
    § Terminological approach :which vocabularies for « Event » « Organization »
Did you say « vocabulary »


… And why not « ontology »?
    § Or « schema » ou « metadata schema »?
    § Ou « model » (data ? World ?)
Ø All these terms are used and justifiable
They are all « vocabularies »
    § The define types of objects (or classes)
      and the properties (oo attributes) atttached to these objects.
    § Types and attributes are logically defined
      and named using natural language
    § A (semantic) vocabulary
      is an explicit formalization
      of concepts existing in natural language

                     SemWebPro 18/01/2011                   13
Vocabularies for linked data


Ø Are meant to describe resources in RDF
Ø Are based on one of the standard W3C language
  § RDF Schema (RDFS)
     • For vocabulaires without too much logical complexity
  § OWL
     • For more complex ontological constructs
   § These two languages are compatible (almost)
Ø The can be composed « ad libitum »
  § One can reuse a few elements of a vocabulary
  § The original semantics have to be followed
What makes a good vocabulary ?


Ø A good vocabulary is a used vocabulary
   § Data published on CKAN give an idea of vocabulary usage
   § Exemple : v
     list of datasets using FOAF http://xmlns.com/foaf/0.1/
Ø Other usability criterias
   § Simplicity and readability in natural language
   § Elements documentation (definition in natural language)
   § Visibility and sustainability of the publication
   § Flexibility and extensibility
   § Sémantique integration (with other vocabularies)
   § Social integration (with the user community)
A vocabulary is also a community


Ø Bad (but common) practice
   ●
       Build a lonely vocabulary
        –   For example as a research project
        –   Without basing it on any existing vocabulary
  § To publish it (or not) and then to forget about it
  § Not to care about its users
Ø A good vocabulary has an organic life
  § Users and use cases
  § Revisions and extensions
  § Like a « natural » vocabulary
Types of vocabularies


Ø Metadata vocabularies
   § Allowing to annotate other vocabularies
       • Dublin Core, Vann, cc REL, Status
Ø Reference vocabularies
   § Provide « common » classes and properties
       • FOAF, Event, Time, Org Ontology
Ø Domain vocabularies
   § Specific to a domain of knowledge
       • Geonames, Music Ontology, WildLife Ontology
Ø « general » vocabularies
   § Describe « everything » at an arbitrary detail level
       • DBpedia Ontology, Cyc Ontology, SUMO
Vocabulary of a Friend


Ø http://www.mondeca.com/foaf/voaf
Ø A simple vocabulary...
Ø To represent interconnexions between vocabularies
Ø A unique entry point to vocabularies and Datasets of
  the linked-data cloud Linked Data Cloud
Ø Ongoing work in Datalift
nd
                   2 floor - Conversion
SemWebPro 18/01/2011         19
URL Design et URL Pattern


Ø Good practices for linked-data
  § Ressource: http://dbpedia.org/resource/Paris
  § Document: http://dbpedia.org/page/Paris
  § Data: http://dbpedia.org/data/Paris
Ø … served using content negociation
URI Pattern in REST


Ø Les services REST (Representational State Transfer)
  manipulent des ressources et les URLs sont
  principalement utilisés pour adresser ces ressources
Ø Une URI de base:
   § http://www.example.com/bookstore/
Ø Une ressource à un URL unique: (retrieve, update,
  create, delete)
   § http://www.example.com/bookstore/books/ISBN123
Ø Notion de collection: (list, replace, create, delete)
   § http://www.example.com/bookstore/books
Convertion tools to RDF


Ø How is the raw data to be converted ?
  § Relational Database ?
  § (Semi-)structured formats ?
  § Programmatic acces (API) ?
Ø There are solutions for all cases
D2RQ Map
Triplify: Relational data to JSON/RDF




Ø Extract a folder in your Webapp:
  http://sourceforge.net/projects/triplify/
Ø Modify a config file:
   § SQL query … URI pattern
   § PHP lover!
Working on spreadsheets
Google acquired Freebase




http://code.google.com/p/google-refine/
RDF extension for Google Refine


Ø A graphical extension for Google Refine allowing to
  export the clean data as RDF
  http://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/

                                                                 Annual pay rate
                                                                    - including
     Name            Job Title        Grade     Organization                             Notes
                                                                 taxable benefits
                                                                 and allowances

                 Chief Executive              Asset Protection   £150,000 -
Stephan Wilcke
                 Officer                      Agency             £154,999
                                              Asset Protection   £165,000 -
Jens Bech        Chief Risk Officer                                                 No pension
                                              Agency             £169,999
                 Chief Invesment              Asset Protection   £165,000 -
Ion Dagtoglou                                                                       No pension
                 Officer                      Agency             £169,999
                 Chief Credit                 Asset Protection   £130,000 -
Brian Scammell                                                                      4 days per week
                 Officer                      Agency             £134,999
Google Refine et RDF
rd
                       3 floor - Publication
SemWebPro 18/01/2011             29
Publication components

                       Querying
                       Browsing

            SPARQL               REST
            endpoint


                                            Alimentation
Inference
 Engine                  RDF
                       storage              Alimentation


                                            Alimentation


             A few products
             Virtuoso, Sesame, Mulgara, 4store
             OWLIM, AllegroGraph, Big Data,Jena
Named graphs



Ø Rdf graphs are bags of triples, everything is mixed
                                                            1
Ø Delete on a graph
                                                                    2
Ø SPARQL queries define                                 3

                                                                5
  graphs                            9

                                                                            6
                                        11
                               10
                                                                                    8
                                    12
                                                                        4       7

                                              13

                                                            16

                                         14        15
Inference
                                                                                 1

                                                                             3           2
                                                                                     5
Ø Generating triples from other triples                        9
                                                                                             6
                                                          10       11
                                                                                                     8
Ø Deduction mechanism                                          12
                                                                                         4       7
                                                                        13
   § Men are mortals, Socrates is a man, so Socrates is                          16
     mortal                                                         14 15


Ø Allows to avoid exhaustivity, give sense to
  defining hierarchies
Ø Constraints: cardinality, NFPs, ...
Analyse des RDF Store : la méthode QSOS




Ø Qualification and Selection of Open Source Software
   §   Projet Open Source sur des solutions open source
   §   http://www.qsos.org
Ø Objectifs de QSOS
   §   Qualifier des logiciels
   §   Comparer des solutions après avoir défini des exigences et en pondérant les critères
   §   Sélectionner le produit le plus adapté par rapport à un besoin
Ø QSOS fournit
   §   Une méthode objective et formalisée ‫‏‬
   §   Un référentiel d’études disponibles
   §   Des outils facilitant le déroulement de la méthode
th
                 4 floor - Interconnexion
SemWebPro 18/01/2011         34
Linked data and interconnexions


Ø Without links there is no Web but data silos
Ø Links can be part of the datasets design (reference
  datasets)
Ø Links can be found after the publication: equivalence
  links between resources
Comment interconnecter ses données ?
Tools


Ø RKB-CRS A coreference resolution service for the RKB
  knowledge base
Ø LD-mapper A linkage tool for datasets described using the
  Music Ontology
Ø ODD Linker A linkage tool based on SQL
Ø RDF-AI Multi purpose data linkage and fusion
Ø Silk et Silk LSL Linkage tool and linkage specification language
Ø Knofuss architecture Datasets linkage and fusion
Exemple Silk specification
<Silk>                                           <Interlink id="cities">
 <Prefix id="rdfs" namespace=                      <LinkType>owl:sameAs</LinkType>
      "http://www.w3.org/2000/01/rdf-schema#" />   <SourceDataset dataSource="dbpedia" var="a">
 <Prefix id="dbpedia" namespace=                     <RestrictTo>
      "http://dbpedia.org/ontology/" />                ?a rdf:type dbpedia:City
 <Prefix id="gn" namespace=                          </RestrictTo>
      "http://www.geonames.org/ontology#" />       </SourceDataset>
                                                   <TargetDataset dataSource="geonames" var="b">
 <DataSource id="dbpedia">                           <RestrictTo>
  <EndpointURI>http://demo_sparql_server1/sparql       ?b rdf:type gn:P
  </EndpointURI>                                     </RestrictTo>
  <Graph>http://dbpedia.org</Graph>                </TargetDataset>
 </DataSource>                                     <LinkCondition>
                                                     <AVG>
 <DataSource id="geonames">                            <Compare metric="jaroSimilarity">
  <EndpointURI>http://demo_sparql_server2/sparql        <Param name="str1" path="?a/rdfs:label" />
  </EndpointURI>                                        <Param name="str2" path="?b/gn:name" />
  <Graph>http://sws.geonames.org/</Graph>              </Compare>
 </DataSource>                                         <Compare metric="numSimilarity">
                                                        <Param name="num1"
 <Thresholds accept="0.9" verify="0.7" />                    path="?a/dbpedia:populationTotal" />
 <Output acceptedLinks="accepted_links.n3"              <Param name="num2" path="?b/gn:population" />
   verifyLinks="verify_links.n3"                       </Compare>
   mode="truncate" />                                </AVG>
                                                   </LinkCondition>
                                                 </Interlink>
                                                 </Silk>
Where to find links ?
Towards automated interconnexion services


Ø The linkage specification could be simplified
  § Using alignments between vocabularies
  § Detection of discriminating properties
  § Indicating comparison methods by attaching metadata to
    ontologies
Ø Work in progress in Datalift
5th floor - Applications
SemWebPro 18/01/2011          41
Data visualization




                Tabulator
                (CSAIL, MIT)
VisiNav
Sig.ma
Nos Députés . FR
A few examples from US




http://data-gov.tw.rpi.edu/demo/USForeignAid/demo-1554.html
Mashups … Mashups … Mashups …
That's it !
●   Datalift.org
●   We're looking for a Datageek !

Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

  • 1.
    Datalift: A Catalyserfor the Web of Data François Scharffe LIRMM/CNRS/University of Montpellier [email protected] @lechatpito With the help of the Datalift team And the support of the French National Research Agency FOSDEM 5/02/2011 1
  • 2.
    The data revolutionis on its way ! As Open Data meets the Semantic Web
  • 3.
    The promises oflinked-data
  • 4.
    Richer Applications Linked DataLite | the Web on Steroids 1.0 (iPhone)
  • 5.
    Richer applications BBC Programmes
  • 6.
  • 7.
    Making your data5 stars http://www.w3.org/DesignIssues/LinkedData.html
  • 8.
    So, how tolift data ? How to publish data on the Web as linked- data ? ● Basic principles Tim Berners Lee [2006] (Design Issues) – Use URIs to identify things (not only documents) – Use HTTP URIs – When dereferecing URIS, return a description of the ressource – Include links to other ressources on the Web
  • 9.
    Welcome aboard thedata lift Published and interlinked data on the Web Applications Interconnexion Publication infrastructure Data convertion Vocabulary selection Raw data
  • 10.
    Datalift Datasets publication R&D toautomate the publication process Tool suite to help publish data Training, tutorials, data publication camps
  • 11.
    st 1 floor - Selection SemWebPro 18/01/2011 11
  • 12.
    Les vocabulaires demes amis … Ø What is a (good) vocabulary for linked data ? § Usability criterias Simplicity, visibility, sustainability, integration, coherence … Ø Differents types of vocabularies § metadata, reference, domain, generalist … § The pillars of Linked Data : Dublin Core, FOAF, SKOS Ø Good and less good practices § Ex : Programmes BBC vs legislation.gov.uk § Vocabulary of a Friend : networked vocabularies Ø Linguistic problems § Existing vocabularies are in English at 99% § Terminological approach :which vocabularies for « Event » « Organization »
  • 13.
    Did you say« vocabulary » … And why not « ontology »? § Or « schema » ou « metadata schema »? § Ou « model » (data ? World ?) Ø All these terms are used and justifiable They are all « vocabularies » § The define types of objects (or classes) and the properties (oo attributes) atttached to these objects. § Types and attributes are logically defined and named using natural language § A (semantic) vocabulary is an explicit formalization of concepts existing in natural language SemWebPro 18/01/2011 13
  • 14.
    Vocabularies for linkeddata Ø Are meant to describe resources in RDF Ø Are based on one of the standard W3C language § RDF Schema (RDFS) • For vocabulaires without too much logical complexity § OWL • For more complex ontological constructs § These two languages are compatible (almost) Ø The can be composed « ad libitum » § One can reuse a few elements of a vocabulary § The original semantics have to be followed
  • 15.
    What makes agood vocabulary ? Ø A good vocabulary is a used vocabulary § Data published on CKAN give an idea of vocabulary usage § Exemple : v list of datasets using FOAF http://xmlns.com/foaf/0.1/ Ø Other usability criterias § Simplicity and readability in natural language § Elements documentation (definition in natural language) § Visibility and sustainability of the publication § Flexibility and extensibility § Sémantique integration (with other vocabularies) § Social integration (with the user community)
  • 16.
    A vocabulary isalso a community Ø Bad (but common) practice ● Build a lonely vocabulary – For example as a research project – Without basing it on any existing vocabulary § To publish it (or not) and then to forget about it § Not to care about its users Ø A good vocabulary has an organic life § Users and use cases § Revisions and extensions § Like a « natural » vocabulary
  • 17.
    Types of vocabularies ØMetadata vocabularies § Allowing to annotate other vocabularies • Dublin Core, Vann, cc REL, Status Ø Reference vocabularies § Provide « common » classes and properties • FOAF, Event, Time, Org Ontology Ø Domain vocabularies § Specific to a domain of knowledge • Geonames, Music Ontology, WildLife Ontology Ø « general » vocabularies § Describe « everything » at an arbitrary detail level • DBpedia Ontology, Cyc Ontology, SUMO
  • 18.
    Vocabulary of aFriend Ø http://www.mondeca.com/foaf/voaf Ø A simple vocabulary... Ø To represent interconnexions between vocabularies Ø A unique entry point to vocabularies and Datasets of the linked-data cloud Linked Data Cloud Ø Ongoing work in Datalift
  • 19.
    nd 2 floor - Conversion SemWebPro 18/01/2011 19
  • 20.
    URL Design etURL Pattern Ø Good practices for linked-data § Ressource: http://dbpedia.org/resource/Paris § Document: http://dbpedia.org/page/Paris § Data: http://dbpedia.org/data/Paris Ø … served using content negociation
  • 21.
    URI Pattern inREST Ø Les services REST (Representational State Transfer) manipulent des ressources et les URLs sont principalement utilisés pour adresser ces ressources Ø Une URI de base: § http://www.example.com/bookstore/ Ø Une ressource à un URL unique: (retrieve, update, create, delete) § http://www.example.com/bookstore/books/ISBN123 Ø Notion de collection: (list, replace, create, delete) § http://www.example.com/bookstore/books
  • 22.
    Convertion tools toRDF Ø How is the raw data to be converted ? § Relational Database ? § (Semi-)structured formats ? § Programmatic acces (API) ? Ø There are solutions for all cases
  • 23.
  • 24.
    Triplify: Relational datato JSON/RDF Ø Extract a folder in your Webapp: http://sourceforge.net/projects/triplify/ Ø Modify a config file: § SQL query … URI pattern § PHP lover!
  • 25.
  • 26.
  • 27.
    RDF extension forGoogle Refine Ø A graphical extension for Google Refine allowing to export the clean data as RDF http://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/ Annual pay rate - including Name Job Title Grade Organization Notes taxable benefits and allowances Chief Executive Asset Protection £150,000 - Stephan Wilcke Officer Agency £154,999 Asset Protection £165,000 - Jens Bech Chief Risk Officer No pension Agency £169,999 Chief Invesment Asset Protection £165,000 - Ion Dagtoglou No pension Officer Agency £169,999 Chief Credit Asset Protection £130,000 - Brian Scammell 4 days per week Officer Agency £134,999
  • 28.
  • 29.
    rd 3 floor - Publication SemWebPro 18/01/2011 29
  • 30.
    Publication components Querying Browsing SPARQL REST endpoint Alimentation Inference Engine RDF storage Alimentation Alimentation A few products Virtuoso, Sesame, Mulgara, 4store OWLIM, AllegroGraph, Big Data,Jena
  • 31.
    Named graphs Ø Rdfgraphs are bags of triples, everything is mixed 1 Ø Delete on a graph 2 Ø SPARQL queries define 3 5 graphs 9 6 11 10 8 12 4 7 13 16 14 15
  • 32.
    Inference 1 3 2 5 Ø Generating triples from other triples 9 6 10 11 8 Ø Deduction mechanism 12 4 7 13 § Men are mortals, Socrates is a man, so Socrates is 16 mortal 14 15 Ø Allows to avoid exhaustivity, give sense to defining hierarchies Ø Constraints: cardinality, NFPs, ...
  • 33.
    Analyse des RDFStore : la méthode QSOS Ø Qualification and Selection of Open Source Software § Projet Open Source sur des solutions open source § http://www.qsos.org Ø Objectifs de QSOS § Qualifier des logiciels § Comparer des solutions après avoir défini des exigences et en pondérant les critères § Sélectionner le produit le plus adapté par rapport à un besoin Ø QSOS fournit § Une méthode objective et formalisée ‫‏‬ § Un référentiel d’études disponibles § Des outils facilitant le déroulement de la méthode
  • 34.
    th 4 floor - Interconnexion SemWebPro 18/01/2011 34
  • 35.
    Linked data andinterconnexions Ø Without links there is no Web but data silos Ø Links can be part of the datasets design (reference datasets) Ø Links can be found after the publication: equivalence links between resources
  • 36.
  • 37.
    Tools Ø RKB-CRS Acoreference resolution service for the RKB knowledge base Ø LD-mapper A linkage tool for datasets described using the Music Ontology Ø ODD Linker A linkage tool based on SQL Ø RDF-AI Multi purpose data linkage and fusion Ø Silk et Silk LSL Linkage tool and linkage specification language Ø Knofuss architecture Datasets linkage and fusion
  • 38.
    Exemple Silk specification <Silk> <Interlink id="cities"> <Prefix id="rdfs" namespace= <LinkType>owl:sameAs</LinkType> "http://www.w3.org/2000/01/rdf-schema#" /> <SourceDataset dataSource="dbpedia" var="a"> <Prefix id="dbpedia" namespace= <RestrictTo> "http://dbpedia.org/ontology/" /> ?a rdf:type dbpedia:City <Prefix id="gn" namespace= </RestrictTo> "http://www.geonames.org/ontology#" /> </SourceDataset> <TargetDataset dataSource="geonames" var="b"> <DataSource id="dbpedia"> <RestrictTo> <EndpointURI>http://demo_sparql_server1/sparql ?b rdf:type gn:P </EndpointURI> </RestrictTo> <Graph>http://dbpedia.org</Graph> </TargetDataset> </DataSource> <LinkCondition> <AVG> <DataSource id="geonames"> <Compare metric="jaroSimilarity"> <EndpointURI>http://demo_sparql_server2/sparql <Param name="str1" path="?a/rdfs:label" /> </EndpointURI> <Param name="str2" path="?b/gn:name" /> <Graph>http://sws.geonames.org/</Graph> </Compare> </DataSource> <Compare metric="numSimilarity"> <Param name="num1" <Thresholds accept="0.9" verify="0.7" /> path="?a/dbpedia:populationTotal" /> <Output acceptedLinks="accepted_links.n3" <Param name="num2" path="?b/gn:population" /> verifyLinks="verify_links.n3" </Compare> mode="truncate" /> </AVG> </LinkCondition> </Interlink> </Silk>
  • 39.
  • 40.
    Towards automated interconnexionservices Ø The linkage specification could be simplified § Using alignments between vocabularies § Detection of discriminating properties § Indicating comparison methods by attaching metadata to ontologies Ø Work in progress in Datalift
  • 41.
    5th floor -Applications SemWebPro 18/01/2011 41
  • 42.
    Data visualization Tabulator (CSAIL, MIT)
  • 43.
  • 44.
  • 46.
  • 47.
    A few examplesfrom US http://data-gov.tw.rpi.edu/demo/USForeignAid/demo-1554.html
  • 48.
    Mashups … Mashups… Mashups …
  • 49.
    That's it ! ● Datalift.org ● We're looking for a Datageek !