Linked Open Data as an Enabler for
         Team Science

           Deborah L. McGuinness
      Tetherless World Senior Constellation Chair
     Professor of Computer and Cognitive Science
      Rensselaer Polytechnic Institute, Troy, NY
     & CEO McGuinness Associates, Latham, NY


       Science of Team Science; LOD and Team Science April 19, 2012
Background
– Semantic Technologies – technological support for
  encoding meaning in a form computers can
  understand and manipulate – are maturing and
  increasing in usage
– Computational encodings of meaning can be used
  to help integrate, link, validate, filter,…. Essentially
  to make smarter, more context-aware applications
– Semantic Technologies enable linking data … and
  linked data provides a way of connecting and
  traversing information, nodes, graphs, webs, …
Linked Data

• Linked Data is quite simple and follows principles set
  out by Berners-Lee in
  http://www.w3.org/DesignIssues/LinkedData.html
  – Use URIs as names for things
  – Use HTTP URIs so that people can look up those names.
  – When someone looks up a URI, provide useful information,
    using the standards (RDF*, SPARQL)
  – Include links to other URIs. so that they can discover more
    things.

  – Introduction by examples and then discussion
Population Sciences Grid Goals


• Convey complex health-related information to
  consumer and public health decision makers
  for community health impact
• Inform the development of future research
  opportunities effectively utilizing
  cyberinfrastructure for cancer prevention and
  control
McGuinness, D. Shaikh, A., Lebo, T, Ding, L., Courtney, P., McCusker, J., Moser,. Morgan, G.D., Tatalovich, Z., Willis, G., Contractor, N., and Hesse, B.
2012. Towards Semantically-Enabled Next Generation Community Health Information Portals: The PopSciGrid Pilot In Proceedings of Hawaii
International Conference on System Sciences 2012


                                                                                                                                               4
Semantic Web Perspective on
                        Initial PopSciGrid Goals
• How can semantic technologies be used to integrate, present,
  and analyze data for a wide range of users?
• Can tools allow lay people to build their own demos and
  support public usage and accurate interpretation?
• How do we facilitate collaboration and “viral” applications?
• Within PopSciGrid:
   – Which policies (taxation, smoking bans, etc) impact health and health
     care costs?
   – What data should be displayed to help scientists and lay people
     evaluate related questions?
   – What data might be presented so that people choose to make (positive)
     behavior changes?
   – What does the data show? why should someone believe that?
   – What are appropriate follow up questions to support actionability? 5
Foundations: The Tetherless World
            Constellation Linked Open Government
                          Data Portal




  Convert      TWC LOGD
                               Query/
                               Access
               LOGD                       Community Portal
              SPARQL            • RDF
              Endpoint          • RSS
                                • JSON
Create                          • XML
                                • HTML
                                • CSV
                                •…


             Enhance

                                         Data.gov deployment
                                                      6
What is an Ontology?

          Thesauri
         “narrower                                                 Formal Frames General
Catalog/   term”                                                    is-a (properties) Logical
ID        relation                                                               constraints

                                            Informal                            Formal Value Disjointness
        Terms/                                                                 instance Restrs. , Inverse,
       glossary                                is-a                                                                part-of…




Ontologies Come of Age McGuinness, 2001, and From AAAI Panel 99 – McGuinness, Welty, Uschold, Gruninger, Lehmann
Plus basis of Ontologies Come of Age – McGuinness, 2003
Inference Web: Making Data Transparent and
                            Actionable Using Semantic Technologies

• How and when does it make sense to use smart system results & how do we
  interact with them?




                                                       (Mobile)
      Knowledge                                       Intelligent
 Provenance in Virtual
                                                        Agents      NSF Interops:
    Observatories                                                   SONET
                                                                    SSIII – Sea Ice
                                           Intelligence Analyst
                                                   Tools




                                          Hypothesis
                                        Investigation /
                                        Policy Advisors
                                                                             8
Foundations: Web Layer Cake

                                                    Visualization APIs
                                                           S2S
                                                        Govt Data
   Inference Web, Proof
   Markup Language, W3C                                                      Inference Web IW Trust,
   Provenance Working                                                        Air + Trust
   group formal model,
   W3C incubator group,                                                          DL, KIF, CL, N3Logic
   …
                                                                      Ontology repositories
OWL 1 & 2 WG Edited main OWL                                          (ontolinguag),
   Docs, quick reference,                                             Ontology Evolution env:
   OWL profiles (OWL RL),                                             Chimaera,
  Earlier languages: DAML,                                            Semantic eScience
      DAML+OIL, Classic                                               Ontologies, MANY other ontologie
                                                                                  RIF WG
                                                                           AIR accountability tool
  SPARQL WG, earlier QL –
  OWL-QL, Classic’ QL, …
                                                                              Govt metadata search
                                                                              Linked Open Govt Data

                  SPARQL to Xquery translator   RDFS materialization
                                                (Billion triple winner)      Transparent Accountable
                                                                             Datamining Initiative (TAM
PopSciGrid Workflow

 Ban coverage
                                         Publish


                CSV2RDF4LOD
                   Direct                             visualize

                   derive                            derive
CHSI 2009
                               archive




                                           Archive
                                                       SemDiff
                 CSV2RDF4LOD
                                                     derive
                   Enhance
PopSciGrid Example
                                State -Hawaii




Extensible Mashups via Linked Data
 Diverse datasets from NIH
 Potentially linking to other content (e.g.
“unemployment rate”)
Accountable Mashups via Provenance
 Annotate datasets used in demos
                                                    12
 Feedback users’ comment to gov contact (e.g. %)
 Annotation capabilities coming (and more)
PopSciGrid II
Reflections
Successful but….
• What if we could allow data experts to build
  their own demos?
• What if we could allow non-subject matter
  experts to function as subject-literate staff?
• What if team members could interchange roles
  (and thus make contributions in other areas)?
• What technological infrastructure is required?
• Claim: all of this is being done now – but not at
  scale                                          14
Updates and Motivations from a
                   Computer Science Perspective

Old:                         New:
• Raw conversions            • Enhanced conversions
• Per-dataset vocabularies   • Vocabulary reuse
• Custom queries             • Generic queries
• Custom data                • Re-usable data
  management code              management code
• Limited use because of     • Unlimited use of new
  Google Visualization         open source visualization
  licenses                     toolkit
• State-level data           • State and county-level
                               data
                                                     15
RDF Data Cube
                          Vocabulary
                               • Integrated with the LOGD
• For publishing multi-          data conversion
  dimensional data, such         infrastructure
  as statistics, on the web
  in such a way that it can    • Integrated with other tooling
  be linked to related data      like Stats2RDF
  sets and concepts using
  RDF.
• Compatible with the cube
  model that underlies
  SDMX (Statistical Data
  and Metadata eXchange).
• Also compatible with:
   – SKOS, SCOVO, VoiD,
     FOAF, Dublin Core Terms
                                                         16
County
  average life
  expectancy
(Summary Measures of Health
SemantEco/SemantAqua
• Enable/Empower citizens &
  scientists to explore pollution
  sites, facilities, regulations, and
  health impacts along with
  provenance.                           5                                                    4
• Demonstrates semantic                                   2               3
  monitoring possibilities.
• Map presentation of analysis
• Explanations and Provenance                                     1
  available
                                            http://was.tw.rpi.edu/swqp/map.html and
     1.   Map view of analyzed results      http://aquarius.tw.rpi.edu/projects/semantaqua

     2.   Explanation of pollution
     3.   Possible health effect of contaminant (from EPA)
     4.   Filtering by facet to select type of data
     5.   Link for reporting problems
     6.   Now joint with USGS resource managers ; expanded to
          endangered species; now more virtual observatory style
System Architecture




Virtuoso




                     access



                              19
Originally developed for VSTO, now in SSIII, SESDI, SESF, OOI …

                                                                    The Virtual Solar-Terrestrial
Observatory: A Deployed Semantic Web Application Case Study for Scientific Research. Proc. 19
Conf. on Innovative Applications of Artificial Intelligence (IAAI-07),
                http://www.vsto.org
Discussion

• Semantic Technologies and Linked Data are
  powering a wide array of application – many
  in Big Science, Team Science, at least
  interdisciplinary science
• Labeled graphs as powered by structured
  data can be a nice corpus for evaluation
• Tools and methodologies are ready for use
• We love to partner in these areas
• What do you need or want from linked data?
Questions? - dlm @ cs . rpi . edu
Extra
Directions
•   Incorporation of TWC data Quality Facts label
    (Zednik et al)
•   Use of DataFAQs automated data quality
    framework (Lebo et al)
•   Additional provenance inclusion / usage (Inference /
    Provenance Web)
•   Annotation / Collaboration facilities (Michaelis et al)
•   Other data sets? Or exposition of other
    parameters?
•   Partners in additional topic areas




                                                              23
Enabling Subject Area Exploration
                      and Hypothesis Generation
• What factors influence prevalence (and under what conditions)?
• Within smoking, should we focus on prevalence, packs sold,
  quit rate, hospital admission diagnosis, other?
• What is prevalence (definition)? And how is it measured (overall
  / in this data set)?
• What are the conditions under which the data was obtained
  (date, sample set, extenuating conditions, …)
• What other data might we include? And how might we show
  that data?
• What should be represented ? And how should it be
  manipulated?
• What tools and services to people benefit from to explore?
  Encode? Act?
Semantically-enabled advisors
utilize:
      • Ontologies
      • Reasoning
      • Social
      • Mobile
      • Provenance
      • Context

Patton & McGuinness.et. al
tw.rpi.edu/web/project/Wineagent
Semantic
            Sommelier
Previous versions used ontologies
to infer descriptions of wines for
meals and query for wines
New version uses
  Context: GPS location, local
  restaurants and wine lists, user
  preferences
  Social input: Twitter, Facebook, Wiki,
  mobile, …
Source variability in quality,
contradictions exist,
Maintenance is an issue… however
new models emerging
•   Semantic Technologies: ready for use
•
                      The Semantic Web
    Tools & tutorials available; deep apps
                          enables…
   future planning may benefit from
   consultants
•   • New models of intelligent services
    Context-aware, semantic
  apps are the future

    •   E-commerce solutions
    •   M-commerce
    •   Web assistants
    •   …

    New forms of web assistants/agents that act on a
          human’s behalf requiring less from humans
          and their communication devices…

20120419 linkedopendataandteamsciencemcguinnesschicago

  • 1.
    Linked Open Dataas an Enabler for Team Science Deborah L. McGuinness Tetherless World Senior Constellation Chair Professor of Computer and Cognitive Science Rensselaer Polytechnic Institute, Troy, NY & CEO McGuinness Associates, Latham, NY Science of Team Science; LOD and Team Science April 19, 2012
  • 2.
    Background – Semantic Technologies– technological support for encoding meaning in a form computers can understand and manipulate – are maturing and increasing in usage – Computational encodings of meaning can be used to help integrate, link, validate, filter,…. Essentially to make smarter, more context-aware applications – Semantic Technologies enable linking data … and linked data provides a way of connecting and traversing information, nodes, graphs, webs, …
  • 3.
    Linked Data • LinkedData is quite simple and follows principles set out by Berners-Lee in http://www.w3.org/DesignIssues/LinkedData.html – Use URIs as names for things – Use HTTP URIs so that people can look up those names. – When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL) – Include links to other URIs. so that they can discover more things. – Introduction by examples and then discussion
  • 4.
    Population Sciences GridGoals • Convey complex health-related information to consumer and public health decision makers for community health impact • Inform the development of future research opportunities effectively utilizing cyberinfrastructure for cancer prevention and control McGuinness, D. Shaikh, A., Lebo, T, Ding, L., Courtney, P., McCusker, J., Moser,. Morgan, G.D., Tatalovich, Z., Willis, G., Contractor, N., and Hesse, B. 2012. Towards Semantically-Enabled Next Generation Community Health Information Portals: The PopSciGrid Pilot In Proceedings of Hawaii International Conference on System Sciences 2012 4
  • 5.
    Semantic Web Perspectiveon Initial PopSciGrid Goals • How can semantic technologies be used to integrate, present, and analyze data for a wide range of users? • Can tools allow lay people to build their own demos and support public usage and accurate interpretation? • How do we facilitate collaboration and “viral” applications? • Within PopSciGrid: – Which policies (taxation, smoking bans, etc) impact health and health care costs? – What data should be displayed to help scientists and lay people evaluate related questions? – What data might be presented so that people choose to make (positive) behavior changes? – What does the data show? why should someone believe that? – What are appropriate follow up questions to support actionability? 5
  • 6.
    Foundations: The TetherlessWorld Constellation Linked Open Government Data Portal Convert TWC LOGD Query/ Access LOGD Community Portal SPARQL • RDF Endpoint • RSS • JSON Create • XML • HTML • CSV •… Enhance Data.gov deployment 6
  • 7.
    What is anOntology? Thesauri “narrower Formal Frames General Catalog/ term” is-a (properties) Logical ID relation constraints Informal Formal Value Disjointness Terms/ instance Restrs. , Inverse, glossary is-a part-of… Ontologies Come of Age McGuinness, 2001, and From AAAI Panel 99 – McGuinness, Welty, Uschold, Gruninger, Lehmann Plus basis of Ontologies Come of Age – McGuinness, 2003
  • 8.
    Inference Web: MakingData Transparent and Actionable Using Semantic Technologies • How and when does it make sense to use smart system results & how do we interact with them? (Mobile) Knowledge Intelligent Provenance in Virtual Agents NSF Interops: Observatories SONET SSIII – Sea Ice Intelligence Analyst Tools Hypothesis Investigation / Policy Advisors 8
  • 9.
    Foundations: Web LayerCake Visualization APIs S2S Govt Data Inference Web, Proof Markup Language, W3C Inference Web IW Trust, Provenance Working Air + Trust group formal model, W3C incubator group, DL, KIF, CL, N3Logic … Ontology repositories OWL 1 & 2 WG Edited main OWL (ontolinguag), Docs, quick reference, Ontology Evolution env: OWL profiles (OWL RL), Chimaera, Earlier languages: DAML, Semantic eScience DAML+OIL, Classic Ontologies, MANY other ontologie RIF WG AIR accountability tool SPARQL WG, earlier QL – OWL-QL, Classic’ QL, … Govt metadata search Linked Open Govt Data SPARQL to Xquery translator RDFS materialization (Billion triple winner) Transparent Accountable Datamining Initiative (TAM
  • 11.
    PopSciGrid Workflow Bancoverage Publish CSV2RDF4LOD Direct visualize derive derive CHSI 2009 archive Archive SemDiff CSV2RDF4LOD derive Enhance
  • 12.
    PopSciGrid Example State -Hawaii Extensible Mashups via Linked Data  Diverse datasets from NIH  Potentially linking to other content (e.g. “unemployment rate”) Accountable Mashups via Provenance  Annotate datasets used in demos 12  Feedback users’ comment to gov contact (e.g. %)  Annotation capabilities coming (and more)
  • 13.
  • 14.
    Reflections Successful but…. • Whatif we could allow data experts to build their own demos? • What if we could allow non-subject matter experts to function as subject-literate staff? • What if team members could interchange roles (and thus make contributions in other areas)? • What technological infrastructure is required? • Claim: all of this is being done now – but not at scale 14
  • 15.
    Updates and Motivationsfrom a Computer Science Perspective Old: New: • Raw conversions • Enhanced conversions • Per-dataset vocabularies • Vocabulary reuse • Custom queries • Generic queries • Custom data • Re-usable data management code management code • Limited use because of • Unlimited use of new Google Visualization open source visualization licenses toolkit • State-level data • State and county-level data 15
  • 16.
    RDF Data Cube Vocabulary • Integrated with the LOGD • For publishing multi- data conversion dimensional data, such infrastructure as statistics, on the web in such a way that it can • Integrated with other tooling be linked to related data like Stats2RDF sets and concepts using RDF. • Compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange). • Also compatible with: – SKOS, SCOVO, VoiD, FOAF, Dublin Core Terms 16
  • 17.
    County averagelife expectancy (Summary Measures of Health
  • 18.
    SemantEco/SemantAqua • Enable/Empower citizens& scientists to explore pollution sites, facilities, regulations, and health impacts along with provenance. 5 4 • Demonstrates semantic 2 3 monitoring possibilities. • Map presentation of analysis • Explanations and Provenance 1 available http://was.tw.rpi.edu/swqp/map.html and 1. Map view of analyzed results http://aquarius.tw.rpi.edu/projects/semantaqua 2. Explanation of pollution 3. Possible health effect of contaminant (from EPA) 4. Filtering by facet to select type of data 5. Link for reporting problems 6. Now joint with USGS resource managers ; expanded to endangered species; now more virtual observatory style
  • 19.
  • 20.
    Originally developed forVSTO, now in SSIII, SESDI, SESF, OOI … The Virtual Solar-Terrestrial Observatory: A Deployed Semantic Web Application Case Study for Scientific Research. Proc. 19 Conf. on Innovative Applications of Artificial Intelligence (IAAI-07), http://www.vsto.org
  • 21.
    Discussion • Semantic Technologiesand Linked Data are powering a wide array of application – many in Big Science, Team Science, at least interdisciplinary science • Labeled graphs as powered by structured data can be a nice corpus for evaluation • Tools and methodologies are ready for use • We love to partner in these areas • What do you need or want from linked data? Questions? - dlm @ cs . rpi . edu
  • 22.
  • 23.
    Directions • Incorporation of TWC data Quality Facts label (Zednik et al) • Use of DataFAQs automated data quality framework (Lebo et al) • Additional provenance inclusion / usage (Inference / Provenance Web) • Annotation / Collaboration facilities (Michaelis et al) • Other data sets? Or exposition of other parameters? • Partners in additional topic areas 23
  • 24.
    Enabling Subject AreaExploration and Hypothesis Generation • What factors influence prevalence (and under what conditions)? • Within smoking, should we focus on prevalence, packs sold, quit rate, hospital admission diagnosis, other? • What is prevalence (definition)? And how is it measured (overall / in this data set)? • What are the conditions under which the data was obtained (date, sample set, extenuating conditions, …) • What other data might we include? And how might we show that data? • What should be represented ? And how should it be manipulated? • What tools and services to people benefit from to explore? Encode? Act?
  • 25.
    Semantically-enabled advisors utilize: • Ontologies • Reasoning • Social • Mobile • Provenance • Context Patton & McGuinness.et. al tw.rpi.edu/web/project/Wineagent
  • 26.
    Semantic Sommelier Previous versions used ontologies to infer descriptions of wines for meals and query for wines New version uses Context: GPS location, local restaurants and wine lists, user preferences Social input: Twitter, Facebook, Wiki, mobile, … Source variability in quality, contradictions exist, Maintenance is an issue… however new models emerging
  • 27.
    Semantic Technologies: ready for use • The Semantic Web Tools & tutorials available; deep apps enables… future planning may benefit from consultants • • New models of intelligent services Context-aware, semantic apps are the future • E-commerce solutions • M-commerce • Web assistants • … New forms of web assistants/agents that act on a human’s behalf requiring less from humans and their communication devices…