Tetherless World Constellation




   Data: Big and Broad
             Jim Hendler
    Tetherless World Constellation
Tetherless World Professor of Computer and Cognitive Science
            Head, Computer Science Department

   Rensselaer Polytechnic Institute
   http://www.cs.rpi.edu/~hendler
         @jahendler (twitter)
Outline (if I stick to it)

                       Tetherless World Constellation


• What is big data?
• How big is big?
• What is big data on the Web?
• What is Broad data?
• Got an example?
• What’s the problem?
• What’s going on
Useful Terms
                                              Tetherless World Constellation

• Machine-readable Data
   – Information available in a form that is accessible and
     manipulable by computer
   – Accessible ≠ Manipulable
      • eg PDF documents can be read in and displayed, but the
        information in the document is not readily available without special
        tooling
• Metadata
   – Information associated with (machine-readable) data that
     provides information about the data set
• Workflow, Provenance, and lots of other terms
   – Useful sorts of metadata with respect to who created the data,
     when, how was it processed, etc.
• Metadata and the other stuff most useful when it is
  machine-readable and openly available in commonly agreed
  upon formats
BIG Data is NOT the Web of Data
                                       Tetherless World Constellation

• The term “Big Data” is widely used
  nowadays to refer to a whole bunch of
  machine-readable data in one accessible
  (to the researcher) place
   – 3 main contexts
    • The large data collections of “big science” projects
       – in traditional data warehouse or database formats
    • The enterprise data of large, non-Web-based
      companies (IBM, TATA, etc.)
       – Generally in multiple
    • The data holdings of a Google, Facebook or other
      large Web company
       – Include large “unstructured” holdings
       – Include “graph” data
Tera, Peta, Zeta
                                            yotta, yotta, yotta…
                                       Tetherless World Constellation


• World Wide Web data is extremely large
• Extremely well “funded”
  – eg. Facebook
     • 25 Terabytes of logged data per day; valuation $33B (US
       NIH budget ~ $31B)
  – eg. Google
     • In 2008 it was estimated at 20 petabytes per day (not
       including youTube); current valuation $190B (about 1/3
       the entire US DoD budget)

• And really, really fascinating stuff
  – Data about people and their relationships
     •   To each other
     •   To products
     •   To activities and actions
     •   …
How BIG is Big?

Tetherless World Constellation
BIG Data

                            Tetherless World Constellation




Google uses their data in many ways
         Search => ads => user
Big Data is becoming different on the Web

                                     Tetherless World Constellation


• New Work
  – is moving away from traditional relational
   models
     • cf. NoSQL
  – Moving towards third party application and
    extension
     • cf. Mobile apps for local governments
  – Includes a focus on interoperability and
    exchange with “lightweight” semantics
     • Using ideas from the Semantic Web
        – Search: Schema.org
        – Social Networking: OGP
Which in part gives rise to BROAD data

                                     Tetherless World Constellation


• 4th context: Broad Data
  – The huge amount of freely available, but widely varied,
    Open Data on the World Wide Web (Structured and
    Semi-structured)
     • Example: The extended Facebook OGP graph (the
       part outside Facebook’s datasets)
     • Example: The growing linked open data cloud of
       freely available RDF linked data
     • Example: Hundreds of thousands of datasets that are
       available on the Web free from governments around
       the world
Example: adding “Breadth”

 Tetherless World Constellation




                    April 2010
Facebook’s Open Graph Protocol

                                                             Tetherless World Constellation

• Facebook now allows other sites to extend the graph
• Open Graph Protocol uses RDFa to let web sites contain
  information about the things people “like”
       og:title - The title of your object as it should appear within the graph, e.g., "The Rock".
       og:type - The type of your object, e.g., "movie". Depending on the type you specify, other
       properties may also be required.
       og:image - An image URL which should represent your object within the graph.
       og:url - The canonical URL of your object that will be used as its permanent ID in the graph
       og:description - A one to two sentence description of your object.
       og:site_name - If your object is part of a larger web site, the name which should be
       displayed for the overall site. e.g., "IMDb".




   – Not a traditional “ontology”
Big Data

                                   Tetherless World Constellation




Facebook generates terabytes of data per day
          What could be learned from this?
Creates a platform for SW-powered apps

              Tetherless World Constellation
BROAD data challenges

                            Tetherless World Constellation


• For broad data the new challenges
  that emerge include
  – (Web-scale) data search
  – “Crowd-sourced” modeling
  – rapid (and potentially ad hoc)
    integration of datasets
  – visualization and analysis of only-
    partially modeled datasets
  – policies for data use, reuse and
    combination.
Huh?

                          Tetherless World Constellation


“The more I work with data, the more I
realize I need Semantics”

 Huh?

The traditional database community has,
umm, not always been the first to embrace
semantics

What is different here?
Government Data Sharing

Tetherless World Constellation
The Web of Open
Government Data is Growing
• Analytics based on over 1,000,000 datasets
  from around the world can be seen at
   – http://logd.tw.rpi.edu/iogds_data_analytics
• The examples that follow are from that page
Datasets                 1,028,054
Countries                43
Catalogs                 192
Categories               2460
Languages                24
          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           17
International




          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           18
2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           19
Many others…




                                                   Important note:
                                                   quantity is not really the most
                                                   important issue

          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           20
Topics (Across All Catalogs)




          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           21
Topics (Across All Catalogs)




          2012 International Open Government Data Conference—Open Gov Data Tutorial
9 July 2012                                                                           22
Combining data from different data sharing sites

                       Tetherless World Constellation
Data Integration Problems

                                       Tetherless World Constellation




Head to head comparions shows that
burglaries in Avon and Somerset (UK) far
exceed those in Los Angeles, California
(one of the highest crime areas in the US)
The problem is (likely) semantics

                                          Tetherless World Constellation




                                                        Same or
                                                        different?




Do the terms mean the same? Are they collected in the same way? Are
they processed differently? …
Example: Water

Tetherless World Constellation
Example: Water/Kenya

Tetherless World Constellation
Finding Data

                        Tetherless World Constellation




World Bank: Africa     Africover: Agriculture




 Kenya: Agricultural   US Data.gov: Crop
5 Star Data

                                         Tetherless World Constellation




              IOGDC Open Data Tutorial             29
9 July 2012
Broad Data “Integration”
requires simple semantics
 Tetherless World Constellation
Example any wikipedia topic!

   Tetherless World Constellation
Arizona

Tetherless World Constellation
Arizona info (From the previous)

       Tetherless World Constellation
USDA data turns out to be crucial

        Tetherless World Constellation
Metadata is crucial for Broad Data
                                           Tetherless World Constellation


• Metadata design is crucial to govt data
  sharing
  – Needed for search and federation in large data
    sharing efforts
• International data sharing
  – W3C Govt Linked Data Working Group
  – Need for vocabularies within govt sectors
     • Esp for cross-langauge use
        – How can we compare health (or legal, or social, or ….) data
          between countries like US, UK, India, Kenya (English) with
          Norway, China, France, etc.
        – How can we link local govts (in traditional languages, local
          dialects, etc) w/national data
Database metadata

Tetherless World Constellation
Dataset extension to schema.org (pending)

                 Tetherless World Constellation
Government Data in the linked open data cloud

                     Tetherless World Constellation




    Government Data is
    currently over ½ the cloud in
    size (~17B triples), 10s of
    thousands of links to other
    data (within and without)

http://linkeddata.org/
Research in Govt Data => Broad Data challenges

                                             Tetherless World Constellation

• Trust
   – Government data is controversial, and potentially biased
       • How do we confirm or dispute?
• Combination
   – When we combine data we need to keep the provenance of
     information (see trust)
       • How do we make policies explicit and sharable
• Scaling
   – Our project has already converted 9.9B triples from only
     >2,000 of the 710,000 government databases we can identify
     (116 catalogs, 32 countries, 16 languages)
       • Cross-catalog
       • Cross Langauge
• Versioning and updating
• Archiving
• Visualization
Big Data needs bigger ideas
            for visualization
          Tetherless World Constellation




      (Fox &Hendler, Science, 2/11/10)
A new idea we’re playing with at RPI

                               Tetherless World Constellation


• Data as “exhibition”
  – Museums/Performing Arts have explored
    accessibility for real world artifacts, can
    we extend these to the data web?
• Data via physical
  interaction
  – Using theatre techniques
    we can literally move a
    person through a data landscape, what
    new metaphors does this open up?
Conclusions
                                    Tetherless World Constellation

• Big data is going Broad
  – World Wide Web trend towards more and more
    varied data
     • In many domains
        – E-commerce, Open Govt, many more (cf.
          Health/Medical care)

• Broad data requires thinking outside the
  “Database” box
  – Including considering access
• Broad data opens exciting possibilities for
  research and innovation
  – And I hope will help provide tools for making
    data more accessible

More Related Content

PDF
Facilitating Web Science Collaboration through Semantic Markup
PPT
Wither OWL
PPT
On Beyond OWL: challenges for ontologies on the Web
PPT
Broad Data
PPT
Semantic Web: The Inside Story
PPT
Broad Data (India 2015)
PPTX
The Unreasonable Effectiveness of Metadata
PPT
The Semantic Web: It's for Real
Facilitating Web Science Collaboration through Semantic Markup
Wither OWL
On Beyond OWL: challenges for ontologies on the Web
Broad Data
Semantic Web: The Inside Story
Broad Data (India 2015)
The Unreasonable Effectiveness of Metadata
The Semantic Web: It's for Real

What's hot (20)

PPT
The Semantic Web: 2010 Update
PPTX
Why Watson Won: A cognitive perspective
PPTX
The Science of Data Science
PPTX
The Rensselaer IDEA: Data Exploration
PPTX
"Why the Semantic Web will Never Work" (note the quotes)
PPT
Social Machines: The coming collision of Artificial Intelligence, Social Netw...
PPTX
SSSW2015 Data Workflow Tutorial
PPTX
Intro to Data Science Concepts
PPTX
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
PPTX
Big Data Talent in Academic and Industry R&D
PPTX
The Other HPC: High Productivity Computing
PPTX
The Web of Data: do we actually understand what we built?
PPTX
Data Science, Data Curation, and Human-Data Interaction
PDF
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
PDF
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
PDF
Knowledge discoverylaurahollink
PPTX
Science Data, Responsibly
PDF
HyperMembrane Structures for Open Source Cognitive Computing
PDF
Data Science For Social Scientists Workshop
PPTX
Data, Responsibly: The Next Decade of Data Science
The Semantic Web: 2010 Update
Why Watson Won: A cognitive perspective
The Science of Data Science
The Rensselaer IDEA: Data Exploration
"Why the Semantic Web will Never Work" (note the quotes)
Social Machines: The coming collision of Artificial Intelligence, Social Netw...
SSSW2015 Data Workflow Tutorial
Intro to Data Science Concepts
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
Big Data Talent in Academic and Industry R&D
The Other HPC: High Productivity Computing
The Web of Data: do we actually understand what we built?
Data Science, Data Curation, and Human-Data Interaction
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
Knowledge discoverylaurahollink
Science Data, Responsibly
HyperMembrane Structures for Open Source Cognitive Computing
Data Science For Social Scientists Workshop
Data, Responsibly: The Next Decade of Data Science
Ad

Similar to Data Big and Broad (Oxford, 2012) (20)

PDF
Semantic Web: "ten year" update
PPTX
The Future of LOD
PDF
First they have to find it: Getting Open Government Data Discovered and Used
PPTX
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
PDF
Big Data on the Web – What We Will Do
PPTX
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
PPTX
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
PPT
The Semantic Web: 2010 Update
PPTX
PhD Proposal Defense - Prateek Jain
PPTX
The Semantic Web Exists. What Next?
PDF
Linked Open Government Data: What’s Next?
PPTX
Big dataorig
PDF
20111120 warsaw learning curve by b hyland notes
PPTX
The CSO Open Data Experience
PPTX
Data mining with big data
PPT
Data science training institute in hyderabad
PPTX
Tragedy of the Data Commons (ODSC-East, 2021)
PDF
Data Science in 2016: Moving Up
PDF
The technical case for a semantic web
PPT
Research issues in the big data and its Challenges
Semantic Web: "ten year" update
The Future of LOD
First they have to find it: Getting Open Government Data Discovered and Used
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Big Data on the Web – What We Will Do
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
The Semantic Web: 2010 Update
PhD Proposal Defense - Prateek Jain
The Semantic Web Exists. What Next?
Linked Open Government Data: What’s Next?
Big dataorig
20111120 warsaw learning curve by b hyland notes
The CSO Open Data Experience
Data mining with big data
Data science training institute in hyderabad
Tragedy of the Data Commons (ODSC-East, 2021)
Data Science in 2016: Moving Up
The technical case for a semantic web
Research issues in the big data and its Challenges
Ad

More from James Hendler (18)

PPTX
Knowing what AI Systems Don't know and Why it matters
PPTX
Exploring the Boundaries of Artificial Intelligence (or "Modern AI")
PPTX
Tragedy of the (Data) Commons
PPTX
Knowledge Graph Semantics/Interoperability
PPTX
The Future(s) of the World Wide Web
PPTX
Enhancing Precision Wellness with Personal Health Knowledge Graphs
PPTX
The Future of AI: Going Beyond Deep Learning, Watson, and the Semantic Web
PPTX
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...
PPTX
Enhancing Precision Wellness with Knowledge Graphs and Semantic Analytics: O...
PPT
KR in the age of Deep Learning
PPTX
Digital Archiving, The Semantic Web, and Modern AI
PPT
Social Machines - 2017 Update (University of Iowa)
PPT
Knowledge Representation in the Age of Deep Learning, Watson, and the Semanti...
PPTX
Artificial Intelligence: Existential Threat or Our Best Hope for the Future?
PPTX
Watson: An Academic's Perspective
PPT
Big Data and Computer Science Education
PPTX
Watson at RPI - Summer 2013
PPT
Future of the World WIde Web (India)
Knowing what AI Systems Don't know and Why it matters
Exploring the Boundaries of Artificial Intelligence (or "Modern AI")
Tragedy of the (Data) Commons
Knowledge Graph Semantics/Interoperability
The Future(s) of the World Wide Web
Enhancing Precision Wellness with Personal Health Knowledge Graphs
The Future of AI: Going Beyond Deep Learning, Watson, and the Semantic Web
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...
Enhancing Precision Wellness with Knowledge Graphs and Semantic Analytics: O...
KR in the age of Deep Learning
Digital Archiving, The Semantic Web, and Modern AI
Social Machines - 2017 Update (University of Iowa)
Knowledge Representation in the Age of Deep Learning, Watson, and the Semanti...
Artificial Intelligence: Existential Threat or Our Best Hope for the Future?
Watson: An Academic's Perspective
Big Data and Computer Science Education
Watson at RPI - Summer 2013
Future of the World WIde Web (India)

Recently uploaded (20)

PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PPTX
Configure Apache Mutual Authentication
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Five Habits of High-Impact Board Members
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
TEXTILE technology diploma scope and career opportunities
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PDF
STKI Israel Market Study 2025 version august
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Configure Apache Mutual Authentication
Getting started with AI Agents and Multi-Agent Systems
Five Habits of High-Impact Board Members
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
sustainability-14-14877-v2.pddhzftheheeeee
A review of recent deep learning applications in wood surface defect identifi...
TEXTILE technology diploma scope and career opportunities
Improvisation in detection of pomegranate leaf disease using transfer learni...
Microsoft Excel 365/2024 Beginner's training
Comparative analysis of machine learning models for fake news detection in so...
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
Custom Battery Pack Design Considerations for Performance and Safety
Taming the Chaos: How to Turn Unstructured Data into Decisions
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Enhancing plagiarism detection using data pre-processing and machine learning...
STKI Israel Market Study 2025 version august
A proposed approach for plagiarism detection in Myanmar Unicode text
Convolutional neural network based encoder-decoder for efficient real-time ob...

Data Big and Broad (Oxford, 2012)

  • 1. Tetherless World Constellation Data: Big and Broad Jim Hendler Tetherless World Constellation Tetherless World Professor of Computer and Cognitive Science Head, Computer Science Department Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler (twitter)
  • 2. Outline (if I stick to it) Tetherless World Constellation • What is big data? • How big is big? • What is big data on the Web? • What is Broad data? • Got an example? • What’s the problem? • What’s going on
  • 3. Useful Terms Tetherless World Constellation • Machine-readable Data – Information available in a form that is accessible and manipulable by computer – Accessible ≠ Manipulable • eg PDF documents can be read in and displayed, but the information in the document is not readily available without special tooling • Metadata – Information associated with (machine-readable) data that provides information about the data set • Workflow, Provenance, and lots of other terms – Useful sorts of metadata with respect to who created the data, when, how was it processed, etc. • Metadata and the other stuff most useful when it is machine-readable and openly available in commonly agreed upon formats
  • 4. BIG Data is NOT the Web of Data Tetherless World Constellation • The term “Big Data” is widely used nowadays to refer to a whole bunch of machine-readable data in one accessible (to the researcher) place – 3 main contexts • The large data collections of “big science” projects – in traditional data warehouse or database formats • The enterprise data of large, non-Web-based companies (IBM, TATA, etc.) – Generally in multiple • The data holdings of a Google, Facebook or other large Web company – Include large “unstructured” holdings – Include “graph” data
  • 5. Tera, Peta, Zeta yotta, yotta, yotta… Tetherless World Constellation • World Wide Web data is extremely large • Extremely well “funded” – eg. Facebook • 25 Terabytes of logged data per day; valuation $33B (US NIH budget ~ $31B) – eg. Google • In 2008 it was estimated at 20 petabytes per day (not including youTube); current valuation $190B (about 1/3 the entire US DoD budget) • And really, really fascinating stuff – Data about people and their relationships • To each other • To products • To activities and actions • …
  • 6. How BIG is Big? Tetherless World Constellation
  • 7. BIG Data Tetherless World Constellation Google uses their data in many ways Search => ads => user
  • 8. Big Data is becoming different on the Web Tetherless World Constellation • New Work – is moving away from traditional relational models • cf. NoSQL – Moving towards third party application and extension • cf. Mobile apps for local governments – Includes a focus on interoperability and exchange with “lightweight” semantics • Using ideas from the Semantic Web – Search: Schema.org – Social Networking: OGP
  • 9. Which in part gives rise to BROAD data Tetherless World Constellation • 4th context: Broad Data – The huge amount of freely available, but widely varied, Open Data on the World Wide Web (Structured and Semi-structured) • Example: The extended Facebook OGP graph (the part outside Facebook’s datasets) • Example: The growing linked open data cloud of freely available RDF linked data • Example: Hundreds of thousands of datasets that are available on the Web free from governments around the world
  • 10. Example: adding “Breadth” Tetherless World Constellation April 2010
  • 11. Facebook’s Open Graph Protocol Tetherless World Constellation • Facebook now allows other sites to extend the graph • Open Graph Protocol uses RDFa to let web sites contain information about the things people “like” og:title - The title of your object as it should appear within the graph, e.g., "The Rock". og:type - The type of your object, e.g., "movie". Depending on the type you specify, other properties may also be required. og:image - An image URL which should represent your object within the graph. og:url - The canonical URL of your object that will be used as its permanent ID in the graph og:description - A one to two sentence description of your object. og:site_name - If your object is part of a larger web site, the name which should be displayed for the overall site. e.g., "IMDb". – Not a traditional “ontology”
  • 12. Big Data Tetherless World Constellation Facebook generates terabytes of data per day What could be learned from this?
  • 13. Creates a platform for SW-powered apps Tetherless World Constellation
  • 14. BROAD data challenges Tetherless World Constellation • For broad data the new challenges that emerge include – (Web-scale) data search – “Crowd-sourced” modeling – rapid (and potentially ad hoc) integration of datasets – visualization and analysis of only- partially modeled datasets – policies for data use, reuse and combination.
  • 15. Huh? Tetherless World Constellation “The more I work with data, the more I realize I need Semantics” Huh? The traditional database community has, umm, not always been the first to embrace semantics What is different here?
  • 16. Government Data Sharing Tetherless World Constellation
  • 17. The Web of Open Government Data is Growing • Analytics based on over 1,000,000 datasets from around the world can be seen at – http://logd.tw.rpi.edu/iogds_data_analytics • The examples that follow are from that page Datasets 1,028,054 Countries 43 Catalogs 192 Categories 2460 Languages 24 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 17
  • 18. International 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 18
  • 19. 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 19
  • 20. Many others… Important note: quantity is not really the most important issue 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 20
  • 21. Topics (Across All Catalogs) 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 21
  • 22. Topics (Across All Catalogs) 2012 International Open Government Data Conference—Open Gov Data Tutorial 9 July 2012 22
  • 23. Combining data from different data sharing sites Tetherless World Constellation
  • 24. Data Integration Problems Tetherless World Constellation Head to head comparions shows that burglaries in Avon and Somerset (UK) far exceed those in Los Angeles, California (one of the highest crime areas in the US)
  • 25. The problem is (likely) semantics Tetherless World Constellation Same or different? Do the terms mean the same? Are they collected in the same way? Are they processed differently? …
  • 28. Finding Data Tetherless World Constellation World Bank: Africa Africover: Agriculture Kenya: Agricultural US Data.gov: Crop
  • 29. 5 Star Data Tetherless World Constellation IOGDC Open Data Tutorial 29 9 July 2012
  • 30. Broad Data “Integration” requires simple semantics Tetherless World Constellation
  • 31. Example any wikipedia topic! Tetherless World Constellation
  • 33. Arizona info (From the previous) Tetherless World Constellation
  • 34. USDA data turns out to be crucial Tetherless World Constellation
  • 35. Metadata is crucial for Broad Data Tetherless World Constellation • Metadata design is crucial to govt data sharing – Needed for search and federation in large data sharing efforts • International data sharing – W3C Govt Linked Data Working Group – Need for vocabularies within govt sectors • Esp for cross-langauge use – How can we compare health (or legal, or social, or ….) data between countries like US, UK, India, Kenya (English) with Norway, China, France, etc. – How can we link local govts (in traditional languages, local dialects, etc) w/national data
  • 37. Dataset extension to schema.org (pending) Tetherless World Constellation
  • 38. Government Data in the linked open data cloud Tetherless World Constellation Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without) http://linkeddata.org/
  • 39. Research in Govt Data => Broad Data challenges Tetherless World Constellation • Trust – Government data is controversial, and potentially biased • How do we confirm or dispute? • Combination – When we combine data we need to keep the provenance of information (see trust) • How do we make policies explicit and sharable • Scaling – Our project has already converted 9.9B triples from only >2,000 of the 710,000 government databases we can identify (116 catalogs, 32 countries, 16 languages) • Cross-catalog • Cross Langauge • Versioning and updating • Archiving • Visualization
  • 40. Big Data needs bigger ideas for visualization Tetherless World Constellation (Fox &Hendler, Science, 2/11/10)
  • 41. A new idea we’re playing with at RPI Tetherless World Constellation • Data as “exhibition” – Museums/Performing Arts have explored accessibility for real world artifacts, can we extend these to the data web? • Data via physical interaction – Using theatre techniques we can literally move a person through a data landscape, what new metaphors does this open up?
  • 42. Conclusions Tetherless World Constellation • Big data is going Broad – World Wide Web trend towards more and more varied data • In many domains – E-commerce, Open Govt, many more (cf. Health/Medical care) • Broad data requires thinking outside the “Database” box – Including considering access • Broad data opens exciting possibilities for research and innovation – And I hope will help provide tools for making data more accessible