Big data (phenomenon)
challenges and requirements
in official statistics
Fernando Reis, Big Data Task-Force
European Commission (Eurostat)
Big Data Europe Workshop
Luxembourg, 18th November 2015
Defining big data in 1 minute
•  Data deluge
•  High Volume, High Velocity, High Variety
•  Data-driven analytical applications
•  Statistical modelling
•  Visualisation
•  Data-driven economy
•  Official statistics does not have a nearly statistical
monopoly anymore
The data deluge
© Copyright Brett Ryder 2010
Digital footprint
Datafication
Sensors
The data deluge
•  Everything is data (data ubiquity)
•  Examples: Text, sound, images, video
•  Emergent use of unstructured data
•  Exhaust data and “reality mining”
•  Types of big data sources
•  Signal to noise ratio (the 4th ‘V’: Value)
•  The next age: Internet of things
6
The data deluge
•  Organic data / exhaust data / digital footprint
•  Data ubiquity: text, sound, images, video
•  Emergent use of unstructured data
•  Reality mining
Communication
Mobile phone
data
Social Media
WWW
Web Searches
Businesses'
Websites
E-commerce
websites
Job
advertisements
Real estate
websites
Sensors
Traffic loops
Smart meters
Vessel
Identification
Satellite
Images
Process
generated data
Flight Booking
transactions
Supermarket
Cashier Data
Financial
transactions
Crowd sourcing
VGI
websites
(OpenStreetMap)
Community
pictures
collection
The data deluge
Analytics
This photo, “Cartoon: Big Data” is
copyright (c) 2014 Thierry Gregorius
and made available under an
Attribution 2.0 Generic license.
•  How to deal with exhaust data?
•  Dealt by machine learning / predictive analytics
•  Massive datasets
•  Foster machine learning
•  Data science: a new discipline?
•  Signal processing (audio, image, video)
•  Natural Language Processing (NLP)
•  Network data
•  Distributed computing
•  Multiple inference
•  Over-fitting
Analytics
Csáji, Balázs Cs, et al. "Exploring the mobility of mobile phone users." Physica A: Statistical Mechanics and its Applications 392.6 (2013): 1459-1473.
Population statistics
Mobile phone
frequent locations
Mobile phone
commute map
Population
Mapping Using
Mobile Phone
Data
Deville, Pierre, et al. "Dynamic
population mapping using mobile
phone data." Proceedings of the
National Academy of Sciences 111.45
(2014): 15888-15893.
https://www.youtube.com/watch?v=qsUDH5dUnvY
An emergent market
An emergent market
•  Monetisation of data: Data is the new oil
•  Data as a new factor of production (competitive
differentiating factor for businesses)
•  A threat to official statistics? (ex: Argentina)
•  Data ecosystem
•  The cases of Google and Facebook
What does big data mean for
official statistics?
•  Change of paradigm
•  From: finite population sampling methodology
•  To: additional statistical modelling and
machine learning
•  from designers of data collection processes to
designers of statistical products
•  Privacy
•  Use of digital footprint
•  Data subject lack of control of data
•  High data detail and insight from analytics
Mobile
Phone
Data
Tourism
Statistics
Population
Statistics
Migration
Statistics
Traffic
Statistics
Commutin
g
Statistics
Population
Statistics
Mobile
phone
data
Smart
Meters
VGI
websites
Satellite
Images
Multisource statistics and
multipurpose sources
Policy Quality Skills
Experience
sharing
Legislation
IT
Infrastructures
Methods
Ethics /
Communication
Pilots
ESS big data action plan
Challenges for data management
•  Size of datasets (storage and processing)
•  Lack of control on data sources
•  Data ownership / licensing
•  Volatility / sustainability of data sources
•  Data integration (variety of data sources)
•  Open data (do we need to store it?)
•  Data types (natural language, images, geo-
location)
•  Level of detail of the data
Challenges for data management
•  Technological change (tools change frequently)
•  Privacy (anonymization methods)
•  Data security (which data to share)
•  Data interface with production / research
•  Metadata (are current standards enough?)
•  Replicability / auditability
•  Data versioning
•  Data handling methodologies
•  Applications (e.g. network analysis)
Thank you for your attention
Fernando Reis
Eurostat Task Force on Big Data
https://github.com/reisfe/
https://twitter.com/reisfe/
https://linkedin.com/in/reisfe/
fernando.reis@ec.europa.eu

SC6 Workshop 1: Big data (phenomenon) challenges and requirements in official statistics - BigDataEurope, SC6 Workshop

  • 1.
    Big data (phenomenon) challengesand requirements in official statistics Fernando Reis, Big Data Task-Force European Commission (Eurostat) Big Data Europe Workshop Luxembourg, 18th November 2015
  • 2.
    Defining big datain 1 minute •  Data deluge •  High Volume, High Velocity, High Variety •  Data-driven analytical applications •  Statistical modelling •  Visualisation •  Data-driven economy •  Official statistics does not have a nearly statistical monopoly anymore
  • 3.
    The data deluge ©Copyright Brett Ryder 2010
  • 4.
  • 5.
    The data deluge • Everything is data (data ubiquity) •  Examples: Text, sound, images, video •  Emergent use of unstructured data •  Exhaust data and “reality mining” •  Types of big data sources •  Signal to noise ratio (the 4th ‘V’: Value) •  The next age: Internet of things
  • 6.
    6 The data deluge • Organic data / exhaust data / digital footprint •  Data ubiquity: text, sound, images, video •  Emergent use of unstructured data •  Reality mining
  • 7.
    Communication Mobile phone data Social Media WWW WebSearches Businesses' Websites E-commerce websites Job advertisements Real estate websites Sensors Traffic loops Smart meters Vessel Identification Satellite Images Process generated data Flight Booking transactions Supermarket Cashier Data Financial transactions Crowd sourcing VGI websites (OpenStreetMap) Community pictures collection The data deluge
  • 8.
    Analytics This photo, “Cartoon:Big Data” is copyright (c) 2014 Thierry Gregorius and made available under an Attribution 2.0 Generic license.
  • 9.
    •  How todeal with exhaust data? •  Dealt by machine learning / predictive analytics •  Massive datasets •  Foster machine learning •  Data science: a new discipline? •  Signal processing (audio, image, video) •  Natural Language Processing (NLP) •  Network data •  Distributed computing •  Multiple inference •  Over-fitting Analytics
  • 10.
    Csáji, Balázs Cs,et al. "Exploring the mobility of mobile phone users." Physica A: Statistical Mechanics and its Applications 392.6 (2013): 1459-1473. Population statistics Mobile phone frequent locations Mobile phone commute map
  • 11.
    Population Mapping Using Mobile Phone Data Deville,Pierre, et al. "Dynamic population mapping using mobile phone data." Proceedings of the National Academy of Sciences 111.45 (2014): 15888-15893. https://www.youtube.com/watch?v=qsUDH5dUnvY
  • 12.
  • 13.
    An emergent market • Monetisation of data: Data is the new oil •  Data as a new factor of production (competitive differentiating factor for businesses) •  A threat to official statistics? (ex: Argentina) •  Data ecosystem •  The cases of Google and Facebook
  • 14.
    What does bigdata mean for official statistics? •  Change of paradigm •  From: finite population sampling methodology •  To: additional statistical modelling and machine learning •  from designers of data collection processes to designers of statistical products •  Privacy •  Use of digital footprint •  Data subject lack of control of data •  High data detail and insight from analytics
  • 15.
  • 16.
  • 17.
    Challenges for datamanagement •  Size of datasets (storage and processing) •  Lack of control on data sources •  Data ownership / licensing •  Volatility / sustainability of data sources •  Data integration (variety of data sources) •  Open data (do we need to store it?) •  Data types (natural language, images, geo- location) •  Level of detail of the data
  • 18.
    Challenges for datamanagement •  Technological change (tools change frequently) •  Privacy (anonymization methods) •  Data security (which data to share) •  Data interface with production / research •  Metadata (are current standards enough?) •  Replicability / auditability •  Data versioning •  Data handling methodologies •  Applications (e.g. network analysis)
  • 19.
    Thank you foryour attention Fernando Reis Eurostat Task Force on Big Data https://github.com/reisfe/ https://twitter.com/reisfe/ https://linkedin.com/in/reisfe/ [email protected]