CBS CEDAR Presentation

CEDAR
Ashkan Ashkpour
Albert Meroño Peñuela
From fragment to fabric - Dutch census
data in a web of global cultural and
historic information
http://cedar-project.nl/

Double purpose of Linked Census Data
 Improve information retrieval for the general public (incl.
lay experts, students, researchers)
 Create new data sources and possibly new research
practices in social history, historic demography, general
history, …
 Create immediate access to digitized Dutch census data
 Semantic modeling of Dutch census data
 Further enriching of statistical information with context

The Historical Census Use Case
2011-2015

End of the door-to-door
censuses

• Source of historical statistical data, providing a
rich source of social, economic and
demographic data
• A relatively untapped source of information
information / Most research focuses on a
specific year or a subset instead of time series
• “..specific information about a nations population
characteristics and needs at a given time in
history, providing invaluable snapshots of the
state of a nation”

Digitization Efforts – 1996
Cooperation between CBS and NIWI

VT 1869 Plaatselijke indeling NB

The census dataset (1795-
1971)
 3 Types of Censuses
[Population/Occupation/Housing]
 17 census years
 Only left aggregated form
 2288 census tables
 33283 annotations
 17Million characters

Structural Heterogeneity
How can we maintain the same structure
and information
1 on 1 representation

Going to RDF
 Model: RDF Data cube > Multi dimensional
statistical data
 Supervised conversion
 Need to define the layout structures per table

Going to RDF
 Styling of 2,288 tables
 Training and conversion at DANS
 Thanks to Michael and Jetske

RDF Statistics
 310,585,567 total triples
 389,132 hierarchical row headers
 17,960,911 data cells
 61,110 column headers
 3,609 row properties
 3,150 titles
 1,581,546 row headers
 274,404 metadata cells
 See http://lod.cedar-project.nl/cedar/data.html

Everything in one system: What does this
mean ?
 No separate files
 Insights in # of variables
 Availability of variables (preliminary analyses)
 Straightforward harmonizations
 Systematic data check
 Visualizations
 Other debugging purposes

Examples on raw data
Number of teachers Number of married women

Three tier model
 Raw data is filled
 Annotations
 Harmonization layer

Enriching the Harmonization layer
 Cleaning and correcting
 Standardizing variables and values..
 Mappings
 Connecting to existing (classifications) systems:
 HISCO (historical occupations)
 Amsterdam Code (historical municipalities)
 SDMX (demographical variables)
 Creating variables, bottom up classification systems
(religious denominations, housing types, occupations,
age ranges..
 Key: bringing all these practices together

CEDAR goal: cross-query the Dutch
historical censuses on the Web
?
(aka integrating
~3K disparate
tables)
1795 1830 1889 1930 1971

• Web publishable
• Machine processable
• Dynamic schema
• Easily link with other
datasets

Why Semantic Web
Technology?
 To W3C
Web publishable
Web exchangeable
 Human & machine readable
 Provide interesting links
 To us
 Finer granularity level (cell level)
 Statistical comparability by leveraging semantic
descriptions
 Provenance
 Harmonization through linkage to other datasets

Towards 5-star Census Data
>2 years ago
2 years ago

“There are many situations where it
would be useful to be able to publish
multi-dimensional data, such as
statistics, on the web in such a way
that they can be linked to related data
sets and concepts.”

RDF Data Cube vocabulary
(QB)

RDF Data Cube vocabulary
(QB)
• SDMX compatible
• Defines cubes as a set of observations that consist
of dimensions, measures and attributes
• Dimensions: time period, region, sex
(qb:DimensionProperty)
• Measure: population life expectancy (qb:MeasureProperty)
• Attribute: unit of measure = years, metadata status =
measured (qb:AttributeProperty)
Observation: “the measured life expectancy of males in
Newport in the period 2004-2006 is 76.7 years”

CEDAR Integrator
https://github.com/CEDAR-project/
Integrator
http://lod.cedar-project.nl/cedar/data.html

http://lod.cedar-project.nl/cedar/stats.html

http://lod.cedar-project.nl/maps/

Dimension Reusability
cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;
cedar:population "12"^^xml:integer ;
maritalstatus:maritalStatus
maritalstatus:single ;
cedarterms:occupationPosition cedarterms:job-D ;
sdmx-dimension:sex sdmx-code:sex-F ;
cedarterms:occupation hisco:88030 ;
sdmx-dimension:refArea gg:11150 ;
cedarterms:belief hreligion:118 ;
cedarterms:houseType cedar:Klooster ;
prov:wasDerivedFrom
cedar:BRT_1889_08_T1-S0-K17 ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-K17-activity .

LSD Dimensions
http://lsd-dimensions.org/
https://github.com/albertmeronyo/LSD-Dimensions
Hourly JSON-LD dumps

What if dimensions aren’t out
there?
 Need to build them
 Input: flat lists of non-standard values
 Output: standard concept scheme
 Knowledge intensive problem
https://github.com/CEDAR-project/TabCluster

Concept Drift
Census classification of
occupations as for
1859
• Root node is void
• Depth 1: occupation groups
• Leaves: actual occupations

Concept Drift
occupations as for
1889

occupations as for
1899
Concept Drift

Concept Drift
 RQ: Can we use past knowledge to predict
when and where will concept drift happen in an
ontology?
 Theoretical framework: [1]
 Data: a number of ontology versions
 Method: supervised learning [2]
 Features: structural, membership, usage [3]
 Results: f-measures of 0.84, 0.93, 0.79
 https://github.com/albertmeronyo/ConceptDrift
[1] Shenghui Wang, Stefan Schlobach, Michael Klein. “What is Concept Drift and How to Identify It?”. EKAW
2[20]1 P0e. squita C, Couto FM (2012) Predicting the Extension of Biomedical Ontologies. PLoS Comput Biol 8(9):
e[31]0 L0ji2lj6ia3n0a. Stojanovic. “Methods and Tools for Ontology Evolution” (2004).

Compatibility? Remixability?
Reusability?
Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic
Similarity and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on
Semantic Statistics (SemStats) ISWC 2014.

Summary
 RDF Data Cube: publishing and integrating multi-dimensional
data in the Semantic Web
 Dutch historical censuses (increasingly)
published and queryable online
 Discoverabililty, reusability and remixability of
dimensions is important
 Bottom-up concept scheme generation only
semi-automatable
 Concept drift (or concept stability) can be
predicted accurately if enough historical data is
available
 Semantic representations can provide insight in
statistical correlation

Thank you
CEDAR Integrator
https://github.com/CEDAR-project/Integrator
LSD Dimensions
http://lsd-dimensions.org/
TabCluster
https://github.com/CEDAR-project/TabCluster
Concept Drift
https://github.com/albertmeronyo/ConceptDrift
Semantic Correlation
http://csarven.ca/sense-of-lsd-analysis
http//www.cedar-project.nl/

CBS CEDAR Presentation

More Related Content

What's hot

Viewers also liked

Similar to CBS CEDAR Presentation

More from Albert Meroño-Peñuela

Recently uploaded

CBS CEDAR Presentation

Editor's Notes