CEDAR 
Ashkan Ashkpour 
Albert Meroño Peñuela 
From fragment to fabric - Dutch census 
data in a web of global cultural and 
historic information 
http://cedar-project.nl/
Affiliations
Double purpose of Linked Census Data 
 Improve information retrieval for the general public (incl. 
lay experts, students, researchers) 
 Create new data sources and possibly new research 
practices in social history, historic demography, general 
history, … 
 Create immediate access to digitized Dutch census data 
 Semantic modeling of Dutch census data 
 Further enriching of statistical information with context
The Historical Census Use Case 
2011-2015
Historical Censuses
Historical Censuses
End of the door-to-door 
censuses
• Source of historical statistical data, providing a 
rich source of social, economic and 
demographic data 
• A relatively untapped source of information 
information / Most research focuses on a 
specific year or a subset instead of time series 
• “..specific information about a nations population 
characteristics and needs at a given time in 
history, providing invaluable snapshots of the 
state of a nation”
Digitization Efforts – 1996 
Cooperation between CBS and NIWI
Conversion Process
Conversion Process
Conversion Process
VT 1869 Plaatselijke indeling NB
The census dataset (1795- 
1971) 
 3 Types of Censuses 
[Population/Occupation/Housing] 
 17 census years 
 Only left aggregated form 
 2288 census tables 
 33283 annotations 
 17Million characters
Conversion Process
Structural Heterogeneity 
How can we maintain the same structure 
and information 
1 on 1 representation
Using the Layout
Going to RDF 
 Model: RDF Data cube > Multi dimensional 
statistical data 
 Supervised conversion 
 Need to define the layout structures per table
Going to RDF 
 Styling of 2,288 tables 
 Training and conversion at DANS 
 Thanks to Michael and Jetske
RDF Statistics 
 310,585,567 total triples 
 389,132 hierarchical row headers 
 17,960,911 data cells 
 61,110 column headers 
 3,609 row properties 
 3,150 titles 
 1,581,546 row headers 
 274,404 metadata cells 
 See http://lod.cedar-project.nl/cedar/data.html
Everything in one system: What does this 
mean ? 
 No separate files 
 Insights in # of variables 
 Availability of variables (preliminary analyses) 
 Straightforward harmonizations 
 Systematic data check 
 Visualizations 
 Other debugging purposes
Examples on raw data
Examples on raw data 
Number of teachers Number of married women
Three tier model 
 Raw data is filled 
 Annotations 
 Harmonization layer
Enriching the Harmonization layer 
 Cleaning and correcting 
 Standardizing variables and values.. 
 Mappings 
 Connecting to existing (classifications) systems: 
 HISCO (historical occupations) 
 Amsterdam Code (historical municipalities) 
 SDMX (demographical variables) 
 Creating variables, bottom up classification systems 
(religious denominations, housing types, occupations, 
age ranges.. 
 Key: bringing all these practices together
CEDAR goal: cross-query the Dutch 
historical censuses on the Web 
? 
(aka integrating 
~3K disparate 
tables) 
1795 1830 1889 1930 1971
• Web publishable 
• Machine processable 
• Dynamic schema 
• Easily link with other 
datasets
Why Semantic Web 
Technology? 
 To W3C 
Web publishable 
Web exchangeable 
 Human & machine readable 
 Provide interesting links 
 To us 
 Finer granularity level (cell level) 
 Statistical comparability by leveraging semantic 
descriptions 
 Provenance 
 Harmonization through linkage to other datasets
Towards 5-star Census Data
Towards 5-star Census Data 
>2 years ago 
2 years ago
“There are many situations where it 
would be useful to be able to publish 
multi-dimensional data, such as 
statistics, on the web in such a way 
that they can be linked to related data 
sets and concepts.”
RDF Data Cube vocabulary 
(QB)
RDF Data Cube vocabulary 
(QB)
RDF Data Cube vocabulary 
(QB) 
• SDMX compatible 
• Defines cubes as a set of observations that consist 
of dimensions, measures and attributes 
• Dimensions: time period, region, sex 
(qb:DimensionProperty) 
• Measure: population life expectancy (qb:MeasureProperty) 
• Attribute: unit of measure = years, metadata status = 
measured (qb:AttributeProperty) 
Observation: “the measured life expectancy of males in 
Newport in the period 2004-2006 is 76.7 years”
CEDAR Integrator 
https://github.com/CEDAR-project/ 
Integrator 
http://lod.cedar-project.nl/cedar/data.html
http://lod.cedar-project.nl/cedar/stats.html
http://lod.cedar-project.nl/maps/
Dimension Reusability 
cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ; 
cedar:population "12"^^xml:integer ; 
maritalstatus:maritalStatus 
maritalstatus:single ; 
cedarterms:occupationPosition cedarterms:job-D ; 
sdmx-dimension:sex sdmx-code:sex-F ; 
cedarterms:occupation hisco:88030 ; 
sdmx-dimension:refArea gg:11150 ; 
cedarterms:belief hreligion:118 ; 
cedarterms:houseType cedar:Klooster ; 
prov:wasDerivedFrom 
cedar:BRT_1889_08_T1-S0-K17 ; 
prov:wasGeneratedBy 
cedar:BRT_1889_08_T1-S0-K17-activity .
Dimension Reusability 
cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ; 
cedar:population "12"^^xml:integer ; 
maritalstatus:maritalStatus 
maritalstatus:single ; 
cedarterms:occupationPosition cedarterms:job-D ; 
sdmx-dimension:sex sdmx-code:sex-F ; 
cedarterms:occupation hisco:88030 ; 
sdmx-dimension:refArea gg:11150 ; 
cedarterms:belief hreligion:118 ; 
cedarterms:houseType cedar:Klooster ; 
prov:wasDerivedFrom 
cedar:BRT_1889_08_T1-S0-K17 ; 
prov:wasGeneratedBy 
cedar:BRT_1889_08_T1-S0-K17-activity .
Dimension Reusability 
cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ; 
cedar:population "12"^^xml:integer ; 
maritalstatus:maritalStatus 
maritalstatus:single ; 
cedarterms:occupationPosition cedarterms:job-D ; 
sdmx-dimension:sex sdmx-code:sex-F ; 
cedarterms:occupation hisco:88030 ; 
sdmx-dimension:refArea gg:11150 ; 
cedarterms:belief hreligion:118 ; 
cedarterms:houseType cedar:Klooster ; 
prov:wasDerivedFrom 
cedar:BRT_1889_08_T1-S0-K17 ; 
prov:wasGeneratedBy 
cedar:BRT_1889_08_T1-S0-K17-activity .
LSD Dimensions 
http://lsd-dimensions.org/ 
https://github.com/albertmeronyo/LSD-Dimensions 
Hourly JSON-LD dumps
What if dimensions aren’t out 
there? 
 Need to build them 
 Input: flat lists of non-standard values 
 Output: standard concept scheme 
 Knowledge intensive problem 
https://github.com/CEDAR-project/TabCluster
Concept Drift 
Census classification of 
occupations as for 
1859 
• Root node is void 
• Depth 1: occupation groups 
• Leaves: actual occupations
Concept Drift 
Census classification of 
occupations as for 
1889 
• Root node is void 
• Depth 1: occupation groups 
• Leaves: actual occupations
Census classification of 
occupations as for 
1899 
• Root node is void 
• Depth 1: occupation groups 
• Leaves: actual occupations 
Concept Drift
Concept Drift 
 RQ: Can we use past knowledge to predict 
when and where will concept drift happen in an 
ontology? 
 Theoretical framework: [1] 
 Data: a number of ontology versions 
 Method: supervised learning [2] 
 Features: structural, membership, usage [3] 
 Results: f-measures of 0.84, 0.93, 0.79 
 https://github.com/albertmeronyo/ConceptDrift 
[1] Shenghui Wang, Stefan Schlobach, Michael Klein. “What is Concept Drift and How to Identify It?”. EKAW 
2[20]1 P0e. squita C, Couto FM (2012) Predicting the Extension of Biomedical Ontologies. PLoS Comput Biol 8(9): 
e[31]0 L0ji2lj6ia3n0a. Stojanovic. “Methods and Tools for Ontology Evolution” (2004).
Compatibility? Remixability? 
Reusability? 
Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic 
Similarity and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on 
Semantic Statistics (SemStats) ISWC 2014.
Summary 
 RDF Data Cube: publishing and integrating multi-dimensional 
data in the Semantic Web 
 Dutch historical censuses (increasingly) 
published and queryable online 
 Discoverabililty, reusability and remixability of 
dimensions is important 
 Bottom-up concept scheme generation only 
semi-automatable 
 Concept drift (or concept stability) can be 
predicted accurately if enough historical data is 
available 
 Semantic representations can provide insight in 
statistical correlation
Thank you 
CEDAR Integrator 
https://github.com/CEDAR-project/Integrator 
LSD Dimensions 
http://lsd-dimensions.org/ 
TabCluster 
https://github.com/CEDAR-project/TabCluster 
Concept Drift 
https://github.com/albertmeronyo/ConceptDrift 
Semantic Correlation 
http://csarven.ca/sense-of-lsd-analysis 
http//www.cedar-project.nl/

CBS CEDAR Presentation

  • 1.
    CEDAR Ashkan Ashkpour Albert Meroño Peñuela From fragment to fabric - Dutch census data in a web of global cultural and historic information http://cedar-project.nl/
  • 2.
  • 3.
    Double purpose ofLinked Census Data  Improve information retrieval for the general public (incl. lay experts, students, researchers)  Create new data sources and possibly new research practices in social history, historic demography, general history, …  Create immediate access to digitized Dutch census data  Semantic modeling of Dutch census data  Further enriching of statistical information with context
  • 4.
    The Historical CensusUse Case 2011-2015
  • 5.
  • 6.
  • 7.
    End of thedoor-to-door censuses
  • 8.
    • Source ofhistorical statistical data, providing a rich source of social, economic and demographic data • A relatively untapped source of information information / Most research focuses on a specific year or a subset instead of time series • “..specific information about a nations population characteristics and needs at a given time in history, providing invaluable snapshots of the state of a nation”
  • 9.
    Digitization Efforts –1996 Cooperation between CBS and NIWI
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    The census dataset(1795- 1971)  3 Types of Censuses [Population/Occupation/Housing]  17 census years  Only left aggregated form  2288 census tables  33283 annotations  17Million characters
  • 15.
  • 16.
    Structural Heterogeneity Howcan we maintain the same structure and information 1 on 1 representation
  • 17.
  • 18.
    Going to RDF  Model: RDF Data cube > Multi dimensional statistical data  Supervised conversion  Need to define the layout structures per table
  • 19.
    Going to RDF  Styling of 2,288 tables  Training and conversion at DANS  Thanks to Michael and Jetske
  • 21.
    RDF Statistics 310,585,567 total triples  389,132 hierarchical row headers  17,960,911 data cells  61,110 column headers  3,609 row properties  3,150 titles  1,581,546 row headers  274,404 metadata cells  See http://lod.cedar-project.nl/cedar/data.html
  • 22.
    Everything in onesystem: What does this mean ?  No separate files  Insights in # of variables  Availability of variables (preliminary analyses)  Straightforward harmonizations  Systematic data check  Visualizations  Other debugging purposes
  • 23.
  • 24.
    Examples on rawdata Number of teachers Number of married women
  • 25.
    Three tier model  Raw data is filled  Annotations  Harmonization layer
  • 26.
    Enriching the Harmonizationlayer  Cleaning and correcting  Standardizing variables and values..  Mappings  Connecting to existing (classifications) systems:  HISCO (historical occupations)  Amsterdam Code (historical municipalities)  SDMX (demographical variables)  Creating variables, bottom up classification systems (religious denominations, housing types, occupations, age ranges..  Key: bringing all these practices together
  • 27.
    CEDAR goal: cross-querythe Dutch historical censuses on the Web ? (aka integrating ~3K disparate tables) 1795 1830 1889 1930 1971
  • 28.
    • Web publishable • Machine processable • Dynamic schema • Easily link with other datasets
  • 29.
    Why Semantic Web Technology?  To W3C Web publishable Web exchangeable  Human & machine readable  Provide interesting links  To us  Finer granularity level (cell level)  Statistical comparability by leveraging semantic descriptions  Provenance  Harmonization through linkage to other datasets
  • 30.
  • 31.
    Towards 5-star CensusData >2 years ago 2 years ago
  • 32.
    “There are manysituations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that they can be linked to related data sets and concepts.”
  • 33.
    RDF Data Cubevocabulary (QB)
  • 34.
    RDF Data Cubevocabulary (QB)
  • 35.
    RDF Data Cubevocabulary (QB) • SDMX compatible • Defines cubes as a set of observations that consist of dimensions, measures and attributes • Dimensions: time period, region, sex (qb:DimensionProperty) • Measure: population life expectancy (qb:MeasureProperty) • Attribute: unit of measure = years, metadata status = measured (qb:AttributeProperty) Observation: “the measured life expectancy of males in Newport in the period 2004-2006 is 76.7 years”
  • 36.
    CEDAR Integrator https://github.com/CEDAR-project/ Integrator http://lod.cedar-project.nl/cedar/data.html
  • 37.
  • 38.
  • 39.
    Dimension Reusability cedar:BRT_1889_02_T1-S0-K17-ha qb:Observation ; cedar:population "12"^^xml:integer ; maritalstatus:maritalStatus maritalstatus:single ; cedarterms:occupationPosition cedarterms:job-D ; sdmx-dimension:sex sdmx-code:sex-F ; cedarterms:occupation hisco:88030 ; sdmx-dimension:refArea gg:11150 ; cedarterms:belief hreligion:118 ; cedarterms:houseType cedar:Klooster ; prov:wasDerivedFrom cedar:BRT_1889_08_T1-S0-K17 ; prov:wasGeneratedBy cedar:BRT_1889_08_T1-S0-K17-activity .
  • 40.
    Dimension Reusability cedar:BRT_1889_02_T1-S0-K17-ha qb:Observation ; cedar:population "12"^^xml:integer ; maritalstatus:maritalStatus maritalstatus:single ; cedarterms:occupationPosition cedarterms:job-D ; sdmx-dimension:sex sdmx-code:sex-F ; cedarterms:occupation hisco:88030 ; sdmx-dimension:refArea gg:11150 ; cedarterms:belief hreligion:118 ; cedarterms:houseType cedar:Klooster ; prov:wasDerivedFrom cedar:BRT_1889_08_T1-S0-K17 ; prov:wasGeneratedBy cedar:BRT_1889_08_T1-S0-K17-activity .
  • 41.
    Dimension Reusability cedar:BRT_1889_02_T1-S0-K17-ha qb:Observation ; cedar:population "12"^^xml:integer ; maritalstatus:maritalStatus maritalstatus:single ; cedarterms:occupationPosition cedarterms:job-D ; sdmx-dimension:sex sdmx-code:sex-F ; cedarterms:occupation hisco:88030 ; sdmx-dimension:refArea gg:11150 ; cedarterms:belief hreligion:118 ; cedarterms:houseType cedar:Klooster ; prov:wasDerivedFrom cedar:BRT_1889_08_T1-S0-K17 ; prov:wasGeneratedBy cedar:BRT_1889_08_T1-S0-K17-activity .
  • 42.
    LSD Dimensions http://lsd-dimensions.org/ https://github.com/albertmeronyo/LSD-Dimensions Hourly JSON-LD dumps
  • 43.
    What if dimensionsaren’t out there?  Need to build them  Input: flat lists of non-standard values  Output: standard concept scheme  Knowledge intensive problem https://github.com/CEDAR-project/TabCluster
  • 44.
    Concept Drift Censusclassification of occupations as for 1859 • Root node is void • Depth 1: occupation groups • Leaves: actual occupations
  • 45.
    Concept Drift Censusclassification of occupations as for 1889 • Root node is void • Depth 1: occupation groups • Leaves: actual occupations
  • 46.
    Census classification of occupations as for 1899 • Root node is void • Depth 1: occupation groups • Leaves: actual occupations Concept Drift
  • 47.
    Concept Drift RQ: Can we use past knowledge to predict when and where will concept drift happen in an ontology?  Theoretical framework: [1]  Data: a number of ontology versions  Method: supervised learning [2]  Features: structural, membership, usage [3]  Results: f-measures of 0.84, 0.93, 0.79  https://github.com/albertmeronyo/ConceptDrift [1] Shenghui Wang, Stefan Schlobach, Michael Klein. “What is Concept Drift and How to Identify It?”. EKAW 2[20]1 P0e. squita C, Couto FM (2012) Predicting the Extension of Biomedical Ontologies. PLoS Comput Biol 8(9): e[31]0 L0ji2lj6ia3n0a. Stojanovic. “Methods and Tools for Ontology Evolution” (2004).
  • 48.
    Compatibility? Remixability? Reusability? Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic Similarity and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on Semantic Statistics (SemStats) ISWC 2014.
  • 49.
    Summary  RDFData Cube: publishing and integrating multi-dimensional data in the Semantic Web  Dutch historical censuses (increasingly) published and queryable online  Discoverabililty, reusability and remixability of dimensions is important  Bottom-up concept scheme generation only semi-automatable  Concept drift (or concept stability) can be predicted accurately if enough historical data is available  Semantic representations can provide insight in statistical correlation
  • 50.
    Thank you CEDARIntegrator https://github.com/CEDAR-project/Integrator LSD Dimensions http://lsd-dimensions.org/ TabCluster https://github.com/CEDAR-project/TabCluster Concept Drift https://github.com/albertmeronyo/ConceptDrift Semantic Correlation http://csarven.ca/sense-of-lsd-analysis http//www.cedar-project.nl/

Editor's Notes

  • #28 2 problems: layout interpretation, and semantic alignment NEXT: WEB STANDARDS
  • #30 Talk URIs as resource identifiers on the web and as cell identifiers for CEDAR (we can use URIs to point to every little thing). Provenance: We want to provide source data as-it-was We want to give explanations to historians of everything we did (on transforming data)
  • #32 We like 5 star datasets. Historians also like 5 star datasets. HOWEVER, they still want their non-standard formats for data diving. Data diving guides their research and suggests new research questions. NEXT: DATA MODEL
  • #36 This is super cool. NOW, how do we connect with the archive to produce it?.... NEXT: INTEGRATOR
  • #37 From the ARCHIVE to RDF Data Cube TURTLE Yellow: interfaces Red: human task
  • #38 ITERATIVE PROCESS
  • #39 NEXT: NOT TRIVIAL THINGS THAT ENABLED RESEARCH
  • #44 NEXT: TIME
  • #45 CHANGE OVER TIME
  • #49 Archiving the serialization of such semantic-statistic relationships?