Digital Demography
Ingmar Weber
@ingmarweber
September 26, 2018
Keynote at Social Informatics 2018
Or How I learned to Love Online Advertising
About ACM
 ACM, the Association for Computing Machinery (www.acm.org), is the
premier global community of computing professionals and students
with nearly 100,000 members in more than 170 countries interacting
with more than 2 million computing professionals worldwide.
 OUR MISSION: We help computing professionals to be their best and
most creative. We connect them to their peers, to what the latest
developments, and inspire them to advance the profession and make a
positive impact on society.
 OUR VISION: We see a world where computing helps solve tomorrow’s
problems – where we use our knowledge and skills to advance the
computing profession and make a positive social impact throughout the
world.
The Distinguished Speakers Program
is made possible by
For additional information, please visit http://dsp.acm.org/
Amazing Collaborators! (Alphab.)
• Francesco Billari, Antoine Dubois, Masoomali
Fatehkia, Harsh Gandhi, Kiran Garimella,
Krishna Gummadi, Karri Haranko, Ridhi
Kashyap, Yelena Mejova, Joao Palotti, Tejas
Rafaliya, Francesco Rampazzo, Vatsala Singh,
Bogdan State, Reham Tamime, Agnese Vitali,
Emilio Zagheni, …
What is Demography?
Demography is the statistical study of populations.
According to IP address 70.67.193.176, user Pbsouthwood and other
contributors to https://en.wikipedia.org/wiki/Demography
The Population Equation
Change in population = Inputs – Outputs
Inputs = Births + In-migration
Outputs = Deaths + Out-Migration
• ∆P = (B + I) − (D + O)
Fertility, Mortality and Migration
Quant: How much? Where? When?
• Births
- Birth registry: India: ~75%, Kenya: ~65%, Liberia: ~25% (2017)
• Deaths
- “Global Burden of Disease” (Murray and Lopez, 1997):
“Medically certified information is available for less than 30% of
the estimated 50.5 million deaths that occur each year
worldwide.”
• Migration
- “The size of the irregular migrant stock of the EU-27 in 2008
was measured to be between 1.9 and 3.8 million, a decline from
between 2.4 and 5.4 million in the EU-25 in 2005” (Kovacheva
and Vogel, 2009).
Qual: Why? How?
• Births
- Effect of religiosity, available childcare, …
• Deaths
- Ikigai: “reason to get up in the morning”
• Migration
- Push/pull factors, assimilation, …
Opportunities for New methods
• Filling data gaps
– New data on migration, fertility, employment, …
• Explaining behavior
– Richer data, including networks and long-term history
• Predictive modeling
– Multi-modal forecasting
• Take a global perspective on things
– Facebook, Google, satellites know (almost) no borders
Goal is to augment, not replace, traditional approaches
Big Data is not a cure-all panacea
Rest of the Talk: Data-Centric
• Online advertising audience estimates
- Migration stocks, migrant assimilation
- Male mean-age-at-childbirth
- Ethics, limitations and challenges
• More non-obvious data sources
- Google Correlate, Followerwonk
- Even more non-obvious data sources
• Thoughts on interdisciplinary work
Facebook’s Audience Estimates
LinkedIn’s Audience Estimates
Female-to-male ratio LI users:
Female-to-male ratio LI users w/ AI:
0.94
0.27
Twitter’s Audience Estimates
Snapchat’s Audience Estimates
Female-to-male ratio SC users:
Female-to-male ratio SC users w/ STEM:
1.25
0.45
Google’s Impression Estimates
What they are actively researching or planning:
Baby and children’s products
Low-Cost Urban Census
http://fb-doha.qcri.org/
MIGRATION MONITORING
Expats Across US States
2014
2017
Expats Across Countries
2015
2017
regression line
Age-Specific Selection Biases
Bias Reduction via Model-Fitting
Mean out-of-sample absolute percentage error 37%,
down from 56% without origin-age bias correction
Adjusted R^2 = .70
Does not use GDP, language, internet penetration, …
z = age-gender group
i = country of birth
j = US state of residence
QUANTIFYING MIGRANT
ASSIMILATION
Do Refugees Share German Interests?
What interests to consider? Everybody likes “Music” and “Technology”.
How to interpret the score? High/low compared to European migrants?
Germans in DEU
FB Interests:
Football (90%)
Max Planck (70%)
Sauerkraut (40%)
…
Arabs in MENA
FB Interests:
Quran (80%)
Ibn Al-Haytham (60%)
Falafel (60%)
…
Arabs in DEU
FB Interests:
?
Obtaining an Assimilation Score
Migrant Group Assim. Score
Austrian migrants .900
Spanish migrants .864
French migrants .803
Turkish-speaking migrants .746
Arabic-speaking migrants .643
A: Women, non-uni, 45-64 .461
A: Men, uni, 18-24 .677
• Experimental methodology: take with a ton, not just a grain of salt
• Needs to be validated externally
• Goals include finding “bridging” interests/patterns
REAL-TIME MIGRATION MONITOR
Venezuela, Colombia, Brazil
STUDYING MALE FERTILITY RATES
Parenthood on Facebook
Mean Age at Child-Bearing
• Goal: fill data gaps on “mean age at child-bearing”
Out-of-Sample Predictions
Male MAC predictions for countries w/o ground truth
LIMITATIONS AND CHALLENGES
Ethical Challenges
• Privacy
– Was possible to obtain PII until early 2018 [Venkatadri
et al., 2018]
– Audience estimates for “custom audiences” no longer
supported
– The k in k-anonymity has been increased
• Vulnerable populations
– Was possible to exclude minorities from ads
– Was possible to target based on likely diseases
– Still targetable through proxy interests
We only use aggregate, anonymous data without
interacting with any user
Limitations: Selection Bias
Aren’t you just studying FB/LI/… vs. the “real
world”?
• If we understand the selection bias, we can
model it and de-bias the estimates
– Non-response biases in surveys
– Usual signal in a prediction model
– Non-random fake/duplicate accounts could
become problematic depending on domain
• Even if “only” LI, still real world implications
– LI used for hiring and to find keynote speakers
Limitations: Black Box
Who knows how FB’s classifier labels “expats” or
SC’s classifier labels “math enthusiasts”?
• Use as signal, not as ground truth
– Empirically, highly predictive of “proper” definition
– Unified definition can be a plus
• Incentives are in the right place
– Companies try to provide values to advertisers and,
hence, are incentivized to have correct labels
• Inconsistencies over time problematic
– Especially for “interests” large temporal fluctuations
Limitations: No Longitudinal Data
None of the services provide information on
running a hypothetical ad campaign in the past
• No historical data sets of audience estimates
exist
– Hard to do causal inference (natural experiments)
• Similar to Twitter streaming API
– The best time to start collecting data is 20 years
ago. The second best time is today.
Limitations: What about Myspace?
Services come and go and FB et al. might
become obsolete
• Only useful for understanding and modeling
processes with current relevance
Usage patterns change over time
• FB of 2008 unlike FB of 2018.
• Users might become more privacy concerned.
• Re-validate and re-train your model over time.
MORE NON-OBVIOUS DATA
SOURCES
Google Correlate and Fertility
Discover search terms correlated with different fertility rates across US
states
https://www.google.com/trends/correlate/search?e=id:f7PU4mFDWV-
&t=all
Remove terms with no conceivable link to sex, pregnancy or maternity
Predicting Spatial Variability
• Performance of the regression models using
leave-one-out cross-validation. SMAPE is in [%], RMSE
values are multiplied by 1,000.
Use the previous terms to build
models predicting state-level fertility
rates
All these models make predictions
based on linear combinations of
search intensity
Goal: apply these spatial models
across time
Learning Across Space, Predicting Across
Time
• Temporal trend when applying the “teen” model
across time. Values are rescaled to a maximum of 1.0.
Pearson r correlation across 2010-2015 when using
the spatial model to predict trends across time.
Followerwonk and Gender Roles
(mother|mom) of … (father|dad) of …
… (girls|daughters) 1,257 303 1,560
… (sons|boys) 941 545 1,486
2,198 848
Location: (us|usa|united states)
https://followerwonk.com/bio/?q=(father|dad)%20of%20(sons|boys)&l=(us|usa|united%20states)
More Creative Data Sources
Online genealogy
- see how marriage mobility has changed
Online obituaries
- monitor patients discharged from hospital
Google Street View
- parked cars tell income and political orientation
https://sites.google.com/site/digitaldemography/
Creative Offline Data
Closing Thoughts
• Receptive collaborators (re selection bias)
• Publication venue (re career considerations)
• Cross fertilization (re age-period-cohort)
• Data for Development (re practical use)
Again, amazing collaborators!
• Francesco Billari, Antoine Dubois, Masoomali
Fatehkia, Harsh Gandhi, Kiran Garimella,
Krishna Gummadi, Karri Haranko, Ridhi
Kashyap, Yelena Mejova, Joao Palotti, Tejas
Rafaliya, Francesco Rampazzo, Vatsala Singh,
Bogdan State, Reham Tamime, Agnese Vitali,
Emilio Zagheni, …
Thanks!
Interested in a collaboration? Get in touch:
iweber@hbku.edu.qa
References and full texts: https://ingmarweber.de/publications/
About ACM
 ACM, the Association for Computing Machinery (www.acm.org), is the
premier global community of computing professionals and students
with nearly 100,000 members in more than 170 countries interacting
with more than 2 million computing professionals worldwide.
 OUR MISSION: We help computing professionals to be their best and
most creative. We connect them to their peers, to what the latest
developments, and inspire them to advance the profession and make a
positive impact on society.
 OUR VISION: We see a world where computing helps solve tomorrow’s
problems – where we use our knowledge and skills to advance the
computing profession and make a positive social impact throughout the
world.
The Distinguished Speakers Program
is made possible by
For additional information, please visit http://dsp.acm.org/

Digital Demography - Keynote at SocInfo'18

  • 1.
    Digital Demography Ingmar Weber @ingmarweber September26, 2018 Keynote at Social Informatics 2018 Or How I learned to Love Online Advertising
  • 2.
    About ACM  ACM,the Association for Computing Machinery (www.acm.org), is the premier global community of computing professionals and students with nearly 100,000 members in more than 170 countries interacting with more than 2 million computing professionals worldwide.  OUR MISSION: We help computing professionals to be their best and most creative. We connect them to their peers, to what the latest developments, and inspire them to advance the profession and make a positive impact on society.  OUR VISION: We see a world where computing helps solve tomorrow’s problems – where we use our knowledge and skills to advance the computing profession and make a positive social impact throughout the world.
  • 3.
    The Distinguished SpeakersProgram is made possible by For additional information, please visit http://dsp.acm.org/
  • 4.
    Amazing Collaborators! (Alphab.) •Francesco Billari, Antoine Dubois, Masoomali Fatehkia, Harsh Gandhi, Kiran Garimella, Krishna Gummadi, Karri Haranko, Ridhi Kashyap, Yelena Mejova, Joao Palotti, Tejas Rafaliya, Francesco Rampazzo, Vatsala Singh, Bogdan State, Reham Tamime, Agnese Vitali, Emilio Zagheni, …
  • 5.
    What is Demography? Demographyis the statistical study of populations. According to IP address 70.67.193.176, user Pbsouthwood and other contributors to https://en.wikipedia.org/wiki/Demography
  • 6.
    The Population Equation Changein population = Inputs – Outputs Inputs = Births + In-migration Outputs = Deaths + Out-Migration • ∆P = (B + I) − (D + O) Fertility, Mortality and Migration
  • 7.
    Quant: How much?Where? When? • Births - Birth registry: India: ~75%, Kenya: ~65%, Liberia: ~25% (2017) • Deaths - “Global Burden of Disease” (Murray and Lopez, 1997): “Medically certified information is available for less than 30% of the estimated 50.5 million deaths that occur each year worldwide.” • Migration - “The size of the irregular migrant stock of the EU-27 in 2008 was measured to be between 1.9 and 3.8 million, a decline from between 2.4 and 5.4 million in the EU-25 in 2005” (Kovacheva and Vogel, 2009).
  • 8.
    Qual: Why? How? •Births - Effect of religiosity, available childcare, … • Deaths - Ikigai: “reason to get up in the morning” • Migration - Push/pull factors, assimilation, …
  • 9.
    Opportunities for Newmethods • Filling data gaps – New data on migration, fertility, employment, … • Explaining behavior – Richer data, including networks and long-term history • Predictive modeling – Multi-modal forecasting • Take a global perspective on things – Facebook, Google, satellites know (almost) no borders Goal is to augment, not replace, traditional approaches Big Data is not a cure-all panacea
  • 10.
    Rest of theTalk: Data-Centric • Online advertising audience estimates - Migration stocks, migrant assimilation - Male mean-age-at-childbirth - Ethics, limitations and challenges • More non-obvious data sources - Google Correlate, Followerwonk - Even more non-obvious data sources • Thoughts on interdisciplinary work
  • 11.
  • 12.
    LinkedIn’s Audience Estimates Female-to-maleratio LI users: Female-to-male ratio LI users w/ AI: 0.94 0.27
  • 13.
  • 14.
    Snapchat’s Audience Estimates Female-to-maleratio SC users: Female-to-male ratio SC users w/ STEM: 1.25 0.45
  • 15.
    Google’s Impression Estimates Whatthey are actively researching or planning: Baby and children’s products
  • 16.
  • 17.
  • 18.
    Expats Across USStates 2014 2017
  • 19.
  • 20.
  • 21.
    Bias Reduction viaModel-Fitting Mean out-of-sample absolute percentage error 37%, down from 56% without origin-age bias correction Adjusted R^2 = .70 Does not use GDP, language, internet penetration, … z = age-gender group i = country of birth j = US state of residence
  • 22.
  • 23.
    Do Refugees ShareGerman Interests? What interests to consider? Everybody likes “Music” and “Technology”. How to interpret the score? High/low compared to European migrants? Germans in DEU FB Interests: Football (90%) Max Planck (70%) Sauerkraut (40%) … Arabs in MENA FB Interests: Quran (80%) Ibn Al-Haytham (60%) Falafel (60%) … Arabs in DEU FB Interests: ?
  • 24.
    Obtaining an AssimilationScore Migrant Group Assim. Score Austrian migrants .900 Spanish migrants .864 French migrants .803 Turkish-speaking migrants .746 Arabic-speaking migrants .643 A: Women, non-uni, 45-64 .461 A: Men, uni, 18-24 .677 • Experimental methodology: take with a ton, not just a grain of salt • Needs to be validated externally • Goals include finding “bridging” interests/patterns
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
    Mean Age atChild-Bearing • Goal: fill data gaps on “mean age at child-bearing”
  • 30.
    Out-of-Sample Predictions Male MACpredictions for countries w/o ground truth
  • 31.
  • 32.
    Ethical Challenges • Privacy –Was possible to obtain PII until early 2018 [Venkatadri et al., 2018] – Audience estimates for “custom audiences” no longer supported – The k in k-anonymity has been increased • Vulnerable populations – Was possible to exclude minorities from ads – Was possible to target based on likely diseases – Still targetable through proxy interests We only use aggregate, anonymous data without interacting with any user
  • 33.
    Limitations: Selection Bias Aren’tyou just studying FB/LI/… vs. the “real world”? • If we understand the selection bias, we can model it and de-bias the estimates – Non-response biases in surveys – Usual signal in a prediction model – Non-random fake/duplicate accounts could become problematic depending on domain • Even if “only” LI, still real world implications – LI used for hiring and to find keynote speakers
  • 34.
    Limitations: Black Box Whoknows how FB’s classifier labels “expats” or SC’s classifier labels “math enthusiasts”? • Use as signal, not as ground truth – Empirically, highly predictive of “proper” definition – Unified definition can be a plus • Incentives are in the right place – Companies try to provide values to advertisers and, hence, are incentivized to have correct labels • Inconsistencies over time problematic – Especially for “interests” large temporal fluctuations
  • 35.
    Limitations: No LongitudinalData None of the services provide information on running a hypothetical ad campaign in the past • No historical data sets of audience estimates exist – Hard to do causal inference (natural experiments) • Similar to Twitter streaming API – The best time to start collecting data is 20 years ago. The second best time is today.
  • 36.
    Limitations: What aboutMyspace? Services come and go and FB et al. might become obsolete • Only useful for understanding and modeling processes with current relevance Usage patterns change over time • FB of 2008 unlike FB of 2018. • Users might become more privacy concerned. • Re-validate and re-train your model over time.
  • 37.
  • 38.
    Google Correlate andFertility Discover search terms correlated with different fertility rates across US states https://www.google.com/trends/correlate/search?e=id:f7PU4mFDWV- &t=all Remove terms with no conceivable link to sex, pregnancy or maternity
  • 39.
    Predicting Spatial Variability •Performance of the regression models using leave-one-out cross-validation. SMAPE is in [%], RMSE values are multiplied by 1,000. Use the previous terms to build models predicting state-level fertility rates All these models make predictions based on linear combinations of search intensity Goal: apply these spatial models across time
  • 40.
    Learning Across Space,Predicting Across Time • Temporal trend when applying the “teen” model across time. Values are rescaled to a maximum of 1.0. Pearson r correlation across 2010-2015 when using the spatial model to predict trends across time.
  • 41.
    Followerwonk and GenderRoles (mother|mom) of … (father|dad) of … … (girls|daughters) 1,257 303 1,560 … (sons|boys) 941 545 1,486 2,198 848 Location: (us|usa|united states) https://followerwonk.com/bio/?q=(father|dad)%20of%20(sons|boys)&l=(us|usa|united%20states)
  • 42.
    More Creative DataSources Online genealogy - see how marriage mobility has changed Online obituaries - monitor patients discharged from hospital Google Street View - parked cars tell income and political orientation https://sites.google.com/site/digitaldemography/
  • 43.
  • 44.
    Closing Thoughts • Receptivecollaborators (re selection bias) • Publication venue (re career considerations) • Cross fertilization (re age-period-cohort) • Data for Development (re practical use)
  • 45.
    Again, amazing collaborators! •Francesco Billari, Antoine Dubois, Masoomali Fatehkia, Harsh Gandhi, Kiran Garimella, Krishna Gummadi, Karri Haranko, Ridhi Kashyap, Yelena Mejova, Joao Palotti, Tejas Rafaliya, Francesco Rampazzo, Vatsala Singh, Bogdan State, Reham Tamime, Agnese Vitali, Emilio Zagheni, …
  • 47.
    Thanks! Interested in acollaboration? Get in touch: [email protected] References and full texts: https://ingmarweber.de/publications/
  • 48.
    About ACM  ACM,the Association for Computing Machinery (www.acm.org), is the premier global community of computing professionals and students with nearly 100,000 members in more than 170 countries interacting with more than 2 million computing professionals worldwide.  OUR MISSION: We help computing professionals to be their best and most creative. We connect them to their peers, to what the latest developments, and inspire them to advance the profession and make a positive impact on society.  OUR VISION: We see a world where computing helps solve tomorrow’s problems – where we use our knowledge and skills to advance the computing profession and make a positive social impact throughout the world.
  • 49.
    The Distinguished SpeakersProgram is made possible by For additional information, please visit http://dsp.acm.org/