Why Is It Difficult to Detect Outbreaks
in Twitter?
Avaré Stewart, Nattiya Kanhabua, Sara Romano
Ernesto Diaz-Aviles, Wolf Siberski, and Wolfgang Nejdl
L3S Research Center / Leibniz Universität Hannover, Germany
SIGIR 2013 Workshop on Health Search and Discovery
1 August 2013, Dublin, Ireland
Motivation
• Numerous works use Twitter to infer the existence
and magnitude of real-world events in real-time
– Earthquake [Sakaki et al., 2010]
– Predicting financial time series [Ruiz et al., 2012]
– Influenza epidemics [Culotta, 2010; Lampos et al.,
2011; Paul et al., 2011]
Early Warnings
Health related tweets
• User status updates or news related to
public health are common in Twitter
– I have the mumps...am I alone?
– my baby girl has a Gastroenteritis so great!! Please
do not give it to meee
– #Cholera breaks out in #Dadaab refugee camp in
#Kenya http://t.co/....
– As many as 16 people have been found infected with
Anthrax in Shahjadpur upazila of the Sirajganj district
in Bangladesh.
Matching Tweets
[Kanhabua et al., CIKM’12]
Matching Tweets
[Kanhabua et al., CIKM’12]
Twitter vs. Official Source
M-Eco System
Medical Ecosystem: Personalized Event-based Surveillance
http://www.meco-project.eu/
Data Collection
• Official outbreak reports
– ~3,000 ProMED-mail reports from 2011
– WHO reports have very small coverage
• Twitter data
– ~1,200 health-related terms (i.e., infectious
diseases, their synonyms, pathogens and symptoms)
– Over 112 millions of tweets from 2011
• Series of NLP tools including
– OpenNLP (tokenization, sentence splitting, POS
tagging)
– OpenCalais (named entity recognition)
– HeidelTime (temporal expression extraction)
Ground Truths
[Kanhabua et al., TAIA’ 12]
Event Extraction
• An event is a sentence containing two entities
– (1) medical condition and (2) geographic expression
– A minimum requirement by domain experts
• A victim and the time of an event can be identified
from the sentence itself, or its surrounding context
• Output: a set of event candidates
Reported by World Health Organization (WHO) on
29 July 2012 about an ongoing Ebola outbreak
in Uganda since the beginning of July 2012
[Kanhabua et al., TAIA’ 12]
Message Filtering: Challenges
• Ambiguity
– having several meanings
– used in different contexts
• Incompleteness
– missing or under-reported events
– data processing errors
Message Filtering: Challenges
• Ambiguity
– having several meanings
– used in different contexts
• Incompleteness
– missing or under-reported events
– data processing errors
Category Example tweet
Literature A two hour train journey, Love In the Time of Cholera ...
Music Dengue Fever’s “Uku,” Mixed by Paul Dreux Smith
Universal Audio...
Marketing Exclusive distributor of high quality #HIV/AIDS Blood &
Urine and #Hepatitis #Self -testers.
General Identification of genotype 4 Hepatitis E virus binding
proteins on swine liver cells: Hepatitis E virus...
Negative i dont have sniffles and no real coughing..well its
coughing but not like an influenza cough.
Joke Thought I had Bieber Fever. Ends up I just had a combo
of the mumps, mono, measles & the hershey squ...
Challenge I. Noisy/evolving
• Evolving data
– Relevant features changes over time
Challenge I. Noisy/evolving
Approach for Noisy Data
• MedISys1
– providing a list of negative keywords
created by medical experts
• Urban Dictionary2
– a Web-based dictionary of slang, ethnic
culture words or phrases
1
http://medusa.jrc.it/medisys/homeedition/en/home.html
2
http://www.urbandictionary.com/
Approach for Noisy Data
1
http://medusa.jrc.it/medisys/homeedition/en/home.html
2
http://www.urbandictionary.com/
[Kanhabua and Nejdl, WOW’ 13]
[Kanhabua and Nejdl, WOW’ 13]
Approach for Feature Changes
Signal Generation: Challenges
• Temporal Dynamics
– seasonal infectious diseases
– rare and spontaneous outbreaks
• Location Dynamics
– frequency and duration
– levels of prevalence or severity
Signal Generation: Challenges
• Temporal Dynamics
– seasonal infectious diseases
– rare and spontaneous outbreaks
• Location Dynamics
– frequency and duration
– levels of prevalence or severity
[Rortais et al., 2010 in Journal of Food Research International]
Signal Generation: Challenges
• Temporal Dynamics
– seasonal infectious diseases
– rare and spontaneous outbreaks
• Location Dynamics
– frequency and duration
– levels of prevalence or severity
Signal Generation: Challenges
[Emch et al., 2008 in International Journal of Health Geographics]
Outbreak Categorization
Outbreak Categorization
How to generate a reliable
signal for low aggregate counts?
Approach
[Kanhabua and Nejdl, WOW’ 13]
Temporal Diversity
• Refined Jaccard Index (RDJ-index)
– average Jaccard similarity of all object pairs
• Note: lower RDJ corresponds to higher diversity
• Problem: “All-Pair comparison”
• Solution: Estimation algorithms with probabilistic
error bound guarantees
[Deng et al., CIKM’
∑<−
=
ji
ji OOJS
nn
RDJ ),(
)1(
2
nji ≤<≤1
∩ UU
Jaccard similarity
Temporal Diversity
• Refined Jaccard Index (RDJ-index)
– average Jaccard similarity of all object pairs
• Note: lower RDJ corresponds to higher diversity
• Problem: “All-Pair comparison”
• Solution: Estimation algorithms with probabilistic
error bound guarantees
[Deng et al., CIKM’
∑<−
=
ji
ji OOJS
nn
RDJ ),(
)1(
2
nji ≤<≤1
∩ UU
Jaccard similarity
(1) Top-k terms
(2) Entities
Threat Assessment: Challenge
• Overwhelming with the large number of tweets
Approach
• Personalized Tweet Ranking for Epidemic
Intelligence
– Learning to rank and recommender systems
– User's context as implicit criteria for
recommendation
[Diaz-Aviles et al., WWW’ 12,
Diaz-Aviles et al., ICWSM’ 12]
Approach
Signal Search Prototype
Future Work
• Real-Time Analysis of Big and Fast
Social Web Streams
– Scalable, efficient methods for filtering and
generating signals in real-time
– Effective methods for aggregating and
visualizing information in a meaningful way
Thank you!
kanhabua@L3S.de
References
• [Culotta, 2010] A. Culotta. Towards detecting influenza epidemics by analyzing twitter messages. In
Proceedings of the First Workshop on Social Media Analytics (SOMA’2010), 2010.
• [Diaz-Aviles et al., 2012a] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Towards
personalized learning to rank for epidemic intelligence based on social media streams. In Proceedings of
the 21st World Wide Web Conference (WWW ‘2012), 2012.
• [Diaz-Aviles et al., 2012b] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Epidemic
intelligence for the crowd, by the crowd. In Proceedings of International AAAI Conference on Weblogs
and Social Media (ICWSM’2012), 2012.
• [Kanhabua et al., 2012a] N. Kanhabua, Sara Romano, and A. Stewart, Identifying Relevant Temporal
Expressions for Real-world Events, In SIGIR 2012 Workshop on Time-aware Information Access
(TAIA'2012), 2012.
• [Kanhabua et al., 2012b] N. Kanhabua, Sara Romano, and A. Stewart and W. Nejdl. Supporting
Temporal Analytics for Health Related Events in Microblogs. In Proceedings of CIKM'2012, 2012.
• [Kanhabua and Nejdl 2013] N. Kanhabua and W. Nejdl. Understanding the Diversity of Tweets in the
Time of Outbreaks. In Proceedings of the First International Web Observatory Workshop (WOW'2013) at
WWW'2013, 2013.
• [Lampos et al., 2011] V. Lampos and N. Cristianini. Nowcasting events from the social web with
statistical learning. ACM TIST, 3, 2011.
• [Paul et al., 2011] M. J. Paul and M. Dredze. You are what you tweet: Analyzing twitter for public health.
In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’2011), 2011.
• [Ruiz et al., 2012] E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes. Correlating financial time
series with micro-blogging activity. In Proceedings of WSDM’2012, 2012.
• [Sakaki et al., 2010] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time
event detection by social sensors. In Proceedings of WWW’2010, 2010.

Why Is It Difficult to Detect Outbreaks in Twitter?

  • 1.
    Why Is ItDifficult to Detect Outbreaks in Twitter? Avaré Stewart, Nattiya Kanhabua, Sara Romano Ernesto Diaz-Aviles, Wolf Siberski, and Wolfgang Nejdl L3S Research Center / Leibniz Universität Hannover, Germany SIGIR 2013 Workshop on Health Search and Discovery 1 August 2013, Dublin, Ireland
  • 2.
    Motivation • Numerous worksuse Twitter to infer the existence and magnitude of real-world events in real-time – Earthquake [Sakaki et al., 2010] – Predicting financial time series [Ruiz et al., 2012] – Influenza epidemics [Culotta, 2010; Lampos et al., 2011; Paul et al., 2011]
  • 3.
  • 4.
    Health related tweets •User status updates or news related to public health are common in Twitter – I have the mumps...am I alone? – my baby girl has a Gastroenteritis so great!! Please do not give it to meee – #Cholera breaks out in #Dadaab refugee camp in #Kenya http://t.co/.... – As many as 16 people have been found infected with Anthrax in Shahjadpur upazila of the Sirajganj district in Bangladesh.
  • 5.
  • 6.
  • 7.
  • 8.
    M-Eco System Medical Ecosystem:Personalized Event-based Surveillance http://www.meco-project.eu/
  • 9.
    Data Collection • Officialoutbreak reports – ~3,000 ProMED-mail reports from 2011 – WHO reports have very small coverage • Twitter data – ~1,200 health-related terms (i.e., infectious diseases, their synonyms, pathogens and symptoms) – Over 112 millions of tweets from 2011 • Series of NLP tools including – OpenNLP (tokenization, sentence splitting, POS tagging) – OpenCalais (named entity recognition) – HeidelTime (temporal expression extraction)
  • 10.
  • 11.
    Event Extraction • Anevent is a sentence containing two entities – (1) medical condition and (2) geographic expression – A minimum requirement by domain experts • A victim and the time of an event can be identified from the sentence itself, or its surrounding context • Output: a set of event candidates Reported by World Health Organization (WHO) on 29 July 2012 about an ongoing Ebola outbreak in Uganda since the beginning of July 2012 [Kanhabua et al., TAIA’ 12]
  • 12.
    Message Filtering: Challenges •Ambiguity – having several meanings – used in different contexts • Incompleteness – missing or under-reported events – data processing errors
  • 13.
    Message Filtering: Challenges •Ambiguity – having several meanings – used in different contexts • Incompleteness – missing or under-reported events – data processing errors Category Example tweet Literature A two hour train journey, Love In the Time of Cholera ... Music Dengue Fever’s “Uku,” Mixed by Paul Dreux Smith Universal Audio... Marketing Exclusive distributor of high quality #HIV/AIDS Blood & Urine and #Hepatitis #Self -testers. General Identification of genotype 4 Hepatitis E virus binding proteins on swine liver cells: Hepatitis E virus... Negative i dont have sniffles and no real coughing..well its coughing but not like an influenza cough. Joke Thought I had Bieber Fever. Ends up I just had a combo of the mumps, mono, measles & the hershey squ...
  • 14.
    Challenge I. Noisy/evolving •Evolving data – Relevant features changes over time
  • 15.
  • 16.
    Approach for NoisyData • MedISys1 – providing a list of negative keywords created by medical experts • Urban Dictionary2 – a Web-based dictionary of slang, ethnic culture words or phrases 1 http://medusa.jrc.it/medisys/homeedition/en/home.html 2 http://www.urbandictionary.com/
  • 17.
    Approach for NoisyData 1 http://medusa.jrc.it/medisys/homeedition/en/home.html 2 http://www.urbandictionary.com/
  • 18.
  • 19.
  • 20.
  • 21.
    Signal Generation: Challenges •Temporal Dynamics – seasonal infectious diseases – rare and spontaneous outbreaks • Location Dynamics – frequency and duration – levels of prevalence or severity
  • 22.
    Signal Generation: Challenges •Temporal Dynamics – seasonal infectious diseases – rare and spontaneous outbreaks • Location Dynamics – frequency and duration – levels of prevalence or severity [Rortais et al., 2010 in Journal of Food Research International]
  • 23.
    Signal Generation: Challenges •Temporal Dynamics – seasonal infectious diseases – rare and spontaneous outbreaks • Location Dynamics – frequency and duration – levels of prevalence or severity
  • 24.
    Signal Generation: Challenges [Emchet al., 2008 in International Journal of Health Geographics]
  • 25.
  • 26.
    Outbreak Categorization How togenerate a reliable signal for low aggregate counts?
  • 27.
  • 28.
    Temporal Diversity • RefinedJaccard Index (RDJ-index) – average Jaccard similarity of all object pairs • Note: lower RDJ corresponds to higher diversity • Problem: “All-Pair comparison” • Solution: Estimation algorithms with probabilistic error bound guarantees [Deng et al., CIKM’ ∑<− = ji ji OOJS nn RDJ ),( )1( 2 nji ≤<≤1 ∩ UU Jaccard similarity
  • 29.
    Temporal Diversity • RefinedJaccard Index (RDJ-index) – average Jaccard similarity of all object pairs • Note: lower RDJ corresponds to higher diversity • Problem: “All-Pair comparison” • Solution: Estimation algorithms with probabilistic error bound guarantees [Deng et al., CIKM’ ∑<− = ji ji OOJS nn RDJ ),( )1( 2 nji ≤<≤1 ∩ UU Jaccard similarity (1) Top-k terms (2) Entities
  • 30.
    Threat Assessment: Challenge •Overwhelming with the large number of tweets
  • 31.
    Approach • Personalized TweetRanking for Epidemic Intelligence – Learning to rank and recommender systems – User's context as implicit criteria for recommendation [Diaz-Aviles et al., WWW’ 12, Diaz-Aviles et al., ICWSM’ 12]
  • 32.
  • 33.
  • 34.
    Future Work • Real-TimeAnalysis of Big and Fast Social Web Streams – Scalable, efficient methods for filtering and generating signals in real-time – Effective methods for aggregating and visualizing information in a meaningful way
  • 35.
  • 36.
    References • [Culotta, 2010]A. Culotta. Towards detecting influenza epidemics by analyzing twitter messages. In Proceedings of the First Workshop on Social Media Analytics (SOMA’2010), 2010. • [Diaz-Aviles et al., 2012a] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Towards personalized learning to rank for epidemic intelligence based on social media streams. In Proceedings of the 21st World Wide Web Conference (WWW ‘2012), 2012. • [Diaz-Aviles et al., 2012b] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Epidemic intelligence for the crowd, by the crowd. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’2012), 2012. • [Kanhabua et al., 2012a] N. Kanhabua, Sara Romano, and A. Stewart, Identifying Relevant Temporal Expressions for Real-world Events, In SIGIR 2012 Workshop on Time-aware Information Access (TAIA'2012), 2012. • [Kanhabua et al., 2012b] N. Kanhabua, Sara Romano, and A. Stewart and W. Nejdl. Supporting Temporal Analytics for Health Related Events in Microblogs. In Proceedings of CIKM'2012, 2012. • [Kanhabua and Nejdl 2013] N. Kanhabua and W. Nejdl. Understanding the Diversity of Tweets in the Time of Outbreaks. In Proceedings of the First International Web Observatory Workshop (WOW'2013) at WWW'2013, 2013. • [Lampos et al., 2011] V. Lampos and N. Cristianini. Nowcasting events from the social web with statistical learning. ACM TIST, 3, 2011. • [Paul et al., 2011] M. J. Paul and M. Dredze. You are what you tweet: Analyzing twitter for public health. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’2011), 2011. • [Ruiz et al., 2012] E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes. Correlating financial time series with micro-blogging activity. In Proceedings of WSDM’2012, 2012. • [Sakaki et al., 2010] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of WWW’2010, 2010.

Editor's Notes

  • #4 To exploit this timeliness potential, we present an event-based Epidemic Intelligence (EI) system, which has emerged as a type of intelligence gathering aimed to detect events of interest to the public health from unstructured text on the Web In the medical domain, there has been a surge in detecting health related tweets for early warning Allow a rapid response from authorities [Diaz-Aviles et al., 2012]
  • #8 Note that, there are existing EI systems, such as, the Bio- Caster Global Health Monitor1 or HealthMap 2. However, they differ from our proposed system in the level of analysis, information sources, language coverage and visualization. Frequencies of cases reported to RKI and number of tweets mentioning the name of the disease: EHEC. Pearson correlation coefficient = 0.864. The monitor of Twitter allowed M-Eco to generate the first signals on Friday, May 20th, 2011.
  • #9 We study and propose solutions to three main research challenges in gathering epidemic intelligence from social media streams: 1) dynamic classification to enable message filtering, 2) producing reliable warning signals (temporal anomalyies found) based on observe term frequency changes in these messages, using biosurveillance algorithms, and 3) providing suitable information and recommendations to domain experts, for better assessment of the potential outbreak threats associated with the generated signals. Part I. Ground truth creation Official outbreak reports World Health Organization1 ProMED-mail2 Part II. Creating Twitter time series medical condition disease name, synonyms, pathogens, symptoms location geographic expressions, geo-location, or user profile 3 levels: country, continent, latitude
  • #10 M-Eco strives to detect a large variety of infectious diseases, so we make use of a list of 1,258 terms consisting of infectious diseases, their synonyms, pathogens and symptoms, which are provided by the domain experts in two languages, namely English and German, for an initial filtering step. All documents and tweets are annotated with locations, medical conditions and temporal expressions using a series of language processing tools, including OpenNLP2 for tokenization, sentence splitting and part-of-speech tagging, HeidelTime [34] for temporal expression extraction
  • #15 Hash-tags co-occurring with #EHEC during May 23 and June 19, 2011, the main period of the outbreak. The hash-tags are classified as entities of type Medical Condition, Location, or Complementary Context, hash-tags out of these categories are discarded.
  • #16 Hash-tags co-occurring with #EHEC during May 23 and June 19, 2011, the main period of the outbreak. The hash-tags are classified as entities of type Medical Condition, Location, or Complementary Context, hash-tags out of these categories are discarded.
  • #21 Our approach builds upon [18] and extends it by: 1) incorporating the use of an orthogonal vector, which is learned by a Support Vector Machine (SVM), as a description of the feature change; and 2) computing a novelty score that lets the system identify those tweets that contribute to the feature change, so that their true labels can be obtained.
  • #26 In order to detect outbreak events for early warning, we exploit different state-of-the-art Biosurveillance algorithms as anomaly detectors in disease-related Twitter messages: \textbf{C1}, \textbf{C2}, \textbf{C3}, F-Statistic (\textbf{FS}), Experimental Weighted Moving Average (\textbf{EWMA}) and Farrington (\textbf{FA})~\cite{basseville1993detection, farrington_1996}. Traditional bio-surveillance systems usually exploit information from official sources, e.g., laboratory results, mortality rates, or the number of reported patients suffering from a disease outbreak. In recent years, researchers in the medical domain have begun to leverage real-time, social Web data, such as, tweets.
  • #27 In order to detect outbreak events for early warning, we exploit different state-of-the-art Biosurveillance algorithms as anomaly detectors in disease-related Twitter messages: \textbf{C1}, \textbf{C2}, \textbf{C3}, F-Statistic (\textbf{FS}), Experimental Weighted Moving Average (\textbf{EWMA}) and Farrington (\textbf{FA})~\cite{basseville1993detection, farrington_1996}. Traditional bio-surveillance systems usually exploit information from official sources, e.g., laboratory results, mortality rates, or the number of reported patients suffering from a disease outbreak. In recent years, researchers in the medical domain have begun to leverage real-time, social Web data, such as, tweets.
  • #28 Identified topics show similar trends during the known time periods of real-world outbreaks Diversity reflects how the language (i.e., terms and locations) are used differently Div(entity) highly correlates with topic dynamics for some diseases, i.e., mumps, ebola, botulism and ehec Div(term) shows correlation with topic dynamics for cholera, anthrax and rubella
  • #29 Algorithms: SampleDJ, TrackDJ (claims and proofs in [Deng et al., 2012])
  • #30 Algorithms: SampleDJ, TrackDJ (claims and proofs in [Deng et al., 2012])
  • #32 (1) Rely upon abundant user interactions and/or the availability of explicit feedback (e.g., ratings, likes, dislikes) (2) Within M-Eco, we use the tweets from signals in developing techniques to provide a personalized short list of tweets that meets the context of the investigation. In this section, we review one of them; namely Personalized Tweet Ranking for Epidemic Intelligence (PTR4EI) [13, 14] and discuss the evaluation conducted during a major EHEC outbreak in Germany. where t is a discrete Time interval, MCu the set of Medical Conditions, and Lu the set of Locations of user interest. More precisely, we expand the user&amp;apos;s context, Cu, using latent topics computed with LDA [5] on: 1) an indexed collection of tweets for epidemic intelligence; and 2) the hash-tags that co-occur with this context.
  • #33 (1) Rely upon abundant user interactions and/or the availability of explicit feedback (e.g., ratings, likes, dislikes) (2) Within M-Eco, we use the tweets from signals in developing techniques to provide a personalized short list of tweets that meets the context of the investigation. In this section, we review one of them; namely Personalized Tweet Ranking for Epidemic Intelligence (PTR4EI) [13, 14] and discuss the evaluation conducted during a major EHEC outbreak in Germany. where t is a discrete Time interval, MCu the set of Medical Conditions, and Lu the set of Locations of user interest. More precisely, we expand the user&amp;apos;s context, Cu, using latent topics computed with LDA [5] on: 1) an indexed collection of tweets for epidemic intelligence; and 2) the hash-tags that co-occur with this context.