Lidia Pivovarova
Classification and clustering
in media monitoring:
from knowledge engineering to deep learning
PhD Thesis
Supervised by Dr Roman Yangarber
December 21st 2018
2
Outline
● PULS media monitoring system
● News grouping
● Multi-label text classification
● Business polarity detection
● Conclusion
3
Outline
● PULS media monitoring system
● News grouping
● Multi-label text classification
● Business polarity detection
● Conclusion
4
PULS media monitoring system
● puls.cs.helsinki.fi
● Applied to several domains, e.g., Cross-Border
Security and Epidemic Surveillance
● Thesis focuses on most recent work on
Bussiness Intelligence
5
PULS business monitoring
● Collects daily ~10,000 articles from more than
1000 news sources
● Pipeline of natural language processing
modules:
– Named entity recognition (NER)
– Multi-label topic and industry classifiers
– Polarity detection
– Grouping of articles into stories
6
7
8
Named Entity Recognition
9
Named Entity Recognition
Grouping
10
Named Entity Recognition
Grouping
Polarity Detection
11
Named Entity Recognition
Grouping
Polarity Detection
Multi-label Text Classification
12
Named Entity Recognition
Grouping
Polarity Detection
Multi-label Text Classification
13
Outline
● PULS media monitoring system
● News grouping
● Multi-label text classification
● Business polarity detection
● Conclusion
14
15
16
17
Grouping into stories
● Different from topical text clustering:
– fine-grained
– named entities are crucial
– group size distribution is skewed
● The dataset: manually annotated one ”typical"
day ~4000 documents
18
19
NE salience
● General nature of news articles:
– most salient NEs are mentioned early in the text
and then repeated
– less salient NEs are mentioned in the later
paragraphs and are less frequent
● Hierarchical clustering: word-based and name-
based distances above a certain threshold
20
Outline
● PULS media monitoring system
● News grouping
● Multi-label text classification
● Business polarity detection
● Conclusion
21
Topic label
Top-level industry sector
2nd
-level industry sector
22
Topic label
Top-level industry sector
23
Topic label
Top-level industry sector
2nd
-level industry sector
24
Challenges
● Complex label hierarchy
● Significant class imbalance
● Data drawn from multiple sources over
significant time periods
25
Ensemble of SVM classifiers
● Single binary classifier trained for each label
● Key idea:
– training set balancing for stable performance
26
Rote classifier
● Key idea: use company label distribution collected
by PULS IE
27
CNN classifier
28
RCV1 sector labels
29
Outline
● PULS media monitoring system
● News grouping
● Multi-label text classification
● Business polarity detection
● Conclusion
30
31
Task
● Similar to (entity-level) sentiment analysis
● But business news:
– contain genre-specific word usages
– typically do not express emotions or subjectivity
● Cannot use resources developed for more general
sentiment analysis
32
Data
● Novel dataset:
– ~17,000 documents, ~20,000 company names
● Larger than any existing dataset for the task
● Still smaller than datasets usually used for deep
learning
● Using much larger data annotated for event
classification:
– manual mapping
– unsupervised feature transfer
33
34
Outline
● PULS media monitoring system
● News grouping
● Multi-label text classification
● Business polarity detection
● Conclusion
35
Contributions
● Grouping:
– novel dataset based on real data
– novel algorithm based on NE salience
– The best method uses combination of salience with
domain-specific embeddings
36
Contributions
● Large-scale multi-label text classification:
– Balancing training set ensures stable performance
– NEs combined with keywords yield better
performance than keyword or name features alone
– Proposed CNN yields higher performance than
previously reported methods
– NEs are important for industry sector classification,
less important for topic classification
37
Contributions
● Entity-level polarity detection
– novel dataset, largest for the task
– unsupervised knowledge transfer outperforms
manual mapping
– result consistent for two different CNNs
38
Lessons learnt
● data pre-processing is important:
much effort spent on data clean-up,
reorganization and manual annotation
● precise linguistic analysis is important:
pattern-based IE engine provides features for
ML components
● NEs are important:
most tasks in media monitoring gain advantage
from special treatment of NEs
39
Lidia Pivovarova
Classification and clustering in media
monitoring: from knowledge engineering to
deep learning
PhD Thesis
Supervised by Dr Roman Yangarber
December 21st 2018

More Related Content

PDF
Big Data Patents Data 3Q 2016
PPTX
Gaurav web mining
PDF
Grouping business news stories based on salience of named entities
PPS
Autonomous News Clustering and Classification for an Intelligent Web Portal
PDF
Machine learning in automated text categorization
PDF
Fake News Detection using Deep Learning
PDF
An in-depth review on News Classification through NLP
PDF
Veda Semantics - introduction document
Big Data Patents Data 3Q 2016
Gaurav web mining
Grouping business news stories based on salience of named entities
Autonomous News Clustering and Classification for an Intelligent Web Portal
Machine learning in automated text categorization
Fake News Detection using Deep Learning
An in-depth review on News Classification through NLP
Veda Semantics - introduction document

Similar to Classification and clustering in media monitoring: from knowledge engineering to deep learning (20)

PPTX
Thesis Presentation.pptx
PDF
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...
PPTX
Media REVEALr: A social multimedia monitoring and intelligence system for Web...
PPTX
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
PDF
Democratizing Data within your organization - Data Discovery
PDF
"Implementing Machine Learning and Big Data soluctions using IDOL"
PPTX
Introduction-to-Artificiefgerwgergtergeteetetgfger gfrner jf ergejg kjg byurk...
PPTX
IoT Unit 4.pptxZxcvbnmklqwertyuiozxdfghjkl
PPTX
Popular Text Analytics Algorithms
PDF
my model genuines.
PPTX
Rise of the machines -- Owasp israel -- June 2014 meetup
PDF
Text Classification, Sentiment Analysis, and Opinion Mining
PDF
Machine Learning and Industrie 4.0
PPTX
Fake news detection using machine learning
PPTX
Seminar dm
PDF
Social Event Detection using Multimodal Clustering and Integrating Supervisor...
PPTX
Internet of things & predictive analytics
PDF
Comparison of Text Classifiers on News Articles
PPTX
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
PDF
Understanding voice of the member via text mining
Thesis Presentation.pptx
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...
Media REVEALr: A social multimedia monitoring and intelligence system for Web...
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
Democratizing Data within your organization - Data Discovery
"Implementing Machine Learning and Big Data soluctions using IDOL"
Introduction-to-Artificiefgerwgergtergeteetetgfger gfrner jf ergejg kjg byurk...
IoT Unit 4.pptxZxcvbnmklqwertyuiozxdfghjkl
Popular Text Analytics Algorithms
my model genuines.
Rise of the machines -- Owasp israel -- June 2014 meetup
Text Classification, Sentiment Analysis, and Opinion Mining
Machine Learning and Industrie 4.0
Fake news detection using machine learning
Seminar dm
Social Event Detection using Multimodal Clustering and Integrating Supervisor...
Internet of things & predictive analytics
Comparison of Text Classifiers on News Articles
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Understanding voice of the member via text mining

More from Lidia Pivovarova (20)

PDF
Convolutional neural networks for text classification
PDF
Интеллектуальный анализ текста
PPTX
AINL 2016: Yagunova
PDF
AINL 2016: Kuznetsova
PPT
AINL 2016: Bodrunova, Blekanov, Maksimov
PDF
AINL 2016: Boldyreva
PPTX
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
PDF
AINL 2016: Kozerenko
PDF
AINL 2016: Shavrina, Selegey
PDF
AINL 2016: Khudobakhshov
PDF
AINL 2016: Proncheva
PPTX
AINL 2016:
PPTX
AINL 2016: Bugaychenko
PDF
AINL 2016: Grigorieva
PDF
AINL 2016: Muravyov
PDF
AINL 2016: Just AI
PPTX
AINL 2016: Moskvichev
PDF
AINL 2016: Goncharov
PDF
AINL 2016: Malykh
PDF
AINL 2016: Filchenkov
Convolutional neural networks for text classification
Интеллектуальный анализ текста
AINL 2016: Yagunova
AINL 2016: Kuznetsova
AINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Boldyreva
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Kozerenko
AINL 2016: Shavrina, Selegey
AINL 2016: Khudobakhshov
AINL 2016: Proncheva
AINL 2016:
AINL 2016: Bugaychenko
AINL 2016: Grigorieva
AINL 2016: Muravyov
AINL 2016: Just AI
AINL 2016: Moskvichev
AINL 2016: Goncharov
AINL 2016: Malykh
AINL 2016: Filchenkov

Recently uploaded (20)

PDF
Pentose Phosphate Pathway by Rishikanta Usham, Dhanamanjuri University
PPTX
Cutaneous tuberculosis Dermatology
PDF
Unit Four Lesson in Carbohydrates chemistry
PPTX
INTRODUCTION TO CELL STRUCTURE_LESSON.pptx
PDF
The scientific heritage No 167 (167) (2025)
PPTX
The Electromagnetism Wave Spectrum. pptx
PPTX
Bacterial and protozoal infections in pregnancy.pptx
PDF
software engineering for computer science
PPTX
1. (Teknik) Atoms, Molecules, and Ions.pptx
PPT
ZooLec Chapter 13 (Digestive System).ppt
PPTX
23ME402 Materials and Metallurgy- PPT.pptx
PDF
SWAG Research Lab Scientific Publications
PDF
Physics of Bitcoin #30 Perrenod Santostasi.pdf
PPTX
flavonoids/ Secondary Metabolites_BCH 314-2025.pptx
PPTX
SCIENCE 5 Q2 WEEK 1 SKELETAL, INTEGUMENTARY AND DIGESTIVE SYSTEM
PDF
Microplastics: Environmental Impact and Remediation Strategies
PPTX
Models of Eucharyotic Chromosome Dr. Thirunahari Ugandhar.pptx
PPTX
Antihypertensive Medicinal Chemistry Unit II BP501T.pptx
PDF
2024_PohleJellKlug_CambrianPlectronoceratidsAustralia.pdf
PDF
Pharmacokinetics Lecture_Study Material.pdf
Pentose Phosphate Pathway by Rishikanta Usham, Dhanamanjuri University
Cutaneous tuberculosis Dermatology
Unit Four Lesson in Carbohydrates chemistry
INTRODUCTION TO CELL STRUCTURE_LESSON.pptx
The scientific heritage No 167 (167) (2025)
The Electromagnetism Wave Spectrum. pptx
Bacterial and protozoal infections in pregnancy.pptx
software engineering for computer science
1. (Teknik) Atoms, Molecules, and Ions.pptx
ZooLec Chapter 13 (Digestive System).ppt
23ME402 Materials and Metallurgy- PPT.pptx
SWAG Research Lab Scientific Publications
Physics of Bitcoin #30 Perrenod Santostasi.pdf
flavonoids/ Secondary Metabolites_BCH 314-2025.pptx
SCIENCE 5 Q2 WEEK 1 SKELETAL, INTEGUMENTARY AND DIGESTIVE SYSTEM
Microplastics: Environmental Impact and Remediation Strategies
Models of Eucharyotic Chromosome Dr. Thirunahari Ugandhar.pptx
Antihypertensive Medicinal Chemistry Unit II BP501T.pptx
2024_PohleJellKlug_CambrianPlectronoceratidsAustralia.pdf
Pharmacokinetics Lecture_Study Material.pdf

Classification and clustering in media monitoring: from knowledge engineering to deep learning