1Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Gregor Große-Bölting
Chifumi Nishioka
Ansgar Scherp
A Comparison of Different
Strategies for Automated
Semantic Document Annotation
2Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Motivation [1/2]
• Document annotation
– Facilitates users and search engines to find documents
– Requires a huge amount of human effort
– e.g., subject indexers in ZBW labeled 1.6 million
scientific documents in economics
• Semantic document annotation
– Documents annotated with semantic entities
– e.g., PubMed and MeSH, ACM DL and ACM CCS
Focus on semantic document annotation
Necessity of automated document annotation
3Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Motivation [2/2]
• Small scale experiments so far
– Comparing a small number of strategies
– Datasets containing a few hundred documents
• Comparing of 43 strategies for document annotation
within the developed experiment framework
– The largest number of strategies
• Experiments with three datasets from different domains
– Contain full-texts of 100,000 documents annotated by subject
indexers
– The largest dataset of scientific publications
We conducted the largest scale experiment
4Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Experiment Framework
Strategies are composed of methods from concept extraction,
concept activation, and annotation selection
1. Concept Extraction
detect concepts (candidate annotations) from each document
2. Concept Activation
compute a score for each concept of a document
3. Annotation Selection
select annotations from concepts for each document
4. Evaluation
measure performance of strategies with ground truth
5Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Research Question
• Research questions solved with the experiment
framework
(I) Which strategy performs best?
(II) Which concept extraction method performs best?
(III) Which concept activation method performs best?
(IV) Which annotation selection method performs
best?
6Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Extraction [1/2]
• Entity
– Extract entities from documents using a domain-specific
knowledge base
– Domain-specific knowledge base
• Entities (subjects) in a specific domain (e.g., medicine)
• One or more labels for each entity
• Relationships between entities
– Detect entities by string matching with entity labels
• Tri-gram
– Extract contigurous sequences of one, two, and three
words in a document
7Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Extraction [2/2]
• RAKE (Rapid Automatic Keyword
Extraction) [Rose et al. 10]
– Unsupervised method for extracting keywords
– Incorporate cooccurrence and frequency of words
• LDA (Latent Dirichlet Allocation) [Blei et al. 03]
– Unsupervised topic modeling method for inferring latent
topics in a document corpus
– Topic model
• Topic: A probability distribution over words
• Document: A probability distribution over topics
– Treat a topic as a concept
8Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [1/6]
• Three types of concept activation
methods
– Statistical Methods
• Baseline
• Use only directly mentioned concepts
– Hierarchy-based Methods
• Reveal concepts that are not mentioned explicitly using a
hierarchical knowledge base
– Graph-based Methods
• Use only directly mentioned concepts
• Represent concept
cooccurrences as a graph
Bank, Interest Rate, Financial Crisis, Bank, Central
Bank, Tax, Interest Rate
Tax
Bank
Interest Rate
Financial Crisis
Central Bank
9Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [2/6]
• Statistical Methods
– Frequency
• 𝑓𝑟𝑒𝑞 𝑐, 𝑑 depends on Concept Extraction methods
– The number of appearances (Entity and Tri-gram)
– The score output by RAKE (RAKE)
– The probability of a topic for a document 𝑑 (LDA)
– CF-IDF [Goossen et al. 11]
• An extension of TF-IDF replacing words with concepts
• Lower scores for concepts that appear in many documents
𝑠𝑐𝑜𝑟𝑒 𝑐𝑓𝑖𝑑𝑓(𝑐, 𝑑) = 𝑐𝑓(𝑐, 𝑑) ∙ 𝑙𝑜𝑔
|𝐷|
| 𝑑 ∈ 𝐷 : {𝑐 ∈ 𝑑}|
𝑠𝑐𝑜𝑟𝑒𝑓𝑟𝑒𝑞(𝑐, 𝑑) = 𝑓𝑟𝑒𝑞(𝑐, 𝑑)
10Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [3/6]
• Hierarchy-based Methods [1/2]
– Base Activation
• 𝐶𝑙(𝑐): a set of child concepts of a concept 𝑐
• 𝜆: decay parameter, set 𝜆 = 0.4
• e.g.,
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙
𝑐 𝑖∈𝐶 𝑙(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒(𝑐𝑖, 𝑑)
Social
Recommendation
Social
Tagging
Web Searching Web Mining
Site
Wrapping
Web Log
Analysis
World Wide Web
𝑐1
𝑐2
𝑐3
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐1, 𝑑 = 1.00
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐2, 𝑑 = 0.40
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐3, 𝑑 = 0.16
11Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [4/6]
• Hierarchy-based Methods [2/2]
– Branch Activation
• 𝐵𝑁: reciprocal of the number of concepts that are located one
level above a concept 𝑐
– OneHop Activation
• 𝐶 𝑑: set of concepts in a document 𝑑
• Activates concepts in a maximum distance of one hop
𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙ 𝐵𝑁 ∙
𝑐 𝑖∈𝐶 𝑙(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ(𝑐𝑖, 𝑑)
𝑠𝑐𝑜𝑟𝑒 𝑜𝑛𝑒ℎ𝑜𝑝 𝑐, 𝑑 =
𝑓𝑟𝑒𝑞 𝑐, 𝑑 if |𝐶𝑙(𝑐) ∩ 𝐶 𝑑| ≥ 2
𝑓𝑟𝑒𝑞 (𝑐, 𝑑) + 𝜆 ∙
𝑐 𝑖∈𝐶 𝑙(𝑐)
𝑓𝑟𝑒𝑞 𝑐𝑖, 𝑑 otherwise
12Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [5/6]
• Graph-based Methods [1/2]
– Degree [Zouaq et al. 12]
• 𝑑𝑒𝑔𝑟𝑒𝑒 𝑐, 𝑑 : the number of edges linked with a concept 𝑐
• e.g., 𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(Bank, 𝑑) = 3
– HITS [Kleinberg 99; Zouaq et al. 12]
• Link analysis algorithm for search engines [Kleinberg 99]
• ℎ𝑢𝑏 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) 𝑎𝑢𝑡ℎ 𝑐𝑖, 𝑑
• 𝑎𝑢𝑡ℎ 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) ℎ𝑢𝑏 𝑐𝑖, 𝑑
𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑) = 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑)
𝑠𝑐𝑜𝑟𝑒ℎ𝑖𝑡𝑠 𝑐, 𝑑 = ℎ𝑢𝑏 𝑐, 𝑑 + 𝑎𝑢𝑡ℎ 𝑐, 𝑑
Tax
Bank
Interest Rate
Financial Crisis
Central Bank
13Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Concept Activation [6/6]
• Graph-based Methods [2/2]
– PageRank [Page et al. 99; Mihalcea & Paul 04]
• Link analysis algorithm for search engines
• Based on the intuition that a node that is linked from many
important nodes is more important
• 𝐶𝑖𝑛(𝑐): set of concepts connected with incoming edges from 𝑐
• 𝐶 𝑜𝑢𝑡(𝑐): set of concepts connected with outgoing edges from 𝑐
• 𝜇: dumping factor, 𝜇 = 0.85
𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒 𝑐, 𝑑 = 1 − 𝜇 + 𝜇 ∙
𝑐 𝑖∈𝐶 𝑖𝑛(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒(𝑐𝑖, 𝑑)
|𝐶 𝑜𝑢𝑡(𝑐𝑖)|
14Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Annotation Selection
• Top-5 and Top-10
– Select concepts whose scores are ranked in top-k
• k Nearest Neighbor (kNN) [Huang et al. 11]
– Based on the assumption that documents with similar
concepts share similar annotations
1. Compute similarity scores between a target document
and all documents with annotations
2. Select union of annotations of k nearest documents
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Example
- 𝑘 = 2
- Selected annotations
Finance; China; Marketing;
Competition Law
15Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Configurations [1/5]
Entity Tri-gram LDARAKE
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
16Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Configurations [2/5]
24 strategies
Entity Tri-gram LDARAKE
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
17Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
15 strategies
Entity Tri-gram LDARAKE
Statistical
Method
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configurations [3/5]
18Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
3 strategies
Entity Tri-gram LDARAKE
Statistical
Method
(Frequency)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configurations [4/5]
19Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Entity Tri-gram LDARAKE
Statistical
Methods
(Frequency)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configurations [5/5]
43 strategies in total
20Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Datasets and Metrics of Experiments
Economics Political Science Computer Science
publication ZBW FIV SemEval 2010
# of publications 62,924 28,324 244
# of annotations 5.26 (± 1.84) 12.00 (± 4.02) 5.05 (± 2.41)
knowledge base STW European Thesaurus ACM CCS
# of entities 6,335 7,912 2,299
# of labels 11,679 8,421 9,086
• Computer Science: SemEval 2010 dataset [Kim et al. 10]
– Publications are annotated with keywords originally
– We converted keywords to entities by string matching
• All publications and labels of entities are in English
• We use full-texts of publications
• All annotations are used as ground truth
• Evaluation metrics: Precision, Recall, F-measure
21Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(I) Best Performing Strategies
• Economics and Political Science datasets
– The best strategy: Entity × HITS × kNN
– F-measure: 0.39 (economics), 0.28 (political science)
• Computer Science dataset
– The best strategy: Entity × Degree × kNN
– F-measure: 0.33 (computer science)
• Graph-based methods do not differ a lot
In general, a document annotation strategy
Entity × Graph-based method × kNN performs best
22Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(II) Influence of Concept Extraction
• Concept Extraction method: Entity
– Use domain-specific knowledge bases
– Knowledge bases: freely available and of high quality
– 32 thesauri listed in W3C SKOS Datasets
For Concept Extraction methods, Entity consistently
outperforms Tri-gram, RAKE, and LDA
23Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(III) Influence of Concept Activation
• Poor performance of hierarchy-based methods
– We use full-texts in the experiments
• Full-texts contain so many different concepts (203.80 unique
entities (SD: 24.50)) that others do not have to be activated
– However, OneHop can work as well as graph-based
methods
• It activates concept in one hop distance
For Concept Activation methods,
graph-based methods are better than statistical
methods or hierarchy-based methods
24Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
(IV) Influence of Annotation Selection
• kNN
– No learning process
– Confirms the assumption that documents with similar
concepts share similar annotations
For Annotation Selection methods, kNN can enhance
the performance
25Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Conclusion
• Large scale experiment for automated semantic
document annotation for scientific publications
• Best strategy: Entity × Graph-based method × kNN
– Novel combination of methods
• Best concept extraction method: Entity
• Best concept activation method: Graph-based
methods
– OneHop can achieve similar performance and requires
less computation cost
26Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Thank you!
Questions?
27Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Appendix
28Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Research Question
• Research questions solved with the experiment
framework
(I) Which strategy performs best?
(II) Which concept extraction method performs best?
(III) Which concept activation method performs best?
(IV) Which annotation selection method performs
best?
29Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
LDA (Latent Dirichlet Allocation)
source: D. M. Blei. Probabilistic topic models, CACM, 2012.
30Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Entity Extraction and Conversion
• Entity extraction
– String matching with entity labels
– Starting with longer entity labels
• e.g., From a text “financial crisis is …”, only an entity “financial
crisis” is detected (not “crisis”).
• Converting to entities
– Words and keywords are extracted in Tri-gram and RAKE
– They are converted to entities by string matching with
entity labels before annotation selection
– If no matched entity label is found, word or keyword is
discarded
31Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
kNN [1/2]
• Similarity measure
– Each document is represented as a vector where each
element is a score of a concept
– Cosine similarity is used as a similarity measure
GDP
Immigration
Population
Bank
Interest rate
Canada
0.3
0.5
0.8
0.1
0.0
0.5
GDP
Immigration
Population
Bank
Interest rate
Canada
0.6
0.0
0.4
0.8
0.4
0.2
cosine similarity between (0.3, 0.5, 0.8, 0.1, 0.0, 0.5) and (0.6, 0.0, 0.4, 0.8, 0.4, 0.2)
𝒔𝒊𝒎 𝒅 𝟏, 𝒅 𝟐 = 𝟎. 𝟓𝟐
𝑑1 𝑑2
32Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
kNN [2/2]
• k = 1
• k = 2
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Marketing
Competitive law
Selected annotations
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Marketing
Competitive law
Finance
China
Selected annotations
33Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Evaluation Metrics
• Precision
• Recall
• F-measure
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}|
| 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 |
𝑟𝑒𝑐𝑎𝑙𝑙 =
{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}|
| 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 |
𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
34Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Datasets
• Economics dataset
– 11 GB
• Political science dataset
– 3.8 GB
35Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Experiments
• Preprocessing documents
– lemmatization
– stop words removal
• 10-fold cross validation
– split a dataset into 10 equal sized subsets
– 8 subset for training data
– 1 subset for testing data
– 1 subset for optimizing parameter
36Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: Entity [1/2]
Economics
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .14 (.17) .14 (.15) .13 (.15) .22 (.20) .11 (.10) .14 (.12) .08 (.21) .08 (.21) .08 (.21)
CF-IDF .19 (.19) .18 (.17) .18 (.16) .24 (.21) .12 (.10) .15 (.12) .29 (.32) .30 (.32) .29 (.31)
Base Act. .10 (.14) .09 (.13) .09 (.13) .18 (.19) .09 (.09) .12 (.11) .20 (.30) .20 (.30) .20 (.29)
Branch Act. .08 (.14) .08 (.12) .08 (.12) .17 (.19) .08 (.09) .11 (.11) .17 (.28) .17 (.28) .17 (.27)
OneHop .12 (.16) .12 (.14) .12 (.14) .19 (.19) .09 (.09) .12 (.11) .35 (.34) .36 (.34) .35 (.33)
Degree .15 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)
HITS .14 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.10) .14 (.12) .40 (.32) .40 (.32) .39 (.31)
PageRank .14 (.17) .14 (.15) .14 (.15) .22 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)
Political Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .12 (.11) .18 (.16) .14 (.12) .15 (.13) .12 (.10) .13 (.10) .14 (.17) .05 (.07) .07 (.09)
CF-IDF .05 (.07) .12 (.16) .07 (.10) .07 (.09) .08 (.10) .07 (.09) .24 (.22) .14 (.14) .17 (.16)
Base Act. .05 (.08) .10 (.13) .07 (.09) .10 (.10) .10 (.09) .09 (.09) .14 (.19) .07 (.10) .09 (.12)
Branch Act. .04 (.07) .08 (.12) .05 (.08) .09 (.09) .09 (.09) .08 (.09) .12 (.17) .06 (.10) .08 (.11)
OneHop .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .27 (.21) .26 (.21) .25 (.19)
Degree .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .29 (.21) .28 (.21) .27 (.19)
HITS .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .30 (.22) .29 (.21) .28 (.20)
PageRank .10 (.09) .20 (.17) .13 (.11) .13 (.10) .14 (.11) .13 (.10) .29 (.22) .29 (.21) .27 (.20)
37Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: Entity [2/2]
Computer Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .18 (.21) .14 (.15) .15 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .24 (.16) .30 (.17)
CF-IDF .02 (.08) .02 (.06) .02 (.06) .03 (.11) .01 (.04) .02 (.05) .47 (.29) .23 (.17) .29 (.18)
Base Act. .17 (.20) .13 (.14) .14 (.15) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .22 (.15) .29 (.17)
Branch Act. .17 (.20) .12 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .50 (.28) .22 (.15) .29 (.17)
OneHop .17 (.20) .13 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .42 (.30) .25 (.21) .29 (.20)
Degree .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .27 (.17) .33 (.18)
HITS .18 (.21) .14 (.15) .15 (.16) .21 (.22) .08 (.08) .11 (.11) .48 (.31) .27 (.18) .32 (.20)
PageRank .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .50 (.29) .25 (.15) .31 (.18)
38Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: Tri-gram
Economics
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .12 (.15) .12 (.14) .11 (.14) .19 (.19) .10 (.10) .13 (.12) .08 (.22) .08 (.22) .08 (.21)
CF-IDF .10 (.12) .10 (.12) .09 (.11) .17 (.17) .08 (.10) .12 (.12) .07 (.20) .06 (.22) .06 (.20)
Degree .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .07 (.21) .07 (.21) .07 (.20)
HITS .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .08 (.22) .08 (.22) .07 (.21)
PageRank .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .10 (.20) .04 (.08) .05 (.11)
Political Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .06 (.08) .14 (.16) .08 (.10) .10 (.10) .11 (.11) .10 (.09) .08 (.14) .05 (.08) .06 (.09)
CF-IDF .05 (.05) .06 (.07) .05 (.06) .09 (.10) .09 (.10) .08 (.09) .09 (.15) .04 (.08) .06 (.10)
Degree .01 (.03) .03 (.07) .01 (.04) .01 (.03) .03 (.07) .01 (.04) .11 (.14) .03 (.05) .05 (.07)
HITS .01 (.03) .02 (.06) .01 (.03) .01 (.03) .00 (.06) .01 (.03) .12 (.14) .04 (.06) .06 (.08)
PageRank .01 (.04) .03 (.08) .02 (.05) .01 (.04) .03 (.08) .02 (.05) .08 (.12) .03 (.05) .04 (.06)
Computer Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .26 (.24) .20 (.18) .22 (.19) .54 (.30) .20 (.13) .29 (.17) .44 (.28) .25 (.18) .30 (.19)
CF-IDF .23 (.24) .18 (.18) .19 (.19) .54 (.29) .22 (.14) .30 (.17) .48 (.28) .20 (.14) .26 (.15)
Degree .09 (.15) .07 (.11) .07 (.12) .13 (.19) .05 (.07) .07 (.09) .48 (.29) .23 (.16) .29 (.18)
HITS .05 (.14) .04 (.09) .04 (.10) .11 (.18) .04 (.06) .06 (.09) .39 (.29) .26 (.21) .28 (.19)
PageRank .02 (.06) .02 (.05) .02 (.06) .03 (.08) .01 (.03) .02 (.05) .46 (.29) .25 (.18) .30 (.18)
39Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: RAKE
Economics
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .08 (.14) .08 (.12) .08 (.12) .15 (.18) .07 (.08) .10 (.11) .34 (.33) .34 (.33) .33 (.32)
Political Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .04 (.07) .08 (.13) .05 (.08) .07 (.09) .08 (.09) .07 (.08) .31 (.23) .18 (.15) .22 (.17)
Computer Science
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .24 (.24) .17 (.16) .19 (.17) .42 (.28) .15 (.10) .22 (.14) .42 (.27) .20 (.13) .25 (.15)
40Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Result Table: LDA
Economics
kNN
Recall Precision F
Frequency .19 (.30) .19 (.30) .19 (.30)
Political Science
kNN
Recall Precision F
Frequency .15 (.19) .15 (.18) .14 (.17)
Computer Science
kNN
Recall Precision F
Frequency .28 (.27) .24 (.23) .24 (.22)
41Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Materials
• Codes
– https://github.com/ggb/ShortStories
• Datasets
– economics and political science
• not publicly available yet
• contact us directly, if you are interested in
– computer science
• publicly available
42Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Presentation
• K-CAP 2015
– International Conference on Knowledge Capture
– Scope
• Knowledge Acquisition / Capture
• Knowledge Extraction from Text
• Semantic Web
• Knowledge Engineering and Modelling
• …
• Time slot
– Presentation: 25 minutes
– Q & A: 5 minutes
43Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Reference
• [Blei et al. 03] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation,
JMLR, 2003.
• [Blei 12] D. M. Blei. Probabilistic topic models, CACM, 2012.
• [Goossen et al. 11] F. Goossen, W. IJntema, F. Frasincar, F. Hogenboom, and U.
Kaymak. News personalization using the CF-IDF semantic recommender, WIMS,
2011.
• [Grosse-Bolting et al. 15] G. Grosse-Bolting, C. Nishioka, and A. Scherp. Generic
process for extracting user profiles from social media using hierarchical knowledge
bases, ICSC, 2015.
• [Huang et al. 11] M. Huang, A. Névéol, and Z. Lu. Recommending MeSH terms for
annotating biomedical articles, JAMIA, 2011.
• [Kapanipathi et al. 14] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User
interests identification on Twitter using a hierarchical knowledge base, ESWC, 2014.
• [Kim et al. 10] S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin. Semeval-2010
task 5: Automatic keyphrase extraction from scientific articles, International
Workshop on Semantic Evaluation, 2010.
44Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Reference
• [Kleinberg 99] J. M. Kleinberg. Authoritative sources in a hyperlinked
environment, Journal of the ACM, 1999.
• [Mihalcea & Paul 04] R. Mihalcea and T. Paul. TextRank: Bringing order into texts,
EMNLP, 2004.
• [Page et al. 99] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank
citation ranking: bringing order to the web, TR of Stanford InfoLab, 1999.
• [Rose et al. 10] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword
extraction from individual documents, Text Mining, 2010.
• [Zouaq et al. 12] A. Zouaq, G. Dragan, and H. Marek. Voting theory for concept
detection, ESWC, 2012.

A Comparison of Different Strategies for Automated Semantic Document Annotation

  • 1.
    1Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Gregor Große-Bölting Chifumi Nishioka Ansgar Scherp A Comparison of Different Strategies for Automated Semantic Document Annotation
  • 2.
    2Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Motivation [1/2] • Document annotation – Facilitates users and search engines to find documents – Requires a huge amount of human effort – e.g., subject indexers in ZBW labeled 1.6 million scientific documents in economics • Semantic document annotation – Documents annotated with semantic entities – e.g., PubMed and MeSH, ACM DL and ACM CCS Focus on semantic document annotation Necessity of automated document annotation
  • 3.
    3Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Motivation [2/2] • Small scale experiments so far – Comparing a small number of strategies – Datasets containing a few hundred documents • Comparing of 43 strategies for document annotation within the developed experiment framework – The largest number of strategies • Experiments with three datasets from different domains – Contain full-texts of 100,000 documents annotated by subject indexers – The largest dataset of scientific publications We conducted the largest scale experiment
  • 4.
    4Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Experiment Framework Strategies are composed of methods from concept extraction, concept activation, and annotation selection 1. Concept Extraction detect concepts (candidate annotations) from each document 2. Concept Activation compute a score for each concept of a document 3. Annotation Selection select annotations from concepts for each document 4. Evaluation measure performance of strategies with ground truth
  • 5.
    5Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Research Question • Research questions solved with the experiment framework (I) Which strategy performs best? (II) Which concept extraction method performs best? (III) Which concept activation method performs best? (IV) Which annotation selection method performs best?
  • 6.
    6Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Concept Extraction [1/2] • Entity – Extract entities from documents using a domain-specific knowledge base – Domain-specific knowledge base • Entities (subjects) in a specific domain (e.g., medicine) • One or more labels for each entity • Relationships between entities – Detect entities by string matching with entity labels • Tri-gram – Extract contigurous sequences of one, two, and three words in a document
  • 7.
    7Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Concept Extraction [2/2] • RAKE (Rapid Automatic Keyword Extraction) [Rose et al. 10] – Unsupervised method for extracting keywords – Incorporate cooccurrence and frequency of words • LDA (Latent Dirichlet Allocation) [Blei et al. 03] – Unsupervised topic modeling method for inferring latent topics in a document corpus – Topic model • Topic: A probability distribution over words • Document: A probability distribution over topics – Treat a topic as a concept
  • 8.
    8Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Concept Activation [1/6] • Three types of concept activation methods – Statistical Methods • Baseline • Use only directly mentioned concepts – Hierarchy-based Methods • Reveal concepts that are not mentioned explicitly using a hierarchical knowledge base – Graph-based Methods • Use only directly mentioned concepts • Represent concept cooccurrences as a graph Bank, Interest Rate, Financial Crisis, Bank, Central Bank, Tax, Interest Rate Tax Bank Interest Rate Financial Crisis Central Bank
  • 9.
    9Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Concept Activation [2/6] • Statistical Methods – Frequency • 𝑓𝑟𝑒𝑞 𝑐, 𝑑 depends on Concept Extraction methods – The number of appearances (Entity and Tri-gram) – The score output by RAKE (RAKE) – The probability of a topic for a document 𝑑 (LDA) – CF-IDF [Goossen et al. 11] • An extension of TF-IDF replacing words with concepts • Lower scores for concepts that appear in many documents 𝑠𝑐𝑜𝑟𝑒 𝑐𝑓𝑖𝑑𝑓(𝑐, 𝑑) = 𝑐𝑓(𝑐, 𝑑) ∙ 𝑙𝑜𝑔 |𝐷| | 𝑑 ∈ 𝐷 : {𝑐 ∈ 𝑑}| 𝑠𝑐𝑜𝑟𝑒𝑓𝑟𝑒𝑞(𝑐, 𝑑) = 𝑓𝑟𝑒𝑞(𝑐, 𝑑)
  • 10.
    10Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Concept Activation [3/6] • Hierarchy-based Methods [1/2] – Base Activation • 𝐶𝑙(𝑐): a set of child concepts of a concept 𝑐 • 𝜆: decay parameter, set 𝜆 = 0.4 • e.g., 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙ 𝑐 𝑖∈𝐶 𝑙(𝑐) 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒(𝑐𝑖, 𝑑) Social Recommendation Social Tagging Web Searching Web Mining Site Wrapping Web Log Analysis World Wide Web 𝑐1 𝑐2 𝑐3 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐1, 𝑑 = 1.00 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐2, 𝑑 = 0.40 𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐3, 𝑑 = 0.16
  • 11.
    11Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Concept Activation [4/6] • Hierarchy-based Methods [2/2] – Branch Activation • 𝐵𝑁: reciprocal of the number of concepts that are located one level above a concept 𝑐 – OneHop Activation • 𝐶 𝑑: set of concepts in a document 𝑑 • Activates concepts in a maximum distance of one hop 𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙ 𝐵𝑁 ∙ 𝑐 𝑖∈𝐶 𝑙(𝑐) 𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ(𝑐𝑖, 𝑑) 𝑠𝑐𝑜𝑟𝑒 𝑜𝑛𝑒ℎ𝑜𝑝 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 if |𝐶𝑙(𝑐) ∩ 𝐶 𝑑| ≥ 2 𝑓𝑟𝑒𝑞 (𝑐, 𝑑) + 𝜆 ∙ 𝑐 𝑖∈𝐶 𝑙(𝑐) 𝑓𝑟𝑒𝑞 𝑐𝑖, 𝑑 otherwise
  • 12.
    12Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Concept Activation [5/6] • Graph-based Methods [1/2] – Degree [Zouaq et al. 12] • 𝑑𝑒𝑔𝑟𝑒𝑒 𝑐, 𝑑 : the number of edges linked with a concept 𝑐 • e.g., 𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(Bank, 𝑑) = 3 – HITS [Kleinberg 99; Zouaq et al. 12] • Link analysis algorithm for search engines [Kleinberg 99] • ℎ𝑢𝑏 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) 𝑎𝑢𝑡ℎ 𝑐𝑖, 𝑑 • 𝑎𝑢𝑡ℎ 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) ℎ𝑢𝑏 𝑐𝑖, 𝑑 𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑) = 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑) 𝑠𝑐𝑜𝑟𝑒ℎ𝑖𝑡𝑠 𝑐, 𝑑 = ℎ𝑢𝑏 𝑐, 𝑑 + 𝑎𝑢𝑡ℎ 𝑐, 𝑑 Tax Bank Interest Rate Financial Crisis Central Bank
  • 13.
    13Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Concept Activation [6/6] • Graph-based Methods [2/2] – PageRank [Page et al. 99; Mihalcea & Paul 04] • Link analysis algorithm for search engines • Based on the intuition that a node that is linked from many important nodes is more important • 𝐶𝑖𝑛(𝑐): set of concepts connected with incoming edges from 𝑐 • 𝐶 𝑜𝑢𝑡(𝑐): set of concepts connected with outgoing edges from 𝑐 • 𝜇: dumping factor, 𝜇 = 0.85 𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒 𝑐, 𝑑 = 1 − 𝜇 + 𝜇 ∙ 𝑐 𝑖∈𝐶 𝑖𝑛(𝑐) 𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒(𝑐𝑖, 𝑑) |𝐶 𝑜𝑢𝑡(𝑐𝑖)|
  • 14.
    14Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Annotation Selection • Top-5 and Top-10 – Select concepts whose scores are ranked in top-k • k Nearest Neighbor (kNN) [Huang et al. 11] – Based on the assumption that documents with similar concepts share similar annotations 1. Compute similarity scores between a target document and all documents with annotations 2. Select union of annotations of k nearest documents Central bank Law Financial crisis Finance China Human resource Leadership Marketing Competition law ?? 0.49 0.45 0.42 0.60 Example - 𝑘 = 2 - Selected annotations Finance; China; Marketing; Competition Law
  • 15.
    15Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Configurations [1/5] Entity Tri-gram LDARAKE Statistical Methods (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation
  • 16.
    16Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Configurations [2/5] 24 strategies Entity Tri-gram LDARAKE Statistical Methods (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation
  • 17.
    17Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 15 strategies Entity Tri-gram LDARAKE Statistical Method (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configurations [3/5]
  • 18.
    18Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 3 strategies Entity Tri-gram LDARAKE Statistical Method (Frequency) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configurations [4/5]
  • 19.
    19Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Entity Tri-gram LDARAKE Statistical Methods (Frequency) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configurations [5/5] 43 strategies in total
  • 20.
    20Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Datasets and Metrics of Experiments Economics Political Science Computer Science publication ZBW FIV SemEval 2010 # of publications 62,924 28,324 244 # of annotations 5.26 (± 1.84) 12.00 (± 4.02) 5.05 (± 2.41) knowledge base STW European Thesaurus ACM CCS # of entities 6,335 7,912 2,299 # of labels 11,679 8,421 9,086 • Computer Science: SemEval 2010 dataset [Kim et al. 10] – Publications are annotated with keywords originally – We converted keywords to entities by string matching • All publications and labels of entities are in English • We use full-texts of publications • All annotations are used as ground truth • Evaluation metrics: Precision, Recall, F-measure
  • 21.
    21Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 (I) Best Performing Strategies • Economics and Political Science datasets – The best strategy: Entity × HITS × kNN – F-measure: 0.39 (economics), 0.28 (political science) • Computer Science dataset – The best strategy: Entity × Degree × kNN – F-measure: 0.33 (computer science) • Graph-based methods do not differ a lot In general, a document annotation strategy Entity × Graph-based method × kNN performs best
  • 22.
    22Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 (II) Influence of Concept Extraction • Concept Extraction method: Entity – Use domain-specific knowledge bases – Knowledge bases: freely available and of high quality – 32 thesauri listed in W3C SKOS Datasets For Concept Extraction methods, Entity consistently outperforms Tri-gram, RAKE, and LDA
  • 23.
    23Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 (III) Influence of Concept Activation • Poor performance of hierarchy-based methods – We use full-texts in the experiments • Full-texts contain so many different concepts (203.80 unique entities (SD: 24.50)) that others do not have to be activated – However, OneHop can work as well as graph-based methods • It activates concept in one hop distance For Concept Activation methods, graph-based methods are better than statistical methods or hierarchy-based methods
  • 24.
    24Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 (IV) Influence of Annotation Selection • kNN – No learning process – Confirms the assumption that documents with similar concepts share similar annotations For Annotation Selection methods, kNN can enhance the performance
  • 25.
    25Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Conclusion • Large scale experiment for automated semantic document annotation for scientific publications • Best strategy: Entity × Graph-based method × kNN – Novel combination of methods • Best concept extraction method: Entity • Best concept activation method: Graph-based methods – OneHop can achieve similar performance and requires less computation cost
  • 26.
    26Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Thank you! Questions?
  • 27.
  • 28.
    28Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Research Question • Research questions solved with the experiment framework (I) Which strategy performs best? (II) Which concept extraction method performs best? (III) Which concept activation method performs best? (IV) Which annotation selection method performs best?
  • 29.
    29Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 LDA (Latent Dirichlet Allocation) source: D. M. Blei. Probabilistic topic models, CACM, 2012.
  • 30.
    30Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Entity Extraction and Conversion • Entity extraction – String matching with entity labels – Starting with longer entity labels • e.g., From a text “financial crisis is …”, only an entity “financial crisis” is detected (not “crisis”). • Converting to entities – Words and keywords are extracted in Tri-gram and RAKE – They are converted to entities by string matching with entity labels before annotation selection – If no matched entity label is found, word or keyword is discarded
  • 31.
    31Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 kNN [1/2] • Similarity measure – Each document is represented as a vector where each element is a score of a concept – Cosine similarity is used as a similarity measure GDP Immigration Population Bank Interest rate Canada 0.3 0.5 0.8 0.1 0.0 0.5 GDP Immigration Population Bank Interest rate Canada 0.6 0.0 0.4 0.8 0.4 0.2 cosine similarity between (0.3, 0.5, 0.8, 0.1, 0.0, 0.5) and (0.6, 0.0, 0.4, 0.8, 0.4, 0.2) 𝒔𝒊𝒎 𝒅 𝟏, 𝒅 𝟐 = 𝟎. 𝟓𝟐 𝑑1 𝑑2
  • 32.
    32Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 kNN [2/2] • k = 1 • k = 2 Central bank Law Financial crisis Finance China Human resource Leadership Marketing Competition law ?? 0.49 0.45 0.42 0.60 Marketing Competitive law Selected annotations Central bank Law Financial crisis Finance China Human resource Leadership Marketing Competition law ?? 0.49 0.45 0.42 0.60 Marketing Competitive law Finance China Selected annotations
  • 33.
    33Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Evaluation Metrics • Precision • Recall • F-measure 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = {𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}| | 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 | 𝑟𝑒𝑐𝑎𝑙𝑙 = {𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}| | 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 | 𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∙ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
  • 34.
    34Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Datasets • Economics dataset – 11 GB • Political science dataset – 3.8 GB
  • 35.
    35Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Experiments • Preprocessing documents – lemmatization – stop words removal • 10-fold cross validation – split a dataset into 10 equal sized subsets – 8 subset for training data – 1 subset for testing data – 1 subset for optimizing parameter
  • 36.
    36Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Result Table: Entity [1/2] Economics top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .14 (.17) .14 (.15) .13 (.15) .22 (.20) .11 (.10) .14 (.12) .08 (.21) .08 (.21) .08 (.21) CF-IDF .19 (.19) .18 (.17) .18 (.16) .24 (.21) .12 (.10) .15 (.12) .29 (.32) .30 (.32) .29 (.31) Base Act. .10 (.14) .09 (.13) .09 (.13) .18 (.19) .09 (.09) .12 (.11) .20 (.30) .20 (.30) .20 (.29) Branch Act. .08 (.14) .08 (.12) .08 (.12) .17 (.19) .08 (.09) .11 (.11) .17 (.28) .17 (.28) .17 (.27) OneHop .12 (.16) .12 (.14) .12 (.14) .19 (.19) .09 (.09) .12 (.11) .35 (.34) .36 (.34) .35 (.33) Degree .15 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32) HITS .14 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.10) .14 (.12) .40 (.32) .40 (.32) .39 (.31) PageRank .14 (.17) .14 (.15) .14 (.15) .22 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32) Political Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .12 (.11) .18 (.16) .14 (.12) .15 (.13) .12 (.10) .13 (.10) .14 (.17) .05 (.07) .07 (.09) CF-IDF .05 (.07) .12 (.16) .07 (.10) .07 (.09) .08 (.10) .07 (.09) .24 (.22) .14 (.14) .17 (.16) Base Act. .05 (.08) .10 (.13) .07 (.09) .10 (.10) .10 (.09) .09 (.09) .14 (.19) .07 (.10) .09 (.12) Branch Act. .04 (.07) .08 (.12) .05 (.08) .09 (.09) .09 (.09) .08 (.09) .12 (.17) .06 (.10) .08 (.11) OneHop .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .27 (.21) .26 (.21) .25 (.19) Degree .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .29 (.21) .28 (.21) .27 (.19) HITS .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .30 (.22) .29 (.21) .28 (.20) PageRank .10 (.09) .20 (.17) .13 (.11) .13 (.10) .14 (.11) .13 (.10) .29 (.22) .29 (.21) .27 (.20)
  • 37.
    37Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Result Table: Entity [2/2] Computer Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .18 (.21) .14 (.15) .15 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .24 (.16) .30 (.17) CF-IDF .02 (.08) .02 (.06) .02 (.06) .03 (.11) .01 (.04) .02 (.05) .47 (.29) .23 (.17) .29 (.18) Base Act. .17 (.20) .13 (.14) .14 (.15) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .22 (.15) .29 (.17) Branch Act. .17 (.20) .12 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .50 (.28) .22 (.15) .29 (.17) OneHop .17 (.20) .13 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .42 (.30) .25 (.21) .29 (.20) Degree .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .27 (.17) .33 (.18) HITS .18 (.21) .14 (.15) .15 (.16) .21 (.22) .08 (.08) .11 (.11) .48 (.31) .27 (.18) .32 (.20) PageRank .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .50 (.29) .25 (.15) .31 (.18)
  • 38.
    38Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Result Table: Tri-gram Economics top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .12 (.15) .12 (.14) .11 (.14) .19 (.19) .10 (.10) .13 (.12) .08 (.22) .08 (.22) .08 (.21) CF-IDF .10 (.12) .10 (.12) .09 (.11) .17 (.17) .08 (.10) .12 (.12) .07 (.20) .06 (.22) .06 (.20) Degree .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .07 (.21) .07 (.21) .07 (.20) HITS .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .08 (.22) .08 (.22) .07 (.21) PageRank .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .10 (.20) .04 (.08) .05 (.11) Political Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .06 (.08) .14 (.16) .08 (.10) .10 (.10) .11 (.11) .10 (.09) .08 (.14) .05 (.08) .06 (.09) CF-IDF .05 (.05) .06 (.07) .05 (.06) .09 (.10) .09 (.10) .08 (.09) .09 (.15) .04 (.08) .06 (.10) Degree .01 (.03) .03 (.07) .01 (.04) .01 (.03) .03 (.07) .01 (.04) .11 (.14) .03 (.05) .05 (.07) HITS .01 (.03) .02 (.06) .01 (.03) .01 (.03) .00 (.06) .01 (.03) .12 (.14) .04 (.06) .06 (.08) PageRank .01 (.04) .03 (.08) .02 (.05) .01 (.04) .03 (.08) .02 (.05) .08 (.12) .03 (.05) .04 (.06) Computer Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .26 (.24) .20 (.18) .22 (.19) .54 (.30) .20 (.13) .29 (.17) .44 (.28) .25 (.18) .30 (.19) CF-IDF .23 (.24) .18 (.18) .19 (.19) .54 (.29) .22 (.14) .30 (.17) .48 (.28) .20 (.14) .26 (.15) Degree .09 (.15) .07 (.11) .07 (.12) .13 (.19) .05 (.07) .07 (.09) .48 (.29) .23 (.16) .29 (.18) HITS .05 (.14) .04 (.09) .04 (.10) .11 (.18) .04 (.06) .06 (.09) .39 (.29) .26 (.21) .28 (.19) PageRank .02 (.06) .02 (.05) .02 (.06) .03 (.08) .01 (.03) .02 (.05) .46 (.29) .25 (.18) .30 (.18)
  • 39.
    39Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Result Table: RAKE Economics top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .08 (.14) .08 (.12) .08 (.12) .15 (.18) .07 (.08) .10 (.11) .34 (.33) .34 (.33) .33 (.32) Political Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .04 (.07) .08 (.13) .05 (.08) .07 (.09) .08 (.09) .07 (.08) .31 (.23) .18 (.15) .22 (.17) Computer Science top-5 top-10 kNN Recall Precision F Recall Precision F Recall Precision F Frequency .24 (.24) .17 (.16) .19 (.17) .42 (.28) .15 (.10) .22 (.14) .42 (.27) .20 (.13) .25 (.15)
  • 40.
    40Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Result Table: LDA Economics kNN Recall Precision F Frequency .19 (.30) .19 (.30) .19 (.30) Political Science kNN Recall Precision F Frequency .15 (.19) .15 (.18) .14 (.17) Computer Science kNN Recall Precision F Frequency .28 (.27) .24 (.23) .24 (.22)
  • 41.
    41Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Materials • Codes – https://github.com/ggb/ShortStories • Datasets – economics and political science • not publicly available yet • contact us directly, if you are interested in – computer science • publicly available
  • 42.
    42Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Presentation • K-CAP 2015 – International Conference on Knowledge Capture – Scope • Knowledge Acquisition / Capture • Knowledge Extraction from Text • Semantic Web • Knowledge Engineering and Modelling • … • Time slot – Presentation: 25 minutes – Q & A: 5 minutes
  • 43.
    43Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Reference • [Blei et al. 03] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation, JMLR, 2003. • [Blei 12] D. M. Blei. Probabilistic topic models, CACM, 2012. • [Goossen et al. 11] F. Goossen, W. IJntema, F. Frasincar, F. Hogenboom, and U. Kaymak. News personalization using the CF-IDF semantic recommender, WIMS, 2011. • [Grosse-Bolting et al. 15] G. Grosse-Bolting, C. Nishioka, and A. Scherp. Generic process for extracting user profiles from social media using hierarchical knowledge bases, ICSC, 2015. • [Huang et al. 11] M. Huang, A. Névéol, and Z. Lu. Recommending MeSH terms for annotating biomedical articles, JAMIA, 2011. • [Kapanipathi et al. 14] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User interests identification on Twitter using a hierarchical knowledge base, ESWC, 2014. • [Kim et al. 10] S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin. Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles, International Workshop on Semantic Evaluation, 2010.
  • 44.
    44Chifumi Nishioka chni@informatik.uni-kiel.de,K-CAP 2015 Reference • [Kleinberg 99] J. M. Kleinberg. Authoritative sources in a hyperlinked environment, Journal of the ACM, 1999. • [Mihalcea & Paul 04] R. Mihalcea and T. Paul. TextRank: Bringing order into texts, EMNLP, 2004. • [Page et al. 99] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the web, TR of Stanford InfoLab, 1999. • [Rose et al. 10] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword extraction from individual documents, Text Mining, 2010. • [Zouaq et al. 12] A. Zouaq, G. Dragan, and H. Marek. Voting theory for concept detection, ESWC, 2012.