A Comparison of Different Strategies for Automated Semantic Document Annotation

1Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015
Gregor Große-Bölting
Chifumi Nishioka
Ansgar Scherp
A Comparison of Different
Strategies for Automated
Semantic Document Annotation

Motivation [1/2]
• Document annotation
– Facilitates users and search engines to find documents
– Requires a huge amount of human effort
– e.g., subject indexers in ZBW labeled 1.6 million
scientific documents in economics
• Semantic document annotation
– Documents annotated with semantic entities
– e.g., PubMed and MeSH, ACM DL and ACM CCS
Focus on semantic document annotation
Necessity of automated document annotation

Motivation [2/2]
• Small scale experiments so far
– Comparing a small number of strategies
– Datasets containing a few hundred documents
• Comparing of 43 strategies for document annotation
within the developed experiment framework
– The largest number of strategies
• Experiments with three datasets from different domains
– Contain full-texts of 100,000 documents annotated by subject
indexers
– The largest dataset of scientific publications
We conducted the largest scale experiment

Experiment Framework
Strategies are composed of methods from concept extraction,
concept activation, and annotation selection
1. Concept Extraction
detect concepts (candidate annotations) from each document
2. Concept Activation
compute a score for each concept of a document
3. Annotation Selection
select annotations from concepts for each document
4. Evaluation
measure performance of strategies with ground truth

Research Question
• Research questions solved with the experiment
framework
(I) Which strategy performs best?
(II) Which concept extraction method performs best?
(III) Which concept activation method performs best?
(IV) Which annotation selection method performs
best?

Concept Extraction [1/2]
• Entity
– Extract entities from documents using a domain-specific
knowledge base
– Domain-specific knowledge base
• Entities (subjects) in a specific domain (e.g., medicine)
• One or more labels for each entity
• Relationships between entities
– Detect entities by string matching with entity labels
• Tri-gram
– Extract contigurous sequences of one, two, and three
words in a document

Concept Extraction [2/2]
• RAKE (Rapid Automatic Keyword
Extraction) [Rose et al. 10]
– Unsupervised method for extracting keywords
– Incorporate cooccurrence and frequency of words
• LDA (Latent Dirichlet Allocation) [Blei et al. 03]
– Unsupervised topic modeling method for inferring latent
topics in a document corpus
– Topic model
• Topic: A probability distribution over words
• Document: A probability distribution over topics
– Treat a topic as a concept

Concept Activation [1/6]
• Three types of concept activation
methods
– Statistical Methods
• Baseline
• Use only directly mentioned concepts
– Hierarchy-based Methods
• Reveal concepts that are not mentioned explicitly using a
hierarchical knowledge base
– Graph-based Methods
• Use only directly mentioned concepts
• Represent concept
cooccurrences as a graph
Bank, Interest Rate, Financial Crisis, Bank, Central
Bank, Tax, Interest Rate
Tax
Bank
Interest Rate
Financial Crisis
Central Bank

• Statistical Methods
– Frequency
• 𝑓𝑟𝑒𝑞 𝑐, 𝑑 depends on Concept Extraction methods
– The number of appearances (Entity and Tri-gram)
– The score output by RAKE (RAKE)
– The probability of a topic for a document 𝑑 (LDA)
– CF-IDF [Goossen et al. 11]
• An extension of TF-IDF replacing words with concepts
• Lower scores for concepts that appear in many documents
𝑠𝑐𝑜𝑟𝑒 𝑐𝑓𝑖𝑑𝑓(𝑐, 𝑑) = 𝑐𝑓(𝑐, 𝑑) ∙ 𝑙𝑜𝑔
|𝐷|
| 𝑑 ∈ 𝐷 : {𝑐 ∈ 𝑑}|
𝑠𝑐𝑜𝑟𝑒𝑓𝑟𝑒𝑞(𝑐, 𝑑) = 𝑓𝑟𝑒𝑞(𝑐, 𝑑)

• Hierarchy-based Methods [1/2]
– Base Activation
• 𝐶𝑙(𝑐): a set of child concepts of a concept 𝑐
• 𝜆: decay parameter, set 𝜆 = 0.4
• e.g.,
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙
𝑐 𝑖∈𝐶 𝑙(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒(𝑐𝑖, 𝑑)
Social
Recommendation
Social
Tagging
Web Searching Web Mining
Site
Wrapping
Web Log
Analysis
World Wide Web
𝑐1
𝑐2
𝑐3
𝑠𝑐𝑜𝑟𝑒 𝑏𝑎𝑠𝑒 𝑐1, 𝑑 = 1.00

• Hierarchy-based Methods [2/2]
– Branch Activation
• 𝐵𝑁: reciprocal of the number of concepts that are located one
level above a concept 𝑐
– OneHop Activation
• 𝐶 𝑑: set of concepts in a document 𝑑
• Activates concepts in a maximum distance of one hop
𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ 𝑐, 𝑑 = 𝑓𝑟𝑒𝑞 𝑐, 𝑑 + 𝜆 ∙ 𝐵𝑁 ∙
𝑠𝑐𝑜𝑟𝑒 𝑏𝑟𝑎𝑛𝑐ℎ(𝑐𝑖, 𝑑)
𝑠𝑐𝑜𝑟𝑒 𝑜𝑛𝑒ℎ𝑜𝑝 𝑐, 𝑑 =
𝑓𝑟𝑒𝑞 𝑐, 𝑑 if |𝐶𝑙(𝑐) ∩ 𝐶 𝑑| ≥ 2
𝑓𝑟𝑒𝑞 (𝑐, 𝑑) + 𝜆 ∙
𝑓𝑟𝑒𝑞 𝑐𝑖, 𝑑 otherwise

• Graph-based Methods [1/2]
– Degree [Zouaq et al. 12]
• 𝑑𝑒𝑔𝑟𝑒𝑒 𝑐, 𝑑 : the number of edges linked with a concept 𝑐
• e.g., 𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(Bank, 𝑑) = 3
– HITS [Kleinberg 99; Zouaq et al. 12]
• Link analysis algorithm for search engines [Kleinberg 99]
• ℎ𝑢𝑏 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) 𝑎𝑢𝑡ℎ 𝑐𝑖, 𝑑
• 𝑎𝑢𝑡ℎ 𝑐, 𝑑 = 𝑐 𝑖∈𝐶 𝑛(𝑐) ℎ𝑢𝑏 𝑐𝑖, 𝑑
𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑) = 𝑑𝑒𝑔𝑟𝑒𝑒(𝑐, 𝑑)
𝑠𝑐𝑜𝑟𝑒ℎ𝑖𝑡𝑠 𝑐, 𝑑 = ℎ𝑢𝑏 𝑐, 𝑑 + 𝑎𝑢𝑡ℎ 𝑐, 𝑑
Tax
Bank
Interest Rate
Financial Crisis
Central Bank

• Graph-based Methods [2/2]
– PageRank [Page et al. 99; Mihalcea & Paul 04]
• Link analysis algorithm for search engines
• Based on the intuition that a node that is linked from many
important nodes is more important
• 𝐶𝑖𝑛(𝑐): set of concepts connected with incoming edges from 𝑐
• 𝐶 𝑜𝑢𝑡(𝑐): set of concepts connected with outgoing edges from 𝑐
• 𝜇: dumping factor, 𝜇 = 0.85
𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒 𝑐, 𝑑 = 1 − 𝜇 + 𝜇 ∙
𝑐 𝑖∈𝐶 𝑖𝑛(𝑐)
𝑠𝑐𝑜𝑟𝑒 𝑝𝑎𝑔𝑒(𝑐𝑖, 𝑑)
|𝐶 𝑜𝑢𝑡(𝑐𝑖)|

Annotation Selection
• Top-5 and Top-10
– Select concepts whose scores are ranked in top-k
• k Nearest Neighbor (kNN) [Huang et al. 11]
– Based on the assumption that documents with similar
concepts share similar annotations
1. Compute similarity scores between a target document
and all documents with annotations
2. Select union of annotations of k nearest documents
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Example
- 𝑘 = 2
- Selected annotations
Finance; China; Marketing;
Competition Law

Configurations [1/5]
Entity Tri-gram LDARAKE
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation

24 strategies
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation

15 strategies
Statistical
Method
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation

3 strategies
Statistical
Method
(Frequency)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation

Statistical
Methods
(Frequency)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
43 strategies in total

Datasets and Metrics of Experiments
Economics Political Science Computer Science
publication ZBW FIV SemEval 2010
# of publications 62,924 28,324 244
# of annotations 5.26 (± 1.84) 12.00 (± 4.02) 5.05 (± 2.41)
knowledge base STW European Thesaurus ACM CCS
# of entities 6,335 7,912 2,299
# of labels 11,679 8,421 9,086
• Computer Science: SemEval 2010 dataset [Kim et al. 10]
– Publications are annotated with keywords originally
– We converted keywords to entities by string matching
• All publications and labels of entities are in English
• We use full-texts of publications
• All annotations are used as ground truth
• Evaluation metrics: Precision, Recall, F-measure

(I) Best Performing Strategies
• Economics and Political Science datasets
– The best strategy: Entity × HITS × kNN
– F-measure: 0.39 (economics), 0.28 (political science)
• Computer Science dataset
– The best strategy: Entity × Degree × kNN
– F-measure: 0.33 (computer science)
• Graph-based methods do not differ a lot
In general, a document annotation strategy
Entity × Graph-based method × kNN performs best

(II) Influence of Concept Extraction
• Concept Extraction method: Entity
– Use domain-specific knowledge bases
– Knowledge bases: freely available and of high quality
– 32 thesauri listed in W3C SKOS Datasets
For Concept Extraction methods, Entity consistently
outperforms Tri-gram, RAKE, and LDA

(III) Influence of Concept Activation
• Poor performance of hierarchy-based methods
– We use full-texts in the experiments
• Full-texts contain so many different concepts (203.80 unique
entities (SD: 24.50)) that others do not have to be activated
– However, OneHop can work as well as graph-based
methods
• It activates concept in one hop distance
For Concept Activation methods,
graph-based methods are better than statistical
methods or hierarchy-based methods

(IV) Influence of Annotation Selection
• kNN
– No learning process
– Confirms the assumption that documents with similar
concepts share similar annotations
For Annotation Selection methods, kNN can enhance
the performance

Conclusion
• Large scale experiment for automated semantic
document annotation for scientific publications
• Best strategy: Entity × Graph-based method × kNN
– Novel combination of methods
• Best concept extraction method: Entity
• Best concept activation method: Graph-based
methods
– OneHop can achieve similar performance and requires
less computation cost

Thank you!
Questions?

Appendix

Research Question
• Research questions solved with the experiment
framework
(I) Which strategy performs best?
(II) Which concept extraction method performs best?
(III) Which concept activation method performs best?
(IV) Which annotation selection method performs
best?

LDA (Latent Dirichlet Allocation)
source: D. M. Blei. Probabilistic topic models, CACM, 2012.

Entity Extraction and Conversion
• Entity extraction
– String matching with entity labels
– Starting with longer entity labels
• e.g., From a text “financial crisis is …”, only an entity “financial
crisis” is detected (not “crisis”).
• Converting to entities
– Words and keywords are extracted in Tri-gram and RAKE
– They are converted to entities by string matching with
entity labels before annotation selection
– If no matched entity label is found, word or keyword is
discarded

kNN [1/2]
• Similarity measure
– Each document is represented as a vector where each
element is a score of a concept
– Cosine similarity is used as a similarity measure
GDP
Immigration
Population
Bank
Interest rate
Canada
0.3
0.5
0.8
0.1
0.0
0.5
GDP
Immigration
Population
Bank
Interest rate
Canada
0.6
0.0
0.4
0.8
0.4
0.2
cosine similarity between (0.3, 0.5, 0.8, 0.1, 0.0, 0.5) and (0.6, 0.0, 0.4, 0.8, 0.4, 0.2)
𝒔𝒊𝒎 𝒅 𝟏, 𝒅 𝟐 = 𝟎. 𝟓𝟐
𝑑1 𝑑2

kNN [2/2]
• k = 1
• k = 2
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Marketing
Competitive law
Selected annotations
Central bank
Law
Financial crisis
Finance
China
Human resource
Leadership
Marketing
Competition law
??
0.49
0.45
0.42
0.60
Marketing
Competitive law
Finance
China
Selected annotations

Evaluation Metrics
• Precision
• Recall
• F-measure
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}|
| 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 |
𝑟𝑒𝑐𝑎𝑙𝑙 =
{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠} ∩ |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠}|
| 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 |
𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

Datasets
• Economics dataset
– 11 GB
• Political science dataset
– 3.8 GB

Experiments
• Preprocessing documents
– lemmatization
– stop words removal
• 10-fold cross validation
– split a dataset into 10 equal sized subsets
– 8 subset for training data
– 1 subset for testing data
– 1 subset for optimizing parameter

Result Table: Entity [1/2]
Economics
top-5 top-10 kNN
Recall Precision F Recall Precision F Recall Precision F
Frequency .14 (.17) .14 (.15) .13 (.15) .22 (.20) .11 (.10) .14 (.12) .08 (.21) .08 (.21) .08 (.21)
CF-IDF .19 (.19) .18 (.17) .18 (.16) .24 (.21) .12 (.10) .15 (.12) .29 (.32) .30 (.32) .29 (.31)
Base Act. .10 (.14) .09 (.13) .09 (.13) .18 (.19) .09 (.09) .12 (.11) .20 (.30) .20 (.30) .20 (.29)
Branch Act. .08 (.14) .08 (.12) .08 (.12) .17 (.19) .08 (.09) .11 (.11) .17 (.28) .17 (.28) .17 (.27)
OneHop .12 (.16) .12 (.14) .12 (.14) .19 (.19) .09 (.09) .12 (.11) .35 (.34) .36 (.34) .35 (.33)
Degree .15 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)
HITS .14 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.10) .14 (.12) .40 (.32) .40 (.32) .39 (.31)
PageRank .14 (.17) .14 (.15) .14 (.15) .22 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)
Political Science
top-5 top-10 kNN
Frequency .12 (.11) .18 (.16) .14 (.12) .15 (.13) .12 (.10) .13 (.10) .14 (.17) .05 (.07) .07 (.09)
CF-IDF .05 (.07) .12 (.16) .07 (.10) .07 (.09) .08 (.10) .07 (.09) .24 (.22) .14 (.14) .17 (.16)
Base Act. .05 (.08) .10 (.13) .07 (.09) .10 (.10) .10 (.09) .09 (.09) .14 (.19) .07 (.10) .09 (.12)
Branch Act. .04 (.07) .08 (.12) .05 (.08) .09 (.09) .09 (.09) .08 (.09) .12 (.17) .06 (.10) .08 (.11)
OneHop .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .27 (.21) .26 (.21) .25 (.19)
Degree .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .29 (.21) .28 (.21) .27 (.19)
HITS .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .30 (.22) .29 (.21) .28 (.20)
PageRank .10 (.09) .20 (.17) .13 (.11) .13 (.10) .14 (.11) .13 (.10) .29 (.22) .29 (.21) .27 (.20)

Result Table: Entity [2/2]
Computer Science
top-5 top-10 kNN
Frequency .18 (.21) .14 (.15) .15 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .24 (.16) .30 (.17)
CF-IDF .02 (.08) .02 (.06) .02 (.06) .03 (.11) .01 (.04) .02 (.05) .47 (.29) .23 (.17) .29 (.18)
Base Act. .17 (.20) .13 (.14) .14 (.15) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .22 (.15) .29 (.17)
Branch Act. .17 (.20) .12 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .50 (.28) .22 (.15) .29 (.17)
OneHop .17 (.20) .13 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .42 (.30) .25 (.21) .29 (.20)
Degree .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .27 (.17) .33 (.18)
HITS .18 (.21) .14 (.15) .15 (.16) .21 (.22) .08 (.08) .11 (.11) .48 (.31) .27 (.18) .32 (.20)
PageRank .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .50 (.29) .25 (.15) .31 (.18)

Result Table: Tri-gram
Economics
top-5 top-10 kNN
Frequency .12 (.15) .12 (.14) .11 (.14) .19 (.19) .10 (.10) .13 (.12) .08 (.22) .08 (.22) .08 (.21)
CF-IDF .10 (.12) .10 (.12) .09 (.11) .17 (.17) .08 (.10) .12 (.12) .07 (.20) .06 (.22) .06 (.20)
Degree .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .07 (.21) .07 (.21) .07 (.20)
HITS .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .08 (.22) .08 (.22) .07 (.21)
PageRank .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .10 (.20) .04 (.08) .05 (.11)
Political Science
top-5 top-10 kNN
Frequency .06 (.08) .14 (.16) .08 (.10) .10 (.10) .11 (.11) .10 (.09) .08 (.14) .05 (.08) .06 (.09)
CF-IDF .05 (.05) .06 (.07) .05 (.06) .09 (.10) .09 (.10) .08 (.09) .09 (.15) .04 (.08) .06 (.10)
Degree .01 (.03) .03 (.07) .01 (.04) .01 (.03) .03 (.07) .01 (.04) .11 (.14) .03 (.05) .05 (.07)
HITS .01 (.03) .02 (.06) .01 (.03) .01 (.03) .00 (.06) .01 (.03) .12 (.14) .04 (.06) .06 (.08)
PageRank .01 (.04) .03 (.08) .02 (.05) .01 (.04) .03 (.08) .02 (.05) .08 (.12) .03 (.05) .04 (.06)
Computer Science
top-5 top-10 kNN
Frequency .26 (.24) .20 (.18) .22 (.19) .54 (.30) .20 (.13) .29 (.17) .44 (.28) .25 (.18) .30 (.19)
CF-IDF .23 (.24) .18 (.18) .19 (.19) .54 (.29) .22 (.14) .30 (.17) .48 (.28) .20 (.14) .26 (.15)
Degree .09 (.15) .07 (.11) .07 (.12) .13 (.19) .05 (.07) .07 (.09) .48 (.29) .23 (.16) .29 (.18)
HITS .05 (.14) .04 (.09) .04 (.10) .11 (.18) .04 (.06) .06 (.09) .39 (.29) .26 (.21) .28 (.19)
PageRank .02 (.06) .02 (.05) .02 (.06) .03 (.08) .01 (.03) .02 (.05) .46 (.29) .25 (.18) .30 (.18)

Result Table: RAKE
Economics
top-5 top-10 kNN
Frequency .08 (.14) .08 (.12) .08 (.12) .15 (.18) .07 (.08) .10 (.11) .34 (.33) .34 (.33) .33 (.32)
Political Science
top-5 top-10 kNN
Frequency .04 (.07) .08 (.13) .05 (.08) .07 (.09) .08 (.09) .07 (.08) .31 (.23) .18 (.15) .22 (.17)
Computer Science
top-5 top-10 kNN
Frequency .24 (.24) .17 (.16) .19 (.17) .42 (.28) .15 (.10) .22 (.14) .42 (.27) .20 (.13) .25 (.15)

Result Table: LDA
Economics
kNN
Recall Precision F
Frequency .19 (.30) .19 (.30) .19 (.30)
Political Science
kNN
Recall Precision F
Frequency .15 (.19) .15 (.18) .14 (.17)
Computer Science
kNN
Recall Precision F
Frequency .28 (.27) .24 (.23) .24 (.22)

Materials
• Codes
– https://github.com/ggb/ShortStories
• Datasets
– economics and political science
• not publicly available yet
• contact us directly, if you are interested in
– computer science
• publicly available

Presentation
• K-CAP 2015
– International Conference on Knowledge Capture
– Scope
• Knowledge Acquisition / Capture
• Knowledge Extraction from Text
• Semantic Web
• Knowledge Engineering and Modelling
• …
• Time slot
– Presentation: 25 minutes
– Q & A: 5 minutes

Reference
• [Blei et al. 03] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation,
JMLR, 2003.
• [Blei 12] D. M. Blei. Probabilistic topic models, CACM, 2012.
• [Goossen et al. 11] F. Goossen, W. IJntema, F. Frasincar, F. Hogenboom, and U.
Kaymak. News personalization using the CF-IDF semantic recommender, WIMS,
2011.
• [Grosse-Bolting et al. 15] G. Grosse-Bolting, C. Nishioka, and A. Scherp. Generic
process for extracting user profiles from social media using hierarchical knowledge
bases, ICSC, 2015.
• [Huang et al. 11] M. Huang, A. Névéol, and Z. Lu. Recommending MeSH terms for
annotating biomedical articles, JAMIA, 2011.
• [Kapanipathi et al. 14] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User
interests identification on Twitter using a hierarchical knowledge base, ESWC, 2014.
• [Kim et al. 10] S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin. Semeval-2010
task 5: Automatic keyphrase extraction from scientific articles, International
Workshop on Semantic Evaluation, 2010.

Reference
• [Kleinberg 99] J. M. Kleinberg. Authoritative sources in a hyperlinked
environment, Journal of the ACM, 1999.
• [Mihalcea & Paul 04] R. Mihalcea and T. Paul. TextRank: Bringing order into texts,
EMNLP, 2004.
• [Page et al. 99] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank
citation ranking: bringing order to the web, TR of Stanford InfoLab, 1999.
• [Rose et al. 10] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword
extraction from individual documents, Text Mining, 2010.
• [Zouaq et al. 12] A. Zouaq, G. Dragan, and H. Marek. Voting theory for concept
detection, ESWC, 2012.

A Comparison of Different Strategies for Automated Semantic Document Annotation

More Related Content

What's hot

Viewers also liked

Similar to A Comparison of Different Strategies for Automated Semantic Document Annotation

More from Ansgar Scherp

A Comparison of Different Strategies for Automated Semantic Document Annotation