Entity Retrieval (WWW 2013 tutorial)

Part II
Entity Retrieval
Krisztian Balog
University of Stavanger
Half-day tutorial at the WWW’13 conference | Rio de Janeiro, Brazil, 2013

What is an entity?
- Uniquely identiﬁable “thing” or “object”
- Properties:
- ID
- Name(s)
- Type(s)
- Attributes
- Relationships to other entities

Entity retrieval tasks
- Ad-hoc entity retrieval
- List completion
- Question answering
- Factual questions
- List questions
- Related entity ﬁnding
- Type-restricted variations
- People, blogs, products, movies, etc.

What’s so special about it?
- Entities are not always directly represented
- Recognise and disambiguate entities in text
- Collect and aggregate information about a given
entity from multiple documents and even multiple
data collections
- More structure
- Types (from some taxonomy)
- Attributes (from some ontology)
- Relationships to other entities (“typed links”)

In this Part
- Focus on the ad-hoc entity retieval task
- Mainly probabilistic models
- Speciﬁcally, Language Models

Outline for Part II
- Crash course into probability theory
- Ranking with ready-made entity descriptions
- Ranking without explicit entity representations
- Evaluation initiatives
- Future directions

Ad-hoc entity retrieval
- Input: unconstrained natural language query
- “telegraphic” queries (neither well-formed nor
grammatically correct sentences or questions)
- Output: ranked list of entities
- Collection: unstructured and/or semi-
structured documents

Ranking with ready-made
entity descriptions

Document-based entity
representations
- Each entity is described by a document
- Ranking entities much like ranking documents
- Unstructured
- Semi-structured

Standard Language Modeling
approach
- Rank documents d according to their likelihood
of being relevant given a query q: P(d|q)
P(d|q) =
P(q|d)P(d)
P(q)
/ P(q|d)P(d)
Document prior
Probability of the document
being relevant to any query
Query likelihood
Probability that query q
was “produced” by document d
P(q|d) =
Y
t2q
P(t|✓d)n(t,q)

Standard Language Modeling
approach (2)
Number of times t appears in q
Empirical
document model
Collection
model
Smoothing parameter
Maximum
likelihood
estimates
P(q|d) =
Y
t2q
P(t|✓d)n(t,q)
Document language model
Multinomial probability distribution
over the vocabulary of terms
P(t|✓d) = (1 )P(t|d) + P(t|C)
n(t, d)
|d|
P
d n(t, d)
P
d |d|

Here, documents=entities, so
P(e|q) / P(e)P(q|✓e) = P(e)
Y
t2q
P(t|✓e)n(t,q)
Entity prior
Probability of the entity
being relevant to any query
Entity language model
Multinomial probability distribution
over the vocabulary of terms

Semi-structured entity
representation
- Entity description documents are rarely
unstructured
- Representing entities as
- Fielded documents -- the IR approach
- Graphs -- the DB/SW approach

dbpedia:Audi_A4
foaf:name Audi A4
rdfs:label Audi A4
rdfs:comment The Audi A4 is a compact executive car
produced since late 1994 by the German car
manufacturer Audi, a subsidiary of the
Volkswagen Group. The A4 has been built [...]
dbpprop:production 1994
2001
2005
2008
rdf:type dbpedia-owl:MeanOfTransportation
dbpedia-owl:Automobile
dbpedia-owl:manufacturer dbpedia:Audi
dbpedia-owl:class dbpedia:Compact_executive_car
owl:sameAs freebase:Audi A4
is dbpedia-owl:predecessor of dbpedia:Audi_A5
is dbpprop:similar of dbpedia:Cadillac_BLS

Mixture of Language Models
[Ogilvie & Callan, 2003]
- Build a separate language model for each ﬁeld
- Take a linear combination of them
mX
j=1
µj = 1
Field language model
Smoothed with a collection model built
from all document representations of the
same type in the collectionField weights
P(t|✓d) =
mX
j=1
µjP(t|✓dj )
P. Ogilvie and J. Callan. Combining document representations for known item search. SIGIR'03.

Setting field weights
- Heuristically
- Proportional to the length of text content in that field,
to the field’s individual performance, etc.
- Empirically (using training queries)
- Problems
- Number of possible fields is huge
- It is not possible to optimise their weights directly
- Entities are sparse w.r.t. different fields
- Most entities have only a handful of predicates

Predicate folding
- Idea: reduce the number of fields by grouping
them together
- Grouping based on
- Type [Pérez-Agüera et al. 2010]
- Manually determined importance [Blanco et al. 2011]
R. Blanco, P. Mika, and S. Vigna. Effective and efficient entity search in RDF data. ISWC'11.
J.R. Pérez-Agüera, J. Arroyo, J. Greenberg, J.P. Iglesias, and V. Fresno. Using BM25F for
semantic search. SemSearch'10.

Hierarchical Entity Model
[Neumayer et al. 2012]
- Organise fields into a 2-level hierarchy
- Field types (4) on the top level
- Individual fields of that type on the bottom level
- Estimate field weights
- Using training data for field types
- Using heuristics for bottom-level types
R. Neumayer, K. Balog and K. Nørvåg. On the modeling of entities for ad-hoc entity search in
the web of data. ECIR'12.

Two-level hierarchy
foaf:name Audi A4
rdfs:label Audi A4
rdfs:comment The Audi A4 is a compact executive car
produced since late 1994 by the German car
manufacturer Audi, a subsidiary of the
Volkswagen Group. The A4 has been built [...]
dbpprop:production 1994
2001
2005
2008
dbpedia-owl:manufacturer dbpedia:Audi
dbpedia-owl:class dbpedia:Compact_executive_car
owl:sameAs freebase:Audi A4
is dbpedia-owl:predecessor of dbpedia:Audi_A5
is dbpprop:similar of dbpedia:Cadillac_BLS
Name
Attributes
Out-relations
In-relations

Comparison of models
d
dfF
...
t
dfF t
... ...d
tdf
...
tdf
...d
t
...
t
Unstructured
document model
Fielded
document model
Hierarchical
document model

Probabilistic Retrieval Model
for Semistructured data
[Kim et al. 2009]
- Extension to the Mixture of Language Models
- Find which document ﬁeld each query term
may be associated with
Mapping probability
Estimated for each query term
P(t|✓d) =
mX
j=1
µjP(t|✓dj )
P(t|✓d) =
mX
j=1
P(dj|t)P(t|✓dj )
J. Kim, X. Xue, and W.B. Croft. A probabilistic retrieval model for semistructured data. ECIR'09.

Estimating the mapping
probability
Term likelihood
Probability of a query term
occurring in a given field type
Prior field probability
Probability of mapping the query term
to this field before observing collection
statistics
P(dj|t) =
P(t|dj)P(dj)
P(t)
X
dk
P(t|dk)P(dk)
P(t|Cj) =
P
d n(t, dj)
P
d |dj|

Example
Query: meg ryan war
cast 0.407
team 0.382
title 0.187
genre 0.927
title 0.070
location 0.002
cast 0.601
team 0.381
title 0.017
dj dj djP(t|dj) P(t|dj) P(t|dj)

The usual suspects from
document retrieval...
- Priors
- HITS, PageRank
- Document link indegree [Kamps & Koolen 2008]
- Pseudo relevance feedback
- Document-centric vs. entity-centric [Macdonald &
Ounis 2007; Serdyukov et al. 2007]
- sampling expansion terms from top ranked documents
and/or (proﬁles of) top ranked candidates
- Field-based [Kim & Croft 2011]
J. Kamps and M. Koolen. The importance of link evidence in Wikipedia. ECIR'08.
C. Macdonald and I. Ounis. Expertise drift and query expansion in expert search. CIKM'07.
P. Serdyukov, S. Chernov, and W. Nejdl. Enhancing expert search through query modeling. ECIR'07.
J.Y. Kim and W.B. Croft. A Field Relevance Model for Structured Document Retrieval. ECIR'12.

So far...
- Ranking (ﬁelded) documents...
- What is special about entities?
- Type(s)
- Relationships with other entities

Entity types

Using target types
Assuming they have been identified...
- Constraining results
- Soft/hard filtering
- Different ways to measure type similarity (between
target types and the types associated with the entity)
- Set-based
- Content-based
- Lexical similarity of type labels
- Query expansion
- Adding terms from type names to the query
- Entity expansion
- Categories as a separate metadata field

Modeling terms and categories
[Balog et al. 2011]
K. Balog, M. Bron, and M. de Rijke. Query modeling for entity search based on terms,
categories and examples. TOIS'11.
Term-based representation
Query model
p(t|✓T
e )p(t|✓T
q ) p(c|✓C
q ) p(c|✓C
e )
Entity model Query model Entity model
Category-based representation
KL(✓T
q ||✓T
e ) KL(✓C
q ||✓C
e )
P(e|q) / P(q|e)P(e)
P(q|e) = (1 )P(✓T
q |✓T
e ) + P(✓C
q |✓C
e )

Identifying target types
- Types of top ranked entities [Vallet & Zaragoza
2008]
- Direct term-based vs. indirect entity-based
representations [Balog & Neumayer 2012]
- Hierarchical case is diﬃcult...
D. Vallet and H. Zaragoza. Inferring the most important types of a query: a semantic approach. SIGIR'08.
K. Balog and R. Neumayer. Hierarchical target type identiﬁcation for entity-oriented queries. CIKM'12.
U. Sawant and S. Chakrabarti. Learning joint query interpretation and response ranking. WWW'13.

Expanding target types
- Pseudo relevance feedback
- Based on hierarchical structure
- Using lexical similarity of type labels

Ranking without explicit
entity representations

Scenario
- Entity descriptions are not readily available
- Entity occurrences are annotated

The basic idea
Use documents to get from queries to entities
e
xxxx x xxx xx xxxxxx xx
x xxx xx x xxxx xx xxx x
xxxxxx xx x xxx xx xxxx
xx xxx xx x xxxxx xxx xx
x xxxx x xxx xx xxxxxx
xx x xxx xx x xxxx xx
xxx x xxxxxx xx x xxx xx
xxxx xx xxx xx x xxxxx
xxx xx x
q
x xxxx x xxx xx
xxxxxx xxxxxx xx x xxx
xx x xxxx xx xxx x xxxxx
xx x xxx xx xxxx xx xxx
xx x xxxxx xxx
Query-document
association
the document’s relevance
Document-entity
association
how well the document
characterises the entity

Two principal approaches
- Profile-based methods
- Create a textual profile for entities, then rank them
(by adapting document retrieval techniques)
- Document-based methods
- Indirect representation based on mentions identified
in documents
- First ranking documents (or snippets) and then
aggregating evidence for associated entities

Proﬁle-based methods
q
xxx xx x
x xxxx x xxx xx
xx x xxxxx xxx
x xxxx x xxx xx
e
x xxxx x xxx xx
x xxxx x xxx xx
e
e

Document-based methods
q
xxx xx x
x xxxx x xxx xx
xx x xxxxx xxx
X
e
X
X
e
e

Many possibilities in terms of
modeling
- Generative probabilistic models
- Discriminative probabilistic models
- Voting models
- Graph-based models

Generative probabilistic
models
- Candidate generation models (P(e|q))
- Two-stage language model
- Topic generation models (P(q|e))
- Candidate model, a.k.a. Model 1
- Document model, a.k.a. Model 2
- Proximity-based variations
- Both families of models can be derived from the
Probability Ranking Principle [Fang & Zhai 2007]
H. Fang and C. Zhai. Probabilistic models for expert ﬁnding. ECIR'07.

Candidate models (“Model 1”)
[Balog et al. 2006]
P(q|✓e) =
Y
t2q
P(t|✓e)n(t,q)
Smoothing
With collection-wide background model
(1 )P(t|e) + P(t)
X
d
P(t|d, e)P(d|e)
K. Balog, L. Azzopardi, and M. de Rijke. Formal Models for Expert Finding in Enterprise Corpora. SIGIR'06.
Document-entity
association
Term-candidate
co-occurrence
In a particular document.
In the simplest case:P(t|d)

Document models (“Model 2”)
[Balog et al. 2006]
P(q|e) =
X
d
P(q|d, e)P(d|e)
Document-entity
association
Document relevance
How well document d
supports the claim that e
is relevant to q
Y
t2q
P(t|d, e)n(t,q)
Simplifying assumption
(t and e are conditionally
independent given d)
P(t|✓d)
K. Balog, L. Azzopardi, and M. de Rijke. Formal Models for Expert Finding in Enterprise Corpora. SIGIR'06.

Document-entity associations
- Boolean (or set-based) approach
- Weighted by the conﬁdence in entity linking
- Consider other entities mentioned in the
document

Proximity-based variations
- So far, conditional independence assumption
between candidates and terms when
computing the probability P(t|d,e)
- Relationship between terms and entities that in
the same document is ignored
- Entity is equally strongly associated with everything
discussed in that document
- Let’s capture the dependence between entities
and terms
- Use their distance in the document

Using proximity kernels
[Petkova & Croft 2007]
D. Petkova and W.B. Croft. Proximity-based document representation for named entity retrieval. CIKM'07.
P(t|d, e) =
1
Z
NX
i=1
d(i, t)k(t, e)
Indicator function
1 if the term at position i is t,
0 otherwise
Normalising
contant
Proximity-based kernel
- constant function
- triangle kernel
- Gaussian kernel
- step function

Figure taken from D. Petkova and W.B. Croft. Proximity-based document representation for named entity
retrieval. CIKM'07.

Discriminative models
- Vs. generative models:
- Fewer assumptions (e.g., term independence)
- “Let the data speak”
- Sufﬁcient amounts of training data required
- Incorporating more document features, multiple
signals for document-entity associations
- Estimating P(r=1|e,q) directly (instead of P(e,q|r=1))
- Optimisation can get trapped in a local optimum

Arithmetic Mean
Discriminative (AMD) model
[Yang et al. 2010]
Y. Fang, L. Si, and A. P. Mathur. Discriminative models of integrating document evidence and
document-candidate associations for expert search. SIGIR'10.
P✓(r = 1|e, q) =
X
d
P(r1 = 1|q, d)P(r2 = 1|e, d)P(d)
Document
prior
Query-document
relevance
Document-entity
relevance
logistic function
over a linear
combination of features
⇣ Ng
X
j=1
jgj(e, dt)
⌘⇣ Nf
X
i=1
↵ifi(q, dt)
⌘
standard logistic
function
weight
parameters
(learned)
features

Learning to rank
- Pointwise
- AMD, GMD [Yang et al. 2010]
- Multilayer perceptrons, logistic regression [Sorg &
Cimiano 2011]
- Additive Groves [Moreira et al. 2011]
- Pairwise
- Ranking SVM [Yang et al. 2009]
- RankBoost, RankNet [Moreira et al. 2011]
- Listwise
- AdaRank, Coordinate Ascent [Moreira et al. 2011]
P. Sorg and P. Cimiano. Finding the right expert: Discriminative models for expert retrieval. KDIR’11.
C. Moreira, P. Calado, and B. Martins. Learning to rank for expert search in digital libraries of academic
publications. PAI'11.
Z. Yang, J. Tang, B. Wang, J. Guo, J. Li, and S. Chen. Expert2bole: From expert ﬁnding to bole search.
KDD'09.

Voting models
[Macdonald & Ounis 2006]
- Inspired by techniques from data fusion
- Combining evidence from different sources
- Documents ranked w.r.t. the query are seen as
“votes” for the entity
C. Macdonald and I. Ounis. Voting for candidates: Adapting data fusion techniques for an expert
search task. CIKM'06.

Voting models
Many different variants, including...
- Votes
- Number of documents mentioning the entity
- Reciprocal Rank
- Sum of inverse ranks of documents
- CombSUM
- Sum of scores of documents
Score(e, q) = |{M(e) R(q)}|
X
{M(e)R(q)}
s(d, q)
Score(e, q) =
X
{M(e)R(q)}
1
rank(d, q)
Score(e, q) = |M(e) R(q)|

Graph-based models
[Serdyukov et al. 2008]
- One particular way of constructing graphs
- Vertices are documents and entities
- Only document-entity edges
- Search can be approached as a random walk
on this graph
- Pick a random document or entity
- Follow links to entities or other documents
- Repeat it a number of times
P. Serdyukov, H. Rode, and D. Hiemstra. Modeling multi-step relevance propagation for expert
ﬁnding. CIKM'08.

Inﬁnite random walk model
[Serdyukov et al. 2008]
P. Serdyukov, H. Rode, and D. Hiemstra. Modeling multi-step relevance propagation for expert ﬁnding.
CIKM'08.
Pi(d) = PJ (d) + (1 )
X
e!d
P(d|e)Pi 1(e),
Pi(e) =
X
d!e
P(e|d)Pi 1(d),
PJ (d) = P(d|q),
ee e
d d
e
d d

Further reading
K. Balog, Y. Fang, M. de Rijke, P. Serdyukov, and L. Si.
Expertise Retrieval. FnTIR'12.

Test collections
Campaign Task Collection
Entity
repr.
#Topics
TREC Enterprise
(2005-08)
Expert ﬁnding
Enterprise intranets
(W3C, CSIRO)
Indirect
99 (W3C)
127 (CSIRO)
TREC Entity
(2009-11)
Rel. entity ﬁnding Web crawl
(ClueWeb09)
Indirect
120TREC Entity
(2009-11) List completion
Web crawl
(ClueWeb09)
Indirect
70
INEX Entity Ranking
(2007-09)
Entity search
Wikipedia Direct 55
INEX Entity Ranking
(2007-09) List completion
Wikipedia Direct 55
SemSearch Chall.
(2010-11)
Entity search Semantic Web crawl
(BTC2009)
Direct
142SemSearch Chall.
(2010-11) List search
Semantic Web crawl
(BTC2009)
Direct
50
INEX Linked Data
(2012-13)
Ad-hoc search
Wikipedia + RDF
(Wikipedia-LOD)
Direct
100 (’12)
144 (’13)

Test collections (2)
- Entity search as Question Answering
- TREC QA track
- QALD-2 challenge
- INEX-LD Jeopardy task

Entity search in DBpedia
[Balog & Neumayer 2013]
- Synthesising queries and relevance
assessments from previous eval. campaigns
- From short keyword queries to natural
language questions
- 485 queries in total
- Results are mapped to DBpedia
K. Balog and R. Neumayer. A test collection for entity search in DBpedia. SIGIR’13

Open challenges
- Combining text and structure
- Knowledge bases and unstructured Web documents
- Query understanding and modeling
- See [Sawant & Chakrabarti 2013] at the main
conference
- Result presentation
- How to interact with entities
U. Sawant and S. Chakrabarti. Learning joint query interpretation and response ranking. WWW'13.

Resources
- Complete tutorial material
http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/
- Referred papers
http://www.mendeley.com/groups/3339761/entity-linking-and-retrieval-
tutorial-at-www-2013-and-sigir-2013/papers/

Entity Retrieval (WWW 2013 tutorial)

More Related Content

What's hot

Similar to Entity Retrieval (WWW 2013 tutorial)

More from krisztianbalog

Recently uploaded

Entity Retrieval (WWW 2013 tutorial)