Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
Indexing
Information Retrieval
Computer Science Tripos Part II
Simone Teufel
Lent 2014
43
IR System Components
44
IR System Components
45
IR System Components
46
Overview
1 Index construction
47
Definitions
48
Example: index creation by sorting
49
Index creation; grouping step (“uniq”)
50
Optimisation: Skip Lists
1 Index construction
to Output
52
Parsing a document
53
Format/Language:Complications
54
Normalisation
55
Tokenisation
Mr. O’Neill thinks that the boys’ stories about Chile’s capital
aren’t amusing.
neill aren’t
oneill arent
o’neill aren t
o’ neill are n’t
o neill
?
?
56
Tokenisation problems: One word or two? (or several)
Hewlett-Packard
State-of-the-art
co-education
the hold-him-back-and-drag-him-away maneuver
data base
San Francisco
Los Angeles-based company
fares
57
Numbers
20/3/91
3/20/91
Mar 20, 1991
B-52
6-year-old
100.2.86.144
(800) 234-2333
800.234.2333
.74189359872398457
58
Chinese: No Whitespace
59
Script-related Problems
German
Lebensversicherungsgesellschaftsangestellter
leben+s+versicherung+s+gesellschaft+s+angestellter
61
Other cases of “no whitespace”: Agglutination
“Agglutinative” languages do this not just for compounds:
Inuit
tusaatsiarunnangittualuujunga
(= “I can’t hear very well”)
Finnish
epäjärjestelmällistyttämättömyydellänsäkäänköhän
(= “I wonder if - even with his/her quality of not
having been made unsystematized”)
Turkish
¸ekoslovakyalılastıramadıklarımızdanmsçasına
(= “as if you were one of those whom we could not
make resemble the Czechoslovacian people”)
62
Casefolding, accents, diacritics
63
Stop words
a, an, and, are, as, at, be, by, for, from, has,he, in, is, it, its, of,
on, that, the, to, was, were, will, with
to be or not to be
prince of Denmark
bamboo in water
64
Lemmatisation
am, are, is → be
car, car’s, cars’, cars → car
the boy’s cars are different colours → the boy car be different color
65
Stemming
66
Porter Stemmer
67
Stemming: Representation of a word
[C] (VC){m}[V]
C : one or more adjacent consonants
V : one or more adjacent vowels
[ ] : optionality
( ) : group operator
{x} : repetition x times
m : the “measure” of a word
68
Porter stemmer: selected rules
SSES → SS
IES → I
SS → SS
S→
caresses → caress
cares → care
(m>0) EED → EE
feed → feed
agreed → agree
BUT: freed, succeed
(*v*) ED →
plastered → plaster
bled → bled
69
Three stemmers: a comparison
Such an analysis can reveal features that are not easily visible from the
variations in the individual genes and can lead to a picture of expression that is
more biologically transparent and accessible to interpretation.
Porter Stemmer
such an analysi can reveal featur that ar not easili visibl from the variat in the
individu gene and can lead to a pictur of express that is more biolog transpar
and access to interpret
Lovins Stemmer
such an analys can reve featur that ar not eas vis from th vari in th individu
gen and can lead to a pictur of expres that is mor biolog transpar and acces to
interpres
Paice Stemmer
such an analys can rev feat that are not easy vis from the vary in the individ
gen and can lead to a pict of express that is mor biolog transp and access to
interpret
70
Does stemming improve effectiveness?
71
More equivalence classing
72
Overview
1 Index construction
73
Phrase Queries
74
Biword indexes
75
Longer phrase queries
76
Issues with biword indexes
77
Positional indexes
78
Positional indexes: Example
be, 178239:
〈 1: 〈 17, 25 〉 ;
4: 〈 17, 191, 291, 430, 434 〉 ;
5: 〈 14, 19, 101 〉 ; . . . 〉
As always:
79
Positional indexes: Example
be, 178239:
〈 1: 〈 17, 25 〉 ;
4: 〈 17, 191, 291, 430, 434 〉 ;
5: 〈 14, 19, 101 〉 ; . . . 〉
79
Positional indexes: Example
be, 178239:
〈 1: 〈 17, 25 〉 ;
4: 〈 17, 191, 291, 430, 434 〉 ;
5: 〈 14, 19, 101 〉 ; . . . 〉
Document 4 is a match!
79
Complexity of search with positional index
80
Proximity search
employment /4 place
Find all documents that contain employment and place within
4 words of each other.
HIT: Employment agencies that place healthcare workers are
seeing growth.
NO HIT: Employment agencies that have learned to adapt
now place healthcare workers.
81
Proximity search with positional index
82
Take-away
83
Reading
84