CS726 Handouts
CS726 Handouts
CS726
Lecture # 1
• Introduction
Library Digitization
Number of Results
Test Corpus
Search Computing
Books
Lecture # 2
• Introduction
• Information Retrieval Models
• Boolean Retrieval Model
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning,
and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• What is Information Retrieval ? ? ?
• IR Models
• The Boolean Model
• Considerations on the Boolean Model
Introduction
• What is Information Retrieval ? ? ?
• Information retrieval (IR) deals with the representation, storage,
organization of, and access to information items.
Information Retrieval
IR Models
• Modeling in IR is a complex process aimed at producing a ranking function
• Ranking function: a function that assigns scores to documents with regard to a given
query
• This process consists of two main tasks:
• The conception of a logical framework for representing documents and queries
• The definition of a ranking function that allows quantifying the similarities among
documents and queries
IR Models
• An IR model is a quadruple [D, Q, F, R(qi, dj)] where
• 1. D is a set of logical views for the documents in the collection
• 2. Q is a set of logical views for the user queries
• 3. F is a framework for modeling documents and queries
• 4. R(qi, dj) is a ranking function
T
d = [1,0,0]
2
T
d = [0,1,0]
3
R = {d , d } R = {d , d } R = {d }
t1 1 2 t2 1 3 t3 1
Query optimization
• Is the process of selecting how to organize the work of answering a
query
GOAL: minimize the total amount of work performed by the
system.
Order of evaluation
t = {d , d , d , d }
1 1 3 5 7
t = {d , d , d , d , d } q = t AND t OR t
2 2 3 4 5 6 1 2 3
t = {d , d , d }
3 4 6 8
• From the left: {d3, d5} {d4, d6, d8}= {d3, d5,d4, d6, d8}
• From the right: {d2,d3,d4,d5,d6,d8} {d1, d3, d5, d7}={d3,d5}
• Standard evaluation priority: and, not, or
Resources
• Modern Information Retrieval
• Chapter 1 of IIR
• Resources at http://ifnlp.org/ir
Boolean Retrieval
Lecture # 3
• Boolean Retrieval Model
• Rank Retrieval Model
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning,
and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Boolean Retrieval Model
• Information Retrieval Ingredients
• Westlaw
• Ranked retrieval models
Boolean queries
The Boolean retrieval model can answer any query that is a Boolean expression.
Boolean queries are queries that use AND, OR and NOT to join
query terms.
Views each document as a set of terms.
Is precise: Document matches condition or not.
Many search systems you use are also Boolean: spotlight, email, intranet etc.
• Query formulation
• Query processing
Westlaw
Commercially successful Boolean retrieval: Westlaw
• Largest commercial legal search service in terms of the number of paying
subscribers
• Over half a million subscribers performing millions of searches a day over tens of
terabytes of text data
• The service was started in 1975.
• In 2005, Boolean search (called “Terms and Connectors” by Westlaw) was still the
default, and used by a large percentage of users . . .
• Boolean queries often result in either too few (=0) or too many (1000s) results.
• It takes a lot of skill to come up with a query that produces a manageable number of hits.
• AND gives too few; OR gives too many
• Free text queries: Rather than a query language of operators and expressions, the user’s query is
just one or more words in a human language
• In principle, there are two separate choices here, but in practice, ranked retrieval has normally
been associated with free text queries and vice versa
• How can we rank-order the documents in the collection with respect to a query?
• Resources
• Chapter 1 of IIR
• Resources at http://ifnlp.org/ir
• Boolean Retrieval
Lecture # 4
• Vector Space Retrieval Model
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
Outline
• The Vector Model
• Term frequency tf
• Document frequency
• tf-idf weighting
• Term weights are used to compute the degree of similarity (or score) between each
document stored in the system and the user query
• By sorting in decreasing order of this degree of similarity, the vector model takes
into consideration documents which match the query terms only partially.
• Ranked document answer set is a lot more precise (in the sense that it better
matches the user information need) than the document answer set retrieved by
the Boolean model
• Documents whose content (terms in the document) correspond most closely to the
content of the query are judged to be the most relevant
|V|
Each document is represented by a binary vector ∈ {0,1}
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
•
Term frequency tf
• The term frequency tf of term t in document d is defined as the number of times that t occurs in
t,d
d.
• We want to use tf when computing query-document match scores. But how?
• Raw term frequency is not what we want:
• A document with 10 occurrences of the term is more relevant than a document with 1
occurrence of the term.
• But not 10 times more relevant.
• Relevance does not increase proportionally with term frequency.
tf = tf /tf
i i max
• This brings the term frequencies of each term in a document between 1 and 0, irrespective of
the size of a document.
Document frequency
• Rare terms are more informative than frequent terms
• Stop words
• Consider a term in the query that is rare in the collection (e.g., arachnocentric)
• A document containing this term is very likely to be relevant to the query arachnocentric
• → We want a high weight for rare terms like arachnocentric.
Document frequency, continued
• Consider a query term that is frequent in the collection (e.g., high, increase, line)
• A document containing such a term is more likely to be relevant than a document that doesn’t
calpurnia 1 6
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0
tf-idf weighting
• The tf-idf weight of a term is the product of its tf weight and its idf weight.
w t ,d log(1 tf t ,d ) log10 ( N / df t )
• Best known weighting scheme in information retrieval
• Note: the “-” in tf-idf is a hyphen, not a minus sign!
• Alternative names: tf.idf, tf x idf
• Increases with the number of occurrences within a document
• Increases with the rarity of the term in the collection
Resources
• IIR 6.2 – 6.4.3
• http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html
Lecture # 5
• TF-IDF Weighting
• Document Representation in Vector Space
• Query Representation in Vector Space
• Similarity Measures
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and
Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Computing TF-IDF
• Mapping to vectors
• Coefficients
• Query Vector
• Similarity Measure
• Jaccard coefficient
• Inner Product
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Mapping to vectors
• terms: an axis for every term
• -vectors corresponding to terms are canonical vectors
• documents: sum of the vectors corresponding to terms in the doc
• queries: treated the same as documents
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
Coefficients
• The coefficients (vector lengths, term weights) represent term presence, importance, or
“aboutness”
• –Magnitude along each dimension
• Model gives no guidance on how to set term weights
• Some common choices:
• –Binary: 1 = term is present, 0 = term not present in document
• –tf: The frequency of the term in the document
• –tf• idf: idfindicates the discriminatory power of the term
• •Tf·idfis far and away the most common
• –Numerous variations…
Graphic Representation
• cat
• cat cat
• cat lion
• lion cat
Query Vector
• Query vector is typically treated as a document and also tf-idf weighted.
• Alternative is for the user to supply weights for the given query terms.
Similarity Measure
• A similarity measure is a function that computes the degree of similarity between two vectors.
• Using a similarity measure between the query and each document:
• It is possible to rank the retrieved documents in the order of presumed
relevance.
• It is possible to enforce a certain threshold so that the size of the retrieved set
can be controlled.
Jaccard coefficient
• A commonly used measure of overlap of two sets A and B
• jaccard(A,B) = |A ∩ B| / |A ∪ B|
• jaccard(A,A) = 1
• jaccard(A,B) = 0 if A ∩ B = 0
• A and B don’t have to be the same size.
• Always assigns a number between 0 and 1.
Issues with Jaccard for scoring
• It doesn’t consider term frequency (how many times a term occurs in a document)
• Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t
consider this information
• We need a more sophisticated way of normalizing for length
• . . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normalization.
| A B| / | A B|
sim(d ,q) = d •q =
j j
where wij is the weight of term i in document j and wiq is the weight of term i in the query
• For binary vectors, the inner product is the number of matched query terms in the document
(size of intersection).
• For weighted term vectors, it is the sum of the products of the weights of the matched terms.
Similarity, Normalized
Resources
• Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted
them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
Lecture # 6
• Similarity Measures
• Cosine Similarity Measure
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning,
and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Cosine Similarity
• Basic indexing pipeline
• Sparse Vectors
• Inverted Index
Similarity, Normalized
Sparse Vectors
• Vocabulary and therefore dimensionality of vectors can be very large, ~104 .
• However, most documents and queries do not contain most words, so vectors are sparse (i.e.
most entries are 0).
• Need efficient methods for storing and computing with sparse vectors.
• Requires linear search of the list to find (or change) the weight of a specific token.
• The keyword-to-document index can be implemented as a hash table, a sorted array, or a tree-
based data structure (trie, B-tree).
Inverted Index
Lecture # 7
• Parsing Documents
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
Outline
• Basic indexing pipeline
• Inverted Index
• Cosine Similarity Measure
• Time Complexity of Indexing
• Retrieval with an Inverted Index
• Inverted Query Retrieval Efficiency
Basic indexing pipeline
Sparse Vectors
• Vocabulary and therefore dimensionality of vectors can be very large, ~104 .
• However, most documents and queries do not contain most words, so vectors are sparse (i.e.
most entries are 0).
• Need efficient methods for storing and computing with sparse vectors.
• Requires linear search of the list to find (or change) the weight of a specific token.
• The keyword-to-document index can be implemented as a hash table, a sorted array, or a tree-
based data structure (trie, B-tree).
Inverted Index
Computing IDF
Let N be the total number of Documents;
Note this requires a second pass through all the tokens after all documents have been indexed.
User Interface
Until user terminates with an empty query:
Lecture # 8
• Token
• Numbers
• Stop Words
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and
Hinrich Schütze
Outline
• Parsing a document
• Complications: Format/language
• Tokenization
• Numbers
• Stop words
Parsing a document
• What format is it in?
• pdf/word/excel/html?
• (CP1252, UTF-8, …)
Complications: Format/language
• Documents being indexed can include docs from many different languages
• A single index may contain terms from many languages.
• Sometimes a document or its components can contain multiple languages/formats
• French email with a German pdf attachment.
• French email quote clauses from an English-language contract
• There are commercial and open source libraries that can handle a lot of this stuff
Complications: What is a document?
We return from our query “documents” but there are often interesting questions of grain size:
What is a unit document?
A file?
An email? (Perhaps one of many in a single mbox file)
What about an email with 5 attachments?
Tokenization
Input: “Friends, Romans and Countrymen”
Output: Tokens
Friends
Romans
Countrymen
A token is an instance of a sequence of characters
Each such token is now a candidate for an index entry, after further processing
Described below
But what are valid tokens to emit?
Tokenization
• Issues in tokenization:
• Finland’s capital
Finland AND s? Finlands? Finland’s?
• Hewlett-Packard Hewlett and Packard as two tokens?
• state-of-the-art: break up hyphenated sequence.
• co-education
• lowercase, lower-case, lower case ?
• It can be effective to get the user to put in possible
hyphens
• San Francisco: one token or two?
• How do you decide it is one token?
Numbers
• 3/20/91 Mar. 12, 1991 20/3/91
• 55 B.C.
• B-52
• My PGP key is 324a3df234cb23e
• (800) 234-2333
• Often have embedded spaces
• Older IR systems may not index numbers
• But often very useful: think about things like looking up
error codes/stacktraces on the web
• (One answer is using n-grams: IIR ch. 3)
• Will often index “meta-data” separately
• Creation date, format, etc.
Tokenization: language issues
• French
• L'ensemble one token or two?
• L ? L’ ? Le ?
• Want l’ensemble to match with un ensemble
• Until at least 2003, it didn’t on Google
• Internationalization!
• German noun compounds are not segmented
• Lebensversicherungsgesellschaftsangestellter
• ‘life insurance company employee’
• German retrieval systems benefit greatly from a compound splitter
module
• Can give a 15% performance boost for
German
• ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’
• With Unicode, the surface presentation is complex, but the stored form is
straightforward
Stop words
• With a stop list, you exclude from the dictionary entirely the commonest words.
Intuition:
• They have little semantic content: the, a, and, to, be
• There are a lot of them: ~30% of postings for top 30 words
• But the trend is away from doing this:
• Good compression techniques (IIR 5) means the space for including stop
words in a system is very small
• Good query optimization techniques (IIR 7) mean you pay little at query
time for including stop words.
• You need them for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
Resources
• MG 3.6, 4.3; MIR 7.2
• Porter’s stemmer: http//www.sims.berkeley.edu/~hearst/irbook/porter.html
• H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined
Indexes”, ACM Transactions on Information Systems.
http://www.seg.rmit.edu.au/research/research.php?author=4
Lecture # 9
• Terms Normalization
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning,
and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
Outline
• Normalization
• Case folding
• Normalization to terms
• Thesauri and soundex
Normalization to terms
• We may need to “normalize” words in indexed text as well as query words into the
same form
• We want to match U.S.A. and USA
• Tokens are transformed to terms which are then entered into the index
• A term is a (normalized) word type, which is an entry in our IR system dictionary
• We most commonly implicitly define equivalence classes of terms by, e.g.,
• deleting periods to form a term
• U.S.A., USA USA
• deleting hyphens to form a term
• anti-discriminatory, antidiscriminatory
antidiscriminatory
1. Normalization: other languages
• Accents: e.g., French résumé vs. resume.
• Simple remedy remove accent but not good in case of Resume with and without
accent.
2. Case folding
• Reduce all letters to lower case
• exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
• Often best to lower case everything, since users will use lowercase regardless of
‘correct’ capitalization…
• A word starting with a capital letter in the middle of sentence is for nouns, so case
folding may be given importance in this case.
• However, if users are not going to use capital letters then there is no point in
improving index.
• Longstanding Google example: [fixed in 2011…]
• Query C.A.T.
• #1 result is for “cats” not Caterpillar Inc.
3. Normalization to terms
• An alternative to equivalence classing is to do asymmetric expansion
• An example of where this may be useful
• Enter: window Search: window, windows
• Enter: windows Search: Windows, windows, window
Resources
• MG 3.6, 4.3; MIR 7.2
• Porter’s stemmer: http//www.sims.berkeley.edu/~hearst/irbook/porter.html
• H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM
Transactions on Information Systems.
http://www.seg.rmit.edu.au/research/research.php?author=4
Lecture 10
• Lemmatization
• Stemming
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Lemmatization
• Stemming
• Porter’s algorithm
• Language-specificity
5. Lemmatization
• NPL tool. It uses dictionaries and morphological analysis of words in
order to return the base or dictionary form of a word
• Reduce inflectional/variant forms to base form
• E.g.,
• am, are, is be
• car, cars, car's, cars' car
• the boy's cars are different colors the boy car be different color
• No change in proper nouns e.g. Pakistan remains same
• Lemmatization implies doing “proper” reduction to dictionary headword form
• Example: Lemmatization of “saw” attempts to return “see”
or “saw” depending on whether the use of the token is a verb
or a noun
6. Stemming
• Reduce terms to their “roots” before indexing
• “Stemming” suggests crude affix chopping
• language dependent
• e.g., automate(s), automatic, automation all reduced to automat.
• e.g., computation, computing, computer, all reduce to comput.
Porter’s algorithm
• Commonest algorithm for stemming English
• Results suggest it’s at least as good as other stemming options
• Conventions + 5 phases of reductions
• phases applied sequentially
• each phase consists of a set of commands
• sample convention: Of the rules in a compound command, select the
one that applies to the longest suffix.
Typical rules in Porter
• sses ss Processes Process
• ies I Skies Ski; ponies poni
• ational ate Rotational Rotate
• tional tion national nation
• S “” cats cat
• Weight of word sensitive rules
• (m>1) EMENT →
• (whatever comes before emenet has length greater than 1, replace emenet with null)
• replacement → replac
• cement → cement
Other stemmers
• Other stemmers exist:
• Lovins stemmer
• http://www.comp.lancs.ac.uk/computing/research/stem
ming/general/lovins.htm
• Single-pass, longest suffix removal (about 250 rules)
• Paice/Husk stemmer
• Snowball
• Full morphological analysis (lemmatization)
• At most modest benefits for retrieval
Stemming Example
Language-specificity
• The above methods embody transformations that are
• Language-specific, and often
• Application-specific
• These are “plug-in” addenda to the indexing process
• Both open source and commercial plug-ins are available for handling these
Does stemming/lemmatization help?
• English: very mixed results. Helps recall for some queries but harms precision on
others
• E.g., operative (dentistry) ⇒ oper
• Operational (research) => oper
• Operating (systems) => oper
• Increase recall but reduce precision, such normalization is not very
useful in English language.
• Definitely useful for Spanish, German, Finnish, …
• 30% performance gains for Finnish!
• Reason is that there are very clear morphological rules so as to form
words in these languages.
• Domain specific normalization may also be helpful e.g. normalizing the
words w.r.t their usage in a particular domain.
Resources
• MG 3.6, 4.3; MIR 7.2
• Porter’s stemmer: http//www.sims.berkeley.edu/~hearst/irbook/porter.html
• H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined
Indexes”, ACM Transactions on Information Systems.
http://www.seg.rmit.edu.au/research/research.php?author=4
Lecture # 11
• Compression
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• compression for inverted indexes
• Dictionary storage
• Dictionary-as-a-String
• Blocking
Net
• Where we used 3 bytes/pointer without blocking
• 3 x 4 = 12 bytes for k=4 pointers,
now we use 3+4=7 bytes for 4 pointers.
Lecture # 12
• Compression
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Blocking
• Front coding
• Postings compression
• Variable Byte (VB) codes
Dictionary as a string
Blocking
Front coding
• Front-coding:
• Sorted words commonly have long common prefix – store differences
only
• (for last k-1 in a block of k)
8automata8automate9automatic10automation
Postings compression
The postings file is much larger than the dictionary, factor of at least 10.
Key desideratum: store each posting compactly
A posting for our purposes is a docID.
For Reuters (800,000 documents), we would use 32 bits per docID when
using 4-byte integers.
Alternatively, we can use log 800,000 ≈ 19.6 < 20 bits per docID.
2
Our goal: use a lot less than 20 bits per docID.
Example
Resources
Chapter 5 of IIR
Resources at http://ifnlp.org/ir
Original publication on word-aligned binary codes by Anh
and Moffat (2005); also: Anh and Moffat (2006a)
Original publication on variable byte codes by Scholer,
Williams, Yiannis and Zobel (2002)
More details on compression (including compression of
positions and frequencies) in Zobel and Moffat (2006)
Lecture # 13
Compression
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Variable Byte (VB) codes
• Unary code
• Gamma codes
• Gamma code properties
• RCV1 compression
Unary code
• Represent n as n 1s with a final 0.
• Unary code for 3 is 1110.
• Unary code for 40 is
11111111111111111111111111111111111111110 .
• Unary code for 80 is:
111111111111111111111111111111111111111111111111111111111111111111111111111111
110
• This doesn’t look promising, but….
Gamma codes
• We can compress better with bit-level codes
• The Gamma code is the best known of these.
• Represent a gap G as a pair length and offset
• offset is G in binary, with the leading bit cut off
• For example 13 → 1101 → 101
• length is the length of offset
• For 13 (offset 101), this is 3.
• We encode length with unary code: 1110.
• Gamma code of 13 is the concatenation of length and offset: 1110101
Reuters RCV1
Reuters RCV1
RCV1 compression
Lecture # 14
• Index Constructions
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Scaling index construction
• Memory Hierarchy
• Hard Disk Tracks and Sectors
• Hard Disk Blocks
• Hardware basics
Scaling index construction
• In-memory index construction does not scale
• Can’t stuff entire collection into memory, sort, then write back
• How can we construct an index for very large collections?
• Taking into account the hardware constraints we just learned
about . . .
• Memory, disk, speed, etc.
Memory Hierarchy
Moore’s Law
Hardware basics
Inverted Index
Distributed computing
• Distributed computing is a field of computer science that studies distributed systems. A
distributed system is a software system in which components located on networked
computers communicate and coordinate their actions by passing messages. Wikipedia
Hardware assumptions
Lecture # 15
• Merge Sort
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Two-Way Merge Sort
• Single-pass in-memory indexing
• SPIMI-Invert
SPIMI:
Single-pass in-memory indexing
• Key idea 1: Generate separate dictionaries for each block – no need to maintain term-
termID mapping across blocks.
• Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur.
• With these two ideas we can generate a complete inverted index for each block.
• These separate indexes can then be merged into one big index.
SPIMI-Invert
Summary
• So far
• Static Collections (corpus is fixed)
• Linked List based indexing
• Array based indexing
• Data fits into the HD of a single machine
• Next
• Data does not fit in a single machine
• Requires more machines (clusters of machines).
Lecture # 16
• Phrase queries
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Types of Queries
• Phrase queries
• Biword indexes
• Extended biwords
• Positional indexes
Types of Queries
1. Phrase Queries
The crops in pakistan
“The crops in pakistan”
2. Proximity Queries
LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
3. Wild Card Queries
Results*
Phrase queries
• Want to answer queries such as “Stanford university” – as a phrase
• Thus the sentence “I went to university at Stanford” is not a match.
• No longer suffices to store only
<term : docs> entries
Extended biwords
• Parse the indexed text and perform part-of-speech-tagging (POST).
• Bucket the terms into (say) Nouns (N) and articles/prepositions (X).
• Now deem any string of terms of the form NX*N to be an extended biword.
• Each such extended biword is now made a term in the dictionary.
• Example:
• catcher in the rye
• Capital of Pakistan
Query processing
• Given a query, parse it into N’s and X’s
• Segment query into enhanced biwords
• Look up index
• Issues
• Parsing longer queries into conjunctions
• E.g., the query tangerine trees and marmalade skies is parsed into
• tangerine trees AND trees and marmalade AND marmalade skies
Other issues
• False positives, as noted before
Positional indexes
• Store, for each term, entries of the form:
<number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Lecture # 17
Processing a phrase query
Proximity queries
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Processing a phrase query
• Proximity queries
• Combination schemes
Proximity queries
• LIMIT! /3 STATUTE /3 FEDERAL /2 TORT Here, /k means “within k words of”.
• Clearly, positional indexes can be used for such queries; biword indexes cannot.
• Exercise: Adapt the linear merge of postings to handle proximity queries. Can you
make it work for any value of k?
Rules of thumb
• Positional index size factor of 2-4 over non-positional index
• Positional index size 35-50% of volume of original text
• Caveat: all of this holds for “English-like” languages
Combination schemes
• A positional index expands postings storage substantially (Why?)
• Biword indexes and positional indexes approaches
can be profitably combined
• For particular phrases (“Michael Jackson”, “Britney Spears”) it is
inefficient to keep on merging positional postings lists
• Even more so for phrases like “The Who”
Lecture # 18
• Wild Card Queries
• B Tree
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• How to Handle Wild-Card Queries
• Wild-card queries: *
• B-Tree
• B+ Tree
How to Handle Wild-Card Queries
• B-Trees
• Permuterm Index
• K-Grams
• Soundex Algorithms
Wild-card queries: *
• mon*: find all docs containing any word beginning with “mon”.
• Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo
• *mon: find words ending in “mon”: harder
• Maintain an additional B-tree for terms backwards.
Can retrieve all words in range: nom ≤ w < non.
B-Tree
B-Tree
Wild-card queries: *
• mon*: find all docs containing any word beginning with “mon”.
• Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo
• *mon: find words ending in “mon”: harder
• Maintain an additional B-tree for terms backwards.
Can retrieve all words in range: nom ≤ w < non.
Resources
• IIR 3, MG 4.2
Lecture # 19
Permuterm index
k-gram
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Permuterm index
• k-gram
• Soundex
Permuterm index
Processing wild-cards
• Query mon* can now be run as
• $m AND mo AND on
• Gets terms that match AND version of our wildcard query.
• But we’d enumerate moon.
• Must post-filter these terms against query.
• Surviving enumerated terms are then looked up in the term-document inverted index.
• Fast, space efficient (compared to permuterm).
K-gram Index
• Larger k-grams would be lesser flexibility in query processing.
• Normally bi and tri k-gram proffered.
Soundex
• Class of heuristics to expand a query into phonetic equivalents
• Language specific – mainly for names
• E.g., chebyshev tchebycheff
• Invented for the U.S. census … in 1918
Soundex – typical algorithm
Soundex continued
4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and return the first four positions, which will be
of the form <uppercase letter> <digit> <digit> <digit>.
E.g., Herman becomes H655.
Soundex
• Soundex is the classic algorithm, provided by most databases (Oracle, Microsoft, …)
• How useful is soundex?
• Generally used for Proper Nouns.
• Database systems also provide soundex option.
• Get names whose last name is same as soundex(SMITH)
• We may get Lindsay and William
• It lowers precision.
• But increases Recall.
• Can be used by Interpol to check names.
• Generally such algorithm are tailored according to European names.
• There are other variants which work well for IR.
Resources
• IIR 3, MG 4.2
• Efficient spell retrieval:
• K. Kukich. Techniques for automatically correcting words in text. ACM
Computing Surveys 24(4), Dec 1992.
• J. Zobel and P. Dart. Finding approximate matches in large
lexicons. Software - practice and experience 25(3), March 1995.
http://citeseer.ist.psu.edu/zobel95finding.html
• Mikael Tillenius: Efficient Generation and Ranking of Spelling Error
Corrections. Master’s thesis at Sweden’s Royal Institute of Technology.
http://citeseer.ist.psu.edu/179155.html
• Nice, easy reading on spell correction:
• Peter Norvig: How to write a spelling corrector
http://norvig.com/spell-correct.html
Lecture # 20
• Spelling Correction
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Spell correction
• Document correction
• Query mis-spellings
• Isolated word correction
• Edit distance
• Fibonacci series
Spell correction
• Two principal uses
• Correcting document(s) being indexed
• Correcting user queries to retrieve “right” answers
• Two main flavors:
• Isolated word
• Check each word on its own for misspelling
• Will not catch typos resulting in correctly spelled words
• e.g., from form
• Context-sensitive
• Look at surrounding words,
• e.g., I flew form Lahore to Dubai.
Document correction
• Especially needed for OCR’ed documents
• Correction algorithms are tuned for this: rn/m
• modern may be considered as modem
• Can use domain-specific knowledge
• E.g., OCR can confuse O and D more often than it would
confuse O and I (adjacent on the QWERTY keyboard, so
more likely interchanged in typing).
• But also: web pages and even printed material have typos
• Goal: the dictionary contains fewer misspellings
• But often we don’t change the documents and instead fix the query-document
mapping
Query mis-spellings
• Our principal focus here
• E.g., the query Fasbinder – Fassbinder (correct German Surname)
• We can either
• Retrieve documents indexed by the correct spelling, OR
• Return several suggested alternative queries with the correct spelling
• Google’s Did you mean … ?
• Given a lexicon and a character sequence Q, return the words in the lexicon closest to
Q
• What’s “closest”?
• We’ll study several alternatives
• Edit distance (Levenshtein distance)
• Weighted edit distance
• n-gram overlap
Edit distance
• Given two strings S and S , the minimum number of operations to convert one to the
1 2
other
• Operations are typically character-level
• Insert, Delete, Replace, (Transposition)
• E.g., the edit distance from dof to dog is 1
• From cat to act is 2 (Just 1 with transpose.)
• from cat to dog is 3.
• Generally found by dynamic programming.
• See http://www.merriampark.com/ld.htm for a nice example plus an applet.
• Dynamic programming:-
Dynamic Programming solves the sub-problems bottom up. The problem can’t be solved until
we find all solutions of sub-problems. The solution comes up when the whole problem appears.
Fibonacci series
• Definition:- The first two numbers in the Fibonacci sequence are 1 and 1, or 0 and 1,
depending on the chosen starting point of the sequence, and each subsequent number
is the sum of the previous two(Wikipedia).
Fib(n)
{
if (n == 1)
return 0;
if (n == 2)
return 1;
else
return Fib(n-1) + Fib(n-2);
}
Fibonacci Tree
Resources
• IIR 3, MG 4.2
• Efficient spell retrieval:
• K. Kukich. Techniques for automatically correcting words in text. ACM
Computing Surveys 24(4), Dec 1992.
• J. Zobel and P. Dart. Finding approximate matches in large
lexicons. Software - practice and experience 25(3), March 1995.
http://citeseer.ist.psu.edu/zobel95finding.html
• Mikael Tillenius: Efficient Generation and Ranking of Spelling Error
Corrections. Master’s thesis at Sweden’s Royal Institute of Technology.
http://citeseer.ist.psu.edu/179155.html
• Nice, easy reading on spell correction:
• Peter Norvig: How to write a spelling corrector
http://norvig.com/spell-correct.html
Lecture # 21
• Spelling Correction
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Edit distance
• Using edit distances
• Weighted edit distance
• n-gram overlap
• One option – Jaccard coefficient
Edit distance
• Alternatively,
• We can look up all possible corrections in our inverted index and return
all docs … slow
• We can run with a single most likely correction
• The alternatives disempower the user, but save a round of interaction with the user
Weighted edit distance
• As before, but the weight of an operation depends on the character(s) involved
• Meant to capture OCR or keyboard errors
Example: m more likely to be mis-typed as n than as q
• Therefore, replacing m by n is a smaller edit distance than by q
• This may be formulated as a probability model
• Requires weight matrix as input
• Modify dynamic programming to handle weights
Weighted edit distance
Edit distance
• Given two strings S and S , the minimum number of operations to convert one to the
1 2
other
• Operations are typically character-level
• Insert, Delete, Replace, (Transposition)
• E.g., the edit distance from dof to dog is 1
• From cat to act is 2 (Just 1 with transpose.)
• from cat to dog is 3.
• Generally found by dynamic programming.
• See http://www.merriampark.com/ld.htm for a nice example plus an applet.
n-gram overlap
• Enumerate all the n-grams in the query string as well as in the lexicon
• Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching
any of the query n-grams
• Threshold by number of matching n-grams
• Variants – weight by keyboard layout, etc.
Resources
• IIR 3, MG 4.2
• Efficient spell retrieval:
• K. Kukich. Techniques for automatically correcting words in text. ACM
Computing Surveys 24(4), Dec 1992.
• J. Zobel and P. Dart. Finding approximate matches in large
lexicons. Software - practice and experience 25(3), March 1995.
http://citeseer.ist.psu.edu/zobel95finding.html
Lecture # 22
• Spelling Correction
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Matching trigrams
• Computing Jaccard coefficient
• Context-sensitive spell correction
• General issues in spell correction
Matching trigrams
2. ANSWER LIST
• Improve the walk through to use Jaccard Coefficient so as to identify the candidate
terms.
• HEURISTIC
• While computing J.C. we may disregard the repeating n-grams in query term as well as
in current term.
• The reason is that we are computing a candidate term in any case, which we shall
process later using edit distance.
Context-sensitive correction
Resources
• IIR 3, MG 4.2
• Efficient spell retrieval:
Lecture # 23
• Performance Evaluation
of Information Retrieval Systems
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Why System Evaluation?
• Difficulties in Evaluating IR Systems
• Measures for a search engine
• Measuring user happiness
• How do you tell if users are happy?
• Number of documents/hour
• (Average document size)
• How fast does it search
• Latency as a function of index size
• Expressiveness of query language
• Ability to express complex information needs
• Speed on complex queries
• Uncluttered UI
• Is it free?
Lecture # 24
• BENCHMARKS FOR THE EVALUATION OF IR SYSTEMS
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Happiness: elusive to measure
• Gold Standard
• search algorithm
• Relevance judgments
• Standard relevance benchmarks
• Evaluating an IR system
Relevance judgments
• Binary (relevant vs. non-relevant) in the simplest case, more nuanced (0, 1, 2, 3 …) in
others
• What are some issues already?
• 5 million times 50K takes us into the range of a quarter trillion judgments
11
• If each judgment took a human 2.5 seconds, we’d still need 10
seconds, or nearly $300 million if you pay people $10 per hour to assess
• 10K new products per day
What else?
• Still need test queries
• Must be germane to docs available
• Must be representative of actual user needs
• Random query terms from the documents generally not a good idea
• Sample from query logs if available
• Classically (non-Web)
Benchmarks
Evaluating an IR system
• Note: user need is translated into a query
• Relevance is assessed relative to the user need, not the query
• E.g., Information need: My swimming pool bottom is becoming black and needs to be
cleaned.
• Query: pool cleaner
• Assess whether the doc addresses the underlying need, not whether it has these
words
Lecture # 25
• BENCHMARKS FOR THE EVALUATION OF IR SYSTEMS
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Evaluation Measures
• Precision and Recall
• Unranked retrieval evaluation
• Trade-off between Recall and Precision
• Computing Recall/Precision Points
Evaluation Measures
• Precision
• Recall
• Accuracy
• Mean Average Precision
• F-Measure/E-Measure
• Non Binary Relevance
• Discounted Cumulative Gain
• Normalized Discounted Cumulative Gain
Lecture # 26
Precision and Recall
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Computing Recall/Precision
• Interpolating a Recall/Precision
• Average Recall/Precision Curve
• R- Precision
• Precision@K
• F-Measure
• E Measure
• The curve closest to the upper right-hand corner of the graph indicates the best
performance
R- Precision
• Precision at the R-th position in the ranking of results for a query that has R relevant
documents.
Precision@K
F-Measure
• One measure of performance that takes into account both recall and precision.
• Harmonic mean of recall and precision:
• Compared to arithmetic mean, both need to be high for harmonic mean to be high.
Lecture # 27
Mean Average Precision
Non Binary Relevance
DCG
NDCG
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Mean Average Precision
• Mean Reciprocal Rank
• Cumulative Gain
• Discounted Cumulative Gain
• Normalized Discounted Cumulative Gain
Non-Binary Relevance
• Documents are rarely entirely relevant or non-relevant to a query
• Many sources of graded relevance judgments
• Relevance judgments on a 5-point scale
• Multiple judges
• Click distribution and deviation from expected levels (but click-through
!= relevance judgments)
Cumulative Gain
Normalized Discounted
Cumulative Gain (NDCG)
• To compare DCGs, normalize values so that a ideal ranking would have a Normalized
DCG of 1.0
• Ideal ranking:
Normalized Discounted
Cumulative Gain (NDCG)
• To compare DCGs, normalize values so that a ideal ranking would have a Normalized
DCG of 1.0
• Ideal ranking:
Normalized Discounted
Cumulative Gain (NDCG)
Lecture # 28
Using user Clicks
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
Kappa Example
• P(A) = 370/400 = 0.925
• P(nonrelevant) = (10+20+70+70)/800 = 0.2125
• P(relevant) = (10+20+300+300)/800 = 0.7878
• P(E) = 0.2125^2 + 0.7878^2 = 0.665
• Kappa = (0.925 – 0.665)/(1-0.665) = 0.776
• Kappa > 0.8 = good agreement
• 0.67 < Kappa < 0.8 -> “tentative conclusions” (Carletta ’96)
• Depends on purpose of study
• For >2 judges: average pairwise kappas
A/B testing
• Purpose: Test a single innovation
• Prerequisite: You have a large search engine up and running.
• Have most users use old system
• Divert a small proportion of traffic (e.g., 1%) to the new system that includes the
innovation
• Evaluate with an “automatic” measure like clickthrough on first result
• Now we can directly see if the innovation does improve user happiness.
Lecture # 29
Cosine ranking
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Computing cosine-based ranking
• Efficient cosine ranking
• Computing the K largest cosines
• Use heap for selecting top K
• High-idf query terms
Generic approach
• Find a set A of contenders, with K < |A| << N
• A does not necessarily contain the top K, but has many docs from
among the top K
• Return the top K docs in A
• Think of A as pruning non-contenders
• The same approach is also used for other (non-cosine) scoring functions
• Will look at several schemes following this approach
Index elimination
• Basic algorithm cosine computation algorithm only considers docs containing at least
one query term
• Take this further:
• Only consider high-idf query terms
• Only consider docs containing many query terms
• Postings of low-idf terms have many docs these (many) docs get
eliminated from set A of contenders
3 of 4 query terms
Lecture # 30
Sampling and pre-grouping
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Term-wise candidates
• Preferred List storage
• Sampling and pre-grouping
• General variants
Term-wise candidates
Preprocess: Pre-compute, for each term, its k nearest docs.
• (Treat each term as a 1-term query.)
• lots of preprocessing.
• Result: "preferred list" for each term.
Search:
• For a t-term query, take the union of their t preferred lists - call this set S.
• Compute cosines from the query to only the docs in S, and choose top k.
Visualization
Visualization
General variants
• Have each follower attached to a=3 (say) nearest leaders.
• From query, find b=4 (say) nearest leaders and their followers.
• Can recur on leader/follower construction.
Lecture # 31
Dimensionality reduction
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Dimensionality reduction
• Random projection onto k<<m axes
• Computing the random projection
• Latent semantic indexing (LSI)
• The matrix
• Dimension reduction
Dimensionality reduction
• What if we could take our vectors and “pack” them into fewer dimensions (say
10000100) while preserving distances?
• (Well, almost.)
• Speeds up cosine computations.
• Two methods:
• Random projection.
• “Latent semantic indexing”.
Guarantee
• With high probability, relative distances are (approximately) preserved by projection.
• Pointer to precise theorem in Resources.
Cost of computation
• This takes a total of kmn multiplications.
• Expensive - see Resources for ways to do essentially the same thing, quicker.
• Exercise: by projecting from 10000 dimensions down to 100, are we really going to
make each cosine computation faster?
• Size of the vectors would decrease
• Will result into smaller postings
The matrix
• The matrix
has rank 2: the first two rows are linearly independent, so the rank is at least 2, but all three
rows are linearly dependent (the first is equal to the sum of the second and third) so the rank
must be less than 3.
• The matrix
has rank 1: there are nonzero colums. So the rank is positive. but any pair of columns is linearly
dependent.
The matrix
t
• Define term-term correlation matrix T=AA
t
• A denotes the matrix transpose of A.
• T is a square, symmetric m m matrix.
Eigenvectors
Eigen Vectors
Visualization
Dimension reduction
• For some s << r, zero out all but the s biggest eigenvalues in Q.
• Denote by Q this new version of Q.
s
• Typically s in the hundreds while r could be in the (tens of) thousands.
t
• Let A = P Q R
s s
• Turns out A is a pretty good approximation to A.
s
Visualization
Guarantee
• Relative distances are (approximately) preserved by projection:
• Of all m n rank s matrices, A is the best approximation to A.
s
• Pointer to precise theorem in Resources.
Doc-doc similarities
t
Semi-precise intuition
• We accomplish more than dimension reduction here:
• Docs with lots of overlapping terms stay together
• Terms from these docs also get pulled together.
• Thus car and automobile get pulled together because both co-occur in docs with tires,
radiator, cylinder, etc.
Query processing
• View a query as a (short) doc:
• call it row 0 of A .
s
t
• Now the entries in row 0 of A A give the similarities of the query with each doc.
s s
• Entry (0,j) is the score of doc j on the query.
• Exercise: fill in the details of scoring/ranking.
• LSI is expensive in terms of computation…
• Randomly choosing a subset of documents for dimensional reduction can give a
significant boost in performance.
Resources
• Random projection theorem: http://citeseer.nj.nec.com/dasgupta99elementary.html
• Faster random projection: http://citeseer.nj.nec.com/frieze98fast.html
• Latent semantic indexing: http://citeseer.nj.nec.com/deerwester90indexing.html
• Books: MG 4.6, MIR 2.7.2.
Lecture # 32
Web Search
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• The World Wide Web
• Web Pre-History
• Web Search History
• Web Challenges for IR
• Graph Structure in the Web (Bowtie)
• Zipf’s Law on the Web
• Manual Hierarchical Web Taxonomies
• Business Models for Web Search
The World Wide Web
• Developed by Tim Berners-Lee in 1990 at CERN to organize research documents
available on the Internet.
• Combined idea of documents available by FTP with the idea of hypertext to link
documents.
• Developed initial HTTP network protocol, URLs, HTML, and first “web server.”
Web Pre-History
• Ted Nelson developed idea of hypertext in 1965.
• Doug Engelbart invented the mouse and built the first implementation of hypertext in
the late 1960’s at SRI.
• ARPANET was developed in the early 1970’s.
• The basic technology was in place in the 1970’s; but it took the PC revolution and
widespread networking to inspire the web and make it practical.
• Andreessen joined with James Clark (Stanford Prof. and Silicon Graphics founder) to
form Mosaic Communications Inc. in 1994 (which became Netscape to avoid conflict
with UIUC).
• Microsoft licensed the original Mosaic from UIUC and used it to build Internet Explorer
in 1995.
• Advertisers pay for banner ads on the site that do not depend on a user’s query.
• Goto/Overture
• CPM: Cost Per Mille (thousand impressions). Pay for each ad display.
• CPC: Cost Per Click. Pay only when user clicks on ad.
• CTR: Click Through Rate. Fraction of ad impressions that result in clicks
throughs. CPC = CPM / (CTR * 1000)
• CPA: Cost Per Action (Acquisition). Pay only when user actually makes a
purchase on target site.
• Advertisers bid for “keywords”. Ads for highest bidders displayed when user query
contains a purchased keyword.
• PPC: Pay Per Click. CPC for bid word ads (e.g. Google AdWords).
Lecture # 33
Spidering
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Web Search Using IR
• Spiders (Robots/Bots/Crawlers)
• What any crawler must do
• What any crawler should do
• Search Strategies
• Basic crawl architecture
• Restricting Spidering
• Robot Exclusion
Spiders (Robots/Bots/Crawlers)
• Start with a comprehensive set of root URL’s from which to start the search.
• Follow all links on these pages recursively to find additional pages.
• Index all novel found pages in an inverted index as they are encountered.
• May allow users to directly submit pages to be indexed (and crawled from).
URL Syntax
• A URL has the following syntax:
• <scheme>://<authority><path>?<query>#<fragment>
• An authority has the syntax:
• <host>:<port-number>
• A query passes variable values from an HTML form and has the syntax:
• <variable>=<value>&<variable>=<value>…
• A fragment is also called a reference or a ref and is a pointer within the document to a
point specified by an anchor tag of the form:
• <A NAME=“<fragment>”>
Robot Exclusion
• Web sites and pages can specify that robots should not crawl/index certain areas.
• Two components:
Lecture # 34
Web Crawler
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Basic crawl architecture
• Communication between nodes
• URL frontier: Mercator scheme
• Duplicate documents
• Shingles + Set Intersection
Politeness – challenges
• Even if we restrict only one thread to fetch from a host, can hit it repeatedly
• Common heuristic: insert time gap between successive requests to a host that is >>
time for most recent fetch from that host
Front queues
• Prioritizer assigns to URL an integer priority between 1 and K
• Appends URL to corresponding queue
• Heuristics for assigning priority
• Refresh rate sampled from previous crawls
• Application-specific (e.g., “crawl news sites more often”)
Back queue invariants
• Each back queue is kept non-empty while the crawl is in progress
• Each back queue only contains URLs from a single host
• Maintain a table from hosts to back queues
Back queues
Duplicate/Near-Duplicate Detection
Computing Similarity
• Features:
• Segments of a document (natural or artificial breakpoints)
• Shingles (Word N-Grams)
• a rose is a rose is a rose → 4-grams are
a_rose_is_a
rose_is_a_rose
is_a_rose_is
a_rose_is_a
• Similarity Measure between two docs (= sets of shingles)
• Set intersection
• Specifically (Size_of_Intersection / Size_of_Union)
Sketch of a document
• Create a “sketch vector” (of size ~200) for each document
• Documents that share ≥ t (say 80%) corresponding vector elements are
deemed near duplicates
• For doc D, sketch [ i ] is as follows:
D
m
• Let f map all shingles in the universe to 0..2 (e.g., f =
fingerprinting)
m
• Let be a random permutation on 0..2
i
• Pick MIN { (f(s))} over all shingles s in D
i
Final notes
• Shingling is a randomized algorithm
• Our analysis did not presume any probability model on the inputs
• It will give us the right (wrong) answer with some probability on any
input
Lecture # 35
Distributed Index
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the underline sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Distributed Index
• Map Reduce
• Mapping onto Hadoop Map Reduce
• Dynamic Indexing
• Issues in Dynamic Indexing
Introduction
• Inverted Index in Memory
• Inverted Index on a HD of a machine
• Inverted Index on a number of machines
• Cluster Computing/Grid Computing
• Data Centers
• Large scale IR systems have number of Data Centers.
Map Reduce
• Utilize several machines to construct
• Map Phase
• Take documents and parse them
• Split a-h; i-p; q-z
• Reduce Phase
• Inversion Build Inverted index out of it.
• Apache Hadoop is an open source frame work.
• Each phase of an IR system can be mapped to map-reduce
Dynamic Indexing
• Documents are updated, deleted
• Simple Approach
• Maintain a big main index.
• New documents in auxiliary small index to be built in say RAM.
• Answer queries by involving both indexes
• For Deletions
• Invalidation bit vectors for deleted docs
• Size of the bit-vector is equal to the size of docs in corpus
• Filter the search results by using the bit-vector
• Periodically, merge the main and aux indexes.
• Keep it simple and assume index is residing on a single machine
• Solutions
• A simple heuristic is to utilize LARGEST index, but it reflects inaccurate information.
• The large search engines keep on building new index on a separate set of machines,
and once it is constructed they start using it.
• Google Dance
• Think about expanding it while incorporating positional index.
Lecture # 36
Link Analysis
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Hypertext and Links
• The Web as a Directed Graph
• Anchor Text
• Indexing anchor text
• PageRank scoring
• Social networks
Example 1: Good/Bad/Unknown
Anchor Text
WWW Worm - McBryan [Mcbr94]
• For ibm how to distinguish between:
• IBM’s home page (mostly graphical)
• IBM’s copyright page (high term freq. for ‘ibm’)
• Rival’s spam page (arbitrarily high term freq.)
Adjacency lists
• The set of neighbors of a node
• Assume each URL represented by an integer
• E.g., for a 4 billion page web, need 32 bits per node
• Naively, this demands 64 bits to represent each hyperlink
• Boldi/Vigna get down to an average of ~3 bits/link
• Further work achieves 2 bits/link
Gap encodings
• Given a sorted list of integers x, y, z, …, represent by x, y-x, z-y, …
• Compress each integer using a code
• code - Number of bits = 1 + 2 lg x
• code: …
• Information theoretic bound: 1 + lg xbits
• code: Works well for integers from a power law Boldi Vigna DCC 2004
Citation Analysis
• Citation frequency
• Bibliographic coupling frequency
• Articles that co-cite the same articles are related
• Citation indexing
• Who is this author cited by? (Garfield 1972)
• Pagerank preview: Pinsker and Narin ’60s
• Asked: which journals are authoritative?
PageRank scoring
Teleporting
• At a dead end, jump to a random web page.
• At any non-dead end, with probability 10%, jump to a random web page.
• With remaining probability (90%), go out on a random link.
• 10% - a parameter.
Result of teleporting
• Now cannot get stuck locally.
• There is a long-term rate at which any page is visited (not obvious, will show this).
• How do we compute this visit rate?
Resources
• IIR Chap 21
• http://www2004.org/proceedings/docs/1p309.pdf
• http://www2004.org/proceedings/docs/1p595.pdf
• http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/index.html
• http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641-mccurley.html
• The WebGraph framework I: Compression techniques (Boldi et al. 2004)
Lecture # 37
Markov chains
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Markov chains
• Ergodic Markov chains
• Markov Chain with Teleporting
• Query Processing
• Personalized PageRank
Markov chains
• A Markov chain consists of n states, plus an nn transition probability matrix P.
• At each step, we are in one of the states.
• For 1 i,j n, the matrix entry P tells us the probability of j being the next state, given
ij
we are currently in state i.
Markov chains
Query Processing
• Compute Page Rank for all pages
• Run the query
• Sort the obtained pages based on the page rank
• Integrate page rank and relevance to come up with the final result
Personalized PageRank
• PageRank can be biased (personalized) by changing E to a non-uniform distribution.
• Restrict “random jumps” to a set of specified relevant pages.
• For example, let E(p) = 0 except for one’s own home page, for which E(p) =
• This results in a bias towards pages that are closer in the web graph to your own
homepage.
• 0.6*sports + 0.4*news
Resources
• IIR Chap 21
• http://www2004.org/proceedings/docs/1p309.pdf
• http://www2004.org/proceedings/docs/1p595.pdf
• http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/index.html
• http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641-mccurley.html
• The WebGraph framework I: Compression techniques (Boldi et al. 2004)
Lecture # 38
HITS
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Hyper-link induced topic search (HITS)
• Hubs and Authorities
• The hope
• Distilling hubs and authorities
• HITS for Clustering
Authorities
• Authorities are pages that are recognized as providing significant, trustworthy, and
useful information on a topic.
• In-degree (number of pointers to a page) is one simple measure of authority.
• However in-degree treats all links as equal.
The hope
High-level scheme
• Extract from the web a base set of pages that could be good hubs or authorities.
• From these, identify a small set of top hub and authority pages;
iterative algorithm.
Base set
• Given text query (say browser), use a text index to get all pages containing browser.
• Call this the root set of pages.
• Add in any page that either
• points to a page in the root set, or
• is pointed to by a page in the root set.
• Call this the base set.
Scaling
• To prevent the h() and a() values from getting too big, can scale down after each
iteration.
• Scaling factor doesn’t really matter:
• we only care about the relative values of the scores.
How many iterations?
• Claim: relative values of scores will converge after a few iterations:
• in fact, suitably scaled, h() and a() scores settle into a steady state!
• proof of this comes later.
• In practice, ~5 iterations get you close to stability.
Results
• Authorities for query: “Java”
• java.sun.com
• comp.lang.java FAQ
• Authorities for query “search engine”
• Yahoo.com
• Excite.com
• Lycos.com
• Altavista.com
• Authorities for query “Gates”
• Microsoft.com
• roadahead.com
Proof of convergence
• nn adjacency matrix A:
• each of the n pages in the base set has a row and column in the matrix.
• Entry A = 1 if page i links to page j, else = 0.
ij
Hub/authority vectors
• View the hub scores h() and the authority scores a() as vectors with n components.
• Recall the iterative updates
• volvocars.com
HITS for Clustering
• An ambiguous query can result in the principal eigenvector only covering one of the
possible meanings.
• Non-principal eigenvectors may contain hubs & authorities for other meanings.
• Example: “jaguar”:
• Atari video game (principal eigenvector)
nd
• NFL Football team (2 non-princ. eigenvector)
rd
• Automobile (3 non-princ. eigenvector)
Issues
• Topic Drift
Off-topic pages can cause off-topic “authorities” to be returned
•
• E.g., the neighborhood graph can be about a “super
topic”
• Mutually Reinforcing Affiliates
• Affiliated pages/sites can boost each others’ scores
• Linkage between affiliated pages is not a useful signal
Resources
• IIR Chap 21
• http://www2004.org/proceedings/docs/1p309.pdf
• http://www2004.org/proceedings/docs/1p595.pdf
• http://www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/index.html
• http://www2003.org/cdrom/papers/refereed/p641/xhtml/p641-mccurley.html
• The WebGraph framework I: Compression techniques (Boldi et al. 2004)
Lecture # 39
Search Computing
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Multi-domain queries with ranking
• Why Search Engines can’t do it?
• Observed trends
• Search Computing
• The Search Computing “Manifesto”
• Search Computing architecture
Motivation: multi-domain queries with ranking
Note that
enough data is on the Web but
not on a single web page.
Observed trends
• More and more data sources become accessible through Web APIs (as services)
• Sufrace & deep Web
• Data sources are often coupled with search APIs
• Publishing of structured and interconnected data is becoming popular (Linked Open
Data)
Opportunity for building focused search systems composing results of several data source
• easy-to-build, easy-to-query, easy-to-maintain, easy-to-scale...
• covering the functionalities of vertical search systems (e.g. “expedia”,
“amazon”) on more focused application domains
• (e.g. localized real estate or leasure planning, sector-
specific job market offers, support of biomedical
research, ...)
Search Computing = service composition “on demand”
• Composition abstractions should emphasize few aspects:
• service invocations
• fundamental operations (parallel invocations, joins, pipelining, …)
• global constraints on execution
• Data composition should be search-driven
• aimed at producing few top results very fast
A house in a walk able area, close to public transportation and located in a pleasant
neighborhood
Lecture # 40
Top-k Query Processing
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Top-k Query Processing
• Simple Database model
• Fagin’s Algorithm
• Threshold Algorithm
• Comparison of Fagin’s and Threshold Algorithm
Top-k Query Processing
Optimal aggregation algorithms for middleware
Ronald Fagin, Amnon Lotem, and Moni Naor
Top-k vs Nested Loop Query
Do sorted access (and corresponding random accesses) until you have seen the top k
answers.
• How do we know that grades of seen objects are higher than
the grades of unseen objects ?
• Predict maximum possible grade unseen objects:
Extending TA
• What if sorted access is restricted ? e.g. use distance database
• TA z
• What if random access not possible? e.g. web search engine
• No Random Access Algorithm
• What if we want only the approximate top k objects?
• TAθ
• What if we consider relative costs of random and sorted access?
• Combined Algorithm (between TA and NRA)
Lecture # 41
Clustering
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• What is clustering?
• Improving search recall
• Issues for clustering
• Notion of similarity/distance
• Hard vs. soft clustering
• Clustering Algorithms
What is clustering?
• Clustering: the process of grouping a set of objects into classes of similar objects
• Documents within a cluster should be similar.
• Documents from different clusters should be dissimilar.
• The commonest form of unsupervised learning
• Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
• A common and important task that finds many applications in IR and
other places
Applications of clustering in IR
• Whole corpus analysis/navigation
• Better user interface: search without typing
• For improving recall in search applications
• Better search results (like pseudo RF)
• For better navigation of search results
• Effective “user recall” will be higher
• For speeding up vector space retrieval
• Cluster-based retrieval gives faster search
Applications of clustering in IR
• Whole corpus analysis/navigation
• Better user interface: search without typing
• For improving recall in search applications
• Better search results (like pseudo RF)
• For better navigation of search results
• Effective “user recall” will be higher
• For speeding up vector space retrieval
• Cluster-based retrieval gives faster search
For improving search recall
• Cluster hypothesis - Documents in the same cluster behave similarly with respect to
relevance to information needs
• Therefore, to improve search recall:
• Cluster docs in corpus a priori
• When a query matches a doc D, also return other docs in the cluster
containing D
• Hope if we do this: The query “car” will also return docs containing automobile
• Because clustering grouped together docs containing car with those
containing automobile.
Applications of clustering in IR
• Whole corpus analysis/navigation
• Better user interface: search without typing
• For improving recall in search applications
Applications of clustering in IR
• Whole corpus analysis/navigation
• Better user interface: search without typing
• For improving recall in search applications
• Better search results (like pseudo RF)
• For better navigation of search results
• Effective “user recall” will be higher
• For speeding up vector space retrieval
• Cluster-based retrieval gives faster search
Visualization
• Makes more sense for applications like creating brows able hierarchies
• You may want to put a pair of sneakers in two clusters: (i) sports apparel
and (ii) shoes
• You can only do that with a soft clustering approach.
Clustering Algorithms
• Flat algorithms
• Usually start with a random (partial) partitioning
• Refine it iteratively
• K means clustering
• (Model based clustering)
• Hierarchical algorithms
• Bottom-up, agglomerative
• (Top-down, divisive)
Partitioning Algorithms
• Partitioning method: Construct a partition of n documents into a set of K clusters
• Given: a set of documents and the number K
• Find: a partition of K clusters that optimizes the chosen partitioning criterion
• Globally optimal
• Intractable for many objective functions
• Ergo, exhaustively enumerate all partitions
• Effective heuristic methods: K-means and K-medoids algorithms
K-Means
• Assumes documents are real-valued vectors.
• Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:
Termination conditions
• Several possibilities, e.g.,
• A fixed number of iterations.
• Doc partition unchanged.
• Centroid positions don’t change.
Convergence
• Why should the K-means algorithm ever reach a fixed point?
• A state in which clusters don’t change.
• K-means is a special case of a general procedure known as the Expectation
Maximization (EM) algorithm.
• EM is known to converge.
• Number of iterations could be large.
• But in practice usually isn’t
Convergence of K-Means
• Residual Sum of Squares (RSS), a goodness measure of a cluster, is the sum of squared
distances from the cluster centroid:
2
• RSS = Σ |d – c | (sum over all d in cluster j)
j i i j i
• RSS = Σ RSS
j j
• Reassignment monotonically decreases RSS since each vector is assigned to the closest
centroid.
• Recomputation also monotonically decreases each RSS because …
j
Convergence of K-Means
• Residual Sum of Squares (RSS), a goodness measure of a cluster, is the sum of squared
distances from the cluster centroid:
2
• RSS = Σ |d – c | (sum over all d in cluster j)
j i i j i
• RSS = Σ RSS
j j
• Reassignment monotonically decreases RSS since each vector is assigned to the closest
centroid.
• Recomputation also monotonically decreases each RSS because …
j
Seed Choice
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.
Lecture # 42
Classification
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Why probabilities in IR?
• Document Classification
• Bayes’ Rule For Text Classification
• Bernoulli Random Variables
• Smoothing Function
Document Classification
P( D | C ) P(C )
P(C | D)
P ( D)
<w ,w ,…….,wk> = {P(ā |C) P(ā |C) P(ā |C) ………. P(ā |C)}]
1 2 1 2 3 L
<w ,w ,…….,wk> = [1 – {P(a |C) P(a |C) P(a |C) ………. P(a |C)}]
1 2 1 2 3 L
1 0 0 0 0 1 Not
spam
2 1 0 1 0 1 Spam
3 0 0 0 0 1 Not
spam
4 1 0 1 0 1 Spam
5 1 1 0 0 1 spam
6 0 0 1 0 1 Not
spam
7 0 1 1 0 1 Not
spam
8 0 0 0 0 1 Not
spam
9 0 0 0 0 1 Not
spam
10 1 1 0 1 1 Not
spam
Smoothing Function
Lecture # 43
Classification
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Rochio classification
• K Nearest neighbors
• Nearest-Neighbor Learning
• kNN decision boundaries
• Bias vs. variance
Rocchio Algorithm
• Relevance feedback methods can be adapted for text categorization
• As noted before, relevance feedback can be viewed as 2-class
classification
• Relevant vs. non-relevant documents
• Use standard tf-idf weighted vectors to represent text documents
• For training documents in each category, compute a prototype vector by summing the
vectors of the training documents in the category.
• Prototype : centroid of members of class
• Assign test documents to the category with the closest prototype vector based on
cosine similarity
Rocchio Algorithm
• There may be many more red vectors along y-axis, and they will drift the RED centroid
towards y-axis. Now an awkward Red documents may be near the blue centroid.
Rocchio classification
• Little used outside text classification
Rocchio classification
• Rocchio forms a simple representative for each class: the centroid/prototype
• Classification: nearest prototype/centroid
• It does not guarantee that classifications are consistent with the given training data
Nearest-Neighbor Learning
• Learning: just store the labeled training examples D
• Testing instance x (under 1NN):
• Compute similarity between x and all examples in D.
• Assign x the category of the most similar example in D.
• Does not compute anything beyond storing the examples
• Also called:
• Case-based learning (remembering every single example of each class)
• Memory-based learning (memorizing every instance of training set)
• Lazy learning
• Rationale of kNN: contiguity hypothesis (docs which are near to a given input doc will
decide its class)
Nearest-Neighbor
k Nearest Neighbor
• Using only the closest example (1NN) subject to errors due to:
• A single atypical example.
• Noise (i.e., an error) in the category label of a single training example.
• More robust: find the k examples and return the majority category of these k
• k is typically odd to avoid ties; 3 and 5 are most common
• Assign weight (relevance) of neighbors to decide.
kNN: Discussion
• No feature selection necessary
• No training necessary
Evaluating Categorization
• Evaluation must be done on test data that are independent of the training data
• Sometimes use cross-validation (averaging results over multiple training
and test splits of the overall data)
• Easy to get good performance on a test set that was available to the learner during
training (e.g., just memorize the test set)
Lecture # 44
Recommender Systems
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Recommender Systems
• Personalization
• Basic Types of Recommender Systems
• Collaborative Filtering
• Content-Based Recommending
Recommender Systems
• Systems for recommending items (e.g. books, movies, CD’s, web pages, newsgroup
messages) to users based on examples of their preferences.
• Many on-line stores provide recommendations (e.g. Amazon, CDNow).
• Recommenders have been shown to substantially increase sales at on-line stores.
Book Recommender
Personalization
• Recommenders are instances of personalization software.
• Personalization concerns adapting to the individual needs, interests, and preferences
of each user.
• Includes:
• Recommending
• Filtering
• Predicting (e.g. form or calendar appt. completion)
• From a business perspective, it is viewed as part of Customer Relationship
Management (CRM).
• Almost all existing commercial recommenders use this approach (e.g. Amazon).
Collaborative Filtering
Similarity Weighting
• Typically use Pearson correlation coefficient between ratings for active user, a, and
another user, u.
• Covariance:
Significance Weighting
• Important not to trust correlations based on very few co-rated items.
• Include significance weights, s , based on number of co-rated items, m.
a,u
Neighbor Selection
• For a given active user, a, select correlated users to serve as source of predictions.
• Standard approach is to use the most similar n users, u, based on similarity weights,
w
a,u
• Alternate approach is to include all users whose similarity weight is above a given
threshold.
Rating Prediction
• Predict a rating, p , for each item i, for active user, a, by using the n selected neighbor
a,i
users, u {1,2,…n}.
• To account for users different ratings levels, base predictions on differences from a
user’s average rating.
• Weight users’ ratings contribution by their similarity to the active user.
• Cold Start: There needs to be enough other users already in the system to find a
match.
• Sparsity: If there are many items to be recommended, even if there are many users,
the user/ratings matrix is sparse, and it is hard to find users that have rated the same
items.
• First Rater: Cannot recommend an item that has not been previously rated.
• New items
• Esoteric items (unique)
• Popularity Bias: Cannot recommend items to someone with unique tastes.
• Tends to recommend popular items.
Content-Based Recommending
• Uses a machine learning algorithm to induce a profile of the users preferences from
examples based on a featural description of content.
• Some previous applications:
• Newsweeder (Lang, 1995)
• Syskill and Webert (Pazzani et al., 1996)
LIBRA
Learning Intelligent Book Recommending Agent
• Content-based recommender for books using information about titles extracted from
Amazon.
• Uses information extraction from the web to organize text into fields:
• Author
• Title
• Editorial Reviews
• Customer Comments
• Subject terms
• Related authors
• Related titles
LIBRA System
Lecture # 45
Final Notes on Information Retrieval
ACKNOWLEDGEMENTS
The presentation of this lecture has been taken from the following sources
1. “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D.
Manning, and Hinrich Schütze
2. “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell
3. “Modern information retrieval” by Baeza-Yates Ricardo,
4. “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline
• Topics that we covered
• Database Management Research