0% found this document useful (0 votes)

110 views40 pages

Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto

The document discusses several techniques for preprocessing text documents prior to indexing them for information retrieval, including lexical analysis, elimination of stopwords, stemming, and selection of index terms. It also covers constructing term categorization structures and several methods for weighting terms, such as term frequency-inverse document frequency (TF-IDF), term discrimination value, and probabilistic term weighting. The goal of these techniques is to extract the most important and discriminative terms from documents to facilitate efficient and effective information retrieval.

Uploaded by

api-20013624

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views40 pages

Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto

Uploaded by

api-20013624

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 40

Modern Information Retrieval

Chapter 7: Text Operations

Ricardo Baeza-Yates
Berthier Ribeiro-Neto
Document Preprocessing
 Lexical analysis of the text
 Elimination of stopwords
 Stemming
 Selection of index terms
 Construction of term categorization structures
Lexical Analysis of the Text
 Word separators
 space
 digits
 hyphens
 punctuation marks
 the case of the letters
Elimination of Stopwords
 A list of stopwords
 words that are too frequent among the documents
 articles, prepositions, conjunctions, etc.

 Can reduce the size of the indexing structure

considerably

 Problem
 Search for “to be or not to be”?
Stemming
 Example
 connect, connected, connecting, connection, connections
 effectiveness --> effective --> effect
 picnicking --> picnic
 king -\-> k

 Removing strategies
 affix removal: intuitive, simple
 table lookup
 successor variety
 n-gram
Index Terms Selection
 Motivation
 A sentence is usually composed of nouns, pronouns,
articles, verbs, adjectives, adverbs, and connectives.
 Most of the semantics is carried by the noun words.

 Identification of noun groups

 A noun group is a set of nouns whose syntactic
distance in the text does not exceed a predefined
threshold
Thesauri
 Peter Roget, 1988
 Example
cowardly adj.
Ignobly lacking in courage: cowardly turncoats
Syns: chicken (slang), chicken-hearted, craven,
dastardly, faint-hearted, gutless, lily-livered,
pusillanimous, unmanly, yellow (slang), yellow-
bellied (slang).

 A controlled vocabulary for the indexing and

searching
The Purpose of a Thesaurus
 To provide a standard vocabulary for indexing
and searching
 To assist users with locating terms for proper
query formulation
 To provide classified hierarchies that allow the
broadening and narrowing of the current query
request
Thesaurus Term Relationships
 BT: broader
 NT: narrower
 RT: non-hierarchical, but related
Term Selection
Automatic Text Processing
by G. Salton, Chap 9,
Addison-Wesley, 1989.
Automatic Indexing
 Indexing:
 assign identifiers (index terms) to text documents.
 Identifiers:
 single-term vs. term phrase
 controlled vs. uncontrolled vocabularies
instruction manuals, terminological schedules, …
 objective vs. nonobjective text identifiers
cataloging rules define, e.g., author names, publisher names,
dates of publications, …
Two Issues
 Issue 1: indexing exhaustivity
 exhaustive: assign a large number of terms
 nonexhaustive
 Issue 2: term specificity
 broad terms (generic)
cannot distinguish relevant from nonrelevant documents
 narrow terms (specific)
retrieve relatively fewer documents, but most of them are
relevant
Parameters of
retrieval effectiveness
 Recall
Number of relevant i tems retri eved
R=
Total numb er of rele vant items in collec tion
 Precision
Number of relevant i tems retri eved
P=
Total numb er of item s retrieve d
 Goal
high recall and high precision
Retrieved
Part
b a
Nonrelevant Relevant
Items Items
c d

a a
Recall = Precision =
a +d a+b
A Joint Measure
 F-score ( β2 + 1) × P × R
F=
β2 × P + R
 β is a parameter that encode the importance of
recall and procedure.
 β =1: equal weight
 β <1: precision is more important
 β >1: recall is more important
Choices of Recall and Precision
 Both recall and precision vary from 0 to 1.
 Particular choices of indexing and search policies
have produced variations in performance ranging
from 0.8 precision and 0.2 recall to 0.1 precision
and 0.8 recall.
 In many circumstance, both the recall and the
precision varying between 0.5 and 0.6 are more
satisfactory for the average users.
Term-Frequency Consideration
 Function words
 for example, "and", "or", "of", "but", …

 the frequencies of these words are high in all texts

 Content words
 words that actually relate to document content

 varying frequencies in the different texts of a collect

 indicate term importance for content

A Frequency-Based Indexing Method
 Eliminate common function words from the document
texts by consulting a special dictionary, or stop list,
containing a list of high frequency function words.
 Compute the term frequency tfij for all remaining terms Tj
in each document Di, specifying the number of
occurrences of Tj in Di.
 Choose a threshold frequency T, and assign to each
document Di all term Tj for which tfij > T.
Inverse Document Frequency
 Inverse Document Frequency (IDF) for term Tj
N
idf j = log
df j
where dfj (document frequency of term Tj) is the
number of documents in which Tj occurs.
 fulfil both the recall and the precision
 occur frequently in individual documents but rarely in
the remainder of the collection
TFxIDF
 Weight wij of a term Tj in a document di
N
wij = tf ij × log
df j
 Eliminating common function words
 Computing the value of wij for each term Tj in each
document Di
 Assigning to the documents of a collection all terms with
sufficiently high (tf x idf) factors
Term-discrimination Value
 Useful index terms
 Distinguish the documents of a collection from
each other
 Document Space
 Two documents are assigned very similar
term sets, when the corresponding points in
document configuration appear close together
 When a high-frequency term without
discrimination is assigned, it will increase the
document space density
A Virtual Document Space

Original State After Assignment of After Assignment of

good discriminator poor discriminator
Good Term Assignment
 When a term is assigned to the documents of a
collection, the few objects to which the term is
assigned will be distinguished from the rest of
the collection.

 This should increase the average distance

between the objects in the collection and hence
produce a document space less dense than
before.
Poor Term Assignment
 A high frequency term is assigned that does not
discriminate between the objects of a collection.
Its assignment will render the document more
similar.

 This is reflected in an increase in document

space density.
Term Discrimination Value
 Definition
dvj = Q - Qj
where Q and Qj are space densities before and
after the assignments of term Tj.
1 N N
Q= ∑ ∑
N ( N −1) i =1 k =1
sim ( Di , Dk )
i ≠k

 dvj>0, Tj is a good term;

dvj<0, Tj is a poor term.
Variations of Term-Discrimination Value
with Document Frequency

Document
Frequency
N
Low frequency Medium frequency High frequency
dvj=0 dvj>0 dvj<0
TFij x dvj
 wij = tfij x dvj
N
 compared with wij =tf ij ×log
df j
N
 : decrease steadily with increasing document
df j
frequency
 dvj: increase from zero to positive as the document
frequency of the term increase,
decrease shapely as the document frequency
becomes still larger.
Document Centroid
 Issue: efficiency problem
N(N-1) pairwise similarities
 Document centroid C = (c1, c2, c3, ..., ct)
N
c j = ∑wij
i =1

where wij is the j-th term in document i.

 Space density
N
1
Q=
N
∑sim (C , D )
i =1
i
Probabilistic Term Weighting
 Goal
Explicit distinctions between occurrences of
terms in relevant and nonrelevant documents of
a collection
 Definition
Given a user query q, and the ideal answer set of the
relevant documents
 From decision theory, the best ranking algorithm
for a document D
Pr( D | rel ) Pr( rel )
g ( D ) = log + log
Pr( D | nonrel ) Pr( nonrel )
Probabilistic Term Weighting
 Pr(rel), Pr(nonrel):
document’s a priori probabilities of relevance and
nonrelevance

Pr( D | rel ) Pr( rel )

g ( D ) = log +log
Pr( D | nonrel ) Pr( nonrel )
t

∏Pr( xi |rel )
= log t
i =1
+ constants
∏Pr(
i =1
xi |nonrel )

t
Pr( xi | rel )
= ∑log +constants
i =1 Pr( xi | nonrel )
For a specific document D
 Given a document D=(d1, d2, …, dt)
t
Pr( xi = di |rel )
g ( D) = ∑ log + constants
i =1 Pr( xi = di |nonrel )

 Assume di is either 0 (absent) or 1 (present).

=∑log
i i

1−d
+constants
d
q (1−q ) i i
i =1
i i

di
t
= ∑log
p (1−q ) (1 − p ) +constants
i
di
i i

d
q (1−p ) (1 −q )
d i i
i =1
i i i

di
=∑
t
log
( p (1−q )) (1−p ) +constants
i i i

i =1
(q (1−p )) d (1−q )
i i
i
i
Term Relevance Weight
t 1 − pi t p (1 − qi )
g ( D) = ∑log + ∑di log i + constants
i =1 1 − qi i =1 qi (1 − pi )

pj (1 −qj )
tr j =log
qj (1 −pj )
Issue
 How to compute pj and qj ?

pj = rj / R
qj = (dfj-rj)/(N-R)

 R: the total number of relevant documents

 N: the total number of documents
Estimation of Term-Relevance

 The occurrence probability of a term in the nonrelevant

documents qj is approximated by the occurrence
probability of the term in the entire document collection
qj = dfj / N

 The occurrence probabilities of the terms in the small

number of relevant documents is equal by using a
constant value pj = 0.5 for all j.
Comparison
df
0.5 * (1 − j
)
pj (1 −qj ) N
tr j =log =log
qj (1 − pj ) df j
* 0.5
N
( N −df j )
= log
df j

When N is sufficiently large, N-dfj ≈ N,

( N −df j ) N
tr j = log ≈ log
= idfj
df j
df j
Estimation of Term-Relevance
 Estimate the number of relevant documents rj in the
collection that contain term Tj as a function of the known
document frequency tfj of the term Tj.
pj = rj / R
qj = (dfj-rj)/(N-R)
R: an estimate of the total number of relevant documents
in the collection.
Summary
 Inverse document frequency, idfj
 tfij *idfj (TFxIDF)
 Term discrimination value, dvj
 tfij *dvj
 Probabilistic term weighting trj
 tfij *trj

 Global properties of terms in a document collection

Inductive Grammar Chart (Unit 1, Page 4) GRAMMAR. Verb Be: Singular Statements I Contractions
No ratings yet
Inductive Grammar Chart (Unit 1, Page 4) GRAMMAR. Verb Be: Singular Statements I Contractions
2 pages
Gaetz Preface
No ratings yet
Gaetz Preface
19 pages
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
No ratings yet
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
65 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
Module 5 (NLP)
No ratings yet
Module 5 (NLP)
30 pages
Term Weighting
No ratings yet
Term Weighting
71 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
48 pages
1 Information Retrieval System
No ratings yet
1 Information Retrieval System
10 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
No ratings yet
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
25 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
IR END PYQ SOLS
No ratings yet
IR END PYQ SOLS
8 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
II_3 UNIT
No ratings yet
II_3 UNIT
45 pages
TF Idf
100% (3)
TF Idf
38 pages
IRS_Unit_2
No ratings yet
IRS_Unit_2
15 pages
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
No ratings yet
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
61 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Preprocessing Stemin JI
No ratings yet
Preprocessing Stemin JI
3 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
2 - Text Operation
No ratings yet
2 - Text Operation
47 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks.in)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks.in)
48 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
Vector Space Model and Features: Carl Staelin
No ratings yet
Vector Space Model and Features: Carl Staelin
28 pages
Lsa
No ratings yet
Lsa
17 pages
Indexing by Latent Semantic Analysis: Scott Deerwester
No ratings yet
Indexing by Latent Semantic Analysis: Scott Deerwester
17 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
Recuperación Información Modelo Vectorial
No ratings yet
Recuperación Información Modelo Vectorial
40 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
6_2018_09_11!11_16_16_AM
No ratings yet
6_2018_09_11!11_16_16_AM
101 pages
Module 4 Notes
No ratings yet
Module 4 Notes
34 pages
IR_midsem Question Paper_2024_solutionfull (2)
No ratings yet
IR_midsem Question Paper_2024_solutionfull (2)
7 pages
IRSunit2
No ratings yet
IRSunit2
20 pages
L03
No ratings yet
L03
16 pages
4_Indexing (2)
No ratings yet
4_Indexing (2)
29 pages
mod4 nlp
No ratings yet
mod4 nlp
53 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Modern Information Retrieval: Parallel and Distributed IR
No ratings yet
Modern Information Retrieval: Parallel and Distributed IR
15 pages
Modern Information Retrieval: A Brief Overview: by Amit Singhal
No ratings yet
Modern Information Retrieval: A Brief Overview: by Amit Singhal
14 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
Relevance Ranking and Relevance Feedback: Carl Staelin
No ratings yet
Relevance Ranking and Relevance Feedback: Carl Staelin
34 pages
CSCI 7000 Modern Information Retrieval Jim Martin
No ratings yet
CSCI 7000 Modern Information Retrieval Jim Martin
36 pages
Modern Information Retrieval: A Brief Overview
No ratings yet
Modern Information Retrieval: A Brief Overview
9 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
Conceptual Structures in Modern Information Retrieval: Claudio Carpineto
No ratings yet
Conceptual Structures in Modern Information Retrieval: Claudio Carpineto
28 pages
Grade 7 English Lesson Notes Grammar and Literature Guide
100% (1)
Grade 7 English Lesson Notes Grammar and Literature Guide
3 pages
Japanese Grammar
100% (1)
Japanese Grammar
34 pages
Table of Specification English 8: Objectives: I.KNOWLEDGE (Easy) Weight in % Number of Items Test Placement
No ratings yet
Table of Specification English 8: Objectives: I.KNOWLEDGE (Easy) Weight in % Number of Items Test Placement
4 pages
Modal Verbs Mix Exercises PDF
No ratings yet
Modal Verbs Mix Exercises PDF
5 pages
Present Progressive / Presente Progresivo
No ratings yet
Present Progressive / Presente Progresivo
9 pages
WH Questions Table
No ratings yet
WH Questions Table
2 pages
1000 Vocabulary
No ratings yet
1000 Vocabulary
215 pages
GNS 102 Use of English
No ratings yet
GNS 102 Use of English
15 pages
Material de trabajo 6to- 2nd term
No ratings yet
Material de trabajo 6to- 2nd term
24 pages
ENJOY ENGLISH (Grammar Patterns)
No ratings yet
ENJOY ENGLISH (Grammar Patterns)
1 page
Grade 10 English Language - Reading and Exercises Brochure
No ratings yet
Grade 10 English Language - Reading and Exercises Brochure
15 pages
Unit -I worksheet -The Night We won The Buick
No ratings yet
Unit -I worksheet -The Night We won The Buick
3 pages
A Grammar of Ayeri: Carsten Becker
No ratings yet
A Grammar of Ayeri: Carsten Becker
63 pages
Adding Tion Sion - TSION
No ratings yet
Adding Tion Sion - TSION
4 pages
Games To Practice Verb Patterns
No ratings yet
Games To Practice Verb Patterns
6 pages
Lesson 3 (Have Got, Possessives)
No ratings yet
Lesson 3 (Have Got, Possessives)
5 pages
Grammar Quiz 1 2B
No ratings yet
Grammar Quiz 1 2B
1 page
Past Simple
No ratings yet
Past Simple
1 page
Korean Class Particles 16-20
No ratings yet
Korean Class Particles 16-20
21 pages
Adjective Clause Connector
No ratings yet
Adjective Clause Connector
13 pages
Simple Present
No ratings yet
Simple Present
4 pages
Planificare Calendaristică La Limba Engleză: Starter Unit
No ratings yet
Planificare Calendaristică La Limba Engleză: Starter Unit
6 pages
6 Future Tense I
No ratings yet
6 Future Tense I
8 pages
Text Type: Blog (Entry) : Blogs
No ratings yet
Text Type: Blog (Entry) : Blogs
1 page
Islcollective Worksheets Preintermediate A2 Adults High School Reading Adjectives Adjectives With Ed Ing People Gramm 3767197755494b2862e3b58 80042512
0% (3)
Islcollective Worksheets Preintermediate A2 Adults High School Reading Adjectives Adjectives With Ed Ing People Gramm 3767197755494b2862e3b58 80042512
2 pages
Irregular Verbs
No ratings yet
Irregular Verbs
10 pages
US Vs CAN Vs UK Pronunciation
No ratings yet
US Vs CAN Vs UK Pronunciation
5 pages
20 Rules of Subject Verb Agreement
0% (1)
20 Rules of Subject Verb Agreement
4 pages