0% found this document useful (0 votes)
110 views40 pages

Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto

The document discusses several techniques for preprocessing text documents prior to indexing them for information retrieval, including lexical analysis, elimination of stopwords, stemming, and selection of index terms. It also covers constructing term categorization structures and several methods for weighting terms, such as term frequency-inverse document frequency (TF-IDF), term discrimination value, and probabilistic term weighting. The goal of these techniques is to extract the most important and discriminative terms from documents to facilitate efficient and effective information retrieval.

Uploaded by

api-20013624
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views40 pages

Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto

The document discusses several techniques for preprocessing text documents prior to indexing them for information retrieval, including lexical analysis, elimination of stopwords, stemming, and selection of index terms. It also covers constructing term categorization structures and several methods for weighting terms, such as term frequency-inverse document frequency (TF-IDF), term discrimination value, and probabilistic term weighting. The goal of these techniques is to extract the most important and discriminative terms from documents to facilitate efficient and effective information retrieval.

Uploaded by

api-20013624
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 40

Modern Information Retrieval

Chapter 7: Text Operations

Ricardo Baeza-Yates
Berthier Ribeiro-Neto
Document Preprocessing
 Lexical analysis of the text
 Elimination of stopwords
 Stemming
 Selection of index terms
 Construction of term categorization structures
Lexical Analysis of the Text
 Word separators
 space
 digits
 hyphens
 punctuation marks
 the case of the letters
Elimination of Stopwords
 A list of stopwords
 words that are too frequent among the documents
 articles, prepositions, conjunctions, etc.

 Can reduce the size of the indexing structure


considerably

 Problem
 Search for “to be or not to be”?
Stemming
 Example
 connect, connected, connecting, connection, connections
 effectiveness --> effective --> effect
 picnicking --> picnic
 king -\-> k

 Removing strategies
 affix removal: intuitive, simple
 table lookup
 successor variety
 n-gram
Index Terms Selection
 Motivation
 A sentence is usually composed of nouns, pronouns,
articles, verbs, adjectives, adverbs, and connectives.
 Most of the semantics is carried by the noun words.

 Identification of noun groups


 A noun group is a set of nouns whose syntactic
distance in the text does not exceed a predefined
threshold
Thesauri
 Peter Roget, 1988
 Example
cowardly adj.
Ignobly lacking in courage: cowardly turncoats
Syns: chicken (slang), chicken-hearted, craven,
dastardly, faint-hearted, gutless, lily-livered,
pusillanimous, unmanly, yellow (slang), yellow-
bellied (slang).

 A controlled vocabulary for the indexing and


searching
The Purpose of a Thesaurus
 To provide a standard vocabulary for indexing
and searching
 To assist users with locating terms for proper
query formulation
 To provide classified hierarchies that allow the
broadening and narrowing of the current query
request
Thesaurus Term Relationships
 BT: broader
 NT: narrower
 RT: non-hierarchical, but related
Term Selection
Automatic Text Processing
by G. Salton, Chap 9,
Addison-Wesley, 1989.
Automatic Indexing
 Indexing:
 assign identifiers (index terms) to text documents.
 Identifiers:
 single-term vs. term phrase
 controlled vs. uncontrolled vocabularies
instruction manuals, terminological schedules, …
 objective vs. nonobjective text identifiers
cataloging rules define, e.g., author names, publisher names,
dates of publications, …
Two Issues
 Issue 1: indexing exhaustivity
 exhaustive: assign a large number of terms
 nonexhaustive
 Issue 2: term specificity
 broad terms (generic)
cannot distinguish relevant from nonrelevant documents
 narrow terms (specific)
retrieve relatively fewer documents, but most of them are
relevant
Parameters of
retrieval effectiveness
 Recall
Number of relevant i tems retri eved
R=
Total numb er of rele vant items in collec tion
 Precision
Number of relevant i tems retri eved
P=
Total numb er of item s retrieve d
 Goal
high recall and high precision
Retrieved
Part
b a
Nonrelevant Relevant
Items Items
c d

a a
Recall = Precision =
a +d a+b
A Joint Measure
 F-score ( β2 + 1) × P × R
F=
β2 × P + R
 β is a parameter that encode the importance of
recall and procedure.
 β =1: equal weight
 β <1: precision is more important
 β >1: recall is more important
Choices of Recall and Precision
 Both recall and precision vary from 0 to 1.
 Particular choices of indexing and search policies
have produced variations in performance ranging
from 0.8 precision and 0.2 recall to 0.1 precision
and 0.8 recall.
 In many circumstance, both the recall and the
precision varying between 0.5 and 0.6 are more
satisfactory for the average users.
Term-Frequency Consideration
 Function words
 for example, "and", "or", "of", "but", …

 the frequencies of these words are high in all texts

 Content words
 words that actually relate to document content

 varying frequencies in the different texts of a collect

 indicate term importance for content


A Frequency-Based Indexing Method
 Eliminate common function words from the document
texts by consulting a special dictionary, or stop list,
containing a list of high frequency function words.
 Compute the term frequency tfij for all remaining terms Tj
in each document Di, specifying the number of
occurrences of Tj in Di.
 Choose a threshold frequency T, and assign to each
document Di all term Tj for which tfij > T.
Inverse Document Frequency
 Inverse Document Frequency (IDF) for term Tj
N
idf j = log
df j
where dfj (document frequency of term Tj) is the
number of documents in which Tj occurs.
 fulfil both the recall and the precision
 occur frequently in individual documents but rarely in
the remainder of the collection
TFxIDF
 Weight wij of a term Tj in a document di
N
wij = tf ij × log
df j
 Eliminating common function words
 Computing the value of wij for each term Tj in each
document Di
 Assigning to the documents of a collection all terms with
sufficiently high (tf x idf) factors
Term-discrimination Value
 Useful index terms
 Distinguish the documents of a collection from
each other
 Document Space
 Two documents are assigned very similar
term sets, when the corresponding points in
document configuration appear close together
 When a high-frequency term without
discrimination is assigned, it will increase the
document space density
A Virtual Document Space

Original State After Assignment of After Assignment of


good discriminator poor discriminator
Good Term Assignment
 When a term is assigned to the documents of a
collection, the few objects to which the term is
assigned will be distinguished from the rest of
the collection.

 This should increase the average distance


between the objects in the collection and hence
produce a document space less dense than
before.
Poor Term Assignment
 A high frequency term is assigned that does not
discriminate between the objects of a collection.
Its assignment will render the document more
similar.

 This is reflected in an increase in document


space density.
Term Discrimination Value
 Definition
dvj = Q - Qj
where Q and Qj are space densities before and
after the assignments of term Tj.
1 N N
Q= ∑ ∑
N ( N −1) i =1 k =1
sim ( Di , Dk )
i ≠k

 dvj>0, Tj is a good term;


dvj<0, Tj is a poor term.
Variations of Term-Discrimination Value
with Document Frequency

Document
Frequency
N
Low frequency Medium frequency High frequency
dvj=0 dvj>0 dvj<0
TFij x dvj
 wij = tfij x dvj
N
 compared with wij =tf ij ×log
df j
N
 : decrease steadily with increasing document
df j
frequency
 dvj: increase from zero to positive as the document
frequency of the term increase,
decrease shapely as the document frequency
becomes still larger.
Document Centroid
 Issue: efficiency problem
N(N-1) pairwise similarities
 Document centroid C = (c1, c2, c3, ..., ct)
N
c j = ∑wij
i =1

where wij is the j-th term in document i.


 Space density
N
1
Q=
N
∑sim (C , D )
i =1
i
Probabilistic Term Weighting
 Goal
Explicit distinctions between occurrences of
terms in relevant and nonrelevant documents of
a collection
 Definition
Given a user query q, and the ideal answer set of the
relevant documents
 From decision theory, the best ranking algorithm
for a document D
Pr( D | rel ) Pr( rel )
g ( D ) = log + log
Pr( D | nonrel ) Pr( nonrel )
Probabilistic Term Weighting
 Pr(rel), Pr(nonrel):
document’s a priori probabilities of relevance and
nonrelevance

 Pr(D|rel), Pr(D|nonrel):
occurrence probabilities of document D in the
relevant and nonrelevant document sets
Assumptions
 Terms occur independently in documents
t
Pr( D | rel ) = ∏Pr( xi | rel )
i =1
t
Pr( D | nonrel ) = ∏Pr( xi | nonrel )
i =1
Derivation Process

Pr( D | rel ) Pr( rel )


g ( D ) = log +log
Pr( D | nonrel ) Pr( nonrel )
t

∏Pr( xi |rel )
= log t
i =1
+ constants
∏Pr(
i =1
xi |nonrel )

t
Pr( xi | rel )
= ∑log +constants
i =1 Pr( xi | nonrel )
For a specific document D
 Given a document D=(d1, d2, …, dt)
t
Pr( xi = di |rel )
g ( D) = ∑ log + constants
i =1 Pr( xi = di |nonrel )

 Assume di is either 0 (absent) or 1 (present).


Pr(xi=1|rel) = pi Pr(xi=0|rel) = 1-pi
Pr(xi=1|nonrel) = qi Pr(xi=0|nonrel) = 1-qi
di 1 − di
Pr( xi = di |rel ) = pi (1 − pi )
di 1 − di
Pr( xi = di |nonrel ) = qi (1 − qi )
t
Pr( xi =di | rel )
g ( D) =∑log +constants
i =1 Pr( xi =di | nonrel )
1−di
t
p d (1−p ) i

=∑log
i i

1−d
+constants
d
q (1−q ) i i
i =1
i i

di
t
= ∑log
p (1−q ) (1 − p ) +constants
i
di
i i

d
q (1−p ) (1 −q )
d i i
i =1
i i i

di
=∑
t
log
( p (1−q )) (1−p ) +constants
i i i

i =1
(q (1−p )) d (1−q )
i i
i
i
Term Relevance Weight
t 1 − pi t p (1 − qi )
g ( D) = ∑log + ∑di log i + constants
i =1 1 − qi i =1 qi (1 − pi )

pj (1 −qj )
tr j =log
qj (1 −pj )
Issue
 How to compute pj and qj ?

pj = rj / R
qj = (dfj-rj)/(N-R)

 R: the total number of relevant documents


 N: the total number of documents
Estimation of Term-Relevance

 The occurrence probability of a term in the nonrelevant


documents qj is approximated by the occurrence
probability of the term in the entire document collection
qj = dfj / N

 The occurrence probabilities of the terms in the small


number of relevant documents is equal by using a
constant value pj = 0.5 for all j.
Comparison
df
0.5 * (1 − j
)
pj (1 −qj ) N
tr j =log =log
qj (1 − pj ) df j
* 0.5
N
( N −df j )
= log
df j

When N is sufficiently large, N-dfj ≈ N,

( N −df j ) N
tr j = log ≈ log
= idfj
df j
df j
Estimation of Term-Relevance
 Estimate the number of relevant documents rj in the
collection that contain term Tj as a function of the known
document frequency tfj of the term Tj.
pj = rj / R
qj = (dfj-rj)/(N-R)
R: an estimate of the total number of relevant documents
in the collection.
Summary
 Inverse document frequency, idfj
 tfij *idfj (TFxIDF)
 Term discrimination value, dvj
 tfij *dvj
 Probabilistic term weighting trj
 tfij *trj

 Global properties of terms in a document collection

You might also like