Term weighting

Ms. T. Primya
Assistant Professor
Department of Computer Science and Engineering
Dr. N. G. P. Institute of Technology
Coimbatore

 Term weighting is an important aspect of modern text
retrieval systems.
 Terms are words, phrases, or any other indexing units
used to identify the contents of a text
 Since different terms have different importance in a
text, an important indicator-the term weight is
associated with every term

 The retrieval performance of the information retrieval
systems is largely dependent on similarity measures.
 The term weighting scheme plays an major role for
the similarity measure
aij=gi*tij*dj
Where
gi is the global weight of the ith term
tij is the local weight of the ith term in the jth document
dj is the normalization factor for the jth document

 Term Weight is used to rank the relevance
 Parameters in calculating a weight for a document term or query term
 Term Frequency (tf):
Term Frequency is the number of times a term i appears in
document j (tfij )
tf is dividing by the frequency of the most common term in the
document
tfij = fij / maxi{fij}
fij = frequency of term i in document j

 Document Frequency (df): Number of documents a
term i appears in, (dfi).
 Inverse Document Frequency (idf): A discriminating
measure for a term i in collection, i.e., how
discriminating term i is
(idfi) = log10(n / dfi),
Where, n is the number of document in the collection
dfi = number of documents in which term i appears
TF-IDF => wij = tfij*idfi = tfij* log10(n / dfi)

 Given a document containing terms with given
frequencies A(3), B(2), C(1)
 Assume document contains 10,000 documents and
document frequencies of these terms are A(50), B(1300),
C(250)
 A: tf=3/3, idf=log(10000/50)=7.6, tf-idf=1*7.6=7.6
 B: tf=2/3, idf=log(10000/1300)=2.9, tf-idf=2.0
 C: tf=1/3, idf=log(10000/250)=5.3, tf-idf=1.8

Stoplists:
 List of words to be ignored when processing documents,
since they give no useful information about content.
Stemming:(conflation)
 Process of treating a set of words as all instances of the
same term
Ex: fights, fighting, fighter,..
Stem is fight

 A bag-of-words model (BoW) is a way of extracting
features from text for use in modeling, such as with
machine learning algorithms.
 A bag-of-words is a representation of text that describes the
occurrence of words within a document. It involves two
things:
A vocabulary of known words.
A measure of the presence of known words.
 It is called a “bag” of words, because any information about
the order or structure of words in the document is discarded.
 The model is only concerned with whether known words
occur in the document, not where in the document.

Step 1: Collect Data
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,

Step 2: Design the Vocabulary
 The unique words here (ignoring case and punctuation) are:
“it”
“was”
“the”
“best”
“of”
“times”
“worst”
“age”
“wisdom”
“foolishness”
 That is a vocabulary of 10 words from a corpus containing 24
words.

Step 3: Create Document Vectors
 The next step is to score the words in each document.
The scoring of the document would look as follows
“it” = 1
“was” = 1
“the” = 1
“best” = 1
“of” = 1
“times” = 1
“worst” = 0
“age” = 0
“wisdom” = 0
“foolishness” = 0

Binary Vectors are:
 “it was the best of times”=[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
 "it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
 "it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
 "it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

Term weighting

In this document

More Related Content

What's hot

Similar to Term weighting

Recently uploaded

Term weighting