Ms. T. Primya
Assistant Professor
Department of Computer Science and Engineering
Dr. N. G. P. Institute of Technology
Coimbatore
 Term weighting is an important aspect of modern text
retrieval systems.
 Terms are words, phrases, or any other indexing units
used to identify the contents of a text
 Since different terms have different importance in a
text, an important indicator-the term weight is
associated with every term
 The retrieval performance of the information retrieval
systems is largely dependent on similarity measures.
 The term weighting scheme plays an major role for
the similarity measure
aij=gi*tij*dj
Where
gi is the global weight of the ith term
tij is the local weight of the ith term in the jth document
dj is the normalization factor for the jth document
 Term Weight is used to rank the relevance
 Parameters in calculating a weight for a document term or query term
 Term Frequency (tf):
Term Frequency is the number of times a term i appears in
document j (tfij )
tf is dividing by the frequency of the most common term in the
document
tfij = fij / maxi{fij}
fij = frequency of term i in document j
 Document Frequency (df): Number of documents a
term i appears in, (dfi).
 Inverse Document Frequency (idf): A discriminating
measure for a term i in collection, i.e., how
discriminating term i is
(idfi) = log10(n / dfi),
Where, n is the number of document in the collection
dfi = number of documents in which term i appears
TF-IDF => wij = tfij*idfi = tfij* log10(n / dfi)
 Given a document containing terms with given
frequencies A(3), B(2), C(1)
 Assume document contains 10,000 documents and
document frequencies of these terms are A(50), B(1300),
C(250)
 A: tf=3/3, idf=log(10000/50)=7.6, tf-idf=1*7.6=7.6
 B: tf=2/3, idf=log(10000/1300)=2.9, tf-idf=2.0
 C: tf=1/3, idf=log(10000/250)=5.3, tf-idf=1.8
Stoplists:
 List of words to be ignored when processing documents,
since they give no useful information about content.
Stemming:(conflation)
 Process of treating a set of words as all instances of the
same term
Ex: fights, fighting, fighter,..
Stem is fight
 A bag-of-words model (BoW) is a way of extracting
features from text for use in modeling, such as with
machine learning algorithms.
 A bag-of-words is a representation of text that describes the
occurrence of words within a document. It involves two
things:
A vocabulary of known words.
A measure of the presence of known words.
 It is called a “bag” of words, because any information about
the order or structure of words in the document is discarded.
 The model is only concerned with whether known words
occur in the document, not where in the document.
Step 1: Collect Data
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
Step 2: Design the Vocabulary
 The unique words here (ignoring case and punctuation) are:
“it”
“was”
“the”
“best”
“of”
“times”
“worst”
“age”
“wisdom”
“foolishness”
 That is a vocabulary of 10 words from a corpus containing 24
words.
Step 3: Create Document Vectors
 The next step is to score the words in each document.
The scoring of the document would look as follows
“it” = 1
“was” = 1
“the” = 1
“best” = 1
“of” = 1
“times” = 1
“worst” = 0
“age” = 0
“wisdom” = 0
“foolishness” = 0
Binary Vectors are:
 “it was the best of times”=[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
 "it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
 "it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
 "it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

Term weighting

  • 1.
    Ms. T. Primya AssistantProfessor Department of Computer Science and Engineering Dr. N. G. P. Institute of Technology Coimbatore
  • 2.
     Term weightingis an important aspect of modern text retrieval systems.  Terms are words, phrases, or any other indexing units used to identify the contents of a text  Since different terms have different importance in a text, an important indicator-the term weight is associated with every term
  • 3.
     The retrievalperformance of the information retrieval systems is largely dependent on similarity measures.  The term weighting scheme plays an major role for the similarity measure aij=gi*tij*dj Where gi is the global weight of the ith term tij is the local weight of the ith term in the jth document dj is the normalization factor for the jth document
  • 4.
     Term Weightis used to rank the relevance  Parameters in calculating a weight for a document term or query term  Term Frequency (tf): Term Frequency is the number of times a term i appears in document j (tfij ) tf is dividing by the frequency of the most common term in the document tfij = fij / maxi{fij} fij = frequency of term i in document j
  • 5.
     Document Frequency(df): Number of documents a term i appears in, (dfi).  Inverse Document Frequency (idf): A discriminating measure for a term i in collection, i.e., how discriminating term i is (idfi) = log10(n / dfi), Where, n is the number of document in the collection dfi = number of documents in which term i appears TF-IDF => wij = tfij*idfi = tfij* log10(n / dfi)
  • 6.
     Given adocument containing terms with given frequencies A(3), B(2), C(1)  Assume document contains 10,000 documents and document frequencies of these terms are A(50), B(1300), C(250)  A: tf=3/3, idf=log(10000/50)=7.6, tf-idf=1*7.6=7.6  B: tf=2/3, idf=log(10000/1300)=2.9, tf-idf=2.0  C: tf=1/3, idf=log(10000/250)=5.3, tf-idf=1.8
  • 7.
    Stoplists:  List ofwords to be ignored when processing documents, since they give no useful information about content. Stemming:(conflation)  Process of treating a set of words as all instances of the same term Ex: fights, fighting, fighter,.. Stem is fight
  • 8.
     A bag-of-wordsmodel (BoW) is a way of extracting features from text for use in modeling, such as with machine learning algorithms.  A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: A vocabulary of known words. A measure of the presence of known words.  It is called a “bag” of words, because any information about the order or structure of words in the document is discarded.  The model is only concerned with whether known words occur in the document, not where in the document.
  • 9.
    Step 1: CollectData It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness,
  • 10.
    Step 2: Designthe Vocabulary  The unique words here (ignoring case and punctuation) are: “it” “was” “the” “best” “of” “times” “worst” “age” “wisdom” “foolishness”  That is a vocabulary of 10 words from a corpus containing 24 words.
  • 11.
    Step 3: CreateDocument Vectors  The next step is to score the words in each document. The scoring of the document would look as follows “it” = 1 “was” = 1 “the” = 1 “best” = 1 “of” = 1 “times” = 1 “worst” = 0 “age” = 0 “wisdom” = 0 “foolishness” = 0
  • 12.
    Binary Vectors are: “it was the best of times”=[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]  "it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]  "it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]  "it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]