The document discusses various aspects of term weighting which is important for text retrieval systems, including term frequency, document frequency, inverse document frequency, and how they are used to calculate TF-IDF weights for terms. It also covers stoplists, stemming, and the bag-of-words model which represents text as vectors of word occurrences without considering word order. Term weighting schemes play a major role in the similarity measures used by information retrieval systems to determine document relevance.
Introduction of Ms. T. Primya, Assistant Professor in Computer Science at Dr. N. G. P. Institute of Technology, Coimbatore.
Term weighting is crucial in text retrieval systems, affecting performance. It involves global and local weights of terms in documents for similarity measures.
Details on term frequency (tf), document frequency (df), and inverse document frequency (idf) used for determining TF-IDF scores to rank document relevance.
Example of calculating tf-idf for terms A, B, C with specific frequencies, showing tf-idf scores: A=7.6, B=2.0, C=1.8.
Discusses stoplists for ignoring non-informative words, stemming to treat word variants as the same, and bag-of-words model for features extraction.
Steps in data collection and vocabulary design, showcasing a sample corpus with unique words identified and processed into a vocabulary.
Details on scoring words in documents and converting them into binary vectors for presence in the vocabulary.
Ms. T. Primya
AssistantProfessor
Department of Computer Science and Engineering
Dr. N. G. P. Institute of Technology
Coimbatore
2.
Term weightingis an important aspect of modern text
retrieval systems.
Terms are words, phrases, or any other indexing units
used to identify the contents of a text
Since different terms have different importance in a
text, an important indicator-the term weight is
associated with every term
3.
The retrievalperformance of the information retrieval
systems is largely dependent on similarity measures.
The term weighting scheme plays an major role for
the similarity measure
aij=gi*tij*dj
Where
gi is the global weight of the ith term
tij is the local weight of the ith term in the jth document
dj is the normalization factor for the jth document
4.
Term Weightis used to rank the relevance
Parameters in calculating a weight for a document term or query term
Term Frequency (tf):
Term Frequency is the number of times a term i appears in
document j (tfij )
tf is dividing by the frequency of the most common term in the
document
tfij = fij / maxi{fij}
fij = frequency of term i in document j
5.
Document Frequency(df): Number of documents a
term i appears in, (dfi).
Inverse Document Frequency (idf): A discriminating
measure for a term i in collection, i.e., how
discriminating term i is
(idfi) = log10(n / dfi),
Where, n is the number of document in the collection
dfi = number of documents in which term i appears
TF-IDF => wij = tfij*idfi = tfij* log10(n / dfi)
6.
Given adocument containing terms with given
frequencies A(3), B(2), C(1)
Assume document contains 10,000 documents and
document frequencies of these terms are A(50), B(1300),
C(250)
A: tf=3/3, idf=log(10000/50)=7.6, tf-idf=1*7.6=7.6
B: tf=2/3, idf=log(10000/1300)=2.9, tf-idf=2.0
C: tf=1/3, idf=log(10000/250)=5.3, tf-idf=1.8
7.
Stoplists:
List ofwords to be ignored when processing documents,
since they give no useful information about content.
Stemming:(conflation)
Process of treating a set of words as all instances of the
same term
Ex: fights, fighting, fighter,..
Stem is fight
8.
A bag-of-wordsmodel (BoW) is a way of extracting
features from text for use in modeling, such as with
machine learning algorithms.
A bag-of-words is a representation of text that describes the
occurrence of words within a document. It involves two
things:
A vocabulary of known words.
A measure of the presence of known words.
It is called a “bag” of words, because any information about
the order or structure of words in the document is discarded.
The model is only concerned with whether known words
occur in the document, not where in the document.
9.
Step 1: CollectData
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
10.
Step 2: Designthe Vocabulary
The unique words here (ignoring case and punctuation) are:
“it”
“was”
“the”
“best”
“of”
“times”
“worst”
“age”
“wisdom”
“foolishness”
That is a vocabulary of 10 words from a corpus containing 24
words.
11.
Step 3: CreateDocument Vectors
The next step is to score the words in each document.
The scoring of the document would look as follows
“it” = 1
“was” = 1
“the” = 1
“best” = 1
“of” = 1
“times” = 1
“worst” = 0
“age” = 0
“wisdom” = 0
“foolishness” = 0
12.
Binary Vectors are:
“it was the best of times”=[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]