Chapter 3
1
12/10/2024
Indexing structure
Outline
◼ Major Steps in Index Construction
◼ Index file Evaluation Metrics
◼ Building Index file
◼ Sequential File
◼ Inverted file
◼ Suffix tree
◼ Suffix Trie
◼ Suffix TreeApplications
2
12/10/2024
Indexing: Basic Concepts
3
◼ Indexing is an arrangement of index terms to permit fast
searching and reducing memory space requirement
◼ It used to speed up access to desired information from document
collection asper users query such that
◼ It enhances efficiency in terms of time for retrieval.
◼ Relevantdocuments are searched and retrieved quick
◼ Index file usually has index terms in asorted order.
◼ Whichlist is easier to search?
fox pig zebra hen ant cat dog lion ox
ant cat dog fox hen lion ox pig zebra
12/10/2024
◼ Anindex file consists of records, called index entries.
◼ Index files are much smaller than the original file.
◼ Remember Heaps Law: in 1 GBof text collection the
vocabulary has asize of only 5 MB. This size may be
further reduced by Linguistic pre-processing (or text
operations).
◼ The usual unit for indexing is the word
◼ Index terms - are used to look up records in afile.
4
Indexing: Basic Concepts
12/10/2024
Major Steps in Index Construction
◼Source file: Collection of text document
◼Adocument can be described by a set of representative
keywordscalled index terms.
◼Index Terms Selection: apply text operations or
preprocessing.
◼Tokenize: identify words in a document, so that each
document is represented byalist of keywords or attributes.
◼Stop words removal: words with high frequency are non-
content bearing and needs to be removed from text
collection.
12/10/2024
Major Steps in Index Construction …
6
◼Word stem: reduce words with similar meaning into their
stem/root word.
◼Term relevance weight: Different index terms have varying
relevance when used to describe document contents. This
effect is captured through the assignment of numerical
weights to each index term of a document. There are
different index terms weighting methods: including TF, IDF,
TF*IDF…
◼Indexing structure: a set of index terms (vocabulary) are
organized in Index File to easily identify documents in which
each term occurs in.
12/10/2024
Basic Indexing Process
Tokenizer
Token
stream. Friends Romans countrymen
Linguistic
preprocessor
Modified
tokens.
friend roman countryman
Indexer
Index File
Documents to
be indexed. Friends, Romans, countrymen.
friend
roman
countryman
2 4
2
13 16
1
Inverted file
12/10/2024
Index file Evaluation Metrics
8
◼Running time of the main operations
◼ Access/searchtime
◼ How much is the running time to find the required search key
from the list?
◼Update time (Insertion time, Deletion time)
◼ How much time does it take to update existing records in an
attempt to add new terms or delete existing unnecessary terms?
◼ Does the indexing structure allows incremental update or re-
indexing?
◼Space overhead
◼ Computer storage space consumed for keeping the list.
12/10/2024
Building Index file
9
◼An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term
◼An index file is alist of search terms that are organized for
associative look-up, i.e., to answer user’s query:
◼In which documents doesaspecified search term appear?
◼Where within each document does each term appear? (There
may be several occurrences.)
◼For organizing index file for acollection of documents, there
are various optionsavailable:
◼Decide what data structure and/or file structure to use. Isit sequential file,
inverted file, suffixtree, etc. ?
12/10/2024
Sequential File
10
◼ Sequential file is the most primitive file structures.
◼ It hasno vocabulary aswell aslinking pointers.
◼ The records are generally arranged serially, one after
another, but in lexicographic order on the value of some key
field. i.e
◼ a particular attribute is chosen as primary key whose value
will determine the order ofthe records.
◼ when the first key fails to discriminate among records, a
second key ischosen to give an order.
12/10/2024
Example:
11
◼Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.
Negative affect
can make it harder
to do even easy tasks.
so make it easy
Doc 1
positive affect can
make it easier
to do difficult tasks
Doc 2
12/10/2024
◼ After all documents
have been tokenized,
stop words are
removed, and
normalizationand
stemmingare
applied, to generate
index terms
◼ These index terms in
sequential file are
sorted in
alphabetical order
Sorting the
Vocabulary Sequential file
12/10/2024
Sequential File
◼To access records search serially.
◼starting at the first record read and investigate all
the succeeding records until the required record is
found or end of the file is reached.
◼Update options: Is the index needs to be rebuilt or
incremental update is supported?
12/10/2024
Sequential File …
◼Its main advantages:
◼easy to implement;
◼provides fast access to the next record using lexicographic
order.
◼ Can be searched quickly, using binary search, O(log n)
◼ Its disadvantages:
◼ No weights attached to terms.
◼ Random access is slow: since similar terms are indexed
individually, we need to find all terms that match with
the query.
12/10/2024
Inverted file
◼Aword oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it
◼ Building and maintaining an inverted index is arelatively low
cost risk. On atext of n words an inverted index can be built in
O(n) time
◼ This list is inverted from alist of terms in location order to a
list of terms in alphabetical order.
Original
Documents
Document IDs
Word Extraction
Word IDs
•W1:d1,d2,d3
•W2:d2,d4,d7,d9
•…
•Wn :di,…dn
•Inverted Files
March 8, 2020 15
12/10/2024
Inverted file
17
Datato be held in the inverted file includes
◼ The vocabulary (List of terms):
◼ is the set of all distinct words (index terms) in the text
collection.
◼having information about vocabulary (list of terms) speeds
searching for relevant documents
◼For each term: the inverted file contains information related to
◼Location: all the text locations/positions where the word
occurs
◼frequency of occurrence of terms in adocument collection
12/10/2024
Enhancements to Inverted Files --
Concept
18
Location: Each posting holds information about the location of
each term within the document.
Uses
user interface design -- highlight location of search term
adjacency and near operators (in Boolean searching)
Frequency: Each inverted list includes the number of postings
for each term.
Uses
term weighting
query processing optimization
12/10/2024
Inverted file
◼Having information about the location of each term within
the document helps for:
◼user interface design: highlight location of search term
◼proximity based ranking: adjacency and near operators (in
Boolean searching)
◼Having information about frequency is used for:
◼calculating term weighting (like TF, TF*IDF, …)
◼optimizing query processing
19
12/10/2024
Inverted File
20
Term CF Doc ID TF Location
term 1 3 2
19
29
1
1
1
66
213
45
term 2 4 3
19
22
1
2
1
94
7, 212
56
term 3 1 5 1 43
term 4 3 11
34
2
1
3, 70
40
Documents are organized bythe terms/words they contain
Thisis calledan
index file.
T
ext operations
are performed
before building
the index.
CF
, total
frequencyof tj in
the corpusn
Is it possible to keep all these information during searching?
12/10/2024
Construction of Inverted file
Aninvertedindexconsists of two files:vocabulary and posting files
◼ Avocabulary file (Word list):
◼ stores all of the distinct terms (keywords) that appear in anyofthe
documents (in lexicographicalorder, i.e like that of adictionary) and
◼ For eachword apointer to aposting file
◼ Recordskept for eachterm j in the vocabulary (word list) contains
the following:
◼ term j
◼ number of documents in whichterm j occurs (DFj)
◼ Collection frequency of term j (Cf)
◼ pointer to inverted (postings)
21list for term j
12/10/2024
Postings File (Inverted List)
◼For each distinct term in the vocabulary, the posting file stores a list of
pointers to the documents that contain that term.
◼Eachelement in an inverted list is called aposting, i.e., the
occurrence of aterm in adocument
◼Eachlist consists of one or many individual postings
Advantage of dividing inverted file into vocabulary and posting:
◼Keeping apointer in the vocabulary to the list in the posting file
allows:
◼ the vocabulary to be kept in memory at search time even for large
text collection, while the Posting file is kept on disk for accessing
the pointers to documents
12/10/2024
General structure of Inverted File
◼ The following figure shows the general structure of inverted
index file.
12/10/2024
Organization of Index File
Term DF CF
Pointer
To
posting
term 1 3 3
term 2 3 4
term 3 1 1
term 4 2 3
Inverted
lists
Vocabulary
(word list) Postings
(inverted list)
Documents
24
12/10/2024
Example:
25
◼Given acollection of documents, they are parsed to extract
words and these are saved with the Document ID.
Doc 1
Doc 2
Negative affect
can make it harder
to do even easy tasks.
so make it easy
positive affect can
make it easier
to do difficult tasks
12/10/2024
◼After all documents
have been tokenized
the inverted file is
sorted by terms
◼Steps
◼ Extract the terms in
each doc
◼ Sort the terms
◼ Compile the terms
i.e Collect the
frequencies for each
term
Sorting the
Vocabulary
12/10/2024
◼Multiple term
entries in a
single
document are
merged and
frequency
information
added
Remove stop words and compute
frequency
27
12/10/2024
◼Multiple term
entries in a
single
document are
merged and
frequency
information
added
stemming & compute frequency
28
12/10/2024
The file is commonly split into aDictionary and aPosting file
vocabulary
Pointers
Vocabulary and postings file
posting
29
Term DF CF
affect 2 2
difficult 1 1
do 2 2
easy 2 3
hard 1 1
make 2 3
negative 1
1
positive 2 1
task 2 2
Doc # TF
1 1
2 1
1 1
1 1
2 1
1 2
2 1
1 2
1 2
2 1
1 1
2 1
1 1
2 1
12/10/2024
Searching on Inverted File
◼ Since the whole index file is divided into two, searching can
be done faster by loading vocabulary list which takes less
memory even for large document collection
◼ Using binary Search the searching takes logarithmic time
◼ The search is done in the vocabulary lists
◼ Updating inverted file is complex.
◼ We need to update both vocabulary and posting files
12/10/2024
Example: Create Inverted file
◼ Create an inverted file (both the vocabulary list and
the posting file) for the following document
collection
D1 The Department of Computer Science was established
in 1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
12/10/2024
Example: Create Inverted file
◼ After text operation red color terms remain asindex
term
D1 The Department of Computer Science was established
in 1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
12/10/2024
…. Example: Create Inverted file
After text operation performed
◼ D1= department comput science establish
◼ D2= department launch bsc comput study
◼ D3= follow msc comput science start
◼ D4= department produce phd graduat
◼ D5= staff contribut intellect profession advance field
12/10/2024
Doc# TF mTF loc
5 1 1 5
2 1 1 3
1 1 1 2
2 1 1 4
3 1 1 2
5 1 1 4
1 1 1 1
2 1 1 1
4 1 1 1
1 1
1 1
1 1
1 1
1 1
1 1
2 2
1 1
1 1
1 1
word DF CF WID
advance 1 1 w1
bsc 1 1 W2
comput 3 3 W3
contribut 1 1 W4
department 3 3 W5
establish 1 1 W6
field 1 1 W7
follow 1 1
graduat 1 1
intellect 1 1
c
launch 1 1
o
msc 1 1 n
phd 1 1 t
produce 1 1 i
profession 1 1 n
science 2 2 u
staff 1 1 e
start 1 1
study 1 1
Pointers
vocabulary posting
•W1:d5
•W2:d2
•W3:d1,d2,d3
•Wn :di,…dn
document file
All term specific
info. (max tf, tf, tf-
idf, location…etc.)
Stored on posting
12/10/2024
Suffix trie
•Asuffix trie is an ordinary trie in which the input strings are
allpossible suffixes.
–Principles:The idea behind suffixTRIE is to assign to each symbol
in a text anindex corresponding to its position in the text. (i.e:
First symbol has index 1, last symbol has index n (#of symbols in
text).
• Tobuild the suffixTRIEwe use these indices instead of the actual object.
•The structure has several advantages:
• It requires less storage space.
• Wedo not have to worry how the text is represented (binary,
ASCII, etc).
• Wedo not haveto store the same object twice (no duplicate).
March 8, 2020 35
12/10/2024
Suffix Trie
•Construct suffix trie for the following string:GOOGOL
•Webegin bygiving aposition to every suffix in the text starting from left to
right asper charactersoccurrence in the string.
• TEXT: G O O G O L$
POSITION: 1 2 3 4 5 6 7
•BuildaSUFFIXTRIEfor all n suffixes of the text.
•Note:The resulting tree hasn leaves and height n.
This structure is
particularlyuseful
for anyapplication
requiringprefix
based ("starts with")
pattern matching.
36
March 8, 2020
12/10/2024
Suffix tree
◼ Asuffix tree is an extension of suffix
trie that construct aTrie of all the
proper suffixes of S
◼ The suffix tree is created by
compacting unarynodes of the
suffixTRIE.
◼ We store pointers rather than words in
the leaves.
◼ It is alsopossible to replace strings
in every edge by apair (a,b), where
a & b are the beginning and end
index of the string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
March 8, 2020 37
12/10/2024
Example: Suffix tree
•Let s=abab, a suffix tree of s is a compressed trie of all
suffixes of s=abab$
•{
• $
• b$
• ab$
• bab$
• abab$ }
•We label each leaf with the
starting point of the
corresponding suffix.
•$
•1
•2
•b
•$
•3
•4
•$
•5
•ab
•ab$
•ab$
39
March 8, 2020
12/10/2024
Generalized suffix tree
• Given aset of strings S,ageneralized suffix tree of Sis acompressedtrie of
all suffixes of s  S
•T
o make suffixes prefix-free we add aspecial char, $, at the end of s.
•Toassociateeach suffix with aunique string in Sadd adifferent special
symbol to each s
• Buildasuffix tree for the string s1$s2#, where `$' and `#' are aspecial
terminator for s1,s2.
•Ex.: Let s1=abab &s2=aab,ageneralized suffix tree for s1&s2is:
•2
•b
•ab$
•1
•b
•$
•3
•5
•a •$
•1
•2
•#
•3
•#
•$
•4
•4
•#
•ab$ •ab$
40
{
4. #
3. b#
2. ab#
1. aab#
5. $
4. b$
3. ab$
2. bab$
1. abab$
Ma
}
rch 8, 2020
12/10/2024
Search in suffix tree
◼Searching for all instances of asubstring Sin a suffix tree is easy
since anysubstring of Sis the prefix of some suffix.
◼Pseudo-code for searching in suffix tree:
◼Start at root
◼Go down the tree by taking each time the corresponding path
◼IfScorrespond to anode, then return all leaves in sub-tree
◼ the places where Scan be found are given by the pointers in
all the leaves in the subtree rooted at x.
◼If Sencountered a NILpointer before reaching the end, then Sis
not in the tree
12/10/2024
Suffix Tree Applications
◼SuffixTree can be used to solve alarge number of string problems
that occur in:
◼text-editing,
◼free-text search, etc.
◼Some examples of string problems are given below.
◼Stringmatching
◼Longest Common Substring
◼Longest Repeated Substring
◼Palindromes
44
12/10/2024
Complexity Analysis
◼ The suffix tree for astring hasbeen built in O(n2) time.
◼ Searching is very fast: The search time is linear in the length of
string S.
◼ The number of leaves is n+1, where n is the number of input
strings.
◼ Furthermore, in the leaves, we may store either the strings
themselves or pointers to the strings (that is, integers).
◼ Searching for asubstring[1..m], in string[1..n], can be solved in
O(m) time.
45
12/10/2024
Exercise
Given the following index terms:
worker, word and world
construct index file using suffix tree?
12/10/2024
46
12/10/2024
End of Chapter 3

Information storage and Retrieval-chapter 3.pdf

  • 1.
  • 2.
    Outline ◼ Major Stepsin Index Construction ◼ Index file Evaluation Metrics ◼ Building Index file ◼ Sequential File ◼ Inverted file ◼ Suffix tree ◼ Suffix Trie ◼ Suffix TreeApplications 2 12/10/2024
  • 3.
    Indexing: Basic Concepts 3 ◼Indexing is an arrangement of index terms to permit fast searching and reducing memory space requirement ◼ It used to speed up access to desired information from document collection asper users query such that ◼ It enhances efficiency in terms of time for retrieval. ◼ Relevantdocuments are searched and retrieved quick ◼ Index file usually has index terms in asorted order. ◼ Whichlist is easier to search? fox pig zebra hen ant cat dog lion ox ant cat dog fox hen lion ox pig zebra 12/10/2024
  • 4.
    ◼ Anindex fileconsists of records, called index entries. ◼ Index files are much smaller than the original file. ◼ Remember Heaps Law: in 1 GBof text collection the vocabulary has asize of only 5 MB. This size may be further reduced by Linguistic pre-processing (or text operations). ◼ The usual unit for indexing is the word ◼ Index terms - are used to look up records in afile. 4 Indexing: Basic Concepts 12/10/2024
  • 5.
    Major Steps inIndex Construction ◼Source file: Collection of text document ◼Adocument can be described by a set of representative keywordscalled index terms. ◼Index Terms Selection: apply text operations or preprocessing. ◼Tokenize: identify words in a document, so that each document is represented byalist of keywords or attributes. ◼Stop words removal: words with high frequency are non- content bearing and needs to be removed from text collection. 12/10/2024
  • 6.
    Major Steps inIndex Construction … 6 ◼Word stem: reduce words with similar meaning into their stem/root word. ◼Term relevance weight: Different index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document. There are different index terms weighting methods: including TF, IDF, TF*IDF… ◼Indexing structure: a set of index terms (vocabulary) are organized in Index File to easily identify documents in which each term occurs in. 12/10/2024
  • 7.
    Basic Indexing Process Tokenizer Token stream.Friends Romans countrymen Linguistic preprocessor Modified tokens. friend roman countryman Indexer Index File Documents to be indexed. Friends, Romans, countrymen. friend roman countryman 2 4 2 13 16 1 Inverted file 12/10/2024
  • 8.
    Index file EvaluationMetrics 8 ◼Running time of the main operations ◼ Access/searchtime ◼ How much is the running time to find the required search key from the list? ◼Update time (Insertion time, Deletion time) ◼ How much time does it take to update existing records in an attempt to add new terms or delete existing unnecessary terms? ◼ Does the indexing structure allows incremental update or re- indexing? ◼Space overhead ◼ Computer storage space consumed for keeping the list. 12/10/2024
  • 9.
    Building Index file 9 ◼Anindex file of a document is a file consisting of a list of index terms and a link to one or more documents that has the index term ◼An index file is alist of search terms that are organized for associative look-up, i.e., to answer user’s query: ◼In which documents doesaspecified search term appear? ◼Where within each document does each term appear? (There may be several occurrences.) ◼For organizing index file for acollection of documents, there are various optionsavailable: ◼Decide what data structure and/or file structure to use. Isit sequential file, inverted file, suffixtree, etc. ? 12/10/2024
  • 10.
    Sequential File 10 ◼ Sequentialfile is the most primitive file structures. ◼ It hasno vocabulary aswell aslinking pointers. ◼ The records are generally arranged serially, one after another, but in lexicographic order on the value of some key field. i.e ◼ a particular attribute is chosen as primary key whose value will determine the order ofthe records. ◼ when the first key fails to discriminate among records, a second key ischosen to give an order. 12/10/2024
  • 11.
    Example: 11 ◼Given a collectionof documents, they are parsed to extract words and these are saved with the Document ID. Negative affect can make it harder to do even easy tasks. so make it easy Doc 1 positive affect can make it easier to do difficult tasks Doc 2 12/10/2024
  • 12.
    ◼ After alldocuments have been tokenized, stop words are removed, and normalizationand stemmingare applied, to generate index terms ◼ These index terms in sequential file are sorted in alphabetical order Sorting the Vocabulary Sequential file 12/10/2024
  • 13.
    Sequential File ◼To accessrecords search serially. ◼starting at the first record read and investigate all the succeeding records until the required record is found or end of the file is reached. ◼Update options: Is the index needs to be rebuilt or incremental update is supported? 12/10/2024
  • 14.
    Sequential File … ◼Itsmain advantages: ◼easy to implement; ◼provides fast access to the next record using lexicographic order. ◼ Can be searched quickly, using binary search, O(log n) ◼ Its disadvantages: ◼ No weights attached to terms. ◼ Random access is slow: since similar terms are indexed individually, we need to find all terms that match with the query. 12/10/2024
  • 15.
    Inverted file ◼Aword orientedindexing mechanism based on sorted list of keywords, with each keyword having links to the documents containing it ◼ Building and maintaining an inverted index is arelatively low cost risk. On atext of n words an inverted index can be built in O(n) time ◼ This list is inverted from alist of terms in location order to a list of terms in alphabetical order. Original Documents Document IDs Word Extraction Word IDs •W1:d1,d2,d3 •W2:d2,d4,d7,d9 •… •Wn :di,…dn •Inverted Files March 8, 2020 15 12/10/2024
  • 16.
    Inverted file 17 Datato beheld in the inverted file includes ◼ The vocabulary (List of terms): ◼ is the set of all distinct words (index terms) in the text collection. ◼having information about vocabulary (list of terms) speeds searching for relevant documents ◼For each term: the inverted file contains information related to ◼Location: all the text locations/positions where the word occurs ◼frequency of occurrence of terms in adocument collection 12/10/2024
  • 17.
    Enhancements to InvertedFiles -- Concept 18 Location: Each posting holds information about the location of each term within the document. Uses user interface design -- highlight location of search term adjacency and near operators (in Boolean searching) Frequency: Each inverted list includes the number of postings for each term. Uses term weighting query processing optimization 12/10/2024
  • 18.
    Inverted file ◼Having informationabout the location of each term within the document helps for: ◼user interface design: highlight location of search term ◼proximity based ranking: adjacency and near operators (in Boolean searching) ◼Having information about frequency is used for: ◼calculating term weighting (like TF, TF*IDF, …) ◼optimizing query processing 19 12/10/2024
  • 19.
    Inverted File 20 Term CFDoc ID TF Location term 1 3 2 19 29 1 1 1 66 213 45 term 2 4 3 19 22 1 2 1 94 7, 212 56 term 3 1 5 1 43 term 4 3 11 34 2 1 3, 70 40 Documents are organized bythe terms/words they contain Thisis calledan index file. T ext operations are performed before building the index. CF , total frequencyof tj in the corpusn Is it possible to keep all these information during searching? 12/10/2024
  • 20.
    Construction of Invertedfile Aninvertedindexconsists of two files:vocabulary and posting files ◼ Avocabulary file (Word list): ◼ stores all of the distinct terms (keywords) that appear in anyofthe documents (in lexicographicalorder, i.e like that of adictionary) and ◼ For eachword apointer to aposting file ◼ Recordskept for eachterm j in the vocabulary (word list) contains the following: ◼ term j ◼ number of documents in whichterm j occurs (DFj) ◼ Collection frequency of term j (Cf) ◼ pointer to inverted (postings) 21list for term j 12/10/2024
  • 21.
    Postings File (InvertedList) ◼For each distinct term in the vocabulary, the posting file stores a list of pointers to the documents that contain that term. ◼Eachelement in an inverted list is called aposting, i.e., the occurrence of aterm in adocument ◼Eachlist consists of one or many individual postings Advantage of dividing inverted file into vocabulary and posting: ◼Keeping apointer in the vocabulary to the list in the posting file allows: ◼ the vocabulary to be kept in memory at search time even for large text collection, while the Posting file is kept on disk for accessing the pointers to documents 12/10/2024
  • 22.
    General structure ofInverted File ◼ The following figure shows the general structure of inverted index file. 12/10/2024
  • 23.
    Organization of IndexFile Term DF CF Pointer To posting term 1 3 3 term 2 3 4 term 3 1 1 term 4 2 3 Inverted lists Vocabulary (word list) Postings (inverted list) Documents 24 12/10/2024
  • 24.
    Example: 25 ◼Given acollection ofdocuments, they are parsed to extract words and these are saved with the Document ID. Doc 1 Doc 2 Negative affect can make it harder to do even easy tasks. so make it easy positive affect can make it easier to do difficult tasks 12/10/2024
  • 25.
    ◼After all documents havebeen tokenized the inverted file is sorted by terms ◼Steps ◼ Extract the terms in each doc ◼ Sort the terms ◼ Compile the terms i.e Collect the frequencies for each term Sorting the Vocabulary 12/10/2024
  • 26.
    ◼Multiple term entries ina single document are merged and frequency information added Remove stop words and compute frequency 27 12/10/2024
  • 27.
    ◼Multiple term entries ina single document are merged and frequency information added stemming & compute frequency 28 12/10/2024
  • 28.
    The file iscommonly split into aDictionary and aPosting file vocabulary Pointers Vocabulary and postings file posting 29 Term DF CF affect 2 2 difficult 1 1 do 2 2 easy 2 3 hard 1 1 make 2 3 negative 1 1 positive 2 1 task 2 2 Doc # TF 1 1 2 1 1 1 1 1 2 1 1 2 2 1 1 2 1 2 2 1 1 1 2 1 1 1 2 1 12/10/2024
  • 29.
    Searching on InvertedFile ◼ Since the whole index file is divided into two, searching can be done faster by loading vocabulary list which takes less memory even for large document collection ◼ Using binary Search the searching takes logarithmic time ◼ The search is done in the vocabulary lists ◼ Updating inverted file is complex. ◼ We need to update both vocabulary and posting files 12/10/2024
  • 30.
    Example: Create Invertedfile ◼ Create an inverted file (both the vocabulary list and the posting file) for the following document collection D1 The Department of Computer Science was established in 1984. D2 The Department launched its first BSc in Computer Studies in 1987. D3 Followed by the MSc in Computer Science which was started in 1991. D4 The Department also produced its first PhD graduate in 1994. D5 Our staff have contributed intellectually and professionally to the advancements in these fields. 12/10/2024
  • 31.
    Example: Create Invertedfile ◼ After text operation red color terms remain asindex term D1 The Department of Computer Science was established in 1984. D2 The Department launched its first BSc in Computer Studies in 1987. D3 Followed by the MSc in Computer Science which was started in 1991. D4 The Department also produced its first PhD graduate in 1994. D5 Our staff have contributed intellectually and professionally to the advancements in these fields. 12/10/2024
  • 32.
    …. Example: CreateInverted file After text operation performed ◼ D1= department comput science establish ◼ D2= department launch bsc comput study ◼ D3= follow msc comput science start ◼ D4= department produce phd graduat ◼ D5= staff contribut intellect profession advance field 12/10/2024
  • 33.
    Doc# TF mTFloc 5 1 1 5 2 1 1 3 1 1 1 2 2 1 1 4 3 1 1 2 5 1 1 4 1 1 1 1 2 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 word DF CF WID advance 1 1 w1 bsc 1 1 W2 comput 3 3 W3 contribut 1 1 W4 department 3 3 W5 establish 1 1 W6 field 1 1 W7 follow 1 1 graduat 1 1 intellect 1 1 c launch 1 1 o msc 1 1 n phd 1 1 t produce 1 1 i profession 1 1 n science 2 2 u staff 1 1 e start 1 1 study 1 1 Pointers vocabulary posting •W1:d5 •W2:d2 •W3:d1,d2,d3 •Wn :di,…dn document file All term specific info. (max tf, tf, tf- idf, location…etc.) Stored on posting 12/10/2024
  • 34.
    Suffix trie •Asuffix trieis an ordinary trie in which the input strings are allpossible suffixes. –Principles:The idea behind suffixTRIE is to assign to each symbol in a text anindex corresponding to its position in the text. (i.e: First symbol has index 1, last symbol has index n (#of symbols in text). • Tobuild the suffixTRIEwe use these indices instead of the actual object. •The structure has several advantages: • It requires less storage space. • Wedo not have to worry how the text is represented (binary, ASCII, etc). • Wedo not haveto store the same object twice (no duplicate). March 8, 2020 35 12/10/2024
  • 35.
    Suffix Trie •Construct suffixtrie for the following string:GOOGOL •Webegin bygiving aposition to every suffix in the text starting from left to right asper charactersoccurrence in the string. • TEXT: G O O G O L$ POSITION: 1 2 3 4 5 6 7 •BuildaSUFFIXTRIEfor all n suffixes of the text. •Note:The resulting tree hasn leaves and height n. This structure is particularlyuseful for anyapplication requiringprefix based ("starts with") pattern matching. 36 March 8, 2020 12/10/2024
  • 36.
    Suffix tree ◼ Asuffixtree is an extension of suffix trie that construct aTrie of all the proper suffixes of S ◼ The suffix tree is created by compacting unarynodes of the suffixTRIE. ◼ We store pointers rather than words in the leaves. ◼ It is alsopossible to replace strings in every edge by apair (a,b), where a & b are the beginning and end index of the string. i.e. (3,7) for OGOL$ (1,2) for GO (7,7) for $ March 8, 2020 37 12/10/2024
  • 37.
    Example: Suffix tree •Lets=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ •{ • $ • b$ • ab$ • bab$ • abab$ } •We label each leaf with the starting point of the corresponding suffix. •$ •1 •2 •b •$ •3 •4 •$ •5 •ab •ab$ •ab$ 39 March 8, 2020 12/10/2024
  • 38.
    Generalized suffix tree •Given aset of strings S,ageneralized suffix tree of Sis acompressedtrie of all suffixes of s  S •T o make suffixes prefix-free we add aspecial char, $, at the end of s. •Toassociateeach suffix with aunique string in Sadd adifferent special symbol to each s • Buildasuffix tree for the string s1$s2#, where `$' and `#' are aspecial terminator for s1,s2. •Ex.: Let s1=abab &s2=aab,ageneralized suffix tree for s1&s2is: •2 •b •ab$ •1 •b •$ •3 •5 •a •$ •1 •2 •# •3 •# •$ •4 •4 •# •ab$ •ab$ 40 { 4. # 3. b# 2. ab# 1. aab# 5. $ 4. b$ 3. ab$ 2. bab$ 1. abab$ Ma } rch 8, 2020 12/10/2024
  • 39.
    Search in suffixtree ◼Searching for all instances of asubstring Sin a suffix tree is easy since anysubstring of Sis the prefix of some suffix. ◼Pseudo-code for searching in suffix tree: ◼Start at root ◼Go down the tree by taking each time the corresponding path ◼IfScorrespond to anode, then return all leaves in sub-tree ◼ the places where Scan be found are given by the pointers in all the leaves in the subtree rooted at x. ◼If Sencountered a NILpointer before reaching the end, then Sis not in the tree 12/10/2024
  • 40.
    Suffix Tree Applications ◼SuffixTreecan be used to solve alarge number of string problems that occur in: ◼text-editing, ◼free-text search, etc. ◼Some examples of string problems are given below. ◼Stringmatching ◼Longest Common Substring ◼Longest Repeated Substring ◼Palindromes 44 12/10/2024
  • 41.
    Complexity Analysis ◼ Thesuffix tree for astring hasbeen built in O(n2) time. ◼ Searching is very fast: The search time is linear in the length of string S. ◼ The number of leaves is n+1, where n is the number of input strings. ◼ Furthermore, in the leaves, we may store either the strings themselves or pointers to the strings (that is, integers). ◼ Searching for asubstring[1..m], in string[1..n], can be solved in O(m) time. 45 12/10/2024
  • 42.
    Exercise Given the followingindex terms: worker, word and world construct index file using suffix tree? 12/10/2024
  • 43.