Information storage and Retrieval-chapter 3.pdf

Chapter 3
1
12/10/2024
Indexing structure

Outline
◼ Major Steps in Index Construction
◼ Index file Evaluation Metrics
◼ Building Index file
◼ Sequential File
◼ Inverted file
◼ Suffix tree
◼ Suffix Trie
◼ Suffix TreeApplications
2
12/10/2024

Indexing: Basic Concepts
3
◼ Indexing is an arrangement of index terms to permit fast
searching and reducing memory space requirement
◼ It used to speed up access to desired information from document
collection asper users query such that
◼ It enhances efficiency in terms of time for retrieval.
◼ Relevantdocuments are searched and retrieved quick
◼ Index file usually has index terms in asorted order.
◼ Whichlist is easier to search?
fox pig zebra hen ant cat dog lion ox
ant cat dog fox hen lion ox pig zebra
12/10/2024

◼ Anindex file consists of records, called index entries.
◼ Index files are much smaller than the original file.
◼ Remember Heaps Law: in 1 GBof text collection the
vocabulary has asize of only 5 MB. This size may be
further reduced by Linguistic pre-processing (or text
operations).
◼ The usual unit for indexing is the word
◼ Index terms - are used to look up records in afile.
4
Indexing: Basic Concepts
12/10/2024

Major Steps in Index Construction
◼Source file: Collection of text document
◼Adocument can be described by a set of representative
keywordscalled index terms.
◼Index Terms Selection: apply text operations or
preprocessing.
◼Tokenize: identify words in a document, so that each
document is represented byalist of keywords or attributes.
◼Stop words removal: words with high frequency are non-
content bearing and needs to be removed from text
collection.
12/10/2024

Major Steps in Index Construction …
6
◼Word stem: reduce words with similar meaning into their
stem/root word.
◼Term relevance weight: Different index terms have varying
relevance when used to describe document contents. This
effect is captured through the assignment of numerical
weights to each index term of a document. There are
different index terms weighting methods: including TF, IDF,
TF*IDF…
◼Indexing structure: a set of index terms (vocabulary) are
organized in Index File to easily identify documents in which
each term occurs in.
12/10/2024

Basic Indexing Process
Tokenizer
Token
stream. Friends Romans countrymen
Linguistic
preprocessor
Modified
tokens.
friend roman countryman
Indexer
Index File
Documents to
be indexed. Friends, Romans, countrymen.
friend
roman
countryman
2 4
2
13 16
1
Inverted file
12/10/2024

Index file Evaluation Metrics
8
◼Running time of the main operations
◼ Access/searchtime
◼ How much is the running time to find the required search key
from the list?
◼Update time (Insertion time, Deletion time)
◼ How much time does it take to update existing records in an
attempt to add new terms or delete existing unnecessary terms?
◼ Does the indexing structure allows incremental update or re-
indexing?
◼Space overhead
◼ Computer storage space consumed for keeping the list.
12/10/2024

Building Index file
9
◼An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term
◼An index file is alist of search terms that are organized for
associative look-up, i.e., to answer user’s query:
◼In which documents doesaspecified search term appear?
◼Where within each document does each term appear? (There
may be several occurrences.)
◼For organizing index file for acollection of documents, there
are various optionsavailable:
◼Decide what data structure and/or file structure to use. Isit sequential file,
inverted file, suffixtree, etc. ?
12/10/2024

Sequential File
10
◼ Sequential file is the most primitive file structures.
◼ It hasno vocabulary aswell aslinking pointers.
◼ The records are generally arranged serially, one after
another, but in lexicographic order on the value of some key
field. i.e
◼ a particular attribute is chosen as primary key whose value
will determine the order ofthe records.
◼ when the first key fails to discriminate among records, a
second key ischosen to give an order.
12/10/2024

Example:
11
◼Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.
Negative affect
can make it harder
to do even easy tasks.
so make it easy
Doc 1
positive affect can
make it easier
to do difficult tasks
Doc 2
12/10/2024

◼ After all documents
have been tokenized,
stop words are
removed, and
normalizationand
stemmingare
applied, to generate
index terms
◼ These index terms in
sequential file are
sorted in
alphabetical order
Sorting the
Vocabulary Sequential file
12/10/2024

Sequential File
◼To access records search serially.
◼starting at the first record read and investigate all
the succeeding records until the required record is
found or end of the file is reached.
◼Update options: Is the index needs to be rebuilt or
incremental update is supported?
12/10/2024

Sequential File …
◼Its main advantages:
◼easy to implement;
◼provides fast access to the next record using lexicographic
order.
◼ Can be searched quickly, using binary search, O(log n)
◼ Its disadvantages:
◼ No weights attached to terms.
◼ Random access is slow: since similar terms are indexed
individually, we need to find all terms that match with
the query.
12/10/2024

Inverted file
◼Aword oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it
◼ Building and maintaining an inverted index is arelatively low
cost risk. On atext of n words an inverted index can be built in
O(n) time
◼ This list is inverted from alist of terms in location order to a
list of terms in alphabetical order.
Original
Documents
Document IDs
Word Extraction
Word IDs
•W1:d1,d2,d3
•W2:d2,d4,d7,d9
•…
•Wn :di,…dn
•Inverted Files
March 8, 2020 15
12/10/2024

Inverted file
17
Datato be held in the inverted file includes
◼ The vocabulary (List of terms):
◼ is the set of all distinct words (index terms) in the text
collection.
◼having information about vocabulary (list of terms) speeds
searching for relevant documents
◼For each term: the inverted file contains information related to
◼Location: all the text locations/positions where the word
occurs
◼frequency of occurrence of terms in adocument collection
12/10/2024

Enhancements to Inverted Files --
Concept
18
Location: Each posting holds information about the location of
each term within the document.
Uses
user interface design -- highlight location of search term
adjacency and near operators (in Boolean searching)
Frequency: Each inverted list includes the number of postings
for each term.
Uses
term weighting
query processing optimization
12/10/2024

Inverted file
◼Having information about the location of each term within
the document helps for:
◼user interface design: highlight location of search term
◼proximity based ranking: adjacency and near operators (in
Boolean searching)
◼Having information about frequency is used for:
◼calculating term weighting (like TF, TF*IDF, …)
◼optimizing query processing
19
12/10/2024

Inverted File
20
Term CF Doc ID TF Location
term 1 3 2
19
29
1
1
1
66
213
45
term 2 4 3
19
22
1
2
1
94
7, 212
56
term 3 1 5 1 43
term 4 3 11
34
2
1
3, 70
40
Documents are organized bythe terms/words they contain
Thisis calledan
index file.
T
ext operations
are performed
before building
the index.
CF
, total
frequencyof tj in
the corpusn
Is it possible to keep all these information during searching?
12/10/2024

Construction of Inverted file
Aninvertedindexconsists of two files:vocabulary and posting files
◼ Avocabulary file (Word list):
◼ stores all of the distinct terms (keywords) that appear in anyofthe
documents (in lexicographicalorder, i.e like that of adictionary) and
◼ For eachword apointer to aposting file
◼ Recordskept for eachterm j in the vocabulary (word list) contains
the following:
◼ term j
◼ number of documents in whichterm j occurs (DFj)
◼ Collection frequency of term j (Cf)
◼ pointer to inverted (postings)
21list for term j
12/10/2024

Postings File (Inverted List)
◼For each distinct term in the vocabulary, the posting file stores a list of
pointers to the documents that contain that term.
◼Eachelement in an inverted list is called aposting, i.e., the
occurrence of aterm in adocument
◼Eachlist consists of one or many individual postings
Advantage of dividing inverted file into vocabulary and posting:
◼Keeping apointer in the vocabulary to the list in the posting file
allows:
◼ the vocabulary to be kept in memory at search time even for large
text collection, while the Posting file is kept on disk for accessing
the pointers to documents
12/10/2024

General structure of Inverted File
◼ The following figure shows the general structure of inverted
index file.
12/10/2024

Organization of Index File
Term DF CF
Pointer
To
posting
term 1 3 3
term 2 3 4
term 3 1 1
term 4 2 3
Inverted
lists
Vocabulary
(word list) Postings
(inverted list)
Documents
24
12/10/2024

Example:
25
◼Given acollection of documents, they are parsed to extract
words and these are saved with the Document ID.
Doc 1
Doc 2
Negative affect
can make it harder
to do even easy tasks.
so make it easy
positive affect can
make it easier
to do difficult tasks
12/10/2024

◼After all documents
have been tokenized
the inverted file is
sorted by terms
◼Steps
◼ Extract the terms in
each doc
◼ Sort the terms
◼ Compile the terms
i.e Collect the
frequencies for each
term
Sorting the
Vocabulary
12/10/2024

◼Multiple term
entries in a
single
document are
merged and
frequency
information
added
Remove stop words and compute
frequency
27
12/10/2024

◼Multiple term
entries in a
single
document are
merged and
frequency
information
added
stemming & compute frequency
28
12/10/2024

The file is commonly split into aDictionary and aPosting file
vocabulary
Pointers
Vocabulary and postings file
posting
29
Term DF CF
affect 2 2
difficult 1 1
do 2 2
easy 2 3
hard 1 1
make 2 3
negative 1
1
positive 2 1
task 2 2
Doc # TF
1 1
2 1
1 1
1 1
2 1
1 2
2 1
1 2
1 2
2 1
1 1
2 1
1 1
2 1
12/10/2024

Searching on Inverted File
◼ Since the whole index file is divided into two, searching can
be done faster by loading vocabulary list which takes less
memory even for large document collection
◼ Using binary Search the searching takes logarithmic time
◼ The search is done in the vocabulary lists
◼ Updating inverted file is complex.
◼ We need to update both vocabulary and posting files
12/10/2024

Example: Create Inverted file
◼ Create an inverted file (both the vocabulary list and
the posting file) for the following document
collection
D1 The Department of Computer Science was established
in 1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
12/10/2024

Example: Create Inverted file
◼ After text operation red color terms remain asindex
term
D1 The Department of Computer Science was established
in 1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
12/10/2024

…. Example: Create Inverted file
After text operation performed
◼ D1= department comput science establish
◼ D2= department launch bsc comput study
◼ D3= follow msc comput science start
◼ D4= department produce phd graduat
◼ D5= staff contribut intellect profession advance field
12/10/2024

Doc# TF mTF loc
5 1 1 5
2 1 1 3
1 1 1 2
2 1 1 4
3 1 1 2
5 1 1 4
1 1 1 1
2 1 1 1
4 1 1 1
1 1
1 1
1 1
1 1
1 1
1 1
2 2
1 1
1 1
1 1
word DF CF WID
advance 1 1 w1
bsc 1 1 W2
comput 3 3 W3
contribut 1 1 W4
department 3 3 W5
establish 1 1 W6
field 1 1 W7
follow 1 1
graduat 1 1
intellect 1 1
c
launch 1 1
o
msc 1 1 n
phd 1 1 t
produce 1 1 i
profession 1 1 n
science 2 2 u
staff 1 1 e
start 1 1
study 1 1
Pointers
vocabulary posting
•W1:d5
•W2:d2
•W3:d1,d2,d3
•Wn :di,…dn
document file
All term specific
info. (max tf, tf, tf-
idf, location…etc.)
Stored on posting
12/10/2024

Suffix trie
•Asuffix trie is an ordinary trie in which the input strings are
allpossible suffixes.
–Principles:The idea behind suffixTRIE is to assign to each symbol
in a text anindex corresponding to its position in the text. (i.e:
First symbol has index 1, last symbol has index n (#of symbols in
text).
• Tobuild the suffixTRIEwe use these indices instead of the actual object.
•The structure has several advantages:
• It requires less storage space.
• Wedo not have to worry how the text is represented (binary,
ASCII, etc).
• Wedo not haveto store the same object twice (no duplicate).
March 8, 2020 35
12/10/2024

Suffix Trie
•Construct suffix trie for the following string:GOOGOL
•Webegin bygiving aposition to every suffix in the text starting from left to
right asper charactersoccurrence in the string.
• TEXT: G O O G O L$
POSITION: 1 2 3 4 5 6 7
•BuildaSUFFIXTRIEfor all n suffixes of the text.
•Note:The resulting tree hasn leaves and height n.
This structure is
particularlyuseful
for anyapplication
requiringprefix
based ("starts with")
pattern matching.
36
March 8, 2020
12/10/2024

Suffix tree
◼ Asuffix tree is an extension of suffix
trie that construct aTrie of all the
proper suffixes of S
◼ The suffix tree is created by
compacting unarynodes of the
suffixTRIE.
◼ We store pointers rather than words in
the leaves.
◼ It is alsopossible to replace strings
in every edge by apair (a,b), where
a & b are the beginning and end
index of the string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
March 8, 2020 37
12/10/2024

Example: Suffix tree
•Let s=abab, a suffix tree of s is a compressed trie of all
suffixes of s=abab$
•{
• $
• b$
• ab$
• bab$
• abab$ }
•We label each leaf with the
starting point of the
corresponding suffix.
•$
•1
•2
•b
•$
•3
•4
•$
•5
•ab
•ab$
•ab$
39
March 8, 2020
12/10/2024

Generalized suffix tree
• Given aset of strings S,ageneralized suffix tree of Sis acompressedtrie of
all suffixes of s  S
•T
o make suffixes prefix-free we add aspecial char, $, at the end of s.
•Toassociateeach suffix with aunique string in Sadd adifferent special
symbol to each s
• Buildasuffix tree for the string s1$s2#, where `$' and `#' are aspecial
terminator for s1,s2.
•Ex.: Let s1=abab &s2=aab,ageneralized suffix tree for s1&s2is:
•2
•b
•ab$
•1
•b
•$
•3
•5
•a •$
•1
•2
•#
•3
•#
•$
•4
•4
•#
•ab$ •ab$
40
{
4. #
3. b#
2. ab#
1. aab#
5. $
4. b$
3. ab$
2. bab$
1. abab$
Ma
}
rch 8, 2020
12/10/2024

Search in suffix tree
◼Searching for all instances of asubstring Sin a suffix tree is easy
since anysubstring of Sis the prefix of some suffix.
◼Pseudo-code for searching in suffix tree:
◼Start at root
◼Go down the tree by taking each time the corresponding path
◼IfScorrespond to anode, then return all leaves in sub-tree
◼ the places where Scan be found are given by the pointers in
all the leaves in the subtree rooted at x.
◼If Sencountered a NILpointer before reaching the end, then Sis
not in the tree
12/10/2024

Suffix Tree Applications
◼SuffixTree can be used to solve alarge number of string problems
that occur in:
◼text-editing,
◼free-text search, etc.
◼Some examples of string problems are given below.
◼Stringmatching
◼Longest Common Substring
◼Longest Repeated Substring
◼Palindromes
44
12/10/2024

Complexity Analysis
◼ The suffix tree for astring hasbeen built in O(n2) time.
◼ Searching is very fast: The search time is linear in the length of
string S.
◼ The number of leaves is n+1, where n is the number of input
strings.
◼ Furthermore, in the leaves, we may store either the strings
themselves or pointers to the strings (that is, integers).
◼ Searching for asubstring[1..m], in string[1..n], can be solved in
O(m) time.
45
12/10/2024

Exercise
Given the following index terms:
worker, word and world
construct index file using suffix tree?
12/10/2024

46
12/10/2024
End of Chapter 3

Information storage and Retrieval-chapter 3.pdf

More Related Content

Similar to Information storage and Retrieval-chapter 3.pdf

More from fikadumeuedu

Recently uploaded

Information storage and Retrieval-chapter 3.pdf