100% found this document useful (1 vote)

141 views15 pages

Automatic Indexing

Automatic indexing analyzes items to extract information for permanent storage in an index. There are four main classes of automatic indexing: statistical indexing, natural language, concept linkages, and hypertext linkages. Statistical indexing uses the frequency of terms to calculate relevance, with approaches including probabilistic weighting, vector weighting, and inverse document frequency. Vector weighting represents items as vectors of term weights. Inverse document frequency adjusts weights based on how common a term is across items.

Uploaded by

Sunitha Rekha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

141 views15 pages

Automatic Indexing

Uploaded by

Sunitha Rekha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Automatic Indexing

• Automatic indexing is the process of analyzing an item to extract the information to be permanently
kept in an index.

• Classes of Automatic Indexing

1. Statistical Indexing
2. Natural Language
3. Concept Linkages
4. Hypertext Linkages
Statistical Indexing

• Statistical indexing uses frequency of occurrence of events to calculate a number that is used to indicate
the potential relevance of an item.
1. Probabilistic Weighting
• The use of probability theory is a natural choice because it is the basis of evidential reasoning (i.e.,
drawing conclusions from evidence).
• It also leads to an invariant result that facilitates integration of results from different databases.

Probability Ranking Principle (PRP) and its Plausible Corollary

HYPOTHESIS: If a reference retrieval system’s response to each request is a ranking of the documents in
the collection in order of decreasing probability of usefulness to the user who submitted the request, where
the probabilities are estimated as accurately as possible on the basis of whatever data is available for this
purpose, then the overall effectiveness of the system to its users is the best obtainable on the basis of that
data.
PLAUSIBLE COROLLARY: The most promising source of techniques for estimating the probabilities of
usefulness for output ranking in IR is standard probability theory and statistics.
• There are several factors that make this hypothesis and its corollary difficult.
• Probabilities are usually based upon a binary condition; an item is relevant or not.
• But in information systems the relevance of an item is a continuous function from non-relevant to
absolutely useful.
2. Vector Weighting

• In information retrieval, each position in the vector typically represents a processing token.
• There are two approaches to the domain of values in the vector: binary and weighted.

• Under the binary approach, the domain contains the value of one or zero, with one representing the
existence of the processing token in the item.
• Binary vectors require a decision process to determine if the degree that a particular processing token
represents the semantics of an item is sufficient to include it in the vector.
• In the weighted approach, the domain is typically the set of all real positive numbers.
• The value for each processing token represents the relative importance of that processing token in
representing the semantics of the item.
Major algorithms that can be used in calculating the weights used to represent a processing token –

a. Simple Term Frequency Algorithm

• In both the unweighted and weighted approaches, an automatic indexing process implements an
algorithm to determine the weight to be assigned to a processing token for a particular item.
• In a statistical system, the data that are potentially available for calculating a weight are
• frequency of occurrence of the processing token in an existing item (i.e., term frequency - TF),
• frequency of occurrence of the processing token in the existing database (i.e., total frequency -
TOTF) and
• the number of unique items in the database that contain the processing token (i.e., item frequency
- IF, frequently labeled in other publications as document frequency - DF).

• The simplest approach is to have the weight equal to the term frequency.
• This approach emphasizes the use of a particular processing token within an item.
• Use of the absolute value biases weights toward longer items, where a term is more likely to occur with
a higher frequency.
• Thus, one normalization typically used in weighting algorithms compensates for the number of words in
an item.
• The term frequency weighting formula used in TREC 4 was:

• where slope was set at .2 and the pivot was set to the average number of unique terms occurring in the
collection.
• In addition to compensating for document length, they also want the formula to be insensitive to
anomalies introduced by stemming or misspellings.

There are many approaches to account for different document lengths when determining the value of Term
Frequency to use –
• Maximum term frequency - the term frequency for each word is divided by the maximum frequency of
the word in any item.
 This normalizes the term frequency values to a value between zero and one.
 The problem with this technique is that the maximum term frequency can be so large that it decreases
the value of term frequency in short items to too small a value and loses significance.
• Logarithmetic term frequency - the log of the term frequency plus a constant is used to replace the
term frequency.
 The log function will perform the normalization when the term frequencies vary significantly due to size
of documents.

• Another approach recognizes that the normalization process may be over penalizing long documents.
• To compensate, a correction factor was defined that is based upon document length that maps the Cosine
function into an adjusted normalization function.
• The function determines the document length crossover point for longer documents where the
probability of relevance equals the probability of retrieval, (given a query set).
• This value called the "pivot point" is used to apply an adjustment to the normalization process.
Pivoted function = (slope) * (old normalization) + (1.0 – slope) * (pivot)
• Slope and pivot are constants for any document/query set.
b. Inverse Document Frequency
• The basic algorithm is improved by taking into consideration the frequency of occurrence of the
processing token in the database.
• One of the objectives of indexing an item is to discriminate the semantics of that item from other items
in the database.
• Algorithm - the weight assigned to an item should be inversely proportional to the frequency of
occurrence of an item in the database.
• The un-normalized weighting formula is:

where -
• WEIGHTij is the vector weight that is assigned to term “j” in item “i” ,
• TFij(term frequency) is the frequency of term “j” in item “i” ,
• “n” is the number of items in the database and
• IFj(item frequency or document frequency) is the number of items in the database that have term “j” in
them.
• A negative log is the same as dividing by the log value, thus the basis for the name of the algorithm.
c. Signal Weighting

• Inverse document frequency adjusts the weight of a processing token for an item based upon the number
of items that contain the term in the existing database.
• What it does not account for is the term frequency distribution of the processing token in the items that
contain the term.
• The distribution of the frequency of processing tokens within an item can affect the ability to rank
items.

• In Information Theory, the information content value of an object is inversely proportional to the
probability of occurrence of the item.
• An instance of an event that occurs all the time has less information value than an instance of a seldom
occurring event.
• This is typically represented as INFORMATION = -Log2 (p), where p is the probability of occurrence of
event “p.”
d. Discrimination Value
• Another approach to creating a weighting algorithm is to base it upon the discrimination value of a
term.
• To achieve the objective of finding relevant items, it is important that the index discriminates among
items.
• The more all items appear the same, the harder it is to identify those that are needed.
• Discrimination value for each term “i”:
where
• AVESIM is the average similarity between every item in the database and
• AVESIMi is the same calculation except that term “i” is removed from all items.
• There are three possibilities with the DISCRIMi value being positive, close to zero or negative.
• A positive value indicates that removal of term “i” has increased the similarity between items. In this
case, leaving the term in the database assists in discriminating between items and is of value.
• A value close to zero implies that the term’s removal or inclusion does not change the similarity
between items.
• If the value of DISCRIMi is negative, the term’s effect on the database is to make the items appear more
similar since their average similarity decreased with its removal.
• Once the value of DISCRMi is normalized as a positive number, it can be used in the standard weighting
formula as:
3. Bayesian Model :

Common questions

Normalization is crucial in term frequency weighting to mitigate bias towards longer documents, where term repetition may inflate relevance scores, irrespective of genuine semantic importance . Techniques such as maximum term frequency normalization and logarithmic transformations scale term frequencies to lengths, allowing relative comparisons across varying document sizes . The maximum term frequency approach scales values between zero and one, though it can diminish significance in short items when maximums are exceedingly high . Logarithmic frequency reduces variance in document sizes by using the log of the frequency for more stability . Additionally, adjusted normalization functions with pivot points and slopes offer refined corrections by adjusting for typical document length effects on relevancy .

Bayesian models in automatic indexing apply probabilistic inference to update beliefs about term relevance based on prior and observed data . By using Bayesian principles, indexing can continuously refine weight assignments as more information becomes available, enhancing adaptability in evolving contexts. Compared to other methods like vector or signal weighting, Bayesian models can incorporate uncertainty more comprehensively, leading to potentially more robust probability estimates . However, they may demand greater computational resources and complex parameterization, which could be cumbersome when handling large datasets or ensuring quick retrieval times . Despite these challenges, incorporating Bayesian models offers dynamic insights into term importance, refining index precision over time .

Applying the PRP in real-world systems faces difficulties due to the continuous rather than binary nature of relevance assessments . Estimating probabilities accurately for usefulness ranking is challenging given varied user needs and contextual ambiguities inherent in real-world queries. Additionally, the empirical integration of diverse databases at varying scales compounds complexities in maintaining consistency and effectiveness of the PRP-driven results . While theoretically optimal, practical application of PRP requires sophisticated algorithms that can adaptively refine probabilistic models based on evolving evidence and data patterns .

Statistical indexing primarily relies on the frequency of occurrence of events to calculate relevance, using probabilistic and vector weighting to rank items based on evident statistics such as term frequency and document frequency . On the other hand, natural language indexing uses linguistic structures and syntax to determine relevancy, whereas concept linkages focus on semantic connections between concepts, potentially incorporating ontologies or thesauri to understand relationships between terms .

The PRP suggests that an information retrieval system achieves its best effectiveness by ranking documents in order of their likelihood of usefulness to users based on accurately estimated probabilities. This principle implies that retrieval systems must integrate robust statistical methods to model these probabilities, using data available, which is a significant challenge given the continuous nature of relevance in practical contexts rather than binary distinctions . Implementing PRP can lead to improved retrieval performance by focusing efforts on refining probability estimates, but it requires overcome challenges such as varying document relevancies and data integration from different databases .

Vector weighting represents items using vectors where each position corresponds to a processing token that could be binary or weighted to reflect its semantic relevance. A binary representation indicates presence or absence, while a weighted vector assigns real positive numbers indicating the token's relative importance . In contrast, probabilistic weighting relies on probability theory to estimate an item's likelihood of usefulness based on expected evidence, which results in an invariant outcome facilitating integration across databases . This makes probabilistic weighting more dynamic as it considers broader evidence compared to vector weighting's specific token weighting.

The discrimination value approach enhances indexing by focusing on how well a term distinguishes items in a database. A positive DISCRIMi value indicates a term's inclusion helps distinguish items, thus proving beneficial for accurate indexing by maintaining it in the database . A zero or near-zero value suggests the term contributes no significant discrimination, indicating its presence or absence doesn’t materially alter index outcomes, thus could be deprioritized . A negative DISCRIMi means a term makes items more similar when removed, suggesting the term might be causing clustering that doesn't accurately reflect semantic distinctions—potentially useful for altering retrieval strategies .

Signal weighting can be considered more refined as it encompasses both the presence of terms and their frequency distribution, providing deeper insights into item-specific relevance beyond what simple term frequency or inverse document frequency methods offer . While term frequency places a fixed emphasis on appearance and inverse document frequency considers general occurrence across a database, signal weighting evaluates specificity by assigning greater informational value to terms with unique distribution patterns within items, reflecting their semantic significance more accurately . This multifaceted assessment allows for more nuanced ranking, thus improving retrieval quality in complex databases.

Inverse document frequency (IDF) plays a critical role by adjusting the weight of processing tokens to be inversely proportional to their occurrence across the database, thus enhancing the discriminating power of frequently occurring terms . By applying IDF, basic term frequency algorithms increase the significance of terms that occur in fewer items, thus balancing the emphasis placed on terms just based on frequency within a document. This improvement accounts for the inherent information value of infrequently appearing terms, reducing the impact of commonly present ones that may not aid specific retrieval tasks .

Signal weighting in automatic indexing is significant because it considers the distribution of term frequencies across items containing the term, unlike inverse document frequency, which solely accounts for term presence . By evaluating how frequently a token appears within items, signal weighting can more accurately predict the informativeness of terms. Within Information Theory, terms with consistent distributions contain less information due to their predictability, making infrequently but significantly distributed terms weighted more to emphasize rarity and uniqueness . This enhances ranking strategies by recognizing nuanced term importance beyond mere presence in a database.

Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
IRS Unit-3
No ratings yet
IRS Unit-3
30 pages
The Classic TF-IDF Vector Space Model
No ratings yet
The Classic TF-IDF Vector Space Model
15 pages
Automatic Indexing Techniques
No ratings yet
Automatic Indexing Techniques
48 pages
Indexing and Cataloging Essentials
No ratings yet
Indexing and Cataloging Essentials
16 pages
UNIT 2 IRS Up
No ratings yet
UNIT 2 IRS Up
42 pages
Unit IV
100% (1)
Unit IV
19 pages
Irs Unit-1
No ratings yet
Irs Unit-1
61 pages
Weighted Search System Boolean Lec 1 Slides
100% (1)
Weighted Search System Boolean Lec 1 Slides
7 pages
DSE 3155 27 Sep 2023
No ratings yet
DSE 3155 27 Sep 2023
14 pages
Parallel Database Systems Overview
No ratings yet
Parallel Database Systems Overview
17 pages
Project Work
No ratings yet
Project Work
21 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Multimedia Information Retrieval Overview
No ratings yet
Multimedia Information Retrieval Overview
38 pages
Irs Important Questions
0% (1)
Irs Important Questions
3 pages
IRS Unit 4 by Krishna
No ratings yet
IRS Unit 4 by Krishna
23 pages
R Programming: Operators & Control Flow
100% (1)
R Programming: Operators & Control Flow
66 pages
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
No ratings yet
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
5 pages
IRS Assignment-I: 1. Define IRS & Goals. Ans
No ratings yet
IRS Assignment-I: 1. Define IRS & Goals. Ans
3 pages
Question Paper CCP
No ratings yet
Question Paper CCP
2 pages
Irs Unit - 4
No ratings yet
Irs Unit - 4
29 pages
UNIT 4 Notes
No ratings yet
UNIT 4 Notes
10 pages
Ann-Unit I
No ratings yet
Ann-Unit I
40 pages
Irs PPT Unit Ii
No ratings yet
Irs PPT Unit Ii
19 pages
Prolog Practical Exercises List
No ratings yet
Prolog Practical Exercises List
2 pages
Overview of Information Retrieval Systems
No ratings yet
Overview of Information Retrieval Systems
15 pages
3-1 Bigdata (Spark)
No ratings yet
3-1 Bigdata (Spark)
3 pages
Irt Syllabus
No ratings yet
Irt Syllabus
3 pages
IRDM Assignment-I PDF
No ratings yet
IRDM Assignment-I PDF
4 pages
M.Tech IR Course Overview
No ratings yet
M.Tech IR Course Overview
72 pages
Unit 3 Chapter 1 Peer To Peer Systems Napster and Its Legacy Middleware
No ratings yet
Unit 3 Chapter 1 Peer To Peer Systems Napster and Its Legacy Middleware
4 pages
Digital Computer Fundamentals
No ratings yet
Digital Computer Fundamentals
37 pages
AI MCQs: Search Algorithms & Strategies
No ratings yet
AI MCQs: Search Algorithms & Strategies
16 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
IT2202 - OPERATING SYSTEMS Handout
No ratings yet
IT2202 - OPERATING SYSTEMS Handout
5 pages
Unit 2
No ratings yet
Unit 2
10 pages
Col780 A1
No ratings yet
Col780 A1
4 pages
Overview of Information Retrieval Systems
100% (2)
Overview of Information Retrieval Systems
2 pages
Data Stream Sampling Techniques
No ratings yet
Data Stream Sampling Techniques
3 pages
Compare DFS & BFS Graph Traversals
No ratings yet
Compare DFS & BFS Graph Traversals
6 pages
CANDIDATE-ELIMINATION Learning Algorithm
0% (1)
CANDIDATE-ELIMINATION Learning Algorithm
3 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
Software Testing Methodologies: Unit 1
No ratings yet
Software Testing Methodologies: Unit 1
16 pages
Cyber Security Ii-I Model Question Papers
No ratings yet
Cyber Security Ii-I Model Question Papers
69 pages
18AI55 - Module 2 Notes
No ratings yet
18AI55 - Module 2 Notes
13 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
No ratings yet
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
2 pages
Single-Layer Perceptron Guide
No ratings yet
Single-Layer Perceptron Guide
39 pages
Functional Overview of Information Retrieval System
No ratings yet
Functional Overview of Information Retrieval System
12 pages
AI Unit 1.
No ratings yet
AI Unit 1.
15 pages
Previous University Question Paper
No ratings yet
Previous University Question Paper
3 pages
Big Data Analytics: Pig & Hive Overview
No ratings yet
Big Data Analytics: Pig & Hive Overview
10 pages
Information Retrieval Systems Exam Papers
No ratings yet
Information Retrieval Systems Exam Papers
6 pages
Lamport Non Token Based Algorithm
No ratings yet
Lamport Non Token Based Algorithm
13 pages
Compiler Lab Guide for Students
No ratings yet
Compiler Lab Guide for Students
47 pages
ABC Bank Management Software Overview
No ratings yet
ABC Bank Management Software Overview
35 pages
QuePaper-OCT2018-M.E (Computer) - Sem-III
No ratings yet
QuePaper-OCT2018-M.E (Computer) - Sem-III
4 pages
IRS Unit-3
100% (2)
IRS Unit-3
28 pages
Automatic Indexing Techniques Explained
No ratings yet
Automatic Indexing Techniques Explained
57 pages
Unit Iii
No ratings yet
Unit Iii
100 pages
Testing
No ratings yet
Testing
19 pages
Write The Test Cases For Any Known Applications Changed
No ratings yet
Write The Test Cases For Any Known Applications Changed
4 pages
Software Quality Metrics Overview
No ratings yet
Software Quality Metrics Overview
6 pages
Junit User Guide 5.10.2
No ratings yet
Junit User Guide 5.10.2
179 pages
Descriptive Writing Lesson Plan
No ratings yet
Descriptive Writing Lesson Plan
18 pages
Grade 3 Caption Writing Lesson
No ratings yet
Grade 3 Caption Writing Lesson
2 pages
8 Urban Form
100% (5)
8 Urban Form
24 pages
Linear Momentum in Physics Form 4
100% (1)
Linear Momentum in Physics Form 4
41 pages
Student Inclusive Conferences Parent Guardian Resource 2024nov
No ratings yet
Student Inclusive Conferences Parent Guardian Resource 2024nov
2 pages
GE-102S Sludge Depth Meter Manual
No ratings yet
GE-102S Sludge Depth Meter Manual
7 pages
Waste Segregation Tech for Engineers
No ratings yet
Waste Segregation Tech for Engineers
8 pages
VBO Chapter
No ratings yet
VBO Chapter
17 pages
Case Study:: MIS & Analysis
No ratings yet
Case Study:: MIS & Analysis
1 page
Angadi, Chirag, 1725283
No ratings yet
Angadi, Chirag, 1725283
2 pages
Mathematics Notes: Matrices & Equations
60% (5)
Mathematics Notes: Matrices & Equations
234 pages
Google Earth Fixes Atlantis Error
No ratings yet
Google Earth Fixes Atlantis Error
2 pages
Generic Torque-Maximizing Design Methodology of Surface Permanent - Magnet Vernier Machine
No ratings yet
Generic Torque-Maximizing Design Methodology of Surface Permanent - Magnet Vernier Machine
8 pages
Lab Manual Geo Informatics
No ratings yet
Lab Manual Geo Informatics
53 pages
ANSYS CFX, Release 10.0
No ratings yet
ANSYS CFX, Release 10.0
62 pages
Chapter 16 Graphics Creation
No ratings yet
Chapter 16 Graphics Creation
42 pages
Secure TVM Bgx501 147 r04
No ratings yet
Secure TVM Bgx501 147 r04
35 pages
Lesson Plan 2
No ratings yet
Lesson Plan 2
3 pages
Implicit vs. Explicit Differentiation
No ratings yet
Implicit vs. Explicit Differentiation
1 page
Excel Skills Test Instructions
0% (1)
Excel Skills Test Instructions
8 pages
Ist Summative Test Grade 11 Els
90% (29)
Ist Summative Test Grade 11 Els
4 pages
June 2021 School Report
No ratings yet
June 2021 School Report
3 pages
Split-Half Method: Reliability Test
100% (2)
Split-Half Method: Reliability Test
3 pages
The Setup PDF
No ratings yet
The Setup PDF
921 pages
Hooponopono
100% (8)
Hooponopono
15 pages
BFD3 Download Install Guide
No ratings yet
BFD3 Download Install Guide
22 pages
English All Units - Short Answer Questions& Answers
No ratings yet
English All Units - Short Answer Questions& Answers
16 pages
Class 5-Lesson 2. Parallels and Meridians F. Answer The Following Questions
75% (4)
Class 5-Lesson 2. Parallels and Meridians F. Answer The Following Questions
1 page
Lab Report
No ratings yet
Lab Report
8 pages
The Functional Architecture of Human Empathy PDF
No ratings yet
The Functional Architecture of Human Empathy PDF
30 pages

Automatic Indexing

Uploaded by

Automatic Indexing

Uploaded by

Automatic Indexing

• Classes of Automatic Indexing

Probability Ranking Principle (PRP) and its Plausible Corollary

a. Simple Term Frequency Algorithm

Common questions

Why is normalization critical in the term frequency weighting process, and how do different normalization techniques address document length bias?

Describe how the concept of Bayesian models can be integrated into automatic indexing, and the potential benefits or drawbacks they offer compared to other approaches discussed.

What challenges arise from applying the Probability Ranking Principle (PRP) and its corollary in real-world information retrieval systems?

How does statistical indexing differ from natural language and concept linkages in automatic indexing?

What are the implications of the Probability Ranking Principle (PRP) for designing information retrieval systems?

In what ways do vector weighting and probabilistic weighting approaches in statistical indexing handle the representation of items differently?

How does the discrimination value approach improve indexing accuracy, and what are the potential impacts of a positive, negative, or zero DISCRIMi value?

Why might signal weighting be considered a more refined method compared to simple term or inverse document frequency weighting strategies?

Explain the role of inverse document frequency (IDF) in the weighting of processing tokens, and describe how IDF improves basic term frequency algorithms.

Discuss the significance of signal weighting in automatic indexing, particularly in terms of term frequency distribution.

You might also like