100% found this document useful (1 vote)
141 views15 pages

Automatic Indexing

Automatic indexing analyzes items to extract information for permanent storage in an index. There are four main classes of automatic indexing: statistical indexing, natural language, concept linkages, and hypertext linkages. Statistical indexing uses the frequency of terms to calculate relevance, with approaches including probabilistic weighting, vector weighting, and inverse document frequency. Vector weighting represents items as vectors of term weights. Inverse document frequency adjusts weights based on how common a term is across items.

Uploaded by

Sunitha Rekha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
141 views15 pages

Automatic Indexing

Automatic indexing analyzes items to extract information for permanent storage in an index. There are four main classes of automatic indexing: statistical indexing, natural language, concept linkages, and hypertext linkages. Statistical indexing uses the frequency of terms to calculate relevance, with approaches including probabilistic weighting, vector weighting, and inverse document frequency. Vector weighting represents items as vectors of term weights. Inverse document frequency adjusts weights based on how common a term is across items.

Uploaded by

Sunitha Rekha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Automatic Indexing

• Automatic indexing is the process of analyzing an item to extract the information to be permanently
kept in an index.

• Classes of Automatic Indexing


1. Statistical Indexing
2. Natural Language
3. Concept Linkages
4. Hypertext Linkages
Statistical Indexing

• Statistical indexing uses frequency of occurrence of events to calculate a number that is used to indicate
the potential relevance of an item.
1. Probabilistic Weighting
• The use of probability theory is a natural choice because it is the basis of evidential reasoning (i.e.,
drawing conclusions from evidence).
• It also leads to an invariant result that facilitates integration of results from different databases.

Probability Ranking Principle (PRP) and its Plausible Corollary


HYPOTHESIS: If a reference retrieval system’s response to each request is a ranking of the documents in
the collection in order of decreasing probability of usefulness to the user who submitted the request, where
the probabilities are estimated as accurately as possible on the basis of whatever data is available for this
purpose, then the overall effectiveness of the system to its users is the best obtainable on the basis of that
data.
PLAUSIBLE COROLLARY: The most promising source of techniques for estimating the probabilities of
usefulness for output ranking in IR is standard probability theory and statistics.
• There are several factors that make this hypothesis and its corollary difficult.
• Probabilities are usually based upon a binary condition; an item is relevant or not.
• But in information systems the relevance of an item is a continuous function from non-relevant to
absolutely useful.
2. Vector Weighting

• In information retrieval, each position in the vector typically represents a processing token.
• There are two approaches to the domain of values in the vector: binary and weighted.

• Under the binary approach, the domain contains the value of one or zero, with one representing the
existence of the processing token in the item.
• Binary vectors require a decision process to determine if the degree that a particular processing token
represents the semantics of an item is sufficient to include it in the vector.
• In the weighted approach, the domain is typically the set of all real positive numbers.
• The value for each processing token represents the relative importance of that processing token in
representing the semantics of the item.
Major algorithms that can be used in calculating the weights used to represent a processing token –

a. Simple Term Frequency Algorithm


• In both the unweighted and weighted approaches, an automatic indexing process implements an
algorithm to determine the weight to be assigned to a processing token for a particular item.
• In a statistical system, the data that are potentially available for calculating a weight are
• frequency of occurrence of the processing token in an existing item (i.e., term frequency - TF),
• frequency of occurrence of the processing token in the existing database (i.e., total frequency -
TOTF) and
• the number of unique items in the database that contain the processing token (i.e., item frequency
- IF, frequently labeled in other publications as document frequency - DF).

• The simplest approach is to have the weight equal to the term frequency.
• This approach emphasizes the use of a particular processing token within an item.
• Use of the absolute value biases weights toward longer items, where a term is more likely to occur with
a higher frequency.
• Thus, one normalization typically used in weighting algorithms compensates for the number of words in
an item.
• The term frequency weighting formula used in TREC 4 was:

• where slope was set at .2 and the pivot was set to the average number of unique terms occurring in the
collection.
• In addition to compensating for document length, they also want the formula to be insensitive to
anomalies introduced by stemming or misspellings.

There are many approaches to account for different document lengths when determining the value of Term
Frequency to use –
• Maximum term frequency - the term frequency for each word is divided by the maximum frequency of
the word in any item.
 This normalizes the term frequency values to a value between zero and one.
 The problem with this technique is that the maximum term frequency can be so large that it decreases
the value of term frequency in short items to too small a value and loses significance.
• Logarithmetic term frequency - the log of the term frequency plus a constant is used to replace the
term frequency.
 The log function will perform the normalization when the term frequencies vary significantly due to size
of documents.

• Another approach recognizes that the normalization process may be over penalizing long documents.
• To compensate, a correction factor was defined that is based upon document length that maps the Cosine
function into an adjusted normalization function.
• The function determines the document length crossover point for longer documents where the
probability of relevance equals the probability of retrieval, (given a query set).
• This value called the "pivot point" is used to apply an adjustment to the normalization process.
Pivoted function = (slope) * (old normalization) + (1.0 – slope) * (pivot)
• Slope and pivot are constants for any document/query set.
b. Inverse Document Frequency
• The basic algorithm is improved by taking into consideration the frequency of occurrence of the
processing token in the database.
• One of the objectives of indexing an item is to discriminate the semantics of that item from other items
in the database.
• Algorithm - the weight assigned to an item should be inversely proportional to the frequency of
occurrence of an item in the database.
• The un-normalized weighting formula is:

where -
• WEIGHTij is the vector weight that is assigned to term “j” in item “i” ,
• TFij(term frequency) is the frequency of term “j” in item “i” ,
• “n” is the number of items in the database and
• IFj(item frequency or document frequency) is the number of items in the database that have term “j” in
them.
• A negative log is the same as dividing by the log value, thus the basis for the name of the algorithm.
c. Signal Weighting

• Inverse document frequency adjusts the weight of a processing token for an item based upon the number
of items that contain the term in the existing database.
• What it does not account for is the term frequency distribution of the processing token in the items that
contain the term.
• The distribution of the frequency of processing tokens within an item can affect the ability to rank
items.

• In Information Theory, the information content value of an object is inversely proportional to the
probability of occurrence of the item.
• An instance of an event that occurs all the time has less information value than an instance of a seldom
occurring event.
• This is typically represented as INFORMATION = -Log2 (p), where p is the probability of occurrence of
event “p.”
d. Discrimination Value
• Another approach to creating a weighting algorithm is to base it upon the discrimination value of a
term.
• To achieve the objective of finding relevant items, it is important that the index discriminates among
items.
• The more all items appear the same, the harder it is to identify those that are needed.
• Discrimination value for each term “i”:
where
• AVESIM is the average similarity between every item in the database and
• AVESIMi is the same calculation except that term “i” is removed from all items.
• There are three possibilities with the DISCRIMi value being positive, close to zero or negative.
• A positive value indicates that removal of term “i” has increased the similarity between items. In this
case, leaving the term in the database assists in discriminating between items and is of value.
• A value close to zero implies that the term’s removal or inclusion does not change the similarity
between items.
• If the value of DISCRIMi is negative, the term’s effect on the database is to make the items appear more
similar since their average similarity decreased with its removal.
• Once the value of DISCRMi is normalized as a positive number, it can be used in the standard weighting
formula as:
3. Bayesian Model :

Common questions

Powered by AI

Normalization is crucial in term frequency weighting to mitigate bias towards longer documents, where term repetition may inflate relevance scores, irrespective of genuine semantic importance . Techniques such as maximum term frequency normalization and logarithmic transformations scale term frequencies to lengths, allowing relative comparisons across varying document sizes . The maximum term frequency approach scales values between zero and one, though it can diminish significance in short items when maximums are exceedingly high . Logarithmic frequency reduces variance in document sizes by using the log of the frequency for more stability . Additionally, adjusted normalization functions with pivot points and slopes offer refined corrections by adjusting for typical document length effects on relevancy .

Bayesian models in automatic indexing apply probabilistic inference to update beliefs about term relevance based on prior and observed data . By using Bayesian principles, indexing can continuously refine weight assignments as more information becomes available, enhancing adaptability in evolving contexts. Compared to other methods like vector or signal weighting, Bayesian models can incorporate uncertainty more comprehensively, leading to potentially more robust probability estimates . However, they may demand greater computational resources and complex parameterization, which could be cumbersome when handling large datasets or ensuring quick retrieval times . Despite these challenges, incorporating Bayesian models offers dynamic insights into term importance, refining index precision over time .

Applying the PRP in real-world systems faces difficulties due to the continuous rather than binary nature of relevance assessments . Estimating probabilities accurately for usefulness ranking is challenging given varied user needs and contextual ambiguities inherent in real-world queries. Additionally, the empirical integration of diverse databases at varying scales compounds complexities in maintaining consistency and effectiveness of the PRP-driven results . While theoretically optimal, practical application of PRP requires sophisticated algorithms that can adaptively refine probabilistic models based on evolving evidence and data patterns .

Statistical indexing primarily relies on the frequency of occurrence of events to calculate relevance, using probabilistic and vector weighting to rank items based on evident statistics such as term frequency and document frequency . On the other hand, natural language indexing uses linguistic structures and syntax to determine relevancy, whereas concept linkages focus on semantic connections between concepts, potentially incorporating ontologies or thesauri to understand relationships between terms .

The PRP suggests that an information retrieval system achieves its best effectiveness by ranking documents in order of their likelihood of usefulness to users based on accurately estimated probabilities. This principle implies that retrieval systems must integrate robust statistical methods to model these probabilities, using data available, which is a significant challenge given the continuous nature of relevance in practical contexts rather than binary distinctions . Implementing PRP can lead to improved retrieval performance by focusing efforts on refining probability estimates, but it requires overcome challenges such as varying document relevancies and data integration from different databases .

Vector weighting represents items using vectors where each position corresponds to a processing token that could be binary or weighted to reflect its semantic relevance. A binary representation indicates presence or absence, while a weighted vector assigns real positive numbers indicating the token's relative importance . In contrast, probabilistic weighting relies on probability theory to estimate an item's likelihood of usefulness based on expected evidence, which results in an invariant outcome facilitating integration across databases . This makes probabilistic weighting more dynamic as it considers broader evidence compared to vector weighting's specific token weighting.

The discrimination value approach enhances indexing by focusing on how well a term distinguishes items in a database. A positive DISCRIMi value indicates a term's inclusion helps distinguish items, thus proving beneficial for accurate indexing by maintaining it in the database . A zero or near-zero value suggests the term contributes no significant discrimination, indicating its presence or absence doesn’t materially alter index outcomes, thus could be deprioritized . A negative DISCRIMi means a term makes items more similar when removed, suggesting the term might be causing clustering that doesn't accurately reflect semantic distinctions—potentially useful for altering retrieval strategies .

Signal weighting can be considered more refined as it encompasses both the presence of terms and their frequency distribution, providing deeper insights into item-specific relevance beyond what simple term frequency or inverse document frequency methods offer . While term frequency places a fixed emphasis on appearance and inverse document frequency considers general occurrence across a database, signal weighting evaluates specificity by assigning greater informational value to terms with unique distribution patterns within items, reflecting their semantic significance more accurately . This multifaceted assessment allows for more nuanced ranking, thus improving retrieval quality in complex databases.

Inverse document frequency (IDF) plays a critical role by adjusting the weight of processing tokens to be inversely proportional to their occurrence across the database, thus enhancing the discriminating power of frequently occurring terms . By applying IDF, basic term frequency algorithms increase the significance of terms that occur in fewer items, thus balancing the emphasis placed on terms just based on frequency within a document. This improvement accounts for the inherent information value of infrequently appearing terms, reducing the impact of commonly present ones that may not aid specific retrieval tasks .

Signal weighting in automatic indexing is significant because it considers the distribution of term frequencies across items containing the term, unlike inverse document frequency, which solely accounts for term presence . By evaluating how frequently a token appears within items, signal weighting can more accurately predict the informativeness of terms. Within Information Theory, terms with consistent distributions contain less information due to their predictability, making infrequently but significantly distributed terms weighted more to emphasize rarity and uniqueness . This enhances ranking strategies by recognizing nuanced term importance beyond mere presence in a database.

You might also like