0% found this document useful (0 votes)
5 views48 pages

Chapter 4

The document outlines a template for evaluating search engines, including sections on introduction, comparison, conclusion, and references. It discusses the importance of understanding retrieval processes and introduces basic retrieval models such as the Boolean and Vector Space Models. The document emphasizes the significance of ranking algorithms in determining document relevance and highlights the limitations of the Boolean model in terms of term weighting and query flexibility.

Uploaded by

bellhermon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views48 pages

Chapter 4

The document outlines a template for evaluating search engines, including sections on introduction, comparison, conclusion, and references. It discusses the importance of understanding retrieval processes and introduces basic retrieval models such as the Boolean and Vector Space Models. The document emphasizes the significance of ranking algorithms in determining document relevance and highlights the limitations of the Boolean model in terms of term weighting and query flexibility.

Uploaded by

bellhermon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Models of Modern IR

Systems
Chapter Four- Part I

1
Template for Search Engine Evaluation
Task
 Cover page (course title, group members name, search engines name)
 Introduction (only one page)
 About how and why you select the search engines for the comparison,
brief description about each search engines)
 Comparison (only one page)
 Use the table next page
 Conclusion (only one page)
 Discussion of key findings you want to emphasis
 Reference

File naming convention – Team leader name with first letter of his/her
father -UG-IR-SEE
2
 Example : tibebeb-UG-IR-SEE
Criteria SE1 SE2 SE3

Size of index database

Searching options

Stemming technique

Ranking approach

Similarity measure
approach
User interface

Relevance feedback
mechanism
Others

3
Objectives

 Understanding retrieval process


 Be familiar with the basic retrieval models
 Get hands on experience ( simulate) retrieval of
items

4
Topics

 Overview
 Boolean/Logical Model
 Vector Space Model (VSM)

5
Modeling of Modern IR Systems

 Retrieval based on index terms assumes the


semantics of the documents and the user
information need can be expressed through sets of
index terms.
 A central problem in IR is predicting which
documents are relevant and which are not.
 Such a decision is dependent on a ranking algorithm
implemented.

6
Cont…
 Ranking is an ordering of the documents retrieved
that (hopefully) reflects the relevance of the
documents to the user query.
 Ranking is based on fundamental premises
regarding the notion of relevance, such as:
 common sets of index terms
 sharing of weighted terms
 likelihood of relevance

7
Cont…
 Such distinct set of premises (regarding document
relevance) leads to a distinct IR models.
 Model
 is an idealization or abstraction of the actual
process (here, retrieval).
 It represents something that exists or is
planned in the real world and that in someway
is too complex or large for us to understand it
as it stands.

8
Cont…

 A model is, in someway, simplified, or reduced in size,


scope or scale.
 It helps to understand the system better.
 And it is the best way, scientific way to study reality.
 Thus, a model is a simplified representation of a complex
reality, usually for the purpose of understanding that
reality, and having all the features of that reality
necessary for the current task or problem.

9
Cont…

 A model may be conceptual like a mathematical model,


which is full of equations, and are used to study the
properties of the process, draw conclusions, and make
predictions.
 Statistical models on the other hand represent repetitive
processes, make predictions about frequencies of
interesting events, and use probability as the fundamental
tool.
 Retrieval model ?

10
What is a retrieval model?

 It is a model that describes the computational


process (e.g. how documents are ranked) and
human process (e.g. the information need,
interaction).
 Note that how documents or indexes are stored
is implementation.
 In relation to this, retrieval variables are queries,
documents, terms, relevance judgments, users,
information needs …

11
What an IR Model includes

 Two elements
 The retrieval mechanism: used to match query with a set
of documents
 The ways in which the user’s information need can be
formulated as a query that can be searched by that
mechanism
 thus a retrieval model specifies the details of
 Document representation
 Query representation
 Retrieval function
12
Building a Model
 To build a model, we need to think of first on
representations of the documents and the user
information need.
 Given these representations the next step is to conceive
a framework in which they can be modeled.
 This framework should also provide the idea on
constructing a ranking function.
 In the Boolean model, the framework is composed of sets
of documents and the standard operations on sets
 For the vector space model, the framework is composed
of a t-dimensional Vector space and standard linear
algebra operations on vectors 13
Cont…

 The discussions made so far provide support for


discussing the two basic information retrieval models:
namely
 Boolean retrieval models and
 Vector space models (VSM)

14
Boolean/Logical model
 It is a simple retrieval model, which is more of retrieval
than document representation and based on or uses set
theory and Boolean algebra.
 Documents and queries are represented as sets of index
terms
 Provides a framework, which is easy to grasp by a
common user of an IR system.

 Attracted great attention in past years and was adopted


by many of the early commercial bibliographic systems.
15
Cont…
 It is a basis for the majority of DBMS and conventional
IR systems.
 It is the most common exact-match model.
 Queries are logic expressions with document
features as operands that means query terms are
linked by the logical operators AND, OR and NOT.
 A document is an object or a set consisting of
terms
 Terms are features of the objects (documents):
 And the search engine retrieves those documents
satisfying the logical constraints of the query.
16
Cont…
 Example
 Doc1: Information storage and retrieval
 Doc 2: Expert system and information retrieval
systems
 Doc 3: Information processing and management
 Doc 4: Information retrieval in archives

17
Cont…
Index term Document
Information 1,2,3,4
Storage 1
Retrieval 1,2,4
System 2
Processing 3
Management 3
Archives 4
Suppose our query consists of information AND Retrieval
Which documents will be retrieved based on Boolean
model?
18
Cont…
 The basic assumption is that there is a domain and
both the author of the document and the readers
belong to the same domain, at any one time you
have t of them.
 whatit means is that any document in the
domain is written in these terms

19
Relevance – matching
 Matching as a concept is the degree of similarity
between D and Q,
 The degree of similarity determines the degree of
closeness between D and Q.
 Ifthere is more sharing between query terms and
document terms, the author and the user are
talking the same thing.

 Thus by taking intersection, what we call similarity


can be captured mathematically.

20
Cont…
 Boolean model’s matching considers that index terms
are either present or absent in a document.
 Thus, the index term weights are assumed to be all
binary.
 That is, wij = {0, 1}
 A query q is composed of index terms linked by the
three connectives, example
q = ka  ( kb  kc )

 Boolean expressions represent a request to determine what


documents contain (or do not contain) a given set of key words.
 A query searches a set of documents to determine
21
their content.
Boolean model (Document, Term, Weight, Matching
 Document (how a document is viewed in BM)
 Is an object, a set consisting of terms
 That is, documents are sets of terms
 Instance of an object (i.e., document or query) is created
when we assign value (concepts) to the features
 Term (how a term is viewed in BM)
 Terms are features of the objects (documents)
 The terms come from the vocabulary of the subject
 Represent documents in terms, of which together represent doc.
 The terms are the things we used to describe concepts in a
particular domain
 The vocabulary is growing when new terms are introduced
22
Cont…
 Weight
 Terms are either present or absent in documents
 Thus, the index term weight variables are all binary, i.e.,
wij  {0,1}
 Matching
 Degree of similarity between D and Q. If there is more sharing between
query terms and document terms, the author and the user are talking the
same thing
 By taking intersection, what we call similarity can be captured
mathematically in Boolean model
 Example
 Q = (1, 1, o, 0, 1, 0, 0) d = (1, 0, 0, 1, 1, 0, 0) S(q, d) = 2

 Thus, intersection
 Takes operands and returns degree of commonness
 Is a function that counts the number of matches
23
How do you explain the
essence of relevance in IRS
designed using Boolean model?

24
Example:

1. Query: Find all documents containing “information”


 Boolean expression
Information
 Result (means)
A set whose elements are all documents containing
the pattern “information”

25
Cont…
2. Query: Find all documents that do not contain
“information”
 This is a query which attempts to find documents
that do not contain a particular pattern
 Boolean expression (representation)
NOT information
 Result
 A set whose elements are all documents that
do not contain the pattern “information

26
Cont…
 Most queries search for more than one term
 Find all documents containing “information” and
“retrieval”
 Find all documents containing “information” or
“retrieval” (or both)
 Find all documents containing “information” or
“retrieval”, but not both
 Each of the three queries illustrates a particular
concept that may form a Boolean expression, namely
Conjunction, Disjunction, Exclusive disjunction

27
Cont…
 Boolean expressions may be formed from other
Boolean expressions to yield complex structure
 Query
 Find all documents containing “information”,
“retrieval” or not containing both “retrieval”
and “science”
 Boolean expression
 (Information
and retrieval) OR NOT (retrieval
AND science), parenthesis avoid ambiguity

28
Cont…

 Each portion of a Boolean expression yields a set of


documents.
 These portions are evaluated separately.
 Combining the terms of Boolean expressions is simple
and done as follows
 Let U represent the set of all docs in the collection
 d1 and d2 represent those docs that contain patterns
p1 and p2 respectively.

29
The following list defines how to evaluate
Boolean expressions operators in terms of
the sets
 U – d1 is the set of all docs not containing p1
(NOT)
 d1 ∩ d2 is the set of all docs. containing both
p1 and p2 (AND)
 d1 U d2 is the set of all docs. containing
either p1 or p2 (OR)
 d1 U d2 – d1 ∩ d2 is the set of all docs.
Containing either p1 or p2, but not both (XOR)

30
Cont…

 Thus In Boolean,
 the use of AND requires that both terms that it
connects be present in the retrieved documents,
 the use of OR requires that at least one of the terms
be present.
 This is an inclusive use of OR, meaning that it is
acceptable for both of the terms to be present,

31
Cont…
 If an exclusive use of OR is desired- one term or the
other, but not both- the construction is more
complex:
(A AND NOT B) OR (B AND NOT A)
or
(A OR B) AND NOT (A AND B)
 NOT requires that the specified term be absent from
any retrieved document

32
Exercise: Consider a set of five docs and assume that
they contain the terms shown in the table

Doc. Terms
D1 Algorithm, information, retrieval
D2 Retrieval, science
D3 Algorithm, information, science
D4 Pattern, retrieval, science
D5 Science, algorithm

Find documents retrieved by the following expressions


• Information AND retrieval
• Information OR retrieval
• (Information and Retrieval) OR NOT (Retrieval and Science)
33
Solution

 Information AND retrieval


{d1,d3} ∩{d1,d2,d4}={d1}
 Information OR retrieval
{d1,d3} U {d1,d2,d4}={d1, d2,d3,d4}
 (Information and Retrieval) OR NOT (Retrieval
and Science)
(d1) OR NOT (d4,d2)= {d1,d3,d5}

34
Advantages of the Boolean
model
 Simplicity

 Isstill a dominate model with the


commercial database systems
 Providesa good starting point for those
new to the field

35
Limitations (Drawbacks) of the
Boolean Model

 In pure Boolean, there is no good way to weight


terms for significance. (thus it does only binary
partition
 Either a term is present or absent. Thus, the
user has little control over how important a
given term is to the query.
 That is, Its retrieval strategy is based on
binary decision criteria.
 No weighting for document terms and no
weighting for query terms
36
Cont…
 That is, the significance concept is totally ignored.
 The representation is only binary.
 The system is not flexible to represent weight which is
said very important in IR.
 Reconsideration of index weight brings us to the
vector model
 Predicts that each document is either relevant or
non-relevant.
 Is a simple partition - those that match the query and
those that do not?
 Divides the collection into two subsets only,
retrieved and non-retrieved 37
Cont…

 There is no notion of partial matching to the query


condition.
 For example, let dj be a document for which
vector
dj= (0,1,0)
 Document dj includes the index term kb but is
considered non- relevant to the query
ka  (kb  kc)
 This prevents good retrieval performance.

38
Cont…
 In Boolean model no ranking of the documents is
provided (absence of a grading scale)
 as all documents are considered equal, no
ordering of retrieved set
 Retrieved documents are generally not ranked.
 All retrieved are presumed to be equally useful.
 No mechanism to show the relative importance of
the different components of a query

39
Cont…
 Query formulation is too difficult using the Boolean
operators.
 Boolean expressions have precise semantics
 Thus, it is not simple to translate an information
need into a Boolean expression.
 Informationneed has to be translated into a
Boolean expression which most users find awkward.

40
Cont…

 To answer sophisticated queries we need to know


more about Boolean logic.
 We need also to have good knowledge of representing
queries in Boolean logic, which presumes knowledge
of the document, queries (user’s needs) and so on
 As a consequence there is a
 Need for trained intermediary, which create
another problem, problem of understanding.
 Instead of yourself, somebody do the translation
for you on your behalf.
41
Cont…
 Boolean model frequently returns either too few or too
many documents in response to a user query
 As it is very difficult to precisely define users need at the
beginning
 As a result of which, the Boolean model frequently
returns either too few or too many documents in
response to a user query
 That is, exact matching may lead to retrieval of too few or
too many documents (main problem)
 This shows very little control over the size of the output by a
particular query.
 That is, the size of retrieved set can hardly be controlled.

42
Cont…
 “NOT”, for instance, retrieves every document that does
not contain a specific term.
 A query such as ‘NOT aardvark’ runs of retrieving virtually
the entire database
 Again another point is separation between retrieved / non-
retrieved too strict that means
 For q= t1 Λ t2 Λ t3, documents containing two of the
terms will be rejected as well as those containing none
 Analogously for q=t1 V t2 V t3, no ordering within
retrieved documents
 Generally it has poor retrieval quality and its main problem
is the inability to recognize partial matches which
frequently leads to poor performance 43
Exercise
Given the following four documents with the following
contents:
 D1 = “computer information retrieval”
 D2 = “computer retrieval”
 D3 = “information”
 D4 = “computer information”
 What are the relevant documents retrieved for the
queries:
 Q1 = “information  retrieval”
 Q2 = “information  ¬computer”
44
The Boolean Model: Example
• Given the following determine documents retrieved by the Boolean
model based IR system
• Index Terms: K1, …,K8.
• Documents:
1. D1 = {K1, K2, K3, K4, K5}
2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)

45
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}

46
The Boolean Model: Example
Given the following three documents, Construct Term – document matrix and
find the relevant documents retrieved by the Boolean model for given
query
• D1: “Shipment of gold damaged in a fire” • Find the relevant
• D2: “Delivery of silver arrived in a silver truck” documents for the
• D3: “Shipment of gold arrived in a truck” queries (use AND , OR)
• Query: “gold silver truck” (a)gold delivery
Use table below for the –term matrix (b)ship gold
(c)silver truck

arrive damage deliver fire gold silver ship truck


D1
D2
D3
query 47
Next

On Vector Space Model

48

You might also like