0% found this document useful (0 votes)

7 views76 pages

AI Module 7

The document provides an overview of Information Retrieval (IR), detailing its focus on searching and manipulating large collections of unstructured text and other data types. It discusses key concepts such as corpus, information need, relevance, and various IR models including Boolean and vector space models. Additionally, it touches on the challenges of natural language processing and the complexities involved in understanding and generating human language.

Uploaded by

Suganesvar Sugan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views76 pages

AI Module 7

Uploaded by

Suganesvar Sugan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Module 7: Communicating, Perceiving and Acting

- Information Retrieval
Information retrieval

IR is concerned with representing, searching and manipulating

large collection of electronic text and other human language data.

IR is finding material (usually documents) of an unstructured nature

(usually text).
First, nomenclature…
 Information retrieval (IR)
 Focus on textual information (= text/document retrieval)
 Other possibilities include image, video, music, …
 What do we search?
 Generically, “collections”
 What do we find?
 Generically, “documents”
 Even though we may be referring to web pages, PDFs, PowerPoint slides,
paragraphs, etc.
Basic Terms
 Corpus: Large repository of document stored in computer.
 Information Need: A topic about which we want to get information
 Relevance: Some of the document that may contain what I want to
search.

So I express my need in terms of query.

Information Retrieval
Information Retrieval Cycle
Source
Selection Resource

Query
Formulation Query

Results

Selection Documents
System discovery
Vocabulary discovery
Concept discovery
Document discovery Examination
Information

source reselection
Delivery
The Central Problem in Search
Author
Searcher

Concepts Concepts

Query Terms Document Terms

“tragic love story” “fateful star-crossed romance”

Do these represent the same concepts?

Abstract IR Architecture
Query Documents
isition
m en t acqu g)
docu rawlin
web c
(e.g.,
online offline
Representation Representation
Function Function

Query Representation Document Representation

Index

Hits
How do we represent text?
 Remember: computers don’t “understand” anything!
 “Bag of words”
 Treat all the words in a document as index terms
 Assign a “weight” to each term based on “importance”
(or, in simplest case, presence/absence of word)
 Disregard order, structure, meaning, etc. of the words
 Simple, yet effective!
 Assumptions
 Term occurrence is independent
 Document relevance is independent
 “Words” are well-defined
What’s a word?
天主教教宗若望保祿二世因感冒再度住進醫院。這是他今年第二度因同樣的病因
住院。 ‫ الناطق باسم‬- ‫وقال مارك ريجيف‬
‫ إن شارون قبل‬- ‫الخارجية اإلسرائيلية‬
‫الدعوة وسيقوم للمرة األولى بزيارة‬
‫ التي كانت لفترة طويلة المقر‬،‫تونس‬
‫الرسمي لمنظمة التحرير الفلسطينية بعد خروجها من لبنان‬
1982 ‫عام‬.
Выступая в Мещанском суде Москвы экс-глава ЮКОСа
заявил не совершал ничего противозаконного, в чем
обвиняет его генпрокуратура России.

भारत सरकार ने आर्थिक सर्वेक्षण में वित्तीय वर्ष

2005-06 में सात फ़ीसदी विकास दर हासिल करने का आकलन
किया है और कर सुधार पर ज़ोर दिया है

日米連合で台頭中国に対処…アーミテージ前副長官提言

조재영 기자 = 서울시는 25 일 이명박 시장이 ` 행정중심복합도시 '' 건설안에 대해 ` 군대라도 동원해

막고싶은 심정 '' 이라고 말했다는 일부 언론의 보도를 부인했다 .
Sample Document
McDonald's slims down spuds
Fast-food chain to reduce certain types of
“Bag of Words”
fat in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp.
is cutting the amount of "bad" fat in its french 14 × McDonalds
fries nearly in half, the fast-food chain said
Tuesday as it moves to make all its fried menu 12 × fat
items healthier.
But does that mean the popular shoestring fries
11 × fries
won't taste the same? The company says no.
"It's a win-win for our customers because they
8 × new
are getting the same great french-fry taste along
with an even healthier nutrition profile," said
7 × french
Mike Roberts, president of McDonald's USA.
6 × company, said, nutrition
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use, 5 × food, oil, percent,
but at least one nutrition expert says playing
with the formula could mean a different taste. reduce, taste, Tuesday
Shares of Oak Brook, Ill.-based McDonald's
(MCD: down $0.54 to $23.22, Research, …
Estimates) were lower Tuesday afternoon. It
was unclear Tuesday whether competitors
Burger King and Wendy's International (WEN:
down $0.80 to $34.91, Research, Estimates)
would follow suit. Neither company could
immediately be reached for comment.
…
Information retrieval models
 An IR model governs how a document and a query are
represented and how the relevance of a document to
a user query is defined.
 Main models:
 Boolean model
 Vector space model
 Probabilistic model..
 etc

12
Boolean Retrieval model
 It is exact retrieval model.
 Each document or query is treated as a “bag” of
words or terms. Word sequence is not considered.
 Given a collection of documents D, let V = {t , t , ..., t
1 2 |
V|} be the set of distinctive words/terms in the
collection. V is called the vocabulary.
 A weight w > 0 is associated with each term t of a
ij i
document dj ∈ D. For a term that does not appear in
document dj, wij = 0.
dj = (w1j, w2j, ..., w|V|j),
Two possible outcomes of query evaluation: True or False
13
Boolean Retrieval model
 Query terms are combined logically using the Boolean
operators AND, OR, and NOT.
 E.g., ((data AND mining) AND (NOT text))
 Retrieval
 Given a Boolean query, the system retrieves every
document that makes the query logically true.
 Called exact match.
 The retrieval results are usually quite poor because
term frequency is not considered.

14
Boolean Retrieval model: Example

15
Boolean queries: Exact match
Sec. 1.3

• The Boolean retrieval model is being able to ask a

query that is a Boolean expression:
– Boolean Queries are queries using AND, OR and NOT to
join query terms
 Views each document as a set of words
 Is precise: document matches condition or not.
– Perhaps the simplest model to build an IR system on
• Primary commercial retrieval tool for 3 decades.
• Many search systems you still use are Boolean:
– Email, library catalog, Mac OS X Spotlight

16
Strengths and Weaknesses
 Strengths
 Precise, if you know the right strategies
 Precise, if you have an idea of what you’re looking for
 Implementations are fast and efficient
 Weaknesses
 Users must learn Boolean logic
 Boolean logic insufficient to capture the richness of language
 No control over size of result set: either too many hits or none
 When do you stop reading? All documents in the result set are considered
“equally good”
 What about partial matches? Documents that “don’t quite match” the query may
be useful also
Vector Space Model
t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Assumption: Documents that are “close together” in

vector space will be retrieved.

Therefore, retrieve documents based on how close the

document is to the query (i.e., similarity ~ “closeness”)
Vector Space Model
In this model, documents and queries are assumed to be part of a t-
dimensional vector space, where t is the index term(word, phrase etc.)

A document dj and query Q is represented by a vector:

Vector Space Model

Similarity Metric
Vector space model
 Documents are also treated as a “bag” of words or terms.
 Each document is represented as a vector.
 However, the term weights are no longer 0 or 1. Each term
weight is computed based on some variations of TF or TF-
IDF scheme.
 Term Frequency - Inverse Document Frequency (TF-IDF) is a
widely used statistical method in natural language
processing and information retrieval. It measures how
important a term is within a document relative to a
collection of documents

21
Term Weighting
 Term weights consist of two components
 Local: how important is the term in this document?
 Global: how important is the term in the collection?
 Here’s the intuition:
 Terms that appear often in a document should get high weights
 Terms that appear in many documents should get low weights
 How do we capture this mathematically?
 Term frequency (local)
 Inverse document frequency (global)
[Link] Term Weighting

N
wi , j tf i , j log
ni
wi , j weight assigned to term i in document j

tf i, j number of occurrence of term i in document j

N number of documents in entire collection

ni number of documents with term i

Vector space model: Example
Vector space model: Example
Vector space model: Example
Some formulas for Similarity

Sim( D, Q)  (ai * bi )
Dot product
t1
 (a * b ) i i
D
Sim( D, Q)  i

Cosine  a *b i
2
i
2
Q
i i

2 (ai * bi ) t2
Sim( D, Q)  i
Dice a i
2
  bi
2

i i

 (a * b ) i i

Jaccard Sim( D, Q)  i

 a   b   (a * b )
2 2
i i i i
i i i

27
PAGE RANK ALGORITHM
Probabilistic Language Models
This model assigns a probability to every sentence in English in such a way
that more likely sentences (in some sense) get higher probability.

If you are unsure between two possible sentences, pick the higher probability
one.

The probabilistic language model is to compute a probability distribution of a

sequence of words i.e. P(s) = P(w1, w2, ..., wN ).
Use of Language Model

Sometimes, you hear or read a sentence that is not clear, but using your
language model, you still can recognize it at a high accuracy despite the noisy
vision/speech input.
Language Model: Completion Prediction
The Language Model
A Simple Language Model Implementation
Probabilistic Language Models
Probabilistic Language Models: N-gram
Probabilistic Language Models: N-gram
Evaluating IR
 Precision is proportion of results that are relevant.
 Recall is proportion of relevant docs that are in results
 ROC curve (there are several varieties): standard is to
plot false negatives vs. false positives.
 More “practical” for web: reciprocal rank of first relevant
result, or just “time to answer”
Accuracy = (TP + TN)/(TP + FP + FN + TN).
Question: An IR system returns 12 relevant documents and 10
irrelevant documents. There are a total of 25 relevant documents in
the collection. What is the precision of the system on this search and
what is the recall.

Solution:
Precision = (Number of relevant items retrieved) / (Total number of
retrieved items)
Recall = (Number of relevant items retrieved) / (Total number of
relevant items)

Precision = 12 / 22

Recall = 12 / 25
Information Extraction
 Goal: create database entries from docs.
 Emphasis on massive data, speed, stylized
expressions
 Regular expression grammars OK if stylized
enough
 Cascaded Finite State Transducers,,,stages
of grouping and structure-finding
What is Natural Language
Processing (NLP)
 The process of computer analysis of input provided in a human
language (natural language), and conversion of this input into
a useful form of representation.
 The field of NLP is primarily concerned with getting computers to
perform useful and interesting tasks with human languages.
 The field of NLP is secondarily concerned with helping us come
to a better understanding of human language.
Forms of Natural Language
 The input/output of a NLP system can be:
 written text
 speech
 We will mostly concerned with written text (not speech).
 To process written text, we need:
 lexical, syntactic, semantic knowledge about the language
 discourse information, real world knowledge
 To process spoken language, we need everything
required to process written text, plus the challenges of
speech recognition and speech synthesis.
Components of NLP
 Natural Language Understanding
 Mapping the given input in the natural language into a useful representation.
 Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
 Natural Language Generation
 Producing output in the natural language from some internal representation.
 Different level of synthesis required:
deep planning (what to say),
syntactic generation
 NL Understanding is much harder than NL Generation. But, still both of
them are hard.
Why NL Understanding is hard?
 Natural language is extremely rich in form and structure, and
very ambiguous.
 How to represent meaning,
 Which structures map to which meaning structures.
 One input can mean many different things. Ambiguity can be at
different levels.
 Lexical (word level) ambiguity -- different meanings of words
 Syntactic ambiguity -- different ways to parse the sentence
 Interpreting partial information -- how to interpret pronouns
 Contextual information -- context of the sentence may affect the
meaning of that sentence.
 Many input can mean the same thing.
 Interaction among components of the input is not clear.
NLU: Knowledge of Language
 Phonology – concerns how words are related to the sounds that realize
them.
 Morphology – concerns how words are constructed from more basic
meaning units called morphemes. A morpheme is the primitive unit of
meaning in a language.
 Syntax – concerns how can be put together to form correct sentences
and determines what structural role each word plays in the sentence and
what phrases are subparts of other phrases.
 Semantics – concerns what words mean and how these meaning
combine in sentences to form sentence meaning. The study of context-
independent meaning.
Knowledge of Language (cont.)
 Pragmatics – concerns how sentences are used in different situations
and how use affects the interpretation of the sentence.

 Discourse – concerns how the immediately preceding sentences affect

the interpretation of the next sentence. For example, interpreting
pronouns and interpreting the temporal aspects of the information.

 World Knowledge – includes general knowledge about the world. What

each language user must know about the other’s beliefs and goals.
Language and Intelligence
Turing Test

Computer Human

Human Judge
 Human Judge asks tele-typed questions to Computer and
Human.
 Computer’s job is to act like a human.
 Human’s job is to convince Judge that he is not machine.
 Computer is judged “intelligent” if it can fool the judge
 Judgment of intelligence is linked to appropriate answers to
questions from the system.
NLP - an inter-disciplinary Field
 NLP borrows techniques and insights from several disciplines.
 Linguistics: How do words form phrases and sentences?
What constraints the possible meaning for a sentence?
 Computational Linguistics: How is the structure of
sentences are identified? How can knowledge and reasoning
be modeled?
 Computer Science: Algorithms for automatons, parsers.
 Engineering: Stochastic techniques for ambiguity resolution.
 Psychology: What linguistic constructions are easy or difficult
for people to learn to use?
 Philosophy: What is the meaning, and how do words and
sentences acquire it?
Some Buzz-Words
 NLP – Natural Language Processing
 CL – Computational Linguistics
 SP – Speech Processing
 HLT – Human Language Technology
 NLE – Natural Language Engineering
 SNLP – Statistical Natural Language Processing
 Other Areas:
 Speech Generation, Text Generation, Speech Understanding,
Information Retrieval,
 Dialogue Processing, Inference, Spelling Correction, Grammar
Correction,
 Text Summarization, Text Categorization,
Some NLP Applications
 Machine Translation – Translation between two natural languages.
 See the Babel Fish translations system on Alta Vista.
 Information Retrieval – Web search (uni-lingual or multi-lingual).
 Query Answering/Dialogue – Natural language interface with a
database system, or a dialogue system.
 Report Generation – Generation of reports such as weather reports.
 Some Small Applications –
 Grammar Checking, Spell Checking, Spell Corrector
Natural Language Generation
 NLG is the process of constructing natural language outputs from
non-linguistic inputs.

Module 7
No ratings yet
Module 7
53 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
113 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
Information Retrieval & MapReduce
No ratings yet
Information Retrieval & MapReduce
72 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Search Engine Evaluation Guide
No ratings yet
Search Engine Evaluation Guide
48 pages
Lang Models: 04 December 2024 23:03
No ratings yet
Lang Models: 04 December 2024 23:03
4 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
Comprehensive Guide to Information Retrieval
No ratings yet
Comprehensive Guide to Information Retrieval
74 pages
Module 2-Students
No ratings yet
Module 2-Students
143 pages
Unit V Notes Adbt Adbt
No ratings yet
Unit V Notes Adbt Adbt
7 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
IR Unit II
No ratings yet
IR Unit II
4 pages
Advanced Database Tech: IR & Web Search
No ratings yet
Advanced Database Tech: IR & Web Search
21 pages
Intro to Info Retrieval Course
No ratings yet
Intro to Info Retrieval Course
31 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
77 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Lecture17 IR
No ratings yet
Lecture17 IR
28 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
3 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
NLP Applications in Information Retrieval
No ratings yet
NLP Applications in Information Retrieval
23 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
46 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Module 3-2
No ratings yet
Module 3-2
17 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
IR Models for Tech Students
No ratings yet
IR Models for Tech Students
24 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
Information Retrieval Fundamentals
No ratings yet
Information Retrieval Fundamentals
11 pages
PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
Chapter Four: IR Models (Part-I)
No ratings yet
Chapter Four: IR Models (Part-I)
32 pages
Web Search
No ratings yet
Web Search
30 pages
Information Retrieval Lecture Overview
No ratings yet
Information Retrieval Lecture Overview
6 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Information Retrieval
No ratings yet
Information Retrieval
9 pages
Information Retrieval
No ratings yet
Information Retrieval
15 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
59 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
15 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
23 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
Indian Language Translation Systems
No ratings yet
Indian Language Translation Systems
19 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
Information Retrieval PDF
No ratings yet
Information Retrieval PDF
14 pages
Boolean Retrieval Model
No ratings yet
Boolean Retrieval Model
5 pages
Intensive Academic Ielts Preparation Course
No ratings yet
Intensive Academic Ielts Preparation Course
5 pages
Understanding Adjectives and Context Clues
No ratings yet
Understanding Adjectives and Context Clues
66 pages
Present Simple Present Continuous
No ratings yet
Present Simple Present Continuous
3 pages
Eng 189 P2
No ratings yet
Eng 189 P2
11 pages
A1 - 1 - Alphabet and Sounds!
No ratings yet
A1 - 1 - Alphabet and Sounds!
11 pages
Generic Reference
No ratings yet
Generic Reference
4 pages
UNIT 4 Rainforest Rescue Sixth Grade
No ratings yet
UNIT 4 Rainforest Rescue Sixth Grade
130 pages
Everyone Has An Accent Except Me.
No ratings yet
Everyone Has An Accent Except Me.
4 pages
English Language Practice Test
No ratings yet
English Language Practice Test
4 pages
How To Be More Eloquent During Presentation
No ratings yet
How To Be More Eloquent During Presentation
1 page
Giraffes Can't Dance: Grammar of The Day
No ratings yet
Giraffes Can't Dance: Grammar of The Day
3 pages
4as DETAILED LESSON PLAN CAINONGLJ
100% (11)
4as DETAILED LESSON PLAN CAINONGLJ
6 pages
Assignment 1: Focus On The Learner: Part A: Learner Profile
100% (1)
Assignment 1: Focus On The Learner: Part A: Learner Profile
5 pages
Tiếng Anh 9 (Global Success) : Unit 1 - Lesson 1
No ratings yet
Tiếng Anh 9 (Global Success) : Unit 1 - Lesson 1
15 pages
Understanding Sociolinguistic Competence
No ratings yet
Understanding Sociolinguistic Competence
17 pages
Prepositions, Articles, Punctuation
No ratings yet
Prepositions, Articles, Punctuation
10 pages
Al-Alson International School: Weekly Plan: Semester I, 2022-2023
No ratings yet
Al-Alson International School: Weekly Plan: Semester I, 2022-2023
4 pages
Le - English 2 Q1 - W4
No ratings yet
Le - English 2 Q1 - W4
15 pages
TGG Licence 2 (Handout)
No ratings yet
TGG Licence 2 (Handout)
10 pages
Computers in Medicine
No ratings yet
Computers in Medicine
5 pages
DLL English-5 Q1 W9
No ratings yet
DLL English-5 Q1 W9
8 pages
Music Tech Terms in Studio Analysis
No ratings yet
Music Tech Terms in Studio Analysis
95 pages
ALS: Mastering Sentence Types
No ratings yet
ALS: Mastering Sentence Types
19 pages
J. Polski Infinitive Past Simple Past Participle: English Class A2+ - I Połowa
No ratings yet
J. Polski Infinitive Past Simple Past Participle: English Class A2+ - I Połowa
8 pages
CID Maam Maila Cross Validation For Reading Addendum 2
No ratings yet
CID Maam Maila Cross Validation For Reading Addendum 2
8 pages
Exercitii Rezolvate Engleza
No ratings yet
Exercitii Rezolvate Engleza
2 pages
Adjectives
No ratings yet
Adjectives
16 pages
Understanding Abnegation: Meaning & Usage
No ratings yet
Understanding Abnegation: Meaning & Usage
7 pages
Grade 10 English Tense Guide
No ratings yet
Grade 10 English Tense Guide
25 pages
Countries and Nationalities Practice AMANDA
No ratings yet
Countries and Nationalities Practice AMANDA
1 page

AI Module 7

Uploaded by

AI Module 7

Uploaded by

Module 7: Communicating, Perceiving and Acting

IR is concerned with representing, searching and manipulating

IR is finding material (usually documents) of an unstructured nature

So I express my need in terms of query.

Query Terms Document Terms

Do these represent the same concepts?

Query Representation Document Representation

भारत सरकार ने आर्थिक सर्वेक्षण में वित्तीय वर्ष

조재영 기자 = 서울시는 25 일 이명박 시장이 ` 행정중심복합도시 '' 건설안에 대해 ` 군대라도 동원해

• The Boolean retrieval model is being able to ask a

Assumption: Documents that are “close together” in

Therefore, retrieve documents based on how close the

A document dj and query Q is represented by a vector:

tf i, j number of occurrence of term i in document j

N number of documents in entire collection

ni number of documents with term i

The probabilistic language model is to compute a probability distribution of a

 Discourse – concerns how the immediately preceding sentences affect

 World Knowledge – includes general knowledge about the world. What

You might also like