SlideShare a Scribd company logo
Information Retrieval and
Extraction
First, nomenclature…
• Information retrieval (IR)
• Focus on textual information (= text/document retrieval)
• Other possibilities include image, video, music, …
• What do we search?
• Generically, “collections”
• Less-frequently used, “corpora”
• What do we find?
• Generically, “documents”
• Even though we may be referring to web pages, PDFs, PowerPoint slides,
paragraphs, etc.
Information Retrieval Cycle
Source
Selection
Search
Query
Selection
Results
Examination
Documents
Delivery
Information
Query
Formulation
Resource
source reselection
System discovery
Vocabulary discovery
Concept discovery
Document discovery
The Central Problem in Search
Searcher
Author
Concepts Concepts
Query Terms Document Terms
Do these represent the same concepts?
“tragic love story” “fateful star-crossed romance”
Abstract IR Architecture
Documents
Query
Hits
Representation
Function
Representation
Function
Query Representation Document Representation
Comparison
Function Index
offline
online
document acquisition
(e.g., web crawling)
How do we represent text?
• Remember: computers don’t “understand” anything!
• “Bag of words”
• Treat all the words in a document as index terms
• Assign a “weight” to each term based on “importance”
(or, in simplest case, presence/absence of word)
• Disregard order, structure, meaning, etc. of the words
• Simple, yet effective!
• Assumptions
• Term occurrence is independent
• Document relevance is independent
• “Words” are well-defined
What’s a word?
天主教教宗若望保祿二世因感冒再度住進醫院。
這是他今年第二度因同樣的病因住院。 ‫باسم‬ ‫الناطق‬ - ‫ريجيف‬ ‫مارك‬ ‫وقال‬
‫قبل‬ ‫شارون‬ ‫إن‬ - ‫اإلسرائيلية‬ ‫الخارجية‬
‫بزيارة‬ ‫األولى‬ ‫للمرة‬ ‫وسيقوم‬ ‫الدعوة‬
‫المقر‬ ‫طويلة‬ ‫لفترة‬ ‫كانت‬ ‫التي‬ ،‫تونس‬
‫عام‬ ‫لبنان‬ ‫من‬ ‫خروجها‬ ‫بعد‬ ‫الفلسطينية‬ ‫التحرير‬ ‫لمنظمة‬ ‫الرسمي‬
1982 .
Выступая в Мещанском суде Москвы экс-глава ЮКОСа
заявил не совершал ничего противозаконного, в чем
обвиняет его генпрокуратура России.
भारत सरकार ने आर्थिक सर्वेक्षण में वित्तीय वर्ष 2005-06 में सात फ़ीसदी विकास
दर हासिल करने का आकलन किया है और कर सुधार पर ज़ोर दिया है
日米連合で台頭中国に対処…アーミテージ前副長官提言
조재영 기자 = 서울시는 25 일 이명박 시장이 ` 행정중심복합도시 ''
건설안에 대해 ` 군대라도 동원해 막고싶은 심정 '' 이라고 말했다는 일부
언론의 보도를 부인했다 .
Sample Document
McDonald's slims down spuds
Fast-food chain to reduce certain types of fat
in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is
cutting the amount of "bad" fat in its french fries
nearly in half, the fast-food chain said Tuesday as it
moves to make all its fried menu items healthier.
But does that mean the popular shoestring fries won't
taste the same? The company says no. "It's a win-win
for our customers because they are getting the same
great french-fry taste along with an even healthier
nutrition profile," said Mike Roberts, president of
McDonald's USA.
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use, but
at least one nutrition expert says playing with the
formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's (MCD:
down $0.54 to $23.22, Research, Estimates) were
lower Tuesday afternoon. It was unclear Tuesday
whether competitors Burger King and Wendy's
International (WEN: down $0.80 to $34.91, Research,
Estimates) would follow suit. Neither company could
immediately be reached for comment.
…
14 × McDonalds
12 × fat
11 × fries
8 × new
7 × french
6 × company, said, nutrition
5 × food, oil, percent, reduce,
taste, Tuesday
…
“Bag of Words”
Information retrieval models
• An IR model governs how a document and a query are
represented and how the relevance of a document to
a user query is defined.
• Main models:
• Boolean model
• Vector space model
• Statistical language model..
• etc
9
Boolean model
• Each document or query is treated as a “bag” of
words or terms. Word sequence is not considered.
• Given a collection of documents D, let V = {t1, t2, ..., t|
V|} be the set of distinctive words/terms in the
collection. V is called the vocabulary.
• A weight wij > 0 is associated with each term ti of a
document dj ∈ D. For a term that does not appear in
document dj, wij = 0.
dj = (w1j, w2j, ..., w|V|j),
10
Boolean model (contd)
• Query terms are combined logically using the Boolean
operators AND, OR, and NOT.
• E.g., ((data AND mining) AND (NOT text))
• Retrieval
• Given a Boolean query, the system retrieves every
document that makes the query logically true.
• Called exact match.
• The retrieval results are usually quite poor because
term frequency is not considered.
11
Boolean queries: Exact match
• The Boolean retrieval model is being able to ask a
query that is a Boolean expression:
– Boolean Queries are queries using AND, OR and NOT to join
query terms
• Views each document as a set of words
• Is precise: document matches condition or not.
– Perhaps the simplest model to build an IR system on
• Primary commercial retrieval tool for 3 decades.
• Many search systems you still use are Boolean:
– Email, library catalog, Mac OS X Spotlight
12
Sec. 1.3
Strengths and Weaknesses
• Strengths
• Precise, if you know the right strategies
• Precise, if you have an idea of what you’re looking for
• Implementations are fast and efficient
• Weaknesses
• Users must learn Boolean logic
• Boolean logic insufficient to capture the richness of language
• No control over size of result set: either too many hits or none
• When do you stop reading? All documents in the result set are considered “equally
good”
• What about partial matches? Documents that “don’t quite match” the query may be
useful also
Vector Space Model
Assumption: Documents that are “close together” in
vector space “talk about” the same things
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
Therefore, retrieve documents based on how close the
document is to the query (i.e., similarity ~ “closeness”)
Similarity Metric
• Use “angle” between the vectors:
• Or, more generally, inner products:









n
i k
i
n
i j
i
n
i k
i
j
i
k
j
k
j
k
j
w
w
w
w
d
d
d
d
d
d
sim
1
2
,
1
2
,
1 ,
,
)
,
( 



k
j
k
j
d
d
d
d






)
cos(
 



n
i k
i
j
i
k
j
k
j w
w
d
d
d
d
sim 1 ,
,
)
,
(


Vector space model
• Documents are also treated as a “bag” of words or terms.
• Each document is represented as a vector.
• However, the term weights are no longer 0 or 1. Each
term weight is computed based on some variations of TF
or TF-IDF scheme.
16
Term Weighting
• Term weights consist of two components
• Local: how important is the term in this document?
• Global: how important is the term in the collection?
• Here’s the intuition:
• Terms that appear often in a document should get high weights
• Terms that appear in many documents should get low weights
• How do we capture this mathematically?
• Term frequency (local)
• Inverse document frequency (global)
TF.IDF Term Weighting
i
j
i
j
i
n
N
w log
tf ,
, 

j
i
w ,
j
i,
tf
N
i
n
weight assigned to term i in document j
number of occurrence of term i in document j
number of documents in entire collection
number of documents with term i
Retrieval in vector space model
• Query q is represented in the same way or slightly
differently.
• Relevance of di to q: Compare the similarity of query q
and document di.
• Cosine similarity (the cosine of the angle between the
two vectors)
• Cosine is also commonly used in text clustering
19
An Example
• A document space is defined by three terms:
• hardware, software, users
• the vocabulary
• A set of documents are defined as:
• A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
• A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
• A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)
• If the Query is “hardware and software”
• what documents should be retrieved?
20
An Example (cont.)
• In Boolean query matching:
• document A4, A7 will be retrieved (“AND”)
• retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)
• In similarity matching (cosine):
• q=(1, 1, 0)
• S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
• S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
• S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
• Document retrieved set (with ranking)=
• {A4, A7, A1, A2, A5, A6, A8, A9}
21
22
Some formulas for Sim
Dot product
Cosine
Dice
Jaccard
  

 

 









i i i
i
i
i
i
i
i
i
i i
i
i
i
i
i
i i
i
i
i
i
i
i
i
b
a
b
a
b
a
Q
D
Sim
b
a
b
a
Q
D
Sim
b
a
b
a
Q
D
Sim
b
a
Q
D
Sim
)
*
(
)
*
(
)
,
(
)
*
(
2
)
,
(
*
)
*
(
)
,
(
)
*
(
)
,
(
2
2
2
2
2
2
t1
t2
D
Q
Vector Space Model
A Running Example
24
A Running Example
• Step 1 – Extract text (i.e. no preposition)
P. 24
This is a data
mining course.
We are studying
text mining. Text
mining is a
subfield of data
mining.
Mining text is
interesting, and I
am interested in
it.
This is a data mining course
We are studying text mining
Text mining is a subfield of
data mining
Mining text is interesting and
I am interested in it
25
A Running Example
• Step 2 – Remove stopword
P. 25
This is a data
mining course.
We are studying
text mining. Text
mining is a
subfield of data
mining.
Mining text is
interesting, and I
am interested in
it.
This is a data mining course
We are studying text mining Text mining is
a subfield of data mining
Mining text is interesting and I am
interested in it
26
A Running Example
• Step 3 – Convert all words to lower case
P. 26
This is a data
mining course.
We are studying
text mining. Text
mining is a
subfield of data
mining.
Mining text is
interesting, and I
am interested in
it.
This is a data mining course
We are studying text mining Text mining is
a subfield of data mining
Mining text is interesting and I am
interested in it
text
mining
27
A Running Example
• Step 4 – Stemming
P. 27
This is a data
mining course.
We are studying
text mining. Text
mining is a
subfield of data
mining.
Mining text is
interesting, and I
am interested in
it.
This is a data mining course
We are studying text mining Text mining is
a subfield of data mining
Mining text is interesting and I am
interested in it
study
interest
mine
mine mine
mine
mine
text
interest
28
A Running Example
• Step 5 – Count the word frequencies
This is a data
mining course.
We are studying
text mining. Text
mining is a
subfield of data
mining.
Mining text is
interesting, and I
am interested in
it.
This is a data mining course
We are studying text mining Text mining is
a subfield of data mining
Mining text is interesting and I am
interested in it
study
interest
mine
mine mine
mine
mine
text
interest
coursex1, datax1, minex1
datax1, minex3, studyx1, subfieldx1, textx2
interestx2, minex1, textx1
29
ID word doc freq
1 course 1
2 data 2
3 interest 1
4 mine 3
5 study 1
6 subfield 1
7 text 2
A Running Example
• Step 6 – Create an indexing file
This is a data
mining course.
We are studying
text mining. Text
mining is a
subfield of data
mining.
Mining text is
interesting, and I
am interested in
it.
This is a data mining course
We are studying text mining Text mining is
a subfield of data mining
Mining text is interesting and I am
interested in it
study
interest
mine
mine
mine
mine
mine
text
interest
coursex1, datax1, minex1
datax1, minex3, studyx1, subfieldx1, textx2
interestx2, minex1, textx1
30
A Running Example
• Step 7 – Create the vector space model
This is a data
mining course.
We are studying
text mining. Text
mining is a
subfield of data
mining.
Mining text is
interesting, and I
am interested in
it.
This is a data mining course
We are studying text mining Text mining is
a subfield of data mining
Mining text is interesting and I am
interested in it
study
interest
mine
mine mine
mine
mine
text
interest
coursex1, datax1, minex1
datax1, minex3, studyx1, subfieldx1, textx2
interestx2, minex1, textx1
(1, 1, 0, 1, 0, 0, 0)
(0, 1, 0, 3, 1, 1, 2)
(0, 0, 2, 1, 0, 0, 1)
I
D
word
docu
ment
freq
uenc
y
1 course 1
2 data 2
3 interest 1
4 mine 3
5 study 1
6 subfield 1
7 text 2
31
A Running Example
• Step 8 – Compute the inverse document frequency
This is a data
mining course.
We are studying
text mining. Text
mining is a
subfield of data
mining.
Mining text is
interesting, and I
am interested in
it.
(1, 1, 0, 1, 0, 0, 0)
(0, 1, 0, 3, 1, 1, 2)
(0, 0, 2, 1, 0, 0, 1)
frequency
document
documents
total
log
)
( 
word
IDF
ID word
document
frequency
IDF
1 course 1 0.477
2 data 2 0.176
3 interest 1 0.477
4 mine 3 0
5 study 1 0.477
6 subfield 1 0.477
7 text 2 0.176
32
A Running Example
• Step 9 – Compute the weights of the words
P. 32
This is a data
mining course.
We are studying
text mining. Text
mining is a
subfield of data
mining.
Mining text is
interesting, and I
am interested in
it.
(1, 1, 0, 1, 0, 0, 0)
(0, 1, 0, 3, 1, 1, 2)
(0, 0, 2, 1, 0, 0, 1)
ID word
document
frequency
IDF
1 course 1 0.477
2 data 2 0.176
3 interest 1 0.477
4 mine 3 0
5 study 1 0.477
6 subfield 1 0.477
7 text 2 0.176
document
in the
appears
times
of
number
)
(
)
(
)
(
)
(
i
i
i
i
i
word
word
TF
word
IDF
word
TF
word
w



(0.477, 0.176, 0, 0, 0, 0, 0)
(0, 0.176, 0, 0, 0.477, 0.477, 0.352)
(0, 0, 0.954, 0, 0, 0, 0.176)
33
A Running Example
• Step 10 – Normalize all documents to unit length
This is a data
mining course.
We are studying
text mining. Text
mining is a
subfield of data
mining.
Mining text is
interesting, and I
am interested in
it.
(1, 1, 0, 1, 0, 0, 0)
(0, 1, 0, 3, 1, 1, 2)
(0, 0, 2, 1, 0, 0, 1)
ID word
document
frequency IDF
1 course 1 0.477
2 data 2 0.176
3 interest 1 0.477
4 mine 3 0
5 study 1 0.477
6 subfield 1 0.477
7 text 2 0.176
)
(
)
(
)
(
)
(
)
(
2
2
2
1
2
n
i
i
word
w
word
w
word
w
word
w
word
w





(0.938, 0.346, 0, 0, 0, 0, 0)
(0, 0.225 0, 0, 0.611, 0.611, 0.450)
(0, 0, 0.983, 0, 0, 0, 0.181)
34
A Running Example
• Finally, we obtain the following:
• Everything become structural!
• We can perform classification, clustering, etc!!!!
P. 34
This is a data
mining course.
We are studying
text mining. Text
mining is a
subfield of data
mining.
Mining text is
interesting, and I
am interested in
it.
(0.938, 0.346, 0, 0, 0, 0, 0)
(0, 0.225 0, 0, 0.611, 0.611, 0.450)
(0, 0, 0.983, 0, 0, 0, 0.181)
ID word
docume
nt
frequen
cy
IDF
1 course 1 0.477
2 data 2 0.176
3 interest 1 0.477
4 mine 3 0
5 study 1 0.477
6 subfield 1 0.477
7 text 2 0.176
Unranked evaluation
• User happiness can only be measured by relevance to an information
need, not by relevance to queries.
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Accuracy
• Accuracy is the fraction of decisions (relevant/nonrelevant) that are
correct.
• In terms of the contingency table above, accuracy = (TP + TN)/(TP +
FP + FN + TN).
PAGE RANK ALGORITHM
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Information Retrieval and Extraction - Module 7
Probabilistic Language Models
• Goal -- define probability distribution over set of strings
• Unigram, bigram, n-gram
• Count using corpus but need smoothing:
• add-one
• Linear interpolation
• Evaluate with Perplexity measure
• E.g. segmentwordswithoutspaces w/ Viterbi
PCFGs
• Rewrite rules have probabilities.
• Prob of a string is sum of probs of its parse trees.
• Context-freedom means no lexical constraints.
• Prefers short sentences.
Learning PCFGs
• Parsed corpus -- count trees.
• Unparsed corpus
• Rule structure known -- use EM (inside-outside algorithm)
• Rules unknown -- Chomsky normal form… problems.
Information Retrieval
• Goal: Google. Find docs relevant to user’s needs.
• IR system has doc. Collection, query in some language, set of results,
and a presentation of results.
• Ideally, parse docs into knowledge base… too hard.
IR 2
• Boolean Keyword Model -- in or out?
• Problem -- single bit of “relevance”
• Boolean combinations a bit mysterious
• How compute P(R=true | D,Q)?
• Estimate language model for each doc, computes prob of query given
the model.
• Can rank documents by P(r|D,Q)/P(~r|D,Q)
IR3
• For this, need model of how queries are related to docs. Bag of
words: freq of words in doc., naïve Bayes.
• Good example pp 842-843.
Evaluating IR
• Precision is proportion of results that are relevant.
• Recall is proportion of relevant docs that are in results
• ROC curve (there are several varieties): standard is to plot false
negatives vs. false positives.
• More “practical” for web: reciprocal rank of first relevant result, or
just “time to answer”
IR Refinements
• Case
• Stems
• Synonyms
• Spelling correction
• Metadata --keywords
IR Presentation
• Give list in order of relevance, deal with duplicates
• Cluster results into classes
• Agglomerative
• K-means
• How describe automatically-generated clusters? Word list? Title of
centroid doc?
IR Implementation
• CSC172!
• Lexicon with “stop list”,
• “inverted” index: where words occur
• Match with vectors: vectorof freq of words dotted with query terms.
Information Extraction
• Goal: create database entries from docs.
• Emphasis on massive data, speed, stylized expressions
• Regular expression grammars OK if stylized enough
• Cascaded Finite State Transducers,,,stages of grouping and structure-
finding
Machine Translation Goals
• Rough Translation (E.g. p. 851)
• Restricted Doman (mergers, weather)
• Pre-edited (Caterpillar or Xerox English)
• Literary Translation -- not yet!
• Interlingua-- or canonical semantic representation like Conceptual
Dependency
• Basic Problem != languages, != categories
MT in Practice
• Transfer -- uses data base of rules for translating small units of
language
• Memory -based. Memorize sentence pairs
• Good diagram p. 853
Statistical MT
• Bilingual corpus
• Find most likely translation given corpus.
• Argmax_F P(F|E) = argmax_F P(E|F)P(F)
• P(F) is language model
• P(E|F) is translation model
• Lots of interesting problems: fertility (home vs. a la maison).
• Horrible drastic simplfications and hacks work pretty well!
Learning and MT
• Stat. MT needs: language model, fertility model, word choice model,
offset model.
• Millions of parameters
• Counting , estimate, EM.
What is Natural Language Processing (NLP)
• The process of computer analysis of input provided in a human
language (natural language), and conversion of this input into
a useful form of representation.
• The field of NLP is primarily concerned with getting computers to
perform useful and interesting tasks with human languages.
• The field of NLP is secondarily concerned with helping us come
to a better understanding of human language.
Forms of Natural Language
• The input/output of a NLP system can be:
• written text
• speech
• We will mostly concerned with written text (not speech).
• To process written text, we need:
• lexical, syntactic, semantic knowledge about the language
• discourse information, real world knowledge
• To process spoken language, we need everything required
to process written text, plus the challenges of speech recognition
and speech synthesis.
Components of NLP
• Natural Language Understanding
• Mapping the given input in the natural language into a useful representation.
• Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
• Natural Language Generation
• Producing output in the natural language from some internal representation.
• Different level of synthesis required:
deep planning (what to say),
syntactic generation
• NL Understanding is much harder than NL Generation. But, still both of them are
hard.
Why NL Understanding is hard?
• Natural language is extremely rich in form and structure, and very
ambiguous.
• How to represent meaning,
• Which structures map to which meaning structures.
• One input can mean many different things. Ambiguity can be at different levels.
• Lexical (word level) ambiguity -- different meanings of words
• Syntactic ambiguity -- different ways to parse the sentence
• Interpreting partial information -- how to interpret pronouns
• Contextual information -- context of the sentence may affect the meaning of that sentence.
• Many input can mean the same thing.
• Interaction among components of the input is not clear.
Knowledge of Language
• Phonology – concerns how words are related to the sounds that realize them.
• Morphology – concerns how words are constructed from more basic meaning
units called morphemes. A morpheme is the primitive unit of meaning in a language.
• Syntax – concerns how can be put together to form correct sentences and
determines what structural role each word plays in the sentence and what phrases
are subparts of other phrases.
• Semantics – concerns what words mean and how these meaning combine in
sentences to form sentence meaning. The study of context-independent
meaning.
Knowledge of Language (cont.)
• Pragmatics – concerns how sentences are used in different situations and
how use affects the interpretation of the sentence.
• Discourse – concerns how the immediately preceding sentences affect
the interpretation of the next sentence. For example, interpreting
pronouns and interpreting the temporal aspects of the information.
• World Knowledge – includes general knowledge about the world. What
each language user must know about the other’s beliefs and goals.
Models to Represent Linguistic Knowledge
• We will use certain formalisms (models) to represent the required
linguistic knowledge.
• State Machines -- FSAs, FSTs, HMMs, ATNs, RTNs
• Formal Rule Systems -- Context Free Grammars, Unification
Grammars, Probabilistic CFGs.
• Logic-based Formalisms -- first order predicate logic, some higher
order logic.
• Models of Uncertainty -- Bayesian probability theory.
Algorithms to Manipulate Linguistic
Knowledge
• We will use algorithms to manipulate the models of linguistic
knowledge to produce the desired behavior.
• Most of the algorithms we will study are transducers and parsers.
• These algorithms construct some structure based on their input.
• Since the language is ambiguous at all levels,
these algorithms are never simple processes.
• Categories of most algorithms that will be used can fall into following
categories.
• state space search
• dynamic programming
Language and Intelligence
Turing Test
Computer Human
Human Judge
• Human Judge asks tele-typed questions to Computer and Human.
• Computer’s job is to act like a human.
• Human’s job is to convince Judge that he is not machine.
• Computer is judged “intelligent” if it can fool the judge
• Judgment of intelligence is linked to appropriate answers to questions from the system.
NLP- an inter-disciplinary Field
• NLP borrows techniques and insights from several disciplines.
• Linguistics: How do words form phrases and sentences? What constraints the
possible meaning for a sentence?
• Computational Linguistics: How is the structure of sentences are identified?
How can knowledge and reasoning be modeled?
• Computer Science: Algorithms for automatons, parsers.
• Engineering: Stochastic techniques for ambiguity resolution.
• Psychology: What linguistic constructions are easy or difficult for people to
learn to use?
• Philosophy: What is the meaning, and how do words and sentences acquire it?
Some Buzz-Words
• NLP – Natural Language Processing
• CL – Computational Linguistics
• SP – Speech Processing
• HLT – Human Language Technology
• NLE – Natural Language Engineering
• SNLP – Statistical Natural Language Processing
• Other Areas:
• Speech Generation, Text Generation, Speech Understanding, Information Retrieval,
• Dialogue Processing, Inference, Spelling Correction, Grammar Correction,
• Text Summarization, Text Categorization,
Some NLP Applications
• Machine Translation – Translation between two natural languages.
• See the Babel Fish translations system on Alta Vista.
• Information Retrieval – Web search (uni-lingual or multi-lingual).
• Query Answering/Dialogue – Natural language interface with a
database system, or a dialogue system.
• Report Generation – Generation of reports such as weather reports.
• Some Small Applications –
• Grammar Checking, Spell Checking, Spell Corrector
Brief History of NLP
• 1940s –1950s: Foundations
• Development of formal language theory (Chomsky, Backus, Naur, Kleene)
• Probabilities and information theory (Shannon)
• 1957 – 1970s:
• Use of formal grammars as basis for natural language processing (Chomsky, Kaplan)
• Use of logic and logic based programming (Minsky, Winograd, Colmerauer, Kay)
• 1970s – 1983:
• Probabilistic methods for early speech recognition (Jelinek, Mercer)
• Discourse modeling (Grosz, Sidner, Hobbs)
• 1983 – 1993:
• Finite state models (morphology) (Kaplan, Kay)
• 1993 – present:
• Strong integration of different techniques, different areas.
Natural Language Understanding
Words
Morphological Analysis
Morphologically analyzed words (another step: POS tagging)
Syntactic Analysis
Syntactic Structure
Semantic Analysis
Context-independent meaning representation
Discourse Processing
Final meaning representation
Natural Language Generation
Meaning representation
Utterance Planning
Meaning representations for sentences
Sentence Planning and Lexical Choice
Syntactic structures of sentences with lexical choices
Sentence Generation
Morphologically analyzed words
Morphological Generation
Words
Morphological Analysis
• Analyzing words into their linguistic components (morphemes).
• Morphemes are the smallest meaningful units of language.
cars car+PLU
giving give+PROG
geliyordum gel+PROG+PAST+1SG - I was coming
• Ambiguity: More than one alternatives
flies flyVERB+PROG
flyNOUN+PLU
adamı adam+ACC - the man (accusative)
adam+P1SG - my man
ada+P1SG+ACC - my island (accusative)
Morphological Analysis (cont.)
• Relatively simple for English. But for some languages such as Turkish, it is more difficult.
uygarlaştıramadıklarımızdanmışsınızcasına
uygar-laş-tır-ama-dık-lar-ımız-dan-mış-sınız-casına
uygar +BEC +CAUS +NEGABLE +PPART +PL +P1PL +ABL +PAST +2PL +AsIf
“(behaving) as if you are among those whom we could not civilize/cause to become civilized”
+BEC is “become” in English
+CAUS is the causative voice marker on a verb
+PPART marks a past participle form
+P1PL is 1st
person plural possessive marker
+2PL is 2nd
person plural
+ABL is the ablative (from/among) case marker
+AsIf is a derivational marker that forms an adverb from a finite verb form
+NEGABLE is “not able” in English
• Inflectional and Derivational Morphology.
• Common tools: Finite-state transducers
Part-of-Speech (POS) Tagging
• Each word has a part-of-speech tag to describe its category.
• Part-of-speech tag of a word is one of major word groups
(or its subgroups).
• open classes -- noun, verb, adjective, adverb
• closed classes -- prepositions, determiners, conjuctions, pronouns, particples
• POS Taggers try to find POS tags for the words.
• duck is a verb or noun? (morphological analyzer cannot make decision).
• A POS tagger may make that decision by looking the surrounding words.
• Duck! (verb)
• Duck is delicious for dinner. (noun)
Lexical Processing
• The purpose of lexical processing is to determine meanings of individual
words.
• Basic methods is to lookup in a database of meanings -- lexicon
• We should also identify non-words such as punctuation marks.
• Word-level ambiguity -- words may have several meanings, and the
correct one cannot be chosen based solely on the word itself.
• bank in English
• yüz in Turkish
• Solution -- resolve the ambiguity on the spot by POS tagging (if
possible) or pass-on the ambiguity to the other levels.
Syntactic Processing
• Parsing -- converting a flat input sentence into a hierarchical structure that
corresponds to the units of meaning in the sentence.
• There are different parsing formalisms and algorithms.
• Most formalisms have two main components:
• grammar -- a declarative representation describing the syntactic structure of
sentences in the language.
• parser -- an algorithm that analyzes the input and outputs its structural
representation (its parse) consistent with the grammar specification.
• CFGs are in the center of many of the parsing mechanisms. But they are
complemented by some additional features that make the formalism more
suitable to handle natural languages.
Semantic Analysis
• Assigning meanings to the structures created by syntactic analysis.
• Mapping words and structures to particular domain objects in way
consistent with our knowledge of the world.
• Semantic can play an import role in selecting among competing
syntactic analyses and discarding illogical analyses.
• I robbed the bank -- bank is a river bank or a financial institution
• We have to decide the formalisms which will be used in the meaning
representation.
Knowledge Representation for NLP
• Which knowledge representation will be used depends on the application --
Machine Translation, Database Query System.
• Requires the choice of representational framework, as well as the specific
meaning vocabulary (what are concepts and relationship between these
concepts -- ontology)
• Must be computationally effective.
• Common representational formalisms:
• first order predicate logic
• conceptual dependency graphs
• semantic networks
• Frame-based representations
Discourse
• Discourses are collection of coherent sentences (not arbitrary set of sentences)
• Discourses have also hierarchical structures (similar to sentences)
• anaphora resolution -- to resolve referring expression
• Mary bought a book for Kelly. She didn’t like it.
• She refers to Mary or Kelly. -- possibly Kelly
• It refers to what -- book.
• Mary had to lie for Kelly. She didn’t like it.
• Discourse structure may depend on application.
• Monologue
• Dialogue
• Human-Computer Interaction
Natural Language Generation
• NLG is the process of constructing natural language outputs from
non-linguistic inputs.
• NLG can be viewed as the reverse process of NL understanding.
• A NLG system may have two main parts:
• Discourse Planner -- what will be generated. which sentences.
• Surface Realizer -- realizes a sentence from its internal representation.
• Lexical Selection -- selecting the correct words describing the
concepts.
Machine Translation
• Machine Translation -- converting a text in language A into the
corresponding text in language B (or speech).
• Different Machine Translation architectures:
• interlingua based systems
• transfer based systems
• How to acquire the required knowledge resources such as mapping
rules and bi-lingual dictionary? By hand or acquire them
automatically from corpora.
• Example Based Machine Translation acquires the required knowledge
(some of it or all of it) from corpora.

More Related Content

PDF
Information Retrieval and Map-Reduce Implementations
Jason J Pulikkottil
 
PPTX
Tdm information retrieval
KU Leuven
 
PPT
Cs583 info-retrieval
Borseshweta
 
PPTX
Chapter 1 Intro Information Rerieval.pptx
bekidea
 
PPTX
Information retrival system and PageRank algorithm
Rupali Bhatnagar
 
PPTX
Week14-Multimedia Information Retrieval.pptx
HasanulFahmi2
 
PPT
3392413.ppt information retreival systems
MARasheed3
 
Information Retrieval and Map-Reduce Implementations
Jason J Pulikkottil
 
Tdm information retrieval
KU Leuven
 
Cs583 info-retrieval
Borseshweta
 
Chapter 1 Intro Information Rerieval.pptx
bekidea
 
Information retrival system and PageRank algorithm
Rupali Bhatnagar
 
Week14-Multimedia Information Retrieval.pptx
HasanulFahmi2
 
3392413.ppt information retreival systems
MARasheed3
 

Similar to Information Retrieval and Extraction - Module 7 (20)

PDF
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
PDF
Information retrieval concept, practice and challenge
Gan Keng Hoon
 
PPT
4-IR Models_new.ppt
BereketAraya
 
PPT
4-IR Models_new.ppt
BereketAraya
 
PPTX
Text mining
Koshy Geoji
 
PPTX
Introduction to Information Retrieval (concepts and principles)
ImtithalSaeed1
 
PPT
Chapter 10 Data Mining Techniques
Houw Liong The
 
PPT
Copy of 10text (2)
Uma Se
 
PPT
chapter 5 Information Retrieval Models.ppt
KelemAlebachew
 
PPT
Slides
butest
 
PDF
Text databases and information retrieval
unyil96
 
PPT
Information Retrieval and Storage Systems
abduwasiahmed
 
PDF
Chapter 1: Introduction to Information Storage and Retrieval
captainmactavish1996
 
PPTX
Information retrieval introduction
nimmyjans4
 
PDF
Information Retrieval
rchbeir
 
PPT
Information Retrieval
ShujaatZaheer3
 
PDF
ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf
siddiquitanveer1
 
PDF
Information retrieval systems irt ppt do
PonnuthuraiSelvaraj1
 
PPTX
IRT Unit_I.pptx
thenmozhip8
 
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
Information retrieval concept, practice and challenge
Gan Keng Hoon
 
4-IR Models_new.ppt
BereketAraya
 
4-IR Models_new.ppt
BereketAraya
 
Text mining
Koshy Geoji
 
Introduction to Information Retrieval (concepts and principles)
ImtithalSaeed1
 
Chapter 10 Data Mining Techniques
Houw Liong The
 
Copy of 10text (2)
Uma Se
 
chapter 5 Information Retrieval Models.ppt
KelemAlebachew
 
Slides
butest
 
Text databases and information retrieval
unyil96
 
Information Retrieval and Storage Systems
abduwasiahmed
 
Chapter 1: Introduction to Information Storage and Retrieval
captainmactavish1996
 
Information retrieval introduction
nimmyjans4
 
Information Retrieval
rchbeir
 
Information Retrieval
ShujaatZaheer3
 
ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf
siddiquitanveer1
 
Information retrieval systems irt ppt do
PonnuthuraiSelvaraj1
 
IRT Unit_I.pptx
thenmozhip8
 
Ad

Recently uploaded (20)

PDF
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
PDF
All chapters of Strength of materials.ppt
girmabiniyam1234
 
PPTX
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
PPTX
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
PDF
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PDF
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PDF
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
PPTX
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
PDF
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
PPTX
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
PPT
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
PPTX
Online Cab Booking and Management System.pptx
diptipaneri80
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
All chapters of Strength of materials.ppt
girmabiniyam1234
 
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
Online Cab Booking and Management System.pptx
diptipaneri80
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
Ad

Information Retrieval and Extraction - Module 7

  • 2. First, nomenclature… • Information retrieval (IR) • Focus on textual information (= text/document retrieval) • Other possibilities include image, video, music, … • What do we search? • Generically, “collections” • Less-frequently used, “corpora” • What do we find? • Generically, “documents” • Even though we may be referring to web pages, PDFs, PowerPoint slides, paragraphs, etc.
  • 4. The Central Problem in Search Searcher Author Concepts Concepts Query Terms Document Terms Do these represent the same concepts? “tragic love story” “fateful star-crossed romance”
  • 5. Abstract IR Architecture Documents Query Hits Representation Function Representation Function Query Representation Document Representation Comparison Function Index offline online document acquisition (e.g., web crawling)
  • 6. How do we represent text? • Remember: computers don’t “understand” anything! • “Bag of words” • Treat all the words in a document as index terms • Assign a “weight” to each term based on “importance” (or, in simplest case, presence/absence of word) • Disregard order, structure, meaning, etc. of the words • Simple, yet effective! • Assumptions • Term occurrence is independent • Document relevance is independent • “Words” are well-defined
  • 7. What’s a word? 天主教教宗若望保祿二世因感冒再度住進醫院。 這是他今年第二度因同樣的病因住院。 ‫باسم‬ ‫الناطق‬ - ‫ريجيف‬ ‫مارك‬ ‫وقال‬ ‫قبل‬ ‫شارون‬ ‫إن‬ - ‫اإلسرائيلية‬ ‫الخارجية‬ ‫بزيارة‬ ‫األولى‬ ‫للمرة‬ ‫وسيقوم‬ ‫الدعوة‬ ‫المقر‬ ‫طويلة‬ ‫لفترة‬ ‫كانت‬ ‫التي‬ ،‫تونس‬ ‫عام‬ ‫لبنان‬ ‫من‬ ‫خروجها‬ ‫بعد‬ ‫الفلسطينية‬ ‫التحرير‬ ‫لمنظمة‬ ‫الرسمي‬ 1982 . Выступая в Мещанском суде Москвы экс-глава ЮКОСа заявил не совершал ничего противозаконного, в чем обвиняет его генпрокуратура России. भारत सरकार ने आर्थिक सर्वेक्षण में वित्तीय वर्ष 2005-06 में सात फ़ीसदी विकास दर हासिल करने का आकलन किया है और कर सुधार पर ज़ोर दिया है 日米連合で台頭中国に対処…アーミテージ前副長官提言 조재영 기자 = 서울시는 25 일 이명박 시장이 ` 행정중심복합도시 '' 건설안에 대해 ` 군대라도 동원해 막고싶은 심정 '' 이라고 말했다는 일부 언론의 보도를 부인했다 .
  • 8. Sample Document McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. … 14 × McDonalds 12 × fat 11 × fries 8 × new 7 × french 6 × company, said, nutrition 5 × food, oil, percent, reduce, taste, Tuesday … “Bag of Words”
  • 9. Information retrieval models • An IR model governs how a document and a query are represented and how the relevance of a document to a user query is defined. • Main models: • Boolean model • Vector space model • Statistical language model.. • etc 9
  • 10. Boolean model • Each document or query is treated as a “bag” of words or terms. Word sequence is not considered. • Given a collection of documents D, let V = {t1, t2, ..., t| V|} be the set of distinctive words/terms in the collection. V is called the vocabulary. • A weight wij > 0 is associated with each term ti of a document dj ∈ D. For a term that does not appear in document dj, wij = 0. dj = (w1j, w2j, ..., w|V|j), 10
  • 11. Boolean model (contd) • Query terms are combined logically using the Boolean operators AND, OR, and NOT. • E.g., ((data AND mining) AND (NOT text)) • Retrieval • Given a Boolean query, the system retrieves every document that makes the query logically true. • Called exact match. • The retrieval results are usually quite poor because term frequency is not considered. 11
  • 12. Boolean queries: Exact match • The Boolean retrieval model is being able to ask a query that is a Boolean expression: – Boolean Queries are queries using AND, OR and NOT to join query terms • Views each document as a set of words • Is precise: document matches condition or not. – Perhaps the simplest model to build an IR system on • Primary commercial retrieval tool for 3 decades. • Many search systems you still use are Boolean: – Email, library catalog, Mac OS X Spotlight 12 Sec. 1.3
  • 13. Strengths and Weaknesses • Strengths • Precise, if you know the right strategies • Precise, if you have an idea of what you’re looking for • Implementations are fast and efficient • Weaknesses • Users must learn Boolean logic • Boolean logic insufficient to capture the richness of language • No control over size of result set: either too many hits or none • When do you stop reading? All documents in the result set are considered “equally good” • What about partial matches? Documents that “don’t quite match” the query may be useful also
  • 14. Vector Space Model Assumption: Documents that are “close together” in vector space “talk about” the same things t1 d2 d1 d3 d4 d5 t3 t2 θ φ Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)
  • 15. Similarity Metric • Use “angle” between the vectors: • Or, more generally, inner products:          n i k i n i j i n i k i j i k j k j k j w w w w d d d d d d sim 1 2 , 1 2 , 1 , , ) , (     k j k j d d d d       ) cos(      n i k i j i k j k j w w d d d d sim 1 , , ) , (  
  • 16. Vector space model • Documents are also treated as a “bag” of words or terms. • Each document is represented as a vector. • However, the term weights are no longer 0 or 1. Each term weight is computed based on some variations of TF or TF-IDF scheme. 16
  • 17. Term Weighting • Term weights consist of two components • Local: how important is the term in this document? • Global: how important is the term in the collection? • Here’s the intuition: • Terms that appear often in a document should get high weights • Terms that appear in many documents should get low weights • How do we capture this mathematically? • Term frequency (local) • Inverse document frequency (global)
  • 18. TF.IDF Term Weighting i j i j i n N w log tf , ,   j i w , j i, tf N i n weight assigned to term i in document j number of occurrence of term i in document j number of documents in entire collection number of documents with term i
  • 19. Retrieval in vector space model • Query q is represented in the same way or slightly differently. • Relevance of di to q: Compare the similarity of query q and document di. • Cosine similarity (the cosine of the angle between the two vectors) • Cosine is also commonly used in text clustering 19
  • 20. An Example • A document space is defined by three terms: • hardware, software, users • the vocabulary • A set of documents are defined as: • A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1) • A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1) • A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1) • If the Query is “hardware and software” • what documents should be retrieved? 20
  • 21. An Example (cont.) • In Boolean query matching: • document A4, A7 will be retrieved (“AND”) • retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”) • In similarity matching (cosine): • q=(1, 1, 0) • S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0 • S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 • S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 • Document retrieved set (with ranking)= • {A4, A7, A1, A2, A5, A6, A8, A9} 21
  • 22. 22 Some formulas for Sim Dot product Cosine Dice Jaccard                   i i i i i i i i i i i i i i i i i i i i i i i i i i b a b a b a Q D Sim b a b a Q D Sim b a b a Q D Sim b a Q D Sim ) * ( ) * ( ) , ( ) * ( 2 ) , ( * ) * ( ) , ( ) * ( ) , ( 2 2 2 2 2 2 t1 t2 D Q
  • 23. Vector Space Model A Running Example
  • 24. 24 A Running Example • Step 1 – Extract text (i.e. no preposition) P. 24 This is a data mining course. We are studying text mining. Text mining is a subfield of data mining. Mining text is interesting, and I am interested in it. This is a data mining course We are studying text mining Text mining is a subfield of data mining Mining text is interesting and I am interested in it
  • 25. 25 A Running Example • Step 2 – Remove stopword P. 25 This is a data mining course. We are studying text mining. Text mining is a subfield of data mining. Mining text is interesting, and I am interested in it. This is a data mining course We are studying text mining Text mining is a subfield of data mining Mining text is interesting and I am interested in it
  • 26. 26 A Running Example • Step 3 – Convert all words to lower case P. 26 This is a data mining course. We are studying text mining. Text mining is a subfield of data mining. Mining text is interesting, and I am interested in it. This is a data mining course We are studying text mining Text mining is a subfield of data mining Mining text is interesting and I am interested in it text mining
  • 27. 27 A Running Example • Step 4 – Stemming P. 27 This is a data mining course. We are studying text mining. Text mining is a subfield of data mining. Mining text is interesting, and I am interested in it. This is a data mining course We are studying text mining Text mining is a subfield of data mining Mining text is interesting and I am interested in it study interest mine mine mine mine mine text interest
  • 28. 28 A Running Example • Step 5 – Count the word frequencies This is a data mining course. We are studying text mining. Text mining is a subfield of data mining. Mining text is interesting, and I am interested in it. This is a data mining course We are studying text mining Text mining is a subfield of data mining Mining text is interesting and I am interested in it study interest mine mine mine mine mine text interest coursex1, datax1, minex1 datax1, minex3, studyx1, subfieldx1, textx2 interestx2, minex1, textx1
  • 29. 29 ID word doc freq 1 course 1 2 data 2 3 interest 1 4 mine 3 5 study 1 6 subfield 1 7 text 2 A Running Example • Step 6 – Create an indexing file This is a data mining course. We are studying text mining. Text mining is a subfield of data mining. Mining text is interesting, and I am interested in it. This is a data mining course We are studying text mining Text mining is a subfield of data mining Mining text is interesting and I am interested in it study interest mine mine mine mine mine text interest coursex1, datax1, minex1 datax1, minex3, studyx1, subfieldx1, textx2 interestx2, minex1, textx1
  • 30. 30 A Running Example • Step 7 – Create the vector space model This is a data mining course. We are studying text mining. Text mining is a subfield of data mining. Mining text is interesting, and I am interested in it. This is a data mining course We are studying text mining Text mining is a subfield of data mining Mining text is interesting and I am interested in it study interest mine mine mine mine mine text interest coursex1, datax1, minex1 datax1, minex3, studyx1, subfieldx1, textx2 interestx2, minex1, textx1 (1, 1, 0, 1, 0, 0, 0) (0, 1, 0, 3, 1, 1, 2) (0, 0, 2, 1, 0, 0, 1) I D word docu ment freq uenc y 1 course 1 2 data 2 3 interest 1 4 mine 3 5 study 1 6 subfield 1 7 text 2
  • 31. 31 A Running Example • Step 8 – Compute the inverse document frequency This is a data mining course. We are studying text mining. Text mining is a subfield of data mining. Mining text is interesting, and I am interested in it. (1, 1, 0, 1, 0, 0, 0) (0, 1, 0, 3, 1, 1, 2) (0, 0, 2, 1, 0, 0, 1) frequency document documents total log ) (  word IDF ID word document frequency IDF 1 course 1 0.477 2 data 2 0.176 3 interest 1 0.477 4 mine 3 0 5 study 1 0.477 6 subfield 1 0.477 7 text 2 0.176
  • 32. 32 A Running Example • Step 9 – Compute the weights of the words P. 32 This is a data mining course. We are studying text mining. Text mining is a subfield of data mining. Mining text is interesting, and I am interested in it. (1, 1, 0, 1, 0, 0, 0) (0, 1, 0, 3, 1, 1, 2) (0, 0, 2, 1, 0, 0, 1) ID word document frequency IDF 1 course 1 0.477 2 data 2 0.176 3 interest 1 0.477 4 mine 3 0 5 study 1 0.477 6 subfield 1 0.477 7 text 2 0.176 document in the appears times of number ) ( ) ( ) ( ) ( i i i i i word word TF word IDF word TF word w    (0.477, 0.176, 0, 0, 0, 0, 0) (0, 0.176, 0, 0, 0.477, 0.477, 0.352) (0, 0, 0.954, 0, 0, 0, 0.176)
  • 33. 33 A Running Example • Step 10 – Normalize all documents to unit length This is a data mining course. We are studying text mining. Text mining is a subfield of data mining. Mining text is interesting, and I am interested in it. (1, 1, 0, 1, 0, 0, 0) (0, 1, 0, 3, 1, 1, 2) (0, 0, 2, 1, 0, 0, 1) ID word document frequency IDF 1 course 1 0.477 2 data 2 0.176 3 interest 1 0.477 4 mine 3 0 5 study 1 0.477 6 subfield 1 0.477 7 text 2 0.176 ) ( ) ( ) ( ) ( ) ( 2 2 2 1 2 n i i word w word w word w word w word w      (0.938, 0.346, 0, 0, 0, 0, 0) (0, 0.225 0, 0, 0.611, 0.611, 0.450) (0, 0, 0.983, 0, 0, 0, 0.181)
  • 34. 34 A Running Example • Finally, we obtain the following: • Everything become structural! • We can perform classification, clustering, etc!!!! P. 34 This is a data mining course. We are studying text mining. Text mining is a subfield of data mining. Mining text is interesting, and I am interested in it. (0.938, 0.346, 0, 0, 0, 0, 0) (0, 0.225 0, 0, 0.611, 0.611, 0.450) (0, 0, 0.983, 0, 0, 0, 0.181) ID word docume nt frequen cy IDF 1 course 1 0.477 2 data 2 0.176 3 interest 1 0.477 4 mine 3 0 5 study 1 0.477 6 subfield 1 0.477 7 text 2 0.176
  • 35. Unranked evaluation • User happiness can only be measured by relevance to an information need, not by relevance to queries.
  • 40. Accuracy • Accuracy is the fraction of decisions (relevant/nonrelevant) that are correct. • In terms of the contingency table above, accuracy = (TP + TN)/(TP + FP + FN + TN).
  • 60. Probabilistic Language Models • Goal -- define probability distribution over set of strings • Unigram, bigram, n-gram • Count using corpus but need smoothing: • add-one • Linear interpolation • Evaluate with Perplexity measure • E.g. segmentwordswithoutspaces w/ Viterbi
  • 61. PCFGs • Rewrite rules have probabilities. • Prob of a string is sum of probs of its parse trees. • Context-freedom means no lexical constraints. • Prefers short sentences.
  • 62. Learning PCFGs • Parsed corpus -- count trees. • Unparsed corpus • Rule structure known -- use EM (inside-outside algorithm) • Rules unknown -- Chomsky normal form… problems.
  • 63. Information Retrieval • Goal: Google. Find docs relevant to user’s needs. • IR system has doc. Collection, query in some language, set of results, and a presentation of results. • Ideally, parse docs into knowledge base… too hard.
  • 64. IR 2 • Boolean Keyword Model -- in or out? • Problem -- single bit of “relevance” • Boolean combinations a bit mysterious • How compute P(R=true | D,Q)? • Estimate language model for each doc, computes prob of query given the model. • Can rank documents by P(r|D,Q)/P(~r|D,Q)
  • 65. IR3 • For this, need model of how queries are related to docs. Bag of words: freq of words in doc., naïve Bayes. • Good example pp 842-843.
  • 66. Evaluating IR • Precision is proportion of results that are relevant. • Recall is proportion of relevant docs that are in results • ROC curve (there are several varieties): standard is to plot false negatives vs. false positives. • More “practical” for web: reciprocal rank of first relevant result, or just “time to answer”
  • 67. IR Refinements • Case • Stems • Synonyms • Spelling correction • Metadata --keywords
  • 68. IR Presentation • Give list in order of relevance, deal with duplicates • Cluster results into classes • Agglomerative • K-means • How describe automatically-generated clusters? Word list? Title of centroid doc?
  • 69. IR Implementation • CSC172! • Lexicon with “stop list”, • “inverted” index: where words occur • Match with vectors: vectorof freq of words dotted with query terms.
  • 70. Information Extraction • Goal: create database entries from docs. • Emphasis on massive data, speed, stylized expressions • Regular expression grammars OK if stylized enough • Cascaded Finite State Transducers,,,stages of grouping and structure- finding
  • 71. Machine Translation Goals • Rough Translation (E.g. p. 851) • Restricted Doman (mergers, weather) • Pre-edited (Caterpillar or Xerox English) • Literary Translation -- not yet! • Interlingua-- or canonical semantic representation like Conceptual Dependency • Basic Problem != languages, != categories
  • 72. MT in Practice • Transfer -- uses data base of rules for translating small units of language • Memory -based. Memorize sentence pairs • Good diagram p. 853
  • 73. Statistical MT • Bilingual corpus • Find most likely translation given corpus. • Argmax_F P(F|E) = argmax_F P(E|F)P(F) • P(F) is language model • P(E|F) is translation model • Lots of interesting problems: fertility (home vs. a la maison). • Horrible drastic simplfications and hacks work pretty well!
  • 74. Learning and MT • Stat. MT needs: language model, fertility model, word choice model, offset model. • Millions of parameters • Counting , estimate, EM.
  • 75. What is Natural Language Processing (NLP) • The process of computer analysis of input provided in a human language (natural language), and conversion of this input into a useful form of representation. • The field of NLP is primarily concerned with getting computers to perform useful and interesting tasks with human languages. • The field of NLP is secondarily concerned with helping us come to a better understanding of human language.
  • 76. Forms of Natural Language • The input/output of a NLP system can be: • written text • speech • We will mostly concerned with written text (not speech). • To process written text, we need: • lexical, syntactic, semantic knowledge about the language • discourse information, real world knowledge • To process spoken language, we need everything required to process written text, plus the challenges of speech recognition and speech synthesis.
  • 77. Components of NLP • Natural Language Understanding • Mapping the given input in the natural language into a useful representation. • Different level of analysis required: morphological analysis, syntactic analysis, semantic analysis, discourse analysis, … • Natural Language Generation • Producing output in the natural language from some internal representation. • Different level of synthesis required: deep planning (what to say), syntactic generation • NL Understanding is much harder than NL Generation. But, still both of them are hard.
  • 78. Why NL Understanding is hard? • Natural language is extremely rich in form and structure, and very ambiguous. • How to represent meaning, • Which structures map to which meaning structures. • One input can mean many different things. Ambiguity can be at different levels. • Lexical (word level) ambiguity -- different meanings of words • Syntactic ambiguity -- different ways to parse the sentence • Interpreting partial information -- how to interpret pronouns • Contextual information -- context of the sentence may affect the meaning of that sentence. • Many input can mean the same thing. • Interaction among components of the input is not clear.
  • 79. Knowledge of Language • Phonology – concerns how words are related to the sounds that realize them. • Morphology – concerns how words are constructed from more basic meaning units called morphemes. A morpheme is the primitive unit of meaning in a language. • Syntax – concerns how can be put together to form correct sentences and determines what structural role each word plays in the sentence and what phrases are subparts of other phrases. • Semantics – concerns what words mean and how these meaning combine in sentences to form sentence meaning. The study of context-independent meaning.
  • 80. Knowledge of Language (cont.) • Pragmatics – concerns how sentences are used in different situations and how use affects the interpretation of the sentence. • Discourse – concerns how the immediately preceding sentences affect the interpretation of the next sentence. For example, interpreting pronouns and interpreting the temporal aspects of the information. • World Knowledge – includes general knowledge about the world. What each language user must know about the other’s beliefs and goals.
  • 81. Models to Represent Linguistic Knowledge • We will use certain formalisms (models) to represent the required linguistic knowledge. • State Machines -- FSAs, FSTs, HMMs, ATNs, RTNs • Formal Rule Systems -- Context Free Grammars, Unification Grammars, Probabilistic CFGs. • Logic-based Formalisms -- first order predicate logic, some higher order logic. • Models of Uncertainty -- Bayesian probability theory.
  • 82. Algorithms to Manipulate Linguistic Knowledge • We will use algorithms to manipulate the models of linguistic knowledge to produce the desired behavior. • Most of the algorithms we will study are transducers and parsers. • These algorithms construct some structure based on their input. • Since the language is ambiguous at all levels, these algorithms are never simple processes. • Categories of most algorithms that will be used can fall into following categories. • state space search • dynamic programming
  • 83. Language and Intelligence Turing Test Computer Human Human Judge • Human Judge asks tele-typed questions to Computer and Human. • Computer’s job is to act like a human. • Human’s job is to convince Judge that he is not machine. • Computer is judged “intelligent” if it can fool the judge • Judgment of intelligence is linked to appropriate answers to questions from the system.
  • 84. NLP- an inter-disciplinary Field • NLP borrows techniques and insights from several disciplines. • Linguistics: How do words form phrases and sentences? What constraints the possible meaning for a sentence? • Computational Linguistics: How is the structure of sentences are identified? How can knowledge and reasoning be modeled? • Computer Science: Algorithms for automatons, parsers. • Engineering: Stochastic techniques for ambiguity resolution. • Psychology: What linguistic constructions are easy or difficult for people to learn to use? • Philosophy: What is the meaning, and how do words and sentences acquire it?
  • 85. Some Buzz-Words • NLP – Natural Language Processing • CL – Computational Linguistics • SP – Speech Processing • HLT – Human Language Technology • NLE – Natural Language Engineering • SNLP – Statistical Natural Language Processing • Other Areas: • Speech Generation, Text Generation, Speech Understanding, Information Retrieval, • Dialogue Processing, Inference, Spelling Correction, Grammar Correction, • Text Summarization, Text Categorization,
  • 86. Some NLP Applications • Machine Translation – Translation between two natural languages. • See the Babel Fish translations system on Alta Vista. • Information Retrieval – Web search (uni-lingual or multi-lingual). • Query Answering/Dialogue – Natural language interface with a database system, or a dialogue system. • Report Generation – Generation of reports such as weather reports. • Some Small Applications – • Grammar Checking, Spell Checking, Spell Corrector
  • 87. Brief History of NLP • 1940s –1950s: Foundations • Development of formal language theory (Chomsky, Backus, Naur, Kleene) • Probabilities and information theory (Shannon) • 1957 – 1970s: • Use of formal grammars as basis for natural language processing (Chomsky, Kaplan) • Use of logic and logic based programming (Minsky, Winograd, Colmerauer, Kay) • 1970s – 1983: • Probabilistic methods for early speech recognition (Jelinek, Mercer) • Discourse modeling (Grosz, Sidner, Hobbs) • 1983 – 1993: • Finite state models (morphology) (Kaplan, Kay) • 1993 – present: • Strong integration of different techniques, different areas.
  • 88. Natural Language Understanding Words Morphological Analysis Morphologically analyzed words (another step: POS tagging) Syntactic Analysis Syntactic Structure Semantic Analysis Context-independent meaning representation Discourse Processing Final meaning representation
  • 89. Natural Language Generation Meaning representation Utterance Planning Meaning representations for sentences Sentence Planning and Lexical Choice Syntactic structures of sentences with lexical choices Sentence Generation Morphologically analyzed words Morphological Generation Words
  • 90. Morphological Analysis • Analyzing words into their linguistic components (morphemes). • Morphemes are the smallest meaningful units of language. cars car+PLU giving give+PROG geliyordum gel+PROG+PAST+1SG - I was coming • Ambiguity: More than one alternatives flies flyVERB+PROG flyNOUN+PLU adamı adam+ACC - the man (accusative) adam+P1SG - my man ada+P1SG+ACC - my island (accusative)
  • 91. Morphological Analysis (cont.) • Relatively simple for English. But for some languages such as Turkish, it is more difficult. uygarlaştıramadıklarımızdanmışsınızcasına uygar-laş-tır-ama-dık-lar-ımız-dan-mış-sınız-casına uygar +BEC +CAUS +NEGABLE +PPART +PL +P1PL +ABL +PAST +2PL +AsIf “(behaving) as if you are among those whom we could not civilize/cause to become civilized” +BEC is “become” in English +CAUS is the causative voice marker on a verb +PPART marks a past participle form +P1PL is 1st person plural possessive marker +2PL is 2nd person plural +ABL is the ablative (from/among) case marker +AsIf is a derivational marker that forms an adverb from a finite verb form +NEGABLE is “not able” in English • Inflectional and Derivational Morphology. • Common tools: Finite-state transducers
  • 92. Part-of-Speech (POS) Tagging • Each word has a part-of-speech tag to describe its category. • Part-of-speech tag of a word is one of major word groups (or its subgroups). • open classes -- noun, verb, adjective, adverb • closed classes -- prepositions, determiners, conjuctions, pronouns, particples • POS Taggers try to find POS tags for the words. • duck is a verb or noun? (morphological analyzer cannot make decision). • A POS tagger may make that decision by looking the surrounding words. • Duck! (verb) • Duck is delicious for dinner. (noun)
  • 93. Lexical Processing • The purpose of lexical processing is to determine meanings of individual words. • Basic methods is to lookup in a database of meanings -- lexicon • We should also identify non-words such as punctuation marks. • Word-level ambiguity -- words may have several meanings, and the correct one cannot be chosen based solely on the word itself. • bank in English • yüz in Turkish • Solution -- resolve the ambiguity on the spot by POS tagging (if possible) or pass-on the ambiguity to the other levels.
  • 94. Syntactic Processing • Parsing -- converting a flat input sentence into a hierarchical structure that corresponds to the units of meaning in the sentence. • There are different parsing formalisms and algorithms. • Most formalisms have two main components: • grammar -- a declarative representation describing the syntactic structure of sentences in the language. • parser -- an algorithm that analyzes the input and outputs its structural representation (its parse) consistent with the grammar specification. • CFGs are in the center of many of the parsing mechanisms. But they are complemented by some additional features that make the formalism more suitable to handle natural languages.
  • 95. Semantic Analysis • Assigning meanings to the structures created by syntactic analysis. • Mapping words and structures to particular domain objects in way consistent with our knowledge of the world. • Semantic can play an import role in selecting among competing syntactic analyses and discarding illogical analyses. • I robbed the bank -- bank is a river bank or a financial institution • We have to decide the formalisms which will be used in the meaning representation.
  • 96. Knowledge Representation for NLP • Which knowledge representation will be used depends on the application -- Machine Translation, Database Query System. • Requires the choice of representational framework, as well as the specific meaning vocabulary (what are concepts and relationship between these concepts -- ontology) • Must be computationally effective. • Common representational formalisms: • first order predicate logic • conceptual dependency graphs • semantic networks • Frame-based representations
  • 97. Discourse • Discourses are collection of coherent sentences (not arbitrary set of sentences) • Discourses have also hierarchical structures (similar to sentences) • anaphora resolution -- to resolve referring expression • Mary bought a book for Kelly. She didn’t like it. • She refers to Mary or Kelly. -- possibly Kelly • It refers to what -- book. • Mary had to lie for Kelly. She didn’t like it. • Discourse structure may depend on application. • Monologue • Dialogue • Human-Computer Interaction
  • 98. Natural Language Generation • NLG is the process of constructing natural language outputs from non-linguistic inputs. • NLG can be viewed as the reverse process of NL understanding. • A NLG system may have two main parts: • Discourse Planner -- what will be generated. which sentences. • Surface Realizer -- realizes a sentence from its internal representation. • Lexical Selection -- selecting the correct words describing the concepts.
  • 99. Machine Translation • Machine Translation -- converting a text in language A into the corresponding text in language B (or speech). • Different Machine Translation architectures: • interlingua based systems • transfer based systems • How to acquire the required knowledge resources such as mapping rules and bi-lingual dictionary? By hand or acquire them automatically from corpora. • Example Based Machine Translation acquires the required knowledge (some of it or all of it) from corpora.

Editor's Notes

  • #4: Why is IR hard? Because language is hard!