0% found this document useful (0 votes)

4 views22 pages

03Text Processing

Uploaded by

nivijune1306

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views22 pages

03Text Processing

Uploaded by

nivijune1306

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Text Processing

Information Retrieval
Lecture 3

Lecture 3 Information Retrieval 1

Text Operations
Converting text to indexing terms
Goal: produce a set of indexing terms
that make the best use of resources

that will accurately match user query terms

Lecture 3 Information Retrieval 2

Text Processing Steps
1. Lexical Analysis
2. Elimination of stopwords
3. Stemming
4. Selection of index terms
5. Building a thesaurus

Lecture 3 Information Retrieval 3

Lexical Analysis
Converting byte stream to tokens
a.k.a tokenization or lexing
Three ways to build your lexer
manually (in C or a scripting language)

use a generator such as lex or flex

use a special-purpose DFA generator

Handling of numbers and punctuation

should be tunable for the application

Lecture 3 Information Retrieval 4

Lexing: Numbers and digits
Numbers need context
"deaths from car accidents in 1989"

{deaths, car, accidents, 1989}

{1989} could retrieve many irrelevant docs

However...
numbers do appear in user queries

rest of terms can give context

might be helped by using phrases

Lecture 3 Information Retrieval 5

Lexing: Hyphens
Keep them?
query might use a non-hyphenated variant

end-of-line hyphens are noise

Throw them out?

can’t recognize a hyphenated term in a query

Two advanced solutions

index as phrase but allow partial matches

use proximity information

Lecture 3 Information Retrieval 6

Lexing: Punctuation
Obvious: segment on puctuation
But (like hyphens) can appear inside a
single term:
"B.C.", "B.S.": without periods, these are just

single letters
URLs as index terms?

Idea: look at surrounding characters

whitespace? end of sentence

not whitespace? abbreviation

Lecture 3 Information Retrieval 7

Lexing: Markup
Nowadays, everything has markup
SGML, HTML, XML...

This information can be useful or not...

Some alternatives:
emit text appearing inside all or some tags

emit tags as tokens which can be interpreted

by the indexer.

Lecture 3 Information Retrieval 8

Writing a lexer by hand
while ((c = getchar()) != EOF){
if (isalpha(c)) { …
Very fast! but
Error-prone

Hard to make it flexible or modular

Alternative: use a scripting langauge

Easier to describe text patterns

But can be hard to maintain

Lecture 3 Information Retrieval 9

Using a DFA generator
Generalization of the hand-written lexer
Define a state machine
transitions occur on different character input

states define possible next steps

write a table, not a procedure

Program generates the lexer

Easier to maintain and debug!
(Frakes & Baeza-Yates ’92 have code)
Lecture 3 Information Retrieval 10
Stop Words
the, of, and, a, in, to, is, for, with, are
take up a lot of space

retrieve all documents

don’t relate to information need

It’s easy to index something that appears

everywhere
Removing stopwords can cause problems:
"to be or not to be" → {be}

"C" as a stop word would be trouble for a computer

programming index!
Lecture 3 Information Retrieval 11
Removing Stop Words
Start with a list of stop words
Table lookup
Make a table out of a static stoplist
Match each token against the table
Hashes, perfect hashing, tries
Build into the lexical analyzer (see F&BY)
Or take a statistical approach

Lecture 3 Information Retrieval 12

Stemming
Reduce variant word forms to a single
"stem" form
-'s, -ing, -ed, -s; in-, ad-, pre-, sub-, ...
Four approaches
table lookup - use a dictionary
successor variety - fancy suffix removal
affix removal - cut prefixes and suffixes
character n-grams (not really stemming)

Lecture 3 Information Retrieval 13

Porter’s algorithm (1980)
Stage 1a and b
SSES -> SS caresses -> caress
Removes suffixes in
five stages IES -> I ponies -> poni
ties -> ti
Only one rule in each
stage fires SS -> SS caress -> caress
Each depends on a S -> ø cats -> cat
suffix and the stem (m>0) feed -> feed
measure m EED->EE agreed -> agree
[C](VC)m[V] (*v*) ED-> plastered -> plaster

(v) ING-> motoring -> motor

Lecture 3 Information Retrieval 14

Porter Errors (Krovetz 93)
Too eager Too cautious
organization/organ european/europe
doing/doe matrices/matrix
policy/police create/creation
university/universe machine/machinery
negligible/negligent explain/explanation
arm/army resolve/resolution
past/paste triangle/triangular
Lecture 3 Information Retrieval 15
Stems and roots
Stemmers are language specific
See the Snowball project
http://snowball.sourceforge.net/
for stemmers in other languages
Morphological analysis
reducing words to their linguistic roots
requires more sophisticated processing
Think about how this can affect the query
Lecture 3 Information Retrieval 16
Character n-grams
Slide an n-character window through text
No stemming or stoplisting
May need to consider punctuation and
hyphens
Redundant tokens: good for noisy text
Less effective than word (stem) pairs in
clean text

Lecture 3 Information Retrieval 17

Term Selection
Individual words
Adjacent word pairs (word n-grams)
Noun phrases
requires more sophisticated NLP
identify nouns along with adjectives and
adverbs in the same phrase
"computer science" and "world-wide web"

Lecture 3 Information Retrieval 18

The Case for Complexity
User queries are only one or two words
The bag-of-words approach is too
simplistic given short queries
Using phrases, sophisticated handling for
numbers, etc. boosts the quality of that
first list of documents.

Lecture 3 Information Retrieval 19

The Case for Simplicity
Query throughput is as (more?) important
than quality responses
Disk is cheap
Complex processing takes too long
Easy to make a wrong decision
Feedback will improve the results

Lecture 3 Information Retrieval 20

Simple or Complex?
Can look at it on two levels:
Does more sophisticated term
processing improve retrieval results?
... or ...
Does it enable a more sophisticated
interface for the user?

Lecture 3 Information Retrieval 21

Designing with Filters
The UNIX philosophy: "do one thing and
do it well."
Filters read text input and produce text
output
can be linked together in pipes
can be simple (cut, nl) or complex (awk,perl)
Lexers are filters
You can have several in your toolbox

Lecture 3 Information Retrieval 22

1. 2_text Operation_1 (2)
No ratings yet
1. 2_text Operation_1 (2)
28 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
49 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
C7 SpellCorrection
No ratings yet
C7 SpellCorrection
43 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
34 pages
lecture2-dictionary
No ratings yet
lecture2-dictionary
37 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Information Retrieval: Text Processing
No ratings yet
Information Retrieval: Text Processing
43 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
Lecture3 Tolerant Retrieval Handout 6 Per
No ratings yet
Lecture3 Tolerant Retrieval Handout 6 Per
8 pages
Lecture3 Tolerant Retrieval
No ratings yet
Lecture3 Tolerant Retrieval
48 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
Lecture3 Roy
No ratings yet
Lecture3 Roy
5 pages
3. text-processing
No ratings yet
3. text-processing
70 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
No ratings yet
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
47 pages
Lec4 IR
No ratings yet
Lec4 IR
53 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Unit 3_ Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Unit 3_ Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
8 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
chap2part2
No ratings yet
chap2part2
20 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
1-Getting Started With ELK
No ratings yet
1-Getting Started With ELK
44 pages
Lecture 4 - Tolerant-Retrieval Chapter 3
No ratings yet
Lecture 4 - Tolerant-Retrieval Chapter 3
20 pages
Lec 19
No ratings yet
Lec 19
60 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
Term Vocabulary and Postings List
No ratings yet
Term Vocabulary and Postings List
64 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
62 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
Lecture5 Spell Correction 1per
No ratings yet
Lecture5 Spell Correction 1per
61 pages
IR Chap7
No ratings yet
IR Chap7
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Contact & Credits: A Graduate Course - Spring 2006
No ratings yet
Contact & Credits: A Graduate Course - Spring 2006
6 pages
Advanced Topics in Information Systems
No ratings yet
Advanced Topics in Information Systems
175 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
6_2018_09_11!11_16_16_AM
No ratings yet
6_2018_09_11!11_16_16_AM
101 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
Made By:-Bhawana Agarwal Cs Iiiyr
No ratings yet
Made By:-Bhawana Agarwal Cs Iiiyr
29 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
6 The Term Vocabulary & Posting List
No ratings yet
6 The Term Vocabulary & Posting List
19 pages
lecture3-tolerent
No ratings yet
lecture3-tolerent
81 pages
lec5
No ratings yet
lec5
22 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
Yule Pragmatics Searchable
No ratings yet
Yule Pragmatics Searchable
76 pages
5 Razred The Final Test Pripreme
No ratings yet
5 Razred The Final Test Pripreme
4 pages
ĐÁP ÁN ĐỀ SỐ 21
No ratings yet
ĐÁP ÁN ĐỀ SỐ 21
2 pages
Kps For Educators
No ratings yet
Kps For Educators
38 pages
An Articulator y View of Historical S-Aspiration in Spanish
No ratings yet
An Articulator y View of Historical S-Aspiration in Spanish
12 pages
Master Thesis
No ratings yet
Master Thesis
160 pages
Language Development: Language in Early Childhood
No ratings yet
Language Development: Language in Early Childhood
3 pages
Introduction To Didactics (TAREK SAID)
No ratings yet
Introduction To Didactics (TAREK SAID)
8 pages
8 - The clause adjuncts
No ratings yet
8 - The clause adjuncts
122 pages
GUÍA N 4 INGLES SÉPTIMO GRADO Ok
No ratings yet
GUÍA N 4 INGLES SÉPTIMO GRADO Ok
3 pages
Unit 6 Technology
No ratings yet
Unit 6 Technology
36 pages
English 9th Test New
No ratings yet
English 9th Test New
6 pages
INGLÉS I (2)
No ratings yet
INGLÉS I (2)
6 pages
Lesson 3 Ni Jiao Shenme Mingzi
No ratings yet
Lesson 3 Ni Jiao Shenme Mingzi
3 pages
Zero and First Conditionals Grammar
No ratings yet
Zero and First Conditionals Grammar
4 pages
Right Form of Verb+Conditionals
No ratings yet
Right Form of Verb+Conditionals
87 pages
parallelism lesson plan
No ratings yet
parallelism lesson plan
2 pages
Parts of Speech. Notional Parts of Speech. The Verb. Participle.
No ratings yet
Parts of Speech. Notional Parts of Speech. The Verb. Participle.
20 pages
Oic Esl Y2 W1 LP
No ratings yet
Oic Esl Y2 W1 LP
10 pages
OPT B1 U02 Vocab Standard
No ratings yet
OPT B1 U02 Vocab Standard
1 page
English 3 DLP Q2 Week 5 Day 2
No ratings yet
English 3 DLP Q2 Week 5 Day 2
8 pages
SSRN 4871732
No ratings yet
SSRN 4871732
11 pages
Pronunciation of Final S Flashcards Games
100% (1)
Pronunciation of Final S Flashcards Games
3 pages
Mee 111 - Final Exam (Gallemit, Janah Rechel M.)
No ratings yet
Mee 111 - Final Exam (Gallemit, Janah Rechel M.)
2 pages
Tips For The CAE Key Word Transformation
No ratings yet
Tips For The CAE Key Word Transformation
2 pages
de thi tinh lai chau 1
No ratings yet
de thi tinh lai chau 1
1 page
ID - Name - Sec - No - Worksheet (Week 2, Day 2)
No ratings yet
ID - Name - Sec - No - Worksheet (Week 2, Day 2)
2 pages
Sarf Term 1 - Scales Revision - March 2022 (Keynote Slides) 2
No ratings yet
Sarf Term 1 - Scales Revision - March 2022 (Keynote Slides) 2
13 pages
40 Đề Minh Họa Theo Đề Bộ TN THPT Môn Anh 2025 Mới Nhất Có Giải Chi Tiết
No ratings yet
40 Đề Minh Họa Theo Đề Bộ TN THPT Môn Anh 2025 Mới Nhất Có Giải Chi Tiết
2 pages
3º Diver 2nd Term Exam
No ratings yet
3º Diver 2nd Term Exam
2 pages

03Text Processing

Uploaded by

03Text Processing

Uploaded by

Text Processing

Lecture 3 Information Retrieval 1

that will accurately match user query terms

Lecture 3 Information Retrieval 2

Lecture 3 Information Retrieval 3

use a generator such as lex or flex

use a special-purpose DFA generator

Handling of numbers and punctuation

Lecture 3 Information Retrieval 4

{deaths, car, accidents, 1989}

{1989} could retrieve many irrelevant docs

rest of terms can give context

might be helped by using phrases

Lecture 3 Information Retrieval 5

end-of-line hyphens are noise

Throw them out?

Two advanced solutions

use proximity information

Lecture 3 Information Retrieval 6

Idea: look at surrounding characters

not whitespace? abbreviation

Lecture 3 Information Retrieval 7

This information can be useful or not...

emit tags as tokens which can be interpreted

Lecture 3 Information Retrieval 8

Hard to make it flexible or modular

Alternative: use a scripting langauge

But can be hard to maintain

Lecture 3 Information Retrieval 9

states define possible next steps

write a table, not a procedure

Program generates the lexer

retrieve all documents

don’t relate to information need

It’s easy to index something that appears

"C" as a stop word would be trouble for a computer

Lecture 3 Information Retrieval 12

Lecture 3 Information Retrieval 13

(*v*) ING-> motoring -> motor

Lecture 3 Information Retrieval 14

Lecture 3 Information Retrieval 17

Lecture 3 Information Retrieval 18

Lecture 3 Information Retrieval 19

Lecture 3 Information Retrieval 20

Lecture 3 Information Retrieval 21

Lecture 3 Information Retrieval 22

You might also like

(v) ING-> motoring -> motor