0% found this document useful (0 votes)
4 views22 pages

03Text Processing

Uploaded by

nivijune1306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views22 pages

03Text Processing

Uploaded by

nivijune1306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Text Processing

Information Retrieval
Lecture 3

Lecture 3 Information Retrieval 1


Text Operations
Converting text to indexing terms
Goal: produce a set of indexing terms
that make the best use of resources


that will accurately match user query terms




Lecture 3 Information Retrieval 2


Text Processing Steps
1. Lexical Analysis
2. Elimination of stopwords
3. Stemming
4. Selection of index terms
5. Building a thesaurus

Lecture 3 Information Retrieval 3


Lexical Analysis
Converting byte stream to tokens
a.k.a tokenization or lexing
Three ways to build your lexer
manually (in C or a scripting language)


use a generator such as lex or flex




use a special-purpose DFA generator




Handling of numbers and punctuation


should be tunable for the application

Lecture 3 Information Retrieval 4


Lexing: Numbers and digits
Numbers need context
"deaths from car accidents in 1989"


{deaths, car, accidents, 1989}




{1989} could retrieve many irrelevant docs




However...
numbers do appear in user queries


rest of terms can give context




might be helped by using phrases




Lecture 3 Information Retrieval 5


Lexing: Hyphens
Keep them?
query might use a non-hyphenated variant


end-of-line hyphens are noise




Throw them out?


can’t recognize a hyphenated term in a query


Two advanced solutions


index as phrase but allow partial matches


use proximity information




Lecture 3 Information Retrieval 6


Lexing: Punctuation
Obvious: segment on puctuation
But (like hyphens) can appear inside a
single term:
"B.C.", "B.S.": without periods, these are just


single letters
URLs as index terms?


Idea: look at surrounding characters


whitespace? end of sentence


not whitespace? abbreviation




Lecture 3 Information Retrieval 7


Lexing: Markup
Nowadays, everything has markup
SGML, HTML, XML...


This information can be useful or not...




Some alternatives:
emit text appearing inside all or some tags


emit tags as tokens which can be interpreted




by the indexer.

Lecture 3 Information Retrieval 8


Writing a lexer by hand
while ((c = getchar()) != EOF){
if (isalpha(c)) { …
Very fast! but
Error-prone


Hard to make it flexible or modular




Alternative: use a scripting langauge


Easier to describe text patterns


But can be hard to maintain




Lecture 3 Information Retrieval 9


Using a DFA generator
Generalization of the hand-written lexer
Define a state machine
transitions occur on different character input


states define possible next steps




write a table, not a procedure




Program generates the lexer


Easier to maintain and debug!
(Frakes & Baeza-Yates ’92 have code)
Lecture 3 Information Retrieval 10
Stop Words
the, of, and, a, in, to, is, for, with, are
take up a lot of space


retrieve all documents




don’t relate to information need




It’s easy to index something that appears


everywhere
Removing stopwords can cause problems:
"to be or not to be" → {be}


"C" as a stop word would be trouble for a computer




programming index!
Lecture 3 Information Retrieval 11
Removing Stop Words
Start with a list of stop words
Table lookup
Make a table out of a static stoplist
Match each token against the table
Hashes, perfect hashing, tries
Build into the lexical analyzer (see F&BY)
Or take a statistical approach

Lecture 3 Information Retrieval 12


Stemming
Reduce variant word forms to a single
"stem" form
-'s, -ing, -ed, -s; in-, ad-, pre-, sub-, ...
Four approaches
table lookup - use a dictionary
successor variety - fancy suffix removal
affix removal - cut prefixes and suffixes
character n-grams (not really stemming)

Lecture 3 Information Retrieval 13


Porter’s algorithm (1980)
Stage 1a and b
SSES -> SS caresses -> caress
Removes suffixes in
five stages IES -> I ponies -> poni
ties -> ti
Only one rule in each
stage fires SS -> SS caress -> caress
Each depends on a S -> ø cats -> cat
suffix and the stem (m>0) feed -> feed
measure m EED->EE agreed -> agree
[C](VC)m[V] (*v*) ED-> plastered -> plaster

(*v*) ING-> motoring -> motor

Lecture 3 Information Retrieval 14


Porter Errors (Krovetz 93)
Too eager Too cautious
organization/organ european/europe
doing/doe matrices/matrix
policy/police create/creation
university/universe machine/machinery
negligible/negligent explain/explanation
arm/army resolve/resolution
past/paste triangle/triangular
Lecture 3 Information Retrieval 15
Stems and roots
Stemmers are language specific
See the Snowball project
http://snowball.sourceforge.net/
for stemmers in other languages
Morphological analysis
reducing words to their linguistic roots
requires more sophisticated processing
Think about how this can affect the query
Lecture 3 Information Retrieval 16
Character n-grams
Slide an n-character window through text
No stemming or stoplisting
May need to consider punctuation and
hyphens
Redundant tokens: good for noisy text
Less effective than word (stem) pairs in
clean text

Lecture 3 Information Retrieval 17


Term Selection
Individual words
Adjacent word pairs (word n-grams)
Noun phrases
requires more sophisticated NLP
identify nouns along with adjectives and
adverbs in the same phrase
"computer science" and "world-wide web"

Lecture 3 Information Retrieval 18


The Case for Complexity
User queries are only one or two words
The bag-of-words approach is too
simplistic given short queries
Using phrases, sophisticated handling for
numbers, etc. boosts the quality of that
first list of documents.

Lecture 3 Information Retrieval 19


The Case for Simplicity
Query throughput is as (more?) important
than quality responses
Disk is cheap
Complex processing takes too long
Easy to make a wrong decision
Feedback will improve the results

Lecture 3 Information Retrieval 20


Simple or Complex?
Can look at it on two levels:
Does more sophisticated term
processing improve retrieval results?
... or ...
Does it enable a more sophisticated
interface for the user?

Lecture 3 Information Retrieval 21


Designing with Filters
The UNIX philosophy: "do one thing and
do it well."
Filters read text input and produce text
output
can be linked together in pipes
can be simple (cut, nl) or complex (awk,perl)
Lexers are filters
You can have several in your toolbox

Lecture 3 Information Retrieval 22

You might also like