2 Text Processing
2 Text Processing
Tien-Lam Pham
Contents
Text processing
Regular Expression
Word and Corpora
Tokenization
Normalization: stemming and lemmatization
Edit Distance
Activities
Lab
Regular Expression
case /S/ (/s/ matches a lower case s but not an upper case S). This means that
the pattern /woodchucks/ will not match the string Woodchucks. We can solve this
problem with the use of the square braces [ and ]. The string of characters inside the
braces specifies a disjunction of characters to match. For example, Fig. 2.2 shows
that the pattern /[wW]/ matches patterns containing either w or W.
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit
Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Activity
Negations [^Ss]
◦ Carat means negation only when first in []
Pattern Matches
[^A-Z] Not an upper case Oyfn pripetchik
letter
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[^e^] Neither e nor ^ Look here
a^b The pattern a carat b Look up a^b now
Regular Expressions: More Dis
Regular Expression: Disjunctions
Woodchuck is another name for groundhog!
The pipe | for disjunction
Pattern Matches
groundhog|woodchuck woodchuck
yours|mine yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck
Regular Expression
Regular Expressions: ? *+.
Pattern Matches
colou?r Optional color colour
previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa
Stephen C Kleene
beg.n begin begun begun beg3n
Kleene *, Kleene +
Regular Expressions: Anchors ^ $
Regular Expression: Anchors
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
Activity
they lay back on the San Francisco grass and looked at the stars
and their
Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled? Was
there consent? Pre-processing?
+Annotation process, language variety, demographics, etc.
Text processing basics
1945 A
72 AARON
19 ABBESS
25 Aaron
5 ABBOT
6 Abate
... ... 1 Abates
5 Abbess
6 Abbey
More counting
3 Abbot
.... …
23243 the
22225 i
18618 and
IssuesTokenization
in Tokenization
Issues
Figure 2.13 The token learner part of the BPE algorithm for taking a corpus broke
nto individual characters or bytes, and learning a vocabulary by iteratively merging tok
Figure adapted from Bostrom and Durrett (2020).
rom the training data, greedily, in the order we learned them. (Thus the frequen
Byte Pair Encoding (BPE) token learning
• BPE token
R EGULAR learner
E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
that
Now BPE
Now the
BPE
the most
most frequent
Demonstration
frequent pairpair isis er
er ,, which
which we we merge;
merge; ourour system
system has
has learned
learned
that there
there should
should bebe aa token
token for
for word-final
word-final er, er, represented
represented as er ::
as er
corpus
corpus vocabulary
vocabulary
55 ll oo ww ,, d,
d, e,
e, i,
i, l,
l, n,
n, o,
o, r,
r, s,
s, t,
t, w,
w, er,
er, er
er
22 ll oo ww ee ss tt
66 nn ee ww er er
33 ww ii dd er er
22 nn ee ww
Next nnMerge
Next ee (total n of
(total count
count e 8)to
of 8) getne
get merged
merged to to ne:
ne:
corpus
corpus vocabulary
vocabulary
55 ll oo ww ,, d,
d, e,
e, i,
i, l,
l, n,
n, o,
o, r,
r, s,
s, t,
t, w,
w, er, er ,, ne
er, er ne
22 ll oo ww ee ss tt
66 ne ne ww erer
33 ww ii dd er er
22 ne ne ww
IfIf we
we continue,
continue, the
the next
next merges
merges are:
are:
Merge
Merge Current
CurrentVocabulary
Vocabulary
(ne,
(ne, w)
w) ,,d,
d,e,
e,i,
i,l,
l,n,
n,o,
o,r,
r,s,
s,t,
t,w,
w,er, er ,,ne,
er,er ne,new
new
Next n e (total count of 8) get merged to ne:
corpusBPE Demonstration
vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er , ne
BPE
2 l o w e s t
6 ne w er
3 w i d er
The2next
nemerges
w are:
If we continue, the next merges are:
Merge Current Vocabulary
(ne, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new
(l, o) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo
(lo, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low
(new, er ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer
(low, ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer , low
Once we’ve learned our vocabulary, the token parser is used to tokenize a t
sentence. The token parser just runs on the test data the merges we have learn
1 Note that there can be ties; we could have instead chosen to merge r first, since that also h
frequency of 9.
Word Normalization
Word Normalization
Putting words/tokens in a standard format
◦ U.S.A. or USA
◦ uhhuh or uh-huh
◦ Fed or fed
◦ am, is, be, are
Word Normalization
Case folding
Applications like IR: reduce all letters to lower case
◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
◦ e.g., General Motors
◦ Fed vs. fed
◦ SAIL vs. sail
Porter Stemmer
and the written notes.
produces the following stemmed output:
Thi wa not the map we found in Billi Bone s chest but an
Based oncopi
accur a series
completofinrewrite rules
all thing namerun
and in series
height and sound
with the singl except of the red cross and the written not
◦ A cascade, in which output of each pass fed to next pass
e The algorithm is based on series of rewrite rules run in series, as a cascade, i
Some
which sample
the output rules:
of each pass is fed as input to the next pass; here is a sampling o
the rules:
ATIONAL ! ATE (e.g., relational ! relate)
ING ! ✏ if stem contains vowel (e.g., motoring ! motor)
SSES ! SS (e.g., grasses ! grass)
Detailed rule lists for the Porter stemmer, as well as code (in Java, Python, etc.
can be found on Martin Porter’s homepage; see also the original paper (Porter, 1980)
Simple stemmers can be useful in cases where we need to collapse across differ
Sentence Segmentation
Sentence Segmentation
!, ? mostly unambiguous but period “.” is very ambiguous
◦ Sentence boundary
◦ Abbreviations like Inc. or Dr.
◦ Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to
classify a period as either (a) part of the word or (b) a
sentence-boundary.
◦ An abbreviation dictionary can help
Sentence segmentation can then often be done by rules
based on this tokenization.
Minimum Edit Distance
Initialization
D(i,0) = i
D(0,j) = j
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
Termination:
D(N,M) is distance
Minimum Edit Distance