0% found this document useful (0 votes)

0 views58 pages

2 Text Processing

The document provides an overview of basic text processing techniques in Natural Language Processing, including regular expressions, tokenization, and normalization methods like stemming and lemmatization. It discusses the importance of regex for pattern matching and introduces concepts like disjunctions and negation in regex. Additionally, it covers tokenization methods, including space-based and subword tokenization, and highlights the significance of corpora in linguistic analysis.

Uploaded by

nhatrang983

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views58 pages

2 Text Processing

Uploaded by

nhatrang983

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Natural Language Processing

Text Processing basic

Tien-Lam Pham
Contents

Text processing
Regular Expression
Word and Corpora
Tokenization
Normalization: stemming and lemmatization
Edit Distance
Activities
Lab
Regular Expression

A language for specifying text search strings (pattern

matching
APTER 2 An
• algebraic notation for
R EGULAR E XPRESSIONS characterizing
, T EXT a ,set
N ORMALIZATION E DITofD ISTANCE
strings

Regex Example Patterns Matched

/woodchucks/ “interesting links to woodchucks and lemurs”
/a/ “Mary Ann stopped by Mona’s”
/!/ “You’ve left the burglar behind again!” said Nori
Figure 2.1 Some simple regex searches.

case /S/ (/s/ matches a lower case s but not an upper case S). This means that
the pattern /woodchucks/ will not match the string Woodchucks. We can solve this
problem with the use of the square braces [ and ]. The string of characters inside the
braces specifies a disjunction of characters to match. For example, Fig. 2.2 shows
that the pattern /[wW]/ matches patterns containing either w or W.

Regex Match Example Patterns

/[wW]oodchuck/ Woodchuck or woodchuck “Woodchuck”
/[abc]/ ‘a’, ‘b’, or ‘c’ “In uomini, in soldati”
Regular Expressions: Disjunctions
Regular Expression: Disjunctions
Letters inside square brackets []

Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit
Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Activity

Extract phone number from following text:

"The agent's phone number is 408-555-1234. Call soon!"
Regular Expressions: Negation in Disjunction
Regular Expression: Negation in Disjunctions

Negations [^Ss]
◦ Carat means negation only when first in []

Pattern Matches
[^A-Z] Not an upper case Oyfn pripetchik
letter
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[^e^] Neither e nor ^ Look here
a^b The pattern a carat b Look up a^b now
Regular Expressions: More Dis
Regular Expression: Disjunctions
Woodchuck is another name for groundhog!
The pipe | for disjunction

Pattern Matches
groundhog|woodchuck woodchuck
yours|mine yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck
Regular Expression
Regular Expressions: ? *+.
Pattern Matches
colou?r Optional color colour
previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa
Stephen C Kleene
beg.n begin begun begun beg3n
Kleene *, Kleene +
Regular Expressions: Anchors ^ $
Regular Expression: Anchors

Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
Activity

Extract all word “the” or “The” from a text

Regular Expression: Identifiers
Regular Expression: Quantifiers
Capture Groups
Regular Expression: Capture Group
• Say we want to put angles around all numbers:
the 35 boxes à the <35> boxes
• Use parens () to "capture" a pattern into a
numbered register (1, 2, 3…)
• Use \1 to refer to the contents of the register
s/([0-9]+)/<\1>/
Regular Expression: Capture Group
Capture groups: multiple registers
/the (.*)er they (.*), the \1er we \2/
Matches
the faster they ran, the faster we ran
But not
the faster they ran, the faster we ate
ButRegular
suppose we don't
Expression: wantGroup
Capture to capture?
Parentheses have a double function: grouping terms, and
capturing
Non-capturing groups: add a ?: after paren:
/(?:some|a few) (people|cats) like some \1/
matches
◦ some cats like some cats
but not
◦ some cats like some some
Word in a Sentence
How many words in a sentence?
"I do uh main- mainly business data processing"
◦ Fragments, filled pauses
"Seuss’s cat in the hat is different from other cats!"
◦ Lemma: same stem, part of speech, rough word sense
◦ cat and cats = same lemma
◦ Wordform: the full inflected surface form
◦ cat and cats = different wordforms
How Word
manyinwords in a sentence?
a Sentence

they lay back on the San Francisco grass and looked at the stars
and their

Type: an element of the vocabulary.

Token: an instance of that type in running text.
How many?
◦ 15 tokens (or 14)
◦ 13 types (or 12) (or 11?)
glish corpora. The larger the corpora we look at, the more wo
d in fact this relationship between the number of types |V | and
HowHerdan’s
is called many words
Words in a in a corpus?
Corpus
Law (Herdan, 1960) or Heaps’ Law (Heap
overers (in linguistics
N = number of tokens and information retrieval respectively). It
where k and b are positive constants, and 0 < b < 1.
V = vocabulary = set of types, |V| is size of vocabulary
b where often .67 < β < .75
Heaps Law = Herdan's Law = |V | = kN
i.e., vocabulary size grows with > square root of the number of word tokens

Tokens = N Types = |V|

Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13+ million
Corpora
Corpora

Words don't appear out of nowhere!

A text is produced by
• a specific writer(s),
• at a specific time,
• in a specific variety,
• of a specific language,
• for a specific function.
Corpora vary along dimension like
Corpora

◦ Language: 7097 languages in the world

◦ Variety, like African American Language varieties.
◦ AAE Twitter posts might include forms like "iont" (I don't)
◦ Code switching, e.g., Spanish/English, Hindi/English:
S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was beautiful:) ]
H/E: dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
◦ Genre: newswire, fiction, scientific articles, Wikipedia
◦ Author Demographics: writer's age, gender, ethnicity, SES
CorpusCorpus
datasheets
datasets
Gebru et al (2020), Bender and Friedman (2018)

Motivation:
• Why was the corpus collected?
• By whom?
• Who funded it?
Situation: In what situation was the text written?
Collection process: If it is a subsample how was it sampled? Was
there consent? Pre-processing?
+Annotation process, language variety, demographics, etc.
Text processing basics

Tokenizing (segmenting) words

Normalizing word formats: Lematization and Stemming
Segmenting sentences
Space-based tokenization
Space-based tokenization

A very simple way to tokenize

◦ For languages that use space characters between words
◦ Arabic, Cyrillic, Greek, Latin, etc., based writing systems
◦ Segment off a token between instances of spaces
Unix tools for space-based tokenization
◦ The "tr" command
◦ Inspired by Ken Church's UNIX for Poets
◦ Given a text file, output the word tokens and their frequencies
Activity: tr linux command
Simple Tokenization in UNIX
(Inspired by Ken Church’s UNIX for Poets.)
Extract tokens and their frequency from a text file
Given a text file, output the word tokens and their frequencies
tr -sc ’A-Za-z’ ’\n’ < shakes.txt Change all non-alpha to newlines
| sort Sort in alphabetical order
| uniq –c Merge and count each type

1945 A
72 AARON
19 ABBESS
25 Aaron
5 ABBOT
6 Abate
... ... 1 Abates
5 Abbess
6 Abbey

More counting
3 Abbot
.... …

Merging upper and lower case

tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c

Sorting the counts

tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r

23243 the
22225 i
18618 and
IssuesTokenization
in Tokenization
Issues

Can't just blindly remove punctuation:

◦ m.p.h., Ph.D., AT&T, cap’n
◦ prices ($45.55)
◦ dates (01/02/06)
◦ URLs (http://www.stanford.edu)
◦ hashtags (#nlproc)
◦ email addresses ([email protected])
Clitic: a word that doesn't stand on its own
◦ "are" in we're, French "je" in j'ai, "le" in l'honneur
When should multiword expressions (MWE) be words?
◦ New York, rock ’n’ roll
In practice, since tokenization needs to be run before any other language pro-
cessing, it needs to be very fast. The standard method for tokenization is therefore
to use deterministic algorithms based
Tokenization on regular
using NLTK expressions compiled into very ef-
ficient finite state automata. For example, Fig. 2.12 shows an example of a basic
Tokenization in NLTK
regular expression that can be used to tokenize with the nltk.regexp tokenize
function of the Python-based Natural Language Toolkit (NLTK) (Bird et al. 2009;
Bird, Loper and Klein (2009), Natural Language Processing with Python. O’Reilly
http://www.nltk.org).

>>> text = ’That U.S.A. poster-print costs $12.40...’

>>> pattern = r’’’(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"’?():-_‘] # these are separate tokens; includes ], [
... ’’’
>>> nltk.regexp_tokenize(text, pattern)
[’That’, ’U.S.A.’, ’poster-print’, ’costs’, ’$12.40’, ’...’]
Figure 2.12 A Python trace of regular expression tokenization in the NLTK Python-based
natural language processing toolkit (Bird et al., 2009), commented for readability; the (?x)
verbose flag tells Python to strip comments and whitespace. Figure from Chapter 3 of Bird
et al. (2009).
Tokenization
Word tokenization / segmentation
So in Chinese it's common to just treat each character
(zi) as a token.
• So the segmentation step is very simple
In other languages (like Thai and Japanese), more
complex word segmentation is required.
• The standard algorithms are neural sequence models
trained by supervised machine learning.
Vietnamese Tokenization
Learning for tokenizations
Another option for text tokenization
Instead of
• white-space segmentation
• single-character segmentation
Use the data to tell us how to tokenize.
Subword tokenization (because tokens can be parts
of words as well as whole words)
Subword tokenization
Learning for tokenizations

Three common algorithms:

◦ Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
◦ Unigram language modeling tokenization (Kudo, 2018)
◦ WordPiece (Schuster and Nakajima, 2012)
All have 2 parts:
◦ A token learner that takes a raw training corpus and induces
a vocabulary (a set of tokens).
◦ A token segmenter that takes a raw test sentence and
tokenizes it according to that vocabulary
Subword tokenization
Learning for tokenizations

Three common algorithms:

Let vocabulary be the set of all individual characters

= {A, B, C, D,…, a, b, c, d….}
Repeat:
◦ Choose the two symbols that are most frequently
adjacent in the training corpus (say 'A', 'B')
◦ Add a new merged symbol 'AB' to the vocabulary
◦ Replace every adjacent 'A' 'B' in the corpus with 'AB'.
Until k merges have been done.
BPE
Bytetoken learner(BPE)
Pair Encoding algorithm
token learning
2.4 • T EXT N ORMALIZATION

function B YTE - PAIR ENCODING (strings C, number of merges k) returns vocab V

V all unique characters in C # initial set of tokens is characters

for i = 1 to k do # merge tokens til k times
tL , tR Most frequent pair of adjacent tokens in C
tNEW tL + tR # make new token by concatenating
V V + tNEW # update the vocabulary
Replace each occurrence of tL , tR in C with tNEW # and update the corpus
return V

Figure 2.13 The token learner part of the BPE algorithm for taking a corpus broke
nto individual characters or bytes, and learning a vocabulary by iteratively merging tok
Figure adapted from Bostrom and Durrett (2020).

rom the training data, greedily, in the order we learned them. (Thus the frequen
Byte Pair Encoding (BPE) token learning

Byte Pair Encoding (BPE) Addendum

Most subword algorithms are run inside space-
separated tokens.
So we commonly first add a special end-of-word
symbol '__' before space in training corpus
Next, separate into letters.
BPE Demonstration: Learner

• BPE token
R EGULAR learner
E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

Original (very fascinating!) corpus:

The algorithm is usually run inside words (not merging across word boundaries)
helow
inputlow
corpus
low is first
lowwhite-space-separated
low lowest lowest to give a set ofnewer
newer strings, each corre
newer
nding to the characters of a word, plus a special end-of-word symbol , and it
newer newer newer wider wider wider new new
unts. Let’s see its operation on the following tiny input corpus of 18 word token
h Add
countsend-of-word
for each word (the word low appears 5 times, the word
tokens, resulting in this vocabulary: newer 6 times
d so on), which would have a starting vocabulary of 11 letters:
corpus representation vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w
2 l o w e s t
6 n e w e r
3 w i d e r
2 n e w
The BPE algorithm first count all pairs of adjacent symbols: the most frequen
with counts
sponding to thefor each word
characters of a(the word
word, low
plus appearsend-of-word
a special 5 times, thesymbol
word newer, and6its
times,
and soLet’s
counts. on),see
which would have
its operation on athe
starting vocabulary
following of corpus
tiny input 11 letters:
of 18 word tokens
BPE token learner
with counts
and so on),
forBPE
corpus
5 whichl owould
Demonstration
each word (the word low vocabulary
appears 5 times, the word newer 6 times,
w have a starting vocabulary
, d, e, i, of 11
l,letters:
n, o, r, s, t, w
2
corpus l o w e s t vocabulary
5 6 l n o we w e r , d, e, i, l, n, o, r, s, t, w
2 3 l w o wi ed se tr
6 2 n n e we ew r
3 BPE
The w i algorithm
d e r first count all pairs of adjacent symbols: the most frequent
is the2 pairne er wbecause it occurs in newer (frequency of 6) and wider (frequency of
3)The
for BPE
a total
algorithm first count1 .allWe
of 9 occurrences then
pairs ofmerge these
adjacent symbols,
symbols: the treating er as one
most frequent
Merge
symbol, and e r
count to er
again:
is the pair e r because it occurs in newer (frequency of 6) and wider (frequency of
a total of 9 occurrences1 . Wevocabulary
3) for corpus then merge these symbols, treating er as one
symbol, 5 andlcount
o w again: , d, e, i, l, n, o, r, s, t, w, er
corpus
2 l o w e s t vocabulary
5 6 l no e w w er , d, e, i, l, n, o, r, s, t, w, er
2 3 l wo i w e
d sert
6 2 n ne e w er
w
3 w i d er
Now the most frequent pair is er , which we merge; our system has learned
2 n e w
that there should be a token for word-final er, represented as er :
Now the most frequent pair is er , which we merge; our system has learned
The BPE algorithm first count all pairs of adjacent symbols: the most frequent
The BPE algorithm first count all pairs of adjacent symbols: the most frequent
he pair e r because it occurs in newer (frequency of 6) and wider (frequency of
is the pair e r because it 1occurs in newer (frequency of 6) and wider (frequency of
BPE
for a total of 9 occurrences
BPE . We then merge these symbols, treating er as one
Demonstration
1
3) for a total of 9 occurrences . We then merge these symbols, treating er as one
mbol, and count again:
symbol, and count again:
corpus vocabulary
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er
5 l o w , d, e, i, l, n, o, r, s, t, w, er
2 l o w e s t
2 l o w e s t
6 n e w er
6 n e w er
3 w i d er
3 w i d er
2 n e w
2 n e w
Now the most frequent pair is er , which we merge; our system has learned
Merge er
Now the most _ to er_
frequent pair is er , which we merge; our system has learned
t there should be a token for word-final er, represented as er :
that there should be a token for word-final er, represented as er :
corpus vocabulary
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er
2 l o w e s t
2 l o w e s t
6 n e w er
3 6 n e
w i d er w er
2 3n e ww i d er
2 n e w
xt n e (total count of 8) get merged to ne:
Next n e (total count of 8) get merged to ne:
corpus vocabulary
5 corpus
l o w , d,vocabulary
e, i, l, n, o, r, s, t, w, er, er , ne
33 ww ii dd er er
22 nn ee ww

that
Now BPE
Now the
BPE
the most
most frequent
Demonstration
frequent pairpair isis er
er ,, which
which we we merge;
merge; ourour system
system has
has learned
learned
that there
there should
should bebe aa token
token for
for word-final
word-final er, er, represented
represented as er ::
as er
corpus
corpus vocabulary
vocabulary
55 ll oo ww ,, d,
d, e,
e, i,
i, l,
l, n,
n, o,
o, r,
r, s,
s, t,
t, w,
w, er,
er, er
er
22 ll oo ww ee ss tt
66 nn ee ww er er
33 ww ii dd er er
22 nn ee ww
Next nnMerge
Next ee (total n of
(total count
count e 8)to
of 8) getne
get merged
merged to to ne:
ne:
corpus
corpus vocabulary
vocabulary
55 ll oo ww ,, d,
d, e,
e, i,
i, l,
l, n,
n, o,
o, r,
r, s,
s, t,
t, w,
w, er, er ,, ne
er, er ne
22 ll oo ww ee ss tt
66 ne ne ww erer
33 ww ii dd er er
22 ne ne ww
IfIf we
we continue,
continue, the
the next
next merges
merges are:
are:
Merge
Merge Current
CurrentVocabulary
Vocabulary
(ne,
(ne, w)
w) ,,d,
d,e,
e,i,
i,l,
l,n,
n,o,
o,r,
r,s,
s,t,
t,w,
w,er, er ,,ne,
er,er ne,new
new
Next n e (total count of 8) get merged to ne:
corpusBPE Demonstration
vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er , ne
BPE
2 l o w e s t
6 ne w er
3 w i d er
The2next
nemerges
w are:
If we continue, the next merges are:
Merge Current Vocabulary
(ne, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new
(l, o) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo
(lo, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low
(new, er ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer
(low, ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer , low
Once we’ve learned our vocabulary, the token parser is used to tokenize a t
sentence. The token parser just runs on the test data the merges we have learn
1 Note that there can be ties; we could have instead chosen to merge r first, since that also h
frequency of 9.
Word Normalization
Word Normalization
Putting words/tokens in a standard format
◦ U.S.A. or USA
◦ uhhuh or uh-huh
◦ Fed or fed
◦ am, is, be, are
Word Normalization

Case folding
Applications like IR: reduce all letters to lower case
◦ Since users tend to use lower case
◦ Possible exception: upper case in mid-sentence?
◦ e.g., General Motors
◦ Fed vs. fed
◦ SAIL vs. sail

For sentiment analysis, MT, Information extraction

◦ Case is helpful (US versus us is important)
Lemmatization
Lemmatization

Represent all words as their lemma, their shared root

= dictionary headword form:
◦ am, are, is ® be
◦ car, cars, car's, cars' ® car
◦ Spanish quiero (‘I want’), quieres (‘you want’)
® querer ‘want'
◦ He is reading detective stories
® He be read detective story
Lemmatization

Lemmatization is done by Morphological Parsing

Morphemes:
◦ The small meaningful units that make up words
◦ Stems: The core meaning-bearing units
◦ Affixes: Parts that adhere to stems, often with grammatical
functions
Morphological Parsers:
◦ Parse cats into two morphemes cat and s
◦ Parse Spanish amaren (‘if in the future they would love’) into
morpheme amar ‘to love’, and the morphological features
3PL and future subjunctive.
Stemming
Stemming
Reduce terms to stems, chopping off affixes crudely
This was not the map we
Thi wa not the map we
found in Billy Bones’s
found in Billi Bone s chest
chest, but an accurate
but an accur copi complet
copy, complete in all
in all thing name and
things-names and heights
height and sound with the
and soundings-with the
singl except of the red
single exception of the
cross and the written note
red crosses and the
.
written notes.
r the most widely used stemming algorithms is the Porter (1980). The Porter stemme
applied to the following paragraph:
ThisStemming
was not the map we found in Billy Bones’s chest, but
an accurate copy, complete in all things-names and heights
and soundings-with the single exception of the red crosses

Porter Stemmer
and the written notes.
produces the following stemmed output:
Thi wa not the map we found in Billi Bone s chest but an
Based oncopi
accur a series
completofinrewrite rules
all thing namerun
and in series
height and sound
with the singl except of the red cross and the written not
◦ A cascade, in which output of each pass fed to next pass
e The algorithm is based on series of rewrite rules run in series, as a cascade, i
Some
which sample
the output rules:
of each pass is fed as input to the next pass; here is a sampling o
the rules:
ATIONAL ! ATE (e.g., relational ! relate)
ING ! ✏ if stem contains vowel (e.g., motoring ! motor)
SSES ! SS (e.g., grasses ! grass)
Detailed rule lists for the Porter stemmer, as well as code (in Java, Python, etc.
can be found on Martin Porter’s homepage; see also the original paper (Porter, 1980)
Simple stemmers can be useful in cases where we need to collapse across differ
Sentence Segmentation
Sentence Segmentation
!, ? mostly unambiguous but period “.” is very ambiguous
◦ Sentence boundary
◦ Abbreviations like Inc. or Dr.
◦ Numbers like .02% or 4.3
Common algorithm: Tokenize first: use rules or ML to
classify a period as either (a) part of the word or (b) a
sentence-boundary.
◦ An abbreviation dictionary can help
Sentence segmentation can then often be done by rules
based on this tokenization.
Minimum Edit Distance

How similar are two strings?

Spell correction • Computational Biology

◦ The user typed “graffe” • Align two sequences of nucleotides
Which is closest? AGGCTATCACCTGACCTCCAGGCCGATGCCC
◦ graf TAGCTATCACGACCGCGGTCGATTTGCCCGAC
◦ graft • Resulting alignment:
◦ grail
◦ giraffe -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

• Also for Machine Translation, Information Extraction, Speech Recognition

Minimum Edit Distance
Edit Distance
The minimum edit distance between two strings
Is the minimum number of editing operations
◦ Insertion
◦ Deletion
◦ Substitution
Needed to transform one into the other
Minimum Edit Distance
Minimum Edit Distance
Two strings and their alignment:
Minimum Edit Distance
Minimum Edit Distance

If each operation has cost of 1

◦ Distance between these is 5
If substitutions cost 2 (Levenshtein)
◦ Distance between them is 8
HowMinimum
to find the
Edit Min Edit Distance?
Distance

Searching for a path (sequence of edits) from the

start string to the final string:
◦ Initial state: the word we’re transforming
◦ Operators: insert, delete, substitute
◦ Goal state: the word we’re trying to get to
◦ Path cost: what we want to minimize: the number of
edits
Minimum Edit Distance

Minimum Edit as Search

But the space of all edit sequences is huge!
◦ We can’t afford to navigate naïvely
◦ Lots of distinct paths wind up at the same state.
◦ We don’t have to keep track of all of them
◦ Just the shortest path to each of those revisted states.
Defining Min Edit Distance
Minimum Edit Distance

For two strings

◦ X of length n
◦ Y of length m
We define D(i,j)
◦ the edit distance between X[1..i] and Y[1..j]
◦ i.e., the first i characters of X and the first j characters of Y
◦ The edit distance between X and Y is thus D(n,m)
Minimum Edit Distance
Dynamic Programming for Minimum Edit Distance

Dynamic programming: A tabular computation of D(n,m)

Solving problems by combining solutions to subproblems.
Bottom-up
◦ We compute D(i,j) for small i,j
◦ And compute larger D(i,j) based on previously computed smaller
values
◦ i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
Defining Min Edit
Minimum Edit Distance
Distance (Levenshtein)

Initialization
D(i,0) = i
D(0,j) = j
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
Termination:
D(N,M) is distance
Minimum Edit Distance

The Edit Distance Table

N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Minimum Edit Distance

The Edit Distance Table

N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Minimum Edit Distance

3023 Kanoodle Extreme Guide
No ratings yet
3023 Kanoodle Extreme Guide
29 pages
Prime Numbers: The Most Mysterious Figures in Math
From Everand
Prime Numbers: The Most Mysterious Figures in Math
David Wells
3/5 (11)
Text Proc
No ratings yet
Text Proc
55 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
2 TextProc Mar 25 2021
No ratings yet
2 TextProc Mar 25 2021
71 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
Week 2
No ratings yet
Week 2
90 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
3b_TextProcessing
No ratings yet
3b_TextProcessing
32 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
2
No ratings yet
2
29 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
Lecture-2n-04032024-081220pm-19022025-105409am
No ratings yet
Lecture-2n-04032024-081220pm-19022025-105409am
38 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
2 Textprocessingboth
No ratings yet
2 Textprocessingboth
46 pages
5 BASIC TEXT PROCESSING
No ratings yet
5 BASIC TEXT PROCESSING
6 pages
Final Summary NLP
No ratings yet
Final Summary NLP
446 pages
Regular Expressions, Text Normalization, Edit Distance
No ratings yet
Regular Expressions, Text Normalization, Edit Distance
30 pages
Corpora
No ratings yet
Corpora
48 pages
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
No ratings yet
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
46 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
NLP Unit1Content
No ratings yet
NLP Unit1Content
106 pages
Regular Expressions, Text Normalization, Edit Distance
No ratings yet
Regular Expressions, Text Normalization, Edit Distance
23 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
41 pages
lec02-1-BasicTextProcessing
No ratings yet
lec02-1-BasicTextProcessing
47 pages
Basic Text Processing: Regular Expressions & Automata in NLP
No ratings yet
Basic Text Processing: Regular Expressions & Automata in NLP
27 pages
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
100% (1)
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
66 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
46 pages
Lect2 Regular Expressions
No ratings yet
Lect2 Regular Expressions
41 pages
Words and Corpora J+M
No ratings yet
Words and Corpora J+M
49 pages
Chapter 1
No ratings yet
Chapter 1
31 pages
Morphological Analysis (1)
No ratings yet
Morphological Analysis (1)
118 pages
CS 491 Natural Language Processing Module 2: Basic Text Processing
No ratings yet
CS 491 Natural Language Processing Module 2: Basic Text Processing
24 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
NLP Notes Unit-1
No ratings yet
NLP Notes Unit-1
20 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
NLP Digital Notes
No ratings yet
NLP Digital Notes
128 pages
text-processing
No ratings yet
text-processing
114 pages
CC 2
No ratings yet
CC 2
65 pages
CCS369-Text and Speech Analysis Lab (1-9) (1)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9) (1)
37 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
lec2
No ratings yet
lec2
21 pages
Chapter 1 + 2
No ratings yet
Chapter 1 + 2
9 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
Multimedia Application L2
No ratings yet
Multimedia Application L2
47 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
Tokeniz prob!
No ratings yet
Tokeniz prob!
4 pages
Regular Expressions
No ratings yet
Regular Expressions
20 pages
Chapter 3 - Scanning: 3.1 Kinds of Tokens
No ratings yet
Chapter 3 - Scanning: 3.1 Kinds of Tokens
17 pages
Module 1 Nlp
No ratings yet
Module 1 Nlp
26 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
French Essentials
From Everand
French Essentials
Miriam Ellis
4/5 (3)
The Genetic Code of All Languages,(Part-1; An Overview)
From Everand
The Genetic Code of All Languages,(Part-1; An Overview)
Moni Kanchan Panda
No ratings yet
Af Personal Resume
No ratings yet
Af Personal Resume
2 pages
Supporting Families With Children With Disability PDF
No ratings yet
Supporting Families With Children With Disability PDF
22 pages
SS2028 Basic Psychology II
No ratings yet
SS2028 Basic Psychology II
5 pages
Specific Directions For The Administration of The Phil
No ratings yet
Specific Directions For The Administration of The Phil
4 pages
Exam Oral Com
No ratings yet
Exam Oral Com
1 page
Trauma Informed Education 1675651481. - Print
No ratings yet
Trauma Informed Education 1675651481. - Print
94 pages
Deployment Management_ ITIL 4 Practice Guide
No ratings yet
Deployment Management_ ITIL 4 Practice Guide
32 pages
GNS_101_WEEK_6 (1)
No ratings yet
GNS_101_WEEK_6 (1)
6 pages
The Perfect Storm Is Approaching... : Transformer Specialist Certification Program
No ratings yet
The Perfect Storm Is Approaching... : Transformer Specialist Certification Program
8 pages
Theory and Conceptual Framework
No ratings yet
Theory and Conceptual Framework
2 pages
Journal Record
No ratings yet
Journal Record
1 page
Wa0000.
No ratings yet
Wa0000.
5 pages
Asl 743 Syllabus Fass
No ratings yet
Asl 743 Syllabus Fass
9 pages
Practical Exam/Task Performance: Type in Your Student Number. Type in Your Complete Name. Example: Reyes, Nika P
No ratings yet
Practical Exam/Task Performance: Type in Your Student Number. Type in Your Complete Name. Example: Reyes, Nika P
2 pages
Untitled 1 PDF
No ratings yet
Untitled 1 PDF
2 pages
Malingering Clinical Presentation
0% (1)
Malingering Clinical Presentation
5 pages
Mapeh Class Card 2021
No ratings yet
Mapeh Class Card 2021
13 pages
UNESCO Defines Global Education As A Goal To Become Aware of Educational Conditions
No ratings yet
UNESCO Defines Global Education As A Goal To Become Aware of Educational Conditions
2 pages
5k Fitness Fusion Campus Run 2025
No ratings yet
5k Fitness Fusion Campus Run 2025
2 pages
Lucena Sec032017 Filipino PDF
No ratings yet
Lucena Sec032017 Filipino PDF
19 pages
G.Nandhu 21321A0551
No ratings yet
G.Nandhu 21321A0551
13 pages
Research Methodologies and Practical Applications of Chemistry 1st Edition Lionello Pogliani (Editor) pdf download
No ratings yet
Research Methodologies and Practical Applications of Chemistry 1st Edition Lionello Pogliani (Editor) pdf download
59 pages
Cot in Health
100% (1)
Cot in Health
3 pages
Half Ticket
No ratings yet
Half Ticket
2 pages
B2_French_Study_Plan
No ratings yet
B2_French_Study_Plan
2 pages
Week 3 - Barriers and Process in Tourism Planning
No ratings yet
Week 3 - Barriers and Process in Tourism Planning
16 pages
Course Outline Managerial Accounting
100% (1)
Course Outline Managerial Accounting
5 pages
Lili Canfora - Unit 5 Mastery
No ratings yet
Lili Canfora - Unit 5 Mastery
5 pages
Template - Intrenship Report
No ratings yet
Template - Intrenship Report
7 pages

2 Text Processing

Uploaded by

2 Text Processing

Uploaded by

Natural Language Processing

Text Processing basic

A language for specifying text search strings (pattern

Regex Example Patterns Matched

Regex Match Example Patterns

Extract phone number from following text:

Extract all word “the” or “The” from a text

Type: an element of the vocabulary.

Tokens = N Types = |V|

Words don't appear out of nowhere!

◦ Language: 7097 languages in the world

Tokenizing (segmenting) words

A very simple way to tokenize

Merging upper and lower case

Sorting the counts

Can't just blindly remove punctuation:

>>> text = ’That U.S.A. poster-print costs $12.40...’

Three common algorithms:

Three common algorithms:

Let vocabulary be the set of all individual characters

function B YTE - PAIR ENCODING (strings C, number of merges k) returns vocab V

V all unique characters in C # initial set of tokens is characters

Byte Pair Encoding (BPE) Addendum

Original (very fascinating!) corpus:

For sentiment analysis, MT, Information extraction

Represent all words as their lemma, their shared root

Lemmatization is done by Morphological Parsing

How similar are two strings?

Spell correction • Computational Biology

• Also for Machine Translation, Information Extraction, Speech Recognition

If each operation has cost of 1

Searching for a path (sequence of edits) from the

Minimum Edit as Search

For two strings

Dynamic programming: A tabular computation of D(n,m)

The Edit Distance Table

The Edit Distance Table

You might also like