0% found this document useful (0 votes)

20 views62 pages

Lecture 02

Uploaded by

mahmadbaloch2006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views62 pages

Lecture 02

Uploaded by

mahmadbaloch2006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

CS447: Natural Language Processing

http://courses.engr.illinois.edu/cs447

Lecture 2:
Tokenization and
Morphology
Julia Hockenmaier
[email protected]
3324 Siebel Center
2 :
u re e
e ct l w
L w i l y ?
a t d a
W h s t o
cu s
d i s
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 2
Lecture 2: Overview
Today, we’ll look at words:
— How do we identify words in text?
— Word frequencies and Zipf’s Law
— What is a word, really?
— What is the structure of words?
— How can we identify the structure of words?

To do this, we’ll need a bit of linguistics,

some data wrangling, and a bit of automata theory.

Later in the semester we’ll ask more questions about words:

How can we identify different word classes (parts of speech)?
What is the meaning of words? How can we represent that?
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 3
Lecture 2: Reading
Most of the material is taken from Chapter 2
(3rd Edition)

I won’t cover regular expressions (2.1.1) or edit distance (2.5),

because I assume you have all seen this material before.
I you aren’t familiar with regular expressions, read this section
because it’s very useful when dealing with text files!

The material on finite-state automata, finite-state

transducers and morphology is from the 2nd Edition
of this textbook, but everything you need should be
explained in these slides.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 4

Lecture 2: Key Concepts
You should understand the distinctions between
— Word forms vs. lemmas
— Word tokens vs. word types
— Finite-state automata vs. finite-state transducers
— Inflectional vs. derivational morphology

And you should know the implications of Zipf’s Law

for NLP (coverage!)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 5

2 :
u r e n
e ct t i o
L i za
k e n
To
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 6
Tokenization: Identifying word boundaries
Text is just a sequence of characters:

Of course he wants to take the advanced course

too. He already took two beginners’ courses.

How do we split this text into words and sentences?

[ [Of, course, he, wants, to, take, the, advanced, course, too, .],
[He, already, took, two, beginners’, courses, .]]

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 7

How do we identify the words in a text?
For a language like English, this seems like
a really easy problem:
A word is any sequence of alphabetical characters
between whitespaces that’s not a punctuation mark?

That works to a first approximation, but…

… what about abbreviations like D.C.?
… what about complex names like New York?
… what about contractions like doesn’t or couldn't've?
… what about New York-based ?
… what about names like SARS-Cov-2, or R2-D2?
… what about languages like Chinese that have no whitespace,
or languages like Turkish where one such “word” may
express as much information as an entire English sentence?
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 8
Words aren’t just defined
by blanks
Problem 1: Compounding
“ice cream”, “website”, “web site”, “New York-based”

Problem 2: Other writing systems have no blanks

Chinese: 我开始写⼩说 = 我开始写⼩说
I start(ed) writing novel(s)

Problem 3: Contractions and Clitics

English: “doesn’t” , “I’m” ,
Italian: “dirglielo” = dir + gli(e) + lo
tell + him + it

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 9

Tokenization Standards
Any actual NLP system will assume a particular
tokenization standard.
Because so much NLP is based on systems that are trained on
particular corpora (text datasets) that everybody uses, these
corpora often define a de facto standard.

Penn Treebank 3 standard:

Input:
"The San Francisco-based restaurant,"
they said, "doesn’t charge $10".
Output:
“_ The _ San _ Francisco-based _ restaurant _ , _” _
they_ said_ ,_ "_ does _ n’t _ charge_ $_ 10 _ " _ . _

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 10

Aside: What about sentence boundaries?
How can we identify that this is two sentences?
Mr. Smith went to D.C. Ms. Xu went to Chicago instead.
Challenge: punctuation marks in abbreviations (Mr., D.C, Ms,…)
[It’s easy to handle a small number of known exceptions,
but much harder to identify these cases in general]

How many sentences are in this text?

"The San Francisco-based restaurant," they said, "doesn’t charge $10".
Answer: just one, even though “they said” appears in the
middle of another sentence.
Similarly, we typically treat this also just as one sentence:
They said: ”The San Francisco-based restaurant doesn’t charge $10".

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 11

Spelling variants, typos, etc.
The same word can be written in different ways:
— with different capitalizations:
lowercase “cat” (in standard running text)
capitalized “Cat” (as first word in a sentence, or in titles/headlines),
all-caps “CAT” (e.g. in headlines)
— with different abbreviation or hyphenation styles:
US-based, US based, U.S.-based, U.S. based
US-EU relations, U.S./E.U. relations, …
— with spelling variants (e.g. regional variants of English):
labor vs labour, materialize vs materialise,
— with typos (teh)

Good practice: Be aware of (and/or document) any normalization

(lowercasing, spell-checking, …) your system uses!

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 12

2 : s
u r e ci e
e ct e n
L e q u w
F r La
r d p f s
’
Wo Z i
a n d
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 13
Counting words: tokens vs types
When counting words in text, we distinguish between
word types and word tokens:

— The vocabulary of a language

is the set of (unique) word types:
V = {a, aardvark, …., zyzzva}

— The tokens in a document include all occurrences

of the word types in that document or corpus
(this is what a standard word count tells you)

— The frequency of a word (type) in a document

= the number of occurrences (tokens) of that type
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 14
How many different words are there in English?

How large is the vocabulary of English

(or any other language)?
Vocabulary size = the number of distinct word types
Google N-gram corpus: 1 trillion tokens,
13 million word types that appear 40+ times

If you count words in text, you will find that…

…a few words (mostly closed-class) are very frequent
(the, be, to, of, and, a, in, that,…)
… most words (all open class) are very rare.
… even if you’ve read a lot of text,
you will keep finding words you haven’t seen before.
Word frequency: the number of occurrences of a word type
in a text (or in a collection of texts)
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 15
Zipf’s law: the long tail
How many words occur How
once, twice,
many 100 Ntimes,
words occur times? 1000 times?
100000

A few words
(log-scale)
the r-th most 10000 are very frequent
common word wr frequency (log)

has P(wr) ∝ 1/r 1000

Word Frequency

Most words
100
are very rare

1
1 10 100 1000 10000 100000

English words,Number of words (log)

sorted by frequency (log-scale)
w1 = the, w2 = to, …., w5346 = computer, ...
In natural language:
A small number of events (e.g. words) occur with high frequency
A large number of events occur with very low frequency
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 16
Implications of Zipf’s Law for NLP
The good:
Any text will contain a number of words that are very common.
We have seen these words often enough that we know (almost)
everything about them. These words will help us get at the
structure (and possibly meaning) of this text.
The bad:
Any text will contain a number of words that are rare.
We know something about these words, but haven’t seen them
often enough to know everything about them. They may occur
with a meaning or a part of speech we haven’t seen before.
The ugly:
Any text will contain a number of words that are unknown to us.
We have never seen them before, but we still need to get at the
structure (and meaning) of these texts.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 17

Dealing with the bad and the ugly
Our systems need to be able to generalize
from what they have seen to unseen events.

There are two (complementary) approaches

to generalization:
— Linguistics provides us with insights about the rules and
structures in language that we can exploit in the (symbolic)
representations we use
E.g.: a finite set of grammar rules is enough to describe an infinite language

— Machine Learning/Statistics allows us to learn models

(and/or representations) from real data that often work well
empirically on unseen data
E.g. most statistical or neural NLP

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 18

How do we represent words?
Option 1: Words are atomic symbols
— Each (surface) word form is its own symbol
— Add some generalization by mapping
different forms of a word to the same symbol
— Normalization: map all variants of the same word (form)
to the same canonical variant (e.g. lowercase everything,
normalize spellings, perhaps spell-check)
—Lemmatization: map each word to its lemma
(esp. in English, the lemma is still a word in the language,
but lemmatized text is no longer grammatical)
— Stemming: remove endings that differ among word forms
(no guarantee that the resulting symbol is an actual word)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 19

How do we represent words?
Option 2: Represent the structure of each word
“books” => “book N pl” (or “book V 3rd sg”)

This requires a morphological analyzer (more later today)

The output is often a lemma (“book”)

plus morphological information (“N pl” i.e. plural noun)

This is particularly useful for highly inflected languages, e.g.

Czech, Finnish, Turkish, etc. (less so for English or Chinese):
In Czech, you might need to know that nejnezajímavějším
is a regular, feminine, plural, dative adjective in the superlative.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 20

How do we represent unknown words?
Many NLP systems assume a fixed vocabulary, but still have
to handle out-of-vocabulary (OOV) words.

Option 1: the UNK token

Replace all rare words (with a frequency at or below a given threshold, e.g. 2,
3, or 5) in your training data with an UNK token (UNK = “Unknown word”).
Replace all unknown words that you come across after training (including rare
training words) with the same UNK token

Option 2: substring-based representations

[often used in neural models]
Represent (rare and unknown) words [“Champaign”] as sequences of
characters [‘C’, ‘h’, ‘a’,…,’g’, ’n'] or substrings [“Ch”, “amp”, “ai”, “gn”]

Byte Pair Encoding (BPE): learn which character sequences

are common in the vocabulary of your language, and treat those
common sequences as atomic units of your vocabulary
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 21
2 :
u re r d ,
e ct w o
L a
t i s ?
ha lly
W r e a

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 22

How many different words are there in English?

How large is the vocabulary of English

(or any other language)?
Vocabulary size = the number of distinct word types
Google N-gram corpus: 1 trillion tokens,
13 million word types that appear 40+ times
[here, we’re treating inflected forms (took, taking) as distinct]

You may have heard statements such as

“adults know about 30,000 words”
“you need to know at least 5,000 words to be fluent”
Such statements do not refer to inflected word forms
(take/takes/taking/take/takes/took) but to lemmas or
dictionary forms (take), and assume if you know
a lemma, you know all its inflected forms too.
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 23
Which words appear in this text?
Of course he wants to take the advanced course
too. He already took two beginners’ courses.

Actual text doesn’t consist of dictionary entries:

wants is a form of want
took is a form of take
courses is a form of course
Linguists distinguish between
— the (surface) forms that occur in text:
want, wants, beginners’, took,…
— and the lemmas that are the uninflected forms of these words:
want, beginner, take, …
In NLP, we sometimes map words to lemmas (or simpler
“stems”), but the raw data always consists of surface forms
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 24
How many different words are there?
Inflection creates different forms of the same word:
Verbs: to be, being, I am, you are, he is, I was,
Nouns: one book, two books

Derivation creates different words from the same lemma:

grace ⇒ disgrace ⇒ disgraceful ⇒ disgracefully

Compounding combines two words into a new word:

cream ⇒ ice cream ⇒ ice cream cone ⇒ ice cream cone bakery

Word formation is productive:

New words are subject to all of these processes:
Google ⇒ Googler, to google, to ungoogle, to misgoogle,
googlification, ungooglification, googlified, Google Maps, Google
Maps service,...
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 25
A Turkish word
uygarlaştıramadıklarımızdanmışsınızcasına
uygar_laş_tır_ama_dık_lar_ımız_dan_mış_sınız_casına

“as if you are among those whom we were not able to civilize
(=cause to become civilized )”
uygar: civilized
_laş: become
_tır: cause somebody to do something
_ama: not able
_dık: past participle
_lar: plural
_ımız: 1st person plural possessive (our)
_dan: among (ablative case)
_mış: past
_sınız: 2nd person plural (you)
_casına: as if (forms an adverb from a verb)

K. Oflazer pc to J&M

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 26

Inflectional morphology in English
Verbs:
Infinitive/present tense: walk, go
3rd person singular present tense (s-form): walks, goes
Simple past: walked, went
Past participle (ed-form): walked, gone
Present participle (ing-form): walking, going

Nouns:
Common nouns inflect for number:
singular (book) vs. plural (books)
Personal pronouns inflect for person, number, gender, case:
I saw him; he saw me; you saw her; we saw them; they saw us.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 27

Derivational morphology in English
Nominalization:
V + -ation: computerization
V+ -er: killer
Adj + -ness: fuzziness

Negation:
un-: undo, unseen, ...
mis-: mistake,...

Adjectivization:
V+ -able: doable
N + -al: national

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 28

Morphemes: stems, affixes
dis-grace-ful-ly
prefix-stem-suffix-suffix

Many word forms consist of a stem

plus a number of affixes (prefixes or suffixes)
Exceptions: Infixes are inserted inside the stem
Circumfixes (German gesehen) surround the stem
Morphemes: the smallest (meaningful/grammatical)
parts of words.
Stems (grace) are often free morphemes.
Free morphemes can occur by themselves as words.
Affixes (dis-, -ful, -ly) are usually bound morphemes.
Bound morphemes have to combine with others to form words.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 29

Morphemes and morphs
The same information (plural, past tense, …) is often
expressed in different ways in the same language.
One way may be more common than others,
and exceptions may depend on specific words:
- Most plural nouns: add -s to singular: book-books,
but: box-boxes, fly-flies, child-children
- Most past tense verbs add -ed to infinitive: walk-walked,
but: like-liked, leap-leapt
Such exceptions are called irregular word forms

Linguists say that there is one underlying morpheme

(e.g. for plural nouns) that is “realized” as different “surface”
forms (morphs) (e.g. -s/-es/-ren)
Allomorphs: two different realizations (-s/-es/-ren)
of the same underlying morpheme (plural)
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 30
Side note: “Surface”?
This terminology comes from Chomskyan
Transformational Grammar.
- Dominant early approach in theoretical linguistics,
superseded by other approaches (“minimalism”).
- Not computational, but has some historical influence on
computational linguistics (e.g. Penn Treebank)

“Surface” = standard English (Chinese, Hindi, etc.).

“Surface string” = a written sequence of characters or words
vs. “Deep”/“Underlying” structure/representation:
A more abstract representation.
Might be the same for different sentences/words
with the same meaning.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 31

2 : t a
u re ma
e c t u to
L e A
t a t
e - S d g e s
n i t a n ua
Fi a n g
r L
u la
R eg

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 32

Formal languages
An alphabet ∑ is a set of symbols:
e.g. ∑= {a, b, c}

A string ω is a sequence of symbols, e.g ω=abcb.

The empty string ε consists of zero symbols.

The Kleene closure ∑* (‘sigma star’) is the (infinite)

set of all strings that can be formed from ∑:
∑*= {ε, a, b, c, aa, ab, ba, aaa, ...}

A language L ⊆ ∑* over ∑ is also a set of strings.

Typically we only care about proper subsets of ∑* (L ⊂ Σ).

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 33

Automata and languages
An automaton is an abstract model of a computer.
It reads an input string symbol by symbol.
It changes its internal state depending on
the current input symbol and its current internal state.
Current input symbol

Input
string a b a c d e
1. read input

2. change
Automaton state Automaton
Current New
q q’
state state
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 34
Automata and languages
The automaton either accepts or rejects
the input string.
Every automaton defines a language
(= the set of strings it accepts). Input string is
in the language
Input
a b a c d e
string
read accept!

Automaton reject!

Input string is
NOT in the language
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 35
Automata and languages

Different types of automata define

different language classes:

—Finite-state automata define

regular languages

—Pushdown automata define

context-free languages

—Turing machines define

recursively enumerable languages

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 36

Finite-state automata
A (deterministic) finite-state automaton (FSA)
consists of:
- a finite set of states Q = {q0….qN}, including a start state q0
and one (or more) final (=accepting) states (say, qN)
- a (deterministic) transition function final state
δ(q,w) = q’ for q, q’ ∈ Q, w ∈ Σ (note the
b c double line)
a q1 q3
q4
q0 x y
move from state q2
start state q2
to state q4
if you read ‘y’

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 37

b a
b a a a q0 q1 q23 Start in q0
a

b a
b a a a q0 q1 q23
a

b a
b a a a q0 q1 q23
a
b a
b a a a q0 q1 q23
a

b a
b a a a q0 q1 q23
Accept!
We’ve reached the end of the string,
and are in an accepting state.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 38

Rejection: Automaton does not
end up in accepting state
a

b a
b q0 q1 q23 Start in q0
a
Reject!
b a
b q0 q1 q23 (q1 is not a
final state)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 39

Rejection: Transition not defined
a

b a
b a c q0 q1 q23 Start in q0
a

b a
b a c q0 q1 q23
a

b a
b a c q0 q1 q23

a
Reject!
b a (There is no
b a c q0 q1 q23 transition
labeled ‘c’)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 40

Finite State Automata (FSAs)
Every NFA can be transformed into an equivalent DFA:
a a

b a b
q0 q2 q3 q0 q3

b q13

Recognition of a string w with a DFA is linear in the length of w

Finite-state automata define the class of regular languages

L1 = { anbm } = {ab, aab, abb, aaab, abb,… } is a regular language,
L2 = { anbn } = {ab, aabb, aaabbb,…} is not (it’s context-free).
You cannot construct an FSA that accepts all the strings in L2 and nothing else.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 41

Regular Expressions
Regular expressions (regexes) can also be used
to define a regular language.
Simple patterns:
- Standard characters match themselves: ‘a’, ‘1’
- Character classes: ‘[abc]’, ‘[0-9]’, negation: ‘[^aeiou]’
(Predefined: \s (whitespace), \w (alphanumeric), etc.)
- Any character (except newline) is matched by ‘.’
Complex patterns: (e.g. ^[A-Z]([a-z])+\s )
- Group: ‘(…)’
- Repetition: 0 or more times: ‘*’, 1 or more times: ‘+’
- Disjunction: ‘...|…’
- Beginning of line ‘^’ and end of line ‘$’

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 42

2 : ta
u re m a
e c t u to
L e a g y
ta t lo
e - s p ho
n i t o r
Fi r m
fo

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 43

Finite state automata for morphology
stem
grace: q0 q13

prefix stem
dis-grace: q0 q1 q23

stem suffix
grace-ful: q0 q1 q23

prefix stem suffix

dis-grace-ful: q0 q1 q2 q3

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 44

Union: merging automata

grace, ε
dis-grace, stem suffix
grace-ful, q0 prefix q1 q23 q3
dis-grace-ful

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 45

FSAs for derivational morphology

-al
adj2 q8 q39
-iz q3
noun1 -e
adj1 -iz -able
q0 q31 q2 q4
noun2
-al
-er -ation
q37 q35
-al
noun3 q36
noun1 = {fossil,mineral,...}
q10 -e adj1 = {equal, neutral}
qq11
3
adj2 = {minim, maxim}
noun2 = {nation, form,…}
noun3 = {natur, structur,…}
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
2 :
u r e
e c t a te
L - St
i te e rs
F i n u c
n s d
T r a
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 47
Recognition vs. Analysis
FSAs can recognize (accept) a string,
but they don’t tell us its internal structure.

We need is a machine that maps (transduces)

the input string into an output string
that encodes its structure:

Input c a t s
(Surface form)

Output
(Lexical form)
c a t +N +pl

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 48

Morphological parsing

disgracefully
dis grace ful ly
prefix stem suffix suffix
NEG grace+N +ADJ +ADV

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 49

Morphological generation
We cannot enumerate all possible English words,
but we would like to capture the rules that define
whether a string could be an English word or not.

That is, we want a procedure that can generate

(or accept) possible English words…
grace, graceful, gracefully
disgrace, disgraceful, disgracefully,
ungraceful, ungracefully,
undisgraceful, undisgracefully,…
without generating/accepting impossible English words
*gracelyful, *gracefuly, *disungracefully,…

NB: * is linguists’ shorthand for “this is ungrammatical”

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 50
Finite State Automata (FSAs)
A finite-state automaton M = 〈Q, Σ, q0, F, δ〉 consists of:
— A finite set of states Q = {q0, q1,.., qn}
— A finite alphabet Σ of input symbols (e.g. Σ = {a, b, c,…})
— A designated start state q0 ∈ Q
— A set of final states F ⊆ Q
— A transition function δ:
For a deterministic (D)FSA: Q × Σ → Q
δ(q,w) = q’ for q, q’ ∈ Q, w ∈ Σ
If the current state is q and the current input is w, go to q’

For a nondeterministic (N)FSA: Q × Σ → 2Q

δ(q,w) = Q’ for q ∈ Q, Q’ ⊆ Q, w ∈ Σ
If the current state is q and the current input is w, go to any q’ ∈ Q’
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 51
Finite-state transducers
A finite-state transducer T = 〈Q, Σ, Δ, q0, F, δ, σ〉 consists of:
— A finite set of states Q = {q0, q1,.., qn}
— A finite alphabet Σ of input symbols (e.g. Σ = {a, b, c,…})
— A finite alphabet Δ of output symbols (e.g. Δ = {+N, +pl,…})
— A designated start state q0 ∈ Q
— A set of final states F ⊆ Q
— A transition function δ: Q × Σ → 2Q
δ(q,w) = Q’ for q ∈ Q, Q’ ⊆ Q, w ∈ Σ
— An output function σ: Q × Σ → Δ*
σ(q,w) = ω for q ∈ Q, w ∈ Σ, ω ∈ Δ*
If the current state is q and the current input is w, write ω.
(NB: Jurafsky&Martin (2nd ed.) define σ: Q × Σ* → Δ*. Why is this equivalent?)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 52

Finite-state transducers
An FST T = Lin ⨉ Lout defines a relation
between two regular languages Lin and Lout:

Lin = {cat, cats, fox, foxes, ...}

Lout = {cat+N+sg, cat+N+pl, fox+N+sg, fox+N+pl ...}

T = { ⟨cat, cat+N+sg⟩,
⟨cats, cat+N+pl⟩,
⟨fox, fox+N+sg⟩,
⟨foxes, fox+N+pl⟩ }

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 53

Some FST operations
Inversion T-1:
The inversion (T-1) of a transducer
switches input and output labels.

This can be used to switch from parsing words

to generating words.

Composition (T◦T’): (Cascade)

Two transducers T = L1 ⨉ L2 and T’ = L2 ⨉ L3 can be
composed into a third transducer T’’ = L1 ⨉ L3.
Sometimes intermediate representations are useful

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 54

English spelling rules
Peculiarities of English spelling (orthography)

The same underlying morpheme (e.g. plural-s)

can have different orthographic “surface realizations”
(-s, -es)

This leads to spelling changes

at morpheme boundaries:
E-insertion: fox +s = foxes
E-deletion: make +ing = making

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 55

Intermediate representations
English plural -s: cat ⇒ cats dog ⇒ dogs
but: fox ⇒ foxes, bus ⇒ buses buzz ⇒ buzzes

We define an intermediate representation to capture

morpheme boundaries (^) and word boundaries (#):
Lexicon: cat+N+PL fox+N+PL
Intermediate representation: cat^s# fox^s#
Surface string: cats foxes

Intermediate-to-Surface Spelling Rule:

If plural ‘s’ follows a morpheme ending in ‘x’,‘z’ or ‘s’, insert ‘e’.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 56

FST composition/cascade:

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 57

Tlex: Lexical to intermediate level

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 58

Te-insert: intermediate to surface level
^ = morpheme boundary #:ε
# = word boundary q38
ε = empty string
q7
s:s
s:s
a:a,…,r:r,
t:t,…,w:w, q6
y:y ^:ε q5
^:e ^:e
s:s, x:x, z:z s:s, x:x, z:z
Intermediate-to- q0 q2 q4
Surface Spelling a:a,…,r:r,t:t,
#:ε #:ε #:ε
Rule: …,w:w,y:y
If plural ‘s’ follows a
morpheme ending in q31 q3
‘x’,‘z’ or ‘s’, insert ‘e’. a:a,…,r:r,
t:t,…,w:w,y:y

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 59

Dealing with ambiguity
book: book +N +sg or book +V?
Generating words is generally unambiguous,
but analyzing words often requires disambiguation.

We need a nondeterministic FST.

Efficiency problem: Not every nondeterministic FST
can be translated into a deterministic one!

We also need a scoring function to identify which

analysis is more likely.
We may need to know the context in which the word
appears: (I read a book vs. I book flights)
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 60
What about compounds?
Semantically, compounds have hierarchical structure:

(((ice cream) cone) bakery)

not (ice ((cream cone) bakery))

((computer science) (graduate student))

not (computer ((science graduate) student))

We will need context-free grammars to capture this

underlying structure.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 61

e n d
T h e
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 62

European Commission Style Guide
No ratings yet
European Commission Style Guide
96 pages
Czech Frequency Dictionary - Core Vocabulary - The 100 Most Common Czech Words - Book 0: Czech, #0
From Everand
Czech Frequency Dictionary - Core Vocabulary - The 100 Most Common Czech Words - Book 0: Czech, #0
MostUsedWords Com
No ratings yet
English Grammer Detail in Gujarati PDF by Hirensir
100% (2)
English Grammer Detail in Gujarati PDF by Hirensir
62 pages
Contrastive Grammar A Handout 2013 PDF
100% (1)
Contrastive Grammar A Handout 2013 PDF
205 pages
Lecture02 Tokenization
No ratings yet
Lecture02 Tokenization
16 pages
Lecture01_2020_TheNLPPipeline
No ratings yet
Lecture01_2020_TheNLPPipeline
20 pages
Lecture 03
No ratings yet
Lecture 03
58 pages
NLP m2
No ratings yet
NLP m2
71 pages
7MvcJmJaQ8uL3CZiWhPLDQ_848bd532a73b42ac974c5a2ee6cdf1f1_Lecture02_6_FST (1)
No ratings yet
7MvcJmJaQ8uL3CZiWhPLDQ_848bd532a73b42ac974c5a2ee6cdf1f1_Lecture02_6_FST (1)
16 pages
Natural Language Processing
No ratings yet
Natural Language Processing
35 pages
1 Intro To NLP
100% (1)
1 Intro To NLP
46 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
Lecture01_1_CS447
No ratings yet
Lecture01_1_CS447
12 pages
Session 1
No ratings yet
Session 1
60 pages
Week 2
No ratings yet
Week 2
90 pages
WINSEM2018-19_SWE1017_ETH_VL2018195004705_2018-12-19_Reference-Material-I (1)
No ratings yet
WINSEM2018-19_SWE1017_ETH_VL2018195004705_2018-12-19_Reference-Material-I (1)
42 pages
NLP01 IntroNLP
No ratings yet
NLP01 IntroNLP
68 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
Lecture 04
No ratings yet
Lecture 04
55 pages
Language Models Probabilistic Model 1735045992
No ratings yet
Language Models Probabilistic Model 1735045992
55 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Lecture1 5 IntroToNLP
No ratings yet
Lecture1 5 IntroToNLP
73 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
lec2
No ratings yet
lec2
21 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
NLP Introduction
No ratings yet
NLP Introduction
35 pages
Natural Language Processing: Rada Mihalcea
No ratings yet
Natural Language Processing: Rada Mihalcea
26 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
05 Introduction To NLP
No ratings yet
05 Introduction To NLP
63 pages
Natural Language Processing tools and approaches
No ratings yet
Natural Language Processing tools and approaches
106 pages
Natural Language Processing Notes
No ratings yet
Natural Language Processing Notes
26 pages
5 BASIC TEXT PROCESSING
No ratings yet
5 BASIC TEXT PROCESSING
6 pages
01 Introduction To NLP
No ratings yet
01 Introduction To NLP
39 pages
NLP Introduction
No ratings yet
NLP Introduction
35 pages
week_02_Tokenizers
No ratings yet
week_02_Tokenizers
36 pages
slides_lec1-3
No ratings yet
slides_lec1-3
225 pages
lect1-intro-3jan08 (1)
No ratings yet
lect1-intro-3jan08 (1)
94 pages
Introduction To NLP: Natural Language Processing
No ratings yet
Introduction To NLP: Natural Language Processing
21 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
Machine Learning
No ratings yet
Machine Learning
39 pages
UNIT-1 notes
No ratings yet
UNIT-1 notes
19 pages
N Language Processing
No ratings yet
N Language Processing
84 pages
SebentaLN-parte1
No ratings yet
SebentaLN-parte1
42 pages
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
No ratings yet
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
44 pages
Ai in Natural Language Processing
No ratings yet
Ai in Natural Language Processing
4 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
Correlation and Causation
No ratings yet
Correlation and Causation
52 pages
Lecture_1_Introduction
No ratings yet
Lecture_1_Introduction
57 pages
cs224n Winter2023 Lecture1 Notes Draft
No ratings yet
cs224n Winter2023 Lecture1 Notes Draft
13 pages
NLP Lect Unit I
100% (1)
NLP Lect Unit I
140 pages
AI_M3_Merged.pdf
No ratings yet
AI_M3_Merged.pdf
98 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
nlp
No ratings yet
nlp
16 pages
5.0
No ratings yet
5.0
34 pages
Artificial Intelligence: Natural Language Processing II
No ratings yet
Artificial Intelligence: Natural Language Processing II
51 pages
Lecture 1: Introduction To NLP: Understand Concepts Applications
No ratings yet
Lecture 1: Introduction To NLP: Understand Concepts Applications
32 pages
NLP Intro
No ratings yet
NLP Intro
4 pages
On the Logic and Learning of Language
From Everand
On the Logic and Learning of Language
Sean A. Fulop
No ratings yet
Differences in American and British English Grammar
No ratings yet
Differences in American and British English Grammar
6 pages
Possessive Nouns Lesson
No ratings yet
Possessive Nouns Lesson
19 pages
Module 3 and 4
No ratings yet
Module 3 and 4
10 pages
Possessives Questions With Whose Choosing A/ An or The Prepositions
No ratings yet
Possessives Questions With Whose Choosing A/ An or The Prepositions
16 pages
Grammar: Look at The Tables Below and Listen To Your Teacher's Explanations
No ratings yet
Grammar: Look at The Tables Below and Listen To Your Teacher's Explanations
31 pages
Sesion Com - Escribimos Adivinanzas
No ratings yet
Sesion Com - Escribimos Adivinanzas
6 pages
Material of Pronoun and Exercises
No ratings yet
Material of Pronoun and Exercises
10 pages
McFarland Curtis.2004.the Philippine Language Situation
No ratings yet
McFarland Curtis.2004.the Philippine Language Situation
17 pages
Boon 22 May
No ratings yet
Boon 22 May
8 pages
КТП 7 34
No ratings yet
КТП 7 34
12 pages
Reflexive Verbs PDF
No ratings yet
Reflexive Verbs PDF
27 pages
Examen Oral Civil
No ratings yet
Examen Oral Civil
4 pages
ENG RJESENJA III I IV
No ratings yet
ENG RJESENJA III I IV
26 pages
Booklet 1° Secundaria
No ratings yet
Booklet 1° Secundaria
47 pages
I N T I M A C I Ó N Desalojo Ramon Bastardo
No ratings yet
I N T I M A C I Ó N Desalojo Ramon Bastardo
3 pages
Its English Time - Dumitru
No ratings yet
Its English Time - Dumitru
47 pages
English Tenses of Grammar
No ratings yet
English Tenses of Grammar
5 pages
Pronouns Power Point/pronouns
No ratings yet
Pronouns Power Point/pronouns
27 pages
Noun Types Worksheet
No ratings yet
Noun Types Worksheet
4 pages
Ostaya
No ratings yet
Ostaya
20 pages
List of Irregular Verbs With Romanian Translation
No ratings yet
List of Irregular Verbs With Romanian Translation
4 pages
SAT SG Language and Writing
No ratings yet
SAT SG Language and Writing
99 pages
Pertemuan 5 - Words
No ratings yet
Pertemuan 5 - Words
7 pages
English Morphology - Notes PDF
100% (4)
English Morphology - Notes PDF
83 pages
Basic Rules For Using Articles A
No ratings yet
Basic Rules For Using Articles A
4 pages
Another Vs Other Vs The Other Vs Others.
No ratings yet
Another Vs Other Vs The Other Vs Others.
2 pages
Animal Names in The Bantu Language
No ratings yet
Animal Names in The Bantu Language
20 pages

Lecture 02

Uploaded by

Lecture 02

Uploaded by

CS447: Natural Language Processing

To do this, we’ll need a bit of linguistics,

Later in the semester we’ll ask more questions about words:

I won’t cover regular expressions (2.1.1) or edit distance (2.5),

The material on finite-state automata, finite-state

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 4

And you should know the implications of Zipf’s Law

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 5

Of course he wants to take the advanced course

How do we split this text into words and sentences?

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 7

That works to a first approximation, but…

Problem 2: Other writing systems have no blanks

Problem 3: Contractions and Clitics

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 9

Penn Treebank 3 standard:

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 10

See also this headline from the NYT (08/26/20):

How many sentences are in this text?

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 11

Good practice: Be aware of (and/or document) any normalization

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 12

— The vocabulary of a language

— The tokens in a document include all occurrences

— The frequency of a word (type) in a document

How large is the vocabulary of English

If you count words in text, you will find that…

has P(wr) ∝ 1/r 1000

English words,Number of words (log)

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 17

There are two (complementary) approaches

— Machine Learning/Statistics allows us to learn models

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 18

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 19

This requires a morphological analyzer (more later today)

The output is often a lemma (“book”)

This is particularly useful for highly inflected languages, e.g.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 20

Option 1: the UNK token

Option 2: substring-based representations

Byte Pair Encoding (BPE): learn which character sequences

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 22

How large is the vocabulary of English

You may have heard statements such as

Actual text doesn’t consist of dictionary entries:

Derivation creates different words from the same lemma:

Compounding combines two words into a new word:

Word formation is productive:

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 26

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 27

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 28

Many word forms consist of a stem

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 29

Linguists say that there is one underlying morpheme

“Surface” = standard English (Chinese, Hindi, etc.).

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 31

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 32

A string ω is a sequence of symbols, e.g ω=abcb.

The Kleene closure ∑* (‘sigma star’) is the (infinite)

A language L ⊆ ∑* over ∑ is also a set of strings.

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 33

Different types of automata define

—Finite-state automata define

—Pushdown automata define

—Turing machines define

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 36

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 37

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 38

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 39

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 40

Recognition of a string w with a DFA is linear in the length of w

Finite-state automata define the class of regular languages

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 41

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 42

CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 43

prefix stem suffix