Lecture 02
Lecture 02
http://courses.engr.illinois.edu/cs447
Lecture 2:
Tokenization and
Morphology
Julia Hockenmaier
[email protected]
3324 Siebel Center
2 :
u re e
e ct l w
L w i l y ?
a t d a
W h s t o
cu s
d i s
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 2
Lecture 2: Overview
Today, we’ll look at words:
— How do we identify words in text?
— Word frequencies and Zipf’s Law
— What is a word, really?
— What is the structure of words?
— How can we identify the structure of words?
[ [Of, course, he, wants, to, take, the, advanced, course, too, .],
[He, already, took, two, beginners’, courses, .]]
A few words
(log-scale)
the r-th most 10000 are very frequent
common word wr frequency (log)
Most words
100
are very rare
10
1
1 10 100 1000 10000 100000
“as if you are among those whom we were not able to civilize
(=cause to become civilized )”
uygar: civilized
_laş: become
_tır: cause somebody to do something
_ama: not able
_dık: past participle
_lar: plural
_ımız: 1st person plural possessive (our)
_dan: among (ablative case)
_mış: past
_sınız: 2nd person plural (you)
_casına: as if (forms an adverb from a verb)
K. Oflazer pc to J&M
Nouns:
Common nouns inflect for number:
singular (book) vs. plural (books)
Personal pronouns inflect for person, number, gender, case:
I saw him; he saw me; you saw her; we saw them; they saw us.
Negation:
un-: undo, unseen, ...
mis-: mistake,...
Adjectivization:
V+ -able: doable
N + -al: national
Input
string a b a c d e
1. read input
2. change
Automaton state Automaton
Current New
q q’
state state
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 34
Automata and languages
The automaton either accepts or rejects
the input string.
Every automaton defines a language
(= the set of strings it accepts). Input string is
in the language
Input
a b a c d e
string
read accept!
Automaton reject!
Input string is
NOT in the language
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 35
Automata and languages
b a
b a a a q0 q1 q23 Start in q0
a
b a
b a a a q0 q1 q23
a
b a
b a a a q0 q1 q23
a
b a
b a a a q0 q1 q23
a
b a
b a a a q0 q1 q23
Accept!
We’ve reached the end of the string,
and are in an accepting state.
b a
b q0 q1 q23 Start in q0
a
Reject!
b a
b q0 q1 q23 (q1 is not a
final state)
b a
b a c q0 q1 q23 Start in q0
a
b a
b a c q0 q1 q23
a
b a
b a c q0 q1 q23
a
Reject!
b a (There is no
b a c q0 q1 q23 transition
labeled ‘c’)
b a b
q0 q2 q3 q0 q3
b q13
prefix stem
dis-grace: q0 q1 q23
stem suffix
grace-ful: q0 q1 q23
grace, ε
dis-grace, stem suffix
grace-ful, q0 prefix q1 q23 q3
dis-grace-ful
-al
adj2 q8 q39
-iz q3
noun1 -e
adj1 -iz -able
q0 q31 q2 q4
noun2
-al
-er -ation
q37 q35
-al
noun3 q36
noun1 = {fossil,mineral,...}
q10 -e adj1 = {equal, neutral}
qq11
3
adj2 = {minim, maxim}
noun2 = {nation, form,…}
noun3 = {natur, structur,…}
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/
2 :
u r e
e c t a te
L - St
i te e rs
F i n u c
n s d
T r a
CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 47
Recognition vs. Analysis
FSAs can recognize (accept) a string,
but they don’t tell us its internal structure.
Input c a t s
(Surface form)
Output
(Lexical form)
c a t +N +pl
disgracefully
dis grace ful ly
prefix stem suffix suffix
NEG grace+N +ADJ +ADV
T = { ⟨cat, cat+N+sg⟩,
⟨cats, cat+N+pl⟩,
⟨fox, fox+N+sg⟩,
⟨foxes, fox+N+pl⟩ }