Lecture 6
Lecture 6
Today
• Formal Grammars
– Context-free grammar
– Grammars for English
– Treebanks
– Dependency grammars
2
Syntax
• By grammar, or syntax, we have in mind
the kind of implicit knowledge of your
native language that you had mastered by
the time you were 3 years old without
explicit instruction
• Not the kind of stuff you were later taught
in “grammar” school
3
Syntax
• Why should you care?
• Grammars (and parsing) are key
components in many applications
– Grammar checkers
– Dialogue management
– Question answering
– Information extraction
– Machine translation
4
Syntax
• Key notions that we’ll cover
– Constituency
– Grammatical relations and Dependency
• Heads
• Key formalism
– Context-free grammars
• Resources
– Treebanks
5
Constituency
• The basic idea here is that groups of words
within utterances can be shown to act as
single units.
• And in a given language, these units form
coherent classes that can be shown to
behave in similar ways
– With respect to their internal structure
– And with respect to other units in the language
6
Constituency
• Internal structure
– We can describe an internal structure to the class
(might have to use disjunctions of somewhat
unlike sub-classes to do this).
• External behavior
– For example, we can say that noun phrases can
come before verbs
7
Constituency
• For example, it makes sense to the say that
the following are all noun phrases in
English...
8
Grammars and Constituency
• Of course, there’s nothing easy or obvious about
how we come up with right set of constituents and
the rules that govern how they combine...
• That’s why there are so many different theories of
grammar and competing analyses of the same
data.
• The approach to grammar, and the analyses,
adopted here are very generic (and don’t
correspond to any modern linguistic theory of
grammar).
9
Context-Free Grammars
• Context-free grammars (CFGs)
– Also known as
• Phrase structure grammars
• Backus-Naur form
• Consist of
– Rules
– Terminals
– Non-terminals
10
Context-Free Grammars
• Terminals
– We’ll take these to be words (for now)
• Non-Terminals
– The constituents in a language
• Like noun phrase, verb phrase and sentence
• Rules
– Rules are equations that consist of a single non-
terminal on the left and any number of
terminals and non-terminals on the right.
11
Some NP Rules
• Here are some rules for our noun phrases
13
Generativity
• As with FSAs and FSTs, you can view these
rules as either analysis or synthesis
machines
– Generate strings in the language
– Reject strings not in the language
– Impose structures (trees) on strings in the
language
14
Derivations
• A derivation is a
sequence of rules
applied to a string that
accounts for that
string
– Covers all the elements
in the string
– Covers only the
elements in the string
15
Definition
• More formally, a CFG consists of
16
Parsing
• Parsing is the process of taking a string and
a grammar and returning a (multiple?) parse
tree(s) for that string
• It is completely analogous to running a
finite-state transducer with a tape
– It’s just more powerful
• Remember this means that there are languages we
can capture with CFGs that we can’t capture with
finite-state methods
17
An English Grammar Fragment
• Sentences
• Noun phrases
– Agreement
• Verb phrases
– Subcategorization
18
Sentence Types
• Declaratives: A plane left.
S NP VP
• Imperatives: Leave!
S VP
• Yes-No Questions: Did the plane leave?
S Aux NP VP
• WH Questions: When did the plane leave?
S WH-NP Aux NP VP
19
Noun Phrases
• Let’s consider the following rule in more
detail...
NP Det Nominal
• Most of the complexity of English noun
phrases is hidden in this rule.
• Consider the derivation for the following
example
– All the morning flights from Denver to Tampa
leaving before 10
20
Noun Phrases
21
NP Structure
• Clearly this NP is really about flights.
That’s the central crucial noun in this NP.
Let’s call that the head.
• We can dissect this kind of NP into the
stuff that can come before the head, and the
stuff that can come after it.
22
Determiners
• Noun phrases can start with determiners...
• Determiners can be
– Simple lexical items: the, this, a, an, etc.
• A car
– Or simple possessives
• John’s car
– Or complex recursive versions of that
• John’s sister’s husband’s son’s car
23
Nominals
• Contains the head and any pre- and post-
modifiers of the head.
– Pre-
• Quantifiers, cardinals, ordinals...
– Three cars
• Adjectives
– large cars
• Ordering constraints
– Three large cars
24
Postmodifiers
• Three kinds
– Prepositional phrases
• From Seattle
– Non-finite clauses
• Arriving before noon
– Relative clauses
• That serve breakfast
• Same general (recursive) rule to handle these
– Nominal Nominal PP
– Nominal Nominal GerundVP
– Nominal Nominal RelClause
25
Agreement
• By agreement, we have in mind constraints
that hold among various constituents that take
part in a rule or set of rules
• For example, in English, determiners and the
head nouns in NPs have to agree in their
number.
26
Problem
• Our earlier NP rules are clearly deficient
since they don’t capture this constraint
– NP Det Nominal
• Accepts, and assigns correct structures, to
grammatical examples (this flight)
• But its also happy with incorrect examples (*these
flight)
– Such a rule is said to overgenerate.
– We’ll come back to this in a bit
27
Verb Phrases
• English VPs consist of a head verb along with
0 or more following constituents which we’ll
call arguments.
28
Subcategorization
• But, even though there are many valid VP
rules in English, not all verbs are allowed to
participate in all those VP rules.
• We can subcategorize the verbs in a
language according to the sets of VP rules
that they participate in.
• This is a modern take on the traditional
notion of transitive/intransitive.
• Modern grammars may have 100s or such
classes.
29
Subcategorization
• Sneeze: John sneezed
• Find: Please find [a flight to NY]NP
• Give: Give [me]NP[a cheaper fare]NP
• Help: Can you help [me]NP[with a flight]PP
• Prefer: I prefer [to leave earlier]TO-VP
• Told: I was told [United has a flight]S
• …
30
Subcategorization
• *John sneezed the book
• *I prefer United has a flight
• *Give with a flight
31
Why?
32
Treebanks
• Treebanks are corpora in which each sentence has
been paired with a parse tree (presumably the right
one).
• These are generally created
– By first parsing the collection with an automatic parser
– And then having human annotators correct each parse
as necessary.
• This generally requires detailed annotation
guidelines that provide a POS tagset, a grammar
and instructions for how to deal with particular
grammatical constructions.
33
Penn Treebank
• Penn TreeBank is a widely used treebank.
34
Treebank Grammars
• Treebanks implicitly define a grammar for
the language covered in the treebank.
• Simply take the local rules that make up the
sub-trees in all the trees in the collection
and you have a grammar.
• Not complete, but if you have decent size
corpus, you’ll have a grammar with decent
coverage.
35
Treebank Grammars
• Such grammars tend to be very flat due to
the fact that they tend to avoid recursion.
– To ease the annotators burden
• For example, the Penn Treebank has 4500
different rules for VPs. Among them...
36
Heads in Trees
• Finding heads in treebank trees is a task that
arises frequently in many applications.
– Particularly important in statistical parsing
• We can visualize this task by annotating the
nodes of a parse tree with the heads of each
corresponding node.
37
Lexically Decorated Tree
38
Head Finding
• The standard way to do head finding is to
use a simple set of tree traversal rules
specific to each non-terminal in the
grammar.
39
Noun Phrases
40
Treebank Uses
• Treebanks (and headfinding) are
particularly critical to the development of
statistical parsers
• Also valuable to Corpus Linguistics
– Investigating the empirical details of various
constructions in a given language
41
Dependency Grammars
• In CFG-style phrase-structure grammars the
main focus is on constituents.
• But it turns out you can get a lot done with
just binary relations among the words in an
utterance.
• In a dependency grammar framework, a
parse is a tree where
– the nodes stand for the words in an utterance
– The links between the words represent
dependency relations between pairs of words.
• Relations may be typed (labeled), or not.
42
Summary
• Context-free grammars can be used to model
various facts about the syntax of a language.
• When paired with parsers, such grammars
constitute a critical component in many
applications.
• Constituency is a key phenomena easily captured
with CFG rules.
– But agreement and subcategorization do pose
significant problems
• Treebanks pair sentences in corpus with their
corresponding trees.
43