MDAN 54233
Analysing the syntax and structure
• Parts of speech (POS) tagging
• Shallow parsing
• Dependency-based parsing
• Constituency-based parsing
Syntax and
Structure
Supunmali Ahangama
MDAN 54233 2
1 2
Libraries Language Syntax and Structure
• The nltk library • A collection of words without any relation or structure
• The spacy library
• The pattern library
• The Stanford parser
• Graphviz and necessary libraries for the same
MDAN 54233 3 MDAN 54233 4
3 4
Hierarchical sentence tree
• sentence → clause → phrase → word
“The brown fox is quick and he is jumping over the lazy dog”
https://courses.grainger.illinois.edu/cs447/sp2023/Slides/Lecture11.pdf MDAN 54233 5 MDAN 54233 6
5 6
Word PoS Tags
• Words are the smallest units in a language that are • It focuses on Syntactic behaviour but not about Semantic
independent and have a meaning of their own. Meaning.
– Although morphemes are the smallest distinctive units, morphemes • “Equivalence class” or categories of linguistic entities ->
are not independent like words, and a word can be comprised of
behaving same.
several morphemes.
– Can perform similar kind of functions
• Parts of speech (POS) to see the major syntactic – Can perform similar transformation
categories
MDAN 54233 7 MDAN 54233 8
7 8
Traditional PoS Tags (Word classes / Lexical Traditional PoS Tags (Word classes / Lexical
Classes) Classes)
• N(oun) – objects/entities • DET(erminer) –a, an, the, each, some, which
– E.g., flower, dog, cat, book • CONJ(unction) –and, or, but, because, if, for, when, yet
• V(erb) – actions • PRON(oun) – I, he, she, it, they, this, that, him, her
– E.g., running, jumping, read, write, eat, have, is, go, like
• P(reposition) - on the shelf, before noon
• Adj(ective) - describe/qualify other words
• INT(erjection)- Ouch! Wow! Great! Help! Oh! Hey! Hi!
– E.g., beautiful (beautiful flower), big, crazy, four, yellow
• Adv(erb) – modifiers
– E.g. (ADV) very (ADJ) beautiful (N) flower • POS tagging means labeling words with their appropriate Part-
– E.g., slowly, quietly, too Of-Speech.
MDAN 54233 9 MDAN 54233 10
9 10
What is POS tagging? What is POS tagging?
• POS tagging is one of the first steps in the traditional NLP • Input: a sequence of word tokens w
pipeline (right after tokenization and segmentation).
• Output: a sequence of part-of-speech tags t, one per word
• The process of assigning a part-of-speech or lexical class
marker to each word in a corpus.
• It explains how a word is used in a sentence
• NB: Although many neural models don’t use POS tagging, it is
still important to understand what makes POS tagging difficult
(or easy), and how the basic models and algorithms work.
MDAN 54233 11 MDAN 54233 12
11 12
Closed vs. Open Class Words
• Closed class: relatively fixed set, it does not accept new
members
– Prepositions: at, on, in, of, by, with …
– Auxiliaries: shall, could, am, is, are, may, can, will, has, had, been, …
– Pronouns: I, we, you, them, he, she, it, their, mine, her, his, …
• Open class: accept new members
– English has 4: Nouns, Verbs, Adjectives, Adverbs
MDAN 54233 13 MDAN 54233 14
13 14
Beyond English… Why POS Tagging?
Chinese No verb/adjective distinction! • Parsing a sentence usually starts with POS tagging
• 漂亮: beautiful/to be beautiful • Finding phrases (shallow parsing) requires POS tagging
– Noun phrases, verb phrases, adverbial phrases, etc.
Sinhala • POS tags are beneficial for word sense disambiguation
• https://github.com/nlpcuom/Sinhala-POS- • Applications in all areas of Text Mining
Data/blob/master/Tagging%20Guide.pdf – NER: ~10% boost using POS-features for single-token entities
– NER: ~20% boost using POS-tags during post-processing of multi
token entities
MDAN 54233 15 MDAN 54233 16
15 16
Applications of POS Tagging POS Tagging tools
• Speech recognition/synthesis • NLTK,
– object (noun vs verb) https://dictionary.cambridge.org/pronunciation/english/object
– discount (noun vs verb vs adjective) • TextBlob,
https://dictionary.cambridge.org/pronunciation/english/discount
• SpaCy,
• Word prediction
– E.g., Possessive pronouns are used after the noun “Saman’s cat is healthier than • Stanford CoreNLP,
mine”, “My phone is dead. Pass me yours.”
– E.g., Personal pronouns followed by verbs “I am Sri Lankan.”, “He is tall.”
• Memory-Based Shallow Parser (MBSP),
• Spelling and grammar checking • Apache OpenNLP,
• Machine translation • Apache Lucene
• Language-based information retrieval on the web MDAN 54233 17 MDAN 54233 18
17 18
Penn Treebank Tagset: 45 Tags
Tag sets
• For English, Penn Treebank is the most common tag set.
– Hand annotated corpus of Wall street journal, I million words
• Tag sets will differ across languages
• Word classes even within a language are not really defined
– London-Lund Corpus of Spoken English: 197 tags
– Lancaster-Oslo/ Bergen: 135 tags
– Penn tag set: 45 tags
– Brown tag set: 87 tags
– STTS (Stuttgart-Tübingen Tag set): ~50 tags
MDAN 54233 19 MDAN 54233 20
19 20
Universal POS Tags Try your hand at tagging…
• Universal POS tags are part-of-speech marks used in • My writing is not good.
Universal Dependencies (UD) which is a project developing
• I am writing a letter
cross-linguistically consistent treebank annotation for
many languages.
• 12 POS tags
• Develop a mapping from different languages
• https://universaldependencies.org/u/pos/
MDAN 54233 21 MDAN 54233 22
21 22
Try your hand at tagging… Try your hand at tagging…
• The back door • I hope that she wins
• On my back • That day was nice
• Win the voters back • You can go that far
• Promised to back the bill
MDAN 54233 23 MDAN 54233 24
23 24
Why is POS tagging hard? Ambiguity
• Ambiguity in English • Common POS ambiguities in English:
– 11.5% of word types ambiguous in Brown corpus – Noun—Verb: table
– 40% of word tokens ambiguous in Brown corpus – Adjective—Verb: laughing, known,
– Annotator disagreement in Penn Treebank: 3.5% – Noun—Adjective: normal
• A word can have more than one POS tag (depends on the context)
• A word is ambiguous if has more than one POS
– The back door -> JJ (Adjective)
– Unless we have a dictionary that gives all POS tags for each word, we
– On my back -> NN (noun) only know the POS tags with which a word appears in our corpus.
– Win the voters back -> RB (adverb) – Since many words appear only once (or a few times) in any given corpus,
– Promised to back the bill VB (verb form) we may not know all of their POS tags.
MDAN 54233 25 MDAN 54233 26
25 26
Why is POS tagging hard?
• Due to ambiguity (and unknown words), we cannot rely on
a dictionary to look up the correct POS tags.
MDAN 54233 27 MDAN 54233 28
27 28
Chunking Phrases
• It is the process of extracting phrases from unstructured • A phrase can be a single word or a combination of words
text. based on the syntax and position of the phrase in a clause or
• Group of words are called “chunks”. sentence.
• Also known as shallow/light parsing. • Phrases do not make sense on their own as they do not have a
subject and predicate (verb based part).
• New York, Sri Lanka should be considered as a single word
• Chunk tags like Noun Phrase(NP), Verb Phrase (VP), etc.
MDAN 54233 29 MDAN 54233 30
29 30
Types of Phrases (1/3) Types of Phrases (2/3)
• Noun phrase (NP) - act as a subject or object to a verb. • Adjective phrase (ADJP): These are phrases with an adjective
– It is a set of words that can be replaced by a pronoun without making the sentence or as the head word.
clause syntactically incorrect.
– E.g., dessert, the lazy dog, the brown fox, an intelligent person, my car, his mother, each
– They describe or qualify nouns and pronouns in a sentence.
person, every dream, any plans, a lot of friends, some friends, two boys – E.g., The cat is too quick, Any person smarter than you can do it.
– The person in the black shirt is neighbor, Some people under your leadership are doing
great.
• Adverb phrase (ADVP): These phrases act like adverbs since
the adverb acts as the head word in the phrase.
• Verb phrase (VP): These phrases are lexical units that have a verb acting as – Adverb phrases are used as modifiers for nouns, verbs, or adverbs
the head word. themselves by providing further details that describe or qualify them.
– E.g., he has started – E.g. The train should be at the station pretty soon
MDAN 54233 31 MDAN 54233 32
31 32
https://englishwithashish.com/adverb-phrases/
https://englishwithashish.com/adjective-phrases-examples/ MDAN 54233 33 MDAN 54233 34
33 34
Types of Phrases (3/3) Libraries
• Prepositional phrase (PP): These phrases usually contain • NLTK
preposition as the head word and other lexical • SpaCy
components like nouns, pronouns, and so on.
• TextBlob
– It acts like an adjective or adverb describing other words or
phrases
– E.g., going up the stairs
MDAN 54233 35 MDAN 54233 36
35 36
Summary
• POS Tagging is a process of marking up words in text
format for a particular part of a speech based on its
definition and context.
• Chunking is the process of grouping small pieces of
information into larger units.
MDAN 54233 37 MDAN 54233 38
37 38