0% found this document useful (0 votes)
89 views42 pages

NLP Unit-2

This document discusses Natural Language Processing (NLP) with a focus on syntax and parsing, explaining the importance of syntactic and semantic parsing in understanding sentence structure and meaning. It introduces key concepts such as treebanks, which are annotated collections of sentences used to train and evaluate syntactic parsers, and Context-Free Grammar (CFG), a formal grammar used to model sentence structure. The document also highlights the challenges of ambiguity in parsing and the applications of syntactic analysis in various NLP tasks.

Uploaded by

ashmitha1428
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views42 pages

NLP Unit-2

This document discusses Natural Language Processing (NLP) with a focus on syntax and parsing, explaining the importance of syntactic and semantic parsing in understanding sentence structure and meaning. It introduces key concepts such as treebanks, which are annotated collections of sentences used to train and evaluate syntactic parsers, and Context-Free Grammar (CFG), a formal grammar used to model sentence structure. The document also highlights the challenges of ambiguity in parsing and the applications of syntactic analysis in various NLP tasks.

Uploaded by

ashmitha1428
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

UNIT II

Natural Language Processing (NLP)

2.1. Syntax: Parsing Natural Language


Parsing in Natural Language Processing (NLP) refers to the process of analyzing a

sentence's structure according to a formal grammar. It involves breaking down a sentence

into its constituent parts (such as words, phrases, and clauses) and identifying the syntactic

relationships between them. Parsing is a fundamental step in understanding the meaning of

a sentence and is used in tasks like machine translation, question answering, and sentiment

analysis.

Parsing uncovers the hidden structure of linguistic input. In many applications involving

natural language, the underlying predicate-argument structure of sentences can be useful.

The syntactic analysis of language provides a means to explicitly discover the various

predicate-argument dependencies that may exist in a sentence.

There are two main types of parsing in NLP:

1. Syntactic Parsing:

o Focuses on the grammatical structure of a sentence.

o Produces a parse tree that represents the syntactic hierarchy of the

sentence.

o Common approaches include constituency parsing and dependency

parsing.

2. Semantic Parsing:

o Focuses on understanding the meaning of a sentence.

o Produces a representation of the sentence's meaning, often in the form

of logical expressions or structured data.

Predicate-argument structure is a fundamental concept in linguistics that describes

the relationships between verbs (predicates) and the entities involved in the actions or

states they describe (arguments).

Ambiguity : It arises when a word, phrase, or sentence has multiple possible

interpretations or meanings. Ambiguity can occur at various levels of language processing,

including lexical, syntactic, and semantic levels.


In syntactic parsing, ambiguity is a particularly difficult problem since the most

plausible analysis has to be chosen from an exponentially large number of alternative

analyses.

1. Parsing Natural Language


In a text to speech application input sentences are to be converted to a spoken output

that should sound like it was spoken by a native speaker of the language .

Types of Syntactic Parsing

1. Constituency Parsing

 Breaks a sentence into sub-phrases or constituents.

 Uses a phrase structure grammar to represent the sentence as a tree of nested

phrases.

2. Dependency Parsing

 Focuses on the relationships between words in a sentence.

 Represents the sentence as a directed graph, where nodes are words and edges

represent syntactic dependencies.

 Example:

o Sentence: "The cat sat on the mat."

o Dependency Parse:

sat (root)
├── cat (nsubj)
│ └── The (det)
└── on (prep)
└── mat (pobj)
└── the (det)

Phrase structure grammar is a system for describing the structure of sentences in a

language. It breaks down sentences into their constituent parts (phrases) and shows how

those parts relate to each other.


The sentence "The cat sat on the mat." Here's how phrase structure grammar would

analyze it:

1. Sentence (S): The entire sentence is the top-level constituent.

2. Noun Phrase (NP): "The cat" is a noun phrase, as it functions as the subject.

3. Verb Phrase (VP): "sat on the mat" is a verb phrase, as it contains the verb and

its related elements.

4. Prepositional Phrase (PP): "on the mat" is a prepositional phrase, modifying the

verb "sat."

key Concepts in Phrase Structure Grammar

1. Constituents:

o A constituent is a group of words that function as a single unit within a

sentence.

o Examples: Noun Phrase (NP), Verb Phrase (VP), Prepositional Phrase (PP).

2. Phrase Structure Rules:

o These rules define how constituents can be combined to form larger

structures.

o Example:

 S -> NP VP (A sentence consists of a Noun Phrase followed by a

Verb Phrase).

 NP -> Det N (A Noun Phrase consists of a Determiner followed by

a Noun).

3. Parse Tree:

o A tree diagram that represents the hierarchical structure of a sentence

according to the phrase structure rules.

o Example: S
├── NP
│ ├── Det (The)
│ └── N (cat)
└── VP
├── V (sat)
└── PP
├── P (on)
└── NP
├── Det (the)
└── N (mat)
4. Terminals and Non-Terminals:

o Terminals: The actual words in a sentence (e.g., "cat", "sat").

o Non-Terminals: The syntactic categories (e.g., NP, VP, Det).

Example of Phrase Structure Grammar

Consider the sentence: "The cat sat on the mat."

Phrase Structure Rules:

1. S -> NP VP

2. NP -> Det N

3. VP -> V PP

4. PP -> P NP

5. Det -> 'the'

6. N -> 'cat' | 'mat'

7. V -> 'sat'

8. P -> 'on'

Another motivation for parsing comes from the natural language task of

summarization, in which several documents about the same topic should be condensed down

to a small digest of information typically limited in size to a 100 or 250 words.

Types of Summarization

1. Extractive Summarization: Select the most important sentences or phrases

from the original text and concatenate them to form a summary.

Example:

a. Original Text: "The cat sat on the mat. It was a sunny day. The cat enjoyed the
warmth."

b. Extractive Summary: "The cat sat on the mat. The cat enjoyed the warmth."

2. Abstractive Summarization: Generate a summary by paraphrasing and

rephrasing the original text, often producing new sentences that capture the

essence of the content.

Example:

o Original Text: "The cat sat on the mat. It was a sunny day. The cat

enjoyed the warmth."

o Abstractive Summary: "The cat relaxed on the mat, enjoying the sunny

weather."
In contemporary natural language processing, syntactic parsers are routinely

used in many applications including but not limited to:

1. statistical machine translation

2. information extraction from text collections

3. language summarization

4. producing entity grids for language generation

5. error correction in text

6. knowledge acquisition from language

7. in speech recognition systems as language models dialog systems

8. text to speech systems.


2.2. Treebanks: A Data-Driven Approach to Syntax
Treebanks are linguistic resources that consist of a collection of sentences,

each annotated with its syntactic structure in the form of a parse tree. These parse

trees represent the hierarchical structure of sentences, breaking them down into

constituents like phrases and clauses. Treebanks are essential for data-driven

approaches to syntax, enabling the development and evaluation of syntactic parsers

and other NLP tools.

What is a Treebank?

 A treebank is a corpus of text where each sentence is annotated with its

syntactic structure.

 The annotation typically includes:

o Part-of-speech (POS) tags for each word.

o Phrase structure (constituency) or dependency relations between words.

 Treebanks are used to train and evaluate parsers and other NLP models.

A corpus of text (plural: corpora) is a large and structured collection of texts,

typically used for linguistic analysis, natural language processing (NLP), and machine

learning. Corpora are essential resources for studying language patterns, training

models, and developing NLP applications.

Parsing recovers information that is not explicit in the input sentence. This

implies that a parser requires some knowledge in addition to the input sentence about

the kind syntactic analysis that should be produced as output.

One method to provide such knowledge to the parser is to write down a grammar

of the language--a set of rules of syntactic analysis. For instance, one might write

down the rules of syntax as a context-free grammar (CFG).

A Context-Free Grammar (CFG) is a formal grammar widely used in Natural

Language Processing (NLP) to model the syntactic structure of sentences. It is a type

of phrase-structure grammar that defines how sentences in a language can be

generated by recursively combining smaller units (e.g., words, phrases) into larger

structures. CFGs are particularly useful for tasks like parsing, syntax analysis,

and language generation.


1. Definition of Context-Free Grammar

A CFG consists of four components:

1. Non-Terminal Symbols (N):

o Represent syntactic categories or phrases (e.g., S for sentence, NP for

noun phrase, VP for verb phrase).

2. Terminal Symbols (T):

o Represent the actual words in the language (e.g., "cat", "run").

3. Production Rules (P):

o Define how non-terminals can be rewritten as sequences of terminals and

non-terminals.

o Example: S → NP VP (a sentence can be composed of a noun phrase

followed by a verb phrase).

4. Start Symbol (S):

o The root non-terminal from which all derivations begin (e.g., S for

sentence).

2. Example of a CFG

Consider a simple CFG for a subset of English:

 Non-Terminals: {S, NP, VP, Det, N, V}

 Terminals: {the, cat, dog, chased, slept}

 Production Rules:

S → NP VP

NP → Det N

VP → V NP | V

Det → the

N → cat | dog

V → chased | slept

 Start Symbol: S
Example Derivation:

 Sentence: "The cat chased the dog."

 Derivation:

S → NP VP

→ Det N VP

→ the N VP

→ the cat VP

→ the cat V NP

→ the cat chased NP

→ the cat chased Det N

→ the cat chased the N

→ the cat chased the dog

3. Parse Trees

A CFG generates a parse tree that represents the syntactic structure of a

sentence. Each node in the tree corresponds to a non-terminal symbol, and the leaves

are terminal symbols (words).

Example Parse Tree for "The cat chased the dog":


4. Applications of CFGs in NLP

CFGs are used in various NLP tasks, including:

1. Syntactic Parsing:

o Building parse trees for sentences to analyze their structure.

o Example: The Stanford Parser uses CFGs for constituency parsing.

2. Language Generation:

o Generating grammatically correct sentences by following CFG rules.

3. Grammar Checking:

o Detecting syntactic errors in text by comparing it to a CFG.

4. Machine Translation:

o Using CFGs to model the syntax of source and target languages.

5. Question Answering:

o Parsing questions and answers to extract relevant information.

The following CFG (written in a simple Backus-Naur form) represents a simple

grammar of transitive verbs in English, verbs (V) that have a subject and object noun

phrase (NP), plus modifiers of verb phrases (VP) in the form of prepositional phrases

(PP).

S -> NP VP

NP -> `John' | `pockets' | D N | NP PP

VP -> V NP | VP PP

V -> `bought'

D -> `a'

N -> `shirt'

PP -> P NP

P -> `with'

Natural language grammars typically have the words w as terminal symbols in the

CFG and they are generated by rules of type X ! w where X is the part of speech for

the word w. For example in the above CFG, the rule V -> `saw' has the part of speech

symbol V generating the verb `saw'. Such non-terminals are called part-of-speech tags

or pre-terminals.
The above CFG can produce a syntax analysis of a sentence like

`John bought a shirt with pockets'

S as the start symbol of the grammar.

Parsing the sentence with the CFG rules gives us two possible derivations for

this sentence.

In one parse, pockets are a kind of currency which can be used to buy a

shirt, and the other parse, which is the more plausible one, John is purchasing a kind

of shirt that has pockets.

However, writing down a CFG for the syntactic analysis of natural language is

problematic. Unlike a programming language, natural language is far too complex to

simply list all the syntactic rules in terms of a CFG.

A simple list of rules does not consider interactions between different

components in the grammar. We could extend this grammar to include other types of

verbs, and other syntactic constructions, but listing all possible syntactic constructions

in a language is a difficult task.

In addition, it is difficult to exhaustively list lexical properties of words, for

instance, listing all the grammar rules in which a particular word can be a participant.

This is a typical knowledge acquisition problem.

Apart from this knowledge acquisition problem, there is another less apparent

problem: it turns out that the rules interact with each other in combinatorially

explosive ways.

Consider a simple CFG that provides a syntactic analysis of noun phrases as a

binary branching tree:

N -> N N

N -> `natural' | `language' | `processing' | `book'


Recursive rules produce ambiguity:

A recursive rule in the context of formal grammars, it is a production rule that

allows a non-terminal symbol to be rewritten in terms of itself, either directly or

indirectly.

with N as the start symbol, for the input `natural' there is one parse tree (N

natural); for the input `natural language' we use the recursive rule once and obtain one

parse tree (N (N natural) (N language));

for the input `natural language processing' we use the recursive rule twice in

each parse and there are two ambiguous parses:

Note that the ambiguity in the syntactic analysis reflects a real ambiguity: is it

a processing of natural language, or is it a natural way to do language processing?

So this issue cannot be resolved by changing the formalism in which the rules

are written--e.g. by using finite-state automata which can be deterministic but cannot

simultaneously model both meanings in a single grammar.

Any system of writing down syntactic rules should represent this ambiguity.

However, by using the recursive rule 3 times we get 5 parses for `natural language

processing book' and for longer and longer input noun phrases, using the recursive rule

4 times we get 14 parses, using it 5 times we get 42 parses, using it 6 times we

get 132 parses. In fact,for CFGs it can be proved that the number of parses obtained

by using the recursive rule n times is the Catalan number of n:

This is a second knowledge acquisition problem--not only do we need to know the

syntactic rules for a particular language, but we also need to know which analysis is the

most plausible for a given input sentence.

The construction of a treebank is a data driven approach to syntax analysis

allows us to address both of these knowledge acquisition bottlenecks in one stroke.


A treebank is simply a collection of sentences (also called a corpus of text),

where each sentence is provided a complete syntax analysis. The syntactic analysis for

each sentence has been judged by a human expert as the most plausible analysis for

that sentence.

There is no set of syntactic rules or linguistic grammar explicitly provided by a

,treebank, and typically there is no list of syntactic constructions provided explicitly

in a treebank.

In fact, no exhaustive set of rules is even assumed to exist, even though

assumptions about syntax are implicit in a treebank. A detailed set of assumptions

about syntax is typically used as an annotation guideline to help the human experts

produce the single most plausible syntactic analysis for each sentence in the corpus.

The consistency of syntax analysis in a treebank is measured using inter-

annotator agreement by having approximately 10% overlapped material annotated by

more than one annotator.

Treebanks provide a solution to the two kinds of knowledge acquisition

bottlenecks. Treebanks provide annotations of syntactic structure for a large sample

of sentences. We can use supervised machine learning methods in order to train a

parser to produce a syntactic analysis for input sentences by generalizing appropriately

from the training data extracted from the treebank.

Treebanks solve the first knowledge acquisition problem of finding the grammar

underlying the syntax analysis since the syntactic analysis is directly given instead of

a grammar.

In fact, the parser does not necessarily need any explicit grammar rules as long

as it can faithfully produce a syntax analysis for an input sentence, although the

information used by the trained parser can be said to represent a set of implicit

grammar rules.

Treebanks solve the second knowledge acquisition problem as well. Since each

sentence in a treebank has been given its most plausible syntactic analysis, supervised

machine learning methods can be used to learn a scoring function over all possible

syntax analyses.

A statistical parser trained on the treebank tries to mimic the human annotation

decisions by using indicators from the input and previous decisions made in the parser
itself to learn such a scoring function. For a given sentence not seen in the training

data, a statistical parser can use this scoring function to return the syntax analysis

that has the highest score, which is taken to be the most plausible analysis for that

sentence. The scoring function can also be used to produce the k-best syntax analyses

for a sentence.

There are two main approaches to syntax analysis which are used to construct

treebanks: dependency graphs and phrase structure trees. These two

representations are very closely related to each other, and under some assumptions

one representation can be converted to another.

Dependency analysis is typically favoured for languages such as Czech, Turkish,

etc., that have free(er) word order, where the arguments of a predicate are often

seen in different ordering in the sentence.

while phrase-structure analysis is often used to provide additional information

about long distance dependencies and mostly in languages like English, French, etc.

where the word order is less flexible.

2.3 Representation of Syntactic Structure


The representation of syntactic structure in natural language processing (NLP)

involves capturing the grammatical organization of a sentence. This is crucial for

understanding how words relate to each other and how they contribute to the overall

meaning of the sentence. There are two primary approaches to representing syntactic

structure:

1. Constituency-Based Representation (Phrase Structure Trees)

2. Dependency-Based Representation (Dependency Graphs)

1 Constituency-Based Representation (Phrase Structure Trees)

Constituency-based representation organizes words into nested phrases, forming a

hierarchical tree structure. This approach is based on phrase structure grammar,

where sentences are broken down into constituents (phrases) such as noun phrases

(NP), verb phrases (VP), and prepositional phrases (PP).


Key Features:

 Hierarchical Structure: Words are grouped into phrases, which are further

grouped into larger phrases.

 Non-Terminal Nodes: Represent phrases (e.g., NP, VP, PP).

 Terminal Nodes: Represent individual words.

 Rules: Defined by a grammar (e.g., context-free grammar).

Example:

Sentence: "The cat sat on the mat."

Phrase Structure Tree:

Explanation:

 S: Sentence

 NP: Noun Phrase ("The cat")

 VP: Verb Phrase ("sat on the mat")

 PP: Prepositional Phrase ("on the mat")

 Det: Determiner ("The", "the")

 N: Noun ("cat", "mat")

 V: Verb ("sat")

 P: Preposition ("on")
2. Dependency-Based Representation (Dependency Graphs)

Dependency graphs in Natural Language Processing (NLP) are a way to represent

the syntactic structure of a sentence by showing the relationships between words.

These graphs are based on the concept of dependency grammar, where words are

connected by directed edges that represent grammatical relationships. Each word in

a sentence (except the root) is linked to exactly one other word, called its head, and

the relationship between them is labeled with a specific syntactic role.

The main philosophy behind dependency graphs is to connect a word- the head of a

phrase- with the dependents in that phrase. The notation connects a head with its

dependent using a directed (asymmetric) connections.

Dependency graphs, just like phrase structures trees, is a representation that is

consistent with many different linguistic frameworks. The words in the input sentence

are treated as the only vertices in the graph, which are linked together by directed

arcs representing syntactic dependencies.

Key Features:

 Directed Edges: Show the relationship between a word (dependent) and its head.

 Labels: Describe the type of dependency (e.g., subject, object, modifier).

 Root: The main verb of the sentence, which has no incoming edges.

Types of Dependency Relationships

Dependency labels describe the syntactic role of a word in relation to its head. Some

common dependency labels include:

 nsubj: Nominal subject (e.g., "She" in "She runs")

 dobj: Direct object (e.g., "book" in "She reads a book")

 amod: Adjectival modifier (e.g., "red" in "red apple")

 advmod: Adverbial modifier (e.g., "quickly" in "runs quickly")

 prep: Prepositional modifier (e.g., "in" in "book on the table")

 det: Determiner (e.g., "the" in "the book")

 pobj: Object of a preposition (e.g., "mat" in "on the mat")

 cc: Coordinating conjunction (e.g., "and" in "apples and oranges")

 conj: Conjunct (e.g., "oranges" in "apples and oranges")


Example:

Sentence: "The cat sat on the mat."

sat
/ \
det prep
| |
The on
|
pobj
|
mat
|
det
|
the
Explanation:

 "sat" is the root (main verb).

 "cat" is the subject (nsubj) of "sat".

 "The" is a determiner (det) modifying "cat".

 "on" is a preposition (prep) depending on "sat".

 "mat" is the object of the preposition (pobj).

 "the" is a determiner (det) modifying "mat".

In dependency-based syntactic parsing, the task is to derive a syntactic structure

for an input sentence by identifying the syntactic head of each word in the sentence.

This defines a dependency graph, where the nodes are the words of the input sentence

and arcs are the binary relations from head to dependent.

There are many variants of dependency style syntactic analysis, but the basic

textual format for a dependency tree can be written in the following form, where each

dependent word specifies the head word in the sentence, and exactly one word

is dependent to the root of the sentence.


An important notion in dependency analysis is the notion of projectivity which is

a constraint imposed by the linear order of words on the dependencies between words.

A projective dependency tree is one where if we put the words in a linear order based

on the sentence with the root symbol in the first position, the dependency arcs can be

drawn above the words without any crossing dependencies.

Comparison of Constituency and Dependency Representations

Constituency-Based (Phrase Dependency-Based (Dependency


Feature
Structure Trees) Graphs)

Structure Hierarchical, nested phrases Flat, directed graph

Focus Grouping words into phrases Relationships between words

Non-terminal (phrases) and terminal


Nodes Only words (no phrase nodes)
(words)

Directed dependencies between


Edges Parent-child relationships in a tree
words

Grammar checking, sentence Machine translation, information


Use Cases
generation extraction

Example
Stanford Parser, NLTK spaCy, Stanza, UDPipe
Tools

 Constituency-based representation uses hierarchical phrase structure trees

to group words into nested phrases.

 Dependency-based representation uses directed graphs to show relationships

between words.

 Both approaches are essential for understanding syntactic structure and are

used in various NLP tasks.

 The choice between constituency and dependency representations depends on

the specific application and the level of detail required.


2.4. Parsing Algorithms
Parsing algorithms in NLP (Natural Language Processing) are crucial for analyzing

the syntactic structure of sentences. These algorithms break down a sentence into

components like phrases and words and identify their relationships, such as subjects,

objects, and verbs.

Given an input sentence, a parser produces an output analysis of that sentence,

which we now assume is the analysis that is consistent with a treebank that is used to

train a parser. Treebank parsers do not need to have an explicit grammar, but to make

the explanation of parsing algorithms simpler we first consider parsing algorithms that

assume the existence of a context-free grammar.

Consider the following simple CFG that can be used to derive strings such as a and b or

c from the start symbol N.

N -> N `and' N

N -> N `or' N

N -> `a' | `b' | `c'

An important concept for parsing is a derivation. For the input string a and b or c the

following sequence of actions separated by the => symbol represents a sequence of

steps called a derivation:

=> N `or' N

=> N `or c'

=> N `and' N `or c'

=> N `and b or c'

=> `a and b or c'

In this derivation each line is called a sentential form. Furthermore, each line

of the derivation applies a rule from the CFG in order to show that the input can, in

fact, be derived from the start symbol N.

In the above derivation, we restricted ourselves to only expand on the rightmost

non-terminal in each sentential form. This method is called the rightmost derivation

of the input using a CFG. An interesting property of a rightmost derivation is revealed

if we arrange the derivation in reverse order:

`a and b or c'
=> N `and b or c' # use rule N -> a

=> N `and' N 'or c' # use rule N -> b

=> N `or c' # use rule N -> N and N

=> N `or' N # use rule N -> c

=> N # use rule N -> N or N

This derivation sequence exactly corresponds to the construction of the

following parse tree from left to right one symbol at a time.

(N (N (N a)

and

(N b))

or

(N c))

However, a unique derivation sequence is not guaranteed. There can be many different

derivations, and as we have seen before the number of derivations can be exponential

in the input length. For example, there is another rightmost derivation that results in

the following parse tree:

(N (N a)

and

(N (N b)

or

(N c)))

`a and b or c'

=> N `and b or c' # use rule N -> a

=> N `and' N `or c' # use rule N -> b

=> N `and' N `or' N # use rule N -> c

=> N `and' N # use rule N -> N or N

=> N # use rule N -> N and N


There are several types of parsing algorithms commonly used in NLP:

1. Top-Down Parsing

2. Bottom-Up Parsing

3. Chart Parsing

4. Dependency Parsing

5. Constituency Parsing

6. Transition-Based Parsing

7. Neural Network-based Parsing

2.4.1. Shift Reduce Parsing

Shift-Reduce Parsing is a type of bottom-up parsing technique used in Natural

Language Processing (NLP), particularly in dependency parsing and syntax analysis. It is a

relatively simple and efficient algorithm designed to construct parse trees for sentences

based on shift and reduce operations.

Shift-reduce parsing is a popular bottom-up technique used in syntax analysis, where the

goal is to create a parse tree for a given input based on grammar rules. The process works by

reading a stream of tokens (the input), and then working backwards through the grammar

rules to discover how the input can be generated.

1. Input Buffer: This stores the string or sequence of tokens that needs to be parsed.

2. Stack: The parser uses a stack to keep track of which symbols or parts of the parse

it has already processed. As it processes the input, symbols are pushed onto and

popped off the stack.

3. Parsing Table: Similar to a predictive parser, a parsing table helps the parser decide

what action to take next.

Shift-reduce parsing works by processing the input left to right and gradually building up a

parse tree by shifting tokens onto the stack and reducing them using grammar rules, until it

reaches the start symbol of the grammar.

1. Shift Operation:

o In the shift operation, the next input word is moved onto the stack. This means

that the parser "shifts" the word from the input sequence and places it in a

stack for further processing.

2. Reduce Operation:

o The reduce operation is applied when a specific set of conditions are met,

typically when the stack has a substructure that can be reduced into a higher-

level syntactic structure.


o This means combining the stackÕs top elements into a new structure according to

a rule or grammar (e.g., combining a noun and a determiner into a noun phrase).

3. Stack:

o A stack is where words and partial constituents are held while the parser works

through the sentence.

4. Input Buffer:

o The input buffer contains the remaining words of the sentence that need to be

processed.

5. Goal:

o The goal of a shift-reduce parser is to reduce the input buffer into a single tree

structure by applying a sequence of shifts and reductions.

Parsing Process:

The shift-reduce parsing algorithm operates in two stages:

1. Shift: A word is moved from the input buffer onto the stack.

2. Reduce: A combination of words (or previously reduced structures) from the stack is

replaced with a higher-level structure based on grammar rules.

Example – Consider the grammar

S –> S + S

S –> S * S

S –> id

Perform Shift Reduce parsing for input string “id + id + id”.


Step-by-Step Example:

Consider the sentence: "The cat sleeps"

Input: The cat sleeps

1. Stack: [] (empty stack) Input Buffer: [The, cat, sleeps]

2. Shift: Move "The" from input buffer to the stack.

o Stack: [The]

o Input Buffer: [cat, sleeps]

3. Shift: Move "cat" from input buffer to the stack.

o Stack: [The, cat]

o Input Buffer: [sleeps]

4. Reduce: Apply a reduction rule (e.g., "Determiner + Noun" → Noun Phrase).

o Stack: [NP(The, cat)] (Now, "The cat" is reduced to a noun phrase)

o Input Buffer: [sleeps]

5. Shift: Move "sleeps" from input buffer to the stack.

o Stack: [NP(The, cat), sleeps]

o Input Buffer: []

6. Reduce: Apply a reduction rule (e.g., "Noun Phrase + Verb" → Sentence).

o Stack: [Sentence(NP(The, cat), sleeps)]

o Input Buffer: []

Now, the stack contains the complete parse tree of the sentence, represented as a "Sentence"

containing a noun phrase and a verb.

Every CFG turns out to have an automaton that is equivalent to it, called a pushdown

automaton (just like regular expressions can be converted to finite-state automata). A

pushdown automaton is simply a finite-state automaton with some additional memory in the

form of a stack (or pushdown). This is a limited amount of memory since only the top of the

stack is used by the machine.

This provides an algorithm for parsing that is general for any given CFG and input string.

The algorithm is called shift-reduce parsing which uses two data-structures: a bu_er for input

symbols and a stack for storing CFG symbols and is defined as follows:

1. Start with an empty stack and the buffer contains the input string.

2. Exit with success if the top of the stack contains the start symbol of the grammar and if

the buffer is empty.

3. Choose between the following two steps (if the choice is ambiguous, choose one based on an

oracle):
_ Shift a symbol from the buffer onto the stack.

_ If the top k symbols of the stack are _1 : : : _k which corresponds to the right-hand

side of a CFG rule A ! _1 : : : _k then replace the top k symbols with the left-hand side

non-terminal A.

4. Exit with failure if no action can be taken in previous step.

5. Else, go to Step 2.

At each step the parser has a choice: either shift a new token into the stack, or combine

the top two elements of the stack with a head -> dependent link or a dependent <- head link.

When using the shift-reduce algorithm in a statistical dependency parser it helps to combine

a shift and reduce action when possible.


2.4.2. Hypergraphs and Chart Parsing in NLP

1. Hypergraphs

A hypergraph is a generalization of a graph where an edge (called a hyperedge) can connect

any number of vertices. In NLP, hypergraphs are used to represent complex structures like

parse trees, dependency graphs, or semantic relations, where traditional graphs (with pairwise

edges) are insufficient.

 Example: In a dependency parse, a hyperedge could represent a grammatical relation

involving multiple words (e.g., a verb and its arguments).

2. Chart Parsing

Chart parsing is a dynamic programming technique used in NLP to efficiently parse sentences

according to a grammar. It avoids redundant computations by storing intermediate results in a

chart (a data structure like a table or graph).

 Example: Parsing the sentence "The cat sat on the mat" using a context-free grammar

(CFG). The chart stores partial parses (e.g., "The cat" as an NP, "sat on the mat" as a

VP).

Hypergraphs in Chart Parsing

Hypergraphs and chart parsing are two related concepts used in natural language processing

(NLP) for syntactic parsing. Hypergraphs represent a generalization of traditional parse trees,

allowing for more complex structures and more efficient parsing algorithms. A hypergraph

consists of a set of nodes (representing words or phrases in the input sentence) and a set of

hyperedges, which connect nodes and represent higher-level structures. A chart, on the other

hand, is a data structure used in chart parsing to efficiently store and manipulate all possible

partial parses of a sentence.

Here is an example of how chart parsing can be used to parse the sentence "the cat chased

the mouse" using a simple grammar:

S -> NP VP

NP -> Det N

VP -> V NP

Det -> the


N -> cat | mouse

V -> chased

1. Initialization: We start by initializing an empty chart with the length of the input
sentence (5 words) and a set of empty cells representing all possible partial parses.

2. Scanning: We scan each word in the input sentence and add a corresponding parse
to the chart. For example, for the first word "the", we add a parse for the non-terminal
symbol Det (Det -> the). We do this for each word in the sentence.

3. Predicting: We use the grammar rules to predict possible partial parses for each
span of words in the sentence. For example, we can predict a partial parse for the span
(1, 2) (i.e., the first two words "the cat") by applying the rule

NP -> Det N to the parses for "the" and "cat". We add this partial parse to the

chart cell for the span (1, 2).

4. Scanning again: We scan the input sentence again, this time looking for matches to
predicted partial parses in the chart.

For example, if we predicted a partial parse for the span (1, 2), we look for a parse
for the exact same span in the chart. If we find a match, we can apply a grammar rule
to combine the two partial parses into a larger parse. For example, if we find a parse
for (1, 2) that matches the predicted parse for NP -> Det N, we can combine them to
create a parse for the span (1, 3) and the non-terminal symbol NP.

5. Combining: We continue to combine partial parses in the chart using grammar

rules until we have a complete parse for the entire sentence.

6. Output: The final parse tree for the sentence is represented by the complete

parse in the chart cell for the span (1, 5) and the non-terminal symbol S.

Hypergraphs can be used to represent the chart in chart parsing. Each hyperedge

corresponds to a rule application, and vertices represent states or partial parses.

 Example: Parsing "The cat sat on the mat":

o Vertices: Positions between words (0: start, 1: after "The", 2: after "cat",

etc.).

o Hyperedges: Rules like NP → Det N (connecting positions 0-2) or VP → V

PP (connecting positions 2-5).


Example: Parsing with Hypergraphs

Consider the sentence "The cat chased the mouse" and a simple CFG:

 S → NP VP

 NP → Det N

 VP → V NP

 Det → "The"

 N → "cat" | "mouse"

 V → "chased"

Hypergraph Representation:

 Vertices: {0, 1, 2, 3, 4, 5} (positions between words).

 Hyperedges:

o Det → "The" (0-1)

o N → "cat" (1-2)

o NP → Det N (0-2)

o V → "chased" (2-3)

o Det → "The" (3-4)

o N → "mouse" (4-5)

o NP → Det N (3-5)

o VP → V NP (2-5)

o S → NP VP (0-5)

The hypergraph compactly encodes all possible parses, and chart parsing algorithms can

efficiently explore this structure.

 Shift-reduce parsing allows a linear time parser but requires access to an oracle.

 CFGs in the worst case need backtracking and have a worst case parsing algorithm

which run in O(n3) where n is the size of the input.

 Variants of this algorithm are used in statistical parsers that attempt to search the

space of possible parse trees without the limitation of left-to-right parsing.

 Our example CFG G is rewritten as new CFG Gcwhich contains up to two non-terminals
on the right hand side.
We can specialize the CFG Gcto a particular input string by creating a new CFG that

represents all possible parse trees that are valid in grammar Gcfor this particular

input sentence.

•For the input “a and b or c” the new CFG Gfthat represents the forest of parse

trees can be constructed.

•Let the input string be broken up into spans 0 a 1 and 2 b 3 or 4 c 5.

 Here a parsing algorithm is defined as taking as input a CFG and an input string and

producing a specialized CFG that represents all legal parsers for the input.

 A parser has to create all the valid specialized rules from the start symbol

nonterminal that spans the entire string to the leaf nodes that are the input tokens.
Now let us look at the steps the parser has to take to construct a specialized CFG.

Let us consider the rules that generate only lexical items:

Chart parsing can be more efficient than other parsing algorithms, such as

recursive descent or shift-reduce parsing, because it stores all possible partial parses

in the chart and avoids redundant parsing of the same span multiple times. Hypergraphs

can also be used in chart parsing to represent more complex structures and enable

more efficient parsing algorithms

2.4.3. Minimum Spanning Trees and Dependency Parsing:

Dependency parsing is a type of syntactic parsing that represents the


grammatical structure of a sentence as a directed acyclic graph (DAG). The nodes of
the graph represent the words of the sentence, and the edges represent the syntactic
relationships between the words.

Minimum spanning tree (MST) algorithms are often used for dependency parsing,
as they provide an efficient way to find the most likely parse for a sentence given a set
of syntactic dependencies.

Here's an example of how a MST algorithm can be used for dependency parsing:
Consider the sentence "The cat chased the mouse". We can represent this sentence as
a graph with nodes for each word and edges representing the syntactic dependencies
between them:

We can use a MST algorithm to find the most likely parse for this graph. One popular
algorithm for this is the Chu-Liu/Edmonds algorithm:

1. We first remove all self-loops and multiple edges in the graph. This is because a valid
dependency tree must be acyclic and have only one edge between any two nodes.

2. We then choose a node to be the root of the tree. In this example, we can choose
"chased" to be the root since it is the main verb of the sentence.

3. We then compute the scores for each edge in the graph based on a scoring function
that takes into account the probability of each edge being a valid dependency. The
score function can be based on various linguistic features, such as part-of-speech tags
or word embeddings.
2.5. Models for Ambiguity Resolution in parsing
Ambiguity resolution in parsing is a crucial aspect of Natural Language
Processing (NLP). Parsing involves analyzing the syntactic structure of a sentence, and
ambiguity arises when a sentence can be parsed in multiple ways, leading to different
interpretations. Here are some common models and techniques used for ambiguity
resolution in parsing.

1. Probabilistic Context-Free Grammar (PCFG)

 Probabilistic Context-Free Grammar (PCFG) is an extension of Context-Free


Grammar (CFG) that assigns probabilities to production rules. These
probabilities are learned from a corpus of annotated sentences (e.g., the Penn
Treebank). PCFG is widely used in syntactic parsing to disambiguate sentences
by selecting the most likely parse tree based on the probabilities of the
production rules.

 Example:

o Sentence: "I saw the man with the telescope."

o Ambiguity: Does "with the telescope" modify "saw" (I used the telescope
to see the man) or "the man" (the man has the telescope)?

o Resolution: PCFG assigns probabilities to each parse tree and selects the
most likely one based on training data.

Key Concepts of PCFG:

1. Context-Free Grammar (CFG):

o A CFG consists of:

 Non-terminal symbols (e.g., S, NP, VP)

 Terminal symbols (e.g., words)

 Production rules (e.g., S → NP VP)

 A start symbol (e.g., S)

o Example: A simple CFG for English might have rules like:

S → NP VP

NP → Det N

VP → V NP

Det → "the"

N → "dog"

V → "chased"
2. Probabilities in PCFG:

o Each production rule is assigned a probability, indicating how likely it is to


be used in a derivation.

o The probabilities of all rules with the same left-hand side (LHS) must sum
to 1.

o Example:

S → NP VP [0.9]

S → VP [0.1]

NP → Det N [0.6]

NP → N [0.4]

3. Parse Tree Probability:

o The probability of a parse tree is the product of the probabilities of all


the production rules used in the tree.

o The parser selects the parse tree with the highest probability.

Example of PCFG in Action

Sentence: "The dog chased the cat."

Step 1: Define the PCFG Rules

Assume the following PCFG rules and probabilities:

S → NP VP [0.9]
NP → Det N [0.6]
NP → N [0.4]
VP → V NP [0.7]
VP → V [0.3]
Det → "the" [1.0]
N → "dog" [0.5]
N → "cat" [0.5]
V → "chased" [1.0]
Step 2: Generate Parse Trees

There are two possible parse trees for this sentence:

1. Parse Tree 1:

S
├── NP
│ ├── Det: "the"
│ └── N: "dog"
└── VP
├── V: "chased"
└── NP
├── Det: "the"
└── N: "cat"
2. Parse Tree 2:
S
├── NP
│ └── N: "dog"
└── VP
├── V: "chased"
└── NP
├── Det: "the"
└── N: "cat"

Step 3: Calculate Parse Tree Probabilities

 Parse Tree 1:

o Rules used: S → NP VP, NP → Det N, VP → V NP, Det → "the", N → "dog",


V → "chased", Det → "the", N → "cat"

o Probability: 0.9×0.6×0.7×1.0×0.5×1.0×1.0×0.5=0.0945

 Parse Tree 2:

o Rules used: S → NP VP, NP → N, VP → V NP, N → "dog", V → "chased",


Det → "the", N → "cat"

o Probability: 0.9×0.4×0.7×0.5×1.0×1.0×0.5=0.063

Step 4: Select the Most Likely Parse Tree

 Parse Tree 1 has a higher probability (0.0945) than Parse Tree 2 (0.063), so it
is selected as the correct parse.
Ambiguity Resolution with PCFG

Example Sentence: "I saw the man with the telescope."

 Ambiguity: Does "with the telescope" modify "saw" (I used the telescope to see
the man) or "the man" (the man has the telescope)?

 PCFG Resolution:

o The parser generates two parse trees:

1. "with the telescope" modifies "saw."

2. "with the telescope" modifies "the man."

o The probabilities of the production rules determine which parse tree is


more likely.

o If the training data shows that "with the telescope" is more likely to
modify "saw," the parser selects the first parse tree.

2.Generative Models for Parsing

Generative models in NLP are probabilistic models that generate sentences or parse
trees by modeling the joint probability distribution of the input (e.g., words) and the
output (e.g., syntactic structure). These models are widely used in parsing to predict
the most likely syntactic structure of a sentence. Below is an explanation of generative
models for parsing, along with examples.

Key Concepts of Generative Models for Parsing

1. Generative vs. Discriminative Models:

o Generative Models: Model the joint probability P(X,Y), where Xis the
input (words) and Y is the output (parse tree). They generate data by
sampling from this distribution.

o Discriminative Models: Model the conditional probability P(Y∣X), directly


predicting the output given the input.

2. Goal of Generative Parsing:

o Given a sentence, the model generates the most likely parse tree by
estimating P(Y∣X) using Bayes' rule:

P(Y∣X)=P(X∣Y)⋅P(Y)/P(X)

Since P(X) is constant for a given sentence, the focus is on


maximizing P(X∣Y)⋅P(Y).

3. Components of Generative Parsing Models:

o Syntax Model (P(Y): Models the probability of a parse tree (e.g., using
PCFG).
o Lexical Model (P(X∣Y)): Models the probability of words given the parse
tree.

Example of Generative Parsing in Action

Sentence: "The dog chased the cat."

Step 1: Define the Grammar (PCFG)

Assume the following PCFG rules:

S → NP VP [0.9]

NP → Det N [0.6]

NP → N [0.4]

VP → V NP [0.7]

VP → V [0.3]

Det → "the" [1.0]

N → "dog" [0.5]

N → "cat" [0.5]

V → "chased" [1.0]

Step 2: Generate Parse Trees

 The parser generates all possible parse trees for the sentence using the
PCFG rules.

Step 3: Calculate Probabilities

 For each parse tree, the model calculates the probability as the product
of the probabilities of the rules used.

Step 4: Select the Most Likely Parse Tree

 The parse tree with the highest probability is selected as the output.

3. Discriminative Models for Parsing in NLP


Discriminative models in NLP focus on directly modeling the conditional
probability P(Y∣X), where X is the input (e.g., a sentence) and Y is the output
(e.g., a parse tree). Unlike generative models, which model the joint
probability P(X,Y) discriminative models aim to predict the output directly
from the input, often leveraging rich features and machine learning techniques.
Discriminative models are widely used in parsing because they can incorporate a
wide range of features (e.g., lexical, syntactic, and contextual) to improve
accuracy.
Key Concepts of Discriminative Models for Parsing
1. Conditional Probability:
o Discriminative models estimate P(Y∣X), the probability of a parse
tree YY given a sentence XX.
2. Feature-Based Representation:
o Discriminative models use feature vectors to represent the input and
output. Features can include:
 Lexical features (e.g., words, part-of-speech tags)
 Syntactic features (e.g., dependencies, phrase structures)
 Contextual features (e.g., surrounding words)
3. Training:
o Discriminative models are trained on annotated corpora (e.g., Penn
Treebank) to learn the mapping from input features to output
structures.
4. Inference:
o During inference, the model predicts the most likely parse tree YY for
a given sentence XX.

Example of Discriminative Parsing in Action


Sentence: "The dog chased the cat."
Step 1: Feature Extraction
 Extract features such as:
o Words: ["The", "dog", "chased", "the", "cat"]
o Part-of-speech tags: ["Det", "N", "V", "Det", "N"]
o Contextual embeddings: Pre-trained word vectors for each word.
Step 2: Model Prediction
 A discriminative model (e.g., a neural network or CRF) predicts the most likely
parse tree based on the extracted features.
Step 3: Parse Tree Output
 The model outputs the following parse tree:
S
├── NP
│ ├── Det: "The"
│ └── N: "dog"
└── VP
├── V: "chased"
└── NP
├── Det: "the"
└── N: "cat"
Advantages of Discriminative Models for Parsing
1. Feature Richness: Discriminative models can incorporate a wide range of
features, including lexical, syntactic, and contextual information.
2. Direct Prediction: They directly model P(Y∣X), avoiding the need to model the
joint distribution P(X,Y).
3. State-of-the-Art Performance: Neural discriminative models (e.g.,
Transformers) achieve state-of-the-art performance on parsing tasks.

2.6.Multilingual Issues: What is a token?


In Natural Language Processing (NLP), a token is the smallest unit of text that

carries meaning. Tokenization is the process of splitting a text into individual tokens,

which are typically words, symbols, or subwords. Tokens are the building blocks for

most NLP tasks, such as text analysis, machine translation, and sentiment analysis.

In English, a token is typically separated from other tokens by a space

character. However, in a parser/treebank for English a word like today's or There's is

treated as two independent tokens, today and 's, or There and 's.

What is a Token?

A token can be:

1. Word: A single word (e.g., "cat", "running").

2. Symbol: A punctuation mark or special character (e.g., ".", "!", "@").

3. Subword: A part of a word, often used in languages with complex morphology or


in deep learning models (e.g., "unhappiness" → ["un", "happiness"]).

4. Number: A numeric value (e.g., "123", "3.14").

5. Emoji: A single emoji character (e.g., " ", " ").

Challenges in Multilingual Tokenization

1. No Spaces Between Words:

o Languages like Chinese, Japanese, and Thai do not use spaces to separate

words, making tokenization more complex.

2. Compound Words:

o Languages like German and Finnish form compound words, which need to

be split into meaningful tokens.


3. Morphological Richness:

o Languages like Arabic and Turkish have rich morphology, where a single

word can have multiple prefixes and suffixes.

4. Script Variations:

o Different scripts (e.g., Cyrillic, Devanagari, Hanzi) require specialized

tokenization rules.

5. Ambiguity:

o Some words or characters can have multiple meanings, making it difficult

to determine correct token boundaries.

Tokenization, Case and Encoding

In syntax parsing, tokenization, case handling, and encoding are critical preprocessing

steps that significantly impact the performance of NLP models. LetÕs break down each

of these concepts and their roles in syntax parsing.

1. Tokenization

Tokenization is the process of splitting text into individual tokens (words, symbols, or

subwords). It is the first step in syntax parsing and directly affects how the parser

interprets the structure of a sentence.

Role in Syntax Parsing:

 Tokens are the basic units used to build parse trees.

 Incorrect tokenization can lead to incorrect parsing (e.g., splitting "can't" into

["ca", "n't"] may confuse the parser).

Challenges:

 Multilingual Tokenization: Languages like Chinese and Japanese donÕt use spaces,

requiring specialized tokenizers.

 Compound Words: Languages like German form long compound words that need

to be split (e.g., "Donaudampfschifffahrtsgesellschaft").

 Contractions: English contractions like "can't" or "I'm" need careful handling.


Example:

from nltk.tokenize import word_tokenize

text = "I can't believe it's already 2023!"

tokens = word_tokenize(text)

print(tokens)

Output:

['I', 'ca', "n't", 'believe', 'it', "'s", 'already', '2023', '!']

2. Case Handling

Case refers to the distinction between uppercase and lowercase letters. Proper

handling of case is crucial for syntax parsing, as it can affect the interpretation of

words.

Role in Syntax Parsing:

 Proper Nouns: Uppercase letters often indicate proper nouns (e.g., "John",

"Paris").

 Sentence Boundaries: Capitalization helps identify the start of sentences.

 Ambiguity: Case can change the meaning of words (e.g., "apple" vs. "Apple").

Challenges:

 Case Insensitivity: Some languages (e.g., German) capitalize all nouns, which can

lead to ambiguity.

 Lowercasing: Lowercasing text can lose important information (e.g., "US" vs.

"us").

Example:

text = "The Apple in the basket is from Apple Inc."

tokens = text.split()

lowercase_tokens = [token.lower() for token in tokens]

print(lowercase_tokens)

Output:

['the', 'apple', 'in', 'the', 'basket', 'is', 'from', 'apple', 'inc.']


3. Encoding

Encoding refers to the representation of text in a format that can be processed by

machines. In syntax parsing, text must be encoded into numerical vectors or other

formats that parsers can understand.

Role in Syntax Parsing:

 Character Encoding: Ensures text is represented correctly (e.g., UTF-8 for

multilingual text).

 Word Embeddings: Words are often encoded as dense vectors (e.g., Word2Vec,

GloVe, BERT).

 Positional Encoding: Used in transformer-based models to represent the

position of tokens in a sentence.

Challenges:

 Multilingual Encoding: Handling different scripts and special characters.

 Out-of-Vocabulary (OOV) Words: Words not present in the vocabulary need

special handling (e.g., subword tokenization in BERT).

Syntax Parsing Pipeline

HereÕs how tokenization, case handling, and encoding fit into a typical syntax parsing

pipeline:

1. Input Text: Raw text is provided as input.

2. Tokenization: Split text into tokens.

3. Case Handling: Normalize case (optional, depending on the task).

4. Encoding: Convert tokens into numerical representations.

5. Parsing: Use a parser (e.g., dependency parser, constituency parser) to analyze

the syntactic structure.

Key Points

1. Tokenization:

o Splits text into meaningful units.

o Language-specific tokenizers are often required.

2. Case Handling:

o Normalizes or preserves case based on the task.


o Important for identifying proper nouns and sentence boundaries.

3. Encoding:

o Converts text into numerical representations.

o Word embeddings and positional encoding are commonly used.

4. Syntax Parsing:

o Analyzes the grammatical structure of sentences.

o Relies on accurate tokenization, case handling, and encoding.

Challenges in Multilingual Syntax Parsing

1. Tokenization: Languages with no spaces (e.g., Chinese) or complex morphology

(e.g., Finnish) require specialized tokenizers.

2. Case Handling: Some languages (e.g., German) capitalize all nouns, which can lead

to ambiguity.

3. Encoding: Handling different scripts and special characters in multilingual text.

Word Segmentation

Word Segmentation is a critical step in syntax parsing, especially for languages that

do not use spaces to separate words (e.g., Chinese, Japanese, Thai). It involves splitting

a continuous stream of characters into meaningful word units. Accurate word

segmentation is essential for downstream NLP tasks like parsing, machine translation,

and sentiment analysis.

What is Word Segmentation?

Word segmentation is the process of identifying word boundaries in text. For languages

like English, this is relatively straightforward because words are separated by spaces.

However, for languages like Chinese, word segmentation is more complex because there

are no explicit delimiters between words.


Role of Word Segmentation in Syntax Parsing

1. Input for Parsers: Syntax parsers rely on correctly segmented words to build

parse trees.

2. Dependency Parsing: Incorrect segmentation can lead to incorrect dependency

relationships.

3. Constituency Parsing: Word boundaries determine how phrases are grouped.

Challenges in Word Segmentation

1. Ambiguity:

o Some character sequences can be segmented in multiple ways.

o Example (Chinese): "结婚的和尚未结婚的" can be segmented as:

 "结婚 的 和 尚未 结婚 的" (married and unmarried).

 "结婚 的 和尚 未 结婚 的" (married monks and unmarried).

2. Out-of-Vocabulary (OOV) Words:

o New or rare words may not be recognized by the segmenter.

3. Language-Specific Rules:

o Different languages have unique segmentation rules (e.g., compound words

in German, agglutination in Turkish).

Word Segmentation Techniques

1. Rule-Based Methods:

o Use dictionaries and handcrafted rules to segment words.

o Example: Maximum Matching Algorithm (MMSEG for Chinese).

2. Statistical Methods:

o Use probabilistic models (e.g., Hidden Markov Models, Conditional Random

Fields) to predict word boundaries based on training data.

3. Neural Methods:

o Use deep learning models (e.g., BiLSTM, Transformer) for segmentation.

o Example: BERT-based models fine-tuned for word segmentation.

4. Hybrid Methods:

o Combine rule-based and statistical/neural approaches for better

accuracy.
Morphology
Morphology plays a crucial role in syntax parsing in Natural Language Processing (NLP).

Morphology deals with the structure and formation of words, including how words are

inflected, derived, and composed. Understanding morphology is essential for accurate

syntax parsing because it helps in identifying the grammatical roles of words in a

sentence.

In many languages the notion of splitting up tokens using spaces is problematic

since each wordP can contain several components, called morphemes, such that the

meaning of a word can be thought of as composed of the combination of the meanings

of the morphemes. A word must now be thought of as being decomposed into a stem

combined with several morphemes.

What is Morphology?

Morphology is the study of the internal structure of words and how they are formed.

It involves:

1. Inflection: Modifying a word to express different grammatical categories (e.g.,

tense, number, case).

o Example: "run" → "running" (present participle), "runs" (third person

singular).

2. Derivation: Creating new words by adding prefixes or suffixes.

o Example: "happy" → "unhappy" (prefix), "happiness" (suffix).

3. Compounding: Combining two or more words to form a new word.

o Example: "sun" + "flower" → "sunflower".

Role of Morphology in Syntax Parsing

1. Part-of-Speech (POS) Tagging:

o Morphological analysis helps determine the POS of a word, which is

critical for parsing.

o Example: "running" can be a verb (e.g., "I am running") or a noun (e.g.,

"Running is fun").

2. Dependency Parsing:

o Morphological features (e.g., case, number, gender) help identify

syntactic relationships between words.


o Example: In "She gives him a book," the morphological case of "him"

(dative) indicates it is the indirect object.

3. Constituency Parsing:

o Morphological information helps group words into phrases.

o Example: In "The quick brown fox," the adjectives "quick" and "brown"

modify the noun "fox."

4. Handling Ambiguity:

o Morphological analysis resolves ambiguities in word forms.

o Example: "Leaves" can be a noun (plural of "leaf") or a verb (third person

singular of "leave").

Challenges in Morphological Analysis

1. Morphological Richness:

o Some languages (e.g., Finnish, Turkish) have highly complex morphology

with extensive inflection and derivation.

o Example: Finnish "taloissani" (in my houses) → ["talo", "i", "ssa", "ni"].

2. Agglutination:

o Agglutinative languages (e.g., Turkish, Korean) add multiple affixes to a

root word.

o Example: Turkish "evlerimizde" (in our houses) → ["ev", "ler", "imiz", "de"].

3. Irregular Forms:

o Some words have irregular inflections (e.g., "go" → "went").

o Example: English "child" → "children" (irregular plural).

4. Ambiguity:

o The same word form can have multiple morphological interpretations.

o Example: "flies" can be a noun (plural of "fly") or a verb (third person

singular of "fly").

You might also like