Mod - 3 (2)
Mod - 3 (2)
•Parser is a compiler that is used to break the data into smaller elements coming
from lexical analysis phase.
•A parser takes input in the form of sequence of tokens and produces output in
the form of parse tree.
•Parsing is of two types: Top down parsing and Bottom up parsing.
What is parsing?
• Parsing in NLP is the process of determining the syntactic structure of a text by
analyzing its constituent words based on an underlying grammar (of the language).
as noun_phrase, verb_phrase etc. have children - hence they are called non-
terminals and finally, the leaves of the tree are called terminals.
Set of grammar
rules(productions)
Parse tree
•Parse tree is the graphical representation of symbol. The symbol can be
terminal or non-terminal.
•In parsing, the string is derived using the start symbol. The root of the parse
tree is that start symbol.
•It is the graphical representation of symbol that can be terminals or non-
terminals.
•Parse tree follows the precedence of operators. The deepest sub-tree traversed
first. So, the operator in the parent node has less precedence over the operator
in the sub-tree.
• Verb phrase (VP): These phrases are lexical units that have a verb acting as the
head word. Usually, there are two forms of verb phrases. One form has the verb
of the object.
• Adjective phrase (ADJP): These are phrases with an adjective as the head word.
Their main role is to describe or qualify nouns and pronouns in a sentence, and they
• Adverb phrase (ADVP): These phrases act like adverbs since the adverb acts as
the head word in the phrase. Adverb phrases are used as modifiers for nouns, verbs,
head word and other lexical components like nouns, pronouns, and so on. These act
sentence.
Some cases we need to go for semantic parsing to understand the meaning of the
sentence.
• In deep or full parsing, typically, grammar concepts such as CFG, and probabilistic
Input:
a*b+c
Example:
Production rules:
S → aSa
S → bSb
S→c
S ⇒ aSa
S ⇒ abSba
S ⇒ abbSbba
S ⇒ abbcbba
example@
http://www.nltk.org/howto/parse.html
4.A regex parser
A regex parser uses a regular expression defined in the form of grammar on top
of a
POS-tagged string. The parser will use these regular expressions to parse the
given
sentences and generate a parse tree out of this.
A working example of the regex parser is:
# Regex parser
>>>chunk_rules=ChunkRule("<.*>+","chunk everything")
>>>import nltk
>>>from nltk.chunk.regexp import *
>>>reg_parser = RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>} # Preposition
V: {<V.*>} # Verb
PP: {<P> <NP>} # PP -> P NP
VP: {<V> <NP|PP>*} # VP -> V (NP|PP)*''')
>>>test_sent="Mr. Obama played a big role in the Health insurance bill"
>>>test_sent_pos=nltk.pos_tag(nltk.word_tokenize(test_sent))
>>>paresed_out=reg_parser.parse(test_sent_pos)
>>> print paresed_out
Tree('S', [('Mr.', 'NNP'), ('Obama', 'NNP'), Tree('VP', [Tree('V',
[('played', 'VBD')]), Tree('NP', [('a', 'DT'), ('big', 'JJ'), ('role',
'NN')])]), Tree('P', [('in', 'IN')]), ('Health', 'NNP'), Tree('NP',
The following is a graphical representation of the tree for the preceding code:
A chunk can be defined as the minimal unit that can be processed. So, for
example, the sentence "the President speaks about the health care reforms"
can be broken into two chunks, one is "the President", which is noun
dominated, and hence is called a noun phrase (NP). The remaining part of
the sentence is dominated by a verb, hence it is called a verb phrase (VP).
If you see, there is one more sub-chunk in the part "speaks about the health
care reforms". Here, one more NP exists that can be broken down gain in
"speaks about" and "health care reforms", as shown in the following figure:
So, let's write some code snippets to do some basic chunking:
# Chunking
>>>from nltk.chunk.regexp import *
>>>test_sent="The prime minister announced he had asked the chief
government whip, Philip Ruddock, to call a special party room meeting for
9am on Monday to consider the spill motion."
>>>test_sent_pos=nltk.pos_tag(nltk.word_tokenize(test_sent))
>>>rule_vp = ChunkRule(r'(<VB.*>)?(<VB.*>)+(<PRP>)?', 'Chunk VPs')
>>>parser_vp = RegexpChunkParser([rule_vp],chunk_label='VP')
>>>print (parser_vp.parse(test_sent_pos))
>>>rule_np = ChunkRule(r'(<DT>?<RB>?)?<JJ|CD>*(<JJ|
CD><,>)*(<NN.*>)+',
'Chunk NPs')
>>>parser_np = RegexpChunkParser([rule_np],chunk_label="NP")
>>>print (parser_np.parse(test_sent_pos))
Information Extraction
Two operations
Let's start with a very generic example where we are given a text file of
the content and we need to extract some of the most insightful named
entities from it:
Relation extraction
•Relation extraction is another commonly used information extraction
operation.
•Relation extraction as it sound is the process of extracting the different
relationships between different entities.
•There are variety of the relationship that exist between the entities.
•We have seen relationship like inheritance/synonymous/analogous.
•The definition of the relation can be dependent on the Information need.
•For example in the case where we want to look from unstructured text data
who is the writer of which book then authorship could be a relation between
the author name and book name.
•With NLTK the idea is to use the same IE pipeline that we used till NER and
extend it with a relation pattern based on the NER tags.
Process of Relation Extraction
Example
So, in the following code, we used an inbuilt corpus of ieer, where the
sentences are tagged till NER and the only thing we need to specify is the
relation pattern we want and the kind of NER we want the relation to
define.
[ORG: 'Christian Democrats'] ', the leading political forces in' [LOC: 'Italy']
[ORG: 'AP'] ') _ Lebanese guerrillas attacked Israeli forces in southern' [LOC: 'Lebanon']
[ORG: 'Security Council'] 'adopted Resolution 425. Huge yellow banners hung across
intersections in' [LOC: 'Beirut']
[ORG: 'U.N.'] 'failures in' [LOC: 'Africa']
[ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
[ORG: 'U.N.'] 'partners on a more effective role in' [LOC: 'Africa']
[ORG: 'AP'] ') _ A bomb exploded in a mosque in central' [LOC: 'San`a']
[ORG: 'Krasnoye Sormovo'] 'shipyard in the Soviet city of' [LOC: 'Gorky']
[ORG: 'Kelab Golf Darul Ridzuan'] 'in' [LOC: 'Perak']
[ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']