0% found this document useful (0 votes)
7 views

NLP Introduction

The document discusses the field of Natural Language Processing (NLP), highlighting its goals, challenges, and applications such as language translation and sentiment analysis. It emphasizes the complexity of understanding diverse languages and the difficulties posed by ambiguity, idioms, and new word meanings. Additionally, it outlines the stages of NLP, including phonetics, morphology, and semantic analysis, while referencing key literature in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

NLP Introduction

The document discusses the field of Natural Language Processing (NLP), highlighting its goals, challenges, and applications such as language translation and sentiment analysis. It emphasizes the complexity of understanding diverse languages and the difficulties posed by ambiguity, idioms, and new word meanings. Additionally, it outlines the stages of NLP, including phonetics, morphology, and semantic analysis, while referencing key literature in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Natural Language Processing

3/10/2022
Text is everywhere!

Social media

3/10/2022
Text is everywhere!
Research papers, news, etc.

3/10/2022
Diversity of Languages (Worldwide)

How many Languages are spoken today?

1Source: https://www.ethnologue.com/guides/how-many-languages
3/10/2022
Diversity of Languages (Worldwide)

How many Languages are spoken today?


7,1111

1Source: https://www.ethnologue.com/guides/how-many-languages
3/10/2022
Diversity of Languages (Worldwide)

How many Languages are spoken today?


7,1111

Can we understand the majority of the world’s data? What percentage?

1Source: https://www.ethnologue.com/guides/how-many-languages
3/10/2022
Diversity of Languages (Worldwide)

How many Languages are spoken today?


7,1111

Can we understand the majority of the world’s data? What percentage?

1Source: https://www.ethnologue.com/guides/how-many-languages
3/10/2022
Diversity of Languages (India)

Can we understand the majority of the India’s data?

Total Languages: > 1650 (2011 consensus)

Hindi: 57.1%
English: 10.6%
Bengali: 8.9%
Marathi: 8.2%
Telugu: 7.8 %
...
...

3/10/2022
The goals of NLP

French Sentence: Tu Bois un Coca-cola

3/10/2022
The goals of NLP

French Sentence: Tu Bois un Coca-cola


English Translation: You drink a Coca-cola

3/10/2022
The goals of NLP

French Sentence: Tu Bois un Coca-cola


English Translation: You drink a Coca-cola
We try to understand a foreign language using some known keywords

3/10/2022
The goals of NLP

French Sentence: Tu Bois un Coca-cola


English Translation: You drink a Coca-cola
We try to understand a foreign language using some known keywords

Goals of NLP
Deep understanding of broad language constructs.
Achieve human-like comprehension of texts/languages.
Make computer systems to understand, draw inferences from,
summarize, translate and generate accurate and natural human text and
language.

3/10/2022
Some Applications: Language Translation

3/10/2022
Language Translation

3/10/2022
Language Translation is not easy even for humans

Pepsi Chinese blunder


“Come alive with the Pepsi Generation”, when translated into Chinese meant,
“Pepsi brings your relatives back from the dead.”

3/10/2022
Language Translation is not easy even for humans

Pepsi Chinese blunder


“Come alive with the Pepsi Generation”, when translated into Chinese meant,
“Pepsi brings your relatives back from the dead.”

KFC’s Chinese blunder


KFC’s slogan, “Finger lickin’ good”, when translated into Chinese meant “We’ll
eat your fingers off.”

3/10/2022
Some more examples...

3/10/2022
Query Recommendation in Search Engines

3/10/2022
Spelling Correction

3/10/2022
Information Extraction

3/10/2022
Sentiment Analysis

3/10/2022
Recent Trends: Fake news detection

3/10/2022
Recent Trends: Chatbots

3/10/2022
Other Goals

Text Summarization
Opinion dynamics
Spam detection

.. .

3/10/2022
Other Goals

Text Summarization
Opinion dynamics
Spam detection

.. .

Natural Language Technology not yet perfect


But still good enough for several useful applications

3/10/2022
Why is NLP hard?

Compounding

195 characters (with 428 characters when transliterated into the roman writing
system).

3/10/2022
Why is NLP hard?

Ambiguity

3/10/2022
Why is NLP hard?
Ambiguity

3/10/2022
Why else is NLP hard?

Shorthand text

3/10/2022
Why else is NLP hard?
Non-standard English

3/10/2022
Why else is NLP hard?

Segmentation Issues
the New York New Haven Railroad

3/10/2022
Why else is NLP hard?

Segmentation Issues
the New York New Haven Railroad
the [New] [York New] [Haven] [Railroad]
the [New York] [New Haven] [Railroad]

3/10/2022
Why else is NLP hard?

Idioms
Dark horse
Ball in your court
Burn the midnight oil

3/10/2022
Why else is NLP hard?
Idioms : An expression whose meaning is different from the meanings of
the individual words in it.
Idioms Example
Dark horse
Ball in your court Burn the midnight oil

Neologisms: A new word, phrase or expression, or a new meaning of a familiar


word
Unfriend
Retweet
Google/Skype/photoshop

3/10/2022
Why is NLP hard?

New Senses of a word


That’s sick dude!
Giants

3/10/2022
Why is NLP hard?

New Senses of a word


That’s sick dude!
Giants ... multinationals, conglomerates, manufacturers

3/10/2022
Why is NLP hard?

New Senses of a word


That’s sick dude!
Giants ... multinationals, conglomerates, manufacturers

Tricky Entity Names


Where is A Bug’s Life playing ...
Let It Be was recorded ...

3/10/2022
Why is NLP hard?

Code Mixing/switching Romanization

3/10/2022
What we do in NLP?

Create annotated corpora


Brown Corpus, Google Books Ngram Corpus, Reuters Newswire Topic
Classification, IMDB Movie Review Sentiment Classification, Project
Gutenberg, etc.

Create Models/Algorithms
LDA, BERT, CKY, Edit Distance, CRF++, etc.

Create Tools
CoreNLP, NLTK, Gensim, SpaCy, etc.

3/10/2022
Stages in NLP (traditional view)
 Phonetics and phonology

 Morphology

 Lexical Analysis

 Syntactic Analysis

 Semantic Analysis

 Pragmatics

 Discourse

Source: IITB NLP Course by Pushpak Bhattacharyya

3/10/2022
Phonetics
• Processing of speech
• Challenges
• Homophones: bank (finance) vs. bank (river bank)
• Near Homophones: maatraa vs. maatra (hin)
• Word Boundary
• aajaayenge (aa jaayenge (will come) or aaj aayenge (will
come today)
• I got [ua]plate
• Phrase boundary
• mtech1 students are especially exhorted to attend as such
seminars are integral to one's post-graduate education
• Disfluency: ah, um, ahem etc.

3/10/2022
Morphology
• Word formation rules from root words

• Nouns: Plural (boy-boys); Gender marking (czar-czarina)

• Verbs: Tense (stretch-stretched); Aspect (e.g. perfective sit-had sat);


Modality (e.g. request khaanaa khaaiie)

• First crucial first step in NLP

• Languages rich in morphology: e.g., Dravidian, Hungarian, Turkish

• Languages poor in morphology: Chinese, English

• Languages with rich morphology have the advantage of easier processing


at higher stages of processing

• A task of interest to computer science: Finite State Machines for Word


Morphology
3/10/2022
Lexical Analysis
• Essentially refers to dictionary access and obtaining the
properties of the word
e.g. dog
noun (lexical property)
take-’s’-in-plural (morph property)
animate (semantic property)
4-legged (-do-)
carnivore (-do)
• Challenge: Lexical or word sense disambiguation

3/10/2022
Lexical Disambiguation
First step: part of Speech Disambiguation
Dog as a noun (animal)
Dog as a verb (to pursue)
Sense Disambiguation
Dog (as animal)
Dog (as a very detestable person)
Needs word relationships in a context
The chair emphasized the need for adult education
Very common in day to day communications
Satellite Channel Ad: Watch what you want, when you want
(two senses of watch)
e.g., Ground breaking ceremony/research

3/10/2022
Syntax Processing Stage

Structure Detection

VP
NP

V NP

I
like
mangoes
3/10/2022
Parsing Strategy

Driven by grammar

S-> NP VP
NP-> N | PRON
VP-> V NP | V PP
N-> Mangoes
PRON-> I
V-> like

3/10/2022
Challenges in Syntactic Processing: Structural Ambiguity
• Scope
1.The old men and women were taken to safe locations
(old men and women) vs. ((old men) and women)
2. No smoking areas will allow Hookas inside

• Preposition Phrase Attachment


• I saw the boy with a telescope
(who has the telescope?)
• I saw the mountain with a telescope
(world knowledge: mountain cannot be an instrument of
seeing)
• I saw the boy with the pony-tail
(world knowledge: pony-tail cannot be an instrument of
seeing)
• Very ubiquitous: newspaper headline “20 years later,
BMC pays father 20 lakhs for causing son’s death”

3/10/2022
Structural Ambiguity…

Overheard
I did not know my PDA had a phone for 3 months

An actual sentence in the newspaper


The camera man shot the man with the gun when
he was near Tendulkar

3/10/2022
Semantic Analysis
• Representation in terms of
Predicate calculus/Semantic
Nets/Frames/Conceptual Dependencies and
Scripts

• John gave a book to Mary


Give action: Agent: John, Object: Book,
Recipient: Mary

• Challenge: ambiguity in semantic role labeling


(Eng) Visiting aunts can be a nuisance
(Hin) aapko mujhe mithaai khilaanii padegii (ambiguous
in Marathi and Bengali too; not in Dravidian languages)

3/10/2022
Pragmatics
• Very hard problem

• Model user intention


• Tourist (in a hurry, checking out of the hotel,
motioning to the service boy): Boy, go upstairs and
see if my sandals are under the divan. Do not be late.
I just have 15 minutes to catch the train.
• Boy (running upstairs and coming back panting): yes
sir, they are there.

• World knowledge
• WHY INDIA NEEDS A SECOND OCTOBER (ToI,
2/10/07)
3/10/2022
Discourse
• Processing of sequence of sentences
 Mother to John:
 John go to school. It is open today. Should you bunk?
Father will be very angry.

• Ambiguity of open
• bunk what?
• Why will the father be angry?
 Complex chain of reasoning and application of world
knowledge
 Ambiguity of father
father as parent
or
father as headmaster

3/10/2022
Reference Books

Daniel Jurafsky and James H. Martin. Speech and Language Processing:


An Introduction to Natural Language Processing, Speech Recognition,
and Computational Linguistics. 2nd edition. Prentice-Hall. 2009.
Christopher D. Manning and Hinrich Schütze. Foundations of Statistical
Natural Language Processing. MIT Press. 1999.
Steven Bird, Ewan Klein and Edward Loper. Natural Language
Processing with Python. O’Reilly Media. 2009.
Ian Goodfellow,Yoshua Bengio and Aaron Courville. Deep Learning. MIT
Press. 2016.

3/10/2022

You might also like