0% found this document useful (0 votes)

401 views51 pages

Introduction To Natural Language Processing (NLP) : Dr. Sukhnandan Kaur Tiet

This document provides an introduction to natural language processing (NLP). It defines NLP as using computational models to understand and manipulate human language. It describes the different types of linguistic knowledge required for NLP applications, including phonology, morphology, syntax, semantics, pragmatics, and discourse. It also discusses the rationalist and empiricist approaches to language processing and the rise of statistical NLP using machine learning. Finally, it briefly outlines the origins of NLP focusing initially on machine translation during the Cold War period between the 1950s-1970s.

Uploaded by

Yash Dhawan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

401 views51 pages

Introduction To Natural Language Processing (NLP) : Dr. Sukhnandan Kaur Tiet

Uploaded by

Yash Dhawan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 51

INTRODUCTION TO

NATURAL LANGUAGE
PROCESSING
(NLP)

Dr. Sukhnandan Kaur

TIET
CONTENTS
 NLP-Definition
 Origin of NLP
 Applications of NLP
 Challenges of NLP
 Processing Indian Languages
INTRODUCTION
 Language is the primary means of communication used by
humans.
 It is the tool with which we express our ideas and emotions.

 Language shapes our thought, has a structure, and carries

meaning.
 Representing ideas and thoughts is so natural that we hardly
realize that how we process knowledge.
 There must be some kind of representation (model) of content
of language in our mind that helps to represent it.
 Natural language processing is concerned with the
development of computational models of aspects of human
language processing.
INTRODUCTION CONTD….
 Building computational models with human language
processing abilities require a knowledge of how humans
acquire, store and process knowledge.
 It also requires a knowledge of the world and of language.

 These computational models are useful :

 in developing automated tools for language processing.
 To gain better understanding of human communication.

 Natural Language processing is an interdisciplinary field with

many names such as Speech and Language Processing,
human language technology, natural language processing,
computational linguistics, speech recognition and synthesis
NLP FORMAL DEFINITIONS
 Natural language processing (NLP) is a field of computer
science, artificial intelligence concerned with the interactions
between computers and human (natural) languages, and, in
particular, concerned with programming computers to fruitfully
process large natural language data.
 Natural Language Processing is a field that covers computer
understanding and manipulation of human language.
 NLP is a way for computers to analyze, understand, and derive
meaning from human language in a smart and useful way.
KNOWLEDGE IN NATURAL LANGUAGE
PROCESSING
 NLP applications like other conventional programs or data
processing systems require knowledge. But, NLP applications also
require knowledge of language.
 For instance, Consider a UNIX wc program, which counts the total
number of bytes, words, and lines in the text. When it counts the
number of bytes it is a simple data processing application.
 Bur when it is used to count the number of words in a file, it
requires knowledge about what it means to be a word in a language
and thus become a language processing system.
 wc is a simple system with extremely limited and superficial
knowledge of language but the advanced applications like
conversational agent (such as HAL), machine translation, question
answering agent requires different levels of knowledge.
KNOWLEDGE IN NLP (CONTD….)
 Any conversational agent or speech recognition and synthesis
system must be able to recognize words from an audio signal and
to generate words from an audio signal.
 These applications require knowledge about phonetics and
phonology- how words are pronounced in terms of sequence of
sounds and how each of these sounds is recognized acoustically.
 These systems must also able to produce and understand
variations like I’m , can’t, or other variations like pluralization,
verb forms, etc. Producing and recognizing these variation of
individual words require knowledge about morphology- the study
of words, the structure of words and parts of words, such
as stems, root words, prefixes, and suffixes
KNOWLEDGE IN NLP (CONTD….)
 Moving beyond words , these applications require structural
knowledge to group together the words that constitute the response.
The knowledge needed to group and order words is called syntax
knowledge.
 To answer the questions or to understand the text various NLP
questions need to have knowledge of semantics i.e. meanings.
Consider the question to be answered by a question answering
system:
How much Chinese silk was exported to Western Europe by the end
of the 18th century?
To answer this question, we need to know about lexical semantics-
meaning of all words (export, silk) and compositional linguistics
(what constitutes Western Europe as opposed to Eastern or
Southern Europe, what does end mean in context of 18th century).
KNOWLEDGE IN NLP (CONTD….)
 Another kind of knowledge required by various NLP
applications is pragmatic or dialogue knowledge- knowledge
of the relationships of meaning to the goals and intentions of
speaker.
 For example, consider the question put to conversational
agent:
HAL, is the John’s door open? Or HAL, open the John’s door?
Despite the bad behavior, HAL knows to be polite to speaker
by not simply replying as No or No, I won’t open the door it
response with the phrase I am sorry. I can’t. This knowledge of
about the kind of actions that speakers intend by their use of
sentences is pragmatic or dialogue knowledge.
KNOWLEDGE IN NLP (CONTD….)
 Another form of pragmatic knowledge is discourse
knowledge- knowledge about linguistic units larger then
sentence. For example, consider the question:
How many states were in the US that year?
To answer this question, it must be able to interpret words
like that year. This task of conference resolution makes use
of knowledge about words like that or pronouns like it or she
refers to the previous parts of discourse.
KNOWLEDGE IN NLP (CONTD….)
 To summarize, complex language processing applications
require various kinds of knowledge:
 Phonetics and Phonology- knowledge about linguistic sounds

 Morphology- knowledge of representation, structure and

parts of words.
 Syntax- knowledge of structural knowledge between words.

 Semantics- knowledge of meaning

 Pragmatic- knowledge of the relationships of meaning to the

goals and intentions of speaker
 Discourse- knowledge about linguistic units larger then
sentence.
APPROACHES TO LANGUAGE
PROCESSING
 There have been two major approaches to natural language
processing: rationalist approach and empirical approach.
 Between about 1960 and 1985, most of linguistics, psychology,
artificial intelligence, and natural language processing was
completely dominated by a rationalist approach.
 A rationalist approach is characterized by the belief that a
significant part of the knowledge in the human mind is not
derived by the senses but is fixed in advance, presumably by
genetic inheritance.
 Within linguistics, this rationalist position has come to
dominate the field due to the widespread acceptance of
arguments by Noam Chomsky for an innate language faculty
(natural language processing capability).
APPROACHES TO LANGUAGE PROCESSING CONTD…

 Chomsky suggested that it is difficult to see how children can

learn something as complex as a natural language from the
limited input (of variable quality and interpretability) that they
hear during their early years.
 Within artificial intelligence, rationalist beliefs can be seen as
supporting the attempt to create intelligent systems by hand
coding into them a lot of starting knowledge and reasoning
mechanisms, so as to duplicate what the human brain begins
with.
 An empiricist approach also begins by postulating some
cognitive abilities as present in the brain.
APPROACHES TO LANGUAGE PROCESSING CONTD…

 The empiricist approach assumed that a baby’s brain begins

with general operations for association, pattern recognition,
and generalization, and that these can be applied to the rich
sensory input available to the child to learn the detailed
structure of natural language.
 An empiricist approach to NLP suggests that we can learn the
complicated and extensive structure of language by specifying
an appropriate general language model, and then inducing the
values of parameters by applying statistical, pattern
recognition, and machine learning methods to a large amount
of language use.
STATISTICAL NATURAL LANGUAGE PROCESSING

 Statistical NLP focus on corpus-driven methods that make use of

supervised and unsupervised machine learning approaches and
algorithms.
 Formerly, many language-processing tasks typically involved the
direct hand coding of rules, which is not in general robust to
natural language variation.
 The machine-learning paradigm calls instead for using statistical
inference to automatically learn such rules through the analysis
of large corpora of typical real-world examples (a corpus (plural,
"corpora") is a set of documents, possibly with human or
computer annotations.
STATISTICAL NLP CONTD……
 Systems based on machine-learning algorithms have many
advantages over hand-produced rules:
 No Language Dependency and Expertise.

 Automatically focus on the most common cases

 Robust to unfamiliar input (e.g. containing words or structures

that have not been seen before) and to erroneous input (e.g.
with misspelled words or words accidentally omitted).
 Can be made more accurate simply by supplying more input
data
ORIGIN OF NLP-PHASE I (1950S TO 1970S)
 The work of the first phase was focused on Machine Translation.
 Shortly after World War II had ended came the Cold War with
Soviet Russia.
 With the fear of nuclear war and Soviet spies, natural language
processing was invented, focusing on machine translation.
 The research begun with the translation of the Russian language,
both spoken and written, to English.
 This tool was considered vital to the United States government
because, when fully developed, it would enable them to translate the
Russian text to English with a low chance of error and at speeds
faster than humans.
 Because the government needed it, funding was readily available.
ORIGIN OF NLP-PHASE I CONTD.....
 Automatic translation from Russian to English, in a very rudimentary
form and limited experiment, was exhibited in the IBM-Georgetown
Demonstration of 1954.
 The experiment converted more than sixty Russian sentences to
English using the IBM-701 mainframe computer.
 The researchers claimed that there were problems with machine
translation and that they would be solved within the next three or four
years, things were looking good. However, real progress was much
slower.
 In 1950, Alan Turing published his famous article "Computing
Machinery and Intelligence" which proposed what is now called
the Turing test as a criterion of intelligence.
 For machine to qualify Turing Test, NLP was the major requirement.
ORIGIN OF NLP-PHASE I CONTD.....
 The United State’s National Research Council, NRC, founded
the Automatic Language Processing Advisory Committee,
ALPAC , in 1964.
 ALPAC was the committee assigned to evaluate the progress of
NLP research.
 In 1966 ALPAC and the NRC halted research on machine
translation because progress.
 After twelve years and twenty million dollars, had slowed and
machine translation had became more expensive than manual
human translation.
ORIGIN OF NLP-PHASE I CONTD.....
 By the end of 1950s, the research in NLP had split into two paradigms:
symbolic and stochastic.
 The symbolic paradigm followed two lines of research.

 The first was the work of Chomsky (Syntactic Structures) , formal

language theory, and the work of many linguistics and computer
science on parsing algorithms.
 One of the earliest parsing system was Harris Transformations and
Discourse Analysis Project (TDAP) at the University of Pennsylvania.
 The second was the field of Artificial Intelligence. In 1956 the
Dartmouth Conference took place in Dartmouth College, New
Hampshire.
 In this conference the John McCarthy coined term ‘Artificial
Intelligence’ and had two months long ‘brainstorm session’ on many
topics related to AI.
ORIGIN OF NLP-PHASE I CONTD.....
 The major focus of the new field of AI was the work on logic
and reasoning based on the Newell & Simon's work on the
Logic Theorist and General Problem Solver.
 Early Natural Language Understanding systems based on
pattern matching and keyword search for reasoning and
question answering were built.
 The stochastic paradigm took hold mainly in department of
statistics and electrical engineering.
 By the late of 1950s, the Bayesian method was applied to
optical character recognition and text recognition.
ORIGIN OF NLP-PHASE I CONTD.....
 The 1960s saw the rise for first model for psychological
models of human language processing based on
transformational grammar.
 The first online corpora: the Brown corpus of American
English, a collection of a 1 million word collection of samples
from 500 written text from newspapers, novels, non-fiction,
academic, etc. was prepared at Brown University.
 An online Chinese dialect dictionary called DOC was prepared.
ORIGIN OF NLP: PHASE II (1970-1983)
 The next period saw an explosion in research in NLP and the
number of research paradigms that are still dominant.
 The stochastic paradigm, played a huge role in the
development of speech recognition algorithms, particularly the
use of Hidden Markov Models (HMMs) and dealing with noisy
channel decoding.
 The Watson Research Center at Carnegie Mellon University
and AT&T's Bell Labs were key center for work on speech
recognition and synthesis.
 The logic-based paradigm, was begun based on Q-systems,
transformation grammars, Definite Clause grammars,
functional grammars.
ORIGIN OF NLP: PHASE II CONTD.....
 The Natural Language Understanding (NLU) paradigm,
also took off during this period.
 The work begun with Winogard's SHRDLU system, which
simulated a robot embedded in a world blocks.
 The system made it clear that the problem of parsing was well
enough to begin focus on semantics and discourse.
 The work on building series of NLU programs were built
focusing on conceptual knowledge such as scripts, plans, and
goals.
 The logic-based and NLU paradigms were unified in
systems that used predicate logic as semantic
representation such as LUNAR, question answering system.
ORIGIN OF NLP: PHASE II CONTD.....
The discourse modeling paradigm focus on key areas in
discourse.
 A number of researchers began to work on automatic reference
resolution and the Belief -Desire-Intention (BDI) framework for
logic-based works on speech acts was developed.
ORIGIN OF NLP: PHASE III (1983-1993)
 The next decade was the return of two important models that lost
popularity in the late 1950s and early 1960s due to theoretical
arguments against them such as Chomsky's review.
 The first class, was finite state models, which began to receive
attention again after work on finite state phonology, morphology,
and syntax.
 The second class, was the return of empiricism; with rise of
probabilistic models throughout speech an language processing.
 The empirical direction was now also focused on model
evaluation, based on using held out data, metrics for evaluation
and the comparison of performance on these metrics.
 The work in this phase was also based on Natural Language
Generation.
ORIGIN OF NLP PHASE IV: (1994-1999)
 By the last five years, of the millennium, it was clear that the
NLP filed was undergoing a major changes in a number of
directions and all those direction came together.
 Firstly, probability and data driven models had become quite
standard throughout language processing. Evaluation
mechanisms borrowed from speech recognition and
Information Retrieval were employed.
 Secondly, the increase in the speed and memory of computers
allowed commercial exploitation of number of application
areas.
 The rise of web emphasized the need of language based and
multi-lingual information retrieval and extraction.
ORIGIN OF NLP: PHASE V (2000- PRESENT)
 In this phase there was rise of Machine Learning techniques for
NLP.
 The empiricist trend that begun in the late 1990s accelerated at
great pace due to following three trends:
 Firstly, large amount of written and spoken material became
widely available through various organizations like Linguistic
Data Consortium (LDC), NIST, ELRA, FIRE, etc.
 Annotated collections such as Penn Treebank, Prague
Dependency Treebank, Penn Discourse Treebank etc. were
made available for various morphological, syntactic, semantic,
discourse and pragmatic annotations.
 These resources promoted casting the various complex
problems of NLP as problems in supervised machine learning.
ORIGIN OF NLP: PHASE V (CONTD.....)
 Second, the increased focus on learning accelerated the statistical
community.
 The techniques such as Support Vector Machine (SVM), Maximum
Entropy Techniques, regression, graphical Bayesian models become
standard practice in NLP.
 Thirdly, the widespread availability of high-performance computing
systems facilitated the training and deployment of various NLP systems.
 Near the late 2008s, largely unsupervised statistical approaches began to
receive renewed attention. The cost and difficulty of annotated corpora
for many problems became a limitation for the use of supervised
learning.
 Deep learning is a class of machine learning algorithms that use a
cascade of multiple layers of non linear processing units for feature
extraction and transformation. Each successive layer uses the output
from the previous layer as input.
APPLICATIONS OF NLP-I
1) Part-of-speech tagging: also called grammatical tagging or
word-category disambiguation, is the process of marking up
a word in a text (corpus) as corresponding to a particular part
of speech, based on both its definition and its context—i.e., its
relationship with adjacent and related words in a phrase,
sentence, or paragraph.
 The POS tagger is used as a preprocessor. Text indexing and
retrieval uses POS information. Speech processing uses POS
tags to decide the pronunciation. POS tagger is used for making
tagged corpora.
APPLICATIONS OF NLP-II
2) Word Sense Disambiguation (WSD): WSD is identifying the
sense of a word (i.e. meaning) is used in a sentence, when the
word has multiple meanings.
3) Named-entity recognition (NER) (also known as entity
identification, entity chunking and entity extraction) is a
subtask of information extraction that seeks to locate and
classify named entities in text into pre-defined categories such
as the names of persons, organizations, locations, expressions
of times, quantities, monetary values, percentages, etc.
o The WSD and NER systems are also used as a preprocessor in
a number of NLP applications such as machine translation,
information retrieval and extraction, etc.
APPLICATIONS OF NLP-III
4) A spell checker (or spell check) is an application program that
flags words in a document that may not be spelled correctly.
Spell checkers may be stand-alone, capable of operating on a
block of text, or as part of a larger application, such as a word
processor, email client, electronic dictionary, or search engine.
5) Automatic text summarization is the process of shortening a
text document with software, in order to create a summary
with the major points of the original document.
o Text summarization systems are used in applications like entity
timelines, storylines of events, sentence compression,
summarization of user generated content.
APPLICATIONS OF NLP-IV
6) Machine translation is a sub-field of computational linguistics
that investigates the use of software to translate text or speech
from one language to another.
o Machine Translation systems are used in search engines, cross-
lingual information retrieval, social networking, military
applications, mobile applications, etc.
7) Document classification or document categorization is to
assign a document to one or more classes or categories.
o It is used in number of applications like e-mail filtering, mail
routing, spam filtering, news monitoring, selective dissemination
of information to information consumers, automated indexing of
scientific articles, automated population of hierarchical
catalogues of Web resources, identification of document genre,
authorship attribution, survey coding and so on.
APPLICATIONS OF NLP-V
8) Speech Synthesis/ Text-to-speech Synthesis: Speech synthesis is
the artificial production of human speech. A computer system used
for this purpose is called a speech computer or speech synthesizer.
A text-to-speech (TTS) system converts normal language text into
speech. An intelligible text-to-speech program allows people
with visual impairments or reading disabilities to listen to written
words on a home computer.
9) Opinion Mining/ Sentiment Analysis: Sentiment analysis aims
to determine the attitude of a speaker, writer, or other subject with
respect to some topic or the overall contextual polarity or emotional
reaction to a document, interaction, or event.
Sentiment analysis is widely applied to: voice of the
customer materials such as reviews and survey responses,
online and social media, healthcare
APPLICATIONS OF NLP-VI
10) Optical character recognition is the mechanical or electronic
conversion of images of typed, handwritten or printed text into
machine-encoded text, whether from a scanned document, a photo of a
document, a scene-photo (for example the text on signs and billboards
in a landscape photo) or from subtitle text superimposed on an image
(for example from a television broadcast).
It is widely used as a form of information entry from printed paper data
records like passport documents, invoices, bank statements,
computerized receipts, business cards, mail, printouts of static-data, or
any suitable documentation.
11) Question answering Systems: is concerned with building systems that
automatically answer questions posed by humans in a natural language.
QA research attempts to deal with a wide range of question types
including: fact, list, definition, How, Why, hypothetical, semantically
constrained, and cross-lingual questions, open ended, closed ended.
APPLICATIONS OF NLP-VII
12) Textual entailment (TE) in NLP is a directional relation between text
fragments. The relation holds whenever the truth of one text fragment
follows from another text
Many natural language processing applications, like Question
Answering (QA), Information Extraction (IE), (multi-
document) summarization and machine translation (MT) evaluation, need
to recognize that a particular target meaning can be inferred from
different text variants.
13) Topic segmentation and recognition: Given a chunk of text, separate it
into segments each of which is devoted to a topic, and identify the topic
of the segment.
It can improve information retrieval or speech recognition significantly
(by indexing/recognizing documents more precisely or by giving the
specific part of a document corresponding to the query as a result).
It is also needed in topic detection and tracking systems and text
summarizing problems.
APPLICATIONS OF NLP-VIII
14) Relationship extraction: Given a chunk of text, identify the relationships
among named entities (e.g. who is married to whom).
Application domains where relationship extraction is useful include gene-
disease relationships, protein-protein interaction, etc.
15) Paraphrase Identification: Paraphrase detection is the task of examining two
text entities (ex. sentence) and determining whether they have the same
meanings.
 Lexical level
 Example - solve and resolve
 Phrase level
 Example - look after and take care of
 Sentence level
 Example - The table was set up in the carriage shed and The table was laid under the cart-shed
 Pattern level
 Example - [X] considers [Y] and [X] takes [Y] into consideration
 Collocation level
 Example - (turn on, OBJ light) and (switch on, OBJ light)
APPLICATIONS OF NLP-IX
 Paraphrase Identification is useful for:
 Machine Translation
 Simplify input sentences
 Alleviate data sparseness

 Question Answering
 Question reformulation
 Information Extraction
 IE pattern expansion
 Information Retrieval
 Query reformulation
 Summarization
 Sentenceclustering
 Automatic evaluation

 Natural Language Generation

 Sentence rewriting
 Others
 Changing writing style
 Text simplification
 Identifying plagiarism
APPLICATIONS OF NLP-X
16) Automated essay scoring (AES) – the use of specialized
computer programs to assign grades to essays written in an
educational setting. It is a method of educational assessment
and an application of natural language processing. It can be
considered a problem of statistical classification.
17) Automatic image annotation – process by which a computer
system automatically assigns textual metadata in the form of
captioning or keywords to a digital image. The annotations
are used in image retrieval systems to organize and locate
images of interest from a database.
18) Automatic taxonomy induction – automated construction
of tree structures from a corpus. This may be applied to
building taxonomical classification systems for reading by
end users, such as web directories or subject outlines.
PROBLEMS / ISSUES IN NLP-I
1) Highly ambiguous at all levels: Languages are highly ambiguous
and this ambiguity occurs at each level of language processing.
Phonological ambiguity is a type of ambiguity that arises out of
the fact that words sound identical, but in fact have different
meanings. For example:
Two = too = to
Wait = Weight
Word level (Semantic) ambiguity: bank, diamond, treat, etc.
words have different meanings in different context.
Syntactic ambiguity arises not from the range of meanings of single
words, but from the relationship between the words and clauses of
a sentence.
For example, The professor said on Monday he would give an
exam, Visiting relatives can be boring.
PROBLEMS / ISSUES IN NLP-II
2) Idioms, metaphor: The idioms (set phrase), metaphor
(figure of speech) add more complexity to identify the
meaning of the written text.
For example, The old man finally kicked the bucket. This
sentence has nothing to do with the words kick and bucket
because it refers to English idiom which means ‘to die’.
3) Quantifier Resolution: The scope of quantifier is another
problem and is often not clear while processing language.
Some of the quantifiers commonly used in English are a, all,
any, at least, at most, either, every, more than, etc.
For example, Example: Each table has at least two sponsors
seated at it, and each sponsor is seated at exactly one table.
PROBLEMS / ISSUES IN NLP-III
4) Involves Reasoning about the world:
Communication via language involves two brains- the brain of
the speaker/writer and the brain of the hearer/reader.
Anything that is assumed to be known to receiver is not
encoded. The receiver possess the knowledge and fills in the
gaps while making an interpretation.
This implicit knowledge is encoded according to context,
cultural knowledge, etc.
For example, ‘Taj’ may mean a monument, a brand of tea, or a
hotel, which may not be so for a non-Indian.
PROBLEMS / ISSUES IN NLP-IV
5) Dealing with emotions in language processing:
There is more than just the words involved in natural
language processing.
In addition to intent and structure, there is also the sentiment
or emotion involved in a statement.
People can be excited, they can be nervous, or they can be
angry or frustrated, and they speak differently and choose
different words and key phrases when each of these emotions
is presenting. Natural language processing algorithms must
cut through this emotion to get to the point of a query.
But there is also predictive value in that emotion, and these
algorithms can use the emotion to build context and predict
questions that a user might want to ask.
PROBLEMS / ISSUES IN NLP-V
6) Finding referents of anaphora/cataphora:
Anaphora is the use of an expression that depends specifically
upon an antecedent expression and cataphora, is the use of an
expression that depends upon a postcedent expression.
For example, in the sentence Sally arrived, but nobody saw
her, the pronoun her is an anaphor, referring back to the
antecedent Sally.
In the sentence Before her arrival, nobody saw Sally, the
pronoun her refers forward to the postcedent Sally.
Identifying these anaphora and cataphora, is also a
challenging task as different languages has different set of
rules to express them.
PROBLEMS / ISSUES IN NLP-VI
7) Representation of Knowledge: First-order logic (FOL) and
knowledge representation systems find it difficult to represent some
issues such as time and modality.
Representing and inferring world knowledge, commonsense
knowledge in particular, is difficult.
Semantics of discourse segments is a difficult problem.
8) Challenges in Pragmatics: There are challenges in pragmatics as
well. A simple declarative sentence stating a fact, it is sunny, is not
only a statement of fact but also serves some communication
function.
The function may be to inform, to mislead about fact or speaker’s
belief about fact, to draw attention, to remind previously mentioned
even or object related to fact, etc.
So, the pragmatic interpretation is open ended and difficult to
interpret exactly the speaker goals or plans from it.
PROCESSING INDIAN LANGUAGES-I
 There are number of differences between Indian languages and
English and thus there is difference in their processing. Some
differences are:
1) In sentence structuring, English uses the subject-verb-
object word order whereas Hindi uses subject-object-verb
word order.
For example, in English we write the sentence as I eat apple
(i.e Subject (I), Verb (eat) and Object (apple)).
But the corresponding structure, in Hindi is मैं सेब खाता हूँ which
is in the format Subject-Object Verb.
PROCESSING INDIAN LANGUAGES-II
2) Indian languages use post-position case markers (karakas)
instead of prepositions.
In other words, The prepositional words in English comes
before the noun/pronoun but in Hindi, these words come
behind or post the noun/pronoun so these are postposition
rather than preposition.
Example,
(1) यात्री मन्दिर की ओर जा रहे थे। ( The traveller were going toward the
temple.)
(2) पिताजी घर के अन्दर हैं। (The father is inside home.)
(3) प्यास के मारे घोड़े का बुरा हाल था। ( The horse was feeling very
bad because of the thirst.)
PROCESSING INDIAN LANGUAGES-III
3) In Indian languages, the verb complexes use sequence of
words rather than a single word as in case of English.
For example, the verb form singing in English as expressed as
गा रही है, गा रहा है, etc.
The auxiliary words in the sequence provide information
about tense, aspect, modality, etc.
4) Indian languages have a free word order, i.e. words can be
moved freely within a sentence without changing the
meaning of the sentence.
For example, I like sweets and Sweets, I like are different in
English but in Hindi they are written as मुझे मिठाइयाँ पसंद है, मिठाइयाँ
मुझे पसंद है both of which means same.
PROCESSING INDIAN LANGUAGES-IV
5) Indian languages have relatively large set of
morphological variants.
English language is least inflectional language. But Indian
languages have relatively complex morphology and are
highly agglutinative and inflectional.
For example, languages like Marathi and Bengali has more
than 20 variants for a single word.
6) Spelling standardization is more subtle in Indian
languages then in English.
7) Unlike English, Indic scripts have a non-linear structure.
PROCESSING INDIAN LANGUAGES-V
8) Indian languages make extensive and productive use
of complex predicates (CPs).
Complex predicate is a noun, a verb, an adjective or an
adverb followed by a light verb that behaves as a single
unit of verb.
Complex predicates (CPs) are abundantly used in Hindi
and other languages of Indo Aryan family.
(1) CP=noun+LV (linking Verb) उसने मुझे आशीर्वाद दिया (He
blessed me).
(2) CP=adjective+LV उसने मुझे प्रसन्न किया (He pleased me).
(3) CP=verb+LV उसने किताब को फाड़ दिया (He torn the book)
ISSUES FOR RESEARCH IN NLP IN INDIA
 Lack of Annotated Corpora
 Lack of NLP Tools

 Lack of Standards

 Lack of Interaction Between Linguistics and Computer

Programmers
 Lack of Consolidated Efforts

 Lack of Education and Training Institutes

Al3501 - NLP Iat Set 2
No ratings yet
Al3501 - NLP Iat Set 2
2 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
Data Science PPT Final
No ratings yet
Data Science PPT Final
13 pages
Unit V Application
No ratings yet
Unit V Application
13 pages
NLP Unit 2
No ratings yet
NLP Unit 2
20 pages
Data Science Techniques Overview
No ratings yet
Data Science Techniques Overview
5 pages
5th Sem BCS515B - AI - Module3
No ratings yet
5th Sem BCS515B - AI - Module3
113 pages
Unit 5
No ratings yet
Unit 5
23 pages
7th Sem Syllabus
No ratings yet
7th Sem Syllabus
11 pages
CS 3 - Problem Solving Agent
No ratings yet
CS 3 - Problem Solving Agent
80 pages
Unit I
No ratings yet
Unit I
41 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
DL Unit - 1 Notes
No ratings yet
DL Unit - 1 Notes
45 pages
Unsupervised Learning: Clustering Algorithms
No ratings yet
Unsupervised Learning: Clustering Algorithms
13 pages
Data Science - UNIT-3 - Notes
No ratings yet
Data Science - UNIT-3 - Notes
32 pages
Pattern Recognition and Anomaly Detection Lab
No ratings yet
Pattern Recognition and Anomaly Detection Lab
3 pages
DBMS Lab File for B.Tech Students
No ratings yet
DBMS Lab File for B.Tech Students
46 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
DWDM Notes - Unit 1
No ratings yet
DWDM Notes - Unit 1
26 pages
CCS354 Network Security Handwritten Notes
No ratings yet
CCS354 Network Security Handwritten Notes
2 pages
CS3352-FDS 2 Marks Questions With Answer
No ratings yet
CS3352-FDS 2 Marks Questions With Answer
20 pages
PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
32 pages
Problem Solving Basics - Module 3
100% (1)
Problem Solving Basics - Module 3
9 pages
Compiler Design Lab Guide
No ratings yet
Compiler Design Lab Guide
59 pages
Internz Learn: Empowering Internships
No ratings yet
Internz Learn: Empowering Internships
7 pages
Intelligent Systems Unit 1
No ratings yet
Intelligent Systems Unit 1
13 pages
Dap M4
No ratings yet
Dap M4
18 pages
Al3501 - NLP Iat Set 1 New
No ratings yet
Al3501 - NLP Iat Set 1 New
2 pages
Com Dcom Corba PDF
No ratings yet
Com Dcom Corba PDF
3 pages
CS6456-Object Oriented Programming
No ratings yet
CS6456-Object Oriented Programming
15 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
Data Science Foundations Question Bank
No ratings yet
Data Science Foundations Question Bank
16 pages
Salesforce CRM for EduConsultPro Services
No ratings yet
Salesforce CRM for EduConsultPro Services
10 pages
CCS348 - Game Theory Lab Manual Record
No ratings yet
CCS348 - Game Theory Lab Manual Record
42 pages
Key Concepts for MU IT Exam 2023
No ratings yet
Key Concepts for MU IT Exam 2023
23 pages
Natural Language Processing: by Dr. Parminder Kaur
No ratings yet
Natural Language Processing: by Dr. Parminder Kaur
26 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
161 pages
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
No ratings yet
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
5 pages
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
No ratings yet
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
4 pages
Big Data - SRM University PDF
No ratings yet
Big Data - SRM University PDF
29 pages
Ai Unit 2
No ratings yet
Ai Unit 2
55 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
Types of Pipeline
100% (1)
Types of Pipeline
2 pages
Data Structures Design - AD3251 - Important Questions With Answer - Unit 1 - Abstract Data Types
No ratings yet
Data Structures Design - AD3251 - Important Questions With Answer - Unit 1 - Abstract Data Types
15 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
25 pages
Assignment No - 1
No ratings yet
Assignment No - 1
1 page
Introduction to PHP Basics
No ratings yet
Introduction to PHP Basics
50 pages
Fundamental - Deep Learning
No ratings yet
Fundamental - Deep Learning
69 pages
Full Stack Web Development - IT3501 - Notes - Unit 2 - Node JS
No ratings yet
Full Stack Web Development - IT3501 - Notes - Unit 2 - Node JS
43 pages
IF4071 Deep Learning Notes
No ratings yet
IF4071 Deep Learning Notes
188 pages
Object-Oriented Programming in C++
No ratings yet
Object-Oriented Programming in C++
186 pages
Computer Vision Seminar Report
No ratings yet
Computer Vision Seminar Report
45 pages
DATA ANALYTICS Syllabus 3 Units
No ratings yet
DATA ANALYTICS Syllabus 3 Units
37 pages
Concept Learning
No ratings yet
Concept Learning
85 pages
Ch-4 Pre-Trained Models and Fine-Tuning
No ratings yet
Ch-4 Pre-Trained Models and Fine-Tuning
13 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Computer Graphics & Multimedia Syllabus
No ratings yet
Computer Graphics & Multimedia Syllabus
3 pages
1 Introduction
No ratings yet
1 Introduction
62 pages
1 NLP (Introduction)
No ratings yet
1 NLP (Introduction)
60 pages
Ad3563 Text and Speech Analysis
No ratings yet
Ad3563 Text and Speech Analysis
8 pages
Generative AI Unit 1 2 3 Questions
No ratings yet
Generative AI Unit 1 2 3 Questions
12 pages
XI AI UNIT 1 Introduction Artificial Intelligence For Everyone
No ratings yet
XI AI UNIT 1 Introduction Artificial Intelligence For Everyone
18 pages
NLP Data Engineer
No ratings yet
NLP Data Engineer
1 page
Business Research - Introduction
No ratings yet
Business Research - Introduction
4 pages
Age and Gender Detection Software
No ratings yet
Age and Gender Detection Software
145 pages
I Pu Cs Chapter 3 Q&A-1
No ratings yet
I Pu Cs Chapter 3 Q&A-1
44 pages
Full Applied Natural Language Processing With Python: Implementing Machine Learning and Deep Learning Algorithms For Natural Language Processing 1st Edition Taweh Beysolow Ii Ebook All Chapters
100% (6)
Full Applied Natural Language Processing With Python: Implementing Machine Learning and Deep Learning Algorithms For Natural Language Processing 1st Edition Taweh Beysolow Ii Ebook All Chapters
55 pages
Natural Language Processing in Artificial Intelligence Chapter 1
No ratings yet
Natural Language Processing in Artificial Intelligence Chapter 1
75 pages
Augmented Analytics From BI To Smart Analytics
No ratings yet
Augmented Analytics From BI To Smart Analytics
11 pages
Deep Learning: A Comprehensive Guide
No ratings yet
Deep Learning: A Comprehensive Guide
12 pages
Assignment 2 CS 421 Fall 2022
No ratings yet
Assignment 2 CS 421 Fall 2022
6 pages
Bot Framework Composer Documentation
100% (1)
Bot Framework Composer Documentation
232 pages
Data Science Bootcamp - UG - V1 - 0324
No ratings yet
Data Science Bootcamp - UG - V1 - 0324
30 pages
Evaluation and Analysis of Large Language Models For Clinical Text Augmentation and Generation
No ratings yet
Evaluation and Analysis of Large Language Models For Clinical Text Augmentation and Generation
10 pages
YouTube Transcript Summarizer Tool
No ratings yet
YouTube Transcript Summarizer Tool
7 pages
Notes - Quick Revision
No ratings yet
Notes - Quick Revision
4 pages
AI's Impact on Hong Kong Banking
100% (1)
AI's Impact on Hong Kong Banking
100 pages
Ai Trends and Future Impact - Industry Adoption & Insights
No ratings yet
Ai Trends and Future Impact - Industry Adoption & Insights
64 pages
Capstone Review 2
No ratings yet
Capstone Review 2
17 pages
Digital Marketing Brochure 2024
No ratings yet
Digital Marketing Brochure 2024
19 pages
Chengqing Zong - Rui Xia - Jiajun Zhang - Text Data Mining-Springer Singapore
No ratings yet
Chengqing Zong - Rui Xia - Jiajun Zhang - Text Data Mining-Springer Singapore
528 pages
NLP Unit II Notes
No ratings yet
NLP Unit II Notes
31 pages
AI and Its Implications For Market Knowledge in b2b Marketing
No ratings yet
AI and Its Implications For Market Knowledge in b2b Marketing
10 pages
Text Blob
No ratings yet
Text Blob
16 pages
AI's Role in Financial Services
No ratings yet
AI's Role in Financial Services
128 pages
Class 10 CH 1 Solution
No ratings yet
Class 10 CH 1 Solution
17 pages
NLP Set B Sessional 44444
No ratings yet
NLP Set B Sessional 44444
2 pages
AI - Robotics
No ratings yet
AI - Robotics
10 pages
NLP-Driven Medication Management App
No ratings yet
NLP-Driven Medication Management App
38 pages

Introduction To Natural Language Processing (NLP) : Dr. Sukhnandan Kaur Tiet

Uploaded by

Introduction To Natural Language Processing (NLP) : Dr. Sukhnandan Kaur Tiet

Uploaded by

INTRODUCTION TO

Dr. Sukhnandan Kaur

 Language shapes our thought, has a structure, and carries

 These computational models are useful :

 Natural Language processing is an interdisciplinary field with

 Morphology- knowledge of representation, structure and

 Semantics- knowledge of meaning

 Pragmatic- knowledge of the relationships of meaning to the

 Chomsky suggested that it is difficult to see how children can

 The empiricist approach assumed that a baby’s brain begins

 Statistical NLP focus on corpus-driven methods that make use of

 Automatically focus on the most common cases

 Robust to unfamiliar input (e.g. containing words or structures

 The first was the work of Chomsky (Syntactic Structures) , formal

 Natural Language Generation

 Lack of Interaction Between Linguistics and Computer

 Lack of Education and Training Institutes

You might also like