0% found this document useful (0 votes)
401 views51 pages

Introduction To Natural Language Processing (NLP) : Dr. Sukhnandan Kaur Tiet

This document provides an introduction to natural language processing (NLP). It defines NLP as using computational models to understand and manipulate human language. It describes the different types of linguistic knowledge required for NLP applications, including phonology, morphology, syntax, semantics, pragmatics, and discourse. It also discusses the rationalist and empiricist approaches to language processing and the rise of statistical NLP using machine learning. Finally, it briefly outlines the origins of NLP focusing initially on machine translation during the Cold War period between the 1950s-1970s.

Uploaded by

Yash Dhawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
401 views51 pages

Introduction To Natural Language Processing (NLP) : Dr. Sukhnandan Kaur Tiet

This document provides an introduction to natural language processing (NLP). It defines NLP as using computational models to understand and manipulate human language. It describes the different types of linguistic knowledge required for NLP applications, including phonology, morphology, syntax, semantics, pragmatics, and discourse. It also discusses the rationalist and empiricist approaches to language processing and the rise of statistical NLP using machine learning. Finally, it briefly outlines the origins of NLP focusing initially on machine translation during the Cold War period between the 1950s-1970s.

Uploaded by

Yash Dhawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

INTRODUCTION TO

NATURAL LANGUAGE
PROCESSING
(NLP)

Dr. Sukhnandan Kaur


TIET
CONTENTS
 NLP-Definition
 Origin of NLP
 Applications of NLP
 Challenges of NLP
 Processing Indian Languages
INTRODUCTION
 Language is the primary means of communication used by
humans.
 It is the tool with which we express our ideas and emotions.

 Language shapes our thought, has a structure, and carries


meaning.
 Representing ideas and thoughts is so natural that we hardly
realize that how we process knowledge.
 There must be some kind of representation (model) of content
of language in our mind that helps to represent it.
 Natural language processing is concerned with the
development of computational models of aspects of human
language processing.
INTRODUCTION CONTD….
 Building computational models with human language
processing abilities require a knowledge of how humans
acquire, store and process knowledge.
 It also requires a knowledge of the world and of language.

 These computational models are useful :


 in developing automated tools for language processing.
 To gain better understanding of human communication.

 Natural Language processing is an interdisciplinary field with


many names such as Speech and Language Processing,
human language technology, natural language processing,
computational linguistics, speech recognition and synthesis
NLP FORMAL DEFINITIONS
 Natural language processing (NLP) is a field of computer
science, artificial intelligence concerned with the interactions
between computers and human (natural) languages, and, in
particular, concerned with programming computers to fruitfully
process large natural language data.
 Natural Language Processing is a field that covers computer
understanding and manipulation of human language.
 NLP is a way for computers to analyze, understand, and derive
meaning from human language in a smart and useful way.
KNOWLEDGE IN NATURAL LANGUAGE
PROCESSING
 NLP applications like other conventional programs or data
processing systems require knowledge. But, NLP applications also
require knowledge of language.
 For instance, Consider a UNIX wc program, which counts the total
number of bytes, words, and lines in the text. When it counts the
number of bytes it is a simple data processing application.
 Bur when it is used to count the number of words in a file, it
requires knowledge about what it means to be a word in a language
and thus become a language processing system.
 wc is a simple system with extremely limited and superficial
knowledge of language but the advanced applications like
conversational agent (such as HAL), machine translation, question
answering agent requires different levels of knowledge.
KNOWLEDGE IN NLP (CONTD….)
 Any conversational agent or speech recognition and synthesis
system must be able to recognize words from an audio signal and
to generate words from an audio signal.
 These applications require knowledge about phonetics and
phonology- how words are pronounced in terms of sequence of
sounds and how each of these sounds is recognized acoustically.
 These systems must also able to produce and understand
variations like I’m , can’t, or other variations like pluralization,
verb forms, etc. Producing and recognizing these variation of
individual words require knowledge about morphology- the study
of words,  the structure of words and parts of words, such
as stems, root words, prefixes, and suffixes
KNOWLEDGE IN NLP (CONTD….)
 Moving beyond words , these applications require structural
knowledge to group together the words that constitute the response.
The knowledge needed to group and order words is called syntax
knowledge.
 To answer the questions or to understand the text various NLP
questions need to have knowledge of semantics i.e. meanings.
Consider the question to be answered by a question answering
system:
How much Chinese silk was exported to Western Europe by the end
of the 18th century?
To answer this question, we need to know about lexical semantics-
meaning of all words (export, silk) and compositional linguistics
(what constitutes Western Europe as opposed to Eastern or
Southern Europe, what does end mean in context of 18th century).
KNOWLEDGE IN NLP (CONTD….)
 Another kind of knowledge required by various NLP
applications is pragmatic or dialogue knowledge- knowledge
of the relationships of meaning to the goals and intentions of
speaker.
 For example, consider the question put to conversational
agent:
HAL, is the John’s door open? Or HAL, open the John’s door?
Despite the bad behavior, HAL knows to be polite to speaker
by not simply replying as No or No, I won’t open the door it
response with the phrase I am sorry. I can’t. This knowledge of
about the kind of actions that speakers intend by their use of
sentences is pragmatic or dialogue knowledge.
KNOWLEDGE IN NLP (CONTD….)
 Another form of pragmatic knowledge is discourse
knowledge- knowledge about linguistic units larger then
sentence. For example, consider the question:
How many states were in the US that year?
To answer this question, it must be able to interpret words
like that year. This task of conference resolution makes use
of knowledge about words like that or pronouns like it or she
refers to the previous parts of discourse.
KNOWLEDGE IN NLP (CONTD….)
 To summarize, complex language processing applications
require various kinds of knowledge:
 Phonetics and Phonology- knowledge about linguistic sounds

 Morphology- knowledge of representation, structure and


parts of words.
 Syntax- knowledge of structural knowledge between words.

 Semantics- knowledge of meaning

 Pragmatic- knowledge of the relationships of meaning to the


goals and intentions of speaker
 Discourse- knowledge about linguistic units larger then
sentence.
APPROACHES TO LANGUAGE
PROCESSING
 There have been two major approaches to natural language
processing: rationalist approach and empirical approach.
 Between about 1960 and 1985, most of linguistics, psychology,
artificial intelligence, and natural language processing was
completely dominated by a rationalist approach.
 A rationalist approach is characterized by the belief that a
significant part of the knowledge in the human mind is not
derived by the senses but is fixed in advance, presumably by
genetic inheritance.
 Within linguistics, this rationalist position has come to
dominate the field due to the widespread acceptance of
arguments by Noam Chomsky for an innate language faculty
(natural language processing capability).
APPROACHES TO LANGUAGE PROCESSING CONTD…

 Chomsky suggested that it is difficult to see how children can


learn something as complex as a natural language from the
limited input (of variable quality and interpretability) that they
hear during their early years.
 Within artificial intelligence, rationalist beliefs can be seen as
supporting the attempt to create intelligent systems by hand
coding into them a lot of starting knowledge and reasoning
mechanisms, so as to duplicate what the human brain begins
with.
 An empiricist approach also begins by postulating some
cognitive abilities as present in the brain.
APPROACHES TO LANGUAGE PROCESSING CONTD…

 The empiricist approach assumed that a baby’s brain begins


with general operations for association, pattern recognition,
and generalization, and that these can be applied to the rich
sensory input available to the child to learn the detailed
structure of natural language.
 An empiricist approach to NLP suggests that we can learn the
complicated and extensive structure of language by specifying
an appropriate general language model, and then inducing the
values of parameters by applying statistical, pattern
recognition, and machine learning methods to a large amount
of language use.
STATISTICAL NATURAL LANGUAGE PROCESSING

 Statistical NLP focus on corpus-driven methods that make use of


supervised and unsupervised machine learning approaches and
algorithms.
 Formerly, many language-processing tasks typically involved the
direct hand coding of rules, which is not in general robust to
natural language variation.
 The machine-learning paradigm calls instead for using statistical
inference to automatically learn such rules through the analysis
of large corpora of typical real-world examples (a corpus (plural,
"corpora") is a set of documents, possibly with human or
computer annotations.
STATISTICAL NLP CONTD……
 Systems based on machine-learning algorithms have many
advantages over hand-produced rules:
 No Language Dependency and Expertise.

 Automatically focus on the most common cases

 Robust to unfamiliar input (e.g. containing words or structures


that have not been seen before) and to erroneous input (e.g.
with misspelled words or words accidentally omitted).
 Can be made more accurate simply by supplying more input
data
ORIGIN OF NLP-PHASE I (1950S TO 1970S)
 The work of the first phase was focused on Machine Translation.
 Shortly after World War II had ended came the Cold War with
Soviet Russia.
 With the fear of nuclear war and Soviet spies,  natural language
processing was invented, focusing on machine translation.
 The research begun with the translation of the Russian language,
both spoken and written, to English.
 This tool was considered vital to the United States government
because, when fully developed, it would enable them to translate the
Russian text to English with a low chance of error and at speeds
faster than humans.
 Because the government needed it, funding was readily available.
ORIGIN OF NLP-PHASE I CONTD.....
 Automatic translation from Russian to English, in a very rudimentary
form and limited experiment, was exhibited in the IBM-Georgetown
Demonstration of 1954.
 The experiment converted more than sixty Russian sentences to
English using the IBM-701 mainframe computer.
 The researchers claimed that there were problems with machine
translation and that they would be solved within the next three or four
years, things were looking good. However, real progress was much
slower.
 In 1950, Alan Turing published his famous article "Computing
Machinery and Intelligence" which proposed what is now called
the Turing test as a criterion of intelligence.
 For machine to qualify Turing Test, NLP was the major requirement.
ORIGIN OF NLP-PHASE I CONTD.....
 The United State’s National Research Council, NRC, founded
the Automatic Language Processing Advisory Committee,
ALPAC , in 1964.
 ALPAC was the committee assigned to evaluate the progress of
NLP research.
 In 1966 ALPAC and the NRC halted research on machine
translation because progress.
 After twelve years and twenty million dollars, had slowed and
machine translation had became more expensive than manual
human translation.
ORIGIN OF NLP-PHASE I CONTD.....
 By the end of 1950s, the research in NLP had split into two paradigms:
symbolic and stochastic.
 The symbolic paradigm followed two lines of research.

 The first was the work of Chomsky (Syntactic Structures) , formal


language theory, and the work of many linguistics and computer
science on parsing algorithms.
 One of the earliest parsing system was Harris Transformations and
Discourse Analysis Project (TDAP) at the University of Pennsylvania.
 The second was the field of Artificial Intelligence. In 1956 the
Dartmouth Conference took place in Dartmouth College, New
Hampshire.
 In this conference the John McCarthy coined term ‘Artificial
Intelligence’ and had two months long ‘brainstorm session’ on many
topics related to AI.
ORIGIN OF NLP-PHASE I CONTD.....
 The major focus of the new field of AI was the work on logic
and reasoning based on the Newell & Simon's work on the
Logic Theorist and General Problem Solver.
 Early Natural Language Understanding systems based on
pattern matching and keyword search for reasoning and
question answering were built.
 The stochastic paradigm took hold mainly in department of
statistics and electrical engineering.
 By the late of 1950s, the Bayesian method was applied to
optical character recognition and text recognition.
ORIGIN OF NLP-PHASE I CONTD.....
 The 1960s saw the rise for first model for psychological
models of human language processing based on
transformational grammar.
 The first online corpora: the Brown corpus of American
English, a collection of a 1 million word collection of samples
from 500 written text from newspapers, novels, non-fiction,
academic, etc. was prepared at Brown University.
 An online Chinese dialect dictionary called DOC was prepared.
ORIGIN OF NLP: PHASE II (1970-1983)
 The next period saw an explosion in research in NLP and the
number of research paradigms that are still dominant.
 The stochastic paradigm, played a huge role in the
development of speech recognition algorithms, particularly the
use of Hidden Markov Models (HMMs) and dealing with noisy
channel decoding.
 The Watson Research Center at Carnegie Mellon University
and AT&T's Bell Labs were key center for work on speech
recognition and synthesis.
 The logic-based paradigm, was begun based on Q-systems,
transformation grammars, Definite Clause grammars,
functional grammars.
ORIGIN OF NLP: PHASE II CONTD.....
 The Natural Language Understanding (NLU) paradigm,
also took off during this period.
 The work begun with Winogard's SHRDLU system, which
simulated a robot embedded in a world blocks.
 The system made it clear that the problem of parsing was well
enough to begin focus on semantics and discourse.
 The work on building series of NLU programs were built
focusing on conceptual knowledge such as scripts, plans, and
goals.
 The logic-based and NLU paradigms were unified in
systems that used predicate logic as semantic
representation such as LUNAR, question answering system.
ORIGIN OF NLP: PHASE II CONTD.....
The discourse modeling paradigm focus on key areas in
discourse.
 A number of researchers began to work on automatic reference
resolution and the Belief -Desire-Intention (BDI) framework for
logic-based works on speech acts was developed.
ORIGIN OF NLP: PHASE III (1983-1993)
 The next decade was the return of two important models that lost
popularity in the late 1950s and early 1960s due to theoretical
arguments against them such as Chomsky's review.
 The first class, was finite state models, which began to receive
attention again after work on finite state phonology, morphology,
and syntax.
 The second class, was the return of empiricism; with rise of
probabilistic models throughout speech an language processing.
 The empirical direction was now also focused on model
evaluation, based on using held out data, metrics for evaluation
and the comparison of performance on these metrics.
 The work in this phase was also based on Natural Language
Generation.
ORIGIN OF NLP PHASE IV: (1994-1999)
 By the last five years, of the millennium, it was clear that the
NLP filed was undergoing a major changes in a number of
directions and all those direction came together.
 Firstly, probability and data driven models had become quite
standard throughout language processing. Evaluation
mechanisms borrowed from speech recognition and
Information Retrieval were employed.
 Secondly, the increase in the speed and memory of computers
allowed commercial exploitation of number of application
areas.
 The rise of web emphasized the need of language based and
multi-lingual information retrieval and extraction.
ORIGIN OF NLP: PHASE V (2000- PRESENT)
 In this phase there was rise of Machine Learning techniques for
NLP.
 The empiricist trend that begun in the late 1990s accelerated at
great pace due to following three trends:
 Firstly, large amount of written and spoken material became
widely available through various organizations like Linguistic
Data Consortium (LDC), NIST, ELRA, FIRE, etc.
 Annotated collections such as Penn Treebank, Prague
Dependency Treebank, Penn Discourse Treebank etc. were
made available for various morphological, syntactic, semantic,
discourse and pragmatic annotations.
 These resources promoted casting the various complex
problems of NLP as problems in supervised machine learning.
ORIGIN OF NLP: PHASE V (CONTD.....)
 Second, the increased focus on learning accelerated the statistical
community.
 The techniques such as Support Vector Machine (SVM), Maximum
Entropy Techniques, regression, graphical Bayesian models become
standard practice in NLP.
 Thirdly, the widespread availability of high-performance computing
systems facilitated the training and deployment of various NLP systems.
 Near the late 2008s, largely unsupervised statistical approaches began to
receive renewed attention. The cost and difficulty of annotated corpora
for many problems became a limitation for the use of supervised
learning.
 Deep learning is a class of machine learning algorithms that use a
cascade of multiple layers of non linear processing units for feature
extraction and transformation. Each successive layer uses the output
from the previous layer as input.
APPLICATIONS OF NLP-I
1) Part-of-speech tagging: also called grammatical tagging or
word-category disambiguation, is the process of marking up
a word in a text (corpus) as corresponding to a particular part
of speech, based on both its definition and its context—i.e., its
relationship with adjacent and related words in a phrase,
sentence, or paragraph.
 The POS tagger is used as a preprocessor. Text indexing and
retrieval uses POS information. Speech processing uses POS
tags to decide the pronunciation. POS tagger is used for making
tagged corpora.
APPLICATIONS OF NLP-II
2) Word Sense Disambiguation (WSD): WSD is identifying the
sense of a word (i.e. meaning) is used in a sentence, when the
word has multiple meanings.
3) Named-entity recognition (NER) (also known as entity
identification, entity chunking and entity extraction) is a
subtask of information extraction that seeks to locate and
classify named entities in text into pre-defined categories such
as the names of persons, organizations, locations, expressions
of times, quantities, monetary values, percentages, etc.
o The WSD and NER systems are also used as a preprocessor in
a number of NLP applications such as machine translation,
information retrieval and extraction, etc.
APPLICATIONS OF NLP-III
4) A spell checker (or spell check) is an application program that
flags words in a document that may not be spelled correctly.
Spell checkers may be stand-alone, capable of operating on a
block of text, or as part of a larger application, such as a word
processor, email client, electronic dictionary, or search engine.
5) Automatic text summarization is the process of shortening a
text document with software, in order to create a summary
with the major points of the original document.
o Text summarization systems are used in applications like entity
timelines, storylines of events, sentence compression,
summarization of user generated content.
APPLICATIONS OF NLP-IV
6) Machine translation is a sub-field of computational linguistics
that investigates the use of software to translate text or speech
from one language to another.
o Machine Translation systems are used in search engines, cross-
lingual information retrieval, social networking, military
applications, mobile applications, etc.
7) Document classification or document categorization is to
assign a document to one or more classes or categories.
o It is used in number of applications like e-mail filtering, mail
routing, spam filtering, news monitoring, selective dissemination
of information to information consumers, automated indexing of
scientific articles, automated population of hierarchical
catalogues of Web resources, identification of document genre,
authorship attribution, survey coding and so on.
APPLICATIONS OF NLP-V
8) Speech Synthesis/ Text-to-speech Synthesis: Speech synthesis is
the artificial production of human speech. A computer system used
for this purpose is called a speech computer or speech synthesizer.
A text-to-speech (TTS) system converts normal language text into
speech.  An intelligible text-to-speech program allows people
with visual impairments or reading disabilities to listen to written
words on a home computer.
9)  Opinion Mining/ Sentiment Analysis:  Sentiment analysis aims
to determine the attitude of a speaker, writer, or other subject with
respect to some topic or the overall contextual polarity or emotional
reaction to a document, interaction, or event.
Sentiment analysis is widely applied to: voice of the
customer materials such as reviews and survey responses,
online and social media, healthcare
APPLICATIONS OF NLP-VI
10) Optical character recognition is the mechanical or electronic
conversion of images of typed, handwritten or printed text into
machine-encoded text, whether from a scanned document, a photo of a
document, a scene-photo (for example the text on signs and billboards
in a landscape photo) or from subtitle text superimposed on an image
(for example from a television broadcast).
It is widely used as a form of information entry from printed paper data
records like passport documents, invoices, bank statements,
computerized receipts, business cards, mail, printouts of static-data, or
any suitable documentation.
11) Question answering Systems: is concerned with building systems that
automatically answer questions posed by humans in a natural language.
QA research attempts to deal with a wide range of question types
including: fact, list, definition, How, Why, hypothetical, semantically
constrained, and cross-lingual questions, open ended, closed ended.
APPLICATIONS OF NLP-VII
12) Textual entailment (TE) in NLP is a directional relation between text
fragments. The relation holds whenever the truth of one text fragment
follows from another text
Many natural language processing applications, like Question
Answering (QA), Information Extraction (IE), (multi-
document) summarization and machine translation (MT) evaluation, need
to recognize that a particular target meaning can be inferred from
different text variants.
13) Topic segmentation and recognition: Given a chunk of text, separate it
into segments each of which is devoted to a topic, and identify the topic
of the segment.
 It can improve information retrieval or speech recognition significantly
(by indexing/recognizing documents more precisely or by giving the
specific part of a document corresponding to the query as a result).
It is also needed in topic detection and tracking systems and text
summarizing problems.
APPLICATIONS OF NLP-VIII
14) Relationship extraction: Given a chunk of text, identify the relationships
among named entities (e.g. who is married to whom).
Application domains where relationship extraction is useful include gene-
disease relationships, protein-protein interaction, etc.
15) Paraphrase Identification: Paraphrase detection is the task of examining two
text entities (ex. sentence) and determining whether they have the same
meanings.
 Lexical level
 Example - solve and resolve
 Phrase level
 Example - look after and take care of
 Sentence level
 Example - The table was set up in the carriage shed and The table was laid under the cart-shed
 Pattern level
 Example - [X] considers [Y] and [X] takes [Y] into consideration
 Collocation level
 Example - (turn on, OBJ light) and (switch on, OBJ light)
APPLICATIONS OF NLP-IX
 Paraphrase Identification is useful for:
 Machine Translation
 Simplify input sentences
 Alleviate data sparseness

 Question Answering
 Question reformulation
 Information Extraction
 IE pattern expansion
 Information Retrieval
 Query reformulation
 Summarization
 Sentenceclustering
 Automatic evaluation

 Natural Language Generation


 Sentence rewriting
 Others
 Changing writing style
 Text simplification
 Identifying plagiarism
APPLICATIONS OF NLP-X
16) Automated essay scoring (AES) – the use of specialized
computer programs to assign grades to essays written in an
educational setting. It is a method of educational assessment
and an application of natural language processing. It can be
considered a problem of statistical classification.
17) Automatic image annotation – process by which a computer
system automatically assigns textual metadata in the form of
captioning or keywords to a digital image. The annotations
are used in image retrieval systems to organize and locate
images of interest from a database.
18) Automatic taxonomy induction – automated construction
of tree structures from a corpus. This may be applied to
building taxonomical classification systems for reading by
end users, such as web directories or subject outlines.
PROBLEMS / ISSUES IN NLP-I
1) Highly ambiguous at all levels: Languages are highly ambiguous
and this ambiguity occurs at each level of language processing.
Phonological ambiguity is a type of ambiguity that arises out of
the fact that words sound identical, but in fact have different
meanings. For example:
Two = too = to
Wait = Weight
Word level (Semantic) ambiguity: bank, diamond, treat, etc.
words have different meanings in different context.
Syntactic ambiguity arises not from the range of meanings of single
words, but from the relationship between the words and clauses of
a sentence.
For example, The professor said on Monday he would give an
exam, Visiting relatives can be boring.
PROBLEMS / ISSUES IN NLP-II
2) Idioms, metaphor: The idioms (set phrase), metaphor
(figure of speech) add more complexity to identify the
meaning of the written text.
For example, The old man finally kicked the bucket. This
sentence has nothing to do with the words kick and bucket
because it refers to English idiom which means ‘to die’.
3) Quantifier Resolution: The scope of quantifier is another
problem and is often not clear while processing language.
Some of the quantifiers commonly used in English are a, all,
any, at least, at most, either, every, more than, etc.
For example, Example: Each table has at least two sponsors
seated at it, and each sponsor is seated at exactly one table.
PROBLEMS / ISSUES IN NLP-III
4) Involves Reasoning about the world:
Communication via language involves two brains- the brain of
the speaker/writer and the brain of the hearer/reader.
Anything that is assumed to be known to receiver is not
encoded. The receiver possess the knowledge and fills in the
gaps while making an interpretation.
This implicit knowledge is encoded according to context,
cultural knowledge, etc.
For example, ‘Taj’ may mean a monument, a brand of tea, or a
hotel, which may not be so for a non-Indian.
PROBLEMS / ISSUES IN NLP-IV
5) Dealing with emotions in language processing:
There is more than just the words involved in natural
language processing.
In addition to intent and structure, there is also the sentiment
or emotion involved in a statement.
People can be excited, they can be nervous, or they can be
angry or frustrated, and they speak differently and choose
different words and key phrases when each of these emotions
is presenting. Natural language processing algorithms must
cut through this emotion to get to the point of a query. 
But there is also predictive value in that emotion, and these
algorithms can use the emotion to build context and predict
questions that a user might want to ask. 
PROBLEMS / ISSUES IN NLP-V
6) Finding referents of anaphora/cataphora:
Anaphora is the use of an expression that depends specifically
upon an antecedent expression and cataphora, is the use of an
expression that depends upon a postcedent expression.
For example, in the sentence Sally arrived, but nobody saw
her, the pronoun her is an anaphor, referring back to the
antecedent Sally.
In the sentence Before her arrival, nobody saw Sally, the
pronoun her refers forward to the postcedent Sally.
Identifying these anaphora and cataphora, is also a
challenging task as different languages has different set of
rules to express them.
PROBLEMS / ISSUES IN NLP-VI
7) Representation of Knowledge: First-order logic (FOL) and
knowledge representation systems find it difficult to represent some
issues such as time and modality. 
Representing and inferring world knowledge, commonsense
knowledge in particular, is difficult. 
Semantics of discourse segments is a difficult problem.
8) Challenges in Pragmatics: There are challenges in pragmatics as
well. A simple declarative sentence stating a fact, it is sunny, is not
only a statement of fact but also serves some communication
function. 
The function may be to inform, to mislead about fact or speaker’s
belief about fact, to draw attention, to remind previously mentioned
even or object related to fact, etc.
So, the pragmatic interpretation is open ended and difficult to
interpret exactly the speaker goals or plans from it.
PROCESSING INDIAN LANGUAGES-I
 There are number of differences between Indian languages and
English and thus there is difference in their processing. Some
differences are:
1) In sentence structuring, English uses the subject-verb-
object word order whereas Hindi uses subject-object-verb
word order.
For example, in English we write the sentence as I eat apple
(i.e Subject (I), Verb (eat) and Object (apple)).
But the corresponding structure, in Hindi is मैं सेब खाता हूँ which
is in the format Subject-Object Verb.
PROCESSING INDIAN LANGUAGES-II
2) Indian languages use post-position case markers (karakas)
instead of prepositions.
In other words, The prepositional words in English comes
before the noun/pronoun but in Hindi, these words come
behind or post the noun/pronoun so these are postposition
rather than preposition.
Example,
(1) यात्री मन्दिर की ओर जा रहे थे। ( The traveller were going toward the
temple.)
(2) पिताजी घर के  अन्दर हैं। (The father is inside home.)
(3) प्यास के  मारे घोड़े का बुरा हाल था। ( The horse was feeling very
bad because of the thirst.)
PROCESSING INDIAN LANGUAGES-III
3) In Indian languages, the verb complexes use sequence of
words rather than a single word as in case of English.
For example, the verb form singing in English as expressed as
गा रही है, गा रहा है, etc.
The auxiliary words in the sequence provide information
about tense, aspect, modality, etc.
4) Indian languages have a free word order, i.e. words can be
moved freely within a sentence without changing the
meaning of the sentence.
For example, I like sweets and Sweets, I like are different in
English but in Hindi they are written as मुझे मिठाइयाँ पसंद है, मिठाइयाँ
मुझे पसंद है both of which means same.
PROCESSING INDIAN LANGUAGES-IV
5) Indian languages have relatively large set of
morphological variants.
English language is least inflectional language. But Indian
languages have relatively complex morphology and are
highly agglutinative and inflectional.
For example, languages like Marathi and Bengali has more
than 20 variants for a single word.
6) Spelling standardization is more subtle in Indian
languages then in English.
7) Unlike English, Indic scripts have a non-linear structure.
PROCESSING INDIAN LANGUAGES-V
8) Indian languages make extensive and productive use
of complex predicates (CPs).
Complex predicate is a noun, a verb, an adjective or an
adverb followed by a light verb that behaves as a single
unit of verb.
Complex predicates (CPs) are abundantly used in Hindi
and other languages of Indo Aryan family.
(1) CP=noun+LV (linking Verb) उसने मुझे आशीर्वाद दिया (He
blessed me).
(2) CP=adjective+LV उसने मुझे प्रसन्न किया (He pleased me).
(3) CP=verb+LV उसने किताब को फाड़ दिया (He torn the book)
ISSUES FOR RESEARCH IN NLP IN INDIA
 Lack of Annotated Corpora
 Lack of NLP Tools

 Lack of Standards

 Lack of Interaction Between Linguistics and Computer


Programmers
 Lack of Consolidated Efforts

 Lack of Education and Training Institutes

You might also like