0% found this document useful (0 votes)
23 views54 pages

Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable

This document provides an introduction to natural language processing (NLP). It defines NLP as a field concerned with interactions between computers and human language, particularly how to program computers to process and analyze large amounts of natural language data. The document discusses what NLP is and is not, some common NLP applications, fields that contributed to NLP like linguistics and computer science, and statistical versus symbolic approaches to NLP. It also briefly outlines the history of NLP and mentions some popular programming resources used for NLP like NLTK, PyTorch, and TensorFlow. Additionally, it covers some key NLP concepts like tokenization, lemmatization, stemming, sentence segmentation, Chomsky hierarchy, and regular expressions.

Uploaded by

Hamad Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views54 pages

Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable

This document provides an introduction to natural language processing (NLP). It defines NLP as a field concerned with interactions between computers and human language, particularly how to program computers to process and analyze large amounts of natural language data. The document discusses what NLP is and is not, some common NLP applications, fields that contributed to NLP like linguistics and computer science, and statistical versus symbolic approaches to NLP. It also briefly outlines the history of NLP and mentions some popular programming resources used for NLP like NLTK, PyTorch, and TensorFlow. Additionally, it covers some key NLP concepts like tokenization, lemmatization, stemming, sentence segmentation, Chomsky hierarchy, and regular expressions.

Uploaded by

Hamad Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Natural Language Processing

CS 1462

Introduction
Computer Science
3 rd s e m e s t e r - 1 4 4 4
Dr. Fahman Saeed
[email protected]

1
Some slides borrows from Carl Sable
What is NLP ?

 Is a subfield of linguistics, computer science, and


artificial intelligence concerned with the interactions
between computers and human language, in
particular how to program computers to process and
analyze large amounts of natural language data.
 No relation with Neuro-Linguistic programming.

2
What is NLP ?

3
Some Applications of NLP

 Information retrieval
 Text categorization
 Grammar checking
 Automatic machine translation
 Question-answering
 Automatic summarization
Fields that Contributed to NLP

 Linguistics is the study of human (natural)


languages, including both syntax and semantics
 Computer science provides data structures and
algorithms for implementing NLP programs
Statistical vs. Symbolic Approaches

 As with AI in general, there has been debate as to whether NLP


should rely primarily on a symbolic approach or a statistical
approach
 The symbolic approach generally involves trying to understand how
humans process language, and developing algorithms that behave
similarly
 When I use the term computational linguistics, I am referring to
such symbolic approaches (some sources use the term
"computational linguistics" more generally)
 The statistical approach generally involves using machine
learning (ML) to learn from data
 The data is often provided in the form of manually labeled corpora
that can be used to train systems
 There are those who think that the two approaches should
ultimately complement each other
 However, it is not generally clear how to best go about doing that
A Very Brief History of NLP

 Historically, early successes in NLP involved rule-based, symbolic approaches; for


example:
 ELIZA, a program created at MIT by Joseph Weizenbaum in the mid-1960s, used simple pattern-
matching techniques to imitate the responses of a Rogerian psychotherapist
 SHRDLU, a program created at MIT by Terry Winograd in the late-1960s, could carry on simple
conversations about the so-called "blocks world"
 The theories of Noam Chomsky, although always somewhat controversial,
dominated linguistics for decades, and helped lead to several algorithms used in
computational linguistics
 As with AI in general, statistical approaches eventually began to dominate NLP
 From the 1990s through the early 2000s, important ML approaches to NLP
included Bayesian approaches, hidden Markov models, support vector machines,
expectation maximization, etc.
 More recently, deep neural networks (including variations of recurrent neural
networks, convolutional neural networks, and more recently, transformers) have
come to dominate NLP
 Effective methods for computing useful word embeddings (and subword
embeddings) were also very important to the deep learning revolution in NLP
Programming-related Resources

 Applications in the field of NLP often make use of


standard toolkits, libraries, and resources
 For example, the Natural Language Toolkit (NLTK) is a
popular open-source platform providing Python
implementations of important NLP-related algorithms
and access to available corpora
 For deep learning in NLP, two very popular frameworks
are PyTorch and TensorFlow (including Keras,
which provides a layer on top of TensorFlow)
 I would recommend using Python for your projects, and
either TensorFlow (with or without Keras) or PyTorch for
your third project
Tokenization

 Tokenization generally involves splitting natural


language text into tokens or wordforms
 What should ultimately be treated as a word is not
trivial
 For some applications, words are kept in their
original wordforms
 Other applications may perform lemmatization or
stemming (the latter is simpler)
Lemmatization

 A lemma is the base form of a word (sometimes called the


canonical form or dictionary form)
 Often, multiple wordforms (e.g., "sing", "sang", "sung", "singing")
share the same lemma
 The set of words that share the same lemma must all have the same
major part of speech (POS)
 We'll discuss POS in more detail in a later topic, but the notion also
comes up several times in this topic
 Algorithms to perform lemmatization can involve morphological
analysis of wordforms (we'll learn more about this later in the
topic)
 However, it is probably more common to use a resource; e.g., a very
popular resource for conventional NLP was WordNet
 WordNet had various other important uses as well in conventional
NLP
Stemming

 Stemming is simpler than lemmatization


 Stemming involves the use of a sequence of rules to
convert a wordform to a simpler form
 For example, the Porter stemmer has been very
popular in conventional NLP; sample rules are:

 In theory, words that share the same lemma should


map to the same stem; in practice, it sometimes
works but sometimes does not
Stemming vs Lemmatization

14
Wordforms

 Note that what counts as a wordform in the first place


varies between apps
 Some applications may just use whitespace to separate wordforms
 Some may strip punctuation
 Some may count certain punctuation (e.g., periods, questions marks,
etc.) as separate tokens
 Some applications may do more complicated splitting; e.g., some
split a possessive -s ('s) into a separate token
 Some applications may convert all letters to lower case to achieve
case insensitivity
 In some languages, words are not separated by spaces
 For example, in Chinese, words are composed of Hanzi characters
 Each generally represents a single morpheme and is pronounced as a
single syllable
Sentence Segmentation

 Another component of many NLP applications is sentence


segmentation (also not trivial)
 It may seem intuitive to split a document into sentences first,
and then to tokenize sentences
 More often, the opposite occurs, since the result of
tokenization aids sentence segmentation
 One complication is that periods are also used for acronyms
(e.g., "U.S.A.", "m.p.h."), abbreviations (e.g., "Mr.", "Corp."),
and decimal points (e.g., "$50.25")
 Note that acronyms and some abbreviations ending in periods
can end sentences at times
 The process of tokenization, optionally followed by
lemmatization or stemming and/or sentence segmentation, is
often referred to as text normalization
The Chomsky Hierarchy
 The Chomsky hierarchy defines four types of formal grammars that are
useful for various tasks
 These are unrestricted grammars (type 0), context-sensitive grammars (type
1), context-free grammars (type 2), and regular grammars (type 3)
 These are numbered from the most powerful / least restrictive (type 0) to the
least powerful / most restrictive (type 3)
 It is often useful (simpler and more efficient) to use the most restrictive type of
grammar that suits your purpose
 Regular grammars are generally powerful enough for tokenization
 There are various equivalent ways to define a specific instance of each type of
grammar that is part of the Chomsky hierarchy
 For example, when we talk about context-free grammars during Part II of the
course (on conventional computational linguistics), we will define them using
productions, a.k.a. rewrite rules
 Regular grammars can also be defined using rewrite rules; or they can be
defined using finite state automata; however, in this course, we will define
them with regular expressions
Regular Expressions

 A regular expression (RE) is a grammatical formalism


useful for defining regular grammars
 Each regular expression is a formula in a special language that
specifies simple classes of strings (a string is just a sequence
of symbols)
 Regular expressions are case sensitive
 RE search requires a pattern that we want to search for and a
corpus of texts to search through
 Although the syntax might be a bit different, if you
understand how to use regular expressions, you won't have a
problem using them in Python or some other language
 Following the book, we will typically assume that an RE
search returns the first line of the document containing the
pattern
Simple Regular Expressions

 The simplest type of regular expression is just a sequence of one or


more characters
 For example, to search for "woodchuck", you would just use the
following: /woodchuck/
 Brackets can be used to distinguish a disjunction of characters to
match
 For example, /[wW]oodchuck/ would search for the word starting
with a lowercase or capital 'w'
 A dash can help to specify a range
 For example, /[A-Za-z]/ searches for any uppercase or lowercase
letter
 If the first character within square brackets is the caret ('^'), this
means that we are specifying what the characters cannot be
 For example, /[^A-Z]/ matches any character except a capital
letter
Special Characters

 A question mark ('?') can be used to match the preceding character or RE, or
nothing (i.e., zero or one instance of the preceding character or RE)
 For example, /woodchucks?/ matches the singular or plural of the word
"woodchuck"
 The Kleene star ('*') indicates zero or more occurrences of the previous character
or RE
 The Kleene + ('+') means one or more of the previous character or RE
 As an example of why these are useful, let's say you want to represent all strings
representing sheep sounds
 In other words, we want to represent the language consisting of the strings "baa!",
"baaa!", "baaaa!", "baaaaa!", etc.
 Two regular expressions defining this language are /baaa*!/ and /baa+!/
 This language is an example of a formal language, which is a set of strings
adhering to specific rules
 One very important special character is the period ('.'); this is a wildcard expression
that matches any single character (except an end-of-line character)
 For example, to find a line in which the word "aardvark" appears twice, you can use
/aardvark.*aardvark/
 To match an actual period, you can use "\." within an RE
Anchors

 Anchors are special characters that match particular places in a


string
 For example, the caret ('^') and dollar sign ('$') can be used to
match the start of a line or end of a line, respectively
 Example: /^The/ matches the word "The" at the start of a line
 Example: / $/ matches a line ending with a space (the textbook
uses a character, '˽', to represent spaces)
 Example: /^The dog\.$/ matches a line that contains only the
phrase "The dog."
 Recall that the caret also has other meanings
 Two other anchors: \b matches a word boundary, and \B matches
any non-boundary position
 A "word" in this context is a sequence of letters, digits, and/or
underscores
Disjunction, Parentheses, and Precedence

 The ('|') character is called the disjunction operator,


a.k.a. the pipe symbol
 For example, the pattern /cat|dog/ matches either the
string "cat" or the string "dog"
 You can use parentheses to help specify precedence
 For example, to search for the singular or plural of the
word "guppy", you can use /gupp(y|ies)/
 Unlike the | operator, the * and + operators apply by
default to a single character
 By putting the expression before these operators in
parentheses, you make the operator apply to the whole
thing
Precedence Hierarchy

 The operator precedence hierarchy for regular


expressions, according to our textbook, is as follows:
1. Parenthesis: ( )
2. Counters: * + ? { } (the curly braces can be used to specify ranges,
more on this soon)
3. Sequences and anchors (e.g., the, ^my, end$, etc.)
4. Disjunction: |
 The list of classes, with the specified order above, is
pretty typical
 Some sources list brackets [ ] as having even higher
precedence than parentheses ( ), and a backslash to
indicate special characters above that
RE Example

 The textbook steps through an example of searching for


the word "the" in a text without any false positives or
false negatives
 They speculate that we want any non-letter to count as a
word separator
 That is, to the immediate left of the 't' or to the
immediate right of the 'e', any non-letter may be present
 We also need to consider that "the" may occur at the start
or end of the line
 Finally, they allow the 't' to be capital or lowercase, but
the 'h' and 'e' must be lowercase
 They end up with the following expression:
/(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/
Other RE Constructs

 Aliases for common sets include:


 \d (any digit)
 \D (a non-digit)
 \w (any alphanumeric or underscore)
 \W (non-alphanumeric)
 \s (any whitespace, e.g., space or tab)
 \S (non-whitespace)

 The characters \t and \n represent tab and newline


 Other special characters can be specified themselves by
preceding them with a backslash (e.g., /\./, /\*/, /\\/, etc.)
 Curly braces specify ranges of repetition:
 {n} means exactly n occurrences of the previous character or expression
 {n,m} means from n to m occurrences
 {n,} means at least n occurrences
Morphology

 Book: "Morphology is the study of the way words are built up from
smaller meaning-bearing units called morphemes."
 The current edition of the textbook very briefly discusses morphological
parsing, which is necessary for "sophisticated methods for lemmatization"
 Recall that in practice, a wordform can instead be looked up in an
appropriate resource to retrieve the lemma
 WordNet is an example of such a resource that was very popular in
conventional NLP
 To keep such resources current (e.g., by adding new words) involves a lot of
manual effort
 Also recall that we can avoid lemmatization all together by applying
stemming, which is much simpler (but doesn't always work as well)
 The current edition of the textbook dropped most of its discussion of
morphology
 We will discuss it in more detail than the book (but significantly less than I
used to); some of this content comes from the previous edition of the
textbook
Rules of Morphology

 Orthographic rules are general rules that deal with spelling and tell us how
to transform words; some examples are:
 To pluralize a noun ending in "y", change the "y" to an "i" and add "es" (e.g., "bunnies")
 A single consonant letter is often doubled before adding "-ing" or "-ed" suffixes (e.g.,
"begging", "begged")
 A "c" is often changed to "ck" when adding "-ing" and "-ed" (e.g., "picnicking", "picnicked")
 Morphological rules deal with exceptions; e.g., "fish" is its own plural,
"goose" becomes "geese"
 Morphological parsing uses both types of rules in order to break down a
word into its component morphemes
 A morpheme is the smallest part of the word that has a semantic meaning
 For example, given the wordform, "going", the parsed form can be
represented as: "VERB-go + GERUND-ing"
 Conventionally, morphological parsing sometimes played an important role
for POS tagging
 For morphologically complex languages (we'll discuss an example later), it
can also play an important role for web search
Stems and Affixes

 Two board classes of morphemes are stems and affixes


 The stem is the main (i.e., the central, most important, or most
significant) morpheme of the word
 Affixes add additional meanings of various kinds
 Affixes can be further divided into prefixes, suffixes, infixes, and
circumfixes
 English debatably does not have circumfixes and proper English
probably does not have any infixes
 A word can have more than one affix; for example, the word
"unbelievably" has a stem ("believe"), a prefix ("un-"), and two
suffixes ("-able" and "-ly")
 English rarely stacks more than four or fix affixes to a word, but
languages like Turkish have words with nine or ten
 We will discuss an example of a morphologically complex language
later
Stems and Affixes

 A prefix is an affix that attaches before its base, like


inter- in international.
 A suffix is an affix that follows its base, like -s in cats.
 A circumfix is an affix that attaches around its base.
 An infix is an affix that attaches inside its base: run-
ran and buy-bought
 A simultaneous affix is an affix that takes place at the
same time as its base.

30
Combining Morphemes

 Four methods of combining morphemes to create words include inflection,


derivation, compounding, and cliticization
 Inflection is the combination of a word stem with a grammatical morpheme,
usually resulting in a word of the same basic POS as the stem, usually filling some
syntactic function like agreement
 As mentioned earlier, we will discuss POS in more detail in a later topic
 Examples of inflection include pluralizing a noun or changing the tense of a verb
 Derivation is the combination of a word stem with a morpheme, usually resulting
in word of a different class, often with a meaning that is harder to predict
 Examples from the textbook (previous edition) include "appointee" from "appoint
and "clueless" from "clue"; an example from Wikipedia is "happiness" from "happy"
 Compounding is the combination of multiple word stems together; for example,
"doghouse"
 Cliticization is the combination of a word stem with a clitic
 A clitic is a morpheme that acts syntactically like a word but is reduced in form and
attached to another word
 An example is the English morpheme "'ve" in words such as "I've" (substituting for
"have") or the French definite article "l'" in words such as "l'opera" (substituting for
"le")
Inflection of English Nouns

 English nouns have only two kinds of inflections: an affix that


makes the word plural and an affix that marks possessive
 Most nouns are pluralized by adding "s"
 The suffix "es" is added to most nouns ending in "s", "z", "sh",
"ch", and "x"
 Nouns ending in "y" preceded by a consonant change the "y"
to "i" and add "es"
 The possessive suffix usually just entails adding "'s" for
regular singular nouns or plural nouns not ending in "s" (e.g.,
"children's")
 Usually, you just add a lone apostrophe for plural nouns
ending in "s" (e.g., "llamas'") or names ending in "s" or "z"
(e.g., "Euripides' comedies")
Inflection of English Regular Verbs

 English has three kinds of verbs


 Main verbs (e.g., "eat", "sleep", "impeach")

 Modal verbs ("can", "will", "should")

 Primary verbs (e.g., "be", "have", "do")

 Regular verbs in English have four inflected forms;


examples are:
Inflection of English Irregular Verbs

 Irregular verbs typically have up to five different


forms but can have as many as eight (e.g., the verb
"be") or as few as three (e.g., "cut")
 The forms of "be" are: "be", "am", "is", "are", "was",
"were", "been", "being"
 Other examples of irregular verbs are shown here:
Part-of-Speech Tagging

 Parts of speech (POS) are categories for words that indicate their syntactic
functions
 Parts of speech are also known as word classes, lexical tags, or syntactic
categories
 Included was the description of eight parts-of-speech: noun, verb, pronoun,
preposition, adverb, conjunction, participle, and article

35
36
Uses of POS

 Knowing the POS of a word gives you information about its neighbors
 Examples: possessive pronouns (e.g., "my", "her", "its") are likely to be followed by
nouns, while personal pronouns (e.g., "I", "you", "he") are likely to be followed by
verbs
 POS can tell us about how a word is pronounced (e.g., "content" as a noun or an
adjective)
 POS can also be useful for applications such as parsing, named entity recognition,
and coreference resolution
 Corpora that have been marked with parts of speech are useful for linguistic
research
 Part-of-speech tagging (a.k.a. POS tagging or sometimes just tagging) is the
automatic assignment of POS to words
 POS tagging is often an important first step before several other NLP applications
can be applied
 We will discuss the use of hidden Markov models (HMMs) for POS tagging later in
the topic
 There are also deep learning approaches for POS tagging, which tend to perform a
bit better (we'll learn about such methods later in the course)
Named-Entity Recognition

 Recognition Entity Name is an acronym for


recognizing and classifying important words, such as
people's names, or institutions or the names of
countries or cities, time, money, percentages, as well
as specifying the names of the flags from the names
Thus People, companies, cities and currencies

38
Example

 Foreign minister spokesman Shen Guofang told


Reuters

39
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

40
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

41
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

42
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

43
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

44
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

45
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

46
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

47
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

48
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

49
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

50
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

51
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

52
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

53
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

54
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

55
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

56
Wishing you a fruitful educatio
nal experience

57

You might also like