0% found this document useful (0 votes)

23 views54 pages

Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable

This document provides an introduction to natural language processing (NLP). It defines NLP as a field concerned with interactions between computers and human language, particularly how to program computers to process and analyze large amounts of natural language data. The document discusses what NLP is and is not, some common NLP applications, fields that contributed to NLP like linguistics and computer science, and statistical versus symbolic approaches to NLP. It also briefly outlines the history of NLP and mentions some popular programming resources used for NLP like NLTK, PyTorch, and TensorFlow. Additionally, it covers some key NLP concepts like tokenization, lemmatization, stemming, sentence segmentation, Chomsky hierarchy, and regular expressions.

Uploaded by

Hamad Abdullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views54 pages

Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable

Uploaded by

Hamad Abdullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Natural Language Processing

CS 1462

Introduction
Computer Science
3 rd s e m e s t e r - 1 4 4 4
Dr. Fahman Saeed
[email protected]

1
Some slides borrows from Carl Sable
What is NLP ?

 Is a subfield of linguistics, computer science, and

artificial intelligence concerned with the interactions
between computers and human language, in
particular how to program computers to process and
analyze large amounts of natural language data.
 No relation with Neuro-Linguistic programming.

2
What is NLP ?

3
Some Applications of NLP

 Information retrieval
 Text categorization
 Grammar checking
 Automatic machine translation
 Question-answering
 Automatic summarization
Fields that Contributed to NLP

 Linguistics is the study of human (natural)

languages, including both syntax and semantics
 Computer science provides data structures and
algorithms for implementing NLP programs
Statistical vs. Symbolic Approaches

 As with AI in general, there has been debate as to whether NLP

should rely primarily on a symbolic approach or a statistical
approach
 The symbolic approach generally involves trying to understand how
humans process language, and developing algorithms that behave
similarly
 When I use the term computational linguistics, I am referring to
such symbolic approaches (some sources use the term
"computational linguistics" more generally)
 The statistical approach generally involves using machine
learning (ML) to learn from data
 The data is often provided in the form of manually labeled corpora
that can be used to train systems
 There are those who think that the two approaches should
ultimately complement each other
 However, it is not generally clear how to best go about doing that
A Very Brief History of NLP

 Historically, early successes in NLP involved rule-based, symbolic approaches; for

example:
 ELIZA, a program created at MIT by Joseph Weizenbaum in the mid-1960s, used simple pattern-
matching techniques to imitate the responses of a Rogerian psychotherapist
 SHRDLU, a program created at MIT by Terry Winograd in the late-1960s, could carry on simple
conversations about the so-called "blocks world"
 The theories of Noam Chomsky, although always somewhat controversial,
dominated linguistics for decades, and helped lead to several algorithms used in
computational linguistics
 As with AI in general, statistical approaches eventually began to dominate NLP
 From the 1990s through the early 2000s, important ML approaches to NLP
included Bayesian approaches, hidden Markov models, support vector machines,
expectation maximization, etc.
 More recently, deep neural networks (including variations of recurrent neural
networks, convolutional neural networks, and more recently, transformers) have
come to dominate NLP
 Effective methods for computing useful word embeddings (and subword
embeddings) were also very important to the deep learning revolution in NLP
Programming-related Resources

 Applications in the field of NLP often make use of

standard toolkits, libraries, and resources
 For example, the Natural Language Toolkit (NLTK) is a
popular open-source platform providing Python
implementations of important NLP-related algorithms
and access to available corpora
 For deep learning in NLP, two very popular frameworks
are PyTorch and TensorFlow (including Keras,
which provides a layer on top of TensorFlow)
 I would recommend using Python for your projects, and
either TensorFlow (with or without Keras) or PyTorch for
your third project
Tokenization

 Tokenization generally involves splitting natural

language text into tokens or wordforms
 What should ultimately be treated as a word is not
trivial
 For some applications, words are kept in their
original wordforms
 Other applications may perform lemmatization or
stemming (the latter is simpler)
Lemmatization

 A lemma is the base form of a word (sometimes called the

canonical form or dictionary form)
 Often, multiple wordforms (e.g., "sing", "sang", "sung", "singing")
share the same lemma
 The set of words that share the same lemma must all have the same
major part of speech (POS)
 We'll discuss POS in more detail in a later topic, but the notion also
comes up several times in this topic
 Algorithms to perform lemmatization can involve morphological
analysis of wordforms (we'll learn more about this later in the
topic)
 However, it is probably more common to use a resource; e.g., a very
popular resource for conventional NLP was WordNet
 WordNet had various other important uses as well in conventional
NLP
Stemming

 Stemming is simpler than lemmatization

 Stemming involves the use of a sequence of rules to
convert a wordform to a simpler form
 For example, the Porter stemmer has been very
popular in conventional NLP; sample rules are:

 In theory, words that share the same lemma should

map to the same stem; in practice, it sometimes
works but sometimes does not
Stemming vs Lemmatization

14
Wordforms

 Note that what counts as a wordform in the first place

varies between apps
 Some applications may just use whitespace to separate wordforms
 Some may strip punctuation
 Some may count certain punctuation (e.g., periods, questions marks,
etc.) as separate tokens
 Some applications may do more complicated splitting; e.g., some
split a possessive -s ('s) into a separate token
 Some applications may convert all letters to lower case to achieve
case insensitivity
 In some languages, words are not separated by spaces
 For example, in Chinese, words are composed of Hanzi characters
 Each generally represents a single morpheme and is pronounced as a
single syllable
Sentence Segmentation

 Another component of many NLP applications is sentence

segmentation (also not trivial)
 It may seem intuitive to split a document into sentences first,
and then to tokenize sentences
 More often, the opposite occurs, since the result of
tokenization aids sentence segmentation
 One complication is that periods are also used for acronyms
(e.g., "U.S.A.", "m.p.h."), abbreviations (e.g., "Mr.", "Corp."),
and decimal points (e.g., "$50.25")
 Note that acronyms and some abbreviations ending in periods
can end sentences at times
 The process of tokenization, optionally followed by
lemmatization or stemming and/or sentence segmentation, is
often referred to as text normalization
The Chomsky Hierarchy
 The Chomsky hierarchy defines four types of formal grammars that are
useful for various tasks
 These are unrestricted grammars (type 0), context-sensitive grammars (type
1), context-free grammars (type 2), and regular grammars (type 3)
 These are numbered from the most powerful / least restrictive (type 0) to the
least powerful / most restrictive (type 3)
 It is often useful (simpler and more efficient) to use the most restrictive type of
grammar that suits your purpose
 Regular grammars are generally powerful enough for tokenization
 There are various equivalent ways to define a specific instance of each type of
grammar that is part of the Chomsky hierarchy
 For example, when we talk about context-free grammars during Part II of the
course (on conventional computational linguistics), we will define them using
productions, a.k.a. rewrite rules
 Regular grammars can also be defined using rewrite rules; or they can be
defined using finite state automata; however, in this course, we will define
them with regular expressions
Regular Expressions

 A regular expression (RE) is a grammatical formalism

useful for defining regular grammars
 Each regular expression is a formula in a special language that
specifies simple classes of strings (a string is just a sequence
of symbols)
 Regular expressions are case sensitive
 RE search requires a pattern that we want to search for and a
corpus of texts to search through
 Although the syntax might be a bit different, if you
understand how to use regular expressions, you won't have a
problem using them in Python or some other language
 Following the book, we will typically assume that an RE
search returns the first line of the document containing the
pattern
Simple Regular Expressions

 The simplest type of regular expression is just a sequence of one or

more characters
 For example, to search for "woodchuck", you would just use the
following: /woodchuck/
 Brackets can be used to distinguish a disjunction of characters to
match
 For example, /[wW]oodchuck/ would search for the word starting
with a lowercase or capital 'w'
 A dash can help to specify a range
 For example, /[A-Za-z]/ searches for any uppercase or lowercase
letter
 If the first character within square brackets is the caret ('^'), this
means that we are specifying what the characters cannot be
 For example, /[^A-Z]/ matches any character except a capital
letter
Special Characters

 A question mark ('?') can be used to match the preceding character or RE, or
nothing (i.e., zero or one instance of the preceding character or RE)
 For example, /woodchucks?/ matches the singular or plural of the word
"woodchuck"
 The Kleene star ('*') indicates zero or more occurrences of the previous character
or RE
 The Kleene + ('+') means one or more of the previous character or RE
 As an example of why these are useful, let's say you want to represent all strings
representing sheep sounds
 In other words, we want to represent the language consisting of the strings "baa!",
"baaa!", "baaaa!", "baaaaa!", etc.
 Two regular expressions defining this language are /baaa*!/ and /baa+!/
 This language is an example of a formal language, which is a set of strings
adhering to specific rules
 One very important special character is the period ('.'); this is a wildcard expression
that matches any single character (except an end-of-line character)
 For example, to find a line in which the word "aardvark" appears twice, you can use
/aardvark.*aardvark/
 To match an actual period, you can use "\." within an RE
Anchors

 Anchors are special characters that match particular places in a

string
 For example, the caret ('^') and dollar sign ('$') can be used to
match the start of a line or end of a line, respectively
 Example: /^The/ matches the word "The" at the start of a line
 Example: / $/ matches a line ending with a space (the textbook
uses a character, '˽', to represent spaces)
 Example: /^The dog\.$/ matches a line that contains only the
phrase "The dog."
 Recall that the caret also has other meanings
 Two other anchors: \b matches a word boundary, and \B matches
any non-boundary position
 A "word" in this context is a sequence of letters, digits, and/or
underscores
Disjunction, Parentheses, and Precedence

 The ('|') character is called the disjunction operator,

a.k.a. the pipe symbol
 For example, the pattern /cat|dog/ matches either the
string "cat" or the string "dog"
 You can use parentheses to help specify precedence
 For example, to search for the singular or plural of the
word "guppy", you can use /gupp(y|ies)/
 Unlike the | operator, the * and + operators apply by
default to a single character
 By putting the expression before these operators in
parentheses, you make the operator apply to the whole
thing
Precedence Hierarchy

 The operator precedence hierarchy for regular

expressions, according to our textbook, is as follows:
1. Parenthesis: ( )
2. Counters: * + ? { } (the curly braces can be used to specify ranges,
more on this soon)
3. Sequences and anchors (e.g., the, ^my, end$, etc.)
4. Disjunction: |
 The list of classes, with the specified order above, is
pretty typical
 Some sources list brackets [ ] as having even higher
precedence than parentheses ( ), and a backslash to
indicate special characters above that
RE Example

 The textbook steps through an example of searching for

the word "the" in a text without any false positives or
false negatives
 They speculate that we want any non-letter to count as a
word separator
 That is, to the immediate left of the 't' or to the
immediate right of the 'e', any non-letter may be present
 We also need to consider that "the" may occur at the start
or end of the line
 Finally, they allow the 't' to be capital or lowercase, but
the 'h' and 'e' must be lowercase
 They end up with the following expression:
/(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/
Other RE Constructs

 Aliases for common sets include:

 \d (any digit)
 \D (a non-digit)
 \w (any alphanumeric or underscore)
 \W (non-alphanumeric)
 \s (any whitespace, e.g., space or tab)
 \S (non-whitespace)

 The characters \t and \n represent tab and newline

 Other special characters can be specified themselves by
preceding them with a backslash (e.g., /\./, /\*/, /\\/, etc.)
 Curly braces specify ranges of repetition:
 {n} means exactly n occurrences of the previous character or expression
 {n,m} means from n to m occurrences
 {n,} means at least n occurrences
Morphology

 Book: "Morphology is the study of the way words are built up from
smaller meaning-bearing units called morphemes."
 The current edition of the textbook very briefly discusses morphological
parsing, which is necessary for "sophisticated methods for lemmatization"
 Recall that in practice, a wordform can instead be looked up in an
appropriate resource to retrieve the lemma
 WordNet is an example of such a resource that was very popular in
conventional NLP
 To keep such resources current (e.g., by adding new words) involves a lot of
manual effort
 Also recall that we can avoid lemmatization all together by applying
stemming, which is much simpler (but doesn't always work as well)
 The current edition of the textbook dropped most of its discussion of
morphology
 We will discuss it in more detail than the book (but significantly less than I
used to); some of this content comes from the previous edition of the
textbook
Rules of Morphology

 Orthographic rules are general rules that deal with spelling and tell us how
to transform words; some examples are:
 To pluralize a noun ending in "y", change the "y" to an "i" and add "es" (e.g., "bunnies")
 A single consonant letter is often doubled before adding "-ing" or "-ed" suffixes (e.g.,
"begging", "begged")
 A "c" is often changed to "ck" when adding "-ing" and "-ed" (e.g., "picnicking", "picnicked")
 Morphological rules deal with exceptions; e.g., "fish" is its own plural,
"goose" becomes "geese"
 Morphological parsing uses both types of rules in order to break down a
word into its component morphemes
 A morpheme is the smallest part of the word that has a semantic meaning
 For example, given the wordform, "going", the parsed form can be
represented as: "VERB-go + GERUND-ing"
 Conventionally, morphological parsing sometimes played an important role
for POS tagging
 For morphologically complex languages (we'll discuss an example later), it
can also play an important role for web search
Stems and Affixes

 Two board classes of morphemes are stems and affixes

 The stem is the main (i.e., the central, most important, or most
significant) morpheme of the word
 Affixes add additional meanings of various kinds
 Affixes can be further divided into prefixes, suffixes, infixes, and
circumfixes
 English debatably does not have circumfixes and proper English
probably does not have any infixes
 A word can have more than one affix; for example, the word
"unbelievably" has a stem ("believe"), a prefix ("un-"), and two
suffixes ("-able" and "-ly")
 English rarely stacks more than four or fix affixes to a word, but
languages like Turkish have words with nine or ten
 We will discuss an example of a morphologically complex language
later
Stems and Affixes

 A prefix is an affix that attaches before its base, like

inter- in international.
 A suffix is an affix that follows its base, like -s in cats.
 A circumfix is an affix that attaches around its base.
 An infix is an affix that attaches inside its base: run-
ran and buy-bought
 A simultaneous affix is an affix that takes place at the
same time as its base.

30
Combining Morphemes

 Four methods of combining morphemes to create words include inflection,

derivation, compounding, and cliticization
 Inflection is the combination of a word stem with a grammatical morpheme,
usually resulting in a word of the same basic POS as the stem, usually filling some
syntactic function like agreement
 As mentioned earlier, we will discuss POS in more detail in a later topic
 Examples of inflection include pluralizing a noun or changing the tense of a verb
 Derivation is the combination of a word stem with a morpheme, usually resulting
in word of a different class, often with a meaning that is harder to predict
 Examples from the textbook (previous edition) include "appointee" from "appoint
and "clueless" from "clue"; an example from Wikipedia is "happiness" from "happy"
 Compounding is the combination of multiple word stems together; for example,
"doghouse"
 Cliticization is the combination of a word stem with a clitic
 A clitic is a morpheme that acts syntactically like a word but is reduced in form and
attached to another word
 An example is the English morpheme "'ve" in words such as "I've" (substituting for
"have") or the French definite article "l'" in words such as "l'opera" (substituting for
"le")
Inflection of English Nouns

 English nouns have only two kinds of inflections: an affix that

makes the word plural and an affix that marks possessive
 Most nouns are pluralized by adding "s"
 The suffix "es" is added to most nouns ending in "s", "z", "sh",
"ch", and "x"
 Nouns ending in "y" preceded by a consonant change the "y"
to "i" and add "es"
 The possessive suffix usually just entails adding "'s" for
regular singular nouns or plural nouns not ending in "s" (e.g.,
"children's")
 Usually, you just add a lone apostrophe for plural nouns
ending in "s" (e.g., "llamas'") or names ending in "s" or "z"
(e.g., "Euripides' comedies")
Inflection of English Regular Verbs

 English has three kinds of verbs

 Main verbs (e.g., "eat", "sleep", "impeach")

 Modal verbs ("can", "will", "should")

 Primary verbs (e.g., "be", "have", "do")

 Regular verbs in English have four inflected forms;

examples are:
Inflection of English Irregular Verbs

 Irregular verbs typically have up to five different

forms but can have as many as eight (e.g., the verb
"be") or as few as three (e.g., "cut")
 The forms of "be" are: "be", "am", "is", "are", "was",
"were", "been", "being"
 Other examples of irregular verbs are shown here:
Part-of-Speech Tagging

 Parts of speech (POS) are categories for words that indicate their syntactic
functions
 Parts of speech are also known as word classes, lexical tags, or syntactic
categories
 Included was the description of eight parts-of-speech: noun, verb, pronoun,
preposition, adverb, conjunction, participle, and article

35
36
Uses of POS

 Knowing the POS of a word gives you information about its neighbors
 Examples: possessive pronouns (e.g., "my", "her", "its") are likely to be followed by
nouns, while personal pronouns (e.g., "I", "you", "he") are likely to be followed by
verbs
 POS can tell us about how a word is pronounced (e.g., "content" as a noun or an
adjective)
 POS can also be useful for applications such as parsing, named entity recognition,
and coreference resolution
 Corpora that have been marked with parts of speech are useful for linguistic
research
 Part-of-speech tagging (a.k.a. POS tagging or sometimes just tagging) is the
automatic assignment of POS to words
 POS tagging is often an important first step before several other NLP applications
can be applied
 We will discuss the use of hidden Markov models (HMMs) for POS tagging later in
the topic
 There are also deep learning approaches for POS tagging, which tend to perform a
bit better (we'll learn about such methods later in the course)
Named-Entity Recognition

 Recognition Entity Name is an acronym for

recognizing and classifying important words, such as
people's names, or institutions or the names of
countries or cities, time, money, percentages, as well
as specifying the names of the flags from the names
Thus People, companies, cities and currencies

38
Example

 Foreign minister spokesman Shen Guofang told

Reuters

39
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

40
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

41
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

42
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

43
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

44
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

45
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

46
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

47
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

48
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

49
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

50
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

51
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

52
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

53
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

54
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

55
A Short Introduction to Arabic Natural Language Processing
Prof. Nizar Habash

56
Wishing you a fruitful educatio
nal experience

Natural Language Processing tools and approaches
No ratings yet
Natural Language Processing tools and approaches
106 pages
NLP m1
No ratings yet
NLP m1
148 pages
NLP Notes Unit-1
No ratings yet
NLP Notes Unit-1
20 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
UNIT I_NLP
No ratings yet
UNIT I_NLP
24 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
NLP Unit1Content
No ratings yet
NLP Unit1Content
106 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
UNIT 1_Part1
No ratings yet
UNIT 1_Part1
121 pages
NLP m2
No ratings yet
NLP m2
71 pages
Multimedia Application L2
No ratings yet
Multimedia Application L2
47 pages
IntroductionToNLPAbebeZerihun
No ratings yet
IntroductionToNLPAbebeZerihun
45 pages
Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
No ratings yet
Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
35 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
Lecture-2n-04032024-081220pm-19022025-105409am
No ratings yet
Lecture-2n-04032024-081220pm-19022025-105409am
38 pages
NLP unit-1-introduction-and-word-level-analysis
No ratings yet
NLP unit-1-introduction-and-word-level-analysis
25 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
unit2
No ratings yet
unit2
20 pages
Nlp Notes Unit 1
No ratings yet
Nlp Notes Unit 1
42 pages
Part01 Overview
No ratings yet
Part01 Overview
31 pages
Introduction
No ratings yet
Introduction
23 pages
English Vocabulary in Use Advanced With Answers
No ratings yet
English Vocabulary in Use Advanced With Answers
9 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
Poeter Stemmer Algorithm
No ratings yet
Poeter Stemmer Algorithm
57 pages
NLP-Unit-1-part1
No ratings yet
NLP-Unit-1-part1
61 pages
Session 1
No ratings yet
Session 1
60 pages
Part01 Overview
No ratings yet
Part01 Overview
31 pages
Nlp Notes Unit 1
No ratings yet
Nlp Notes Unit 1
42 pages
AI Presentation (1)
No ratings yet
AI Presentation (1)
14 pages
Module 1.1
No ratings yet
Module 1.1
9 pages
5th Unit NLP (1)
No ratings yet
5th Unit NLP (1)
32 pages
Big Data Finance t8 1 Choi Neoma NLP 2024
No ratings yet
Big Data Finance t8 1 Choi Neoma NLP 2024
13 pages
Module 1 Nlp
No ratings yet
Module 1 Nlp
26 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
nlp
No ratings yet
nlp
19 pages
AI_M3_Merged.pdf
No ratings yet
AI_M3_Merged.pdf
98 pages
NLP UNIT 1
No ratings yet
NLP UNIT 1
46 pages
Natural Language Processing manual
No ratings yet
Natural Language Processing manual
39 pages
Text Analytics and Natural Language Processing - KAI073.docx
No ratings yet
Text Analytics and Natural Language Processing - KAI073.docx
24 pages
Unit1 SNLP Osmania University
No ratings yet
Unit1 SNLP Osmania University
16 pages
ML Module A7707 - Part1
No ratings yet
ML Module A7707 - Part1
48 pages
AI_NLP
No ratings yet
AI_NLP
9 pages
nlp
No ratings yet
nlp
35 pages
NLP Module 1
No ratings yet
NLP Module 1
55 pages
module-1
No ratings yet
module-1
49 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
NLP Unit-1 - 1
No ratings yet
NLP Unit-1 - 1
24 pages
NLP Notes Unit 1to5 final
No ratings yet
NLP Notes Unit 1to5 final
75 pages
NLP - Viva - Que & Ans
No ratings yet
NLP - Viva - Que & Ans
15 pages
UNIT-5 Quetions - answers.docx
No ratings yet
UNIT-5 Quetions - answers.docx
10 pages
Final Summary NLP
No ratings yet
Final Summary NLP
446 pages
Formal Analysis for NLP. Zhiwei_Feng
No ratings yet
Formal Analysis for NLP. Zhiwei_Feng
802 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
The Magic of Formal Languages
From Everand
The Magic of Formal Languages
Pasquale De Marco
No ratings yet
DLL ENGLISH 7 WEEK 6 q4 Final
No ratings yet
DLL ENGLISH 7 WEEK 6 q4 Final
14 pages
0 - Phrasal Verbs
No ratings yet
0 - Phrasal Verbs
4 pages
[Ebooks PDF] download Teach Yourselfe Avesta Language 1st edition by Ramiyar Karanjia full chapters
100% (21)
[Ebooks PDF] download Teach Yourselfe Avesta Language 1st edition by Ramiyar Karanjia full chapters
79 pages
Anglais 4e
No ratings yet
Anglais 4e
3 pages
Grade 7 English DLL Third Quarter 1
No ratings yet
Grade 7 English DLL Third Quarter 1
36 pages
Stative Verbs Grammar Guides 10275
No ratings yet
Stative Verbs Grammar Guides 10275
2 pages
Kurdish Basic Course (Dialect of Sulaimania, Iraq)_text
No ratings yet
Kurdish Basic Course (Dialect of Sulaimania, Iraq)_text
490 pages
Class 10 Further Practice
No ratings yet
Class 10 Further Practice
6 pages
Upstprep Test 1 A2 1st Term 2021 PDF
No ratings yet
Upstprep Test 1 A2 1st Term 2021 PDF
5 pages
Problem #1: Evolution of Nna Solution:: Printing Technology
No ratings yet
Problem #1: Evolution of Nna Solution:: Printing Technology
1 page
USE OF ENGLISH TWO lecture notes
No ratings yet
USE OF ENGLISH TWO lecture notes
77 pages
The Saudi Arabian English
No ratings yet
The Saudi Arabian English
3 pages
ENG.-3-LAS-Q2-Week 1
100% (1)
ENG.-3-LAS-Q2-Week 1
16 pages
Diwali holiday homework
No ratings yet
Diwali holiday homework
5 pages
Artikel Jurnal - An Analysis of Address Terms Received by Queen Elizabeth in The Prince Animated Series
No ratings yet
Artikel Jurnal - An Analysis of Address Terms Received by Queen Elizabeth in The Prince Animated Series
11 pages
Unit 1 - Lesson 3.1 - Vocab & Listening - Page 12
No ratings yet
Unit 1 - Lesson 3.1 - Vocab & Listening - Page 12
25 pages
1h-selida-ths-aithshs-KA122-SCH-968BF8A7
No ratings yet
1h-selida-ths-aithshs-KA122-SCH-968BF8A7
1 page
3B-COLETTE
No ratings yet
3B-COLETTE
5 pages
NHA2 List of Bound Bases and Affixes
No ratings yet
NHA2 List of Bound Bases and Affixes
12 pages
Rakesh Admit Card
No ratings yet
Rakesh Admit Card
1 page
Jawaban UTS Bhs - Inggris - Andrian Lubis (2261201320)
No ratings yet
Jawaban UTS Bhs - Inggris - Andrian Lubis (2261201320)
3 pages
Đề tiếng anh số 1
No ratings yet
Đề tiếng anh số 1
5 pages
NSC English HL Cat 2024
No ratings yet
NSC English HL Cat 2024
23 pages
Lecture 1
No ratings yet
Lecture 1
2 pages
Unit 3 Consonants (Reading)
No ratings yet
Unit 3 Consonants (Reading)
13 pages
Bat Bow Palm Leaves Tie: Are Words That Have The Same Spelling But Different Meanings
No ratings yet
Bat Bow Palm Leaves Tie: Are Words That Have The Same Spelling But Different Meanings
7 pages
GILBERTO MARQUEZ RODRIGUEZ NIVEL 4 Actividad de Practica 1.1
No ratings yet
GILBERTO MARQUEZ RODRIGUEZ NIVEL 4 Actividad de Practica 1.1
3 pages
Local Media5680996195741437689
No ratings yet
Local Media5680996195741437689
22 pages
The Social Skills Playbook
No ratings yet
The Social Skills Playbook
44 pages
IELTS Writing Task 2
No ratings yet
IELTS Writing Task 2
15 pages

Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable

Uploaded by

Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable

Uploaded by

Natural Language Processing

 Is a subfield of linguistics, computer science, and

 Linguistics is the study of human (natural)

 As with AI in general, there has been debate as to whether NLP

 Historically, early successes in NLP involved rule-based, symbolic approaches; for

 Applications in the field of NLP often make use of

 Tokenization generally involves splitting natural

 A lemma is the base form of a word (sometimes called the

 Stemming is simpler than lemmatization

 In theory, words that share the same lemma should

 Note that what counts as a wordform in the first place

 Another component of many NLP applications is sentence

 A regular expression (RE) is a grammatical formalism

 The simplest type of regular expression is just a sequence of one or

 Anchors are special characters that match particular places in a

 The ('|') character is called the disjunction operator,

 The operator precedence hierarchy for regular

 The textbook steps through an example of searching for

 Aliases for common sets include:

 The characters \t and \n represent tab and newline

 Two board classes of morphemes are stems and affixes

 A prefix is an affix that attaches before its base, like

 Four methods of combining morphemes to create words include inflection,

 English nouns have only two kinds of inflections: an affix that

 English has three kinds of verbs

 Modal verbs ("can", "will", "should")

 Primary verbs (e.g., "be", "have", "do")

 Regular verbs in English have four inflected forms;

 Irregular verbs typically have up to five different

 Recognition Entity Name is an acronym for

 Foreign minister spokesman Shen Guofang told

You might also like