Introduction to Natural
Language Processing (NLP)
Ababacar BA
EDC
Content
1. Objectives and Workplan
a. Definitions
b. Use cases
c. Objectives and Evaluations
d. Workspace setup
2. Traditional text analysis
a. Text processing fundamentals
b. Word/sentence embedding
c. Topic modeling & Document classification
d. Sentiment Analysis
3. More advanced NLP
a. Probabilistic models
b. Sequence models
c. Bonus: Attention models and speech to text
Timeline
SS1 SS2 SS3 SS4
Up to 2.a (text process 2.a (text process 2.b (Word/sentence 2.c (Topic modeling &
fundamentals) fundamentals) embedding) Document classification)
SS5 SS6 SS7 SS8
Up to 2.d (Sentiment Up to 3.b (Sequence Up to 3.c (Attention models Final project
Analysis) models) and speech to text)
Session 1
--
01
Objectives and
Workplan
Some definitions
● Textmining : Process of extracting information from textual data
● NLP : Field of AI gathering a set of technics whose purpose is to understand,
generate, translate human language, whether in written or spoken version
So what's the difference ?
NB: It's easier to deal with text data than audio data
Examples of
NLP Use Cases
Text Automatic text
Chat bots
classification summary
Classify texts into Given a large text, The ability to discuss
predefined subjects summarize it and with a machine (useful
return main ideas for customer advisor)
Ex : online products, questions on Ex : executive summary for managers Ex : SNCF virtual assistant
stackoverflow : https://bot.tgvinoui.sncf/
"ChatGPT: Optimizing
Language Models
for Dialogue"
"We’ve trained a model called
ChatGPT which interacts in a
conversational way. The
dialogue format makes it
possible for ChatGPT to answer
followup questions, admit its
mistakes, challenge incorrect
premises, and reject
inappropriate requests."
https://chat.openai.com/chat
Objectives, Evaluations &
rules
Objectives Evaluations
• Get essential skills for getting information from text data
• Topic modeling, classification, sentiment analysis • Small exercises during sessions +
• Apply pre-trained models on text for a specific task possible homeworks
• Final project
Rules
• AVOID CHATGPT or any other LLM, try by yourself first
• No foolish question : we learn with mistakes
• Practice makes perfect
Workspace
Setup
• Prerequisites :
• Basic ML courses
• Python programming
• Materials : PC or laptop
• Software : Browser + google account => Google
colab
https://colab.research.google.com/
02
Traditional text
analysis
Text processing fundamentals :
Handling text data in Python
● Structure of text data :
○ Strings + characters => Words / tokens => Sentences => Documents
● Decompose a text :
>> text = "The struggle to deliver on promises to provide Leopard 2 tanks for use
against Russian forces has exposed just how unprepared European militaries
are"
>> text.split(' ')
['The', 'struggle', 'to', 'deliver', 'on', 'promises', 'to', 'provide', 'Leopard', '2', 'tanks',
'for', 'use', 'against', 'Russian', 'forces', 'has', 'exposed', 'just', 'how', 'unprepared',
'European', 'militaries', 'are']
Text processing fundamentals :
Handling text data in Python
● Operations on string/words in Python (suppose "s" is a string value) :
○ len(s)
○ s.upper(); s.lower()
○ s.split(); s.strip()
○ s.find(); s.replace()
○ s.startswith(); s.endswith(); s.isupper(); s.isalpha(); s.isdigit()
● Practice : Take a sentence and use these functions to :
○ Split into words, lowercase first word and uppercase everything else
○ Count the number of words, words starting with "a", words ending with "e" and count digits
○ Find the positions of letter o's first and last occurrence
Text processing fundamentals :
Regular expressions
Regular Expression (Regex) : special sequence of characters that helps you match or
find other strings or sets of strings using a specialized syntax.
Regular expressions are used for searching, manipulating, and editing text based on
patterns.
For simplification : Regex is a description of a string using "Meta-characters"
Most common regular expressions :
● "." : Matches any character except a newline.
● "^" : Matches the beginning of the string.
● "$" : Matches the end of the string.
● "*" : Matches 0 or more repetitions of the preceding pattern.
● "+" : Matches 1 or more repetitions of the preceding pattern.
● "?" : Matches 0 or 1 repetition of the preceding pattern.
● "[ ]": A character set that matches any one of the characters inside the brackets.
● "\d" : Matches any digit (equivalent to [0-9]).
● "\w" : Matches any word character (alphanumeric + underscore).
● "\s" : Matches any whitespace character (spaces, tabs, newlines).
Text processing fundamentals :
Regular expressions
Metacharacter Meaning Example
^ Starts with "^a"
$ Ends with "final$"
. Any character ".anguage"
* Zero or more occurrence ".*is in Paris"
[] A set of characters "[a-z]"; "[0-9]"
| Either, or "[B|b]ook"
● More REGEX in Python : https://docs.python.org/3/library/re.html
● Practice :
○ Give a regex to match "simple" emails
○ Write a regex expressing a timestamp in format "YYYY-MM-dd HH:mm:ss.SSS" => "\d{4}-\d{2}-
\d{2} \d{2}:\d{2}:\d{2}\.\d{3}"
Text processing fundamentals :
Regular expressions
Usage of REGEX : Find expressions, extract, replace
Useful functions in Python library re :
● re.findAll: Returns a list containing all matches
● re.search : Returns a Math object if there is a match anywhere in the string
● re.split : Returns a list where the string has been split at each match
● re.sub : Replaces one or many matches with a string
More examples : https://studymachinelearning.com/regular-expression-for-natural-language-processing/
Practice : Copy a press article on Artificial Intelligence (Ex : https://digital-
strategy.ec.europa.eu/en/policies/european-approach-artificial-intelligence )
● Find all occurrences of these terms "Artificial Intelligence" et "AI"
● Print position of the first occurrence of "intelligence"
● Replace "Artificial Intelligence" by "AI" and split the text using the string "AI"
● Find all words starting with "a" and ending with "s"
Session 2
--
Text processing fundamentals :
Tokenization
Definition :
Process of splitting a text or a sentence in smaller chunks.
Text processing fundamentals :
Tokenization
Practice :
With 3 Python frameworks : nltk, spacy, gensim , apply word and sentence
tokenization of the following text :
"The most visible advances have been in what’s called “natural language processing”
(NLP), the branch of AI focused on how computers can process language like humans
do. It has been used to write an article for The Guardian, and AI-authored blog
posts have gone viral — feats that weren’t possible a few years ago. AI even excels at
cognitive tasks like programming where it is able to generate programs for simple
video games from human instructions."
Text processing fundamentals :
Stopwords, special characters removal
Stopwords:
Words that do not add special meaning to the sentence. They can be connexion terms
(conjunction), articles ...
List of stopwords using nltk in Python :
Practice : Take the previous text, apply tokenization and
remove stopwords using nltk.
Text processing fundamentals :
Stopwords, special characters removal
Special characters:
Symbols used for punctuation, comparisons etc
List of special characters with Python :
Practice : Take a new text with special characters, apply tokenization and
remove stopwords and special characters.
Hint : look at is.alpha method for filtering
Text processing fundamentals :
Part Of Speech (POS) tagging
Definition
● POS tagging means identifying different components of a sentence using word
classes : nouns, verbs, adjectives, prepositions etc.
● To identify each class, underlying technics are : using dictionnary or a predictive
model (markov chain for example).
● Example of classes :
● CC: It is the conjunction of coordinating
● CD: It is a digit of cardinal
● DT: It is the determiner
● JJ: Adjective
● NN: Singular noun
● VBZ: Verb
Text processing fundamentals :
Part Of Speech (POS) tagging
POS Tagging in Python (nltk)
Definition on a particular class :
Practice : Take a sentence and filter it to keep only verbs
Text processing fundamentals :
Named Entity Recognition (NER)
Definition
It goes beyond POS tagging : the objective is to recognize entities like names of
persons, organizations, locations, expressions of times, quantities, monetary values,
percentages
Use cases :
● Find person names in a document (authors, complains of clients...)
● Companies mentioned in an article, their location
● Amount in a contract or commercial document (parse automatically these
documents)
Text processing fundamentals :
Named Entity Recognition (NER)
NER in Python (spacy)
Text processing fundamentals :
Named Entity Recognition (NER)
Practice : Take a rental advertising from pap.fr (ex :
https://www.pap.fr/annonces/appartement-montmagny-95360-r443100759), and list
all the following items :
● Locations
● Organizations
● Prices
Text processing fundamentals :
Stemming and Lemmatization
Principle: Reduce a word by using root form
● Stemming => Depending of type of stemmer, it returns a root of the word common to
multiple variations of the word from the same family. Most popular stemmer
: Porter, Snowball, and Lancaster
○ Example : "universal", "universities", "universes" can all be reduced down
to "univers" (which doesn't exist as a word)
● Lemmatization => Replace a word by his grammatical root.
○ Example : "universal", "universities", "universes" are reduced to "universal",
"university", "universe"
Text processing fundamentals :
Stemming and Lemmatization
Stemming and lemmatization in Python
Assignment 1 Course materials :
https://1drv.ms/f/c/ab584826bfa6bf60/Eg2mh5dZmG9GhpqkpXrHrv
oBvSWsN0pge7neZT9LYACIFg?e=xo4rp4