0% found this document useful (0 votes)

36 views29 pages

NLP - Course EDC 1 29

Uploaded by

RADJA LOUAIL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views29 pages

NLP - Course EDC 1 29

Uploaded by

RADJA LOUAIL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Introduction to Natural

Language Processing (NLP)

Ababacar BA
EDC
Content
1. Objectives and Workplan
a. Definitions

b. Use cases
c. Objectives and Evaluations
d. Workspace setup

2. Traditional text analysis

a. Text processing fundamentals
b. Word/sentence embedding
c. Topic modeling & Document classification
d. Sentiment Analysis

3. More advanced NLP

a. Probabilistic models
b. Sequence models
c. Bonus: Attention models and speech to text
Timeline

SS1 SS2 SS3 SS4

Up to 2.a (text process 2.a (text process 2.b (Word/sentence 2.c (Topic modeling &
fundamentals) fundamentals) embedding) Document classification)

SS5 SS6 SS7 SS8

Up to 2.d (Sentiment Up to 3.b (Sequence Up to 3.c (Attention models Final project
Analysis) models) and speech to text)
Session 1
--
01
Objectives and
Workplan
Some definitions
● Textmining : Process of extracting information from textual data

● NLP : Field of AI gathering a set of technics whose purpose is to understand,

generate, translate human language, whether in written or spoken version

So what's the difference ?

NB: It's easier to deal with text data than audio data
Examples of
NLP Use Cases

Text Automatic text

Chat bots
classification summary
Classify texts into Given a large text, The ability to discuss
predefined subjects summarize it and with a machine (useful
return main ideas for customer advisor)

Ex : online products, questions on Ex : executive summary for managers Ex : SNCF virtual assistant
stackoverflow : https://bot.tgvinoui.sncf/
"ChatGPT: Optimizing
Language Models
for Dialogue"
"We’ve trained a model called
ChatGPT which interacts in a
conversational way. The
dialogue format makes it
possible for ChatGPT to answer
followup questions, admit its
mistakes, challenge incorrect
premises, and reject
inappropriate requests."

https://chat.openai.com/chat
Objectives, Evaluations &
rules

Objectives Evaluations
• Get essential skills for getting information from text data
• Topic modeling, classification, sentiment analysis • Small exercises during sessions +
• Apply pre-trained models on text for a specific task possible homeworks
• Final project
Rules
• AVOID CHATGPT or any other LLM, try by yourself first
• No foolish question : we learn with mistakes
• Practice makes perfect
Workspace
Setup

• Prerequisites :
• Basic ML courses
• Python programming
• Materials : PC or laptop

• Software : Browser + google account => Google

colab
https://colab.research.google.com/
02
Traditional text
analysis
Text processing fundamentals :
Handling text data in Python

● Structure of text data :

○ Strings + characters => Words / tokens => Sentences => Documents
● Decompose a text :
>> text = "The struggle to deliver on promises to provide Leopard 2 tanks for use
against Russian forces has exposed just how unprepared European militaries
are"
>> text.split(' ')

['The', 'struggle', 'to', 'deliver', 'on', 'promises', 'to', 'provide', 'Leopard', '2', 'tanks',
'for', 'use', 'against', 'Russian', 'forces', 'has', 'exposed', 'just', 'how', 'unprepared',
'European', 'militaries', 'are']
Text processing fundamentals :
Handling text data in Python
● Operations on string/words in Python (suppose "s" is a string value) :
○ len(s)
○ s.upper(); s.lower()
○ s.split(); s.strip()
○ s.find(); s.replace()
○ s.startswith(); s.endswith(); s.isupper(); s.isalpha(); s.isdigit()

● Practice : Take a sentence and use these functions to :

○ Split into words, lowercase first word and uppercase everything else
○ Count the number of words, words starting with "a", words ending with "e" and count digits
○ Find the positions of letter o's first and last occurrence
Text processing fundamentals :
Regular expressions
Regular Expression (Regex) : special sequence of characters that helps you match or
find other strings or sets of strings using a specialized syntax.

Regular expressions are used for searching, manipulating, and editing text based on
patterns.

For simplification : Regex is a description of a string using "Meta-characters"

Most common regular expressions :
● "." : Matches any character except a newline.
● "^" : Matches the beginning of the string.
● "$" : Matches the end of the string.
● "*" : Matches 0 or more repetitions of the preceding pattern.
● "+" : Matches 1 or more repetitions of the preceding pattern.
● "?" : Matches 0 or 1 repetition of the preceding pattern.
● "[ ]": A character set that matches any one of the characters inside the brackets.
● "\d" : Matches any digit (equivalent to [0-9]).
● "\w" : Matches any word character (alphanumeric + underscore).
● "\s" : Matches any whitespace character (spaces, tabs, newlines).
Text processing fundamentals :
Regular expressions
Metacharacter Meaning Example

^ Starts with "^a"

$ Ends with "final$"

. Any character ".anguage"

* Zero or more occurrence ".*is in Paris"

[] A set of characters "[a-z]"; "[0-9]"

| Either, or "[B|b]ook"

● More REGEX in Python : https://docs.python.org/3/library/re.html

● Practice :

○ Give a regex to match "simple" emails

○ Write a regex expressing a timestamp in format "YYYY-MM-dd HH:mm:ss.SSS" => "\d{4}-\d{2}-

\d{2} \d{2}:\d{2}:\d{2}\.\d{3}"
Text processing fundamentals :
Regular expressions
Usage of REGEX : Find expressions, extract, replace

Useful functions in Python library re :

● re.findAll: Returns a list containing all matches

● re.search : Returns a Math object if there is a match anywhere in the string
● re.split : Returns a list where the string has been split at each match
● re.sub : Replaces one or many matches with a string
More examples : https://studymachinelearning.com/regular-expression-for-natural-language-processing/

Practice : Copy a press article on Artificial Intelligence (Ex : https://digital-

strategy.ec.europa.eu/en/policies/european-approach-artificial-intelligence )
● Find all occurrences of these terms "Artificial Intelligence" et "AI"
● Print position of the first occurrence of "intelligence"
● Replace "Artificial Intelligence" by "AI" and split the text using the string "AI"
● Find all words starting with "a" and ending with "s"
Session 2
--
Text processing fundamentals :
Tokenization
Definition :
Process of splitting a text or a sentence in smaller chunks.
Text processing fundamentals :
Tokenization
Practice :
With 3 Python frameworks : nltk, spacy, gensim , apply word and sentence
tokenization of the following text :

"The most visible advances have been in what’s called “natural language processing”
(NLP), the branch of AI focused on how computers can process language like humans
do. It has been used to write an article for The Guardian, and AI-authored blog
posts have gone viral — feats that weren’t possible a few years ago. AI even excels at
cognitive tasks like programming where it is able to generate programs for simple
video games from human instructions."
Text processing fundamentals :
Stopwords, special characters removal
Stopwords:
Words that do not add special meaning to the sentence. They can be connexion terms
(conjunction), articles ...

List of stopwords using nltk in Python :

Practice : Take the previous text, apply tokenization and

remove stopwords using nltk.
Text processing fundamentals :
Stopwords, special characters removal
Special characters:
Symbols used for punctuation, comparisons etc

List of special characters with Python :

Practice : Take a new text with special characters, apply tokenization and
remove stopwords and special characters.
Hint : look at is.alpha method for filtering
Text processing fundamentals :
Part Of Speech (POS) tagging
Definition
● POS tagging means identifying different components of a sentence using word
classes : nouns, verbs, adjectives, prepositions etc.

● To identify each class, underlying technics are : using dictionnary or a predictive

model (markov chain for example).

● Example of classes :
● CC: It is the conjunction of coordinating
● CD: It is a digit of cardinal
● DT: It is the determiner
● JJ: Adjective
● NN: Singular noun
● VBZ: Verb
Text processing fundamentals :
Part Of Speech (POS) tagging
POS Tagging in Python (nltk)

Definition on a particular class :

Practice : Take a sentence and filter it to keep only verbs

Text processing fundamentals :
Named Entity Recognition (NER)
Definition
It goes beyond POS tagging : the objective is to recognize entities like names of
persons, organizations, locations, expressions of times, quantities, monetary values,
percentages

Use cases :
● Find person names in a document (authors, complains of clients...)
● Companies mentioned in an article, their location
● Amount in a contract or commercial document (parse automatically these
documents)
Text processing fundamentals :
Named Entity Recognition (NER)
NER in Python (spacy)
Text processing fundamentals :
Named Entity Recognition (NER)
Practice : Take a rental advertising from pap.fr (ex :
https://www.pap.fr/annonces/appartement-montmagny-95360-r443100759), and list
all the following items :

● Locations
● Organizations
● Prices
Text processing fundamentals :
Stemming and Lemmatization
Principle: Reduce a word by using root form

● Stemming => Depending of type of stemmer, it returns a root of the word common to
multiple variations of the word from the same family. Most popular stemmer
: Porter, Snowball, and Lancaster

○ Example : "universal", "universities", "universes" can all be reduced down

to "univers" (which doesn't exist as a word)

● Lemmatization => Replace a word by his grammatical root.

○ Example : "universal", "universities", "universes" are reduced to "universal",

"university", "universe"
Text processing fundamentals :
Stemming and Lemmatization
Stemming and lemmatization in Python
Assignment 1 Course materials :
https://1drv.ms/f/c/ab584826bfa6bf60/Eg2mh5dZmG9GhpqkpXrHrv
oBvSWsN0pge7neZT9LYACIFg?e=xo4rp4

Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Intro To NLP
No ratings yet
Intro To NLP
44 pages
NLB Lab Manuel 2
No ratings yet
NLB Lab Manuel 2
71 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
NLP Syllabus
No ratings yet
NLP Syllabus
2 pages
Python Text Processing and NLP Basics
No ratings yet
Python Text Processing and NLP Basics
32 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
NLP Course for Python Learners
No ratings yet
NLP Course for Python Learners
4 pages
Unit 1
No ratings yet
Unit 1
20 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
Unit No 1 Introduction To NLP
No ratings yet
Unit No 1 Introduction To NLP
20 pages
NLP Saurav
No ratings yet
NLP Saurav
16 pages
NLP Notes Unit 1
No ratings yet
NLP Notes Unit 1
179 pages
CSR 322 Syllabus
No ratings yet
CSR 322 Syllabus
2 pages
NLP Full Overview
No ratings yet
NLP Full Overview
37 pages
NLP - Shortnotes Unit 1 & 2
100% (1)
NLP - Shortnotes Unit 1 & 2
16 pages
Text Processing Guide for NLP
No ratings yet
Text Processing Guide for NLP
15 pages
NLP with Python: Beginner's Tutorial
No ratings yet
NLP with Python: Beginner's Tutorial
72 pages
Unit I - NLP
No ratings yet
Unit I - NLP
24 pages
Natural Language Processing (NLP) (A Complete Guide)
No ratings yet
Natural Language Processing (NLP) (A Complete Guide)
26 pages
NLP Introduction Notes Anna University
No ratings yet
NLP Introduction Notes Anna University
2 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP with Python Lab Manual
No ratings yet
NLP with Python Lab Manual
15 pages
v24dsl07t - Unit I - NLP
No ratings yet
v24dsl07t - Unit I - NLP
65 pages
NLP 1
No ratings yet
NLP 1
11 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
Data Science: Text & Speech Analysis Course
No ratings yet
Data Science: Text & Speech Analysis Course
2 pages
Essential NLP Pre-processing Steps
No ratings yet
Essential NLP Pre-processing Steps
20 pages
Languages: What Is Natural Language Processing ?
No ratings yet
Languages: What Is Natural Language Processing ?
25 pages
Reading4 NLP
No ratings yet
Reading4 NLP
64 pages
NLP Syllabus R21
100% (1)
NLP Syllabus R21
2 pages
Unit 5
No ratings yet
Unit 5
4 pages
GBHRFTHRDF
No ratings yet
GBHRFTHRDF
3 pages
Natural Language Processing With Python
100% (1)
Natural Language Processing With Python
504 pages
ch5&6 Lecture AI
No ratings yet
ch5&6 Lecture AI
69 pages
Genai Unit !
No ratings yet
Genai Unit !
71 pages
Big Data Finance t8 1 Choi Neoma NLP 2024
No ratings yet
Big Data Finance t8 1 Choi Neoma NLP 2024
13 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
TSA Book
No ratings yet
TSA Book
154 pages
10366-Article Text-12682-1-10-20240404
No ratings yet
10366-Article Text-12682-1-10-20240404
7 pages
NLP Coding Guide for Beginners
No ratings yet
NLP Coding Guide for Beginners
10 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
NLP Techniques: Tokenization & Stemming
No ratings yet
NLP Techniques: Tokenization & Stemming
11 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
NLP Materia
No ratings yet
NLP Materia
29 pages
Unit I NLP
No ratings yet
Unit I NLP
5 pages
Gentle Start To Natural Language Processing Using Python
No ratings yet
Gentle Start To Natural Language Processing Using Python
6 pages
New Toc
No ratings yet
New Toc
36 pages
Module 1
No ratings yet
Module 1
49 pages
Natural Language Processing Course Overview
No ratings yet
Natural Language Processing Course Overview
6 pages
Handbook of Optical Coherence Tomography - Bouma & Tearney
No ratings yet
Handbook of Optical Coherence Tomography - Bouma & Tearney
751 pages
Accenture HMO Enrollment Consent Form
No ratings yet
Accenture HMO Enrollment Consent Form
2 pages
Year 12 Mathematics Exam
No ratings yet
Year 12 Mathematics Exam
28 pages
MSEC Electronics Engineering Syllabus 2024
No ratings yet
MSEC Electronics Engineering Syllabus 2024
68 pages
GT QB
No ratings yet
GT QB
4 pages
AKG Labelitaly - en - Akg7-Manual
No ratings yet
AKG Labelitaly - en - Akg7-Manual
3 pages
Week 1 - Web App Testing Basics
No ratings yet
Week 1 - Web App Testing Basics
19 pages
HT 33kv Incomer CT
No ratings yet
HT 33kv Incomer CT
3 pages
Thumb Rules in Building Construction PDF
No ratings yet
Thumb Rules in Building Construction PDF
2 pages
En How To Buy Meme Coins Before They 100x Your Ultimate Guide
No ratings yet
En How To Buy Meme Coins Before They 100x Your Ultimate Guide
13 pages
Instagram
No ratings yet
Instagram
173 pages
State Transition Matrix in Control Systems
No ratings yet
State Transition Matrix in Control Systems
33 pages
Computer Network and Information Security
No ratings yet
Computer Network and Information Security
33 pages
Effective Manpower Planning Strategies
No ratings yet
Effective Manpower Planning Strategies
6 pages
2013 Camry ATF Exchange Guide
No ratings yet
2013 Camry ATF Exchange Guide
18 pages
MNM2615 Assignment01 Complete Humanized LongVersion
No ratings yet
MNM2615 Assignment01 Complete Humanized LongVersion
7 pages
Vcard Anlaysis
No ratings yet
Vcard Anlaysis
71 pages
Ifrs 16 V4 (002) - Dipifrs
No ratings yet
Ifrs 16 V4 (002) - Dipifrs
59 pages
CSE220 Data Structures - Course Description and Outcome Form
No ratings yet
CSE220 Data Structures - Course Description and Outcome Form
4 pages
Campus Drive Questions With Solution
No ratings yet
Campus Drive Questions With Solution
45 pages
TSB wk2 1900411
No ratings yet
TSB wk2 1900411
2 pages
Major Exam
No ratings yet
Major Exam
12 pages
MS Access Data Normalization Guide
No ratings yet
MS Access Data Normalization Guide
23 pages
Advanced Operating Systems Overview
No ratings yet
Advanced Operating Systems Overview
4 pages
Everything You Need Toknowabout Airpel Anti-Stiction Aircylinders
No ratings yet
Everything You Need Toknowabout Airpel Anti-Stiction Aircylinders
24 pages
Mindspark'24 - Rule Book Final
No ratings yet
Mindspark'24 - Rule Book Final
138 pages
Photointerrupter Specs for Engineers
No ratings yet
Photointerrupter Specs for Engineers
6 pages
Assignment 08 - Method - 3
No ratings yet
Assignment 08 - Method - 3
12 pages
880 Series User Manual - MAN-027.2023
No ratings yet
880 Series User Manual - MAN-027.2023
88 pages
Cv-Iqra Yakub
No ratings yet
Cv-Iqra Yakub
2 pages

NLP - Course EDC 1 29

Uploaded by

NLP - Course EDC 1 29

Uploaded by

Introduction to Natural

Language Processing (NLP)

2. Traditional text analysis

3. More advanced NLP

SS1 SS2 SS3 SS4

SS5 SS6 SS7 SS8

● NLP : Field of AI gathering a set of technics whose purpose is to understand,

So what's the difference ?

Text Automatic text

• Software : Browser + google account => Google

● Structure of text data :

● Practice : Take a sentence and use these functions to :

For simplification : Regex is a description of a string using "Meta-characters"

^ Starts with "^a"

$ Ends with "final$"

. Any character ".anguage"

* Zero or more occurrence ".*is in Paris"

[] A set of characters "[a-z]"; "[0-9]"

● More REGEX in Python : https://docs.python.org/3/library/re.html

○ Give a regex to match "simple" emails

○ Write a regex expressing a timestamp in format "YYYY-MM-dd HH:mm:ss.SSS" => "\d{4}-\d{2}-

Useful functions in Python library re :

● re.findAll: Returns a list containing all matches

Practice : Copy a press article on Artificial Intelligence (Ex : https://digital-

List of stopwords using nltk in Python :

Practice : Take the previous text, apply tokenization and

List of special characters with Python :

● To identify each class, underlying technics are : using dictionnary or a predictive

Definition on a particular class :

Practice : Take a sentence and filter it to keep only verbs

○ Example : "universal", "universities", "universes" can all be reduced down

● Lemmatization => Replace a word by his grammatical root.

○ Example : "universal", "universities", "universes" are reduced to "universal",

You might also like