0% found this document useful (0 votes)

3 views23 pages

Lect02

Uploaded by

rodrigoferraribr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views23 pages

Lect02

Uploaded by

rodrigoferraribr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

NLP Pipeline

By Ivan Wong
Generic NLP pipeline
Data Acquisition
• Data is the heart of any ML system
• In an ideal setting, we’ll have the required datasets with
thousands—maybe even millions—of data points.
• Use a public dataset
• https://github.com/niderhoff/nlp-datasets
• https://datasetsearch.research.google.com/
• Scrape data
• We could find a source of relevant data on the internet—for example, a
consumer or discussion forum where people have posted queries (sales
or support).
Data Acquisition
• Product intervention
• In most industrial settings, AI models seldom exist by themselves.
They’re developed mostly to serve users via a feature or product.
• Data augmentation
• While instrumenting products is a great way to collect data, it
takes time.
• NLP has a bunch of techniques through which we can take a small
dataset and use some tricks to create more data.
Text Extraction and Cleanup
• Text extraction and cleanup refers to the process of extracting raw
text from the input data by removing all the other non-textual
information, such as markup, metadata, etc., and converting the
text to the required encoding format.
• Text extraction is a standard data-wrangling step, and we
don’t usually employ any NLP-specific techniques during
this process.
Text
Extraction
and Cleanup
• HTML Parsing and Cleanup
• Beautiful Soup and Scrapy
• Unicode Normalization
• Spelling Correction
• Bing Spell Check API
• https://pypi.org/project/pyspellch
ecker/
• System-Specific Error
Correction
• OCR Error
Pre-Processing
• Here are some common pre-processing steps used in NLP
software:
• Preliminaries
• Sentence segmentation and word tokenization.
• Frequent steps
• Stop word removal, stemming and lemmatization, removing digits/punctuation,
lowercasing, etc.
• Other steps
• Normalization, language detection, code mixing, transliteration, etc.
• Advanced processing
• POS tagging, parsing, coreference resolution, etc.
Preliminaries
• Sentence segmentation
• Most NLP libraries come with some form of sentence and word splitting
implemented.
• A commonly used library is Natural Language Tool Kit (NLTK)
• Word tokenization
• Similar to sentence segmentation, word tokenization splits a sentence
into words.
Stop Word Removal
• Some of the frequently used words in English, such as a, an, the,
of, in, etc., are not particularly useful for this task, as they don’t
carry any content on their own to separate between the four
categories.
• Such words are called stop words and are typically (though not
always) removed from further analysis in such problem scenarios.
• There is no standard list of stop words for English, though.
• There are some popular lists (NLTK has one, for example), although what a
stop word is can vary depending on what we’re working on.
Stemming and lemmatization
• Stemming refers to the process of removing suffixes and reducing
a word to some base form such that all different variants of that
word can be represented by the same form (e.g., “car” and “cars”
are both reduced to “car”).
• Porter Stemmer
• Lemmatization is the process of mapping all the different forms of
a word to its base word, or lemma.
• While this seems close to the definition of stemming, they are, in fact,
different.
• For example, the adjective “better,” when stemmed, remains the same.
However, upon lemmatization, this should become “good,
Pre-
processing
• Note that these are the
more common pre-
processing steps, but
they’re by no means
exhaustive. Depending
on the nature of the
data, some additional
pre-processing steps
may be important. Let’s
take a look at a few of
those steps.
Other Pre-Processing Steps
• Text normalization
• A word can be spelled in different ways, including in shortened forms, a
phone number can be written in different formats (e.g., with and without
hyphens), names are sometimes in lowercase, and so on.
• Language detection
• Code mixing and transliteration
Advanced Processing
• Imagine we’re asked to develop a system to identify person and
organization names in our company’s collection of one million
documents.
• The common pre-processing steps we discussed earlier may not
be relevant in this context.
• Identifying names requires us to be able to do POS tagging, as
identifying proper nouns can be useful in identifying person and
organization names.
Advanced Processing
• What we’ve seen so far in this section are some of the most
common pre-processing steps in a pipeline.
• They’re all available as pre-trained, usable models in different NLP
libraries.
• Apart from these, additional, customized pre-processing may be
necessary, depending on the application.
• For example, consider a case where we’re asked to mine the social media
sentiment on our product.
• We start by collecting data from, say, Twitter, and quickly realize there are
tweets that are not in English.
• In such cases, we may also need a language-detection step before doing
anything else.
Advanced
Processing
• Advanced pre-
processing steps on a
blob of text
Advanced
Processing
• Coreference Resolution
Feature Engineering
• When we use ML methods to perform our modeling step later,
we’ll still need a way to feed this pre-processed text into an ML
algorithm.
• Feature engineering refers to the set of methods that will
accomplish this task.
• It’s also referred to as feature extraction.
• The goal of feature engineering is to capture the characteristics of
the text into a numeric vector that can be understood by the ML
algorithms.
Classical NLP/ML Pipeline

• Feature engineering is an integral step in any ML

pipeline.
• Feature engineering steps convert the raw data into
a format that can be consumed by a machine.
• These transformation functions are usually
handcrafted in the classical ML pipeline, aligning to
the task at hand.
DL Pipeline
• In the DL pipeline, the raw data (after pre-processing) is
directly fed to a model.
• The model is capable of “learning” features from the
data.
• Hence, these features are more in line with the task at
hand, so they generally give improved performance.
• But, since all these features are learned via model
parameters, the model loses interpretability.
• It’s very hard to explain a DL model’s prediction, which is
a disadvantage in a business-driven use case.
Modeling
• We need a system that’s easier to maintain as it matures.
• Further, as we collect more data, our ML model starts beating pure
heuristics.
• At that point, a common practice is to combine heuristics directly or
indirectly with the ML model.
• Create a feature from the heuristic for your ML model
• For instance, in the email spam-classification example, we can add
features, such as the number of words from the blacklist in a given
email or the email bounce rate, to the ML model.
• Pre-process your input to the ML model
• For instance, if for certain words in an email, there’s a 99% chance that
it’s spam, then it’s best to classify that email as spam instead of sending
it to an ML model.
Modeling
• We have NLP service providers, such as Google Cloud Natural
Language [44], Amazon Comprehend [45], Microsoft Azure
Cognitive Services [46], and IBM Watson Natural Language
Understanding [47], which provide off-the-shelf APIs to solve
various NLP tasks.
• Once you’re comfortable that the task is feasible and conclude
that the off-the-shelf models give reasonable results, you can
move toward building custom ML models and improving them.
Evaluation
• A key step in the NLP pipeline is to measure how good the model we’ve
built is.
• Evaluations are of two types: intrinsic and extrinsic.
• Intrinsic focuses on intermediary objectives, while extrinsic focuses on
evaluating performance on the final objective.
• For example, consider a spam-classification system. The ML metric will be
precision and recall, while the business metric will be “the amount of time users
spent on a spam email.”
• Intrinsic evaluation will focus on measuring the system performance using
precision and recall.
• Extrinsic evaluation will focus on measuring the time a user wasted because a
spam email went to their inbox or a genuine email went to their spam folder.
Post-Modeling Phases
• Deployment entails plugging the NLP module into the broader
system.
• Once we’re happy with one final solution, it needs to be deployed in a
production environment as a part of a larger system.
• Monitoring for NLP projects and models has to be handled
differently than a regular engineering project, as we need to
ensure that the outputs produced by our models daily make
sense.
• Model Updating is necessary because once the model is deployed
and we start gathering new data, we’ll iterate the model based on
this new data to stay current with predictions.

Class 12 Computer Science Project Python
74% (81)
Class 12 Computer Science Project Python
32 pages
Practical Natural Language Processing A Comprehensive Guide To Building Real World NLP Systems 1st Edition Sowmya Vajjala
100% (3)
Practical Natural Language Processing A Comprehensive Guide To Building Real World NLP Systems 1st Edition Sowmya Vajjala
62 pages
DeekshikaJadyada-AP24LDS11
No ratings yet
DeekshikaJadyada-AP24LDS11
6 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
Chapter 2 Solutions
No ratings yet
Chapter 2 Solutions
6 pages
APznzaaezhN_zrfGNBIVQoFpyxQuDJEbpYM-rd1_4RK0dsKNoyaIK1leg5AOwJTuo35Fm7my_JrMLHTTwQc2-C9HancQl3eg5PMXqg3GVh...P8BhsI_jQJy5fp8rf5U6yKHXRfFB-0sfyXvsKcrtjBjLcU1flNWbsLeC886utDYCdlHaYbVGoX44N_s9IQDFZVmSS9erIHdWuLbw1xo7dFCD-1IOTfC4GfUw8x
No ratings yet
APznzaaezhN_zrfGNBIVQoFpyxQuDJEbpYM-rd1_4RK0dsKNoyaIK1leg5AOwJTuo35Fm7my_JrMLHTTwQc2-C9HancQl3eg5PMXqg3GVh...P8BhsI_jQJy5fp8rf5U6yKHXRfFB-0sfyXvsKcrtjBjLcU1flNWbsLeC886utDYCdlHaYbVGoX44N_s9IQDFZVmSS9erIHdWuLbw1xo7dFCD-1IOTfC4GfUw8x
171 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
2. NLP Pipeline
No ratings yet
2. NLP Pipeline
50 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
Module-I_NLP (1)
No ratings yet
Module-I_NLP (1)
35 pages
Natural Language Processing manual
No ratings yet
Natural Language Processing manual
39 pages
NLP lect 2
No ratings yet
NLP lect 2
5 pages
NLP PPT
No ratings yet
NLP PPT
58 pages
fbc0639a-812b-41a2-9cc3-1df472ef6897
No ratings yet
fbc0639a-812b-41a2-9cc3-1df472ef6897
22 pages
Practical Natural Language Processing A Comprehensive Guide to Building Real world Nlp Systems 1st Edition Sowmya Vajjala download
100% (5)
Practical Natural Language Processing A Comprehensive Guide to Building Real world Nlp Systems 1st Edition Sowmya Vajjala download
54 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Practical Natural Language Processing A Comprehensive Guide to Building Real world Nlp Systems 1st Edition Sowmya Vajjala - The full ebook with complete content is ready for download
100% (1)
Practical Natural Language Processing A Comprehensive Guide to Building Real world Nlp Systems 1st Edition Sowmya Vajjala - The full ebook with complete content is ready for download
61 pages
Natural Language Processing_NOTES
No ratings yet
Natural Language Processing_NOTES
4 pages
Introduction to Data Science_Week 7_LAQ's
No ratings yet
Introduction to Data Science_Week 7_LAQ's
4 pages
NLP_Crash_Course_Comprehensive
No ratings yet
NLP_Crash_Course_Comprehensive
2 pages
nlp_1
No ratings yet
nlp_1
11 pages
NLP_record300
No ratings yet
NLP_record300
24 pages
NLP Basics
No ratings yet
NLP Basics
4 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
Unit I NLP
No ratings yet
Unit I NLP
5 pages
1_NLP.docx
No ratings yet
1_NLP.docx
26 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Great Big Natural Language Processing Primer KDnuggets
No ratings yet
Great Big Natural Language Processing Primer KDnuggets
25 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Nlp Materia
No ratings yet
Nlp Materia
29 pages
eco36
No ratings yet
eco36
6 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
01 Introduction to Natural Language Processing
No ratings yet
01 Introduction to Natural Language Processing
42 pages
Text-Processing-For-NLP-Text-Processing (6)
No ratings yet
Text-Processing-For-NLP-Text-Processing (6)
15 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
Natural Language Processing 101
No ratings yet
Natural Language Processing 101
26 pages
NLP LectureNotes UNIT 1
No ratings yet
NLP LectureNotes UNIT 1
55 pages
Practical Natural Language Processing: A Comprehensive Guide To Building Real-World NLP Systems
No ratings yet
Practical Natural Language Processing: A Comprehensive Guide To Building Real-World NLP Systems
8 pages
Disruptive Technologies AI Lecture 3
No ratings yet
Disruptive Technologies AI Lecture 3
19 pages
AI-2
No ratings yet
AI-2
7 pages
Unit 3 AI-ML Driven Data Science and Automation
No ratings yet
Unit 3 AI-ML Driven Data Science and Automation
49 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
Chapter 4
No ratings yet
Chapter 4
17 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
AI-CH-4
No ratings yet
AI-CH-4
53 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
UNIT III
No ratings yet
UNIT III
6 pages
Natural Language Processing tools and approaches
No ratings yet
Natural Language Processing tools and approaches
106 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
NLP Chapter -2 Sheet
No ratings yet
NLP Chapter -2 Sheet
7 pages
Natural Language Processing_ Step by Step Guide _ NLP
No ratings yet
Natural Language Processing_ Step by Step Guide _ NLP
21 pages
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
NLP Sheets
No ratings yet
NLP Sheets
23 pages
NLP Interview
No ratings yet
NLP Interview
23 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
21 pages
NLP Applications
No ratings yet
NLP Applications
32 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
From Everand
Accelerated DevOps with AI, ML & RPA: Non-Programmer’s Guide to AIOPS & MLOPS
Stephen Fleming
5/5 (2)
3175 Lab 2 - Trip Booking App - Solutions(1)
No ratings yet
3175 Lab 2 - Trip Booking App - Solutions(1)
1 page
3175 Lab 3
No ratings yet
3175 Lab 3
1 page
CSIS 3300 W3 Denormalization StarSchema
No ratings yet
CSIS 3300 W3 Denormalization StarSchema
27 pages
3175 Lab 4
No ratings yet
3175 Lab 4
2 pages
Csis 3300 w5 9 Nosql
No ratings yet
Csis 3300 w5 9 Nosql
27 pages
CSIS 3300 W13 Transactions
No ratings yet
CSIS 3300 W13 Transactions
13 pages
Csis3300 001 Outline Nb f24
No ratings yet
Csis3300 001 Outline Nb f24
8 pages
CSIS 3300 W3 Denormalization StarSchema Sol
No ratings yet
CSIS 3300 W3 Denormalization StarSchema Sol
2 pages
CSIS 3300 W11 QueryOptimization
No ratings yet
CSIS 3300 W11 QueryOptimization
27 pages
Proj2
No ratings yet
Proj2
5 pages
Lect07
No ratings yet
Lect07
24 pages
Lect04
No ratings yet
Lect04
44 pages
Lect05
No ratings yet
Lect05
17 pages
Lect06
No ratings yet
Lect06
21 pages
Lect08
No ratings yet
Lect08
17 pages
Lect01
No ratings yet
Lect01
28 pages
CSIS3400 070CourseOutline 2024Fall(1)
No ratings yet
CSIS3400 070CourseOutline 2024Fall(1)
5 pages
Proj01
No ratings yet
Proj01
5 pages
L2-02 - Mastercam Managers
No ratings yet
L2-02 - Mastercam Managers
7 pages
RAPORT Ihre Im Virustotal
No ratings yet
RAPORT Ihre Im Virustotal
16 pages
Zoho Creator Quick Start Guide
No ratings yet
Zoho Creator Quick Start Guide
21 pages
PLC Comparison Chart 2007 v5
No ratings yet
PLC Comparison Chart 2007 v5
3 pages
History of Internet: Al Gore and The Internet
No ratings yet
History of Internet: Al Gore and The Internet
4 pages
2023 Eplus Call Template KA130 HED
No ratings yet
2023 Eplus Call Template KA130 HED
10 pages
Huawei E173 Specs
No ratings yet
Huawei E173 Specs
21 pages
Car Rental Case Study in UML
No ratings yet
Car Rental Case Study in UML
44 pages
LENOVO v15 I5 16 512
No ratings yet
LENOVO v15 I5 16 512
2 pages
Ejemplos de Ensayos Persuasivos de Escuela Intermedia
100% (2)
Ejemplos de Ensayos Persuasivos de Escuela Intermedia
5 pages
Samsung C5212
100% (1)
Samsung C5212
1 page
00 Course Outline - CBT Nuggets - Microsoft Windows Server 2012 70-412
No ratings yet
00 Course Outline - CBT Nuggets - Microsoft Windows Server 2012 70-412
3 pages
LAP #1
No ratings yet
LAP #1
11 pages
SPAdapter SPPID
No ratings yet
SPAdapter SPPID
116 pages
Data Science Process & Methodology - LinkedIn
No ratings yet
Data Science Process & Methodology - LinkedIn
10 pages
Presentation of TOSMANA: Lasse@staff - Uni-Marburg - de
No ratings yet
Presentation of TOSMANA: Lasse@staff - Uni-Marburg - de
17 pages
Review2 PPT - Meghana
No ratings yet
Review2 PPT - Meghana
32 pages
ENERCALC Structural Engineering Library Version 6
No ratings yet
ENERCALC Structural Engineering Library Version 6
5 pages
NWC203c PE Last
No ratings yet
NWC203c PE Last
32 pages
Vendor Management System - User Guide: Confidential
No ratings yet
Vendor Management System - User Guide: Confidential
62 pages
PLACEMENT
No ratings yet
PLACEMENT
8 pages
Slot 6-7 JSP
No ratings yet
Slot 6-7 JSP
91 pages
5-Layer TCP-IP Blend Model
No ratings yet
5-Layer TCP-IP Blend Model
1 page
NCSC-TG-023 A Guide To Security Testing and Test Documentation in Trusted Systems (Bright Orange Book)
No ratings yet
NCSC-TG-023 A Guide To Security Testing and Test Documentation in Trusted Systems (Bright Orange Book)
124 pages
Grievance Status
No ratings yet
Grievance Status
2 pages
Expressed in Terms of Parameter
No ratings yet
Expressed in Terms of Parameter
4 pages
Rad Studio Feature Matrix
No ratings yet
Rad Studio Feature Matrix
22 pages
ABCDEF
No ratings yet
ABCDEF
12 pages
comp proj
No ratings yet
comp proj
24 pages

Lect02

Uploaded by

Lect02

Uploaded by

NLP Pipeline

• Feature engineering is an integral step in any ML

You might also like