0% found this document useful (0 votes)
3 views23 pages

Lect02

Uploaded by

rodrigoferraribr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views23 pages

Lect02

Uploaded by

rodrigoferraribr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

NLP Pipeline

By Ivan Wong
Generic NLP pipeline
Data Acquisition
• Data is the heart of any ML system
• In an ideal setting, we’ll have the required datasets with
thousands—maybe even millions—of data points.
• Use a public dataset
• https://github.com/niderhoff/nlp-datasets
• https://datasetsearch.research.google.com/
• Scrape data
• We could find a source of relevant data on the internet—for example, a
consumer or discussion forum where people have posted queries (sales
or support).
Data Acquisition
• Product intervention
• In most industrial settings, AI models seldom exist by themselves.
They’re developed mostly to serve users via a feature or product.
• Data augmentation
• While instrumenting products is a great way to collect data, it
takes time.
• NLP has a bunch of techniques through which we can take a small
dataset and use some tricks to create more data.
Text Extraction and Cleanup
• Text extraction and cleanup refers to the process of extracting raw
text from the input data by removing all the other non-textual
information, such as markup, metadata, etc., and converting the
text to the required encoding format.
• Text extraction is a standard data-wrangling step, and we
don’t usually employ any NLP-specific techniques during
this process.
Text
Extraction
and Cleanup
• HTML Parsing and Cleanup
• Beautiful Soup and Scrapy
• Unicode Normalization
• Spelling Correction
• Bing Spell Check API
• https://pypi.org/project/pyspellch
ecker/
• System-Specific Error
Correction
• OCR Error
Pre-Processing
• Here are some common pre-processing steps used in NLP
software:
• Preliminaries
• Sentence segmentation and word tokenization.
• Frequent steps
• Stop word removal, stemming and lemmatization, removing digits/punctuation,
lowercasing, etc.
• Other steps
• Normalization, language detection, code mixing, transliteration, etc.
• Advanced processing
• POS tagging, parsing, coreference resolution, etc.
Preliminaries
• Sentence segmentation
• Most NLP libraries come with some form of sentence and word splitting
implemented.
• A commonly used library is Natural Language Tool Kit (NLTK)
• Word tokenization
• Similar to sentence segmentation, word tokenization splits a sentence
into words.
Stop Word Removal
• Some of the frequently used words in English, such as a, an, the,
of, in, etc., are not particularly useful for this task, as they don’t
carry any content on their own to separate between the four
categories.
• Such words are called stop words and are typically (though not
always) removed from further analysis in such problem scenarios.
• There is no standard list of stop words for English, though.
• There are some popular lists (NLTK has one, for example), although what a
stop word is can vary depending on what we’re working on.
Stemming and lemmatization
• Stemming refers to the process of removing suffixes and reducing
a word to some base form such that all different variants of that
word can be represented by the same form (e.g., “car” and “cars”
are both reduced to “car”).
• Porter Stemmer
• Lemmatization is the process of mapping all the different forms of
a word to its base word, or lemma.
• While this seems close to the definition of stemming, they are, in fact,
different.
• For example, the adjective “better,” when stemmed, remains the same.
However, upon lemmatization, this should become “good,
Pre-
processing
• Note that these are the
more common pre-
processing steps, but
they’re by no means
exhaustive. Depending
on the nature of the
data, some additional
pre-processing steps
may be important. Let’s
take a look at a few of
those steps.
Other Pre-Processing Steps
• Text normalization
• A word can be spelled in different ways, including in shortened forms, a
phone number can be written in different formats (e.g., with and without
hyphens), names are sometimes in lowercase, and so on.
• Language detection
• Code mixing and transliteration
Advanced Processing
• Imagine we’re asked to develop a system to identify person and
organization names in our company’s collection of one million
documents.
• The common pre-processing steps we discussed earlier may not
be relevant in this context.
• Identifying names requires us to be able to do POS tagging, as
identifying proper nouns can be useful in identifying person and
organization names.
Advanced Processing
• What we’ve seen so far in this section are some of the most
common pre-processing steps in a pipeline.
• They’re all available as pre-trained, usable models in different NLP
libraries.
• Apart from these, additional, customized pre-processing may be
necessary, depending on the application.
• For example, consider a case where we’re asked to mine the social media
sentiment on our product.
• We start by collecting data from, say, Twitter, and quickly realize there are
tweets that are not in English.
• In such cases, we may also need a language-detection step before doing
anything else.
Advanced
Processing
• Advanced pre-
processing steps on a
blob of text
Advanced
Processing
• Coreference Resolution
Feature Engineering
• When we use ML methods to perform our modeling step later,
we’ll still need a way to feed this pre-processed text into an ML
algorithm.
• Feature engineering refers to the set of methods that will
accomplish this task.
• It’s also referred to as feature extraction.
• The goal of feature engineering is to capture the characteristics of
the text into a numeric vector that can be understood by the ML
algorithms.
Classical NLP/ML Pipeline

• Feature engineering is an integral step in any ML


pipeline.
• Feature engineering steps convert the raw data into
a format that can be consumed by a machine.
• These transformation functions are usually
handcrafted in the classical ML pipeline, aligning to
the task at hand.
DL Pipeline
• In the DL pipeline, the raw data (after pre-processing) is
directly fed to a model.
• The model is capable of “learning” features from the
data.
• Hence, these features are more in line with the task at
hand, so they generally give improved performance.
• But, since all these features are learned via model
parameters, the model loses interpretability.
• It’s very hard to explain a DL model’s prediction, which is
a disadvantage in a business-driven use case.
Modeling
• We need a system that’s easier to maintain as it matures.
• Further, as we collect more data, our ML model starts beating pure
heuristics.
• At that point, a common practice is to combine heuristics directly or
indirectly with the ML model.
• Create a feature from the heuristic for your ML model
• For instance, in the email spam-classification example, we can add
features, such as the number of words from the blacklist in a given
email or the email bounce rate, to the ML model.
• Pre-process your input to the ML model
• For instance, if for certain words in an email, there’s a 99% chance that
it’s spam, then it’s best to classify that email as spam instead of sending
it to an ML model.
Modeling
• We have NLP service providers, such as Google Cloud Natural
Language [44], Amazon Comprehend [45], Microsoft Azure
Cognitive Services [46], and IBM Watson Natural Language
Understanding [47], which provide off-the-shelf APIs to solve
various NLP tasks.
• Once you’re comfortable that the task is feasible and conclude
that the off-the-shelf models give reasonable results, you can
move toward building custom ML models and improving them.
Evaluation
• A key step in the NLP pipeline is to measure how good the model we’ve
built is.
• Evaluations are of two types: intrinsic and extrinsic.
• Intrinsic focuses on intermediary objectives, while extrinsic focuses on
evaluating performance on the final objective.
• For example, consider a spam-classification system. The ML metric will be
precision and recall, while the business metric will be “the amount of time users
spent on a spam email.”
• Intrinsic evaluation will focus on measuring the system performance using
precision and recall.
• Extrinsic evaluation will focus on measuring the time a user wasted because a
spam email went to their inbox or a genuine email went to their spam folder.
Post-Modeling Phases
• Deployment entails plugging the NLP module into the broader
system.
• Once we’re happy with one final solution, it needs to be deployed in a
production environment as a part of a larger system.
• Monitoring for NLP projects and models has to be handled
differently than a regular engineering project, as we need to
ensure that the outputs produced by our models daily make
sense.
• Model Updating is necessary because once the model is deployed
and we start gathering new data, we’ll iterate the model based on
this new data to stay current with predictions.

You might also like