Lect02
Lect02
By Ivan Wong
Generic NLP pipeline
Data Acquisition
• Data is the heart of any ML system
• In an ideal setting, we’ll have the required datasets with
thousands—maybe even millions—of data points.
• Use a public dataset
• https://github.com/niderhoff/nlp-datasets
• https://datasetsearch.research.google.com/
• Scrape data
• We could find a source of relevant data on the internet—for example, a
consumer or discussion forum where people have posted queries (sales
or support).
Data Acquisition
• Product intervention
• In most industrial settings, AI models seldom exist by themselves.
They’re developed mostly to serve users via a feature or product.
• Data augmentation
• While instrumenting products is a great way to collect data, it
takes time.
• NLP has a bunch of techniques through which we can take a small
dataset and use some tricks to create more data.
Text Extraction and Cleanup
• Text extraction and cleanup refers to the process of extracting raw
text from the input data by removing all the other non-textual
information, such as markup, metadata, etc., and converting the
text to the required encoding format.
• Text extraction is a standard data-wrangling step, and we
don’t usually employ any NLP-specific techniques during
this process.
Text
Extraction
and Cleanup
• HTML Parsing and Cleanup
• Beautiful Soup and Scrapy
• Unicode Normalization
• Spelling Correction
• Bing Spell Check API
• https://pypi.org/project/pyspellch
ecker/
• System-Specific Error
Correction
• OCR Error
Pre-Processing
• Here are some common pre-processing steps used in NLP
software:
• Preliminaries
• Sentence segmentation and word tokenization.
• Frequent steps
• Stop word removal, stemming and lemmatization, removing digits/punctuation,
lowercasing, etc.
• Other steps
• Normalization, language detection, code mixing, transliteration, etc.
• Advanced processing
• POS tagging, parsing, coreference resolution, etc.
Preliminaries
• Sentence segmentation
• Most NLP libraries come with some form of sentence and word splitting
implemented.
• A commonly used library is Natural Language Tool Kit (NLTK)
• Word tokenization
• Similar to sentence segmentation, word tokenization splits a sentence
into words.
Stop Word Removal
• Some of the frequently used words in English, such as a, an, the,
of, in, etc., are not particularly useful for this task, as they don’t
carry any content on their own to separate between the four
categories.
• Such words are called stop words and are typically (though not
always) removed from further analysis in such problem scenarios.
• There is no standard list of stop words for English, though.
• There are some popular lists (NLTK has one, for example), although what a
stop word is can vary depending on what we’re working on.
Stemming and lemmatization
• Stemming refers to the process of removing suffixes and reducing
a word to some base form such that all different variants of that
word can be represented by the same form (e.g., “car” and “cars”
are both reduced to “car”).
• Porter Stemmer
• Lemmatization is the process of mapping all the different forms of
a word to its base word, or lemma.
• While this seems close to the definition of stemming, they are, in fact,
different.
• For example, the adjective “better,” when stemmed, remains the same.
However, upon lemmatization, this should become “good,
Pre-
processing
• Note that these are the
more common pre-
processing steps, but
they’re by no means
exhaustive. Depending
on the nature of the
data, some additional
pre-processing steps
may be important. Let’s
take a look at a few of
those steps.
Other Pre-Processing Steps
• Text normalization
• A word can be spelled in different ways, including in shortened forms, a
phone number can be written in different formats (e.g., with and without
hyphens), names are sometimes in lowercase, and so on.
• Language detection
• Code mixing and transliteration
Advanced Processing
• Imagine we’re asked to develop a system to identify person and
organization names in our company’s collection of one million
documents.
• The common pre-processing steps we discussed earlier may not
be relevant in this context.
• Identifying names requires us to be able to do POS tagging, as
identifying proper nouns can be useful in identifying person and
organization names.
Advanced Processing
• What we’ve seen so far in this section are some of the most
common pre-processing steps in a pipeline.
• They’re all available as pre-trained, usable models in different NLP
libraries.
• Apart from these, additional, customized pre-processing may be
necessary, depending on the application.
• For example, consider a case where we’re asked to mine the social media
sentiment on our product.
• We start by collecting data from, say, Twitter, and quickly realize there are
tweets that are not in English.
• In such cases, we may also need a language-detection step before doing
anything else.
Advanced
Processing
• Advanced pre-
processing steps on a
blob of text
Advanced
Processing
• Coreference Resolution
Feature Engineering
• When we use ML methods to perform our modeling step later,
we’ll still need a way to feed this pre-processed text into an ML
algorithm.
• Feature engineering refers to the set of methods that will
accomplish this task.
• It’s also referred to as feature extraction.
• The goal of feature engineering is to capture the characteristics of
the text into a numeric vector that can be understood by the ML
algorithms.
Classical NLP/ML Pipeline