Tutorial Text Classification in Phyton Using Spacy
Tutorial Text Classification in Phyton Using Spacy
designed particularly for production use, and it can help us to build applications
that process massive volumes of text efficiently. First, let’s take a look at some
of the basic analytical tasks spaCycan handle.
Installing spaCy
We’ll need to install spaCy and its English-language model before proceeding
further. We can do this using the following command line commands:
pip install spacy
python -m spacy download en
We can also use spaCy in a Juypter Notebook. It’s not one of the pre-installed
libraries that Jupyter includes by default, though, so we’ll need to run these
commands from the notebook to get spaCy installed in the correct Anaconda
directory. Note that we use ! in front of each command to let the Jupyter
notebook know that it should be read as a command line command.
!pip install spacy
!python -m spacy download en
Challenges and setbacks aren’t failures, they’re just part of the journey.
There are a couple of different ways we can appoach this. The first is
called word tokenization, which means breaking up the text into individual
words. This is a critical step for many language processing applications, as they
often require input in the form of individual words rather than longer strings of
text.
In the code below, we’ll import spaCy and its English-language model, and tell it
that we’ll be doing our natural language processing using that model. Then we’ll
assign our text string to text. Using nlp(text), we’ll process that text
in spaCy and assign the result to a variable called my_doc.
At this point, our text has already been tokenized, but spaCy stores tokenized
text as a doc, and we’d like to look at it in list form, so we’ll create a for loop
that iterates through our doc, adding each word token it finds in our text string
to a list called token_list so that we can take a better look at how words are
tokenized.
# Word tokenization
from spacy.lang.en import English
As we can see, spaCy produces a list that contains each token as a separate
item. Notice that it has recognized that contractions such as shouldn’t actually
represent two distinct words, and it has thus broken them down into two distinct
tokens.
Fist we need to load language dictionaries, Here in abve example, we are
loading english dictionary using English() class and creating nlp nlp object. “nlp”
object is used to create documents with linguistic annotations and various nlp
properties. After creating document, we are creating a token list.
If we want, we can also break the text into sentences rather than words. This is
called sentence tokenization. When performing sentence tokenization, the
tokenizer looks for specific characters that fall between sentences, like periods,
exclaimation points, and newline characters. For sentence tokenization, we will
use a preprocessing pipeline because sentence preprocessing
using spaCy includes a tokenizer, a tagger, a parser and an entity recognizer
that we need to access to correctly identify what’s a sentence and what isn’t.
In the code below,spaCy tokenizes the text and creates a Doc object. This Doc
object uses our preprocessing pipeline’s components tagger,parser and entity
recognizer to break the text down into components. From this pipeline we can
extract any component, but here we’re going to access sentence tokens using
the sentencizer component.
# sentence tokenization
As we can see, spaCy‘s default list of stopwords includes 312 total entries, and
each entry is a single word. We can also see why many of these words wouldn’t
be useful for data analysis. Transition words like nevertheless, for example,
aren’t necessary for understanding the basic meaning of a sentence. And other
words like somebody are too vague to be of much use for NLP tasks.
If we wanted to, we could also create our own customized list of stopwords. But
for our purposes in this tutorial, the default list that spaCy provides will be fine.
Removing Stopwords from Our Data
Now that we’ve got our list of stopwords, let’s use it to remove the stopwords
from the text string we were working on in the previous section. Our text is
already stored in the variable text, so we don’t need to define that again.
Instead, we’ll create an empty list called filtered_sent and then iterate
through our docvariable to look at each tokenized word from our source
text. spaCy includes a bunch of helpful token attributes, and we’ll use one of
them called is_stop to identify words that aren’t in the stopword list and then
append them to our filtered_sent list.
from spacy.lang.en.stop_words import STOP_WORDS
It’s not too difficult to see why stopwords can be helpful. Removing them has
boiled our original text down to just a few words that give us a good idea of
what the sentences are discussing: learning data science, and discouraging
challenges and setbacks along that journey.
Lexicon Normalization
Lexicon normalization is another step in the text data cleaning process. In the
big picture, normalization converts high dimensional features into low
dimensional features which are appropriate for any machine learning model. For
our purposes here, we’re only going to look at lemmatization, a way of
processing words that reduces them to their roots.
Lemmatization
Lemmatization is a way of dealing with the fact that while words
like connect, connection, connecting, connected, etc. aren’t exactly the same,
they all have the same essential meaning: connect. The differences in spelling
have grammatical functions in spoken language, but for machine processing,
those differences can be confusing, so we need a way to change all the words
that are forms of the word connect into the word connect itself.
One method for doing this is called stemming. Stemming involves simply
lopping off easily-identified prefixes and suffixes to produce what’s often the
simplest version of a word. Connection, for example, would have the -ion suffix
removed and be correctly reduced to connect. This kind of simple stemming is
often all that’s needed, but lemmatization—which actually looks at words and
their roots (called lemma) as described in the dictionary—is more precise (as
long as the words exist in the dictionary).
Since spaCy includes a build-in way to break a word down into its lemma, we can
simply use that for lemmatization. In the following very simple example, we’ll
use .lemma_ to produce the lemma for each word we’re analyzing.
# Implementing lemmatization
lem = nlp("run runs running runner")
# finding lemma for each word
for word in lem:
print(word.text,word.lemma_)
run run
runs run
running run
runner runner
Hooray! spaCy has correctly identified the part of speech for each word in this
sentence. Being able to identify parts of speech is useful in a variety of NLP-
related contexts, because it helps more accurately understand input sentences
and more accurately construct output responses.
Entity Detection
Entity detection, also called entity recognition, is a more advanced form of
language processing that identifies important elements like places, people,
organizations, and languages within an input string of text. This is really helpful
for quickly extracting information from text, since you can quickly pick out
important topics or indentify key sections of text.
Let’s try out some entity detection using a few paragraphs from this recent
article in the Washington Post. We’ll use .label to grab a label for each entity
that’s detected in the text, and then we’ll take a look at these entities in a more
visual format using spaCy‘s displaCyvisualizer.
#for visualization of Entity detection importing displacy from
spacy:
covers four CARDINAL Zip codes there, Mayor Bill de Blasio PERSON (D)
said Tuesday DATE .The mandate orders all unvaccinated people in the area,
including for children as young as 6 months old DATE . Anyone who resists could
Dependency Parsing
Depenency parsing is a language processing technique that allows us to
better determine the meaning of a sentence by analyzing how it’s constructed
to determine how the individual words relate to each other.
Consider, for example, the sentence “Bill throws the ball.” We have two nouns
(Bill and ball) and one verb (throws). But we can’t just look at these words
individually, or we may end up thinking that the ball is throwing Bill! To
understand the sentence correctly, we need to look at the word order and
sentence structure, not just the words and their parts of speech.
Doing this is quite complicated, but thankfully spaCy will take care of the work
for us! Below, let’s give spaCy another short sentence pulled from the news
headlines. Then we’ll use another spaCy called noun_chunks, which breaks the
input down into nouns and the words describing them, and iterate through each
chunk in our source text, identifying the word, its root, its dependency
identification, and which chunk it belongs to.
docp = nlp (" In pursuit of a wall, President Trump ran into one.")
This output can be a little bit difficult to follow, but since we’ve already imported
the displaCy visualizer, we can use that to view a dependency diagraram
using style = "dep"that’s much easier to understand:
displacy.render(docp, style="dep", jupyter= True)
Click to expand
There’s no way that a human could look at that array and identify it as meaning
“mango,” but representing the word this way works well for machines, because
it allows us to represent both the word’s meaning and its “proximity” to other
similar words using the coordinates in the array.
Text Classification
Now that we’ve looked at some of the cool things spaCy can do in general, let’s
look at at a bigger real-world application of some of these natural language
processing techniques: text classification. Quite often, we may find ourselves
with a set of text data that we’d like to classify according to some parameters
(perhaps the subject of each snippet, for example) and text classification is what
will help us to do this.
The diagram below illustrates the big-picture view of what we want to do when
classifying text. First, we extract the features we want from our source text (and
any tags or metadata it came with), and then we feed our cleaned data into a
machine learning algorithm that do the classification for us.
Importing Libraries
We’ll start by importing the libraries we’ll need for this task. We’ve already
imported spaCy, but we’ll also want pandas and scikit-learn to help with our
analysis.
import pandas as pd
from sklearn.feature_extraction.text import
CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
Loading Data
Above, we have looked at some simple examples of text analysis with spaCy, but
now we’ll be working on some Logistic Regression Classification using scikit-
learn. To make this more realistic, we’re going to use a real-world data set—this
31- Charcoal
0
5 Jul-18 Fabric Love my Echo! 1
rating date variation verified_reviews feedback
31- Charcoal
1
5 Jul-18 Fabric Loved it! 1
31- Charcoal
4
5 Jul-18 Fabric Music 1
# shape of dataframe
df_amazon.shape
(3150, 5)
# View data information
df_amazon.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
rating 3150 non-null int64
date 3150 non-null object
variation 3150 non-null object
verified_reviews 3150 non-null object
feedback 3150 non-null int64
dtypes: int64(2), object(3)
memory usage: 123.1+ KB
# Feedback Value count
df_amazon.feedback.value_counts()
1 2893
0 257
Name: feedback, dtype: int64
One tool we can use for doing this is called Bag of Words. BoW converts text
into the matrix of occurrence of words within a given document. It focuses on
whether given words occurred or not in the document, and it generates a matrix
that we might see referred to as a BoW matrix or a document term matrix.
We can generate a BoW matrix for our text data by using scikit-
learn‘s CountVectorizer. In the code below, we’re telling CountVectorizer to
use the custom spacy_tokenizer function we built as its tokenizer, and defining
the ngram range we want.
N-grams are combinations of adjacent words in a given text, where n is the
number of words that incuded in the tokens. for example, in the sentence “Who
will win the football world cup in 2022?” unigrams would be a sequence of single
words such as “who”, “will”, “win” and so on. Bigrams would be a sequence of 2
contiguous words such as “who will”, “will win”, and so on. So
the ngram_range parameter we’ll use in the code below sets the lower and upper
bounds of the our ngrams (we’ll be using unigrams). Then we’ll assign the
ngrams to bow_vector.
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer,
ngram_range=(1,1))
We’ll also want to look at the TF-IDF (Term Frequency-Inverse Document
Frequency) for our terms. This sounds complicated, but it’s simply a way of
normalizing our Bag of Words(BoW) by looking at each word’s frequency in
comparison to the document frequency. In other words, it’s a way of
representing how important a particular term is in the context of a given
document, based on how many times the term appears and how many other
documents that same term appears in. The higher the TF-IDF, the more
important that term is to that document.
We can represent this with the following mathematical equation:
# model generation
pipe.fit(X_train,y_train)
Pipeline(memory=None,
steps=[('cleaner', <__main__.predictors object at
0x00000254DA6F8940>), ('vectorizer', CountVectorizer(analyzer='word',
binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))])
# Model Accuracy
print("Logistic Regression
Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression
Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test,
predicted))
Logistic Regression Accuracy: 0.9417989417989417
Logistic Regression Precision: 0.9528508771929824
Logistic Regression Recall: 0.9863791146424518
Scikit-learn documentation
spaCy documentation
Dataquest’s Machine Learning Course on Linear Regression in Python; many other
machine learning courses are also available in our Data Scientist path.