NLP Programs
NLP Programs
The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP)
methodology.
NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the
part of speech of those words, highlighting the main subjects, and then even with helping your machine to
understand what the text is all about.
NLTK 3. The easiest method to installing the NLTK module is going to be with pip.
For all users, that is done by opening up cmd.exe, bash, or whatever shell you use and typing:
pip install nltk
Next, we need to install some of the components for NLTK. Open python via whatever means you normally do, and
type:
import nltk
nltk.download()
Corpus - Body of text, singular. Corpora is the plural of this. Example: A collection of medical journals.
Lexicon - Words and their meanings. Example: English dictionary. Consider, however, that various fields
will have different lexicons. For example: To a financial investor, the first meaning for the word "Bull" is
someone who is confident about the market, as compared to the common English lexicon, where the first
meaning for the word "Bull" is an animal. As such, there is a special lexicon for financial investors,
doctors, children, mechanics, and so on.
Token - Each "entity" that is a part of whatever was split up based on rules. For examples, each word is a
token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the
sentences out of a paragraph.
EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard."
print(sent_tokenize(EXAMPLE_TEXT))
The first step would be likely doing a simple .split('. '), or splitting by period followed by a space. Then maybe you
would bring in some regular expressions to split by period, space, and then a capital letter.
NLTK is going to go ahead and just save you a ton of time with this seemingly simple, yet very complex, operation
The above code will output the sentences, split up into a list of sentences, which you can do things like iterate
through with a for loop.
['Hello Mr. Smith, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-
blue.', "You shouldn't eat cardboard."]
print(word_tokenize(EXAMPLE_TEXT))
There are a few things to note here. First, notice that punctuation is treated as a separate token. Also, notice the
separation of the word "shouldn't" into "should" and "n't." Finally, notice that "pinkish-blue" is indeed treated like
the "one word" it was meant to be turned into. Pretty cool!
Do this easily, by storing a list of words that you consider to be stop words. NLTK starts you off with a bunch of
words that they consider to be stop words, you can access it via the NLTK corpus with:
To check the list of stopwords you can type the following commands in the python shell.
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))
Here is how you might incorporate using the stop_words set to remove the stop words from your text:
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
Output:
['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than
when tense is involved.
The reason why we stem is to shorten the lookup, and normalize sentences.
Consider:
I was taking a ride in the car.
I was riding in the car.
One of the most popular stemming algorithms is the Porter stemmer, which has been around since 1979.
First, we're going to grab and define our stemmer:
ps = PorterStemmer()
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
for w in example_words:
print(ps.stem(w))
Output
python
python
python
python
pythonli
Now let's try stemming a typical sentence, rather than some words:
new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned
poorly at least once."
Output:
It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc