0% found this document useful (0 votes)
24 views5 pages

NLP Programs

The document discusses natural language processing using the Natural Language Toolkit (NLTK) module with Python. It covers installing NLTK with pip, downloading NLTK components, tokenizing text into words and sentences, removing common stop words, and using the Porter stemming algorithm to reduce words to their root form.

Uploaded by

cnu.vadali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views5 pages

NLP Programs

The document discusses natural language processing using the Natural Language Toolkit (NLTK) module with Python. It covers installing NLTK with pip, downloading NLTK components, tokenizing text into words and sentences, removing common stop words, and using the Porter stemming algorithm to reduce words to their root form.

Uploaded by

cnu.vadali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Natural Language Processing using the Natural Language Toolkit, or NLTK,

module with Python.


If the python was not installed, go to Python.org and download the latest version of Python if you are on Windows.
If you are on Mac or Linux, you should be able to run an apt-get install python3

The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP)
methodology.

NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the
part of speech of those words, highlighting the main subjects, and then even with helping your machine to
understand what the text is all about.

NLTK 3. The easiest method to installing the NLTK module is going to be with pip.

For all users, that is done by opening up cmd.exe, bash, or whatever shell you use and typing:
pip install nltk

Next, we need to install some of the components for NLTK. Open python via whatever means you normally do, and
type:

import nltk
nltk.download()

Some quick vocabulary:

 Corpus - Body of text, singular. Corpora is the plural of this. Example: A collection of medical journals.
 Lexicon - Words and their meanings. Example: English dictionary. Consider, however, that various fields
will have different lexicons. For example: To a financial investor, the first meaning for the word "Bull" is
someone who is confident about the market, as compared to the common English lexicon, where the first
meaning for the word "Bull" is an animal. As such, there is a special lexicon for financial investors,
doctors, children, mechanics, and so on.
 Token - Each "entity" that is a part of whatever was split up based on rules. For examples, each word is a
token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the
sentences out of a paragraph.

1) Tokenizing Words and Sentences with NLTK


 Tokenizing - Splitting sentences and words from the body of text.

from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard."

print(sent_tokenize(EXAMPLE_TEXT))

The first step would be likely doing a simple .split('. '), or splitting by period followed by a space. Then maybe you
would bring in some regular expressions to split by period, space, and then a capital letter.

NLTK is going to go ahead and just save you a ton of time with this seemingly simple, yet very complex, operation

The above code will output the sentences, split up into a list of sentences, which you can do things like iterate
through with a for loop.

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-
blue.', "You shouldn't eat cardboard."]

print(word_tokenize(EXAMPLE_TEXT))

There are a few things to note here. First, notice that punctuation is treated as a separate token. Also, notice the
separation of the word "shouldn't" into "should" and "n't." Finally, notice that "pinkish-blue" is indeed treated like
the "one word" it was meant to be turned into. Pretty cool!

2) Stop words with NLTK


Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been
programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a
search query.
Stop words as words that just contain no meaning, and we want to remove them. When we want to completely
cease analysis if you detect words that are commonly used sarcastically, and stop immediately

Do this easily, by storing a list of words that you consider to be stop words. NLTK starts you off with a bunch of
words that they consider to be stop words, you can access it via the NLTK corpus with:

from nltk.corpus import stopwords

To check the list of stopwords you can type the following commands in the python shell.

import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

Here is how you might incorporate using the stop_words set to remove the stop words from your text:

from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

Output:

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']

['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']

3) Stemming words with NLTK

The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than
when tense is involved.

The reason why we stem is to shorten the lookup, and normalize sentences.

Consider:
I was taking a ride in the car.
I was riding in the car.

One of the most popular stemming algorithms is the Porter stemmer, which has been around since 1979.
First, we're going to grab and define our stemmer:

from nltk.stem import PorterStemmer


from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

Now, let's choose some words with a similar stem, like:

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

Next, we can easily stem by doing something like:

for w in example_words:
print(ps.stem(w))

Output

python
python
python
python
pythonli

Now let's try stemming a typical sentence, rather than some words:

new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned
poorly at least once."

Output:

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc

You might also like