0% found this document useful (0 votes)
8 views29 pages

Natural language processing-Section (4)

The document outlines a mini project to create a Python program that analyzes text sentiment by comparing words against predefined lists of positive and negative words. It details the steps for reading a file, tokenizing text, and calculating sentiment scores, while addressing potential issues with accuracy and word similarity using the Wu-Palmer similarity method. Additionally, it suggests tasks for further improvement and exploration of stopwords in Arabic.

Uploaded by

dw9324764
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views29 pages

Natural language processing-Section (4)

The document outlines a mini project to create a Python program that analyzes text sentiment by comparing words against predefined lists of positive and negative words. It details the steps for reading a file, tokenizing text, and calculating sentiment scores, while addressing potential issues with accuracy and word similarity using the Wu-Palmer similarity method. Additionally, it suggests tasks for further improvement and exploration of stopwords in Arabic.

Uploaded by

dw9324764
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Language Engineering

Prepared by: Abdelrahman M. Safwat

Section (4) – Mini Project


Idea

 We want to create a Python program that takes a file


and checks whether the text indicates a positive
sentiment or negative sentiment.

2
Defining The Lists for Positive and
Negative Words
 First, we need to define a list of positive and
negative words to compare words from our text
against.
positive_words = ["well", "good", "great", "like", "better", "enough", "happy", "love", "pleasure", "hap
piness"]

negative_words = ["miss", "poor", "doubt", "object", "sorry", "impossible", "afraid", "scarcely", "bad",
"anxious"]

3
Opening and Reading The File

 Next, we need to open the text file and store its


content in a variable. Make sure the text file is in the
same directory as the Python script.
file = open(“1.txt")
text = file.read()

4
Tokenizing The Text

 Next, we need to tokenize all the text into words.


from nltk.tokenize import word_tokenize

words = word_tokenize(text)

5
Checking if The Text contains
Positive or Negative Words

 Next, we need to loop through all words and check if


they’re included in either of the lists we created.
for word in words:
if word in positive_words:
print("The text is positive")
break
elif word in negative_words:
print("The text is negative")
break

6
Improving The Program

 You’ll notice that there’s a problem with the code in the


previous slide, and that if there are a mix of both
good and bad words, this method would be
inaccurate.

7
Keeping Positive and Negative
Scores

 To solve the problem we mentioned, we can keep score


of how many positive and negative words there
are in the text.

8
Keeping Positive and Negative Scores
(Cont.)

positive_score = 0
negative_score = 0

for word in words:


if word in positive_words:
positive_score += 1
elif word in negative_words:
negative_score += 1

if positive_score > negative_score:


print("The text is positive")
else:
print("The text is negative")

9
Improving The Program even
Further

 You’ll notice that there’s a problem with the code in the


previous slide, and that some words might be
positive that we haven’t put in our list.

10
Using Word Similarity

 What we’ll do is that for each word in our text, we’ll


check how similar it is to each word in the positive
and negative lists.
 To do so, we’ll also need to remove all irrelevant words.

11
Stop Words

A stop word is a commonly used word (such as “the”,


“a”, “an”, “in”) that a search engine has been
programmed to ignore, both when indexing entries for
searching and when retrieving them as the result of a
search query.
Stop words are commonly used in Natural Language
Processing (NLP) to eliminate words that are so
commonly used that they carry very little useful
information. 12
Example for Stop Word

13
Removing Stop Words

from nltk.corpus import stopwords

stop_words= stopwords.words("english")
filtered_words = []

for word in words:


if word not in stop_words:
filtered_words.append(word)

14
Using Word Similarity

 What we’ll do is that for each word in our text, we’ll check how similar
it is to each word in the positive list and then again for the negative list.
 We’ll keep each similarity score in a list, and get the maximum
score.

positive_score = 0
negative_score = 0
positive_similarity = []
negative_similarity = []

15
Using Word Similarity (Cont.)

from nltk.corpus import wordnet

for word in words:


for positive_word in positive_words:
word1 = wordnet.synsets(word)[0]
word2 = wordnet.synsets(positive_word)[0]
positive_similarity.append(word1.wup_similarity(word2))
for negative_word in negative_words:
word1 = wordnet.synsets(word)[0]
word2 = wordnet.synsets(negative_word)[0]
negative_similarity.append(word1.wup_similarity(word2))

positive_score += max(positive_similarity)
negative_score += max(negative_similarity)
16
Fixing Problems in Our Code

 You’ll notice that there’s an error that complains about a


“None” value in our list.
 That’s coming from the “max()” function, because it’s
expecting numbers only.
 To fix this, we simply need to use the “filter()” function to
remove all “None” values.

17
Fixing Problems in Our Code (Cont.)

for word in words:


for positive_word in positive_words:
word1 = wordnet.synsets(word)[0]
word2 = wordnet.synsets(positive_word)[0]
positive_similarity.append(word1.wup_similarity(word2))
positive_similarity = list(filter(None, positive_similarity))
for negative_word in negative_words:
word1 = wordnet.synsets(word)[0]
word2 = wordnet.synsets(negative_word)[0]
negative_similarity.append(word1.wup_similarity(word2))
negative_similarity = list(filter(None, negative_similarity))

positive_score += max(positive_similarity)
negative_score += max(negative_similarity)
18
Fixing Problems in Our Code (Cont.)

 You’ll notice that there’s another error that complains


about an invalid index.
 That’s because some of the words we’re comparing
don’t have an entry in the WordNet.
 We can simply check first if there are entries or not.

19
Fixing Problems in Our Code (Cont.)

for word in words:


for positive_word in positive_words:
if(wordnet.synsets(word) and wordnet.synsets(positive_word)):
word1 = wordnet.synsets(word)[0]
word2 = wordnet.synsets(positive_word)[0]
positive_similarity.append(word1.wup_similarity(word2))
positive_similarity = list(filter(None, positive_similarity))
for negative_word in negative_words:
if(wordnet.synsets(word) and wordnet.synsets(negative_word)):
word1 = wordnet.synsets(word)[0]
word2 = wordnet.synsets(negative_word)[0]
negative_similarity.append(word1.wup_similarity(word2))
negative_similarity = list(filter(None, negative_similarity))

positive_score += max(positive_similarity) 20
negative_score += max(negative_similarity)
Checking Our Results

if positive_score > negative_score:


print("The text is positive")
else:
print("The text is negative")

21
Wu-Palmer Similarity

The wup_similarity method is short for Wu-Palmer


Similarity, which is a scoring method based on how similar
the word senses are and where the Synsets occur relative to
each other in the hypernym tree.

22
Code #1: Introducing Synsets

from nltk.corpus import wordnet


syn1 = wordnet.synsets('hello')[0]
syn2 = wordnet.synsets('selling')[0]
print ("hello name : ", syn1.name())
print ("selling name : ", syn2.name())
 Output
hello name : hello.n.01
selling name : selling.n.01

23
Code #2: Wu Similarity

syn1.wup_similarity(syn2)

Output :
0.26666666666666666
 hello and selling is apparently 27% similar!

24
Try it out yourself

 Code:
https://colab.research.google.com/drive/19sLiFnHyDzi1
M99yRjeHlB7ekCSYXrmD

25
Task #1

 Use the mini project we did to loop through all text files
in a directory and print the document name and
whether it contains positive or negative text.
 Extra: See if you can improve the mini project even
further.

26
Task #2

 Write a python program to check the list of stopwords


in Arabic language.

27
Thank you for your attention!

28
References

 https://www.tidytextmining.com/sentiment.html

29

You might also like