0% found this document useful (0 votes)

36 views34 pages

Chapter 2

Uploaded by

Ramdhan Firdaus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views34 pages

Chapter 2

Uploaded by

Ramdhan Firdaus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Bag-of-words

S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
What is a bag-of-words (BOW) ?

Describes the occurrence of words within a document or a collection of documents (corpus)

Builds a vocabulary of the words and a measure of their presence

SENTIMENT ANALYSIS IN PYTHON

Amazon product reviews

SENTIMENT ANALYSIS IN PYTHON

Sentiment analysis with BOW: Example
This is the best book ever. I loved the book and highly recommend it!!!

{'This': 1, 'is': 1, 'the': 2 , 'best': 1 , 'book': 2,

'ever': 1, 'I':1 , 'loved':1 , 'and': 1 , 'highly': 1,
'recommend': 1 , 'it': 1 }

Lose word order and grammar rules!

SENTIMENT ANALYSIS IN PYTHON

BOW end result
The output will look something like this:

SENTIMENT ANALYSIS IN PYTHON

CountVectorizer function
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=1000)
vect.fit(data.review)
X = vect.transform(data.review)

SENTIMENT ANALYSIS IN PYTHON

CountVectorizer output
X

<10000x1000 sparse matrix of type '<class 'numpy.int64'>'

with 406668 stored elements in Compressed Sparse Row format>

SENTIMENT ANALYSIS IN PYTHON

Transforming the vectorizer
# Transform to an array
my_array = X.toarray()

# Transform back to a dataframe, assign column names

X_df = pd.DataFrame(my_array, columns=vect.get_feature_names())

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Getting granular
with n-grams
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Context matters
I am happy, not sad.

I am sad, not happy.

Pu ing 'not' in front of a word (negation) is one example of how context ma ers.

SENTIMENT ANALYSIS IN PYTHON

Capturing context with a BOW
Unigrams : single tokens

Bigrams: pairs of tokens

Trigrams: triples of tokens

n-grams: sequence of n-tokens

SENTIMENT ANALYSIS IN PYTHON

Capturing context with BOW
The weather today is wonderful.

Unigrams : { The, weather, today, is, wonderful }

Bigrams: {The weather, weather today, today is, is wonderful}

Trigrams: {The weather today, weather today is, today is wonderful}

SENTIMENT ANALYSIS IN PYTHON

n-grams with the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(min_n, max_n))

# Only unigrams
ngram_range=(1, 1)

# Uni- and bigrams

ngram_range=(1, 2)

SENTIMENT ANALYSIS IN PYTHON

What is the best n?
Longer sequence of tokens
Results in more features

Higher precision of machine learning models

Risk of over ing

SENTIMENT ANALYSIS IN PYTHON

Specifying vocabulary size
CountVectorizer(max_features, max_df, min_df)

max_features: if speci ed, it will include only the top most frequent words in the vocabulary
If max_features = None, all words will be included

max_df: ignore terms with higher than speci ed frequency

If it is set to integer, then absolute count; if a oat, then it is a proportion

Default is 1, which means it does not ignore any terms

min_df: ignore terms with lower than speci ed frequency

If it is set to integer, then absolute count; if a oat, then it is a proportion

Default is 1, which means it does not ignore any terms

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Build new features
from text
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Goal of the video

Goal : Enrich the existing dataset with features related to the text column (capturing the
sentiment)

SENTIMENT ANALYSIS IN PYTHON

Product reviews data
reviews.head()

SENTIMENT ANALYSIS IN PYTHON

Features from the review column

How long is each review?

How many sentences does it contain?

What parts of speech are involved?

How many punctuation marks?

SENTIMENT ANALYSIS IN PYTHON

Tokenizing a string
from nltk import word_tokenize

anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.'

word_tokenize(anna_k)

['Happy','families','are', 'all','alike',',',
'every','unhappy', 'family', 'is','unhappy','in',
'its','own','way','.']

SENTIMENT ANALYSIS IN PYTHON

Tokens from a column
# General form of list comprehension
[expression for item in iterable]

word_tokens = [word_tokenize(review) for review in reviews.review]

type(word_tokens)

list

type(word_tokens[0])

list

SENTIMENT ANALYSIS IN PYTHON

Tokens from a column
len_tokens = []

# Iterate over the word_tokens list

for i in range(len(word_tokens)):
len_tokens.append(len(word_tokens[i]))

# Create a new feature for the length of each review

reviews['n_tokens'] = len_tokens

SENTIMENT ANALYSIS IN PYTHON

Dealing with punctuation
We did not address it but you can exclude it

A feature that measures the number of punctuation signs

A review with many punctuation signs could signal a very emotionally charged opinion

SENTIMENT ANALYSIS IN PYTHON

Reviews with a feature for the length
reviews.head()

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Can you guess the
language?
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Language of a string in Python
from langdetect import detect_langs
foreign = 'Este libro ha sido uno de los mejores libros que he leido.'

detect_langs(foreign)

[es:0.9999945352697024]

SENTIMENT ANALYSIS IN PYTHON

Language of a column
Problem: Detect the language of each of the strings and capture the most likely language in
a new column

from langdetect import detect_langs

reviews = pd.read_csv('product_reviews.csv')

reviews.head()

SENTIMENT ANALYSIS IN PYTHON

Building a feature for the language
languages = []

for row in range(len(reviews)):

languages.append(detect_langs(reviews.iloc[row, 1]))

languages
[it:0.9999982541301151],
[es:0.9999954153640488],
[es:0.7142833997345875, en:0.2857160465706441],
[es:0.9999942365605781],
[es:0.999997956049055] ...

SENTIMENT ANALYSIS IN PYTHON

Building a feature for the language
# Transform the first list to a string and split on a colon
str(languages[0]).split(':')
['[es', '0.9999954153640488]']

str(languages[0]).split(':')[0]
'[es'

str(languages[0]).split(':')[0][1:]
'es'

SENTIMENT ANALYSIS IN PYTHON

Building a feature for the language
languages = [str(lang).split(':')[0][1:] for lang in languages]

reviews['language'] = languages

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N

Chapter 3
No ratings yet
Chapter 3
28 pages
Chapter 1
No ratings yet
Chapter 1
26 pages
Chapter 4
No ratings yet
Chapter 4
35 pages
Sentiment Analysis with NLTK
No ratings yet
Sentiment Analysis with NLTK
4 pages
NLP Sentimental Analysis 1736351356
No ratings yet
NLP Sentimental Analysis 1736351356
32 pages
Sentiment Analysis For PolishPoznan Studies in Contemporary Linguistics
No ratings yet
Sentiment Analysis For PolishPoznan Studies in Contemporary Linguistics
24 pages
Sentiment Analysis Basics
No ratings yet
Sentiment Analysis Basics
32 pages
Assignment 3
No ratings yet
Assignment 3
23 pages
Session 7
No ratings yet
Session 7
17 pages
Text Data Analysis and Visualization Techniques
No ratings yet
Text Data Analysis and Visualization Techniques
22 pages
10 1109@icaccs48705 2020 9074208
No ratings yet
10 1109@icaccs48705 2020 9074208
3 pages
Sentiment Analysis Using Bert Model
No ratings yet
Sentiment Analysis Using Bert Model
8 pages
Sentiment Analysis On User-Generated Tweets
No ratings yet
Sentiment Analysis On User-Generated Tweets
15 pages
Sentiment Analysis Using Machine Learning Algorithms
No ratings yet
Sentiment Analysis Using Machine Learning Algorithms
23 pages
Chapter 8 Text Analytics
No ratings yet
Chapter 8 Text Analytics
42 pages
Viva Questions For Opinion Mining Project by NASIR ABBAS - VUBWN
No ratings yet
Viva Questions For Opinion Mining Project by NASIR ABBAS - VUBWN
8 pages
Sentiment Analysis Using Vectotizer
No ratings yet
Sentiment Analysis Using Vectotizer
37 pages
Paper 2021.findings-Emnlp.278
No ratings yet
Paper 2021.findings-Emnlp.278
7 pages
Sentiment Analysis Blog Series Part
No ratings yet
Sentiment Analysis Blog Series Part
8 pages
Pre Processing
No ratings yet
Pre Processing
9 pages
Sypnosis: Twitter Sentimental Analysis
No ratings yet
Sypnosis: Twitter Sentimental Analysis
3 pages
Sentiment Analysis for Tweets
No ratings yet
Sentiment Analysis for Tweets
11 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
4 pages
CS771: GROUP-19 Sentiment Analysis in Movie Reviews: Project Report
No ratings yet
CS771: GROUP-19 Sentiment Analysis in Movie Reviews: Project Report
28 pages
Package Sentimentr': R Topics Documented
No ratings yet
Package Sentimentr': R Topics Documented
49 pages
YZV 201E Exercise 2
No ratings yet
YZV 201E Exercise 2
4 pages
Sentiment Analysis Tool in Python
No ratings yet
Sentiment Analysis Tool in Python
11 pages
Text Analysis
No ratings yet
Text Analysis
4 pages
Pysentimiento: A Python Toolkit For Sentiment Analysis and Socialnlp Tasks
No ratings yet
Pysentimiento: A Python Toolkit For Sentiment Analysis and Socialnlp Tasks
4 pages
Implementation of Sentiment Analysis On Twitter Data
No ratings yet
Implementation of Sentiment Analysis On Twitter Data
6 pages
Chapter 10 - Text Analytics
No ratings yet
Chapter 10 - Text Analytics
13 pages
PY0101EN 3 5 Practice - Lab 20230526 1685059200.jupyterlite
No ratings yet
PY0101EN 3 5 Practice - Lab 20230526 1685059200.jupyterlite
7 pages
M04 Lecture Notes
No ratings yet
M04 Lecture Notes
86 pages
Sentiment Analysis For Movie Reviews
No ratings yet
Sentiment Analysis For Movie Reviews
3 pages
Sentiment Analysis with TextBlob in Python
No ratings yet
Sentiment Analysis with TextBlob in Python
2 pages
Sentiment Analysis of Rotten Tomatoes For Box Office Revenue Prediction
No ratings yet
Sentiment Analysis of Rotten Tomatoes For Box Office Revenue Prediction
6 pages
A Natural Language Processing For Sentiment Analysis From Text Using Deep Learning Algorithm
No ratings yet
A Natural Language Processing For Sentiment Analysis From Text Using Deep Learning Algorithm
7 pages
NLTK Cheatsheet for Text Analysis
No ratings yet
NLTK Cheatsheet for Text Analysis
3 pages
NLTK Text Analysis Cheatsheet
No ratings yet
NLTK Text Analysis Cheatsheet
3 pages
NLTK Text Analysis Cheatsheet
No ratings yet
NLTK Text Analysis Cheatsheet
3 pages
Enhanced Sentiment Learning Using Twitter Hashtags and Smileys
No ratings yet
Enhanced Sentiment Learning Using Twitter Hashtags and Smileys
9 pages
Less Is More: Selecting Informative Unigrams For Sentiment Classification
No ratings yet
Less Is More: Selecting Informative Unigrams For Sentiment Classification
10 pages
Report
No ratings yet
Report
12 pages
Sentiment Analysis - Ipynb
No ratings yet
Sentiment Analysis - Ipynb
35 pages
Twitter Sentiment Analysis Techniques
No ratings yet
Twitter Sentiment Analysis Techniques
15 pages
Dav Exp7 56
No ratings yet
Dav Exp7 56
8 pages
1 PB
No ratings yet
1 PB
5 pages
Polarity Identification Through Emoticon Using Context Based Sentiment Analysis - 1605073640
No ratings yet
Polarity Identification Through Emoticon Using Context Based Sentiment Analysis - 1605073640
5 pages
Bengali Speech Sentiment Analysis Study
No ratings yet
Bengali Speech Sentiment Analysis Study
8 pages
Improved Feature Extraction and Classification - Sentiment Analysis - Trupthi2016
No ratings yet
Improved Feature Extraction and Classification - Sentiment Analysis - Trupthi2016
6 pages
All Practicals
No ratings yet
All Practicals
33 pages
Social Media Sentiment Analysis in Python
No ratings yet
Social Media Sentiment Analysis in Python
9 pages
Sentiments Analysis Code Analysis
No ratings yet
Sentiments Analysis Code Analysis
42 pages
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
No ratings yet
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
15 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
12 pages
05 - Dictionaries and Tuples
No ratings yet
05 - Dictionaries and Tuples
61 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Fin Ijprems1714118825
No ratings yet
Fin Ijprems1714118825
6 pages
Assignment - 1
No ratings yet
Assignment - 1
14 pages
Python for Non-Tech Students
No ratings yet
Python for Non-Tech Students
2 pages
2020 ISCAS A 128-Point Multi-Path SC FFT Architecture
No ratings yet
2020 ISCAS A 128-Point Multi-Path SC FFT Architecture
5 pages
Build A Modern, Unified Analytics Data Platform With Google Cloud - Whitepaper August 2021
No ratings yet
Build A Modern, Unified Analytics Data Platform With Google Cloud - Whitepaper August 2021
18 pages
Isoefficiency Function for Parallel Algorithms
No ratings yet
Isoefficiency Function for Parallel Algorithms
20 pages
Usman
No ratings yet
Usman
12 pages
Chung Hsing 2021 - Assignment09
No ratings yet
Chung Hsing 2021 - Assignment09
1 page
Infra1 Lsinventory0502
No ratings yet
Infra1 Lsinventory0502
42 pages
8 - Android Layout
No ratings yet
8 - Android Layout
18 pages
Gibt Campus-Intro To Computing WS V1.02
No ratings yet
Gibt Campus-Intro To Computing WS V1.02
10 pages
Digital Twin Security Threats and Countermeasures An Introduction
No ratings yet
Digital Twin Security Threats and Countermeasures An Introduction
5 pages
CIT 403 and SEN 309 - 241016 - 143907 - 241117 - 212422
No ratings yet
CIT 403 and SEN 309 - 241016 - 143907 - 241117 - 212422
38 pages
Multithreading in Java. CPU - by Engineering Digest - Medium
No ratings yet
Multithreading in Java. CPU - by Engineering Digest - Medium
33 pages
Ict - Ed Tech Reviewer A
No ratings yet
Ict - Ed Tech Reviewer A
5 pages
Angular Event Loop and HTTP Interceptors
No ratings yet
Angular Event Loop and HTTP Interceptors
132 pages
Student Grade Card Report
No ratings yet
Student Grade Card Report
1 page
Sasi Bhushan Veeramachaneni
No ratings yet
Sasi Bhushan Veeramachaneni
1 page
Disaster Recovery Into The CICD Pipeline
No ratings yet
Disaster Recovery Into The CICD Pipeline
11 pages
Smart Hackathon
No ratings yet
Smart Hackathon
14 pages
Distributed Systems: C5 Basic Distributed Algorithms
No ratings yet
Distributed Systems: C5 Basic Distributed Algorithms
29 pages
Actualizar Dash 3000
No ratings yet
Actualizar Dash 3000
20 pages
SAP Add-On Installation Guide
No ratings yet
SAP Add-On Installation Guide
14 pages
Secret Key Cryptography & DES
No ratings yet
Secret Key Cryptography & DES
18 pages
Eks Ultimate Guide
0% (1)
Eks Ultimate Guide
298 pages
Macro Definitions and Expansions Explained
No ratings yet
Macro Definitions and Expansions Explained
39 pages
COS3711 Advanced Programming Exam Guide
100% (1)
COS3711 Advanced Programming Exam Guide
8 pages
Parallel Programming Workshop
No ratings yet
Parallel Programming Workshop
1 page
Time Table For Summer 2024 Theory Examination
No ratings yet
Time Table For Summer 2024 Theory Examination
9 pages
Lab2 - Strings
No ratings yet
Lab2 - Strings
4 pages
IOT Based Automatic Plant Irrigation System Based On Weather Conditions Using Multiple Weather Sensors Document.
0% (1)
IOT Based Automatic Plant Irrigation System Based On Weather Conditions Using Multiple Weather Sensors Document.
85 pages

Chapter 2

Uploaded by

Chapter 2

Uploaded by

Bag-of-words

Describes the occurrence of words within a document or a collection of documents (corpus)

Builds a vocabulary of the words and a measure of their presence

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

{'This': 1, 'is': 1, 'the': 2 , 'best': 1 , 'book': 2,

Lose word order and grammar rules!

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

<10000x1000 sparse matrix of type '<class 'numpy.int64'>'

SENTIMENT ANALYSIS IN PYTHON

# Transform back to a dataframe, assign column names

SENTIMENT ANALYSIS IN PYTHON

I am sad, not happy.

SENTIMENT ANALYSIS IN PYTHON

Bigrams: pairs of tokens

Trigrams: triples of tokens

n-grams: sequence of n-tokens

SENTIMENT ANALYSIS IN PYTHON

Unigrams : { The, weather, today, is, wonderful }

Bigrams: {The weather, weather today, today is, is wonderful}

Trigrams: {The weather today, weather today is, today is wonderful}

SENTIMENT ANALYSIS IN PYTHON

vect = CountVectorizer(ngram_range=(min_n, max_n))

# Uni- and bigrams

SENTIMENT ANALYSIS IN PYTHON

Higher precision of machine learning models

Risk of over ing

SENTIMENT ANALYSIS IN PYTHON

max_df: ignore terms with higher than speci ed frequency

Default is 1, which means it does not ignore any terms

min_df: ignore terms with lower than speci ed frequency

Default is 1, which means it does not ignore any terms

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

How long is each review?

How many sentences does it contain?

What parts of speech are involved?

How many punctuation marks?

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

word_tokens = [word_tokenize(review) for review in reviews.review]

SENTIMENT ANALYSIS IN PYTHON

# Iterate over the word_tokens list

# Create a new feature for the length of each review

SENTIMENT ANALYSIS IN PYTHON

A feature that measures the number of punctuation signs

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

from langdetect import detect_langs

SENTIMENT ANALYSIS IN PYTHON

for row in range(len(reviews)):

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

You might also like