0% found this document useful (0 votes)
51 views6 pages

A Small Comparative Study of Machine Learning Algorithms in The Detection of Fake Reviews of Amazon Products

Uploaded by

name52513
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views6 pages

A Small Comparative Study of Machine Learning Algorithms in The Detection of Fake Reviews of Amazon Products

Uploaded by

name52513
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2023 6th International Conference on Contemporary Computing and Informatics (IC3I)

A Small Comparative Study of Machine Learning


Algorithms in the Detection of Fake Reviews of
Amazon Products
Sikha Akshara Sulige Shiva Swapnika Kubireddy
Dept. of Computer Science Engineering Dept. of Computer Science Engineering Dept. of Computer Science Engineering
B V Raju Institute of Technology B V Raju Institute of Technology B V Raju Institute of Technology
2023 6th International Conference on Contemporary Computing and Informatics (IC3I) | 979-8-3503-0448-0/23/$31.00 ©2023 IEEE | DOI: 10.1109/IC3I59117.2023.10398096

Narsapur,Medak,Telangana,India Narsapur,Medak,Telangana,India Narsapur,Medak,Telangana,India


[email protected] [email protected] [email protected]

Thatipally Arun VVS Lakshmi Kanthety


Dept. of Computer Science Engineering Dept. of Computer Science Engineering
B V Raju Institute of Technology B V Raju Institute of Technology
Narsapur,Medak,Telangana,India Narsapur,Medak,Telangana,India
[email protected] [email protected]

Abstract—Nowadays the popularity of online shopping has statistical model used to estimate the probability of a binary
increased, leading to a significant rise in the number of product outcome.In the context of fake review detection,Logistic Re-
reviews on platforms like Amazon. However, this increase has gression can be trained to predict the likelihood of a review
also attracted a growing number of fake reviews, which are being fake by analyzing its sentiment features.The approach
designed to deceive consumers and manipulate product ratings.
form in this work may be used to develop a robust and
Detecting and filtering out these fake reviews is crucial for main-
taining the integrity and reliability of online review systems.We accurate system for detecting fake reviews on Amazon.The
proposed a methodology to identify fake reviews on Amazon proposed approach holds the potential to enhance the
using sentiment analysis, support vector machine algorithm and reliability of the online review system, allowing consumers
logistic regression algorithm.The objective is to show how well to make informed purchasing decisions while discouraging
the suggested approaches work in separating fraudulent reviews the increase of fake reviews.We will outline the methodology,
from real Amazon reviews. describe the data set used for training and evaluation, and
present the results of our experiments.The findings will
Index Terms—Fake review detection, Amazon books reviews, provide valuable insights into the effectiveness of sentiment
sentiment analysis, support vector machine, logistic regression,
analysis coupled with SVM and Logistic Regression
sentiment polarity, machine learning, online shopping.
algorithms in detecting fake reviews on Amazon,
I. INTRODUCTION contributing to the ongoing efforts to combat review
manipulation and ensure the authenticity of amazon books
In this age of e-marketing, online reviews play a crucial reviews.
role in shaping consumers’ purchase decisions.However, with
the increase in the popularity of online shopping like II. LITERATURE SURVEY
Amazon,the issue of fake reviews has become a significant
concern.Fake reviews can mislead potential buyers and The study, [1] used three different data sets,they are the
compromise the integrity of the review system. Therefore,it Clothing, Shoes and Jewelry reviews dataset, proposed
is crucial to create efficient techniques for identifying and Logistic Regression (LR), Decision Tree methods, as well as
removing these bo- gus reviews. Utilizing sentiment analysis, Naive Bayes (NB), Support Vector Machine (SVM), and text
a natural language processing technique that captures the classification algorithms.The Logistic regression has more
sentiment or opinion expressed in a text,is one of the potential accuracy than all other proposed models for the three
methods for Fake Re- view detection. Sentiment analysis can different data sets.
help uncover patterns in language that are indicative of fake
reviews.To implement sen- timent analysis for fake review [2] Due to its context-dependent character and the usage
detection on amazon product reviews,we will employ two of irony or contradictory terms to convey the opposite of the
powerful machine learning algo- rithms: Support Vector literal meaning, sarcasm identification can be a difficult task
Machine (SVM) and Logistic Regression (LR). A high- in natural language processing. To accurately identify
dimensional space is searched for an ideal hyperplane by the sarcasm in product reviews, the article suggests employing a
supervised learning method SVM in order to separate variety of machine learning methods, including supervised
instances of various classes.By learning from labeled training learning, natural language processing, or even deep learning
data,SVM can build a classification model capable of models. Comparisons with other existing techniques and
distinguishing between genuine and fake reviews based on accuracy metrics may be used to assess the proposed
their sentiment features.Similarly, Logistic Regression is a methodology.They offered the following model

2258
979-8-3503-0448-0/22/$31.00 ©2023 IEEE

Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:25:22 UTC from IEEE Xplore. Restrictions apply.
2023 6th International Conference on Contemporary Computing and Informatics (IC3I)

suggestions:Lemmatization, stemming, polarity article. The study shows that sentiment analysis employing
identification, and data tokenization are all parts of the pre- machine learning approaches holds tremendous potential for
processing of data. Then, features including inverted efficiently identifying and combatting fraudulent reviews,
document frequency, word frequency, and n-grams are helping customers make educated decisions and fostering
extracted. The Support Vector Machine (SVM), K Nearest trust in online review sites.
Neighbors, and Random Forest categorization methods are
employed.The outcomes computation is evaluated using the [11] The article is mainly on linguistic patterns,writing
parameter’s correctness. The polarity of the statement is then style,sentiment indicators as they retrieved the elements that
checked after the sarcasm detection process; if necessary, the were especially suited to the review material.To distinguish
polarities are labeled, and the sentiment is then examined. between authentic and false reviews,characteristics like the
frequency of particular terms,sentiment scores,lexical
[3] They suggested utilizing supervised learning and the richness,and punctuation usage are extracted from the review
Bidirectional Encoder Representations from Transformers text.To simulate the relationship between the review-centric
(BERT) model to identify bogus hotel reviews. Then they features and the authenticity labels,they used the
improved the pre-trained BERT model using a publicly sophisticated binary classification algorithm logistic
accessible data set of hotel reviews, and then trained a regression.The labeled dataset is used to train the algorithm
supervised learning algorithm to distinguish between real and to identify authentic reviews from false ones using the
false reviews.They used accepted evaluation metrics to attributes that were extracted.
evaluate the performance.
[13] To identify fraudulent internet reviews and assess
[4] With the aid of language and rating properties, this the effectiveness of the two strategies on a data set containing
study seeks to spot phony product reviews. To spot bogus hotel reviews, they introduced various semi-supervised and
reviews, they used advanced supervised machine learning supervised text mining models.To enhance the performance
algorithms, methods for processing natural language, and of classification,SVM and Statistical Naive Bayes are utilized
methods for sentiment and opinion mining. By providing the as classifiers in their research. They largely concentrated on
model with good training and using high-quality training and the review-based techniques’ substance. They made use of
testing data, they were able to determine the veracity of a word frequency, sentiment polarity, and review length.They
given review using the aforementioned strategies. The found that the supervised Naive Bayes classifier had the
outcome of this experiment demonstrates that the answer is highest accuracy when compared to other classifiers.The
more accurate than the earlier effort. article discusses the advantages of using semi-supervised
learning for scenarios where acquiring a large number of
[5] The article uses a variety of machine learning labeled fake reviews might be challenging. It could also
methods, such as deep learning models, natural language explore the effectiveness of combining semi-supervised and
processing, supervised or unsupervised learning, to supervised learning to improve the accuracy of fake review
automatically monitor and detect fraudulent reviews linked to detection.
certain products.The article examines user behavior patterns
that can be used to discern between real and false reviews, as [14] They employed SVM as one of the most well-liked
well as techniques for extracting pertinent elements from supervised learning algorithms, which is used for both
review content.It might also go into detail about how to spot classification and regression issues. However, machine-
questionable behaviors that may be signs of false reviews, learning classification issues are where it is most frequently
such spamming reviews or expressing biased opinions. To applied.The effectiveness of the suggested model was
maintain the credibility and integrity of product review evaluated using a dataset of 6186 Amazon reviews split
sites,the article also includes techniques for deleting or evenly into a training set (4948 reviews) and a test set (1238
filtering out fraudulent reviews. reviews). The model has performance of 80.4 percent
accuracy, 80.8 percent precision, 80 percent recall, and 80
[7] The paper investigates the use of sentiment analysis percent F1-score.Comparing the proposed methodology to
and other machine learning approaches to identify bogus the previously discussed algorithm, accuracy has increased
reviews. The idea of sentiment analysis is then introduced. by 20 percent.
This natural language processing (NLP) method is used to
ascertain the sentiment or emotion portrayed in a text. It [15] The paper discusses the efficacy of applying NLP
outlines typical traits that could be telltale signs of false methods to identify crucial linguistic aspects as well as how
reviews, like repetitious wording, overuse of superlatives, neural networks may generalize and learn from vast volumes
and a dearth of specific information or personal experiences. of review data. It also looks at the process’s potential pitfalls,
Support Vector Machines (SVM), Naive Bayes, and deep like data imbalance, noisy evaluations, or overfitting.
learning-based methods like Recurrent Neural Networks
(RNNs) and Transformers are some of the machine learning III. OBJECTIVES
algorithms for sentiment analysis that are compared in the

2259

Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:25:22 UTC from IEEE Xplore. Restrictions apply.
2023 6th International Conference on Contemporary Computing and Informatics (IC3I)

• To analyze and evaluate different machine learning considering the threshold value. Initially we will take the
ap- proaches to identify which ones are effective in threshold value as 0, if the sentiment score is greater than
distin- guishing between genuine and fake reviews threshold then we say it as positive sentiment and take the
on Amazon’s product platform. label as 1 and if the sentiment score is less than threshold
value then we say it as negative sentiment and we take the
• Train a machine learning model using labeled label as 0. Now we will consider this labelled dataset for
datasets consisting of both genuine and fake implementing our methods. Describes the steps for cleaning
reviews. and pre processing the collected data, such as removing noise,
standardizing formats, and handling missing values. Discuss
• Using the results of the machine learning model, several feature extraction methods, such as count vectorizer
catego- rize reviews as real or fraudulent. and TF-IDF.For the purpose of identifying false reviews in
the obtained data sets, we suggested Support Vector Machine
• Assess the performance of the fake review detection and Logistic Regression. For SVM we describe the
system using appropriate evaluation metrics and mathematical formulation of SVM and how it works. Explain
measure its precision, recall, and overall accuracy. the process of training an SVM model using the pre processed
review data and sentiment features. Discuss techniques for
IV. EXISTING WORK hyper parameter tuning and model evaluation.For LR we
introduce LR as a statistical modeling technique for binary
Reviews are such a powerful tool, therefore it makes classification. Explain the logistic function and how it relates
sense that there are a lot of unethical acts taking place in the to the probability of a review being fake or genuine. Describe
review industry. Various initiatives have been made to the process of training an LR model using the pre processed
identify and fully comprehend these behaviors, which are review data and sentiment features.Discuss techniques for
generally referred to as opinion spam. feature selection, regularization, and model evaluation.
Present the results of applying SVM and LR models to the
To identify false reviews, Jindal and Liu took the data set.
initiative and made an effort. They analyzed internet reviews
in–false opinions, seller/brand only reviews, and nonreviews
utilizing near-duplicate content as a sign of phony reviews–
and first identified the issue of opinion spam. Another study
that looked at the discovery of review-level spam used
reviewer and review feature combinations with expressive
text features.used Ama- zon Mechanical Turk to create
fabricated hotel evaluations, but Jindal and Liu used data
from Amazon as their source and exploited content duplicity
as their source of truth. Both of them engaged in feature
review work. The significance of brands was briefly
discussed by Jindal et al and Li et al.It is easy to express that
detection techniques for sentiment categorization based on
machine learning use a supervised learning technique with
three classes: negative, neutral, and positive. Reviews are Figure 1. Architecture Diagram
frequently the source of testing and training data in the
literature (Liu and Zhang, 2012). The detection and VI. METHODS
elimination of unfair reviews are of utmost importance
(Jindal and Liu, 2008; Moraes et al., 2013 provided a method Data collection: Gathering a data set of Amazon books
to categorize the textual review of a specific topic). reviews, including both genuine and fake reviews.

V. PROPOSED WORK Pre-processing: Clean and pre-process the review data to


remove any irrelevant information such as HTML tags,
Explains the process of collecting a representative data punctuation, and special characters. Also, convert the text to
set of Amazon product reviews, including both genuine and lowercase and remove stop words (commonly used words
fake reviews. We will be taking the Amazon Book Reviews with little semantic value).certain modifications must be
Small dataset form the website Kaggle. This dataset does not applied to the data set to minimize the amount of data by
have the labels. So in order to get the labels we will first deleting unnecessary data. Punctuations have no impact on
perform the preprocessing the text data like converting it into text classification.Stop words can be
lowercase and later by using NLTK’s sentiment analysis articles,prepositions,conjunctions and pronouns.At last then
module we will classify the sentiment of each review. Now we have to perform stemming.Split the cleaned text into
we will take the sentiment score of preprocessed text by individual words or tokens using NLTK libraries.identify

2260

Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:25:22 UTC from IEEE Xplore. Restrictions apply.
2023 6th International Conference on Contemporary Computing and Informatics (IC3I)

negations like not,isn’t,etc.expand abbrevations to their full


forms. Tokenization is the process of breaking down either
the whole paragraph or one sentence into block of words
called tokens.

Feature extraction:It is the process of transforming raw


data while preserving original data set. Extract relevant
features from the review text that can be used for sentiment
analysis.Feature selection can significantly increase the Figure 2. SVM Hyperplane
classification accuracy and make it better.In this article we
used Term Frequency-Inverse Document Frequency (TFIDF) Logistic regression(LR): The testing set will be used to
method to extract features and lessen the weight of commonly assess the performance of the LR model once it has been
used words that might not convey much emotional trained on the training set. Fit a logistic regression model to
content.Count vectorizer is used to analyze how many times the training data. The LR model will learn the relationship
each word had appeared in the data in order to convert text between sentiment features and labels (genuine or
into vectors of token counts.count vectorizer generates the fake).equation: P(fake) = sigmoid(w * X + b) where: P(fake)
matrix which consists of rows and columns.Each distinct represents the probability of a review being fake,w is the
word is represented by column and each sample of text is weight vector for the sentiment features,X is the feature
represented by row.The value of each cell is just how many vector representing sentiment features of a review,b is the
times a word occurs in that particular text sample. bias term.The sigmoid function squashes the output to a value
between 0 and 1,representing the probability of a review
Algorithms for sentiment classification: Separate the being fake.
collection of labeled data into training and testing sets. Most
of the time, we divide the data so that 70 percent of the data Model Evaluation: Evaluate trained SVM model, LR
is for training and 30 percent of the data is for testing. model using the testing set by using evaluation metrics like
Documents were classified as either good, negative, or accuracy, precision to evaluate the model’s effectiveness.
neutral using the sentiment classification algorithm. We used
two well-known supervised classifiers in our investigation,
namely LR and SVM classifiers.

Support Vector Machine: The Support Vector Machine is


a supervised learning method that employs related learning
algorithms to examine the datasets used for classification.The
Support Vector Machine (SVM) has emerged as one of the Figure 3. Datasets
most extensively used and well-liked supervised learning
clas- sifiers in recent years.Suport Vector Machine (SVM) VII. RESULTS
works by locating the hyperplane that divides dataset into two
classes, in this instance geninue and fake reviews.The The data sets that we used is amazon book reviews from
distance between each feature vector and the hyperplane can kaggle. We removed the stop words using NLTK libraries and
be used to de- termine whether the review is geninue or fake. tokenized the text data into tokens.Using TF-IDF and count
The review is regarded as geninue if the distance is positive vectorizer approaches,we extracted and selected the features
and it is regarded as fake if the distance is negative. from 70percent of the data set for training and 30percent of
the data set for testing.We used SVM and LR classifiers for
Equation : f (x) = sign(wTx + b) (1) training and testing the data set. SVM gave the accuracy of
86percent and LR gave the accuracy of 87percent.
In this equation:f(x) represents the predicted class label
for a given review, x is the feature vector representing the
review, w is the weight vector that the SVM algorithm learns
during training, b is the bias term. The sign function ensures
that the predicted class label is either +1 or -1, indicating
genuine or fake, respectively.

Figure 4. Dataset Schema

The following attributes are involved for estimating


preci- sion,recall and F-Measure:

2261

Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:25:22 UTC from IEEE Xplore. Restrictions apply.
2023 6th International Conference on Contemporary Computing and Informatics (IC3I)

• True Positive Reviews (TPR): The amount of


sentences that the categorization model properly
predicted as Positive is known as Fair Positive
Reviews, and it is contained in the testing data.

• False Positive Reviews (FPR): The amount of


sentences that were mistakenly predicted as +ve by
the classification model is known as the number of
unfairly positive reviews, which were discovered in
the testing data.
Figure 7. Confusion matrix obtained by LR
• True Negative Reviews (TNR): The number of
sentences the classification model properly Accuracy = (TP + TN )/(TP + TN + FP + FN ) (2)
predicted as negative is the amount of fair negative
reviews, which were discovered in the testing data Precision = TP/(TP + FP ) (3)

• False Negative Reviews (FNR): The amount of Recall = TP/(TP + FN ) (4)


sentences that were mistakenly predicted as -ve by
the classification model is negative reviews, which F1 measure demonstratea an equilibrium between accuracy
were discovered in the testing data and recall.

• True Neutral Reviews (TNR): Fair Neutral Reviews F − Measure = (2pr)/(p + r) (5)
are based on testing data and are the number of
phrases that the F-Measure demonstrates an where, p is the precision and r is the recall.
equilibrium between accuracy and classification
model accurately predicted as Neutral. recall.

• False Neutral Reviews (FNR): The amount of


sentences that the classification model misclassified
as Neutral is known as the number of unfair neutral
reviews, which were discovered in the testing data.

Figure 8. Performance Metrics

Figure 5. Confusion Matrix

Figure 9. Graph

VIII. CONCLUSION AND FUTURE WORK


Figure 6. Confusion matrix obtained by SVM
In this research, we proposed two methods to analyze a
dataset of amazon product reviews. We used amazon book
reviews small from kaggle which does not have the labels. So
we labelled the dataset using sentiment analysis technique

2262

Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:25:22 UTC from IEEE Xplore. Restrictions apply.
2023 6th International Conference on Contemporary Computing and Informatics (IC3I)

and renamed the dataset as amazon book labelled reviews [15] model”,”IJSRD”, Vol. 11, No.3, Pages:280–283, 2023 online reviews
using semi-supervised and supervised learning.”Turkish Journal of
which consists of 48747 labelled reviews. For this labelled
Computer and Mathematics Education, Volume 14,Issue 2, Pages:107–
dataset we performed the machine learning algorithms and 118, 2023.
found out the performance metrics such as Recall, Precision,
F- Measure and Accuracy of each machine learning
algorithms. Our experimental approaches studied the
accuracy of all sentiment classification algorithms, and how
to determine which algorithm is more accurate. Furthermore,
we were able to detect fake positive reviews and fake
negative reviews through detection processes. Among two
algorithms used, Logistic Regression (LR) performs more
with an accuracy percent compared to Support Vector
Machine (SVM) which has an accuracy of 86 percent in
detecting the fake reviews of amazon products. Accordingly
in the future we will improve the accuracy of the proposed
methodology by combining the both methods that we
proposed now.

REFERENCES

[1] Elshrif Ibrahim and Abdelouahed Gherbi,”Unfair Reviews Detection


on Amazon Reviews Using Sentiment Analysis with Supervised
Learning Techniques”, ”Journal of Computer Science”, Volume 14,
Issue.5,Pages 714-726, May.
[2] Elshrif Elmurngi and Abdelouahed Gherbi.”Detecting Fake Reviews
through Sentiment Analysis Using Machine Learning
Techniques”,”The Sixth International Conference on Data
Analytics”,Pages:65-72,June 2018.
[3] Viresh Gupta, Aayush Aggarwal and Tanmoy Chakraborty,”Detecting
and Characterizing Extremist Reviewer Groups in Online Product
Reviews”,”IEEE”,Pages:1-10,May 2020.
[4] Prof. Vitthal.G, Kaushik , Mrunal.L, Prathamesh Jagtap, Srishti.D,”An
Approach to Detect Fake Reviews based on Logistic Regression using
Review-Centric Features”, ”IRJET”, Volume No.7, Issue
No.6,Pages:2107–2112, June 2020.
[5] Ashwini.MC,Padma.M.C,”Efficiently Analysing and Detecting Fake
Reviews Through Opinion Mining”,”IJCSMC”, Vol.9, Issue
7,Pages:97- 108, July 2020.
[6] Mandala Vishal and Sindhu C,”Detection of Sarcasm on Amazon
Product Reviews Using Machine Learning Algorithms Under
Sentiment Analysis”,”IEEE”,Pages:196-199, June 20, 2021.
[7] K.Venkateswara, P. Anil, R.J.V. Siddhartha, and Antony.T.”Fake
Review Sentiment Analysis Using Natural Language
Processing”,”JES”, Vol. No.12,Issue No.7,Pages:314–318 (July 2021)
[8] Pansy Nandwani, Rupali Verma,”A review on sentiment analysis and
emotion detection from text”,”Social Network Analysis and
Mining,Springer”,Pg.no:119,August 2021.
[9] Riad Sonbol, Ghaida Rebdawi,and Nada Ghneim,”The Use of NLP-
Based Text Representation Techniques to Support Requirement
Engineering Tasks: A Systematic Mapping
Review”,”IEEE”,Vol.no:10,Pages:63811-62830,June 2022.
[10] Amira Yousif, James Buckley, ”Impact of Sentiment Analysis on
FakeReview Detection”,”A PREPRINT”,Pages:1-8,December 2022.
[11] SwarnajitReview Using Machine Learning (ML),”Research
Square”,Pages:1 Bhattacharya,”Monitoring and Removal of Fake
Product-11, April 2023.
[12] Avantika.T,Samiksha.W,Bhagyashri.T,”Online Fake Review
DetectionUsing SVM”, ”IJRASET”, Volume.no:11,
Issue.no:V,Pages:1-5, May 2023.
[13] Abhijeet Rathore,Gayatri Bhadane, Ankita Jadhav, Kishor Dhale and
Jayshree MuleyIJERT,”Fake Reviews Detection Using NLP Model and
Neural Network Model.” Vol.12, No.5,May 2023,Pages:51–56
[14] Abhijeet Rathore, Gayatri Bhadane, Ankita Jadhav, Kishor Dhale,
Prof.Jayshree Muley,”Fake reviews detection using machine learning
model”,”IJSRD”, Vol. 11, No.3, Pages:280–283, 2023.

2263

Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:25:22 UTC from IEEE Xplore. Restrictions apply.

You might also like