A Small Comparative Study of Machine Learning Algorithms in The Detection of Fake Reviews of Amazon Products
A Small Comparative Study of Machine Learning Algorithms in The Detection of Fake Reviews of Amazon Products
Abstract—Nowadays the popularity of online shopping has statistical model used to estimate the probability of a binary
increased, leading to a significant rise in the number of product outcome.In the context of fake review detection,Logistic Re-
reviews on platforms like Amazon. However, this increase has gression can be trained to predict the likelihood of a review
also attracted a growing number of fake reviews, which are being fake by analyzing its sentiment features.The approach
designed to deceive consumers and manipulate product ratings.
form in this work may be used to develop a robust and
Detecting and filtering out these fake reviews is crucial for main-
taining the integrity and reliability of online review systems.We accurate system for detecting fake reviews on Amazon.The
proposed a methodology to identify fake reviews on Amazon proposed approach holds the potential to enhance the
using sentiment analysis, support vector machine algorithm and reliability of the online review system, allowing consumers
logistic regression algorithm.The objective is to show how well to make informed purchasing decisions while discouraging
the suggested approaches work in separating fraudulent reviews the increase of fake reviews.We will outline the methodology,
from real Amazon reviews. describe the data set used for training and evaluation, and
present the results of our experiments.The findings will
Index Terms—Fake review detection, Amazon books reviews, provide valuable insights into the effectiveness of sentiment
sentiment analysis, support vector machine, logistic regression,
analysis coupled with SVM and Logistic Regression
sentiment polarity, machine learning, online shopping.
algorithms in detecting fake reviews on Amazon,
I. INTRODUCTION contributing to the ongoing efforts to combat review
manipulation and ensure the authenticity of amazon books
In this age of e-marketing, online reviews play a crucial reviews.
role in shaping consumers’ purchase decisions.However, with
the increase in the popularity of online shopping like II. LITERATURE SURVEY
Amazon,the issue of fake reviews has become a significant
concern.Fake reviews can mislead potential buyers and The study, [1] used three different data sets,they are the
compromise the integrity of the review system. Therefore,it Clothing, Shoes and Jewelry reviews dataset, proposed
is crucial to create efficient techniques for identifying and Logistic Regression (LR), Decision Tree methods, as well as
removing these bo- gus reviews. Utilizing sentiment analysis, Naive Bayes (NB), Support Vector Machine (SVM), and text
a natural language processing technique that captures the classification algorithms.The Logistic regression has more
sentiment or opinion expressed in a text,is one of the potential accuracy than all other proposed models for the three
methods for Fake Re- view detection. Sentiment analysis can different data sets.
help uncover patterns in language that are indicative of fake
reviews.To implement sen- timent analysis for fake review [2] Due to its context-dependent character and the usage
detection on amazon product reviews,we will employ two of irony or contradictory terms to convey the opposite of the
powerful machine learning algo- rithms: Support Vector literal meaning, sarcasm identification can be a difficult task
Machine (SVM) and Logistic Regression (LR). A high- in natural language processing. To accurately identify
dimensional space is searched for an ideal hyperplane by the sarcasm in product reviews, the article suggests employing a
supervised learning method SVM in order to separate variety of machine learning methods, including supervised
instances of various classes.By learning from labeled training learning, natural language processing, or even deep learning
data,SVM can build a classification model capable of models. Comparisons with other existing techniques and
distinguishing between genuine and fake reviews based on accuracy metrics may be used to assess the proposed
their sentiment features.Similarly, Logistic Regression is a methodology.They offered the following model
2258
979-8-3503-0448-0/22/$31.00 ©2023 IEEE
Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:25:22 UTC from IEEE Xplore. Restrictions apply.
2023 6th International Conference on Contemporary Computing and Informatics (IC3I)
suggestions:Lemmatization, stemming, polarity article. The study shows that sentiment analysis employing
identification, and data tokenization are all parts of the pre- machine learning approaches holds tremendous potential for
processing of data. Then, features including inverted efficiently identifying and combatting fraudulent reviews,
document frequency, word frequency, and n-grams are helping customers make educated decisions and fostering
extracted. The Support Vector Machine (SVM), K Nearest trust in online review sites.
Neighbors, and Random Forest categorization methods are
employed.The outcomes computation is evaluated using the [11] The article is mainly on linguistic patterns,writing
parameter’s correctness. The polarity of the statement is then style,sentiment indicators as they retrieved the elements that
checked after the sarcasm detection process; if necessary, the were especially suited to the review material.To distinguish
polarities are labeled, and the sentiment is then examined. between authentic and false reviews,characteristics like the
frequency of particular terms,sentiment scores,lexical
[3] They suggested utilizing supervised learning and the richness,and punctuation usage are extracted from the review
Bidirectional Encoder Representations from Transformers text.To simulate the relationship between the review-centric
(BERT) model to identify bogus hotel reviews. Then they features and the authenticity labels,they used the
improved the pre-trained BERT model using a publicly sophisticated binary classification algorithm logistic
accessible data set of hotel reviews, and then trained a regression.The labeled dataset is used to train the algorithm
supervised learning algorithm to distinguish between real and to identify authentic reviews from false ones using the
false reviews.They used accepted evaluation metrics to attributes that were extracted.
evaluate the performance.
[13] To identify fraudulent internet reviews and assess
[4] With the aid of language and rating properties, this the effectiveness of the two strategies on a data set containing
study seeks to spot phony product reviews. To spot bogus hotel reviews, they introduced various semi-supervised and
reviews, they used advanced supervised machine learning supervised text mining models.To enhance the performance
algorithms, methods for processing natural language, and of classification,SVM and Statistical Naive Bayes are utilized
methods for sentiment and opinion mining. By providing the as classifiers in their research. They largely concentrated on
model with good training and using high-quality training and the review-based techniques’ substance. They made use of
testing data, they were able to determine the veracity of a word frequency, sentiment polarity, and review length.They
given review using the aforementioned strategies. The found that the supervised Naive Bayes classifier had the
outcome of this experiment demonstrates that the answer is highest accuracy when compared to other classifiers.The
more accurate than the earlier effort. article discusses the advantages of using semi-supervised
learning for scenarios where acquiring a large number of
[5] The article uses a variety of machine learning labeled fake reviews might be challenging. It could also
methods, such as deep learning models, natural language explore the effectiveness of combining semi-supervised and
processing, supervised or unsupervised learning, to supervised learning to improve the accuracy of fake review
automatically monitor and detect fraudulent reviews linked to detection.
certain products.The article examines user behavior patterns
that can be used to discern between real and false reviews, as [14] They employed SVM as one of the most well-liked
well as techniques for extracting pertinent elements from supervised learning algorithms, which is used for both
review content.It might also go into detail about how to spot classification and regression issues. However, machine-
questionable behaviors that may be signs of false reviews, learning classification issues are where it is most frequently
such spamming reviews or expressing biased opinions. To applied.The effectiveness of the suggested model was
maintain the credibility and integrity of product review evaluated using a dataset of 6186 Amazon reviews split
sites,the article also includes techniques for deleting or evenly into a training set (4948 reviews) and a test set (1238
filtering out fraudulent reviews. reviews). The model has performance of 80.4 percent
accuracy, 80.8 percent precision, 80 percent recall, and 80
[7] The paper investigates the use of sentiment analysis percent F1-score.Comparing the proposed methodology to
and other machine learning approaches to identify bogus the previously discussed algorithm, accuracy has increased
reviews. The idea of sentiment analysis is then introduced. by 20 percent.
This natural language processing (NLP) method is used to
ascertain the sentiment or emotion portrayed in a text. It [15] The paper discusses the efficacy of applying NLP
outlines typical traits that could be telltale signs of false methods to identify crucial linguistic aspects as well as how
reviews, like repetitious wording, overuse of superlatives, neural networks may generalize and learn from vast volumes
and a dearth of specific information or personal experiences. of review data. It also looks at the process’s potential pitfalls,
Support Vector Machines (SVM), Naive Bayes, and deep like data imbalance, noisy evaluations, or overfitting.
learning-based methods like Recurrent Neural Networks
(RNNs) and Transformers are some of the machine learning III. OBJECTIVES
algorithms for sentiment analysis that are compared in the
2259
Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:25:22 UTC from IEEE Xplore. Restrictions apply.
2023 6th International Conference on Contemporary Computing and Informatics (IC3I)
• To analyze and evaluate different machine learning considering the threshold value. Initially we will take the
ap- proaches to identify which ones are effective in threshold value as 0, if the sentiment score is greater than
distin- guishing between genuine and fake reviews threshold then we say it as positive sentiment and take the
on Amazon’s product platform. label as 1 and if the sentiment score is less than threshold
value then we say it as negative sentiment and we take the
• Train a machine learning model using labeled label as 0. Now we will consider this labelled dataset for
datasets consisting of both genuine and fake implementing our methods. Describes the steps for cleaning
reviews. and pre processing the collected data, such as removing noise,
standardizing formats, and handling missing values. Discuss
• Using the results of the machine learning model, several feature extraction methods, such as count vectorizer
catego- rize reviews as real or fraudulent. and TF-IDF.For the purpose of identifying false reviews in
the obtained data sets, we suggested Support Vector Machine
• Assess the performance of the fake review detection and Logistic Regression. For SVM we describe the
system using appropriate evaluation metrics and mathematical formulation of SVM and how it works. Explain
measure its precision, recall, and overall accuracy. the process of training an SVM model using the pre processed
review data and sentiment features. Discuss techniques for
IV. EXISTING WORK hyper parameter tuning and model evaluation.For LR we
introduce LR as a statistical modeling technique for binary
Reviews are such a powerful tool, therefore it makes classification. Explain the logistic function and how it relates
sense that there are a lot of unethical acts taking place in the to the probability of a review being fake or genuine. Describe
review industry. Various initiatives have been made to the process of training an LR model using the pre processed
identify and fully comprehend these behaviors, which are review data and sentiment features.Discuss techniques for
generally referred to as opinion spam. feature selection, regularization, and model evaluation.
Present the results of applying SVM and LR models to the
To identify false reviews, Jindal and Liu took the data set.
initiative and made an effort. They analyzed internet reviews
in–false opinions, seller/brand only reviews, and nonreviews
utilizing near-duplicate content as a sign of phony reviews–
and first identified the issue of opinion spam. Another study
that looked at the discovery of review-level spam used
reviewer and review feature combinations with expressive
text features.used Ama- zon Mechanical Turk to create
fabricated hotel evaluations, but Jindal and Liu used data
from Amazon as their source and exploited content duplicity
as their source of truth. Both of them engaged in feature
review work. The significance of brands was briefly
discussed by Jindal et al and Li et al.It is easy to express that
detection techniques for sentiment categorization based on
machine learning use a supervised learning technique with
three classes: negative, neutral, and positive. Reviews are Figure 1. Architecture Diagram
frequently the source of testing and training data in the
literature (Liu and Zhang, 2012). The detection and VI. METHODS
elimination of unfair reviews are of utmost importance
(Jindal and Liu, 2008; Moraes et al., 2013 provided a method Data collection: Gathering a data set of Amazon books
to categorize the textual review of a specific topic). reviews, including both genuine and fake reviews.
2260
Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:25:22 UTC from IEEE Xplore. Restrictions apply.
2023 6th International Conference on Contemporary Computing and Informatics (IC3I)
2261
Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:25:22 UTC from IEEE Xplore. Restrictions apply.
2023 6th International Conference on Contemporary Computing and Informatics (IC3I)
• True Neutral Reviews (TNR): Fair Neutral Reviews F − Measure = (2pr)/(p + r) (5)
are based on testing data and are the number of
phrases that the F-Measure demonstrates an where, p is the precision and r is the recall.
equilibrium between accuracy and classification
model accurately predicted as Neutral. recall.
Figure 9. Graph
2262
Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:25:22 UTC from IEEE Xplore. Restrictions apply.
2023 6th International Conference on Contemporary Computing and Informatics (IC3I)
and renamed the dataset as amazon book labelled reviews [15] model”,”IJSRD”, Vol. 11, No.3, Pages:280–283, 2023 online reviews
using semi-supervised and supervised learning.”Turkish Journal of
which consists of 48747 labelled reviews. For this labelled
Computer and Mathematics Education, Volume 14,Issue 2, Pages:107–
dataset we performed the machine learning algorithms and 118, 2023.
found out the performance metrics such as Recall, Precision,
F- Measure and Accuracy of each machine learning
algorithms. Our experimental approaches studied the
accuracy of all sentiment classification algorithms, and how
to determine which algorithm is more accurate. Furthermore,
we were able to detect fake positive reviews and fake
negative reviews through detection processes. Among two
algorithms used, Logistic Regression (LR) performs more
with an accuracy percent compared to Support Vector
Machine (SVM) which has an accuracy of 86 percent in
detecting the fake reviews of amazon products. Accordingly
in the future we will improve the accuracy of the proposed
methodology by combining the both methods that we
proposed now.
REFERENCES
2263
Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:25:22 UTC from IEEE Xplore. Restrictions apply.