0% found this document useful (0 votes)
4 views11 pages

Analytics of Machine Learning-based Algorithms for Text Classification

This paper presents a comparative analysis of various machine learning algorithms for text classification, focusing on their efficiency across different datasets. The study finds that Logistic Regression and Support Vector Machine perform best on the IMDB dataset, while k-Nearest Neighbor excels on the SPAM dataset. The research highlights the significance of automating text classification to enhance decision-making processes in various industries.

Uploaded by

anh.dmh7210
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

Analytics of Machine Learning-based Algorithms for Text Classification

This paper presents a comparative analysis of various machine learning algorithms for text classification, focusing on their efficiency across different datasets. The study finds that Logistic Regression and Support Vector Machine perform best on the IMDB dataset, while k-Nearest Neighbor excels on the SPAM dataset. The research highlights the significance of automating text classification to enhance decision-making processes in various industries.

Uploaded by

anh.dmh7210
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Sustainable Operations and Computers 3 (2022) 238–248

Contents lists available at ScienceDirect

Sustainable Operations and Computers


journal homepage:
http://www.keaipublishing.com/en/journals/sustainable-operations-and-computers/

Analytics of machine learning-based algorithms for text classification


Sayar Ul Hassan a, Jameel Ahamed a,∗, Khaleel Ahmad a
a
Department of Computer Science & Information Technology, Maulana Azad National Urdu University, Hyderabad, Telangana, India

a b s t r a c t

Text classification is the most vital area in natural language processing in which text data is automatically sorted into a predefined set of classes. The application
of text classification is wide in commercial works like spam filtering, decision making, extracting information from raw data, and many other applications. Text
classification is more significant for many enterprises since it eliminates the need for manual data classification, a more expensive and time-consuming mechanism.
In this paper, a comparative analysis of text classification is done in which the efficiency of different machine learning algorithms on different datasets is analyzed
and compared. Support Vector Machine (SVM), k-Nearest Neighbor (k-NN), Logistic Regression (LR), Multinomial Naïve Bayes (MNB), and Random Forest (RF) are
Machine Learning based algorithms used in this work. Two different datasets are used to make a comparative analysis of these algorithms. This paper further analyzes
the machine learning techniques employed for text classification on the basis of performance metrics viz accuracy, precision, recall and f1- score. The resullltsss
reveals that Logistic Regression and Support Vector Machine outperforms the other models in the IMDB dataset, and kNN outperforms the other models for the SPAM
dataset as per the results obtained from the proposed system.

1. Introduction [4]. In this study, different selected Machine Learning Techniques are
used for text classification. Besides these techniques, there are various
Nowadays, industries benefit greatly from developing automatic approaches for text classification, but most of them cannot classify text
systems for extracting useable structured data from unstructured text data accurately compared to Machine Learning techniques, which give
sources. Researchers and industry professionals would perform reason- more effective results [3]. Even though several efficient text classifi-
ably easy queries to retrieve all information related to industrial work cation approaches have been developed, text classification remains a
using a structured resource [1]. We can use these machine learning clas- difficult subject with a lot of room for improvement in terms of effi-
sifiers in the field of environment.Suppose the data related to sustain- ciency [5]. However, organizations and enterprises use text documents
able development and climate change will be collected from different to keep track of their industrial and government services [4,6]. In the
sources. In that case, different machine learning techniques can be ap- text classification system, the classifier is the main part. The classifier
plied to that data so that the domain knowledge can be extracted from performance quality is directly related to the efficiency and effect of text
it. It will help us in different fields such as making decisions about the classification. Most of theclassifiers are based on the methods from in-
future and also will get an idea about how we should sustainably use formation retrieval and the machine learning algorithms that are intro-
available resources. We can also be aware of people of climate change duced for text classification purpose [7]. A good text classifier though,
problems. We can also publish the resulting data on different platforms would work efficiently for large training data sets with several features
to become aware of climate change and sustainable development. Text [8]. Because of the high dimensionality and existence of noise in fea-
analysis is one of the important aspects of extracting the desired in- tures, it is crucial to choose only the most critical features in the case
formation. Text classification is classifying text into different classes of text categorization. [9]. A comparative analysis is done in this paper,
based on the textdomain. It is a fundamental process in natural lan- based on text classification employing Machine Learning techniques on
guage processing in which the tools are available for classifyingtextual different datasets. The problem is that the manual process of classify-
data. Automatic text classification has always been a critical application ing text data is tedious and very time-consuming [6]. Therefore, it is
and research topic since the inception of digital documents. Today, text very important to automate the process and enhance data-driven de-
classification is much required due to the massive amount of text doc- cisions [10,11]. In this research, the machine learning algorithms are
uments generated daily worldwide [2]. Textual analytics translates text applied and compared for best performanceon different datasets [12].
into numbers, giving structured data and making it easier to spot trends. The documents in the text classification model arepassed through dif-
The more structured data, the better will be the analysis, and eventu- ferent steps viz (i) convert the main document into plain text, lowercase
ally, the better the decisions [3]. Machine learning (ML) is employed for the full document, removal of stop words, remove the words which are
this purpose which is a branch of artificial intelligence (AI) that allows not useful, to reduce different words in a single word called root word
computers to operate and learn even they are not explicitly programmed using stemming and lemmatization and(ii) select the data for training
and testing, build the classifier and then deploy the classifier on dif-

ferent datasets [2,13,27]. Further, the Machine learning techniques can
Corresponding author.
also be applied for classification problems to measure the perceptions
E-mail address: [email protected] (J. Ahamed).

https://doi.org/10.1016/j.susoc.2022.03.001
Received 24 July 2021; Received in revised form 20 February 2022; Accepted 25 March 2022
Available online 1 April 2022
2666-4127/© 2022 The Author(s). Published by Elsevier B.V. on behalf of KeAi Communications Co., Ltd. This is an open access article under the CC BY-NC-ND
license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
S.U. Hassan, J. Ahamed and K. Ahmad Sustainable Operations and Computers 3 (2022) 238–248

about the COVID-19 pandemic and identify the misconception among classifiers that can automatically organize and classify documents. Many
the population, which will help us to inform the public health organiza- machine learning algorithms have been applied to make an automatic
tions and then better methods can be created to educate the public and text classifier by getting trained from a set of classified training doc-
make them understand not to fall for these misconceptions [14]. Ma- uments [21,22,23]. There are several text classification models which
chine learning techniques can also be used to tackle the SARAS-COV-2 have been built for Urdu, English, French, Chinese, and many other
crisis in the field of diagnosis, disease progression, and epidemiology languages [24,25,26]. Support vector machine (SVM) is one of the su-
[15]. pervised machine learning model that uses classification algorithms for
Machine learning is one of the most important technical develop- two-group classification problems [28]. A number of text classifiers are
ments allowing Industry 4.0 to take hold in businesses and industries. used in text mining are used and compared in this work [8]. Usually,
The introduction of AI and Machine Learning to Industry 4.0 marks a supervised and unsupervised are the two categories of classifiers used
significant shift for manufacturing organizations, potentially resulting for text classification. In text classification, the process to train an “un-
in new business prospects and benefits such as increased productivity. known” NLP text [6], the most significant element of collecting informa-
Artificial Intelligence, Machine Learning, Deep Learning have also been tion is automatically categorizing a batch of documents into categories
widely used in areas like healthcare, finance, smart factories as a part (or classes, or subjects) from a specified set [29]. Different tools and
of Industry 4.0 [16–19] methods are derived from the domain, which has several applications
in text classification [30]. SVM is also compared to other algorithms, but
SVM outperforms the others in various studies [12,29]. Different classi-
2. Related work fiers are viewed and analyzed to determine which category a document
belongs to. We can classify non-linear data using kernel functions in or-
Text classification is one of the main tasks in Natural Language Pro- der to classify data with greater dimensions [7,31]. The Support Vector
cessing (NLP) [6,20]. Due to the fast growth of Internet applications, a Machine gives high performance but less recall which is one of the lim-
huge increment in online texts leads to improved automated text mining itations of using Support Vector Machine [31]. In k-Nearest Neighbor
Table 1
Systematic literature review.

Title Author(year) Methodologies Findings

Efficient English Xiaoyu Luo(2021) SVM, Precision, Recall, and


Text Classification Naïve Bayes, F1-value is calculated
Using Selected Logistic Regression For the evaluation of
Machine Learning the classifier in which
Techniques [6]. SVM outperforms the
Other in two datasets
and Logistic
Regression outperform
in one dataset
Restaurant Review Classification and Dhirajj Kumar, Gopesh, Naïve Bayes, Multinomial Naïve Bayes technique performs better than
Analysis [52] AvinashChoubey, Ms. Pratibha Multinomial Naïve Bayes, other algorithms in Precision, Recall, and F1 Score
Sing(2020) Logistic Regression evaluation matric.

Benchmark Performance Of Muhammad NabeelAsim, Naïve Bayes, SVM Outperformed


Machine Learning Muhammad Usman Ghani, SVM, The Naïve Bayes
and Deep learning Muhammad Ali TF-IDF using TF-IDF
Based Methodologies Ibrahim, Vector representation
for Urdu Text Waqar Mahmood,
Document Sdheraz Ahmad,
Classification [53]. Andreas Dengel(2020)
Text classification using machine learning EmmanouilK.Ikonomskis, Sotiris Naïve Bayes, Classification performance depends on the training text
techniques [21]. Kotsiantis, V.Tampakas(2019) K-Nearest corpuses. With high-quality training, corpus performance
Neighbour, will be better.
Support Vector
Machine
Comparative Kapil Sethi, Neural Network, SVM outperforms the
Analysis of Machine Ankit Gupta, K-Nearest Neighbor, other Algorithms and
Learning Algorithms Gaurav Gupta, Support Vector this model is useful
on DifferentDatasets [4] Varun Jaiswal(2017) Machine. in medication,
governmental issues
and Other different
fields
A Study Of Text Jasleen Kaur, Naïve Bayes, Supervised Machine
Classification Natural Dr. Jatinder Kumar, SVM, Learning Algorithms
Language Processing R Saini Artificial Neural Outperformed
Algorithms for Indian (2015) Network, than Unsupervised ML
Language [54] N-gram Algorithms For Indian
Languages.
Urdu Word Sense “Muhammad Abid, Bayes Net Classifier, Bayes Net Outperform
Disambiguation Using Asad Habib, SVM, The other Algorithms.
Machine Learning JawabShahid, Decision Tree
Approach [55]. Jawad Ashraf(2017)”
Text classification Using Basant Agarwal, Naïve Bayes, SVM provides good
Machine Learning NamitaMithal(2016) SVM, performance for textual
Methods. A Survey [56]. KNN, documents which belong
Decision Tree, To a Particular category
but not for Multiclass
Classification.

239
S.U. Hassan, J. Ahamed and K. Ahmad Sustainable Operations and Computers 3 (2022) 238–248

Fig. 1. Flow diagram of proposed model.

(k-NN) majority voting method is used to classify the instance correctly ing whether it is authentic or true, even if we trust it. Most individuals
[32,33,34]. Hence, this is a newly introduced method using fewer text are using this platform as a weapon to manipulate public opinion for
data for testing and large text data for training, giving the best perfor- political, religious, or other causes, but we can use machine learning
mance and effective results [33]. It is also evident that the main focus of algorithms to determine whether the news or information is authentic
the machine learning model is to learn automatically and improve the or deceptive propaganda [44,45]. A conventional way of extracting the
model’s efficiency based on experience [35]. The proposed classification emotions, opinions, and attitudes from the text data available on social
model comprises three major modules viz preprocessing raw data, em- networking sites can alsobe done using machine learning algorithms
ploying machine learning, and final model for classification [36]. This [46,47]. Different data preprocessing techniques are also essential for
model learns from past data or experience to improve the model’s per- modelsto to give better results with good performance [48]. TF-IDF is a
formance [37]. With the introduction of digital documents, automatic statistical measure for extracting meaningful information from text in-
text classification became an important area of research [38]. Among put, but it is ineffective for unbalanced distributions. However, we may
the machine learning techniques, the SVM classifier is getting better apply an upgraded version of TF-IDF to improve the model’s power and
results in most applications of the classification problems specifically, robustness [49]. There are a variety of additional ways for classifying
for diseases identification and face recognition application [39]. Fur- text data, such as a caps-net based multitask learning architecture for
ther, three machine learning algorithms like Random forest, kNN, and text classification, but when compared to machine learning approaches
Naïve Bayes were applied to the chronic kidney disease prediction. The for the same problem, the results are substantially better utilising ma-
random forest was proven to have the best results [40,41]. Accurate chine learning techniques [50]. Following the release of new COVID-19
predictions and better generalizations can be achieved using random variations, the entire world is experiencing healthcare issues, making
sampling and ensemble strategies [11]. Machine learning techniques it harder to collect health care records to analyse current and future
can also play a vital role in analyzing diseases from medical records needs for the general public using various machine learning approaches
in the medical sector. It also helps us in the COVID-19 pandemic to [51]. Furthermore, the systematic literature review with regard to the
detect differentaspects such as perceptions and misconceptions among effectiveness of machine learning for text classification is depicted in
the general public [14]. Machine learning techniques are also used for Table 1.
drug discovery, increasing exponentially day by day. It is easy to ana-
lyze the previous data based on drugs that have been utilized to predict 3. Proposed system and methodology
the new requirements [42]. It is important for healthcare organizations
and health officers to understand the view of the general public about The methodologies used in this research work will be based on
what causes them anxiety, stress, and trauma and then it makes better the Machine Learning Techniques viz Support Vector Machine (SVM),
policies and better treatments based on the data available using machine k-Nearest Neighbor, Gaussian Naïve Bayes (GNB), Multinomial Naïve
learning techniques [43]. Nowadays, any news or information spreads Bayes (MNB), and Logistic Regression (LR). The ML-based classification
swiftly on many social networking sites, and there is no way of know- models are compared on different datasets in terms of the accuracy of

240
S.U. Hassan, J. Ahamed and K. Ahmad Sustainable Operations and Computers 3 (2022) 238–248

Table 2
Summary of machine learning algorithms.

Methods Strengths Weaknesses Applications

Support Higher-order data can be classified by Not suitable for large Handwriting recognition,
Vector using kernel datasets Difficult, Text and hypertext categorization,
Machine Functions. To choose a kernel Classification of images,
(SVM) Function.
K-Nearest Neighbor(KNN) Robust for noisy data. Computation cost is high Health care,
Easy to implement, Finding the value of K is difficult. Segmentation,
Effective results for Customer service,
a large amount of training Fraud Detection.
data,
Take no time to learn.
Multinomial Easy to implement. Probabilities are not Real-time prediction,
Naïve Bayes Better results were accurate, Spam and Ham filtering,
(MNB) obtained in most classes, Interaction between Sentiment analysis.
Small amount of training features can’t be achieved
data required.
Logistic Quick to train data, Not better for non-linear Medicine,
Regression Work well for data, Text editing,
(LR) categorical data, Required large sample size Hotel booking,
Simple parameter Financial forecasting.
Estimation,
Better for linear data.
Random Forest Better for large datasets, Complex for multiple Banking sector,
(RF) Experimental method valued and uncertain Healthcare sector,
for detecting variable attributes, Customer intelligence,
interaction Require more Marketing data.
Computational power.

each model. Before developing the classification model, different tech- a classifier that can classify the text data with high accuracy [61]. Sup-
niques are used for preprocessing the input data and preprocessed data, port vector machine is computationally effective with some limitations,
then used for training and testing purposes [57,58]. A portion of data which reduce its performance for small datasets [31]. There are two
is taken for training, and the remaining is used for testing, but the data types of data classifications using SVM (i) linear data classification and
handled for training and testing is divided based on the technique used (ii)non-linear data classification.
in training [36,59]. The flow of this Machine learning-based text classi-
fication is shown in Fig. 1. (i). Linear data classification. To classify linear data using SVM, the
Maximum Margin Hyperplane (MMH) is used to separate the two data
points in order to draw many hyperplanes. It is desired to find out one
3.1. Machine learning techniques having the largest distance between vector points that will accurately
classify the data points, as shown in Fig. 2. As depicted in Fig. 2, there
As we all know, text data is increasing exponentially. Hence, it is not are positive and negative hyperplanes whereas the positive hyperplane
easy to classify it manually, so it is desired to find out the different fea- is drawn on the positive data point side. In contrast, the negative hyper-
sible ways to classify a large amount of data in a short period. The data plane is drawn on the negative data point side [62]. It is better to draw
generated after classification is called information, and this information the hyperplanes in such a way to get the maximum margin between
is then used to make future planning of the business and industrial ap- positive and negative hyperplanes.
plications. In this work, different machine learning algorithms are pro-
posed to be used for text classification, as mentioned in Table 2. It is (ii). Non-linear data classification. Support Vector Machine (SVM) can
necessary to determine which machine learning algorithm will provide also classify non-linear data with the help of kernel function. It trans-
high accuracy on which dataset type. This comparative analysis will ex- forms the data to higher dimensions to make the classification, as shown
amine the efficiency of various machine learning algorithms and then in Fig. 3. There are different types of kernel functions available that we
determine which algorithm is better for which type of data, as we know can use for classification purposes [29]. This method has to find out the
that different machine learning algorithms classify text data differently. proper kernel function to classify data points appropriately. When the
As a result, determining which approach is suitable for a specific dataset kernel function is used to classify the data points, it will transform one
type is critical. The detailed definitions of all the applied machine learn- class of data to a higher dimension, and the decision surface is obtained
ing techniques are given in the next section. to classify the data points.

3.1.2. K-nearest neighbors classifier(KNN)


3.1.1. Support vector machine K-nearest neighbor algorithm is a simple, easy-to-implement super-
Support Vector Machine is a machine learning technique that can vised machine learning algorithm that can be used to solve both classi-
be used for both regression and classification, but it is best for classi- fication and regression problems [57,63]. This algorithm finds the simi-
fication problems [20,22]. It can classify linear data with the help of larity between the available and new data, and the new data is classified
Maximum Margin Hyperplane (MMH), in which the distance is maxi- to that category with more similarity [64]. The value of K is difficult to
mum between data points called support vectors. The two parallel lines analyze, so the classification time by k-NN is more [33]. It is also called
separating the data are called positive and negative hyperplanes, as we a lazy learner algorithm because it is not learning from training data
can draw several [60]. For non-linear data, the kernel function can be abruptly, but it acts at the time of classification, as shown in Fig. 4 [27].
used to form the multi-dimensional hyperplane for classification. There
are multiple numbers of kernel functions available for classification pur- 3.1.3. Multinomial naïve bayes(MNB)
poses. Researchers have used kernel functions like String Subsequence The MNB classification algorithm is used to classify discrete features
Kernel (SSK) and Approximating Kernels (AK). These two kernels make (e.g., word frequency for text classification) [65]. In multinomial dis-

241
S.U. Hassan, J. Ahamed and K. Ahmad Sustainable Operations and Computers 3 (2022) 238–248

Fig. 2. Linear data classification using SVM.

Fig. 3. Non-linear data classification using SVM.

242
S.U. Hassan, J. Ahamed and K. Ahmad Sustainable Operations and Computers 3 (2022) 238–248

Fig. 4. Working of k-NN.

tribution, integer feature counts are required, but in reality, fractional also be used as a regression technique but is mainly used for classifi-
values in such cases as TF-IDF may also work [66]. cation because of its diversity and simplicity. It is the combination of
The bag of words approach is used, and each word constitutes a fea- learning models which increases the final result [60,63,68].Many trees
ture in which the word order is not important. Naïve Bayes is based on are combined to make a random forest in this machine learning tech-
Baye’s rule of conditional probability [67], and MNB is mathematically nique. We will get high accuracy if we have more uncorrelated trees
defined in the equation below. [10]. Missing values can be filled using random forest [11]. FurtherDe-
𝑃 (𝑥|□)𝑃 (□) cision, tree classifiers are popular for their outstanding performance as
P(h|x) = it is known that random forest is the collection of decision trees so it
𝑃 (𝑥)
Where h is the hypothesis and x is the attribute. becomes more robust and more powerful. A simple decision tree used
for classification problems gives better results with high accuracy [69].
3.1.4. Logistic regression (LR)
4. Results
Logistic regression is a supervised machine learning algorithm used
for classification purposes. It is used when the data is in the form of
This section will look at the results of distinct machine learning algo-
binary, i.e., 0 and 1that means whether the class is from one category
rithms that were applied to two separate datasets. Each algorithm was
or another. We can use two functions for binary values,viz logistic func-
applied separately for determining the efficiency of the machine learn-
tion and sigmoid function [10]. The Logistic Regression, also termed a
ing algorithm using various performance indicators such as Accuracy,
classification algorithm, is shown in Fig. 5 [64]. Logistic regression can
Precision, Recall, and F1-Score. We need to understand the basic blocks
be classified based on the number of categories as given below.
of evaluation measures before learning about these truly positive eval-
(i) Binomial: Only two types of values are possible in the target vari- uation measures true negative, false positive, and false negative.
able: “0” or “1” can represent “loss” versus “win,” “fail” versus True positive: A positive class is correctly predicted by the model
“pass,” “alive” versus “dead,” etc. or classifier. It can be represented by TP.
(ii) Multinomial: Three or more types are possible in target variables True negative: The model or classifier correctly predicts a negative
that are not ordered (i.e., types have no quantitative significance) class. It can be represented by TN.
like “virus A” versus “virus B” versus “virus C.” False positive: The model or classifier incorrectly predicts a positive
(iii) Ordinal: Ordered categories in a target variable; for example, call. It can be represented by FP.
an assessment score can be categorized as: “very good,” “good,” False negative: The model or classifier incorrectly predicts a nega-
“poor,” and “very poor.” Here, each category can be given a score tive class. It can be represented by FN.
like 0, 1, 2, 3, or vice versa. Accuracy: It is one of the evaluation measures of the machine learn-
ing model in which we can say how much accurate the classifier classi-
3.1.5. Random forest fies the data. We calculate accuracy using this formula:
Several classification algorithms exist, but the random forest (Fig. 6)
TP + TN
is one of the best classification algorithms in machine learning. It can Accuracy =
P+N

243
S.U. Hassan, J. Ahamed and K. Ahmad Sustainable Operations and Computers 3 (2022) 238–248

Fig. 5. Logistic regression.

Fig. 6. Classification using random forest.

244
S.U. Hassan, J. Ahamed and K. Ahmad Sustainable Operations and Computers 3 (2022) 238–248

Fig. 7. Graphical representation of sentiment


attributes.

Precision will tell us how exact our model is (how many positive Table 3
identified classes were correct). We calculate the precision using this Comparative analysis of ML algorithms.
formula: Algorithm(Dataset) Accuracy Precision Recall F1Score
TP
Precision = SVM(IMDB) 85.5 85 87 86
TP + FP SVM(SPAM) 95.5 96 96 95
Recall: It will tell us how complete our model is (how many actual kNN(IMDB) 50.8 50 72 59
positives were identified correctly). We calculate the recall using this kNN(SPAM) 98.5 99 99 98
MNB(IMDB) 84.4 85 87 86
formula: MNB(SPAM) 97.4 98 97 97
TP RF(IMDB) 74.9 72 81 77
Recall =
TP + FN RF(SPAM) 96.5 97 97 96
LR(IMDB) 85.8 85 87 86
F1-Score:The harmonic mean of precision and recall gives the bal- LR(SPAM) 91.9 93 92 90
ance result of precision and recall. We calculate the f1-score using this
formula:
2 ∗ ( precision ∗ recall )
F1 − Score = don’t have an equal number of spam and ham labels, so this dataset is,
Precision + recall
also called the unbalanced dataset,which means the data is skewed.
We have an equal number of positive and negative sentiments in
4.1. Datasets used the IMDB dataset. We first cleaned the data using different preprocess-
ing steps like Removal of Punctuations, Stop-words, Frequent words,
We have used two data sets in this work, and these datasets are col- Stemming, and Lemmatization [70]. After the preprocessing, the text
lected from online repositories. The datasets areanalyzed on different is then converted into vectors using the bag of words model, term-
machine learning algorithms to analyze the efficiency of each algorithm. frequency, inverse document term frequency model, and finally, the al-
The description of the datasets is given in the subsection below. gorithm’s efficiency using different matrix evaluation methods. On the
other hand,In the Spam dataset, there are different labels like the num-
4.1.1. IMDB dataset ber of ham records which are more than the number of spam records. For
This dataset reviews the movies available on the Internet in which this dataset, it is compulsory to evaluate the classifier using precision,
we have 50000 records with two attributes. One is a review. Another recall, and f1-score. Table 3 represents the accuracy, precision, recall,
is a sentiment, as shown in Fig. 7. This dataset has an equal number and f1-score of the algorithm implementation on the IMDB and Spam
of positive and negative sentiments, so this dataset is also called the dataset.The performance of these algorithms using a graphical repre-
balanced dataset, which means the data is not skewed. sentation will give a clear representation to find out which machine
learning algorithm outperforms the other. SVM and Logistic Regression
4.1.2. SPAM dataset have 85.5 and 85.9% accuracies, respectively, as shown in Fig. 9.
This dataset is all about the normal SMS messages in which we have On the spam dataset, the support vector machine outperforms the
two labels, ham and spam. This dataset has 50572 records with two other classifier. The remaining algorithms have almost the same ac-
attributes, label and message, as shown in Fig. 8. In this dataset, we curacy as the support vector machine,i.e., 95.5%, the k-nearest neigh-

245
S.U. Hassan, J. Ahamed and K. Ahmad Sustainable Operations and Computers 3 (2022) 238–248

Fig. 8. Graphical representations of label attributes.

Fig. 9. Accuracy of selected ML algorithms on


IDBM dataset.

246
S.U. Hassan, J. Ahamed and K. Ahmad Sustainable Operations and Computers 3 (2022) 238–248

Fig. 10. Accuracy of selected ML algorithms


on Spam dataset.

bor has 98.5, multinomial naïve Bayes has 97.4,and Random forest has CRediT authorship contribution statement
96.5%.
Further, Logistic Regression has 91.9% accuracy, which is the least Sayar Ul Hassan: Conceptualization, Methodology, Software, Visu-
among all classifiers, as shown in Fig. 10. The results on various machine alization. Jameel Ahamed: Data curation, Writing – original draft, Su-
learning performance metrics are given in Table 3. pervision. Khaleel Ahmad: Writing – review & editing.

5. Limitations and future work References

[1] S.P. Nayat, L. Marti, C.B. Garcia, Text classification techniques in oil in-
In the future, this research can be expanded to include more algo- dustry applications, Adv. Intell. Syst. Comput. 239 (January) (2014) v–vi,
rithms with hyperparameter tuning and ensemble approaches. To rep- doi:10.1007/978-3-319-01854-6.
resent the effective information discovery, the models can also be im- [2] M. Ikonomakis, S. Kotsiantis, V. Tampakas, Text classification using ma-
chine learning techniques, WSEAS Trans. Comput. 4 (8) (2005) 966–974,
plemented with novel strategies for parameter optimization. In the area doi:10.11499/sicejl1962.38.456.
of text classification, streaming data processing has been rather under- [3] A. Wilkinson, N. Wenger, L.R. Shugarman, Literature review on advance directives,
explored and needs to be minutely viewed. As a result, if used correctly, US Department of Health and Human Services, Washington, DC, 2007.
[4] E. Uysal, A. Ozturk, Comparison of machine learning algorithms on different
ensemble and calibrated approaches will benefit text classification. datasets, in: 26th IEEE Signal Processing and Communications Applications Con-
ference, SIU 2018, No. ICIC 2017, 2018, pp. 1–4, doi:10.1109/SIU.2018.8404193.
6. Conclusion [5] J. Wang, Y. Li, J. Shan, J. Bao, C. Zong, L. Zhao, Large-scale text classification using
scope-based convolutional neural network–A deep learning approach, IEEE Access
7 (2019) 171548–171558, doi:10.1109/ACCESS.2019.2955924.
The most significant part of natural language processing is text clas- [6] X. Luo, Efficient English text classification using selected Machine Learning Tech-
sification, which automatically categorizes text data into a set of de- niques, Alex. Eng. J. 60 (3) (2021) 3401–3409, doi:10.1016/j.aej.2021.02.009.
[7] L. Wei, B. Wei, B. Wang, Text classification using support vector machine with mix-
sirable categories. Machine Learning-based techniques are essential for ture of kernel, Journal of Software Engineering and Applications 5 (2012) 55.
text classification. Therefore, this study uses five algorithms: Support [8] C.N. Kamath, S.S. Bukhari, A. Dengel, Comparative study between traditional ma-
Vector Machine, k-Nearest Neighbor Logistic Regression, Multinomial chine learning and deep learning approaches for text classification, in: Proceedings
of the ACM Symposium on Document Engineering 2018, 2018, August, pp. 1–11.
Nave Bayes, and Random Forest, as well as two datasets: IMDB and [9] M. Trivedi, S. Sharma, N. Soni, S. Nair, Comparison of text classification algorithms,
Spam. It is revealed from the results that out of the developed mod- International Journal of Engineering Research & Technology (IJERT) 4 (02) (2015).
els,the k-NN model outperforms the other models in the Spam dataset [10] A. Mohi, U. Din, K. Syed, T. Rabani, Q. Rayees, Machine learning based approaches
for detecting COVID-19 using clinical text data, Int. J. Inf. Technol. 12 (3) (2020)
with an accuracy of 98.5%. In contrast, the LR model surpasses the other
731–739, doi:10.1007/s41870-020-00495-9.
models in the IMDB dataset with an accuracy of 85.8% using the pro- [11] D. Mahesh Matta Meet Kumar Saraf, D. Mahesh Matta, M. Kumar Saraf, S. Memeti,
posed system. Prediction of COVID-19 using Machine Learning Techniques, 2020.
[12] C. C. Aggarwal and C. X. Zhai, Mining text data, vol. 9781461432. 2013.
[13] A. Sarkar, S. Chatterjee, W. Das, D. Datta, Text classification using support vec-
Declaration of Competing Interests tor machine, International Journal of Engineering Science Invention 4 (11) (2015)
33–37.
The authors declare that they have no known competing financial [14] M. Gupta, A. Bansal, B. Jain, J. Rochelle, A. Oak, M.S. Jalali, Whether the weather
will help us weather the COVID-19 pandemic–Using machine learning to mea-
interests or personal relationships that could have appeared to influence sure Twitter users’ perceptions, Int. J. Med. Inform. 145 (November 2020) (2021)
the work reported in this paper. 104340, doi:10.1016/j.ijmedinf.2020.104340.

247
S.U. Hassan, J. Ahamed and K. Ahmad Sustainable Operations and Computers 3 (2022) 238–248

[15] H.B. Syeda, M. Syed, K.W. Sexton, S. Syed, S. Begum, F. Syed, . . . F. Yu Jr,, Role of [43] S.V. Praveen, R. Ittamalla, G. Deepak, Analyzing Indian general public’s perspective
machine learning techniques to tackle the COVID-19 crisis: systematic review, JMIR on anxiety, stress and trauma during Covid-19-A machine learning study of 840,000
medical informatics 9 (1) (2021) e23811. tweets, Diabetes & Metabolic Syndrome: Clinical Research & Reviews 15 (3) (2021)
[16] D. Nagar, S. Raghav, A. Bhardwaj, R. Kumar, P.Lata Singh, R. Sindhwani, Machine 667–671.
learning–Best way to sustain the supply chain in the era of industry 4.0, Mater. Today [44] A.M.U.D. Khanday, Q.R. Khan, S.T. Rabani, Detecting textual propaganda using ma-
Proc. 47 (2021) 3676–3682, doi:10.1016/j.matpr.2021.01.267. chine learning techniques, Baghdad Sci J 18 (1) (2021) 199–209.
[17] M.A. Kadampur, S. Al Riyaee, Skin cancer detection–Applying a deep learn- [45] A.M.U.D. Khanday, Q.R. Khan, S.T. Rabani, Identifying propaganda from online so-
ing based model driven architecture in the cloud for classifying dermal cial networks during COVID-19 using machine learning techniques, International
cell images, Inform. Med. Unlocked 18 (November 2019) (2020) 100282, Journal of Information Technology 13 (1) (2021) 115–122.
doi:10.1016/j.imu.2019.100282. [46] N. Yadav, O. Kudale, A. Rao, S. Gupta, A. Shitole, Twitter sentiment analysis using
[18] N.F. Hordri, S.S. Yuhaniz, N.F.M. Azmi, S.M. Shamsuddin, Handling class imbalance supervised machine learning, Lect. Notes Data Eng. Commun. Technol. 57 (2021)
in credit card fraud using resampling methods, Int. J. Adv. Comput. Sci. Appl. 9 (11) 631–642, doi:10.1007/978-981-15-9509-7_51.
(2018) 390–396, doi:10.14569/ijacsa.2018.091155. [47] Ahmed, A. A. A., Aljabouh, A., Donepudi, P. K., & Choi, M. S. (2021). Detecting
[19] K. Crowston, F. Bolici, Impacts of machine learning on work, Proc. Annu. Hawaii Int. Fake News using Machine Learning: A Systematic Literature Review. arXiv preprint
Conf. Syst. Sci. 2019-January (2019) 5961–5970, doi:10.24251/hicss.2019.719. arXiv:2102.04458.
[20] B.S. Singh, S.A. Nayyar, A review paper on algorithms used for text classification, [48] Y. HaCohen-Kerner, D. Miller, Y. Yigal, The influence of preprocessing on text
International Journal of Application or Innovation in Engineering & Management classification using a bag-of-words representation, PLoS One 15 (5) (2020),
(IJAIEM) 2 (3) (2013). doi:10.1371/journal.pone.0232525.
[21] M. Ikonomakis, S. Kotsiantis, V. Tampakas, Text classification using machine learn- [49] Z. Jiang, B. Gao, Y. He, Y. Han, P. Doyle, Q. Zhu, Text classification using novel term
ing techniques, WSEAS transactions on computers 4 (8) (2005) 966–974. weighting scheme-based improved TF-IDF for Internet media reports, Mathematical
[22] A.I. Anik, S. Yeaser, A.I. Hossain, A. Chakrabarty, Player’s performance prediction Problems in Engineering 2021 (2021).
in ODI cricket using machine learning algorithms, in: 2018 4th International Con- [50] I.J. Jacob, Performance evaluation of caps-net based multitask learning architecture
ference on Electrical Engineering and Information & Communication Technology for text classification, Journal of Artificial Intelligence 2 (01) (2020) 1–10.
(iCEEiCT), IEEE, 2018, September, pp. 500–505. [51] F. Rustam, M. Khalid, W. Aslam, V. Rupapara, A. Mehmood, G.S. Choi, A perfor-
[23] Nigam, K., McCallum, A., & Mitchell, T. M. (2006). Semi-Supervised Text Classifica- mance comparison of supervised machine learning models for Covid-19 tweets sen-
tion Using EM. timent analysis, Plos one 16 (2) (2021) e0245909.
[24] I. Rasheed, V. Gupta, H. Banka, C. Kumar, Urdu text classification: A compara- [52] Kumar, D., Gopesh, A. C., & Singh, M. P. Restaurant Review Classification and Anal-
tive study using machine learning techniques, in: 2018 Thirteenth International ysis.
Conference on Digital Information Management (ICDIM), IEEE, 2018, September, [53] Nabeel Asim, M., Usman Ghani, M., Ibrahim, M. A., Ahmad, S., Mahmood, W., &
pp. 274–278. Dengel, A. (2020). Benchmark Performance of Machine And Deep Learning Based
[25] N. Aljedani, R. Alotaibi, M. Taileb, Hmatc: Hierarchical multi-label arabic text clas- Methodologies for Urdu Text Document Classification. arXiv e-prints, arXiv-2003.
sification model using machine learning, Egyptian Informatics Journal 22 (3) (2021) [54] J. Kaur, J.R. Saini, A study of text classification natural language processing algo-
225–237. rithms for Indian languages, VNSGU J Sci Technol 4 (1) (2015) 162–167.
[26] Y. Zhan, H. Chen, S.F. Zhang, M. Zheng, Chinese text categorization study based on [55] M. Abid, A. Habib, J. Ashraf, A. Shahid, Urdu word sense disambiguation using
feature weight learning, in: 2009 international conference on machine learning and machine learning approach, Cluster Computing 21 (1) (2018) 515–522.
cybernetics, 3, IEEE, 2009, July, pp. 1723–1726. [56] S.K. Singh, N. Katal, S.G. Modani, Multi-objective optimization of PID controller
[27] J. Sreemathy, P.S. Balamurugan, An efficient text classification using knn and naive for coupled-tank liquid-level control system using genetic algorithm, in: Proceed-
bayesian, International Journal on Computer Science and Engineering 4 (3) (2012) ings of the Second International Conference on Soft Computing for Problem Solving
392. (SocProS 2012), December 28-30, 2012, Springer, New Delhi, 2014, pp. 59–66.
[28] S. Mayor, B. Pant, Document classification using support vector machine, Interna- [57] B. Maram, G. Padmapriya, A.R. Satish, A framework for performance analysis on
tional Journal of Engineering Science and Technology 4 (4) (2012). machine learning algorithms using covid-19 dataset, Adv Math: Sci J 9 (10) (2020)
[29] F. Colas, P. Brazdil, Comparison of SVM and some older classification algo- 8207–8215.
rithms in text classification tasks, IFIP Int. Fed. Inf. Process. 217 (2006) 169–178, [58] A.I. Kadhim, An evaluation of preprocessing techniques for text classification, Int.
doi:10.1007/978-0-387-34747-9_18. J. Comput. Sci. Inf. Secur. 16 (6) (2018) 22–32.
[30] S. Tong and D. Koller, “with Applications to Text Classification,” pp. 45–66, 2001. [59] M.A. Rosid, A.S. Fitrani, I.R.I. Astutik, N.I. Mulloh, H.A. Gozali, Improving text pre-
[31] J. Shawe-Taylor and C. Watkins, “Text Classification using String Kernels.” processing for student complaint document classification using sastrawi, in: IOP Con-
[32] B. Trstenjak, S. Mikac, D. Donko, KNN with TF-IDF based framework for text catego- ference Series: Materials Science and Engineering, 874, IOP Publishing, 2020.
rization, Procedia Eng. 69 (2014) 1356–1364, doi:10.1016/j.proeng.2014.03.129. [60] R.I. Kurnia, Y.D. Tangkuman, A.S. Girsang, Classification of user comment using
[33] L. Baoli, Y. Shiwen, and L. Qin, “An improved k -nearest neighbor algorithm,” Proc. word2vec and SVM classifier, Int. J. Adv. Trends Comput. Sci. Eng. 9 (1) (2020)
20th Int. Conf. Comput. Process. Orient. Lang., 2003. 643–648, doi:10.30534/ijatcse/2020/90912020.
[34] E. M. Elnahrawy, “Log-based chat room monitoring using text categorization –A [61] A. Balinsky, H. Balinsky, S. Simske, Rapid change detection and text mining, in: Pro-
comparative study,” IASTED Int. Conf. Inf. Knowl. Shar. (IKS 2002), 2002. ceedings of the 2nd Conference on Mathematics in Defence (IMA), Defence Academy,
[35] G. Khazal, A. Zamyatin, Feature engineering for Arabic text classification, J. Eng. UK, 2011, October.
Appl. Sci. 14 (7) (2019) 2292–2301, doi:10.36478/jeasci.2019.2292.2301. [62] A.I. Kadhim, Survey on supervised machine learning techniques for automatic text
[36] S. Vijayarani, M.N. Nithya, Efficient machine learning classifiers for automatic in- classification, Artificial Intelligence Review 52 (1) (2019) 273–292.
formation classification, Int. J. Mod. Trends Eng. Res. (Ii) (2015) 685–694. [63] Y. Zheng, “An exploration on text classification with classical machine learning
[37] B. Agarwal, N. Mittal, Text classification using machine learning methods-a survey, algorithm,” 2019 Int. Conf. Mach. Learn. Big Data Bus. Intell., pp. 81–85, 2019,
in: Proceedings of the Second International Conference on Soft Computing for Prob- doi:10.1109/MLBDBI48998.2019.00023.
lem Solving (SocProS 2012), December 28-30, 2012, Springer, New Delhi, 2014, [64] R. Jindal, Techniques for text classification –Literature review and current trends,
pp. 701–709. Webology 12 (2) (2015) 1–28.
[38] S.H. Jambukia, V.K. Dabhi, H.B. Prajapati, ECG beat classification using ma- [65] B. Lopez, X. Sumba, IMDb Sentiment Analysis, 2019, pp. 2–6.
chine learning techniques, Int. J. Biomed. Eng. Technol. 26 (1) (2018) 32–53, [66] M. Usman and S. Ayub, “Urdu Text Classification using Majority Voting,” vol. 7, no.
doi:10.1504/IJBET.2018.089255. 8, pp. 265–273, 2016.
[39] “Machine learning applications based on SVM classification.pdf.”. [67] M. Bilal, H. Israr, M. Shahid, A. Khan, Sentiment classification of Roman-Urdu opin-
[40] Parul Sinha, Poonam Sinha, Comparative study of chronic kidney disease pre- ions using Navie Baysian, Decision Tree and KNN classification techniques, J. King
diction using KNN and SVM, Int. J. Eng. Res. V4 (12) (2015) 608–612, Saud Univ. Inf. Sci. 28 (3) (2015) 330344, doi:10.1016/j.jksuci.2015.11.003.
doi:10.17577/ijertv4is120622. [68] A. Aggarwal, J. Singh, K. Gupta, in: A Review of Different Text Categorization Tech-
[41] I. Ibrahim, A. Abdulazeez, The role of machine learning algorithms for niques, 7, 2018, pp. 11–15.
diagnosing diseases, J. Appl. Sci. Technol. Trends 2 (01) (2021) 10–19, [69] B. Charbuty, A. Abdulazeez, Classification based on decision tree algorithm
doi:10.38094/jastt20179. for machine learning, J. Appl. Sci. Technol. Trends 2 (01) (2021) 20–28,
[42] M. Elbadawi, S. Gaisford, A.W. Basit, Advanced machine-learning techniques in drug doi:10.38094/jastt20165.
discovery, Drug Discovery Today 26 (3) (2021) 769–777. [70] B. Mathiak, S. Eckstein, Five steps to text mining in biomedical literature, in: Pro-
ceedings of the European Workshop on Data Mining and Text Mining for Bioinfor-
matics, 2004, pp. 47–50.

248

You might also like