0% found this document useful (0 votes)
27 views7 pages

Paper 125-Hate Speech Detection in Bahasa Indonesia Challenges and Opportunities

This study reviews the challenges and opportunities in detecting hate speech in Bahasa Indonesia on social media, particularly focusing on datasets, methods, and the prevalence of abusive language. It highlights the dominance of Twitter as a data source and the need for more diverse research on specific hate speech phenomena, such as islamophobia. The study also emphasizes the importance of addressing issues like code-mixing and the lack of effective regulations contributing to the rise of hate speech in Indonesia.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views7 pages

Paper 125-Hate Speech Detection in Bahasa Indonesia Challenges and Opportunities

This study reviews the challenges and opportunities in detecting hate speech in Bahasa Indonesia on social media, particularly focusing on datasets, methods, and the prevalence of abusive language. It highlights the dominance of Twitter as a data source and the need for more diverse research on specific hate speech phenomena, such as islamophobia. The study also emphasizes the importance of addressing issues like code-mixing and the lack of effective regulations contributing to the rise of hate speech in Indonesia.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 14, No. 6, 2023

Hate Speech Detection in Bahasa Indonesia:


Challenges and Opportunities

Endang Wahyu Pamungkas1 , Divi Galih Prasetyo Putri2 , Azizah Fatmawati3


Informatics Engineering Department, Universitas Muhammadiyah Surakarta, Surakarta, Indonesia1,3
Software Engineering Department, Vocational School, Universitas Gadjah Mada, Yogyakarta, Indonesia2
Social Informatics Research Center, Universitas Muhammadiyah Surakarta, Surakarta, Indonesia1

Abstract—This study aims to provide an overview of the offer opportunities for violent actors to propagate their acts,
current research on detecting abusive language in Indonesian potentially reaching a wider audience when their posts go
social media. The study examines existing datasets, methods, and viral [2]. Twitter is a popular social networking platform
challenges and opportunities in this field. The research found that that provides convenient access to its users for online social
most existing datasets for detecting abusive language were col- interactions. The number of Twitter users has been steadily
lected from social media platforms such as Twitter, Facebook, and
Instagram, with Twitter being the most commonly used source.
increasing, from around 100 million users in 2017 to almost
The study also found that hate speech is the most researched 240 million in 2022. Previous studies have shown that hate
type of abusive language. Various models, including traditional speech is also a prominent challenge in the Twittersphere.
machine learning and deep learning approaches, have been im- Pamungkas et al. [3] conducted a study on hate speech towards
plemented for this task, with deep learning models showing more women in Twitter in multiple languages, including Italian,
competitive results. However, the use of transformer-based models Spanish, and English. Lingiardi et al. [4] has also explored
is less popular in Indonesian hate speech studies. The study also other forms of hate speech targeted at specific groups on
emphasizes the importance of exploring more diverse phenomena, Twitter.
such as islamophobia and political hate speech. Additionally,
the study suggests crowdsourcing as a potential solution for Automatically detecting hate speech from social media text
the annotation approach for labeling datasets. Furthermore, it is a challenging task. Several studies have been proposed to
encourages researchers to consider code-mixing issues in abusive address hate speech in social media, mainly focusing on im-
language datasets in Indonesia, as it could improve the overall plementing machine learning models to automatically predict
model performance for detecting abusive language in Indonesian whether an utterance is hate speech or not. However, working
data. The study also suggests that the lack of effective regulations with social media data is a very challenging task. Social
and the anonymity afforded to users on most social networking
sites, as well as the increasing number of Twitter users in
media data often contains valuable knowledge for information
Indonesia, have contributed to the rising prevalence of hate speech extraction tasks, but it is usually very noisy and full of informal
in Indonesian social media. The study also notes the importance language [5]. According to the study of Baldwin et al. [5],
of considering code-mixed language, out-of-vocabulary words, there are several properties of social media data, including:
grammatical errors, and limited context when working with social i) the presence of code-mixed language; ii) the presence of
media data. out-of-vocabulary words; and iii) grammatical errors. Social
media data also usually has very limited context, which is an
Keywords—Abusive language; hate speech detection; machine
learning; social media important issue for abusive language detection tasks because it
is difficult to classify a text as abusive or not without context.
Other important clues for abusive detection tasks, such as facial
I. I NTRODUCTION expressions, gestures, and voice tones (which are recognized
in face-to-face communication), are also absent in social media
In this digital era, social media has become an important
data. However, social media content has some signals that can
aspect of everyday life. Not only is it a source of information,
be exploited to partially resolve the context of such texts,
but it is also a medium of entertainment, allowing people to
including emojis, emoticons, hashtags, URLs, and mentions.
share content and express their feelings about anything at any
Some studies have also found that there are several issues that
time. However, social media can also be a double-edged sword.
contribute to the difficulty of detecting hate speech in social
On one hand, it can provide a medium for constructive and
media automatically, including the use of swear words [6],
positive communication among its users. On the other hand,
multidomain issues [7], [8], and multilingual issues [8], [7].
the freedom of expression afforded to social media users can
also create serious problems, such as the increasing prevalence Similarly, hate speech phenomena also occur in Indonesian
of hate speech on social media. This phenomenon is often social media. According to Statista1 , the number of Twitter
attributed to the lack of effective regulations and the anonymity users in Indonesia has reached almost 240 million, ranking fifth
afforded to users on most social networking sites [1]. These among all countries in the world. Hate speech in Indonesia
characteristics make social media the perfect medium for has been regulated by the government since 2008, as stated
individual abusive users or even hate groups to spread and in the Law of Information and Electronic Transaction (UU
reinforce their views. In fact, social media platforms even ITE). The Kepolisian Republik Indonesia (Indonesian Police
This work has been funded by the internal funding of Universitas Muham- 1 https://www.statista.com/statistics/242606/
madiyah Surakarta under Grant Number 110.28/A.3-III/LRI/VI/2022. number-of-active-twitter-users-in-selected-countries/

www.ijacsa.thesai.org 1175 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 6, 2023

Fig. 1. Documents collection methodology.

Department) has also issued further regulations, as hate speech by the national constitution, Indonesian social media users
has the potential to have dangerous effects, not only for still use abusive language to communicate and even attack
the victims of hate speech but also for society as a whole. other users, often because they can hide their identities using
Interestingly, most instances of hate speech on Indonesian anonymous accounts. Several studies have been conducted to
social media are triggered by political events, such as elections. address hate speech in Indonesian. Some have proposed novel
Several studies have also been conducted to study the hate corpora containing manually annotated data gathered from so-
speech phenomena in Indonesian social media [9], [10]. Most cial media platforms such as Twitter, Instagram, and YouTube.
studies have focused on the automatic detection of hate speech Others have focused on developing machine learning models
utterances from social media data. The study by Alfina et al. to automatically classify given utterances as either abusive
[11] was one of the early studies of hate speech detection in or not. A few studies have done both, proposing a novel
Indonesian social media, specifically on the Twitter platform. hate speech dataset and building a machine learning model
This work proposed a novel dataset gathered from Twitter based on that dataset. In this section, we review hate speech
and manually annotated with two labels: hate speech and not. studies in Indonesian social media, focusing on two main
Another study by Ibrohim and Budi [12] proposed a more fine- aspects: (i) what datasets are available for abusive language
grained hate speech dataset, which not only contains a binary detection in Indonesia and (ii) what has been done so far in
class (hate speech vs. not), but also is annotated based on Indonesian abusive language detection studies. We collected
several categories, including the hate speech target, category, relevant documents using Google Scholar by searching for the
and level of hatefulness. More recent studies on hate speech keywords ’hate speech detection in Indonesian’ and ’abusive
detection in Indonesia have focused on adopting more recent language detection in Indonesian’s, limited to the first five
technologies, such as neural-based and transform-based models pages of results for each keyword and sorted by relevance,
[13], [14]. without a time filter. We also checked the cited documents
and references on the first five pages of each search to find
In this paper, we summarize the studies on hate speech more relevant publications. Fig. 1 summarizes our approach
detection, specifically on Indonesian social media. In this to collecting relevant documents for our study.
paper, we provide an overview of research conducted in
this area, giving a comprehensive view of the state-of-the-
A. What Datasets are Available for Abusive Language Detec-
art and datasets centered on this area. Our main objective is
tion in Indonesia?
to draw a conclusion on the state-of-the-art and to provide
several possible opportunities for future work based on ex- In this subsection, we collect information about the avail-
isting open problems. After the introduction, we discuss the able datasets for abusive language studies in Indonesia. Table I
existing studies on hate speech detection in Indonesian social summarizes the information about the available datasets for
media, focusing on the approaches adopted and the available hate speech detection studies specifically in Indonesian. We
language resources in Section II. An analysis of challenges and gathered this information from previous studies on hate speech
opportunities for this particular task in future work is discussed detection in Indonesian, using the approaches outlined in
in Section III. Finally, Section IV presents conclusive remarks Fig. 1. We found that the two most frequently used datasets in
for this survey. previous work are those from Alfina et. al. [11] and Ibrohim
et. al. [15]. However, these datasets are still less commonly
II. L ITERATURE R EVIEW used compared to hate speech datasets in languages with
more resources, such as English, Italian, and Spanish. This
Similar to other languages, hate speech is becoming a rele- may be due to the lack of a hate speech detection shared
vant issue in Indonesian social media. Despite being regulated task in Indonesia, which usually attracts more researchers
www.ijacsa.thesai.org 1176 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 6, 2023

TABLE I. S UMMARIZATION OF AVAILABLE A BUSIVE L ANGUAGE DATASET IN I NDONESIAN


Topical Focus Sources Annotation Entries Available Ref
Hate Speech Twitter Expert Manual 1,100 Yes [11]
Hate Speech Twitter Expert Manual 13,169 Yes [12]
Abusiveness Twitter Expert Manual 2,016 Yes [15]
Abusiveness News Comments Expert Manual 3,184 Yes [16]
Hate Speech News Comments Expert Manual 3,614 No [16]
Hate Speech Twitter Expert Manual 4,002 No [17]
Hate Speech Instragram Expert Manual 1,067 No [18]
Hate Speech Instragram Expert Manual 13,194 No [19]
Hate Speech Instragram Expert Manual 572 Yes [20]
Hate Speech Instragram Expert Manual 1,012 No [21]
Hate Speech and
Cyberbullying Twitter Automatic 83,752 No [22]
Hate Speech Facebook Expert Manual 1,276 No [23]
Hate Speech Twitter Expert Manual 35,623 Yes [24]
Hate Speech Twitter Expert Manual 1,477 Yes [25]
Hate Speech Multiple Social Media Expert Manual 2,273 No [26]
Sources
Hate Speech Multiple Social Media Expert Manual 1,400 No [27]
Sources
Abusive Language and Hate Twitter Expert Manual 5,656 Yes [28]
Speech
Hate Speech Twitter Expert Manual 20,601 No [29]

to use available datasets for developing the best systems. In experts. This result differs from other studies of abu-
this section, we will discuss the available datasets based on sive language in other languages, where crowdsourc-
their topical focus, sources, annotation approach, number of ing is also a popular method for annotating datasets.
instances, and availability. We also observed that most proposed abusive language
datasets in Indonesia use binary labels, including an
• Topical Focuses : As mentioned in a previous study “abusive” class and a “not abusive” class. However,
by Pamungkas et al. [8], the topical focus of a dataset we also found studies that propose a finer-grained
can be described as the specific abusive phenomena annotation schema, such as the one implemented by
addressed, as well as the targets of the abusive be- [12], [28], [24]
havior. We also agree that a hate speech dataset may
cover more than one abusive phenomena. Compared • Availability: As presented in Table I, more than half
to the results obtained by Pamungkas et al. [8], most of the datasets used for abusive language detection
abusive language datasets in Indonesia only focus studies were not publicly available 2 . We can observe
on two topical focuses: abusiveness and hate speech, that most of the publicly available datasets were gath-
which are the most general terms used in abusive ered from Twitter. Meanwhile, datasets sourced from
language studies. Only one study by Febriana et al. other social media platforms such as Facebook and
[22] includes the term “cyberbullying” to describe Instagram are mostly not shared publicly. This finding
their dealt abusive phenomena. Based on these results, is also consistent with the survey results obtained
we argue that there are still many specific abusive by [8], where the availability of the datasets can
phenomena that need to be addressed in Indonesian be influenced by the regulation of the social media
abusive language studies, such as sexism, xenophobia, platforms related to data sharing policies.
offensiveness, and Islamophobia.
• Sources : The source of a dataset refers to the media B. What has been Done so Far in Indonesian Abusive Lan-
platforms from which the data was gathered. The guage Detection Study?
different characteristics of each platform can also be
variables that influence the treatment and difficulty In this subsection, we review the approaches adopted by
of the hate speech detection task. According to our previous studies to detect abusive language in Indonesian
results presented in Table I, most abusive language social media. We used a similar approach as presented in Fig. 1
datasets in Indonesian were gathered from Twitter. to collect the available studies. We collected any publications
This may be due to the convenience of scraping tweet found on Google Scholar using the defined keywords, “abusive
samples using the Twitter API, and because Twitter language detection Indonesia” and “hate speech detection
has less strict rules regarding data sharing for research Indonesia”. We limited our query to the first five pages for
purposes compared to other platforms. This result is each keyword and sorted results based on relevance, without
consistent with a survey conducted by Pamungkas a time filter. Furthermore, we also checked each document’s
et al. [8]. Additionally, we also observed that some cited documents and references on the first five pages to find
research used Instagram posts and comments on news more relevant publications. Table II summarizes the available
posts to study abusive phenomena. works in abusive language detection, specifically in Indonesian
social media. We carefully reviewed each document to obtain
• Annotation Approach and Scheme : Based on our the key information of each work. In this part, we focus on
manual inspection of previous studies, we found that
almost all of the proposed datasets were annotated by 2 the link of cannot be found in the article.

www.ijacsa.thesai.org 1177 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 6, 2023

TABLE II. S UMMARIZATION OF A PPROACHES A DDOPTED FOR H ATE S PEECH D ETECTION IN I NDONESIAN
Model Approach Ref
Traditional Models Using classical machine learning models such as SVM, Naive Bayes, Decision [11], [30], [17], [23], [12],
Tree, Random Forest, Logistic Regression, K-nearest Neighbours, Maximum [28], [27], [21], [31], [32],
Entropy and etc. coupled with several features including lexical and other [33], [34], [35], [36], [37],
structural features. [38], [39], [40]
Unsupervised Approach Using data mining technique such as clustering, classification, and association, [41], [42], [43]
without training process to detect hate speech instance. This approach is very
beneficial when the training data is limited.
Neural-based Models Using neural-based models either RNN-based model variants such as LSTM, [26], [19], [13], [44], [45],
GRU, and etc or CNN-based models coupled with language representation either [46], [47], [48], [49], [50],
using available pretrained models or self-training based on the available training [51], [52]
data.
Transformer-based Models Using the recent transformer based architecture such as BERT, RoBERTa, XLM, [53], [31], [14]
and etc. Based on the previous studies in NLP area, these models usually provide
the robust performance across different NLP tasks.

reviewing the adopted approach of each work to deal with ing approaches to detect abusive instances. Despite
the abusive language detection task. In particular, we focus on lexicon-based approaches being unpopular in common
two main discussions: variants of the models and implemented text classification tasks, this approach is still reliable
approaches. Following, we provide a deeper elaboration to when annotated data is limited. In line with the trend
compare the previous work in Indonesian abusive language in other natural language processing tasks, the use of
studies, to gain insights for further development. neural-based models is also gaining more attention
from NLP researchers in Indonesia. Some models
• Model Variant: A wide variety of classification mod- such as LSTM, GRU, and CNN have been widely
els have been adopted for the abusive language de- used to detect abusive language in Indonesian, either
tection task in Indonesian. Table II summarizes the using pre-trained language representations or without
published studies in this topic. Based on the results, we pre-training models. Lastly, the recent transformer-
divided the proposed models into four different vari- based technology is also starting to be used in the
ants: traditional models, unsupervised models, neural- Indonesian research community. This may be due to
based models, and transformer-based models. We can the availability of multilingual transformer models
observe that most previous works employed traditional such as BERT Multilingual, Multilingual GPT, and
models to deal with abusive language detection in XLM RoBERTa. Some of these models were also
Indonesia. Additionally, we also found a few studies used by a few studies [14], [53] for detecting abusive
that adopt an unsupervised approach, which do not language in Indonesian.
require labeled data to detect abusive language. This is
an interesting finding, as unsupervised models are not
popularly used for detecting abusive language in more III. C HALLENGES AND O PPORTUNITIES
resource-rich languages, as observed by Pamungkas The literature review and analysis presented in previous
et al. [8]. Similar to traditional-based models, neural- sections provide insights into the current state of the art
based models are also popular for detecting abusive of abusive language studies in Indonesian. Based on these
language. This is in line with the availability of analyses, we have observed several challenges in this task,
Indonesian language models that have been proposed which are summarized as follows:
by several recent works. Lastly, we notice that the use
of transformer-based models is still not yet explored in • Limited Availability of Language Resources: The
Indonesian abusive language studies. Unlike Indone- adopted approach for dealing with the task of abusive
sian language models, studies focused on developing language detection in Indonesian is currently limited
transformer models for the Indonesian language are and lags behind studies in other, more resource-rich
also limited. Most of the abusive language studies in languages. Traditional models are the most popular ap-
Indonesia that exploit transformer-based models are proach for addressing this problem in Indonesia, while
utilizing multilingual transformers. in other languages, more recent transformer-based
models are commonly used to achieve state-of-the-art
• Classification Models: A wide variety of classifi- results. We believe that this discrepancy is likely re-
cation models were used in this task. Starting with lated to the limited availability of language resources,
traditional classifiers, several models such as SVM, including language corpora and language models. We
Naive Bayes, Decision Tree, Random Forest, Logistic also note that several recent studies have proposed
Regression, KNN, and Maximum Entropy have been transformer-based models, such as IndoBERT [54]
used for this classification task. These models were the and IndoBERTweet [55], but they are still limited in
most popular approach for detecting abusive language, comparison to the transformer technologies available
specifically in Indonesian data. This may be due to for other languages.
the limited availability of resources in Indonesian,
such as language models or labeled datasets. For the • Limited Exploration of Abusive Phenomena: Based
unsupervised approach, a few studies have proposed on the abusive phenomena covered in the available
using lexicon-based and straightforward string match- datasets for abusive language detection studies, we
www.ijacsa.thesai.org 1178 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 6, 2023

perceive that the explored abusive phenomena in In- abusive language studies to explore more approaches
donesian is still very limited. Studies in Indonesian to better detect abusive language in Indonesian.
have mostly focused on the detection of hate and
abusive speech. Meanwhile, similar studies in other • Expanding the Study Exploration into Other Abu-
languages have been conducted with a broader cover- sive Phenomena: As mentioned in the challenges
age of abusive phenomena, which can include sexism, section, abusive language studies in Indonesian are
racism, misogyny, Islamophobia, and more. Some of still focused on a few phenomena, including hate
these studies have also proposed finer-grained labels speech and abusiveness. Based on our investigation,
to capture more specific abusive phenomena, which is there are several abusive phenomena specific to In-
usually beneficial for differentiating the treatment for donesia that could potentially become a focus for
handling each phenomenon. exploration, including islamophobia and political hate
speech. There are also other more general phenomena
• Low Awareness of Reproducibility Aspect: Based on which have been studied in other languages, such as
our review, we also notice that most of the published sexism, racism, xenophobia, homophobia, and more.
research in Indonesian abusive language studies do A broader exploration into other abusive phenomena
not make their code and datasets publicly available. could open more opportunities for research collabora-
This issue makes it difficult for other researchers to tion between NLP researchers and researchers from
reproduce the results of previous works, which is other communities such as the study of humanity,
important for better analysis of their own studies. psychology, gender studies, and social science.
Furthermore, reproducibility is an important aspect for • Exploring Other Annotation Approach to Build
maintaining continuity in research, specifically in the Abusive Langueage Datasets: Most of the avail-
area of abusive language research. able abusive language datasets in Indonesian were
• Limited Approach for Annotation Procedure: We built using expert annotation approaches. For example,
observe that most studies used manual expert anno- crowdsourcing could be a worth-considering option
tation procedures to label abusive language datasets. to be implemented for annotating abusive language
This approach is proven to be reliable for obtaining datasets. Because crowdsourcing approach has the
a high-quality dataset when the subjectivity of the advantage of bringing in a diverse set of annotators
annotation task is high. However, this approach is with different background identities, which can help to
usually not feasible for annotating a large number reduce bias in the dataset, which is also an important
of data, as the annotation task becomes more labor- issue in this study. In addition, crowdsourcing can
intensive and time-consuming. Sometimes, alternative be particularly useful when the dataset is large and
annotation approaches such as crowdsourcing scenar- complex, and would be too time-consuming for a
ios can provide a wider perspective, with a diverse single person to finish.
demographic of annotators who have different back- • Tackling the Problem of Code-Mixed Data: Code-
grounds and views to evaluate the abusive instances. mixing is becoming a prominent challenge in various
• The Problem of Code-Mixed Languages: Geograph- NLP tasks in recent years. This problem may be due
ically, Indonesia consists of several regions, each to the current technology and platforms which have a
with its own local languages. According to recent multilingual environment. Similarly, Indonesians also
reports, there are 718 local languages used by different tend to use a mix of their local languages and Ba-
regions and tribes in Indonesia. Indonesians tend to hasa Indonesia to communicate with others both in
use a mix of their own local language and Bahasa real life and on social media channels. Dealing with
to communicate on social media platforms, such as language-shift in code-mixed data is a challenging
Twitter. Related to this issue, we conducted a random task. Specifically in abusive language studies, several
check on some publicly available datasets. We found transfer learning approaches could be applied in this
a lot of code-mixed instances on the checked datasets task.
[28], [24], which are mostly written in a mixture of
Indonesian and Javanese. As in other languages and IV. C ONCLUSION
other NLP tasks, the issue of code-mixing is still a This survey presents a summary of research on detecting
prominent challenge that needs to be tackled. abusive language in Indonesian social media. It covers existing
datasets that could be used for this research, including datasets
Based on these challenges, we also point out several from multiple platforms, types of abusive behavior, and lan-
opportunities for future studies in this research direction, which guages. The survey also examines the methods that have been
are summarized below. proposed for detecting abusive language in Indonesian social
media. Finally, it discusses the challenges and opportunities
• Building Novel Language Resources in Indonesian: in this area of research and provides suggestions for future
Our NLP research community should also focus on development.
studying and developing language resources in In-
donesian. These resources could include novel corpora This study found that most of the existing datasets for
for diverse tasks or recent language model technolo- detecting abusive language were collected from social media
gies. The availability of more language resources platforms like Twitter, Facebook, and Instagram, with Twitter
could provide more opportunities for researchers in being the most commonly used source. This may be because
www.ijacsa.thesai.org 1179 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 6, 2023

it is easy to obtain samples from Twitter using its public API [6] E. W. Pamungkas, V. Basile, and V. Patti, “Investigating the role of
and because of the less strict policy from Twitter for sharing swear words in abusive language detection tasks,” Language Resources
data. The study also observed that hate speech is the most and Evaluation, pp. 1–34, 2022.
researched type of abusive language, compared to other types [7] N. Ousidhoum, Z. Lin, H. Zhang, Y. Song, and D.-Y. Yeung,
“Multilingual and multi-aspect hate speech analysis,” in Proceedings
such as abusiveness and cyberbullying. of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference
A wide variety of models have been implemented to deal on Natural Language Processing (EMNLP-IJCNLP). Hong Kong,
with the task of abusive language detection in Indonesia. China: Association for Computational Linguistics, Nov. 2019, pp.
However, most studies have exploited traditional models such 4675–4684. [Online]. Available: https://aclanthology.org/D19-1474
as logistic regression, SVM, naive bayes, and random forest to [8] E. W. Pamungkas, V. Basile, and V. Patti, “Towards multidomain
deal with this task. Several feature representations were used and multilingual abusive language detection: a survey,” Personal and
to train the models, which include TF-IDF, Bag of Words, Ubiquitous Computing, pp. 1–27, 2021.
and word vectors obtained from pre-trained language repre- [9] Y. Wirawanda and T. O. Wibowo, “Twitter: expressing hate speech
behind tweeting,” Profetik: Jurnal Komunikasi, vol. 11, no. 1, pp. 5–11,
sentations. Overall, recent deep learning architectures have 2018.
obtained more competitive results compared to other models.
[10] E. Fauziati, S. Suharyanto, A. S. Syahrullah, W. A. Pradana, and
Furthermore, we also observed that the use of transformer- I. Nurcholis, “Hate language produced by indonesian figures in social
based models is less popular in Indonesian hate speech studies. media: From philosophical perspectives,” WISDOM, vol. 3, no. 2, pp.
32–47, 2022.
Finally, we have identified some recent challenges and [11] I. Alfina, R. Mulia, M. I. Fanany, and Y. Ekanata, “Hate speech
opportunities for abusive language detection studies in In- detection in the indonesian language: A dataset and preliminary study,”
donesian. We observe that the availability of more language in 2017 International Conference on Advanced Computer Science and
resources in Indonesian is one of the factors that contribute Information Systems (ICACSIS). IEEE, 2017, pp. 233–238.
to the acceleration of research development, specifically in [12] M. O. Ibrohim and I. Budi, “Multi-label hate speech and abusive
this area. We also identify that abusive language studies language detection in Indonesian Twitter,” in Proceedings of the Third
Workshop on Abusive Language Online. Florence, Italy: Association
should explore more diverse phenomena beyond hate speech for Computational Linguistics, Aug. 2019, pp. 46–57. [Online].
and abusiveness topics, such as islamophobia, political hate Available: https://www.aclweb.org/anthology/W19-3506
speech, and other more general phenomena which are already [13] A. R. Isnain, A. Sihabuddin, and Y. Suyanto, “Bidirectional long short
widely studied in other languages such as sexism and racism. term memory method and word2vec extraction approach for hate speech
Another suggestion is related to the annotation approach for detection,” IJCCS (Indonesian Journal of Computing and Cybernetics
labeling abusive datasets, which mostly exploit manual expert Systems), vol. 14, no. 2, pp. 169–178, 2020.
annotation procedures. We suggest exploring crowdsourcing [14] M. A. Ibrahim, N. T. M. Sagala, S. Arifin, R. Nariswari, N. P. Murnaka,
and P. W. Prasetyo, “Separating hate speech from abusive language on
scenarios which could produce less bias and more com- indonesian twitter,” in 2022 International Conference on Data Science
prehensive datasets. Finally, we also encourage researchers and Its Applications (ICoDSA). IEEE, 2022, pp. 187–191.
who focus in this research area to consider the code-mixing [15] M. O. Ibrohim and I. Budi, “A dataset and preliminaries study for abu-
issue in current abusive language datasets in Indonesia. We sive language detection in indonesian social media,” Procedia Computer
believe that dealing with code-mixing issue could improve the Science, vol. 135, pp. 222–229, 2018.
overall model performance for detecting abusive language in [16] D. R. K. Desrul and A. Romadhony, “Abusive language detection on
Indonesian data. indonesian online news comments,” in 2019 International Seminar on
Research of Information Technology and Intelligent Systems (ISRITI).
IEEE, 2019, pp. 320–325.
ACKNOWLEDGMENT [17] T. Putri, S. Sriadhi, R. Sari, R. Rahmadani, and H. Hutahaean, “A
comparison of classification algorithms for hate speech detection,” in
This research is supported by internal funding from Uni- Iop conference series: Materials science and engineering, vol. 830,
versitas Muhammadiyah Surakarta. no. 3. IOP Publishing, 2020, p. 032006.
[18] A. Briliani, B. Irawan, and C. Setianingsih, “Hate speech detection
R EFERENCES in indonesian language on instagram comment section using k-nearest
neighbor classification method,” in 2019 IEEE International Conference
[1] H. Rainie, J. Q. Anderson, and J. Albright, The future of free speech, on Internet of Things and Intelligence System (IoTaIS). IEEE, 2019,
trolls, anonymity and fake news online. Pew Research Center Wash- pp. 98–104.
ington, DC, 2017. [19] I. G. M. Putra and D. Nurjanah, “Hate speech detection in indonesian
[2] B. Mathew, N. Kumar, P. Goyal, A. Mukherjee et al., “Analyzing language instagram,” in 2020 International Conference on Advanced
the hate and counter speech accounts on Twitter,” arXiv preprint Computer Science and Information Systems (ICACSIS). IEEE, 2020,
arXiv:1812.02712, 2018. pp. 413–420.
[3] E. W. Pamungkas, V. Basile, and V. Patti, “Misogyny detection in [20] N. I. Pratiwi, I. Budi, and I. Alfina, “Hate speech detection on indone-
twitter: a multilingual and cross-domain study,” Information Processing sian instagram comments using fasttext approach,” in 2018 International
& Management, vol. 57, no. 6, p. 102360, 2020. Conference on Advanced Computer Science and Information Systems
[4] V. Lingiardi, N. Carone, G. Semeraro, C. Musto, M. D’Amico, and (ICACSIS). IEEE, 2018, pp. 447–450.
S. Brena, “Mapping twitter hate speech towards social and sexual [21] E. Erizal, B. Irawan, and C. Setianingsih, “Hate speech detection in
minorities: A lexicon-based approach to semantic content analysis,” indonesian language on instagram comment section using maximum
Behaviour & Information Technology, vol. 39, no. 7, pp. 711–721, 2020. entropy classification method,” in 2019 International Conference on
[5] T. Baldwin, P. Cook, M. Lui, A. MacKinlay, and L. Wang, “How Information and Communications Technology (ICOIACT). IEEE, 2019,
noisy social media text, how diffrnt social media sources?” in pp. 533–538.
Proceedings of the Sixth International Joint Conference on Natural [22] T. Febriana and A. Budiarto, “Twitter dataset for hate speech and
Language Processing. Nagoya, Japan: Asian Federation of Natural cyberbullying detection in indonesian language,” in 2019 International
Language Processing, Oct. 2013, pp. 356–364. [Online]. Available: Conference on Information Management and Technology (ICIMTech),
https://www.aclweb.org/anthology/I13-1041 vol. 1. IEEE, 2019, pp. 379–382.

www.ijacsa.thesai.org 1180 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 6, 2023

[23] N. Aulia and I. Budi, “Hate speech detection on indonesian long [40] M. O. Ibrohim, M. A. Setiadi, and I. Budi, “Identification of hate
text documents using machine learning approach,” in Proceedings of speech and abusive language on indonesian twitter using the word2vec,
the 2019 5th International Conference on Computing and Artificial part of speech and emoji features,” in Proceedings of the International
Intelligence, 2019, pp. 164–169. Conference on Advanced Information Science and System, 2019, pp.
[24] A. D. Asti, I. Budi, and M. O. Ibrohim, “Multi-label classification for 1–5.
hate speech and abusive language in indonesian-local languages,” in [41] W. Darmalaksana, F. Irwansyah, H. Sugilar, D. Maylawati, W. Azis, and
2021 International Conference on Advanced Computer Science and A. Rahman, “Logical framework for hate speech detection on religion
Information Systems (ICACSIS). IEEE, 2021, pp. 1–6. issues in indonesia,” in IOP Conference Series: Materials Science and
[25] A. Muzakir, K. Adi, and R. Kusumaningrum, “Classification of hate Engineering, vol. 1098, no. 3. IOP Publishing, 2021, p. 032046.
speech language detection on social media: Preliminary study for [42] N. Kurniasih, L. A. Abdillah, I. K. Sudarsana, I. Yogantara, I. Astawa,
improvement,” in International Conference on Networking, Intelligent R. F. Nanuru, A. Miagina, J. O. Sabarua, M. Jamil, J. Tandisalla
Systems and Security. Springer, 2023, pp. 146–156. et al., “Prototype application hate speech detection website using
[26] T. L. Sutejo and D. P. Lestari, “Indonesia hate speech detection using string matching and searching algorithm,” International Journal of
deep learning,” in 2018 International Conference on Asian Language Engineering & Technology, vol. 7, no. 2.5, pp. 62–64, 2018.
Processing (IALP). IEEE, 2018, pp. 39–43. [43] M. Hayaty, S. Adi, and A. D. Hartanto, “Lexicon-based indonesian
[27] U. A. N. Rohmawati, S. W. Sihwi, and D. E. Cahyani, “Semar: An local language abusive words dictionary to detect hate speech in social
interface for indonesian hate speech detection using machine learning,” media,” Journal of Information Systems Engineering and Business
in 2018 International Seminar on Research of Information Technology Intelligence, vol. 6, no. 1, pp. 9–17, 2020.
and Intelligent Systems (ISRITI). IEEE, 2018, pp. 646–651. [44] J. Patihullah and E. Winarko, “Hate speech detection for indonesia
[28] S. D. A. Putri, M. O. Ibrohim, and I. Budi, “Abusive language and tweets using word embedding and gated recurrent unit,” IJCCS (Indone-
hate speech detection for javanese and sundanese languages in tweets: sian Journal of Computing and Cybernetics Systems), vol. 13, no. 1,
Dataset and preliminary study,” in 2021 11th International Workshop pp. 43–52, 2019.
on Computer Science and Engineering, WCSE 2021. International [45] S. S. Syam, B. Irawan, and C. Setianingsih, “Hate speech detection
Workshop on Computer Science and Engineering (WCSE), 2021, pp. on twitter using long short-term memory (lstm) method,” in 2019
461–465. 4th International Conference on Information Technology, Information
[29] F. Anistya, E. B. Setiawan et al., “Hate speech detection on twitter in Systems and Electrical Engineering (ICITISEE). IEEE, 2019, pp. 305–
indonesia with feature expansion using glove,” Jurnal RESTI (Rekayasa 310.
Sistem Dan Teknologi Informasi), vol. 5, no. 6, pp. 1044–1051, 2021. [46] I. Ghozali, K. R. Sungkono, R. Sarno, and R. Abdullah, “Synonym
[30] N. A. Setyadi, M. Nasrun, and C. Setianingsih, “Text analysis for based feature expansion for indonesian hate speech detection.” Inter-
hate speech detection using backpropagation neural network,” in 2018 national Journal of Electrical & Computer Engineering (2088-8708),
International Conference on Control, Electronics, Renewable Energy vol. 13, no. 1, 2023.
and Communications (ICCEREC). IEEE, 2018, pp. 159–165. [47] H. Imaduddin, S. Fauziati et al., “Word embedding comparison for
[31] A. D. Sanya and L. H. Suadaa, “Handling imbalanced dataset on indonesian language sentiment analysis,” in 2019 International Con-
hate speech detection in indonesian online news comments,” in 2022 ference of Artificial Intelligence and Information Technology (ICAIIT).
10th International Conference on Information and Communication IEEE, 2019, pp. 426–430.
Technology (ICoICT). IEEE, 2022, pp. 380–385. [48] A. Marpaung, R. Rismala, and H. Nurrahmi, “Hate speech detection in
[32] P. S. B. Ginting, B. Irawan, and C. Setianingsih, “Hate speech detection indonesian twitter texts using bidirectional gated recurrent unit,” in 2021
on twitter using multinomial logistic regression classification method,” 13th International Conference on Knowledge and Smart Technology
in 2019 IEEE International Conference on Internet of Things and (KST). IEEE, 2021, pp. 186–190.
Intelligence System (IoTaIS). IEEE, 2019, pp. 105–111. [49] G. B. Herwanto, A. M. Ningtyas, K. E. Nugraha, and I. N. P. Trisna,
[33] M. A. Ibrahim, S. Arifin, I. G. A. A. Yudistira, R. Nariswari, A. A. “Hate speech and abusive language classification using fasttext,” in
Abdillah, N. P. Murnaka, and P. W. Prasetyo, “An explainable ai model 2019 International Seminar on Research of Information Technology and
for hate speech detection on indonesian twitter,” CommIT (Communica- Intelligent Systems (ISRITI). IEEE, 2019, pp. 69–72.
tion and Information Technology) Journal, vol. 16, no. 2, pp. 175–182, [50] D. A. N. Taradhita and I. Darma Putra, “Hate speech classification
2022. in indonesian language tweets by using convolutional neural network.”
[34] D. A. Anggoro and D. Permatasari, “Performance comparison of the Journal of ICT Research & Applications, vol. 14, no. 3, 2021.
kernels of support vector machine algorithm for diabetes mellitus [51] E. Sazany and I. Budi, “Hate speech identification in text written
classification,” International Journal of Advanced Computer Science in indonesian with recurrent neural network,” in 2019 International
and Applications, vol. 14, no. 1, 2023. Conference on Advanced Computer Science and information Systems
[35] I. M. A. Niam, B. Irawan, C. Setianingsih, and B. P. Putra, “Hate speech (ICACSIS). IEEE, 2019, pp. 211–216.
detection using latent semantic analysis (lsa) method based on image,” [52] M. N. Ramadhan, I. Budi, A. B. Santoso, and R. R. Suryono, “Sexual
in 2018 International Conference on Control, Electronics, Renewable violence classification as hate speech using indonesian tweet,” in
Energy and Communications (ICCEREC). IEEE, 2018, pp. 166–171. 2022 International Symposium on Information Technology and Digital
[36] M. P. K. Dewi and E. B. Setiawan, “Feature expansion using word2vec Innovation (ISITDI). IEEE, 2022, pp. 114–120.
for hate speech detection on indonesian twitter with classification using [53] E. W. Pamungkas, V. Basile, and V. Patti, “A joint learning approach
svm and random forest,” JURNAL MEDIA INFORMATIKA BUDI- with knowledge injection for zero-shot cross-lingual hate speech detec-
DARMA, vol. 6, no. 2, pp. 979–988, 2022. tion,” Information Processing & Management, vol. 58, no. 4, p. 102544,
[37] E. Utami, A. F. Iskandar, S. Raharjo et al., “Multi-label classification 2021.
of indonesian hate speech detection using one-vs-all method,” in 2021 [54] F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “Indolem and indobert: A
IEEE 5th International Conference on Information Technology, Infor- benchmark dataset and pre-trained language model for indonesian nlp,”
mation Systems and Electrical Engineering (ICITISEE). IEEE, 2021, in Proceedings of the 28th International Conference on Computational
pp. 78–82. Linguistics, 2020, pp. 757–770.
[38] D. Elisabeth, I. Budi, and M. O. Ibrohim, “Hate code detection in [55] F. Koto, J. H. Lau, and T. Baldwin, “Indobertweet: A pretrained
indonesian tweets using machine learning approach: A dataset and language model for indonesian twitter with effective domain-specific
preliminary study,” in 2020 8th International Conference on Information vocabulary initialization,” in Proceedings of the 2021 Conference on
and Communication Technology (ICoICT). IEEE, 2020, pp. 1–6. Empirical Methods in Natural Language Processing, 2021, pp. 10 660–
[39] S. Kurniawan and I. Budi, “Indonesian tweets hate speech target classi- 10 668.
fication using machine learning,” in 2020 Fifth International Conference
on Informatics and Computing (ICIC). IEEE, 2020, pp. 1–5.

www.ijacsa.thesai.org 1181 | P a g e

You might also like