0% found this document useful (0 votes)
43 views16 pages

1 s2.0 S2949719123000389 Main

This paper presents a new dataset for detecting homophobia and transphobia in Malayalam and Hindi languages, consisting of 5,193 and 3,203 comments respectively. It explores the challenges of identifying such content in under-resourced languages and evaluates the performance of traditional machine learning and transformer models across multiple languages. The study aims to enhance automated detection systems and foster inclusive online communities by providing insights into cross-lingual transfer learning for homophobic and transphobic content.

Uploaded by

Isabela Kairee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views16 pages

1 s2.0 S2949719123000389 Main

This paper presents a new dataset for detecting homophobia and transphobia in Malayalam and Hindi languages, consisting of 5,193 and 3,203 comments respectively. It explores the challenges of identifying such content in under-resourced languages and evaluates the performance of traditional machine learning and transformer models across multiple languages. The study aims to enhance automated detection systems and foster inclusive online communities by providing insights into cross-lingual transfer learning for homophobic and transphobic content.

Uploaded by

Isabela Kairee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Natural Language Processing Journal 5 (2023) 100041

Contents lists available at ScienceDirect

Natural Language Processing Journal


journal homepage: www.elsevier.com/locate/nlp

Homophobia and transphobia detection for low-resourced languages in


social media comments
Prasanna Kumar Kumaresan a , Rahul Ponnusamy a , Ruba Priyadharshini b , Paul Buitelaar a ,
Bharathi Raja Chakravarthi a,c ,∗
a
Insight SFI Research Centre for Data Analytics, Data Science Institute, University of Galway, Ireland
b
Department of Mathematics, Gandhigram Rural Institute-Deemed to be University, Tamil Nadu, India
c School of Computer Science, University of Galway, Ireland

ARTICLE INFO ABSTRACT


Keywords: People are increasingly sharing and expressing their emotions using online social media platforms such
Dataset creation as Twitter, Facebook, and YouTube. An abusive, hateful, threatening, and discriminatory act that makes
Under-resourced language discomfort targets gay, lesbian, transgender, or bisexual individuals is called Homophobia and Transphobia.
Machine learning
Detecting these types of acts on social media is called Homophobia and Transphobia Detection. This task
Transformer models
has recently gained interest among researchers. Identifying homophobic and transphobic content for under-
Cross-lingual transfer learning
Homophobia and transphobia
resourced languages is a bit challenging task. There are no such resources for Malayalam and Hindi to
Performance analysis categorize these types of content as far now. This paper presents a new high-quality dataset for detecting
homophobia and transphobia in Malayalam and Hindi languages. Our dataset consists of 5,193 comments
in Malayalam and 3,203 comments in Hindi. We also submitted the experiments performed with traditional
machine learning and transformer-based deep learning models on the Malayalam, Hindi, English, Tamil, and
Tamil-English datasets.

1. Introduction threats targeted at individuals or organizations (Zampieri et al., 2019;


Chakravarthi, 2023a). These hateful posts and comments affect society
The evolution of the internet led to social media usage, reviewing on a micro-scale and at the world level by influencing people’s views
sites, and many more platforms. The online communication of social relating to necessary global events like elections and protests. For the
media platforms has increased in overall languages globally. This stage scope of this paper, we tend to adopt the definition of homophobia and
permits users to post and share fabric and express their views in transphobia as outlined as ‘‘insulting, hurtful, derogatory, or obscene
consideration of anything at any time (Al-Hassan and Al-Dossari, 2021; content directed from one person to a different person. ‘‘Within the
Chakravarthi et al., 2022a). Due to the freedom of speech available on natural language processing (NLP) community, the automatizing of
the internet, people express their thoughts and feelings on any specific homophobia and transphobia detection has created vital progress that
topic, allowing powerless individuals to manipulate the lives of others has been accelerated by the organization of various tasks geared toward
and abuse them due to the anonymity and social distance provided by distinguishing homophobia and transphobia (Chakravarthi, 2023b; Ku-
it. Abusive content has also become more prevalent nowadays, mainly maresan et al., 2022)’’. Furthermore, there has been a proliferation
with the rise of social media. This abusive content is a growing issue of recent strategies for automatic homophobia and transphobia de-
in social media, because of the large amount of user-generated content tection during a social media text. However, operating with social
on the internet—especially on social media sites like Twitter, Facebook, media text is challenging as people use various languages, spellings,
and YouTube (Sai and Sharma, 2021). During the last decade, online and words that may not be found in any standard dictionary (Mishra
offensive language has been identified as a global epidemic that has et al., 2021). An increasing number of researchers are working to-
spread across social media platforms (Gao et al., 2020). For Lesbian, ward reducing derogatory comments on online platforms. Similarly,
Gay, Bisexual, Transgender, and Other (LGBT+) vulnerable individuals, we also strive to minimize negative opinions about homosexual and
it is much more disturbing (Díaz-Torres et al., 2020). transgender individuals on online social media platforms.
There is a vast range of undesirable stuff, including racist, ho- As we know, for Malayalam and Hindi, there is not any online
mophobic, transphobic, and sexist comments, as well as abuse and dataset for automatically detecting homophobia or transphobia. This

∗ Corresponding author at: Insight SFI Research Centre for Data Analytics, Data Science Institute, University of Galway, Ireland.
E-mail addresses: [email protected] (P.K. Kumaresan), [email protected] (R. Ponnusamy),
[email protected] (R. Priyadharshini), [email protected] (P. Buitelaar), [email protected] (B.R. Chakravarthi).

https://doi.org/10.1016/j.nlp.2023.100041
Received 1 July 2023; Received in revised form 1 November 2023; Accepted 20 November 2023

2949-7191/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

Fig. 1. This figure shows the stages undergone for the dataset creation.

is especially true for under-resourced languages like Malayalam and to understand their ability to accurately identify and classify
Hindi, where the culture itself enforces the issue of LGBTQ+ as taboo, harmful content. The findings can contribute to content mod-
and even if a prominent social media person raises a supportive voice, eration efforts, online safety measures, and the development of
they are targeted and classified as LGBTQ+ (Chakravarthi et al., 2021). automated systems to detect discriminatory language, ultimately
The usage of forbidden terms simultaneously brands them as LGBTQ+ improving social media moderation techniques and addressing
members, different, or defiant by breaking established customs, even the challenges of promoting inclusive online communities. We
if they are not LGBTQ+, and can be used to push him/her out of provided a brief explanation of the results obtained from the
society, causing significant mental health concerns in vulnerable in- transformer models in Section 4.3.
dividuals (Chakravarthi et al., 2022a). Malayalam (ISO 639-3: mal), 3. Is the knowledge about homophobia/transphobia learned from
Hindi (ISO 639-3: hin), Tamil (ISO 639-3: tam), Kannada (ISO 639-3: one language informative to predict homophobic/transphobic
kan), and Telugu (ISO 639-3: tel) are the five major literary languages content in other languages?
of India. Each of the five languages is officially recognized by the Indian We investigate whether knowledge about homophobia/
government as one of the 22 scheduled languages (Thamburaj and transphobia learned from one language is informative for pre-
Rengganathan, 2015). Despite millions of people speaking these lan- dicting such content in other languages. The study employs
guages, the tools and resources for developing strong NLP applications cross-lingual transfer learning among Tamil, English, Malay-
for these languages are underdeveloped. alam, Tamil-English, and Hindi to examine the transferabil-
In this paper, our study involves the creation of a dataset for ity of models trained in one language to predict homopho-
the Malayalam and Hindi languages. This dataset has specific distin- bic/transphobic content in different languages. The findings can
guishing characteristics that set it apart from previous hate speech or
provide insights into the generalizability of knowledge about
offensive language identification datasets. To begin, we use chrono-
these issues across languages, enabling the development of more
logical and organized discussion comments, which means annotators
effective content moderation systems and fostering inclusive on-
evaluate previous comments before labeling each post. Second, we dif-
line communities on a broader scale. Section 4.4 briefly explains
ferentiate between various types of LGBTQ+ abuse, such as derogatory
about the cross-lingual approach.
and personally threatening words. We also consider anti-discrimination
and hope-filled sentences. Third, rather than employing crowd-sourced Our main contributions to the paper are as follows:
laborers, we rely on skilled volunteer annotators from the LGBTQ+
community who identify as LGBTQ+ or LGBTQ+ supporters. We also • We have created a new dataset that has binary and fine-grained
use guided meetings rather than a majority vote to decide on the classes for homophobia and transphobia detection in Malayalam
final labels. These two aspects work in combination to provide a high- and Hindi.
quality dataset. We believe that our study will assist in enhancing the • We have uncovered the highly predictive features within our
identification of homophobic and transphobic comments automatically. machine learning models for five language settings, crucial for
This dataset may also be used to eradicate or generate informed textual accurate outcome predictions.
responses. In upcoming sections, we will describe the dataset creation • We employed pre-trained large language models to predict in-
processes and provide experiment analysis to investigate homophobia stances of homophobia and transphobia within social media com-
and transphobia detection from the YouTube comments on Dravidian ments.
languages (see Fig. 2). • We conducted a cross-lingual analysis to predict instances of ho-
mophobia and transphobia in comments across different language
1. What are the most predictive features to distinguish between
settings.
homophobic/transphobic content and non-anti-LGBT+ content
in social media?
Differentiating between homophobic/transphobic content and 2. Related work
non-anti-LGBT+ content on social media can be challenging.
Some potentially predictive features include analyzing keywords In recent years, technological advancements in social computing,
and phrases associated with discrimination, assessing sentiment such as sentiment analysis (SA) and offensive language identifica-
and language for negativity and aggression, considering the tion(OLI), have enabled the ability to analyze people’s opinions and
context of the content, examining user behavior and engage- sentiments, encouraging various sectors such as marketing or customer
ment with anti-LGBT+ communities, analyzing hashtags for sen- service (Chakravarthi et al., 2022b). Numerous researchers have exam-
timent and content, and assessing visual content for hate speech ined the subject of automatically identifying offensive language and
symbols. hate speech in social media networks. Several methods have been
2. How well do pre-trained transformer models classify text from developed, ranging from traditional rule-based and machine learning
user-generated social media that contains Homophobia/ approaches to advanced deep learning-based approaches (Risch and
Transphobia? Krestel, 2018; Chakravarthi et al., 2023). The aim is to minimize
Explores the effectiveness of pre-trained transformer models in the likelihood and severity of offensive information being transmitted
classifying text from user-generated social media that contains via the Internet. The Internet enables us to communicate our views
homophobia or transphobia. By training these models on la- anonymously or offers the appearance of anonymity. While conversing
beled datasets and evaluating their performance, the study aims on social media, we frequently do not see the other person’s face

2
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

or expression. Such an atmosphere fosters opportunities for offending for translating data from the source language to the target language
others and facilitates the dissemination of hostile content, as seen by was to use automatic tools. In the case of Brooke et al. (2009), the
the growing number of such posts (Arshad et al., 2023). English language was translated into Spanish; in the case of Demirtas
As a result of these efforts, numerous social media platforms, includ- and Pechenizkiy (2013) from English to Turkish. Other studies have
ing Facebook and Twitter, currently use human, semi-automatic, and recommended the use of a bilingual sentiment lexicon, which is also
fully automated approaches to detect offensive language, with different generated through the process of machine translation, to facilitate the
degrees of success (Mandl et al., 2019). There are two traditional transfer of knowledge between languages (Mihalcea et al., 2007; Meng
techniques for solving sentiment classification problems: lexicon-based et al., 2012). Despite the drawbacks of machine translation, these
approaches and machine-learning approaches (Habimana et al., 2020). studies determined that this straightforward approach is frequently
Since 1966, as lexicons have grown in prominence in the field of sen- highly effective.
timent classification, new lexicons such as WordNet (Fellbaum, 1998), Our study varies from prior studies in that we establish the tax-
WordNet-Affect (Strapparava et al., 2004), SentiNet (Poria et al., 2012), onomy of homophobia and transphobia at several levels, examine
and SentiWordNet (Esuli and Sebastiani, 2006) have been largely used. hope speech and counter-speech, and present a dataset of Malayalam,
Despite their popularity, classical machine learning and lexicon-based English, Tamil, and Tamil-English code-mixed contents. This is the first
algorithms are inefficient when applied to user-generated content due work that has created a dataset for homophobia and transphobia in
to the dynamic nature of such data. Deep learning algorithms shine in Malayalam and Hindi, an under-resourced language to our knowledge.
this area due to their efficiency in adjusting to dynamic user-generated
content. GloVe (Pennington et al., 2014), Word2Vec (Mikolov et al., 3. Dataset
2013), and fastText (Bojanowski et al., 2017) all have advantages and
disadvantages in the context of transfer learning. To be sure, manual The data produced by social media users like YouTube, Twitter,
approaches rely on user complaints about offensive language/hate and Facebook, are rapidly increasing. It may cause emotional damage
speech content or on hiring employees whose job is to continuously to an individual or group. YouTube is growing increasingly popular
monitor postings and delete or suspend accounts that include hate throughout the Indian subcontinent due to the vast amount of content
speech content. While this task is time-consuming, labor-expensive, and available on the internet, such as music, courses, product reviews, and
wasteful in and of itself, it is exacerbated when non-English content is trailers. Users can upload content to YouTube, and other individuals
used since there are insufficient resources to identify potentially dam- can comment on it. As a result, it allows for more user-generated
aging information. However, homophobic and transphobic content is content in languages with scarce resources. It also applies to LGBTQ+
not taken seriously in regions where Malayalam and Hindi are spoken. people who watch comparable videos and remark on the one with
For sentiment analysis, the Malayalam code-mixed was released and which they identify. Firstly, we collect the data that makes them
submitted benchmark system using a few Machine Learning models uncomfortable to stop these activities. With that, we can train our
and deep neural networks such as CNN and BERT (Chakravarthi et al., model to identify those text data. We also performed the same model
2020). setting used for the Malayalam and Hindi datasets on Tamil, English,
Race, politics, religion, and financial standing contribute to a per- and Tamil–English datasets to analyze the performance of homophobia
son’s identity. While this diversity is to be celebrated, people’s nu- and transphobia texts.
merous identities can result in multiple levels of prejudice when com- Malayalam (ISO 639-3: mal) is a Dravidian language with a high
bined. Therefore, an individual may be more likely to be victimized level of agglutination. It evolved in the ninth century A.D., in the last
by hate crimes motivated by discrimination. All around the world, quarter of the century (Sekhar, 1951). In the 16th century, the steep
hatred is directed toward specific groups of individuals from varied Western Ghats separated the dialect from the leading speech group and
origins (Faulkner and Bliuc, 2016). Racism based on anti-Semitism, gradually developed into a separate language. The Ramacaritam is the
anti-African American (primarily in the United States), anti-Indian earliest literary work composed in Malayalam. This language combines
and Pakistani (in the United Kingdom), anti-Arab, and anti-Muslim Tamil and Sanskrit and uses the Tamil Grantha script for transcribing
sentiments, for example (in much of Europe) (Zampieri et al., 2019). Sanskrit and foreign words in Tamil Nadu. Malayalam has a total of 57
There is a wide range of hateful content on the Internet, including letters that contain 13 vowels, 36 consonants, and five chilly, Anusara,
racist, homophobic, transphobic, and sexist comments, as well as insults visarga, and chandrakkala (Kumar and Chandran, 2015).
and threats directed at individuals or organizations. It has become a big Hindi (ISO 639-3: hin) is the prevalent language of India’s Hindi
concern for online communities as a result of the proliferation of online belt, and it is written in the Devanagari script, which is also used for
material (Kumar et al., 2018). Sanskrit, Marathi, and Nepali. Hindi is an Indo-Aryan language spoken
Recent methods that heavily rely on deep learning require massive mainly in India. It’s the fourth most considerably spoken language
amounts of labeled data; nevertheless, the majority of this research in the world, with over 500 million speakers. However, the word
only addresses English or a select few major languages (Malmasi and order in Hindi and English differs, with Hindi using the Subject Object
Zampieri, 2018). On the other hand, a lot of languages have a prob- Verb (SOV) order and English using the Subject Verb Object (SVO)
lem with not having enough undesirable data (Subramanian et al., order (Meetei et al., 2019).
2022a). A number of researchers have recently implemented unaccept- Tamil (ISO 6393: tam) is the primary language of Tamil Nadu and
able speech identification in low-resource languages by making use Pondicherry, a state, and a territory of India, as it is a scheduled lan-
of cross-lingual word embeddings (Santhiya et al., 2022; Ali et al., guage under the Indian constitution. It is also recognized as one of Sri
2022; Priyadarshini et al., 2023). The first step in the process involves Lanka’s and Singapore’s official languages. A significant people speaks
instructing a classification model to recognize unacceptable language Tamil in four more south Indian states: Kerala, Karnataka, Andhra
using a high-resource language as the data source. Transfer learning Pradesh, Telangana, and Andaman & Nicobar Islands (the Union Terri-
is then used to move the trained model from a language with a high tory) (Sakuntharaj and Mahesan, 2016, 2017, 2021; Thavareesan and
number of available resources to a language with fewer resources. The Mahesan, 2019, 2020a,b, 2021). Tamil has 247 letters, including 18
process of finding a shared embedding space that can be used as an consonants, 12 vowels, and 216 compound letters, followed by a unique
indicator of categorization for both languages is the most important character (Hewavitharana and Fernando, 2002).
part of these approaches (Bigoulaeva et al., 2021). Code-Mixing is the combination of two or more languages at the
In the discipline of sentiment analysis, the cross-lingual context article, paragraph, comments, phrase, sentence, word, or morpheme
has garnered considerable interest (Balamurali et al., 2012; Rasooli level. It is a distinct feature of speech or discourse in bilingual and
et al., 2018; Xu et al., 2022). The oldest and simplest technique multilingual communities (Barman et al., 2014; Chakravarthi, 2022b).

3
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

Fig. 2. Visualization for overall proposed system.

A bilingual/multilingual speaker converts their speech into another lan- abnormal.3 , 4 We are responsible for protecting ethical principles and
guage. Social media users use the Roman alphabet to write, which in- protecting vulnerable people’s confidentiality and privacy. Particularly
creases the odds of code-mixing with a Roman alphabet language (Sub- when it comes to minorities like the LGBTQ+ community, data from
ramanian et al., 2022b; Hande et al., 2022). When it comes to code- social media is susceptible. Before sending the comments to annotators,
mixing tasks, most language pairings are under-resourced (Barman we removed all user IDs, phone numbers, or addresses provided. We
et al., 2014; Jose et al., 2020). have taken considerable care to lower the risk of individual identity in
the data by deleting personal information from the dataset. We used
3.1. Data collection Google Forms for the annotation. Likewise, we utilized the annotator’s
email address to ensure they could only annotate once. A maximum
As far as we know, for the Malayalam and Hindi languages, no of 100 comments are permitted on a single Google Form. We con-
such data is available for automatically identifying homophobia and tacted people from the college society for annotation, but we got few
transphobia tasks. We selected to gather data from users’ comments responses. All annotators are postgraduates. We gave them 30 min of
on YouTube.1 We did not utilize comments from personal coming orientation with the explanation of terms and well-known YouTubers
out stories of LGBTQ+ people since they included personal informa- explaining LGBT+ people and their issues. Three people annotated each
tion. Instead, we gathered videos from famous YouTubers that ex- comment. The labels will be agreed upon if more than two annotators
plain LGBTQ+ in the hopes that people will be more welcoming. To agree on the same label or will be disagreed. Whenever the annotators
guarantee that our dataset contains enough homophobic and transpho- felt uncomfortable with annotation, they were given the option to quit.
bic insults, we started with YouTube prank films titled ‘‘Gay Prank’’,
‘‘Transgender Prank’’, and ‘‘Legalizing Homosexuality’’. Some videos 3.2.2. Taxonomy
emphasized the excellent aspects of transgenderism, but the majority Hateful comments toward homosexuals and transsexuals are known
of the videos from news and popular networks portrayed transgen- as homophobia and transphobia (Haaga, 1991). Transphobia is a prob-
der individuals as exploiters and instigators of disputes. Finding a lem faced by many transgender people. Transgender people frequently
YouTube video discussing LGBTQ+ problems in Malayalam and Hindi experience rejection and neglect in both gay and straight societies.
was challenging because it is still taboo, marriage equality is not legal, Due to ignorance and hatred, many transgender people are prevented
and homosexuality was unlawful in India until recently. We gathered from coming out or identifying themselves, which further obscures
all of these videos. The comments were collected using the YouTube them. We created a three-level hierarchical taxonomy that is shown
Comment Scraper tool.2 We used these comments to create manual in Fig. 3. First, we establish a ternary division between content that
annotation datasets (see Fig. 1). is homophobic, transphobic, and non-anti-LGBT+. We expanded the
variance of homophobic, transphobic, and non-anti-LGBT+ content
3.2. Annotation for 3 classes, 5 classes, and 7 classes. We described two categories
of both homophobic and transphobic content and three categories of
3.2.1. Annotation process Non-anti-LGBT+ content:
After preprocessing and cleaning, we thoroughly inspect the text 1. Homophobic content is a type of gender-based abusive com-
to verify that it includes no personally identifiable information. When ment that targets those who identify as gay, lesbian, bisexual,
developing training datasets for sensitive issues like ours, producing queer, or gender non-conforming using negative labels or in-
high-quality annotations is difficult. It is influenced by various things, sulting language (Meyer, 2008; Poteat and Rivers, 2010). Lesbo-
including annotators’ cultural influences and personal prejudices. It was phobia, gayphobia, and biphobia are three families of phobias
challenging to get annotators since the LGBTQ+ issue is taboo. It wasn’t that target different target groups. Lesbophobia, gayphobia, and
easy to locate willing Malayalam and Hindi annotators. We discovered biphobia are all covered under homophobia in this study.
LGBTQ+ or LGBTQ+ ally volunteer annotators after a lengthy search.
We trained them by presenting YouTube films describing what LGBTQ+ • Homophobic derogation is a term used when an insult
means, followed by a video discussing whether LGBTQ+ is normal or or lack of respect for the LGBTQ vulnerable. It includes

1 3
https://www.youtube.com/ https://www.youtube.com/watch?v=4XCErIq3U6U&t=3s
2 4
https://github.com/ahmedshahriar/youtube-comment-scraper.git https://www.youtube.com/watch?v=XxrA4hx6EQQ

4
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

Fig. 3. Hierarchical for fine-grained labels.

words that are explicitly derogatory and insulting, includ- you’re so incorrect’’, or it might give an opposing opinion
ing ‘‘homo’’, ‘‘dyke’’, ‘‘fag’’, and ‘‘lezza’’, as well as state- that refutes the homophobia or transphobia.
ments that imply hatred or anger toward LGBIQ, like ‘‘no • Hope Speech promotes hope and courage, which posi-
homo. Likewise, the words ‘‘gay’’ and ‘‘poof’’ are often tively impacts the readers (Youssef and Luthans, 2007). we
used to characterize anything considered to be ‘‘uncool’’, referred to it as a motivational discourse on how people
‘‘non-normative ’’, or ‘‘unmasculine’’ (Thurlow, 2001). For deal with challenging circumstances and prevail in them
example, ‘‘Gays are fit for nothing’’, ‘‘everybody hates (Chakravarthi, 2022a; Snyder, 2002).
gays’’, Whom do you like boy or girl? How does that work? • None of the above is a category that is completely not
• Homophobic threatening is content that supports, pro- related to the hate of LGBT+ individuals. The comments
motes, argues for, or incites such violence and implies that do not suites the above categories will come under
intent or intention to harm or cause harm to LGB. It is a this category.
form of ‘‘explicit’’ abuse. It might involve sexual violence
like rape, harassment, or penetration, as well as non-sexual These taxonomies are inspired by the work done on misogyny by (Guest
physical acts of violence like beating, injuring, or murder- et al., 2021; Chakravarthi et al., 2022a).
ing. It can also involve invasions of privacy, like disclosing
private information. For example, ‘‘I will you gay boy’’. 3.2.3. Annotation evaluation
Inter-annotator agreement measures the degree to of annotators
2. Transphobic content is a gender-based abusive comment that
agree on their ratings. It is essential to ensure that the annotation
targets those who identify as transgender (transfer of gender
system is consistent and that multiple raters can give the same emotion
from male to female or female to male) using negative labels
label to the same comment. We utilize Krippendorff’s alpha coeffi-
or insulting. People who are transphobic can be straight or
cient (Krippendorff, 1970) because it applies to any number of an-
homosexual and can be transphobic without being homophobic.
notators. We utilized the NLTK (nltk.metrics.agreement module).5 We
• Transphobic derogation is the phrase for negativity about obtained Krippendorff’s alpha values of 0.72 for our dataset.
transgender individuals. This content might be Directly or
indirectly abusive. Negative stereotypes about the abilities 3.3. Corpus statistics
of weak transgender people, such as a lack of feminin-
ity or emotional control. They express morally repugnant Table 1 provides an overview of the statistical characteristics per-
opinions on transgender people, such as claiming that they taining to our diverse datasets, including Malayalam and Hindi, as
are somehow less than or unequal to women or men. For well as several other languages. It furnishes comprehensive information
example, ‘‘ohh you are not a female anymore’’, and ‘‘Shame regarding the number of tokens and characters present within the
to be like you’’. Malayalam, Hindi, English, Tamil, and Tamil-English datasets. Each
• Transphobic threatening is anything that discusses, pro- comment within the datasets guarantees the inclusion of at least one
motes, stimulates, or plans unfavorable or harmful treat- sentence, and on average, ten tokens are observed per sentence. The
ment of transgender people. It comprises both expressing datasets have been effectively categorized into three distinct classes,
the desire to take action against transgender people and five classes, and seven classes, allowing for enhanced taxonomy expan-
stating choices for how they should be treated, including sion. To present a more granular analysis, Table 3, Table 4, and Table 5
using abusive words, using force physically or sexually, offer detailed statistical insights specifically tailored to the 3-class, 5-
or invading their personal space. For example, ‘‘beat that class, and 7-class taxonomies, respectively. These tables facilitate a
transgender, no one is going to care about it’’, and ‘‘kill the comprehensive understanding of the dataset distribution across various
transsyy, they are not fit for living in this world’’. classification schemes. Additionally, Table 2 provides crucial statistics
related to the data split methodology employed during the model
3. Non-anti-LGBTQ+ content is divided into three categories. creation process. This table highlights the specific divisions of the
dataset used for training, validation, and testing purposes, providing
• Counter Speech is a non-aggressive response that offers
valuable insights into the model’s construction (see Figs. 4 and 5).
criticism through fact-based arguments, to be more precise.
It might say, for instance, ‘‘What you said is unacceptable’’,
5
‘‘That’s very homophobic’’, or ‘‘That’s not how LGBT+ act, https://www.nltk.org/api/nltk.metrics.agreement.html

5
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

4. Experimental setup

4.1. Feature extraction

• Term Frequency and Inverse Document Frequency (TF-IDF)


is a statistical measure that assesses how relevant a word is to a
report in an assortment of records. This is finished by duplicating
two measurements: how frequently a word shows up in a report
and the backward-record recurrence of the word across a bunch
of archives. The instinct behind it is that if a word happens on
various occasions in a record, we should support its importance,
as it ought to be more significant than different words that seem
fewer occasions. This is done with the help of scikit-learn.6
• Bert Embedding (BE) (Devlin et al., 2019) is word embedding
that generates vectors based on the context of the phrase as
well as the sentence’s word. BERT is a global language model
Fig. 4. Visualization of the number of comments in the datasets.
that creates a sub-word contextualized word embedding. In con-
trast to static non-contextualized word embedding, BERT captures
Table 1
Table shows the in-depth statistics of the datasets.
short and long-term contextual dependency in the input text
through bidirectional self-attention transformers. The sentence is
Language No. of comments Number of tokens Number of characters
tokenized first in the BERT embedding, with the [CLS] token
Malayalam 5,193 63,162 456,659
Hindi 3,203 56,796 306,488
concatenated at the start and the [SEP] token at the end. The
English 4,946 116,015 632,221 token embedding for each token with a size of 768 is then created.
Tamil 4,161 255,578 787,177 • fasText (FT) (Grave et al., 2018) is a pre-trained vector model
Tamil-English 6,034 88,303 628,077 that learns sub-word information and allows users to construct
representations for uncommon or out-of-vocabulary terms. It was
Table 2
trained on Common Crawl and Wikipedia data for 157 languages
Table shows the split-wise dataset statistics of the datasets. with Continuous BagOfWords (CBOW) position weights in 300
Splits Mal Hin Eng Tam Tam-Eng dimensions. The model will minimize the loglikelihood over the
Train 3,114 2,305 3,164 2,662 3,861
classes for a set of n documents by decaying the learning rate with
Dev 866 641 792 666 966 a normalized bag of words of 𝑛th document i𝑛 and label o𝑛 , as well
Test 1,213 257 990 833 1,207 as the weight matrices X and Y.
Total 5,193 3,203 4,946 4,161 6,034
4.2. ML models

Table 3
After the extraction of features, we performed the machine learning
Table shows the class-wise data statistics of 3 class taxonomy.
models for classifying the text using the sklearn package,7 a model that
Classes Mal Hin Eng Tam Tam-Eng
had used is explained one by one below:
Homophobic 806 57 276 723 465
Transphobic 290 108 13 233 184 • Support Vector Machine (SVM) is a powerful supervised ma-
Non-anti-LGBT+ content 4,097 3,038 4,657 3,205 5,385 chine learning algorithm used for classification or regression
tasks. It employs a technique called the kernel trick, which trans-
Table 4 forms the input data into a higher-dimensional space. In this
Table shows the class-wise data statistics of 5 class taxonomy. transformed space, SVM finds an optimal decision boundary that
Classes Mal Hin Eng Tam Tam-Eng maximally separates the different classes or predicts the contin-
Homophobic 802 57 276 723 465 uous output. The sklearn package, which is a popular machine-
Transphobic 300 108 13 233 184 learning library in Python, provides efficient implementations of
Counter-speech 250 210 486 336 281 SVM algorithms.
Hope-Speech 119 441 687 335 317 • Random Forest is an ensemble learning method that combines
Non-anti-LGBT+ content 3,722 2,387 3,484 2,534 4,787
multiple decision trees to make more accurate and stable predic-
tions. Each decision tree in the forest is built using a subset of the
Table 5 training data and a random selection of features. By introducing
Table shows the class-wise data statistics of 7 class taxonomy. randomness in the tree construction process, random forest adds
Classes Mal Hin Eng Tam Tam-Eng diversity to the models and reduces overfitting. The sklearn pack-
Homophobic-derogation 729 50 262 661 408 age includes an implementation of the random forest algorithm,
Transphobic-derogation 286 105 11 170 105 making it easily accessible for machine learning tasks.
Homophobic-Threatening 84 7 14 62 57
• Naive Bayes is a probabilistic model that is simple and fast for
Transphobic-Threatening 15 3 2 63 79
Counter-speech 258 210 486 336 281
predicting the class of test data. It is based on Bayes’ theorem
Hope-Speech 120 441 687 335 317 and assumes that the features are conditionally independent given
None-of-the-above 3,701 2,387 3,484 2,534 4,787 the class variable. Naive Bayes classifiers are particularly effective
when the assumption of feature independence holds true. They
perform well with categorical input variables compared to contin-
uous variables. Naive Bayes classifiers require less training data

6
https://scikit-learn.org/stable/modules/generated/sklearn.feature_
extraction.text.TfidfVectorizer.html
7
https://scikit-learn.org

6
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

Fig. 5. Visualization of the number of tokens and characters.

and are computationally efficient. The sklearn package provides • XLM-RoBERTa was proposed as an unsupervised cross-lingual
Naive Bayes implementations for various types of probability representation approach that outperformed multilingual BERT
distributions. (Conneau et al., 2020) on several cross-lingual benchmarks. It’s
• Decision Tree (DT) is a non-parametric model that estimates a multilingual language model with 2.5 TB of filtered Common
target values by following decision rules based on the attributes Crawl data that was trained. XLM-RoBERTa was fine-tuned for
of input vectors. It resembles a tree structure with internal nodes evaluation and inference on a range of downstream tasks using
representing features, branches representing decision points based Wikipedia data in over 100 languages. XLM-RoBERTa has looked
on those features, and leaf nodes representing outcomes or class excellent in a number of multilingual NLP tasks. We have used
labels. Decision trees are built recursively by selecting features base and large model variance in the XLM-RoBERTa model.
that result in the maximum information gain, entropy reduction,
or Gini index improvement at each node. The sklearn package 4.4. Cross-lingual
offers decision tree algorithms with default parameter values such
as ‘gini’ for the impurity criterion and a minimum sample split of The cross-lingual approach in machine learning and deep learning
2. involves using techniques to process and understand text in multiple
• Logistic Regression (LR) is used as a classifier to predict the languages. It aims to develop models that can transfer knowledge
output, which uses linear combinations of input. The logistic and predictions across languages, even without explicit training in
function is used to predict the probability of a specific class. the target language. This approach is beneficial when resources are
The output of logistic regression is determined by the input as limited or dealing with multilingual data. By leveraging similarities and
well as the corresponding system. The probability of output is patterns shared across languages, cross-lingual models provide valuable
estimated using a memory-based method in which the probability insights and predictions in diverse linguistic contexts. They enable the
of output p𝑜 , the probability of the associated input p𝑖 , and the creation of shared representations that capture underlying linguistic
probability of the system p𝑠 are all taken into account. As a result, structures. Applications include machine translation, sentiment anal-
the classifier predicts p𝑜 with P(p𝑜 /p𝑖 , p𝑠 ) as a categorical output. ysis, and more. Cross-lingual techniques include multilingual embed-
The models were trained using various feature sets with default dings, transfer learning, neural machine translation, and cross-lingual
parameter settings such as C=1.0 and kernel=‘linear’ for each word alignment. These models bridge the gap between languages,
data set. facilitating knowledge exchange and enabling effective multilingual
natural language processing.
4.3. Transformer
4.5. Experiment setup
• mBERT is a Multilingual Bidirectional Encoder Representations
from Transformers (Devlin et al., 2018). BERT is a semi-supervised We completed the necessary data preparation steps, dividing the
Language representation model that employs attention dataset into three sets: train (for model training), development (for
techniques. It has an encoder that reads the complete sequence model validation), and test (for model testing). Our main objective
as input rather than reading from a certain direction and uses was to develop an automated system capable of detecting homophobic
contextual relations between the words. For classification, we and transphobic content within these datasets. We aimed to catego-
employ BERT with the classification head and fine-tune all pa- rize the content into three classes, five classes, and seven classes.
rameters in an end-to-end manner. Experiments were conducted To establish a baseline before exploring more advanced methods, we
using the hugging face library. Multilingual BERT was trained to decided to experiment with various traditional Machine Learning mod-
acquire multilingual and cross representations better than other els. However, since these models cannot directly process string data,
representations in 104 languages other than English. We used an we needed to convert the textual information into numerical vectors.
uncased version hence there is no such case in Malayalam. To accomplish this, we employed three different methods for feature
• RoBERTa (Liu et al., 2019) is a Robustly Optimized BERT Pre- extraction: TF-IDF, Bert embeddings (BE), and fastText (FT). These
training Approach. Unlike BERT, it is not learned on the next methods are well-suited for extracting features from multilingual texts.
sentence prediction training objectives. Instead, when training After extracting the features, we trained several classifiers, including
the language model with the Masked Language Modeling intent, LR, NB, DT, RF, and SVM, using the scikit-learn library.
larger mini-batches, and learning rates are utilized. RoBERTa Next, we delved into the usage of selective transformer architectures
outperforms the typical BERT baseline on downstream NLP tasks with our dataset. Specifically, we utilized mBERT, RoBERTa (base and
owing to its optimal design choices. In our experiments, we used large), and XLM-RoBERTa (base and large) models. These pre-trained
RoBERTa’s capabilities. We utilized RoBERTa base and RoBERTa transformer models are specifically designed to handle multilingual text
for our experiment. data and offer advantageous features for languages other than English.

7
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

For training and evaluation, we set the batch size to 16 and trained Table 6
Results for Malayalam 3 class dataset.
the models for three epochs. To improve performance, we made use of
CUDA. The implementation of transformer models was done using the Features Classifiers Acc MP MR MF1 WP WR WF1

simple transformer library.8 We acquired valuable information about LR 0.84 0.94 0.46 0.52 0.87 0.84 0.80
NB 0.85 0.94 0.49 0.55 0.87 0.85 0.80
these models from the Hugging Face9 website.
TF-IDF DT 0.91 0.96 0.73 0.82 0.92 0.91 0.90
Throughout our experimentation, we faced challenges related to RF 0.91 0.96 0.73 0.81 0.91 0.91 0.89
time constraints and computational power. To overcome these lim- SVM 0.91 0.96 0.73 0.81 0.91 0.91 0.89
itations, we utilized paid cloud GPUs from Google Colab Pro10 as LR 0.88 0.95 0.63 0.72 0.90 0.88 0.86
well as free GPUs from Kaggle.11 This allowed us to accelerate our NB 0.85 0.94 0.49 0.55 0.87 0.85 0.80
experiments and effectively handle the computational demands. In the BERT embeddings DT 0.91 0.96 0.73 0.82 0.92 0.91 0.90
RF 0.91 0.96 0.73 0.82 0.92 0.91 0.90
following section, we present the evaluation results, showcasing the
SVM 0.90 0.95 0.72 0.80 0.91 0.90 0.89
performance of each model on the Malayalam, Hindi, English, Tamil,
LR 0.80 0.27 0.33 0.30 0.64 0.80 0.71
and Tamil-English datasets.
NB 0.73 0.48 0.52 0.49 0.76 0.73 0.74
fastText DT 0.86 0.72 0.87 0.78 0.88 0.86 0.87
5. Results and evaluation RF 0.95 0.98 0.86 0.91 0.95 0.95 0.94
SVM 0.81 0.91 0.37 0.37 0.84 0.81 0.74
In this section, we report the macro and weighted average of RoBERTa Base 0.82 0.53 0.39 0.39 0.77 0.82 0.76
precision (P), recall (R), and F1 scores (F1) of our traditional as well RoBERTa Large 0.80 0.27 0.33 0.30 0.64 0.80 0.71
mBERT uncased 0.83 0.48 0.44 0.45 0.77 0.83 0.79
as transformed-based models to identify offensive comments/posts and
XLM-RoBERTa Small 0.80 0.27 0.33 0.30 0.64 0.80 0.71
further classify them into classes of homophobia, transphobia, counter XLM-RoBERTa Large 0.80 0.27 0.33 0.30 0.64 0.80 0.71
speech, hope speech, and offensive untargeted for all languages. TP, FP,
FN, and TN are used for calculating the evaluation metrics.
Accuracy evaluates how frequently the classifier predicts accurately. Table 7
Results for Malayalam 5 class dataset.
Precision illustrates how many accurately estimated events were actu-
Features Classifiers Acc MP MR MF1 WP WR WF1
ally positive (Chhetri et al., 2022). Recall describes how many actual
positive events can estimate correctly with our model (Akosa, 2017). LR 0.79 0.54 0.32 0.36 0.76 0.79 0.73
NB 0.78 0.65 0.30 0.34 0.78 0.78 0.71
The harmonic mean of precision and recall is the F1-Score. It evalu- TF-IDF DT 0.83 0.73 0.44 0.50 0.82 0.83 0.80
ates both false positives and false negatives. As a result, it performs RF 0.83 0.68 0.43 0.48 0.82 0.83 0.80
effectively on an unbalanced dataset (Chhetri et al., 2022). Macro SVM 0.83 0.71 0.43 0.49 0.82 0.83 0.80
average is a basic arithmetic mean of all measures across all classes. LR 0.81 0.75 0.38 0.44 0.82 0.81 0.77
This method assigns equal weights to all classes, making it an excellent NB 0.78 0.65 0.30 0.34 0.78 0.78 0.71
choice for unbalanced data. Weighted average computes measures for BERT embeddings DT 0.83 0.76 0.44 0.50 0.82 0.83 0.80
RF 0.83 0.69 0.43 0.49 0.82 0.83 0.80
each class individually, but when adding them together, it uses a weight SVM 0.83 0.74 0.42 0.48 0.83 0.83 0.79
depending on support (the number of true labels for each class). As a
LR 0.73 0.15 0.20 0.17 0.53 0.73 0.62
result, it favors the majority. NB 0.53 0.34 0.35 0.29 0.67 0.53 0.57
fastText DT 0.70 0.43 0.51 0.45 0.75 0.70 0.72
5.1. Machine learning and LLModels RF 0.88 0.70 0.53 0.57 0.86 0.88 0.85
SVM 0.74 0.53 0.23 0.22 0.73 0.74 0.65

We submitted the findings for 300 models assessed with a mixture RoBERTa Base 0.73 0.15 0.20 0.17 0.53 0.73 0.61
RoBERTa Large 0.73 0.15 0.20 0.17 0.53 0.73 0.61
of five classifiers utilizing three sets of feature extractors plus five deep
mBERT uncased 0.78 0.47 0.39 0.39 0.74 0.78 0.75
classifiers for the fifteen data settings. Tables 6, 7, and 8 show the XLM-RoBERTa Small 0.73 0.15 0.20 0.17 0.53 0.73 0.62
performance of the ML and transformer models of Malayalam for 3, XLM-RoBERTa Large 0.77 0.28 0.29 0.28 0.67 0.77 0.72
5, and 7 classes, and the results for Hindi for 3, 5, and 7 classes in
Tables 9, 10, and 11. For Tamil, English, and Tamil-English results
of 3 classes, 5 classes, and 7 classes are shown in Tables 12, 13, 14,
15, 16, 17, 18, 19, and 20, shows the classification performance of interactions between features. It leverages an ensemble of decision
the machine learning and deep learning models for the 3-class, 5- trees to make predictions, allowing it to capture a wide range of
class, and 7-class data settings. We utilized scikit-learn12 for getting possible feature interactions. In contrast, transformer models rely on
the classification report and confusion matrix. Based on the macro self-attention mechanisms and sequential processing, which may not
F1 score, we concluded that the best models were among the models be as effective in capturing intricate relationships present in the data.
experimented with for all the languages, specifically Malayalam and Additionally, the random forest model’s utilization of fastText features
Hindi. further enhanced its performance. Fasttext features are derived from
For Malayalam, we observed that the Malayalam language, the sub-word embedding, which can capture morphological and semantic
random forest model with fasttext features consistently outperformed information effectively. This linguistic richness likely contributed to the
them in terms of macro F1 scores on 3 classes, 5 classes, and 7 classes. random forest model’s superior performance, as it could better under-
This indicates that the random forest model was more effective in stand the complexities of the Malayalam language. While transformer
capturing the underlying patterns and nuances in the data compared models have gained prominence in various natural language processing
to the transformer models. The random forest model’s success can be tasks, these results highlight the importance of considering alternative
attributed to its ability to handle non-linear relationships and complex machine learning models. In the case of Malayalam, the random forest
model with fasttext features emerged as a strong contender, demon-
8
strating that traditional machine learning approaches can still deliver
https://simpletransformers.ai
9 competitive performance in certain scenarios.
https://huggingface.co
10
https://colab.research.google.com/ Similarly, in the Hindi language task, we observed that the decision
11
https://www.kaggle.com/ tree model with BERT embeddings exhibited strong performance in
12
https://scikit-learn.org/stable/modules/generated/sklearn.metrics. the 3 classes classification scenario. The decision tree model’s ability
classification_report.html to leverage the contextual information captured by BERT embeddings

8
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

Table 8 Table 10
Results for Malayalam 7 class dataset. Results for Hindi 5 class dataset.
Features Classifiers Acc MP MR MF1 WP WR WF1 Features Classifiers Acc MP MR MF1 WP WR WF1
LR 0.74 0.67 0.30 0.32 0.78 0.74 0.65 LR 0.66 0.34 0.37 0.35 0.70 0.66 0.67
NB 0.73 0.60 0.21 0.24 0.75 0.73 0.64 NB 0.74 0.15 0.20 0.17 0.55 0.74 0.63
TF-IDF DT 0.79 0.62 0.41 0.45 0.79 0.79 0.75 TF-IDF DT 0.66 0.24 0.24 0.24 0.63 0.66 0.64
RF 0.79 0.64 0.41 0.44 0.80 0.79 0.74 RF 0.74 0.28 0.21 0.19 0.62 0.74 0.64
SVM 0.79 0.65 0.41 0.44 0.80 0.79 0.74 SVM 0.76 0.41 0.26 0.26 0.70 0.76 0.70
LR 0.77 0.65 0.37 0.40 0.79 0.77 0.70 LR 0.73 0.22 0.22 0.20 0.60 0.73 0.65
NB 0.73 0.60 0.21 0.24 0.75 0.73 0.64 NB 0.08 0.24 0.23 0.08 0.69 0.08 0.08
BERT embeddings DT 0.79 0.62 0.41 0.45 0.79 0.79 0.80 BERT embeddings DT 0.63 0.28 0.28 0.28 0.65 0.63 0.64
RF 0.79 0.64 0.41 0.44 0.80 0.79 0.74 RF 0.74 0.30 0.23 0.22 0.64 0.74 0.66
SVM 0.79 0.67 0.41 0.44 0.81 0.79 0.74 SVM 0.74 0.24 0.21 0.19 0.62 0.74 0.65
LR 0.70 0.10 0.14 0.12 0.49 0.70 0.58 LR 0.74 0.20 0.20 0.17 0.59 0.74 0.63
NB 0.47 0.25 0.38 0.25 0.62 0.47 0.52 NB 0.42 0.26 0.31 0.24 0.68 0.42 0.48
fastText DT 0.71 0.49 0.60 0.53 0.74 0.71 0.72 fastText DT 0.58 0.24 0.25 0.25 0.61 0.58 0.60
RF 0.85 0.77 0.60 0.65 0.84 0.85 0.83 RF 0.74 0.30 0.21 0.18 0.62 0.74 0.64
SVM 0.71 0.53 0.21 0.22 0.72 0.71 0.60 SVM 0.74 0.15 0.20 0.17 0.55 0.74 0.63
RoBERTa Base 0.70 0.10 0.14 0.12 0.49 0.70 0.58 RoBERTa Base 0.73 0.23 0.25 0.24 0.63 0.73 0.68
RoBERTa Large 0.70 0.10 0.14 0.12 0.49 0.70 0.58 RoBERTa Large 0.74 0.15 0.20 0.17 0.55 0.74 0.63
mBERT uncased 0.77 0.49 0.33 0.36 0.74 0.77 0.73 mBERT uncased 0.76 0.25 0.29 0.27 0.67 0.76 0.71
XLM-RoBERTa Small 0.76 0.25 0.24 0.24 0.67 0.76 0.71 XLM-RoBERTa Small 0.74 0.15 0.20 0.17 0.55 0.74 0.63
XLM-RoBERTa Large 0.70 0.10 0.14 0.12 0.49 0.70 0.58 XLM-RoBERTa Large 0.74 0.15 0.20 0.17 0.55 0.74 0.63

Table 9 Table 11
Results for Hindi 3 class dataset. Results for Hindi 7 class dataset.
Features Classifiers Acc MP MR MF1 WP WR WF1 Features Classifiers Acc MP MR MF1 WP WR WF1
LR 0.93 0.41 0.38 0.39 0.91 0.93 0.92 LR 0.65 0.22 0.24 0.22 0.72 0.65 0.68
NB 0.95 0.32 0.33 0.32 0.90 0.95 0.92 NB 0.72 0.12 0.17 0.14 0.52 0.72 0.61
TF-IDF DT 0.92 0.43 0.38 0.40 0.91 0.92 0.91 TF-IDF DT 0.71 0.23 0.21 0.21 0.67 0.71 0.68
RF 0.95 0.32 0.33 0.32 0.90 0.95 0.92 RF 0.73 0.32 0.17 0.15 0.71 0.73 0.62
SVM 0.95 0.32 0.33 0.32 0.90 0.95 0.92 SVM 0.76 0.28 0.20 0.20 0.70 0.76 0.68
LR 0.95 0.32 0.33 0.32 0.90 0.95 0.92 LR 0.73 0.21 0.19 0.19 0.64 0.73 0.66
NB 0.21 0.34 0.37 0.15 0.90 0.21 0.31 NB 0.08 0.22 0.17 0.07 0.79 0.08 0.10
BERT embeddings DT 0.90 0.42 0.42 0.41 0.91 0.90 0.90 BERT embeddings DT 0.61 0.18 0.18 0.18 0.63 0.61 0.62
RF 0.95 0.32 0.33 0.32 0.90 0.95 0.92 RF 0.73 0.36 0.19 0.20 0.67 0.73 0.64
SVM 0.95 0.32 0.33 0.32 0.90 0.95 0.92 SVM 0.73 0.22 0.18 0.16 0.63 0.73 0.64
LR 0.95 0.32 0.33 0.32 0.90 0.95 0.92 LR 0.72 0.12 0.17 0.14 0.52 0.72 0.61
NB 0.81 0.39 0.51 0.40 0.92 0.81 0.86 NB 0.36 0.19 0.20 0.17 0.67 0.36 0.42
fastText DT 0.88 0.38 0.40 0.39 0.91 0.88 0.90 fastText DT 0.57 0.16 0.16 0.16 0.58 0.57 0.58
RF 0.95 0.32 0.33 0.32 0.90 0.95 0.92 RF 0.72 0.18 0.17 0.15 0.55 0.72 0.61
SVM 0.95 0.32 0.33 0.32 0.90 0.95 0.92 SVM 0.72 0.12 0.17 0.14 0.52 0.72 0.61
RoBERTa Base 0.95 0.32 0.33 0.32 0.90 0.95 0.92 RoBERTa Base 0.72 0.12 0.17 0.14 0.52 0.72 0.61
RoBERTa Large 0.95 0.32 0.33 0.32 0.90 0.95 0.92 RoBERTa Large 0.72 0.12 0.17 0.14 0.52 0.72 0.61
mBERT uncased 0.95 0.32 0.33 0.32 0.90 0.95 0.92 mBERT uncased 0.75 0.22 0.22 0.22 0.66 0.75 0.69
XLM-RoBERTa Small 0.95 0.32 0.33 0.32 0.90 0.95 0.92 XLM-RoBERTa Small 0.72 0.12 0.17 0.14 0.52 0.72 0.61
XLM-RoBERTa Large 0.95 0.32 0.33 0.32 0.90 0.95 0.92 XLM-RoBERTa Large 0.72 0.12 0.17 0.14 0.52 0.72 0.61

likely played a crucial role in its success. By incorporating the semantic of BERT embeddings and the logistic regression model’s integration
and syntactic knowledge learned by the transformer-based language of TF-IDF features demonstrate the importance of considering alterna-
model, the decision tree model was able to effectively capture the tive approaches and leveraging domain-specific knowledge to achieve
underlying patterns and nuances specific to Hindi. On the other hand, optimal performance in different language tasks.
when considering the more complex classification scenarios of 5 classes In the Tamil language task, we observed that the decision tree model
and 7 classes in the Hindi language, the logistic regression model with fastText features demonstrated strong performance in the 3 classes
with TF-IDF features emerged as the best-performing model. Despite classification. The decision tree model’s utilization of fasttext features,
the popularity of transformer models in natural language processing which capture semantic information effectively, likely contributed to
its success in capturing the underlying patterns and nuances specific
tasks, the logistic regression model with TF-IDF features showcased
to Tamil. However, when considering the more complex classification
its prowess by outperforming the transformer-based models in these
scenarios of 5 classes and 7 classes in Tamil, the random forest model
multi-class settings. The logistic regression model, coupled with TF-
with fasttext features emerged as the top-performing model in terms of
IDF features, offers a simple yet effective approach for representing the
macro F1 scores. The random forest model’s ensemble of decision trees,
Hindi text. TF-IDF captures the importance of words in a document by
combined with the linguistic richness provided by fasttext features,
considering their frequency in the specific document as well as their enabled it to effectively capture intricate relationships and achieve
rarity across the entire corpus. This model was able to effectively ex- superior performance compared to other models, including the decision
ploit the discriminative power of words in the Hindi language, leading tree model.
to its superior performance compared to the transformer-based models. Shifting to the English language datasets, we observed different
These findings highlight that while transformers have shown great model performances across the different class configurations. In the
potential in various NLP tasks, there are instances where traditional 3 classes classification scenario, the decision tree model with BERT
machine learning models, such as decision trees and logistic regression, embeddings performed well. BERT embeddings capture contextual in-
can still yield exceptional results. The decision tree model’s utilization formation, which likely contributed to the decision tree model’s success

9
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

Table 12 Table 14
Results for Tamil 3 class dataset. Results for Tamil 7 class dataset.
Features Classifiers Acc MP MR MF1 WP WR WF1 Features Classifiers Acc MP MR MF1 WP WR WF1
LR 0.82 0.91 0.41 0.43 0.84 0.82 0.77 LR 0.65 0.43 0.19 0.20 0.65 0.65 0.55
NB 0.80 0.62 0.39 0.40 0.77 0.80 0.74 NB 0.63 0.51 0.20 0.21 0.65 0.63 0.53
TF-IDF DT 0.85 0.79 0.55 0.61 0.84 0.85 0.83 TF-IDF DT 0.74 0.82 0.45 0.54 0.76 0.74 0.71
RF 0.86 0.82 0.55 0.62 0.86 0.86 0.84 RF 0.73 0.80 0.43 0.52 0.76 0.73 0.70
SVM 0.86 0.93 0.54 0.61 0.88 0.86 0.83 SVM 0.73 0.71 0.36 0.43 0.76 0.73 0.68
LR 0.84 0.80 0.45 0.48 0.84 0.84 0.79 LR 0.69 0.58 0.29 0.34 0.71 0.69 0.62
NB 0.80 0.62 0.39 0.40 0.77 0.80 0.74 NB 0.63 0.51 0.20 0.21 0.65 0.63 0.53
BERT embeddings DT 0.85 0.75 0.55 0.61 0.83 0.85 0.83 BERT embeddings DT 0.73 0.82 0.44 0.54 0.75 0.73 0.70
RF 0.85 0.78 0.55 0.60 0.84 0.85 0.83 RF 0.73 0.82 0.44 0.54 0.76 0.73 0.70
SVM 0.84 0.92 0.45 0.49 0.85 0.84 0.79 SVM 0.69 0.63 0.27 0.32 0.72 0.69 0.62
LR 0.79 0.26 0.33 0.29 0.62 0.79 0.70 LR 0.61 0.09 0.14 0.20 0.37 0.61 0.46
NB 0.67 0.49 0.60 0.51 0.79 0.67 0.71 NB 0.58 0.47 0.51 0.46 0.67 0.58 0.60
fastText DT 0.82 0.65 0.67 0.66 0.84 0.82 0.83 fastText DT 0.72 0.62 0.61 0.61 0.73 0.72 0.72
RF 0.91 0.95 0.68 0.76 0.92 0.91 0.90 RF 0.83 0.96 0.60 0.72 0.86 0.83 0.81
SVM 0.84 0.88 0.46 0.49 0.84 0.84 0.80 SVM 0.69 0.46 0.22 0.22 0.68 0.69 0.60
RoBERTa Base 0.79 0.26 0.33 0.29 0.62 0.79 0.70 RoBERTa Base 0.61 0.81 0.14 0.11 0.37 0.61 0.46
RoBERTa Large 0.79 0.26 0.33 0.29 0.62 0.79 0.70 RoBERTa Large 0.61 0.09 0.14 0.11 0.37 0.61 0.46
mBERT uncased 0.83 0.64 0.58 0.61 0.82 0.83 0.82 mBERT uncased 0.71 0.35 0.36 0.35 0.67 0.71 0.69
XLM-RoBERTa Small 0.81 0.46 0.47 0.46 0.76 0.81 0.78 XLM-RoBERTa Small 0.01 0.01 0.03 0.01 0.00 0.01 0.01
XLM-RoBERTa Large 0.79 0.26 0.33 0.29 0.62 0.79 0.70 XLM-RoBERTa Large 0.02 0.00 0.14 0.00 0.00 0.02 0.00

Table 13 Table 15
Results for Tamil 5 class dataset. Results for English 3 class dataset.
Features Classifiers Acc MP MR MF1 WP WR WF1 Features Classifiers Acc MP MR MF1 WP WR WF1
LR 0.67 0.76 0.28 0.30 0.70 0.67 0.58 LR 0.93 0.31 0.33 0.32 0.87 0.93 0.90
NB 0.66 0.81 0.28 0.30 0.73 0.66 0.56 NB 0.93 0.48 0.34 0.33 0.90 0.93 0.90
TF-IDF DT 0.74 0.74 0.50 0.57 0.74 0.74 0.72 TF-IDF DT 0.92 0.39 0.35 0.36 0.89 0.92 0.90
RF 0.75 0.76 0.51 0.58 0.76 0.75 0.72 RF 0.93 0.45 0.34 0.34 0.90 0.93 0.90
SVM 0.74 0.85 0.46 0.55 0.78 0.74 0.71 SVM 0.93 0.48 0.34 0.33 0.90 0.93 0.90
LR 0.70 0.77 0.37 0.43 0.73 0.70 0.65 LR 0.93 0.64 0.34 0.33 0.93 0.93 0.90
NB 0.66 0.81 0.28 0.30 0.73 0.66 0.56 NB 0.93 0.48 0.34 0.33 0.90 0.93 0.90
BERT embeddings DT 0.74 0.76 0.51 0.58 0.75 0.74 0.72 BERT embeddings DT 0.92 0.41 0.36 0.37 0.89 0.92 0.91
RF 0.74 0.73 0.50 0.57 0.75 0.74 0.71 RF 0.93 0.41 0.34 0.34 0.89 0.93 0.90
SVM 0.70 0.85 0.35 0.40 0.75 0.70 0.63 SVM 0.93 0.31 0.33 0.32 0.87 0.93 0.90
LR 0.62 0.12 0.20 0.15 0.38 0.62 0.47 LR 0.93 0.31 0.33 0.32 0.87 0.93 0.90
NB 0.55 0.43 0.51 0.45 0.65 0.55 0.58 NB 0.84 0.36 0.40 0.37 0.89 0.84 0.86
fastText DT 0.75 0.62 0.66 0.64 0.76 0.75 0.75 fastText DT 0.91 0.33 0.33 0.33 0.88 0.91 0.89
RF 0.85 0.95 0.67 0.77 0.88 0.85 0.84 RF 0.93 0.31 0.33 0.32 0.87 0.93 0.90
SVM 0.73 0.88 0.38 0.43 0.78 0.73 0.66 SVM 0.93 0.31 0.33 0.32 0.87 0.93 0.90
RoBERTa Base 0.62 0.12 0.20 0.15 0.38 0.62 0.47 RoBERTa Base 0.93 0.31 0.33 0.32 0.87 0.93 0.90
RoBERTa Large 0.62 0.12 0.20 0.15 0.38 0.62 0.47 RoBERTa Large 0.90 0.30 0.33 0.32 0.81 0.90 0.85
mBERT uncased 0.80 0.67 0.63 0.64 0.80 0.80 0.79 mBERT uncased 0.93 0.31 0.33 0.32 0.87 0.93 0.90
XLM-RoBERTa Small 0.08 0.04 0.13 0.06 0.02 0.08 0.03 XLM-RoBERTa Small 0.93 0.31 0.33 0.32 0.87 0.93 0.90
XLM-RoBERTa Large 0.17 0.03 0.20 0.06 0.03 0.17 0.05 XLM-RoBERTa Large 0.93 0.31 0.33 0.32 0.87 0.93 0.90

in understanding the complexities of English text. For the 5 classes and 5.2. Cross-lingual analysis
7 classes classification in English, mBERT (multilingual BERT) demon-
strated strong performance. mBERT, being a transformer-based model This section aims to showcase the cross-lingual approach across all
pre-trained on multilingual text, was able to leverage its cross-lingual languages, utilizing the best models from each class. The objective is
knowledge and contextual understanding to achieve competitive macro to evaluate whether the models can accurately predict comments when
F1 scores and semantics across a wide range of classes enabling it to applied to different languages.
outperform other models in this setting. Table 21 presents the results for all languages in three classes,
Moving to the Tamil-English datasets, we observed that different showcasing the performance of the best models from each language
models excelled in class configurations—the decision tree model with when applied to the other languages. These best models were selected
fasttext features performed well in the 3 class scenario. The Naive Bayes based on previous evaluations specific to each language. In the case of
model with fastText features (NB with FT) showcased its strength in the Malayalam, the TF-IDF with a decision tree model exhibited strong per-
5 classes classification, while mBERT delivered a strong performance formance within its drive language, achieving a macro F1 score of 0.82.
in the 7 classes classification scenario. These results further emphasize However, when applied to the other languages in the same class, its
the importance of leveraging different models and features based on the performance decreased by approximately half. Nevertheless, the model
specific language and class configurations. Overall, these observations still showed promise in predicting comments related to homopho-
highlight the necessity of considering a range of models and features to bia/transphobia, demonstrating its effectiveness within the Malayalam
achieve optimal performance in different language tasks. The success language. In comparison, the macro F1 scores for the other languages
of decision tree, random forest, mBERT, and Naive Bayes models in (Tamil, English, Hindi, and Tamil-English) were 0.32, 0.29, 0.35, and
various scenarios demonstrates the effectiveness of both traditional 0.35, respectively, indicating a decline in performance compared to the
machine learning and transformer-based approaches in capturing the Malayalam language. The outcomes for the five-class and seven-class
complexities of different languages and class configurations. classifications are presented in Table 22 and Table 23, respectively.

10
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

Table 16 Table 18
Results for English 5 class dataset. Results for Tamil-English 3 class dataset.
Features Classifiers Acc MP MR MF1 WP WR WF1 Features Classifiers Acc MP MR MF1 WP WR WF1
LR 0.71 0.23 0.21 0.19 0.57 0.71 0.61 LR 0.90 0.30 0.33 0.32 0.81 0.90 0.85
NB 0.71 0.40 0.21 0.19 0.64 0.71 0.60 NB 0.90 0.30 0.33 0.32 0.81 0.90 0.85
TF-IDF DT 0.68 0.30 0.25 0.26 0.60 0.68 0.63 TF-IDF DT 0.90 0.38 0.34 0.32 0.83 0.90 0.85
RF 0.69 0.31 0.24 0.24 0.60 0.69 0.62 RF 0.90 0.30 0.33 0.32 0.81 0.90 0.85
SVM 0.70 0.28 0.22 0.21 0.58 0.70 0.61 SVM 0.90 0.30 0.33 0.32 0.81 0.90 0.85
LR 0.71 0.41 0.23 0.22 0.63 0.71 0.62 LR 0.90 0.30 0.33 0.32 0.81 0.90 0.85
NB 0.71 0.40 0.21 0.19 0.64 0.71 0.60 NB 0.90 0.30 0.33 0.32 0.81 0.90 0.85
BERT embeddings DT 0.68 0.30 0.25 0.25 0.61 0.68 0.63 Bert embeddings DT 0.90 0.30 0.33 0.32 0.81 0.90 0.85
RF 0.69 0.30 0.24 0.25 0.60 0.69 0.63 RF 0.90 0.30 0.33 0.32 0.81 0.90 0.85
SVM 0.71 0.51 0.22 0.21 0.67 0.71 0.62 SVM 0.90 0.30 0.33 0.32 0.81 0.90 0.85
LR 0.71 0.14 0.20 0.17 0.50 0.71 0.59 LR 0.90 0.30 0.33 0.32 0.81 0.90 0.85
NB 0.66 0.23 0.26 0.24 0.56 0.66 0.61 NB 0.57 0.39 0.50 0.35 0.87 0.57 0.66
fastText DT 0.63 0.24 0.23 0.23 0.56 0.63 0.59 Fasttext DT 0.76 0.35 0.37 0.35 0.82 0.76 0.79
RF 0.71 0.14 0.20 0.17 0.50 0.71 0.59 RF 0.90 0.63 0.34 0.33 0.84 0.90 0.85
SVM 0.71 0.14 0.20 0.17 0.50 0.71 0.59 SVM 0.90 0.30 0.33 0.32 0.81 0.90 0.85
RoBERTa Base 0.75 0.45 0.30 0.28 0.65 0.75 0.66 RoBERTa Base 0.73 0.17 0.20 0.18 0.58 0.73 0.65
RoBERTa Large 0.71 0.14 0.20 0.17 0.50 0.71 0.59 RoBERTa Large 0.70 0.10 0.14 0.12 0.49 0.70 0.57
mBERT uncased 0.72 0.41 0.37 0.38 0.69 0.72 0.70 mBERT uncased 0.68 0.28 0.23 0.24 0.64 0.68 0.66
XLM-RoBERTa Small 0.71 0.14 0.20 0.17 0.50 0.71 0.59 XLM-RoBERTa Small 0.66 0.14 0.18 0.16 0.54 0.66 0.59
XLM-RoBERTa Large 0.71 0.14 0.20 0.17 0.50 0.71 0.59 XLM-RoBERTa Large 0.70 0.10 0.14 0.12 0.49 0.70 0.57

Table 17 Table 19
Results for English 7 class dataset. Results for Tamil-English 5 class dataset.
Features Classifiers Acc MP MR MF1 WP WR WF1 Features Classifiers Acc MP MR MF1 WP WR WF1
LR 0.71 0.19 0.16 0.15 0.59 0.71 0.61 LR 0.80 0.32 0.21 0.20 0.69 0.80 0.72
NB 0.70 0.32 0.15 0.13 0.61 0.70 0.58 NB 0.80 0.16 0.20 0.18 0.64 0.80 0.71
TF-IDF DT 0.68 0.25 0.19 0.20 0.61 0.68 0.63 TF-IDF DT 0.80 0.43 0.23 0.24 0.71 0.80 0.73
RF 0.69 0.27 0.18 0.18 0.62 0.69 0.62 RF 0.80 0.48 0.23 0.22 0.72 0.80 0.72
SVM 0.71 0.31 0.17 0.17 0.48 0.71 0.62 SVM 0.80 0.49 0.23 0.23 0.73 0.80 0.72
LR 0.71 0.37 0.17 0.18 0.66 0.71 0.62 LR 0.80 0.28 0.22 0.21 0.67 0.80 0.72
NB 0.70 0.32 0.15 0.13 0.61 0.70 0.58 NB 0.80 0.16 0.20 0.18 0.64 0.80 0.71
Bert embeddings DT 0.68 0.25 0.19 0.20 0.62 0.68 0.63 Bert embeddings DT 0.80 0.43 0.23 0.24 0.71 0.80 0.73
RF 0.69 0.25 0.18 0.18 0.62 0.69 0.63 RF 0.80 0.48 0.23 0.22 0.72 0.80 0.72
SVM 0.71 0.32 0.16 0.15 0.68 0.71 0.61 SVM 0.80 0.51 0.22 0.22 0.73 0.80 0.72
LR 0.70 0.10 0.14 0.12 0.49 0.70 0.57 LR 0.80 0.16 0.20 0.18 0.64 0.80 0.71
NB 0.66 0.16 0.17 0.16 0.54 0.66 0.59 NB 0.52 0.32 0.39 0.29 0.76 0.52 0.59
Fasttext DT 0.61 0.16 0.15 0.15 0.54 0.61 0.57 Fasttext DT 0.58 0.23 0.24 0.23 0.66 0.58 0.61
RF 0.70 0.10 0.14 0.12 0.49 0.70 0.57 RF 0.80 0.36 0.23 0.23 0.70 0.80 0.72
SVM 0.70 0.10 0.14 0.12 0.49 0.70 0.57 SVM 0.80 0.36 0.21 0.20 0.69 0.80 0.72
RoBERTa Base 0.73 0.17 0.20 0.18 0.58 0.73 0.65 RoBERTa Base 0.80 0.20 0.21 0.20 0.66 0.80 0.72
RoBERTa Large 0.70 0.10 0.14 0.12 0.49 0.70 0.57 RoBERTa Large 0.80 0.16 0.20 0.18 0.64 0.80 0.71
mBERT uncased 0.68 0.28 0.23 0.24 0.64 0.68 0.66 mBERT uncased 0.81 0.36 0.25 0.26 0.72 0.81 0.74
XLM-RoBERTa Small 0.66 0.14 0.18 0.16 0.54 0.66 0.59 XLM-RoBERTa Small 0.80 0.26 0.24 0.24 0.67 0.80 0.73
XLM-RoBERTa Large 0.70 0.10 0.14 0.12 0.49 0.70 0.57 XLM-RoBERTa Large 0.80 0.16 0.20 0.18 0.64 0.80 0.71

In both cases, the fastText with random forest model performed well F1 score of 0.41. When the Hindi language was evaluated alongside
within the same languages, achieving a macro F1 score of 0.57 for other languages such as Malayalam, Tamil, English, and Tamil-English,
the five-class classification and 0.65 for the seven-class classification. it exhibited similar performance trends, with macro F1 scores of 0.32,
However, when applied to other languages, particularly Tamil, English, 0.29, 0.33, and 0.35, respectively. These findings are presented in
and Tamil-English, the model’s performance was notably lower. This Table 21. Turning to the five-class evaluation for the Hindi language,
suggests that the model struggled to effectively generalize its predic- Table 22 demonstrates the performance of the TF-IDF with a logistic
tions across different languages. The performance in the seven-class regression model. Within its own language, the model achieved a macro
classification was especially low, further highlighting the challenges F1 score of 0.35. However, when applied to other languages, it failed
faced in accurately classifying comments in a multilingual setting. to make accurate predictions for Tamil, English, and Tamil-English.
Overall, the results indicate that while certain models exhibit strong It performed relatively poorly in the Malayalam language, yielding a
performance within their languages, their effectiveness decreases when macro F1 score of 0.17. Additionally, Table 23 presents the evaluation
applied to other languages. This can be attributed to linguistic vari- of the seven-class classification for the Hindi language using the same
ations and challenges in capturing language-specific patterns and nu- TF-IDF with a logistic regression model. Within its own language, the
ances. Improving cross-lingual classification performance remains a model achieved a macro F1 score of 0.22. However, its performance
crucial area for further research and development. Enhancements in was significantly lower when applied in a cross-lingual context with
handling language variations and code-mixed expressions will be es- other languages. The macro F1 scores for these cross-lingual predictions
sential to improve the generalizability of models and their overall were 0.12, 0.11, 0.12, and 0.13, further emphasizing the challenges
performance in cross-lingual classification tasks. faced in accurately classifying comments across different languages.
In the case of the Hindi language, the evaluation of the three Overall, these results indicate that while the evaluated models
classes revealed that the Bert embedding with the decision tree model showcased reasonably good performance within the Hindi language,
performed well within its own language, achieving a macro average their effectiveness diminished when applied to other languages. These

11
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

Table 20 within the same languages, the best-performing model was the mBERT
Results for Tamil-English 7 class dataset. uncased model. It achieved a macro F1 score of 0.38 within the same
Features Classifiers Acc MP MR MF1 WP WR WF1 languages, slightly surpassing the performance of the Bert embedding
LR 0.80 0.26 0.15 0.15 0.69 0.80 0.71 with a decision tree model. However, when this model was applied
NB 0.79 0.11 0.14 0.13 0.63 0.79 0.70 to other languages, its performance notably declined. The resulting
TF-IDF DT 0.80 0.37 0.16 0.16 0.75 0.80 0.72
RF 0.80 0.37 0.16 0.16 0.75 0.80 0.72
macro F1 scores for these cross-lingual predictions were observed to be
SVM 0.80 0.36 0.16 0.15 0.75 0.80 0.71 0.22, 0.19, 0.21, and 0.24, indicating a decrease in accuracy compared
LR 0.80 0.38 0.16 0.15 0.75 0.80 0.71
to its performance within the same languages. Similarly, Table 23
NB 0.79 0.11 0.14 0.13 0.63 0.79 0.70 is the seven-class classification method exhibited patterns similar to
Bert embeddings DT 0.80 0.36 0.16 0.15 0.74 0.80 0.71 the five-class classification, both within the same languages and in
RF 0.80 0.37 0.16 0.16 0.75 0.80 0.71 cross-lingual predictions. The mBERT uncased model remained the best
SVM 0.80 0.38 0.16 0.15 0.75 0.80 0.71
performer within the same languages, achieving a macro F1 score of
LR 0.79 0.11 0.14 0.13 0.63 0.79 0.70 0.24. However, when applied to other languages in the seven-class
NB 0.52 0.22 0.27 0.20 0.74 0.52 0.60
classification, the model’s performance decreased. The resulting macro
Fasttext DT 0.54 0.17 0.18 0.17 0.67 0.54 0.59
RF 0.79 0.24 0.15 0.15 0.68 0.79 0.71
F1 scores were 0.18, 0.17, 0.15, and 0.21, further indicating a decline
SVM 0.79 0.26 0.14 0.13 0.69 0.79 0.70 in accuracy compared to its performance within the same languages.
RoBERTa Base 0.78 0.14 0.19 0.16 0.66 0.78 0.71
Overall, these results highlight that while the models perform rela-
RoBERTa Large 0.78 0.14 0.19 0.16 0.66 0.78 0.71 tively well within their languages, their performance diminishes when
mBERT uncased 0.78 0.39 0.29 0.30 0.75 0.78 0.75 applied to other languages. This can be attributed to the inherent
XLM-RoBERTa Small 0.79 0.15 0.18 0.16 0.66 0.79 0.72 linguistic variations and challenges associated with capturing language-
XLM-RoBERTa Large 0.79 0.11 0.14 0.13 0.63 0.79 0.70
specific nuances and patterns. The discrepancies in vocabulary, gram-
mar, and sentence structures across languages impact the models’
ability to make accurate predictions. However, despite the decrease
variations can be attributed to differences in language structure, vocab- in performance, the models still exhibit a certain level of consistency
ulary, and cultural nuances. Enhancing the cross-lingual capabilities of in their cross-lingual predictions. Further research and improvements
models and accounting for language-specific variations are crucial areas are necessary to enhance the models’ ability to effectively generalize
of research for improving the performance of multilingual classification across different languages and improve their overall performance in
tasks. In the case of the Tamil language, the evaluation of the three cross-lingual classification tasks.
classes demonstrated that the fastText with the decision tree model The cross-lingual approach applied to the Tamil-English language
performed well within its own language, achieving a macro average involved predictions with other languages such as Malayalam, Hindi,
F1 score of 0.66. When the Tamil language was evaluated alongside Tamil, and English. The chosen model, fastText with a decision tree,
achieved a macro average F1 score of 0.35 within its languages. How-
other languages such as Malayalam, Hindi, English, and Tamil-English,
ever, when compared to other languages, this performance was rela-
it exhibited similar performance trends, with macro F1 scores of 0.33,
tively lower, with macro F1 scores of 0.33, 0.31, 0.18, and 0.32. The
0.30, 0.31, and 0.31, respectively. These findings are presented in
results are shown in Table 21. Similar trends were observed when the
Table 21. Moving on to the five-class evaluation for the Tamil language,
model was applied to the mentioned languages, indicating consistent
Table 22 showcases the performance of the fastText with a random
performance across different linguistic contexts. In Table 22 the five-
forest model. Within its own language, the model achieved a notable
class classification for the same languages, the best-performing model
macro F1 score of 0.77. However, when applied to other languages,
was fastText with naive Bayes, which attained a macro F1 score of 0.29.
it struggled to make accurate predictions for Malayalam and Hindi. It
However, when applied to other languages, the model’s performance
also demonstrated lower performance in the English and Tamil-English
notably declined. The resulting macro F1 scores were observed to be
languages, with macro F1 scores of 0.17 and 0.20, respectively. Further-
0.04, 0.01, 0.19, and 0.17, indicating a significant decrease in accuracy
more, Table 23 presents the evaluation of the seven-class classification
compared to its performance within the same languages. This is par-
for the Tamil language using the same fastText with a random forest
ticularly evident in the prediction of Tamil-English to Malayalam and
model. Within its own language, the model achieved a high macro F1
Hindi, where the presence of code-mixed languages, combining Tamil
score of 0.72, indicating strong performance. However, when evaluated and English, poses challenges for accurate classification. In Table 23,
in a cross-lingual context with other languages, it did not provide the seven-class classification method exhibited similar patterns to the
accurate predictions for these languages, primarily due to the presence five-class classification, both within the same languages and in cross-
of code-mixed texts. lingual predictions. Once again, the mBERT uncased model emerged
Overall, these results indicate that the evaluated models showed as the best performer within the same languages, achieving a macro
promising performance within the Tamil language. However, when F1 score of 0.24. However, when applied to other languages within
applied to other languages, their effectiveness varied, with certain lan- the seven-class classification, the model’s performance decreased. The
guages posing greater challenges. The presence of code-mixed texts fur- resulting macro F1 scores were 0.18, 0.17, 0.15, and 0.21, further
ther complicated the cross-lingual predictions. Addressing these chal- indicating a decline in accuracy compared to its performance within
lenges and developing models capable of handling code-mixed data the same languages.
is crucial for improving the performance of multilingual classification These results highlight the challenges faced in the cross-lingual
tasks. approach, particularly when dealing with code-mixed languages and
The cross-lingual approach involving the English language encom- diverse linguistic variations. The presence of code-mixing can intro-
passed predictions concerning other languages such as Malayalam, duce complexities and affect the model’s ability to accurately pre-
Hindi, Tamil, and Tamil-English. The chosen model, Bert embedding dict outcomes. Additionally, the variations in vocabulary, grammar,
with a decision tree, demonstrated a macro average F1 score of 0.37 and language structures across languages pose difficulties for effective
when evaluated within the English language shown in Table 21. Al- cross-lingual classification. Further research and improvements are
though this performance is relatively low when compared to other required to address these challenges and enhance the models’ ability
languages, it remained consistent when applied to the aforementioned to accurately predict in cross-lingual scenarios. Developing techniques
languages. This suggests that the model’s ability to make accurate to handle code-mixed languages and better capture linguistic nuances
predictions remains stable across different linguistic contexts. When will be crucial for improving the performance of cross-lingual models
considering the 5 classes classification that is shown in Table 22 in Tamil-English and other languages.

12
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

Table 21 Table 23
Cross-lingual approach for the 3 classes. Cross-lingual approach for the 7 classes.
Languages Models ACC MP MR MF1 WP WR WF1 Languages Models ACC MP MR MF1 WP WR WF1
Mal 0.91 0.96 0.73 0.82 0.92 0.91 0.90 Mal 0.85 0.77 0.60 0.65 0.84 0.85 0.83
Hin 0.94 0.32 0.33 0.32 0.90 0.94 0.92 Hin 0.72 0.12 0.17 0.14 0.52 0.72 0.61
Mal Tam TFIDF - DT 0.76 0.50 0.34 0.29 0.71 0.76 0.66 Mal Tam FT - RF 0.61 0.09 0.14 0.11 0.37 0.61 0.46
Eng 0.93 0.44 0.35 0.35 0.90 0.93 0.90 Eng 0.70 0.10 0.14 0.12 0.49 0.70 0.57
Ta-En 0.88 0.36 0.35 0.35 0.82 0.88 0.85 Ta-En 0.79 0.11 0.14 0.13 0.63 0.79 0.70
Mal 0.70 0.32 0.32 0.32 0.66 0.70 0.68 Mal 0.70 0.10 0.14 0.12 0.49 0.70 0.58
Hin 0.90 0.42 0.42 0.41 0.91 0.90 0.90 Hin 0.65 0.22 0.24 0.22 0.72 0.65 0.68
Hin Tam BE-DT 0.76 0.26 0.32 0.29 0.62 0.76 0.68 Hin Tam TF - LR 0.61 0.16 0.14 0.11 0.45 0.61 0.46
Eng 0.84 0.33 0.38 0.33 0.87 0.84 0.85 Eng 0.70 0.17 0.14 0.12 0.56 0.70 0.58
Ta-En 0.83 0.37 0.36 0.35 0.82 0.83 0.82 Ta-En 0.79 0.11 0.14 0.13 0.63 0.79 0.70
Mal 0.72 0.34 0.33 0.33 0.67 0.72 0.69 Mal 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Hin 0.75 0.32 0.32 0.30 0.90 0.75 0.82 Hin 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Tam Tam FT-DT 0.82 0.65 0.67 0.66 0.84 0.82 0.83 Tam Tam FT - RF 0.83 0.96 0.60 0.72 0.86 0.83 0.81
Eng 0.66 0.34 0.36 0.31 0.88 0.66 0.75 Eng 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Ta-En 0.83 0.31 0.31 0.31 0.81 0.83 0.82 Ta-En 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Mal 0.71 0.31 0.33 0.32 0.66 0.71 0.69 Mal 0.65 0.19 0.20 0.18 0.56 0.65 0.59
Hin 0.86 0.40 0.32 0.33 0.91 0.86 0.88 Hin 0.66 0.23 0.18 0.17 0.65 0.66 0.62
Eng Tam BE-DT 0.73 0.31 0.33 0.31 0.65 0.73 0.68 Eng Tam mBERT-uncased 0.58 0.16 0.16 0.15 0.46 0.58 0.50
Eng 0.92 0.41 0.36 0.37 0.89 0.92 0.91 Eng 0.68 0.28 0.23 0.24 0.64 0.68 0.66
Ta-En 0.85 0.36 0.36 0.36 0.83 0.85 0.84 Ta-En 0.67 0.21 0.23 0.21 0.70 0.67 0.68
Mal 0.65 0.33 0.33 0.33 0.66 0.65 0.66 Mal 0.69 0.15 0.15 0.14 0.54 0.69 0.60
Hin 0.84 0.32 0.33 0.31 0.90 0.84 0.87 Hin 0.72 0.12 0.16 0.14 0.54 0.72 0.61
TamEng Tam FT-DT 0.22 0.34 0.27 0.18 0.70 0.22 0.26 TamEng Tam mBERT-uncased 0.61 0.09 0.14 0.11 0.37 0.61 0.46
Eng 0.90 0.32 0.33 0.32 0.87 0.90 0.89 Eng 0.69 0.14 0.15 0.13 0.50 0.69 0.58
Ta-En 0.76 0.35 0.37 0.35 0.82 0.76 0.79 Ta-En 0.78 0.39 0.29 0.30 0.75 0.78 0.75

Table 22
Cross-lingual approach for the 5 classes. technique is ‘‘model-agnostic’’, meaning it can be applied to any type of
Languages Models ACC MP MR MF1 WP WR WF1 machine-learning model without requiring specific modifications. This
Mal 0.88 0.70 0.53 0.57 0.86 0.88 0.85 versatility makes LIME a valuable asset in our pursuit of transparency
Hin 0.74 0.15 0.20 0.17 0.55 0.74 0.63 and interpretability (Ribeiro et al., 2016).
Mal Tam FT - RF 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Eng 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The LIME values, for Malayalam text classification for the models,
Ta-En 0.00 0.00 0.00 0.00 0.00 0.00 0.00 are shown in Figs. 6 and 7. These figures focus on 3, 5, and 7 classes.
Mal 0.73 0.15 0.20 0.17 0.54 0.73 0.62
On the hand Figs. 8 and 9 illustrate the LIME values for the model, in
Hin 0.66 0.34 0.37 0.35 0.70 0.66 0.67 Hindi considering 3, 5, and 7 classes.
Hin Tam TF - LR 0.01 0.10 0.01 0.02 0.10 0.01 0.02 By utilizing LIME, we can examine individual instances and identify
Eng 0.00 0.00 0.00 0.00 0.00 0.00 0.00 the features that have the most significant impact on the model’s
Ta-En 0.00 0.00 0.00 0.00 0.00 0.00 0.00
predictions. It provides us with a local explanation for each prediction
Mal 0.00 0.00 0.00 0.00 0.00 0.00 0.00 by constructing a simpler and interpretable ‘‘local’’ model around the
Hin 0.00 0.00 0.00 0.00 0.00 0.00 0.00
instance of interest. This local model approximates the behavior of the
Tam Tam FT - RF 0.85 0.95 0.67 0.77 0.88 0.85 0.84
Eng 0.71 0.14 0.20 0.17 0.50 0.71 0.59 complex model within a small neighborhood of the instance, making
Ta-En 0.80 0.29 0.21 0.20 0.66 0.80 0.71 it easier for us to understand why the model arrived at a particular
Mal 0.63 0.21 0.26 0.22 0.58 0.63 0.60 classification. The insights gained from LIME can be invaluable for
Hin 0.66 0.21 0.20 0.19 0.60 0.66 0.62 various purposes. They can help us identify potential biases in the
Eng Tam mBERT-uncased 0.57 0.29 0.23 0.21 0.50 0.57 0.48 model’s decision-making process, uncover hidden patterns or relation-
Eng 0.72 0.41 0.37 0.38 0.69 0.72 0.70
ships in the data, and detect any areas where the model may be prone
Ta-En 0.61 0.29 0.32 0.24 0.71 0.61 0.64
to making errors. Overall, LIME serves as a powerful tool in our model
Mal 0.03 0.04 0.05 0.04 0.03 0.03 0.03
evaluation toolkit, enabling us to go beyond numerical metrics and gain
Hin 0.00 0.01 0.01 0.01 0.00 0.00 0.00
TamEng Tam FT - NB 0.47 0.26 0.26 0.19 0.50 0.47 0.46 a comprehensive understanding of how our models learn and make
Eng 0.70 0.16 0.20 0.17 0.52 0.70 0.59 predictions.
Ta-En 0.52 0.32 0.39 0.29 0.76 0.52 0.59
7. Limitations and ethical consideration

To maintain the same level of agreement as new languages are


6. Error analysis
added to our dataset, future annotators must have extensive knowledge
of that language and have been trained on our annotation criteria
In today’s complex world of machine learning, it is not enough to
through pilot annotations. This restricts the availability of
solely rely on evaluating models based on metrics. While metrics pro- crowdsourcing-like solutions, which increases the cost of resource
vide us with quantitative measurements of performance, they cannot development. Due to the absence of context in some comments in
often offer a deep understanding of why a model is making certain pre- the annotation process, our dataset contains a some incomprehensible
dictions. To bridge this gap, we turn to interpretability techniques such statements. In order for our annotators to annotate such comments,
as LIME (Local Interpretable Model-agnostic Explanations).13 LIME is they were required to imagine themselves in circumstances that did
a powerful tool that allows us to delve into the inner workings of not actually exist. As a result of this deficiency and the limitations of
our models and gain insights into their decision-making processes. The neural networks in terms of comprehending various linguistic events,
their performance in this position is inadequate to be helpful. Even
13
https://lime-ml.readthedocs.io/en/latest/ though preventing the misuse of social media resources originating

13
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

Fig. 6. LIME Values for Malayalam text classification with 3 and 5 Classes.

may facilitate generalization for other languages. Moreover, the studies


reported in this study are in no way exhaustive, both in terms of
the number of datasets and the variety of models that have been
tested. Some of the fine-grained labels are too small to draw strong
conclusions. Supplementing these underrepresented topics with more
data would provide better insights into the generalizability of various
high school manifestations.

8. Conclusion

In this study, we present a groundbreaking dataset consisting of


expertly labeled instances of homophobic and transphobic content from
Malayalam and Hindi YouTube comments, marking the first dataset
Fig. 7. LIME value for Malayalam 7 class.
specifically designed for these languages to detect and classify such con-
tent in low-resourced languages. We also have included datasets from
previous years for the same task in three languages: Tamil, English, and
Tamil-English. Through a comprehensive experimental study within
a supervised classification framework, we explored various feature
selection methodologies, including traditional machine learning tech-
niques and transformer-based methods. We trained multiple classifiers
on 3-class, 5-class, and 7-class datasets, evaluating their performance
using classification reports and confusion matrices. Additionally, we
investigated the cross-lingual approach by leveraging the best models
from all languages for 3 classes, 5 classes, and 7 classes, assessing the
transferability of knowledge and uncovering similarities and differences
in homophobic and transphobic content. To enhance interpretability,
we employed LIME values, offering insights into the discriminatory
Fig. 8. LIME value for Hindi 3 class. patterns and linguistic cues present in the identified comments. Overall,
our study introduces a pioneering dataset, addresses the scarcity of
resources in low-resourced languages, and provides valuable insights
from homophobia and transphobia is one of our primary objectives, the for combating hate speech and promoting inclusivity in Malayalam and
publication of this dataset may still provide the opportunity for such an Hindi YouTube discussions.
instance to occur. We are also aware of the potential risks associated
with making a dataset of homophobia and transphobia accessible to CRediT authorship contribution statement
the public (such as using ours as a basis for the development of
homophobic and transphobic chatbots). Despite this, we are confident Prasanna Kumar Kumaresan: Conceptualization, Methodology,
in our assertion that the proposed benchmark will produce more Dataset, Software, Writing – original draft. Rahul Ponnusamy:
benefits than risks. Despite this, we continue to believe that efficient
Methodology, Software, Writing – original draft. Ruba Priyad-
classifiers for this task are necessary to combat implicit and subtle forms
harshini: Dataset, Writing – original draft. Paul Buitelaar: Writing
of online homophobia and transphobia on a large scale and prevent
– review & editing, Supervision. Bharathi Raja Chakravarthi:
the proliferation of harmful online content. The scientific community
Conceptualization, Methodology, Writing – review & editing,
is encouraged to conduct additional research on these topics as a
Supervision.
consequence of our work, which aims to bring us one step closer to our
objective. Using the our annotated datasets, we believe it is worthwhile
to investigate alternative solutions that may facilitate generalization Declaration of competing interest
despite these issues.
The majority of these issues are well-documented; however, because The authors declare that they have no known competing financial
re-annotation of all relevant data would be prohibitively expensive, interests or personal relationships that could have appeared to
we believe there is value in investigating alternative solutions that influence the work reported in this paper.

14
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

Fig. 9. LIME Values for Hindi text classification with 5 and 7 Classes.

Acknowledgment Chakravarthi, B.R., Priyadharshini, R., Ponnusamy, R., Kumaresan, P.K., Sampath, K.,
Thenmozhi, D., Thangasamy, S., Nallathambi, R., McCrae, J.P., 2021. Dataset for
identification of homophobia and transophobia in multilingual YouTube comments.
This work was conducted with the financial support of the Science
arXiv preprint arXiv:2109.00227.
Foundation Ireland Centre for Research Training in Artificial Intelli- Chhetri, T.R., Dehury, C.K., Lind, A., Srirama, S.N., Fensel, A., 2022. A combined system
gence under Grant No. 18/CRT/6223, supported in part by a research metrics approach to cloud service reliability using artificial intelligence. Big Data
grant from Science Foundation Ireland (SFI) under Grant Number Cognit. Comput. 6 (1), 26.
SFI/12/RC/2289_P2(Insight_2). Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F.,
Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V., 2020. Unsupervised cross-lingual
representation learning at scale. In: Proceedings of the 58th Annual Meeting
References of the Association for Computational Linguistics. Association for Computational
Linguistics, Online, pp. 8440–8451. http://dx.doi.org/10.18653/v1/2020.acl-main.
Akosa, J., 2017. Predictive accuracy: A misleading performance measure for highly 747, URL https://aclanthology.org/2020.acl-main.747.
imbalanced data. In: Proceedings of the SAS Global Forum, vol. 12, pp. 1–4. Demirtas, E., Pechenizkiy, M., 2013. Cross-lingual polarity detection with machine
Al-Hassan, A., Al-Dossari, H., 2021. Detection of hate speech in Arabic tweets using translation. In: Proceedings of the Second International Workshop on Issues of
deep learning. Multimedia Syst. 1–12. Sentiment Discovery and Opinion Mining. pp. 1–8.
Ali, R., Farooq, U., Arshad, U., Shahzad, W., Beg, M.O., 2022. Hate speech detection Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of deep
on Twitter using transfer learning. Comput. Speech Lang. 74, 101365. bidirectional transformers for language understanding. arXiv preprint arXiv:1810.
Arshad, M.U., Ali, R., Beg, M.O., Shahzad, W., 2023. Uhated: Hate speech detection in 04805.
Urdu language using transfer learning. Lang. Resourc. Eval. 1–20. Devlin, J., Chang, M., Lee, K., Toutanova, K., 2019. Bert: Pre-training of deep
Balamurali, A., Joshi, A., Bhattacharyya, P., 2012. Cross-lingual sentiment analysis for bidirectional transformers for language understanding. In: Proceedings of the 2019
Indian languages using linked wordnets. In: Proceedings of COLING 2012: Posters. Conference of the North American Chapter of the Association for Computational
pp. 73–82. Linguistics. ACL, pp. 4171–4186.
Barman, U., Das, A., Wagner, J., Foster, J., 2014. Code mixing: A challenge for language Díaz-Torres, M.J., Morán-Méndez, P.A., Villaseñor-Pineda, L., Montes, M., Aguilera, J.,
identification in the language of social media. In: Proceedings of the First Workshop Meneses-Lerín, L., 2020. Automatic detection of offensive language in social media:
on Computational Approaches to Code Switching. pp. 13–23. Defining linguistic criteria to build a Mexican Spanish dataset. In: Proceedings of
Bigoulaeva, I., Hangya, V., Fraser, A., 2021. Cross-lingual transfer learning for hate the Second Workshop on Trolling, Aggression and Cyberbullying. pp. 132–136.
speech detection. In: Proceedings of the First Workshop on Language Technology Esuli, A., Sebastiani, F., 2006. SENTIWORDNET: A publicly available lexical resource for
for Equality, Diversity and Inclusion. Association for Computational Linguistics, opinion mining. In: Proceedings of the Fifth International Conference on Language
Kyiv, pp. 15–25, URL https://aclanthology.org/2021.ltedi-1.3. Resources and Evaluation. LREC’06, European Language Resources Association
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T., 2017. Enriching word vectors with (ELRA), Genoa, Italy, URL http://www.lrec-conf.org/proceedings/lrec2006/pdf/
subword information. Trans. Assoc. Comput. Linguistics 5, 135–146. 384_pdf.pdf.
Brooke, J., Tofiloski, M., Taboada, M., 2009. Cross-linguistic sentiment analysis: From Faulkner, N., Bliuc, A.M., 2016. ‘It’s okay to be racist’: Moral disengagement in online
English to Spanish. In: Proceedings of the International Conference RANLP-2009. discussions of racist incidents in Australia. Ethnic Racial Stud. 39 (14), 2545–2563.
Association for Computational Linguist., Borovets, Bulgaria, pp. 50–54, URL https: Fellbaum, C., 1998. WordNet: An Electronic Lexical Database and Some of Its
//aclanthology.org/R09-1010. Applications. MIT press Cambridge.
Chakravarthi, B.R., 2022a. Hope speech detection in YouTube comments. Soc. Netw. Gao, Z., Yada, S., Wakamiya, S., Aramaki, E., 2020. Offensive language detection on
Anal. Min. 12 (1), 75. video live streaming chat. In: Proceedings of the 28th International Conference on
Chakravarthi, B.R., 2022b. Multilingual hope speech detection in English and Dravidian Computational Linguistics. pp. 1936–1940.
languages. Int. J. Data Sci. Anal. 14 (4), 389–406. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T., 2018. Learning word
Chakravarthi, B.R., 2023a. Detection of homophobia and transphobia in YouTube vectors for 157 languages. In: Proceedings of the Eleventh International Conference
comments. Int. J. Data Sci. Anal. http://dx.doi.org/10.1007/s41060-023-00400-0. on Language Resources and Evaluation. LREC 2018, European Language Resources
Chakravarthi, B.R., 2023b. Detection of homophobia and transphobia in YouTube Association (ELRA), Miyazaki, Japan, URL https://aclanthology.org/L18-1550.
comments. Int. J. Data Sci. Anal. 1–20. Guest, E., Vidgen, B., Mittos, A., Sastry, N., Tyson, G., Margetts, H., 2021. An expert
Chakravarthi, B.R., Hande, A., Ponnusamy, R., Kumaresan, P.K., Priyadharshini, R., annotated dataset for the detection of online misogyny. In: Proceedings of the
2022a. How can we detect homophobia and Transphobia? Experiments in a 16th Conference of the European Chapter of the Association for Computational
multilingual code-mixed setting for social media governance. Int. J. Inf. Manag. Linguistics: Main Volume. pp. 1336–1350.
Data Insights 2 (2), 100119. Haaga, D.A., 1991. ‘‘ Homophobia’’? J. Soc. Behav. Personality 6 (1), 171.
Chakravarthi, B.R., Jagadeeshan, M.B., Palanikumar, V., Priyadharshini, R., 2023. Habimana, O., Li, Y., Li, R., Gu, X., Yu, G., 2020. Sentiment analysis using deep learning
Offensive language identification in Dravidian languages using MPNet and CNN. approaches: An overview. Sci. China Inf. Sci. 63 (1), 1–36.
Int. J. Inf. Manag. Data Insights 3 (1), 100151. Hande, A., Hegde, S.U., Chakravarthi, B.R., 2022. Multi-task learning in under-resourced
Chakravarthi, B.R., Jose, N., Suryawanshi, S., Sherly, E., McCrae, J.P., 2020. A senti- Dravidian languages. J. Data, Inf. Manag. 4 (2), 137–165.
ment analysis dataset for code-mixed Malayalam-English. In: Proceedings of the 1st Hewavitharana, S., Fernando, H., 2002. A two stage classification approach to Tamil
Joint Workshop on Spoken Language Technologies for under-Resourced Languages handwriting recognition. Tamil Internet 2002, 118–124.
(SLTU) and Collaboration and Computing for under-Resourced Languages (CCURL). Jose, N., Chakravarthi, B.R., Suryawanshi, S., Sherly, E., McCrae, J.P., 2020. A survey of
European Language Resources association, Marseille, France, pp. 177–184, URL current datasets for code-switching research. In: 2020 6th International Conference
https://aclanthology.org/2020.sltu-1.25. on Advanced Computing and Communication Systems. ICACCS, IEEE, pp. 136–141.
Chakravarthi, B.R., Priyadharshini, R., Muralidaran, V., Jose, N., Suryawanshi, S., Krippendorff, K., 1970. Estimating the reliability, systematic error and random error of
Sherly, E., McCrae, J.P., 2022b. Dravidiancodemix: Sentiment analysis and offensive interval data. Educ. Psychol. Meas. 30 (1), 61–70.
language identification dataset for dravidian languages in code-mixed text. Lang. Kumar, M., Chandran, S., 2015. Handwritten Malayalam word recognition system using
Resourc. Eval. 56 (3), 765–806. neural networks. Int. J. Eng. Res. Technol. (IJERT) 4 (4), 90–99.

15
P.K. Kumaresan, R. Ponnusamy, R. Priyadharshini et al. Natural Language Processing Journal 5 (2023) 100041

Kumar, R., Ojha, A.K., Malmasi, S., Zampieri, M., 2018. Benchmarking aggression Sai, S., Sharma, Y., 2021. Towards offensive language identification for Dravidian
identification in social media. In: Proceedings of the First Workshop on Trolling, languages. In: Proceedings of the First Workshop on Speech and Language
Aggression and Cyberbullying. TRAC-2018, pp. 1–11. Technologies for Dravidian Languages. pp. 18–27.
Kumaresan, P.K., Ponnusamy, R., Sherly, E., Sivanesan, S., Chakravarthi, B.R., 2022. Sakuntharaj, R., Mahesan, S., 2016. A novel hybrid approach to detect and correct
Transformer based hope speech comment classification in code-mixed text. In: spelling in Tamil text. In: 2016 IEEE International Conference on Information and
International Conference on Speech and Language Technologies for Low-Resource Automation for Sustainability. ICIAfS, IEEE, pp. 1–6.
Languages. Springer, pp. 120–137. Sakuntharaj, R., Mahesan, S., 2017. Use of a novel hash-table for speeding-up
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., suggestions for misspelt Tamil words. In: 2017 IEEE International Conference on
Zettlemoyer, L., Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining Industrial and Information Systems. ICIIS, IEEE, pp. 1–5.
approach. Sakuntharaj, R., Mahesan, S., 2021. Missing word detection and correction based on
Malmasi, S., Zampieri, M., 2018. Challenges in discriminating profanity from hate context of Tamil sentences using N-grams. In: 2021 10th International Conference
speech. J. Exp. Theor. Artif. Intell. 30 (2), 187–202. on Information and Automation for Sustainability. ICIAfS, IEEE, pp. 42–47.
Mandl, T., Modha, S., Majumder, P., Patel, D., Dave, M., Mandlia, C., Patel, A., 2019. Santhiya, S., Jayadharshini, P., Kogilavani, S., 2022. Transfer learning based youtube
Overview of the HASOC track at FIRE 2019: Hate speech and offensive content toxic comments identification. In: International Conference on Speech and Language
identification in Indo-European languages. In: Proceedings of the 11th Forum for Technologies for Low-Resource Languages. Springer, pp. 220–230.
Information Retrieval Evaluation. pp. 14–17. Sekhar, A.C., 1951. Evolution of Malayalam. Bull. Deccan College Res. Inst. 12 (1/2),
Meetei, L.S., Singh, T.D., Bandyopadhyay, S., 2019. WAT2019: English-Hindi translation 1–216.
on Hindi visual genome dataset. In: Proceedings of the 6th Workshop on Asian Snyder, C.R., 2002. Hope theory: Rainbows in the mind. Psychol. Inquiry 13 (4),
Translation. pp. 181–188. 249–275.
Meng, X., Wei, F., Xu, G., Zhang, L., Liu, X., Zhou, M., Wang, H., 2012. Lost in Strapparava, C., Valitutti, A., et al., 2004. Wordnet affect: An affective extension of
translations? Building sentiment Lexicons using context based machine translation. wordnet. In: Lrec, Vol. 4, no. 1083–1086. Lisbon, Portugal, p. 40.
In: Proceedings of COLING 2012: Posters. The COLING 2012 Organizing Committee, Subramanian, M., Chinnasamy, R., Kumaresan, P.K., Palanikumar, V., Mohan, M.,
Mumbai, India, pp. 829–838, URL https://aclanthology.org/C12-2081. Shanmugavadivel, K., 2022a. Development of multi-lingual models for detecting
Meyer, E.J., 2008. Gendered harassment in secondary schools: Understanding hope speech texts from social media comments. In: International Conference on
teachers’(non) interventions. Gender Educ. 20 (6), 555–570. Speech and Language Technologies for Low-Resource Languages. Springer, pp.
Mihalcea, R., Banea, C., Wiebe, J., 2007. Learning multilingual subjective language 209–219.
via cross-lingual projections. In: Proceedings of the 45th Annual Meeting of the Subramanian, M., Ponnusamy, R., Benhur, S., Shanmugavadivel, K., Ganesan, A.,
Association of Computational Linguistics. Association for Computational Linguistics, Ravi, D., Shanmugasundaram, G.K., Priyadharshini, R., Chakravarthi, B.R., 2022b.
Prague, Czech Republic, pp. 976–983, URL https://aclanthology.org/P07-1123. Offensive language detection in Tamil YouTube comments by adapters and
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013. Distributed cross-domain knowledge transfer. Comput. Speech Lang. 76, 101404.
representations of words and phrases and their compositionality. Adv. Neural Inf. Thamburaj, K.P., Rengganathan, V., 2015. A critical study of SPM Tamil literature exam
Process. Syst. 26. paper. Asian J. Assess. Teaching Learn. 5, 13–24.
Mishra, S., Prasad, S., Mishra, S., 2021. Exploring multi-task multi-lingual learning of Thavareesan, S., Mahesan, S., 2019. Sentiment analysis in Tamil texts: A study on
transformer models for hate speech and offensive speech identification in social machine learning techniques and feature representation. In: 2019 14th Conference
media. SN Comput. Sci. 2 (2), 1–19. on Industrial and Information Systems. ICIIS, IEEE, pp. 320–325.
Pennington, J., Socher, R., Manning, C.D., 2014. Glove: Global vectors for word Thavareesan, S., Mahesan, S., 2020a. Sentiment lexicon expansion using word2vec and
representation. In: Proceedings of the 2014 Conference on Empirical Methods in fasttext for sentiment prediction in Tamil texts. In: 2020 Moratuwa Engineering
Natural Language Processing. EMNLP, pp. 1532–1543. Research Conference. MERCon, IEEE, pp. 272–276.
Poria, S., Gelbukh, A., Cambria, E., Yang, P., Hussain, A., Durrani, T., 2012. Merging Thavareesan, S., Mahesan, S., 2020b. Word embedding-based part of speech tagging
SenticNet and WordNet-affect emotion lists for sentiment analysis. In: 2012 IEEE in Tamil texts. In: 2020 IEEE 15th International Conference on Industrial and
11th International Conference on Signal Processing, Vol. 2. IEEE, pp. 1251–1255. Information Systems. ICIIS, IEEE, pp. 478–482.
Poteat, V.P., Rivers, I., 2010. The use of homophobic language across bullying roles Thavareesan, S., Mahesan, S., 2021. Sentiment analysis in Tamil texts using k-means
during adolescence. J. Appl. Dev. Psychol. 31 (2), 166–172. and k-nearest neighbour. In: 2021 10th International Conference on Information
Priyadarshini, I., Sahu, S., Kumar, R., 2023. A transfer learning approach for detecting and Automation for Sustainability. ICIAfS, pp. 48–53. http://dx.doi.org/10.1109/
offensive and hate speech on social media platforms. Multimedia Tools Appl. 1–27. ICIAfS52090.2021.9605839.
Rasooli, M.S., Farra, N., Radeva, A., Yu, T., McKeown, K., 2018. Cross-lingual sentiment Thurlow, C., 2001. Naming the ‘‘outsider within’’: Homophobic pejoratives and the
transfer with limited resources. Mach. Transl. 32, 143–165. verbal abuse of lesbian, gay and bisexual high-school pupils. J. Adolescence 24
Ribeiro, M.T., Singh, S., Guestrin, C., 2016. ‘‘Why should I trust you?’’: Explaining the (1), 25–38.
predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Xu, Y., Cao, H., Du, W., Wang, W., 2022. A survey of cross-lingual sentiment analysis:
Conference on Knowledge Discovery and Data Mining. KDD ’16, Association for Methodologies, models and evaluations. Data Sci. Eng. 7 (3), 279–299.
Computing Machinery, New York, NY, USA, pp. 1135–1144. http://dx.doi.org/10. Youssef, C.M., Luthans, F., 2007. Positive organizational behavior in the workplace:
1145/2939672.2939778. The impact of hope, optimism, and resilience. J. Manag. 33 (5), 774–800.
Risch, J., Krestel, R., 2018. Aggression identification using deep learning and data Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., Kumar, R., 2019.
augmentation. In: Proceedings of the First Workshop on Trolling, Aggression and Predicting the type and target of offensive posts in social media. arXiv preprint
Cyberbullying. TRAC-2018, pp. 150–158. arXiv:1902.09666.

16

You might also like