0% found this document useful (0 votes)
29 views31 pages

Debiasing Education Algorithms: Jamiu Adekunle Idowu

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views31 pages

Debiasing Education Algorithms: Jamiu Adekunle Idowu

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

International Journal of Artificial Intelligence in Education

https://doi.org/10.1007/s40593-023-00389-4

ARTICLE

Debiasing Education Algorithms

Jamiu Adekunle Idowu1

Accepted: 15 December 2023


© The Author(s) 2023

Abstract
This systematic literature review investigates the fairness of machine learning algo-
rithms in educational settings, focusing on recent studies and their proposed solu-
tions to address biases. Applications analyzed include student dropout prediction,
performance prediction, forum post classification, and recommender systems.
We identify common strategies, such as adjusting sample weights, bias attenua-
tion methods, fairness through un/awareness, and adversarial learning. Commonly
used metrics for fairness assessment include ABROCA, group difference in perfor-
mance, and disparity metrics. The review underscores the need for context-specific
approaches to ensure equitable treatment and reveals that most studies found no
strict tradeoff between fairness and accuracy. We recommend evaluating fairness
of data and features before algorithmic fairness to prevent algorithms from receiv-
ing discriminatory inputs, expanding the scope of education fairness studies beyond
gender and race to include other demographic attributes, and assessing the impact of
fair algorithms on end users, as human perceptions may not align with algorithmic
fairness measures.

Keywords Debiasing education algorithms · Fairness metrics · Algorithmic


fairness · Systematic literature review

Introduction

Motivation

In 2020, the UK’s decision to use an algorithm to predict A-level grades led to a
public outcry. The algorithm disproportionately downgraded students from disad-
vantaged backgrounds while favoring those from affluent schools. This is just one
out of many cases of how algorithms perpetuate existing biases. For instance, an

* Jamiu Adekunle Idowu


[email protected]
1
Centre for Artificial Intelligence, Department of Computer Science, University College London,
London, UK

13
Vol.:(0123456789)
International Journal of Artificial Intelligence in Education

algorithm developed to identify at-risk students performed worse for African-Amer-


ican students (Hu & Rangwala, 2020). Similarly, the predictive algorithms built by
Yu et al. (2020) for undergraduate course grades and average GPA recorded better
performance for male students than female students. In addition, a SpeechRater sys-
tem for automated scoring of language assessment had to lower accuracy for stu-
dents with speech impairment (Loukina & Buzick, 2017). Indeed, for AI applica-
tions to have a positive impact on education, it is crucial that their design considers
fairness at every step (Suresh & Guttag, 2021; Gusenbauer & Haddaway, 2020).
In response to these challenges, the Learning Analytics and AIED research com-
munity has increasingly focused on fairness, equity, and responsibility. Recent
studies have begun to address these issues more directly. For instance, Patterson
et al. (2023) proposed the adoption of the anticipation, inclusion, responsiveness,
and reflexivity (AIRR) framework (originally introduced by Stilgoe et al., 2013) in
Learning Analytics. The study further applied the framework to creating an early-
alert retention system at James Madison University. Also, Rets et al. (2023) pro-
vided six practical recommendations to enable the ethical use of predictive learning
analytics in distance education. These include active involvement of end users in
designing and implementing LA tools, inclusion through consideration of diverse
student needs, stakeholder engagement, and a clear plan for student support inter-
ventions. Also, previous studies attempted to eliminate bias from education algo-
rithms by removing sensitive features. However, this ‘fairness by unawareness’
approach is not a foolproof solution as bias can persist due to correlation with non-
sensitive features (Jiang & Pardos, 2021) or the presence of proxy attributes (Pes-
sach & Shmueli, 2022). For instance, brand associations like “Victoria Secret” may
be highly correlated with gender, occupation/educational attainment may be corre-
lated with socio-economic status, and geographic location may be highly correlated
with race. In our review, we will examine various bias mitigation strategies imple-
mented in the existing literature.
Meanwhile, a study by Gedrimiene et al. (2023) found accuracy to be a stronger
predictor of students’ trustworthiness in an LA career recommender system com-
pared to the students’ understanding of the origins of the recommendations (explain-
ability). This aligns with Wang et al., 2022 (one of the papers reviewed in this SLR)
which reported that in an online user study of more than 200 college students, par-
ticipants preferred an original biased career recommender system over the debi-
ased version of the system. Given these findings, this SLR will analyse the tradeoff
between accuracy and fairness in education algorithms.
Moreover, to assess fairness of algorithms, it is crucial to choose an appropri-
ate metric especially as studies have shown that it is impossible to satisfy multiple
measures of fairness simultaneously. For instance, pursuing demographic parity will
harm equalized odds (Baker & Hawn, 2021; Corbett-Davies et al., 2023). Therefore,
it is important to analyse fairness measures being adopted in the existing literature
on education algorithmic fairness.
This systematic literature review aims to address the following objectives:

• Research Objective 1: To identify the key metrics for assessing the fairness of
education algorithms.

13
International Journal of Artificial Intelligence in Education

• Research Objective 2: To identify and analyze the bias mitigation strategies and
sensitive features used in educational algorithms.
• Research Objective 3: To investigate the tradeoff between fairness and accuracy
in educational algorithms.

Fairness in Machine Learning Algorithms

Fairness in machine learning has become an increasingly central topic, driven by the
growing awareness of the societal implications of algorithmic decisions. While the
design and training of algorithms can increase unfairness, biases often come from
the data itself capturing historical prejudices, cultural stereotypes, or demographic
disparities, which in turn influence model behaviour (Yu et al., 2020; Jiang & Par-
dos, 2021).
Although the definition of fairness is a subject of debate, at the broadest level,
fairness of algorithms falls into two categories: individual fairness and group
fairness.
Individual Fairness emphasizes that similar individuals should receive similar
outcomes. Some common measures in this category include:

• Fairness Through Awareness: An algorithm is considered fair if it gives similar


predictions to similar individuals (Dwork et al., 2012).
• Fairness Through Unawareness: Fairness is achieved if protected attributes
like race, age, or gender are not explicitly incorporated in the decision-making
process (Gajane & Pechenizkiy, 2017).
• Counterfactual Fairness: A decision is deemed fair for an individual if it
remains the same in both the real and a counterfactual world where the individ-
ual belongs to a different demographic group (Kusner et al., 2017).

Group Fairness, on the other hand, focuses on ensuring equal statistical outcomes
across distinct groups partitioned by protected attributes. Some widely used meas-
ures here are:

• Equal Opportunity: The probability of a person in a positive class being


assigned to a positive outcome should be equal for both protected and unpro-
tected group members (Hardt et al., 2016).
• Demographic Parity: The likelihood of a positive outcome should be the same
regardless of whether the person is in the protected group (Dwork et al., 2012).
• Equalized Odds: The probabilities of true positives and false positives should be
the same for both protected and unprotected groups (Hardt et al., 2016).

Meanwhile, despite the advancement in algorithmic fairness research, defining


fairness remains challenging, as many existing definitions may not always align with
real-world applications (Gardner et al., 2019). For instance, while demographic par-
ity requires positive prediction probabilities to be consistent across groups, it may
lead to imbalances when legitimate reasons exist for discrepancies among groups

13
International Journal of Artificial Intelligence in Education

(Gardner et al., 2019). Recognizing this, Yu et al. (2021) emphasized the importance
of contextualizing fairness based on the application.

Method

This systematic literature review (SLR) collates, classifies and interprets the existing
literature that fits the pre-defined eligibility criteria of the review. The 2020 updated
PRISMA guideline is adopted for the review (Page et al., 2021). The stages involved
in this SLR are discussed as follows:

Stage 1: Data Sources and Search Strategy

Scopus was selected as the principal search system for this systematic review
because it is extensive (70,000,000 + publications) and can efficiently perform
searches with high precision, recall, and reproducibility (Gusenbauer & Haddaway,
2020). In addition, ACM Digital Library and IEEE Xplore are used as supplemen-
tary search systems because they are specifically focused on the computer science
subject (Gusenbauer & Haddaway, 2020). Table 1 provides more details into the
data sources.
The keywords used for the search strategy are presented in Table 2.

Stage 2: Inclusion and Exclusion Criteria

This section outlines the inclusion and exclusion criteria for the SLR. We screened
studies in three steps: first by titles and keywords, then abstracts, and finally through
full-text reading. Studies that met the inclusion criteria were further evaluated in
Stage 3 using the quality assessment criteria.
Inclusion Criteria:

• Papers that directly address the issues of algorithmic bias in educational settings
or tools.
• Studies published from 2015 onwards.
• Research materials such as peer-reviewed journal articles, conference proceed-
ings, white papers, etc. all with clearly defined research questions.

Exclusion Criteria:

• Non-research materials like newsletters, magazines, posters, surveys, invited


talks, panels, keynotes, and tutorials.
• Duplicate research paper or article.

Implementation of Search Strategy:


We executed our search strategy across the three data sources using the queries
specified in Table 2. Our decision to focus on studies from 2015 onwards was

13
Table 1  Data sources for identifying relevant studies
Data source Type Years covered Justification

Scopus Abstract and Citations database (journal articles, confer- All years Extensive publications, can be used as principal search system, full index, bulk
ence proceedings, etc.) download, repeatable, etc. (Gusenbauer & Haddaway, 2020).
International Journal of Artificial Intelligence in Education

ACM Full text Digital Library All years Primary focus on Computer Science, can be used as a principal search system,
full index, bulk download, repeatable, etc. (Gusenbauer & Haddaway, 2020).
IEEE Xplore Digital Library database (journals, conference papers, etc.) All years Focus on Computer Science, full index, bulk download, repeatable, etc. (Guse-
nbauer & Haddaway, 2020).

13
International Journal of Artificial Intelligence in Education

Table 2  Search keywords


Term Keywords

Artificial Intelligence “machine learning” OR “artificial intelligence” OR AI OR algo-


rithm* OR automate* OR recommend*
AND
Fairness bias* OR fair* OR ethic* OR trust* OR diverse* OR responsible
AND
Education education, teacher, student, pupil, classroom, grading, curricula*

validated upon examining the distribution of publications, as the literature indi-


cated an uptick in relevant studies within this timeframe as shown in Fig. 1 (for
the principal search system, Scopus).

Stage 3: Quality Assessment

In this systematic literature review, we employed a rigorous quality assessment


framework to ensure the validity and reliability of the studies included. This
framework encompasses four quality assessment questions listed below. Only
papers that satisfied at least 3 out of the 4 quality assessment questions were con-
sidered for further analysis.

QA1. Did the study have clearly defined research questions and methodology?
QA2. Did the study incorporate considerations of fairness during the algorithm
development process, ensuring non-discrimination based on factors like gender,
race, disability, socioeconomic status, etc.?
QA3. Did the study report the performance of the algorithm using suitable metrics
such as accuracy, F1 score, AUC-ROC, precision, recall, etc.?
QA4. Was the study based on a real-world educational dataset?

Fig. 1  Distribution of targeted


studies by year in Scopus

13
International Journal of Artificial Intelligence in Education

Results

A total of 3424 targeted studies were identified. 2747 of these are from the prin-
cipal search system (Scopus), 357 from ACM, and 320 from IEEE Xplore. Using
EndNote, a reference management package, 426 duplicates were detected and
removed leaving 2998 studies whose titles and/or abstracts were screened.
By screening the studies’ titles and keywords, 2865 papers were sieved out
while the abstracts of the remaining 133 papers were fully read. At the end of
the screening exercise, 86 papers were removed, leaving 47 papers as eligible for
full text screening. After full text screening, 31 papers were dropped leaving 16
papers. A consultation with experts yielded one additional paper, Counterfactual
Fairness, a study that developed a framework for modeling fairness using causal
inference and demonstrated the use of the framework with fair prediction of stu-
dents’ success at Law School (Kusner et al., 2017). In the end, 12 papers made it
through the ‘thick and thin’ of our quality assessment and became eligible papers.
Figure 2 presents a PRISMA flow chart of the study selection process. The data
extraction table containing the study design, dataset, methods, evaluation metrics,
results, conclusion, and limitations of the eligible papers is presented in Appen-
dix Table 6.

Fig. 2  Flow diagram of study selection using PRISMA 2020 guideline

13
International Journal of Artificial Intelligence in Education

Discussion

Research Objective 1: Measuring the Fairness of Education Algorithms

According to S1 (Gardner et al., 2019), a necessary step towards correcting any


unfairness in algorithms is to measure it first. However, there has been a challenge
around finding the appropriate metric for measuring the fairness of education algo-
rithms. Therefore, this section explores the metrics employed by selected studies to
measure fairness in education algorithms as highlighted in Table 3.

1. ABROCA (S1, S2 & S6)


  As noted by Wang et al. (2022), most studies in learning analytics typically
measure algorithmic fairness using statistical parity, equalised odds, and equalised
opportunities. However, these methods are threshold dependent, for instance,
equalised odds can only be achieved at points where the ROC curves cross
thereby failing to evaluate model discrimination at any other threshold (Gard-

Table 3  Fairness metrics adopted by selected studies


Paper Reference ABROCA Group Δ in TPR FNR & FPR Others Performance
perfor- TNR metric
mance

S1 Gardner et al. ✓ AUC-ROC


(2019)
S2 Sha et al. ✓ AUC​
(2022)
S3 Wang et al. UPAR *NDCG@K
(2022)
S4 Yu et al. (2021) ✓ Accuracy &
Recall
S5 Jiang and Par- ✓ Accuracy
dos (2021)
S6 Sha et al. ✓ Accuracy, AUC
(2021) & F1
S7 Yu et al. (2020) ✓ Accuracy
S8 Hu and Rang- ✓ Accuracy
wala (2020)
S9 Lee and Kizil- ✓ ✓ Accuracy and F1
cec (2020)
S10 Anderson et al. ✓ ✓ AUC-ROC
(2019)
S11 Loukina et al. ✓ Accuracy
(2019)
S12 Kusner et al. Counter- RMSE
(2017) factual
Fairness

* NDCG@K - Normalized discounted cumulative gain at K

13
International Journal of Artificial Intelligence in Education

ner et al., 2019). Therefore, S1 (Gardner et al., 2019) developed a methodology


to measure unfairness in predictive models using slicing analysis. This method
involves evaluating the model performance across multiple categories of the data.
As a case study, S1 applied the slicing approach to explore the gender-based dif-
ferences in MOOC dropout prediction models. The study proposed a new fairness
metric, ABROCA, the Absolute Between-ROC Area. This metric measures the
absolute value of the area between the baseline group ROC curve ROCb and the
comparison group(s) ROCc. The lower the ABROCA value, the less unfair the
algorithm.
1


ABROCA = |ROCb (t) − ROCc (t)|dt
0

  Unlike fairness metrics like equalized odds, ABROCA is not threshold depend-
ent. The metric reveals model discrimination between subgroups across all pos-
sible thresholds t. Also, as pointed out by S1, ABROCA evaluates model accu-
racy without focusing strictly on positive cases, making it relevant for learning
analytics where a prediction, such as a student potentially dropping out, is not
inherently good or bad, but rather used to inform future support actions. Another
advantage of the metric is its minimal computational requirement, as it can be cal-
culated directly from the results of a prediction modeling task (S1). Indeed, some
of the relevant studies identified in this SLR adopted ABROCA; while S6 (Sha
et al., 2021) used ABROCA to assess fairness of students’ forum post classifica-
tion, S2 (Sha et al., 2022) used the metric to assess fairness of student dropout
prediction, student performance prediction, and forum post classification.
2. Group Difference in Performance (S4, S8, S9, S10 & S11)
  Four of the selected studies measured fairness via group difference in perfor-
mance. S4 (Yu et al., 2021) used the difference between the Accuracy, Recall and
TNR of its AWARE model (i.e., sensitive features present as predictors) and the
BLIND model (sensitive features absent). Similarly, S9 (Lee & Kizilcec, 2020)
used demographic parity as one of its fairness metrics while S10 (Anderson
et al., 2019) measured the optimality of models via a comparison between the
AUC ROC gained or lost when separate models were built for each sub-popula-
tion. Meanwhile, S11 (Loukina et al., 2019), the approach adopted involved three
dimensions of fairness namely overall score accuracy; overall score difference;
and conditional score difference.
3. True Positive Rate and True Negative Rate (S5 & S9)
  S5 (Jiang & Pardos, 2021) examined race group fairness by reporting true
positive rate (TPR), true negative rate (TNR), and accuracy according to equity
of opportunity and equity of odds. Similarly, S9 (Lee & Kizilcec, 2020) used
equality of opportunity and positive predictive parity.
4. False Positive Rate and False Negative Rate (S7 & S10)
 For S7 (Yu et al., 2020), performance parity across student subpopulations was
the approach to fairness measurement. The study computed disparity metrics and
tested the significance of the disparities using one-sided two proportion z-test.

13
International Journal of Artificial Intelligence in Education

Higher ratios indicate a greater degree of unfairness. The disparity metrics are
defined as follows:
Accuracy disparity = accref ∕accg
FPR disparity = fprg ∕fprref
FNR disparity = fnrg ∕fnrref

where g is the disadvantaged student group and ref is the reference group.
 Also, S10 (Anderson et al., 2019) assessed the equity of graduation prediction
models using False Positive Rates and False Negatives.
5. Others (S3 & S12)
  S3 (Wang et al., 2022) and S12 (Kusner et al., 2017) took different unique
approaches in assessing the fairness of students’ career recommender system
and prediction of success in law school. While S1 used non-parity unfairness,
UPAR, S12 used counterfactual fairness. According S3, non-parity unfairness,
UPAR, computes the absolute difference of the average ratings between two
groups of users.
[ ] [ ]
UPAR = |Eg y − E¬g y |
 [ ] [ ]
where Eg y is the average predicted score for one group (e.g. Male), and E¬g y
is the average predicted score for the other group (e.g. Female), the lower the
UPAR, the fairer the system.
 Meanwhile, S12, focusing on predicting success in law school, proposed Coun-
terfactual Fairness. This approach is rooted in the idea that a predictor is con-
sidered counterfactually fair if, for any given context, altering a protected attrib-
ute (like race or gender) doesn’t change the prediction outcome when all other
causally unrelated attributes remain constant. Formally, let A denote protected
attributes and the remaining attributes as X, while Y denotes the desired output.
Given a causal model represented by (U, V, F) where V is the union of A and X,
S12 postulated the following criterion for predictors of Y:
A predictor Y is said to be counterfactually fair if, for any specific context
defined by X = x and A = a, the following condition holds:
( ) ( )
P YA←a (U) = y|X = x, A = a = P YA←a� (U) = y|X = x, A = a

for every possible outcome y and for any alternative value 𝛼 ′ that the pro-
tected attribute A can assume.
  Clearly, this SLR reveals the diverse nature of metrics for measuring education
algorithmic fairness with common methods being ABROCA (S1, S2 & S6), true
positive rate and true negative rate (S5 & S9), false positive rate, false negative
rate, and disparity metrics (S7 & S10) as well as group difference in perfor-
mance (S4, S8, S9, S10 & S11). Other methods include non-parity unfairness,
UPAR (S3) and counterfactual fairness (S12).

13
International Journal of Artificial Intelligence in Education

Each method comes with its unique strengths and limitations, making it clear that
there is no one-size-fits-all approach to fairness assessment. For instance, ABROCA
is advantageous in scenarios where fairness needs to be evaluated independent of
decision thresholds, offering a more holistic view of model discrimination. On the
other hand, ABROCA might not be as effective in scenarios where specific deci-
sion thresholds are crucial, and this is exactly where measures like equalized odds
do well. Also, counterfactual fairness, which ensures that predictions don’t change
when altering protected attributes like race or gender, is beneficial in contexts where
direct discrimination needs to be explicitly addressed. However, this approach might
be less suitable in situations where indirect bias or correlation with non-protected
attributes is the primary concern.
In a nutshell, our review emphasizes the need for a nuanced understanding of
research and application contexts highlighting that the choice of fairness metric
should be aligned with the specific goals, characteristics, and ethical considerations
of each educational algorithm.

Research Objective 2: Sensitive Features and Bias Mitigation Strategies

Sensitive Features: Beyond Gender and Race

This review reveals that most of the studies in the existing literature on education
algorithmic fairness primarily concentrate on gender and race as sensitive features.
As shown in Table 4, ten papers (S1, S2, S3, S4, S6, S7, S8, S9, S10, & S12) of
the 12 papers reviewed used gender as sensitive feature while seven papers (S4, S5,
S7, S8, S9, S10, & S12) used race as a sensitive feature. For instance, S1 (Gard-
ner et al., 2019) performed a slicing analysis along student gender alone, with the
authors acknowledging the need for future work to consider multiple dimensions of
student identity. Also, in its research on the effectiveness of class balancing tech-
niques for fairness improvement, S2 (Sha et al., 2022) focused only on student gen-
der groups, with the study noting as part of its limitations, the need to consider other
demographic attributes like first language and educational background. Similarly,
the debiased college major recommender system by S3 (Wang et al., 2022) mitigates
only gender bias. Meanwhile, the study acknowledged that gender bias is only one of
the biases that can harm career’s choices, emphasizing the importance of addressing
other types of biases in future studies.
Also, S5 (Jiang & Pardos, 2021) primarily focused on race and did not thoroughly
analyze potential biases that may have been introduced by other protected attributes
such as parental income. In a similar vein, S6 (Sha et al., 2021) concentrated only on
gender and first language while S8 (Hu & Rangwala, 2020) and S9 (Lee & Kizilcec,
2020) focused only gender and race.
Notable exceptions are S4 (Yu et al., 2021) and S7 (Yu et al., 2021), which exam-
ined four sensitive features each: gender, race, first-generation student, and financial
need. Meanwhile, only one paper S11 (Loukina et al., 2019) considered disability. It
is the same for native language, as only S6 (Sha et al., 2021) considered it as a sensi-
tive feature.

13
13
Table 4  Sensitive features used in selected studies
Paper Dataset Sensitive features
Sex Race Status* Disability Language

S1 Data from 44 MOOCs. 3,080,230 students in training & 1,111,591 in the test set. ✓
S2 Moodle Data, KDDCUP (2015) and OULA Kuzilek et al. (2017) ✓
S3 Dataset with information from 16,619 anonymised Facebook users ✓
S4 81,858 online course-taking records for 24,198 students and 874 unique courses ✓ ✓ ✓
S5 82,309 undergrads with 1.97 m enrollments and 10,430 courses from UC-Berkeley ✓
S6 3,703 randomly-selected discussion posts by Monash University students ✓ ✓
S7 2,244 students enrolled in 10 online, STEM courses taught from 2016 to 2018 ✓ ✓ ✓
S8 A 10-year data at George Mason University from 2009 to 2019. Covering top five majors. ✓ ✓
S9 Admin data of 5,443 students from Fall 2014 to Spring 2019 from a US university ✓ ✓
S10 Data covering 14,706 first-time in US college undergrads, 2006–2012 (inclusive) ✓ ✓
S11 A corpus included 26,710 responses from 4,452 test-takers ✓
S12 A survey data covering 163 law schools and 21,790 law students in the US ✓ ✓

Status* means social-economic status


International Journal of Artificial Intelligence in Education
International Journal of Artificial Intelligence in Education

Mitigating Bias in Education Algorithms

The papers included in this SLR leveraged various fairness strategies to debias edu-
cation algorithms as shown in Table 5.
S2, S4, S6 and S9 used class balancing techniques to improve fairness. In par-
ticular, S2 (Sha et al., 2022) examined the effectiveness of 11 class balancing tech-
niques in improving the fairness of three representative prediction tasks: forum post
classification, student dropout prediction, and student performance prediction. The
research considered four under-sampling techniques: Tomek’s links, Near Miss,
Edited Nearest Neighbour, and Condensed Nearest Neighbour; four over-sampling
techniques: SMOTE, SMOTE (K-means), SMOTE (Borderline), and ADASYN;
and four hybrid techniques (with different combinations of under-sampling and
over-sampling techniques). Class balancing was performed on the training set to
ensure gender parity. The study revealed that eight of the 11 class balancing tech-
niques improved fairness in at least two of the three prediction tasks. Particularly,
TomekLink and SMOTE-TomekLink improved fairness across all the three tasks.
Meanwhile, compared to fairness improvement, accuracy was almost exclusively
improved by over-sampling techniques.
Similarly, to address the issue of class imbalance, S4 (Yu et al., 2021) with a
focus on student drop-out prediction and S9 (Lee & Kizilcec, 2020) with a focus on
student performance prediction adjusted the sample weights to be inversely propor-
tional to class frequencies during the training stage, ensuring that the model fairly
considered both majority and minority classes during the learning process. Also, to
address algorithmic bias in students’ forum post classification, S6 (Sha et al., 2021)
explored the viability of equal sampling for observed demographic groups (gender
and first-language background) in the model training process. The research trained
six classifiers with two different training data samples, the original training sample

Table 5  Bias mitigation strategies adopted by reviewed papers


Paper Bias mitigation strategy

S1 Slicing analysis; proposed ABROCA for fairness measurement


S2 Class balancing techniques
S3 Vector projection-based bias attenuation method
S4 Adjusted sample weights to address class imbalance.
S5 Fairness via unawareness, reweighing, adversarial learning, and inference strategy
S6 Equal sampling across demographic groups
S7 None
S8 Created a model, MCCM which is composed of two classifiers, each of which corresponds to a
sensitive attribute
S9 Set different threshold values for each subgroup to achieve equality of opportunity in the testing
set.
S10 None
S11 None
S12 Counterfactual Fairness

13
International Journal of Artificial Intelligence in Education

(all posts after removing the testing data) and the equal training sample (equal num-
ber of posts randomly selected for each demographic group). To ensure comparable
results, the same testing data was used for evaluation. The study found that most
classifiers became fairer for the demographic groups when using equal sampling.
Meanwhile, to mitigate gender bias in a career recommender system, S3 (Wang
et al., 2022) used vector projection-based bias attenuation method. As noted by the
study, traditional word embeddings, trained on large data, often inherit racial and
gender biases. For instance, vector arithmetic on embeddings can produce biased
analogies, such as “doctor − man + woman = nurseε. The user embeddings in the
career recommender system faced a similar issue. Therefore, the study introduced
a debiasing step: given the embedding of a user, pu, and a unit vector vB represent-
ing the global gender bias in the same embedding space, the study debiased pu by
removing its projection on the gender bias vector vB. According to S3, this bias
attenuation method is beyond a simple “fairness through unawareness” technique, as
it systematically removes bias related to both the sensitive feature (gender) and the
proxy features.
Noting that the decision to include or exclude sensitive features as predictors in
models is contentious in literature, S4 (Yu et al., 2021) set out to address the chal-
lenge by exploring the implication of including or excluding protected attributes in
student dropout prediction. The study compared BLIND (fairness through unaware-
ness) and AWARE models (fairness through awareness). Fairness levels did not sub-
stantially differ between the models, regardless of whether the protected attributes
were included or excluded (S4).
To minimize racial bias in student grade prediction, S5 (Jiang & Pardos, 2021)
trial several fairness strategies. The first fairness strategy used is “fairness by una-
wareness” (similar to S4) which was made the default (baseline). At the data con-
struction stage, S5 assigned weights to different race and grade labels to address data
imbalance. At the model training stage, the study used adversarial learning strategy.
Additional fairness strategies tried include the inference strategy and “alone strat-
egy”. The inference strategy involved adding sensitive feature (race) to the input for
training and removing it for prediction while the alone strategy involved training the
model separately on each race group. According to S5, using weight loss function to
balance race reduced the TPR, TNR, and accuracy for all race groups except Pacific
Islanders. For all the three metrics used, no single strategy was always the best,
however, the inference strategy recorded the most frequent best performance but it
is also the worst in terms of fairness. Meanwhile, the adversarial learning strategy
achieved the most fair results for all metrics.
S8 and S12 took unique approaches as they proposed new framework for bias
mitigation. Specifically, S12 (Kusner et al., 2017) proposed counterfactual fairness
as discussed in “Research Objective 1: Measuring the Fairness of Education Algo-
rithms” section. On the other hand, S8 (Hu & Rangwala, 2020), developed the mul-
tiple cooperative classifier model (MCCM) to improve individual fairness in student
at-risk prediction. MCCM consists of two classifiers, each corresponding to a spe-
cific sensitive attribute (e.g., male or female). When given an individual’s feature
vector xi , both classifiers receive the input. The output of the classifier correspond-
ing to the individual’s sensitive attribute si provides the probability of being positive,

13
International Journal of Artificial Intelligence in Education

while the output of the other classifier offers the probability of being positive if the
individual’s sensitive attribute were different (i.e. 1 − si ). The difference between the
two outputs is measured using KL-divergence. Based on the assumption of metric
free individual fairness, a prediction is fair when the difference between the two clas-
sifiers is negligible. To improve fairness, the research included a term that represents
the KL-divergence in the model’s objective function as fairness constraint. By mini-
mizing this difference during the model training process and controlling the trade-
off between accuracy and fairness with an hyperparameter λ, the model promotes
fairness between different sensitive attributes. S8 compared the proposed MCCM to
baseline models such as individually fair algorithms like Rawlsian Fairness, Learn-
ing Fair Representation, and Adversarial Learned Fair Representation as well as an
algorithm with no fairness constraint, Logistic Regression. The proposed MCCM is
the best in mitigating gender bias in student at-risk prediction case study. Though
the model was designed to improve individual fairness, it also achieved group fair-
ness, underscoring the high correlation between both. Meanwhile, the LR model
was highly biased as there was no fairness constraint imposed on it (S8).
In its own research, S9 (Lee & Kizilcec, 2020) sought to mitigate racial and
gender bias in Random Forest model predicting student performance by adjusting
threshold values for each demographic sub-group. The study noted that optimizing
the model to satisfy equality of opportunity perpetuates unfairness in terms of posi-
tive predictive parity and demographic parity emphasizing that it is not possible to
satisfy all notions of fairness simultaneously.
The variety of fairness strategies discussed in the reviewed studies, such as class
balancing techniques and adjusting sample weights (S2, S4, S6 & S9), bias attenua-
tion methods (S3), fairness through awareness/unawareness (S4 & S5), adversarial
learning (S5), multiple cooperative classifier models (S8) and counterfactual fair-
ness (S12), emphasizes the need for a context-specific approach. By considering the
unique characteristics and desired outcomes of education algorithms, researchers
can ensure that they serve students in an equitable manner, applying the most suit-
able strategy for each specific scenario.

Fairness of Features and Data Before Fairness of Algorithms

S2, S6, S7, & S8 revealed that it is critical to consider the fairness of data and fea-
tures used in educational algorithms before delving into algorithmic fairness itself.
Meanwhile, limited work has been done in evaluating bias in data and feature engi-
neering (S2). As noted by S7 (Yu et al., 2020) and S8 (Hu & Rangwala, 2020),
biased data and features can contribute to the unfairness of predictive algorithms,
as they may reflect historical prejudices, demographic inequalities, and cultural
stereotypes. According to S2 (Sha et al., 2022), bias is inherent in most real-world
education datasets. All the three datasets: Moodle dataset from Monash University,
Xue-TangX KDDCUP (2015) dataset, and the Open University Learning Analytics
(OULA) dataset exhibited distribution and hardness bias (Kuzilek et al., 2017). 60%
of forum posts in the Moodle dataset were authored by female students while over
68% of the KDDCUP dataset were male student data. OULA had the lowest distri-
bution bias with only 9% difference in male and female samples sizes. However, S2

13
International Journal of Artificial Intelligence in Education

noted that using class-balancing techniques reduced the hardness and distribution
bias of the datasets.
Another eligible paper, S6 (Sha et al., 2021), pointed out that extracting fairer
features and evaluating feature fairness can prevent algorithms from receiving dis-
criminatory inputs, especially as this information is difficult to detect later in model
implementation.
To examine fairness as an attribute of data sources rather than algorithms, S7
(Yu et al., 2020) evaluated the fairness impact of features from three data sources:
institutional data (female, transfer, low income, first-gen, URM, SAT total score and
High school GPA were extracted as features), LMS data (total clicks, total clicks
by category, total time, and total time by category), and survey data (effort regula-
tion, time management, environment management, and self-efficacy). According to
the study, combining institutional and LMS data led to the most accurate prediction
college success. On the other hand, features from the institutional data led to the
most discriminatory behavior of the model. Meanwhile, features from survey dataset
recorded the lowest accuracy while also showing high bias.

Research Objective 3: Fairness‑Accuracy Tradeoffs

The trade-offs between fairness and accuracy in educational algorithms are a criti-
cal aspect of algorithm design (Pessach & Shmueli, 2022). These trade-offs have
been investigated in various studies, however, papers (S1, S2, S3, S4 & S6) in this
SLR found no strict trade-off between model accuracy and fairness. Starting with S1
(Gardner et al., 2019), the study did not notice a strict tradeoff between the perfor-
mance and discrimination of its MOOC dropout models. Also, S2 (Sha et al., 2022)
found that fairness improvements were complementary with accuracy improvements
when applying over-sampling class balancing techniques. Meanwhile, S3 (Wang
et al., 2022) presented an interesting case. A debiased career recommender system
was found to be more accurate and fairer than its original biased version when eval-
uated using the measures of machine learning accuracy and fairness. However, when
evaluated through an online user study of more than 200 college students, partici-
pants preferred the original biased system over the debiased system (S3). The find-
ings of this study highlight that fair algorithms may not meet the expectations of
end users, for refusing to confirm existing human bias. In its research, S4 (Yu et al.,
2021) found that including or excluding protected attributes in dropout prediction
models did not significantly affect performance metrics like accuracy, recall, and
TNR, neither did it lead to different levels of fairness. Similarly, in a study compar-
ing machine learning and deep learning models in delivering fair predictions, S6
(Sha et al., 2021) found that utilizing techniques such as equal sampling can help
reduce model unfairness without sacrificing classification performance.
Meanwhile, a study by S5 (Jiang & Pardos, 2021) which tried different fair-
ness strategies revealed that adversarial learning achieved the best group fair-
ness without sacrificing much with respect to performance metrics like TPR,
TNR, and accuracy. However, another strategy that included sensitive attribute

13
International Journal of Artificial Intelligence in Education

(race) most frequently scored the highest for the performance metrics, but it was
also the worst in terms of fairness, indicating the existence of fairness-accuracy
tradeoff.

Conclusion and Future Research

In this systematic literature review, we analyzed twelve eligible studies on education


algorithmic fairness. Our findings highlight the need for researchers to evaluate data
and feature bias in the development of fair algorithms (S2, S6, S7, & S8). Despite
its importance, limited studies currently address this aspect. Additionally, the cur-
rent body of research on education algorithmic fairness predominantly focuses on
gender and race. Meanwhile, in a review by Baker and Hawn (2021), there are avail-
able evidence of bias in other areas like military-connected status, disability, and
socioeconomic status. Another takeaway from this SLR is that no one-size-fits-all
solution exists for assessing the fairness of education algorithms, highlighting the
importance of selecting metrics that effectively capture the nuances of one’s con-
text or application. Similarly, our analysis indicates that there is no strict trade-off
between fairness and accuracy in education algorithms, as the relationship between
the two can be complex and context dependent.
Based on the gaps identified, we make the following recommendations for
future research to advance the state of the art:

1. Prioritize Fairness of Data and Features: As indicated by S2, S6, S7, & S8,
it is critical to assess the fairness of data and features used in educational algo-
rithms before delving into algorithmic fairness itself. This is essential to prevent
algorithms from receiving discriminatory inputs.
2. Broaden the scope of education algorithmic fairness studies by incorporating
demographic attributes such as socioeconomic status and disability when analys-
ing sensitive features. This SLR revealed that most papers in the existing literature
focus primarily on gender and race as sensitive features. Specifically, ten papers
(S1, S2, S3, S4, S6, S7, S8, S9, S10, & S12) out of the 12 papers in this review
used gender as a sensitive feature while seven papers (S4, S5, S7, S8, S9, S10, &
S12) used race as a sensitive feature. In contrast, only one paper each examined
fairness related to native language (S6) and disability (S11). Notably, none of the
eligible studies in this SLR examined military-connected status or age.
3. Consider End User’s Perspective: As reported by S3, the expectations of end
users may not align with algorithm’s fairness measures. Therefore, investigating
user preferences can help bridge the gap between machine learning fairness and
human perceptions.
4. Harmonize Fairness with Accuracy: When exploring novel debiasing proce-
dures for algorithms, focus on taking advantage of the complementary nature of
fairness and accuracy, rather than compromising one for another as papers (S1,
S2, S3, S4 & S6) in our review indicate there is no strict tradeoff between the two.

13
Appendix Table 6

13
Table 6   Data extracted from eligible papers
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations

S1 Gardner et al. Proposed - Raw data from - Slicing Analysis - ABROCA for Mean for 44 - ABROCA offers - Slicing was done
(2019) “ABROCA” a 44 MOOCs. - Feature extrac- fairness measure- MOOCs: more insights only along gender
new method − 3,080,230 tion ment = ABROCA into model as sensitive
1
for unfairness students in the - Hyperparameter ∫ 0 |ROCb − ROCc |dt LSTM = 0.0351 discrimination feature.
measurement training dataset optimization - AUC-ROC for LR = 0.0353 than equalized - Focused only on
in predictive − 1,111,591 − 5 ML models accuracy SVM = 0.0311 odds. measuring fair-
student models. students in the for experiment Naïve- - Model discrimi- ness, nothing on
ABROCA meas- test set. Bayes = 0.0177 nation is related mitigation.
ures the absolute CART = 0.0280 to gender and
value of the AUC​ course cur-
area between LSTM = 0.653 ricular.
the baseline LR = 0.669 - No strict tradeoff
group ROC SVM = 0.693 between model
curve ROCb Naïve- performance
(e.g. Male) and Bayes = 0.558 and discrimina-
the comparison CART = 0.715 tion
groups ROCc
(e.g. Female).
Replication study
International Journal of Artificial Intelligence in Education
Table 6  (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S2 Sha et al. (2022) Investigated the Three datasets − 22 features Distribution and Best results are 8 out of the 11 Addressed only
impact of 11 used: extracted for hardness bias stated: CBTs improved the data aspect
class balancing - Moodle dataset: KDDCUP drop- to measure data Hardness bias: fairness in at of algorithmic
techniques on 3,703 forum out prediction; bias. 0.001 for least two out of bias but bias may
fairness of three discussion posts CFIN model (i.e. AUC for predictive Forum post the three tasks also occur during
representative by Monash Uni- DNN + CNN). accuracy task, 0.014 Over-sampling annotation of a
prediction tasks: versity students − 112 features ABROCA for pre- for KDDCUP, techniques training sample,
Forum post classi- - KDDCUP extracted for dictive fairness and 0.01 for methods feature engineer-
fication, student (2015) dataset OULA perfor- OULA. improved both ing, and model
dropout predic- with 39 courses mance predic- AUC: 0.844, fairness and training.
tion and student and 180,785 tion; GridSearch 0.89, and 0.802 accuracy. Focused only on
performance students from and Random respectively for No tradeoff student gender
prediction. online learning Forest model Forum post, between fairness groups. Need
platform Xue- implemented. KDDCUP, and and accuracy. to explore other
TangX. - Forum post clas- OULA tasks. demographic
International Journal of Artificial Intelligence in Education

- Open Univer- sification task ABROCA: attributes.


sity Learning involves BERT 0.055, 0.019,
Analytics embedding of and 0.018
dataset with posts; CLSTM respectively for
24,806 students’ (CNN + LSTM) Forum post,
interactions in model imple- KDDCUP, and
the VLE on 22 mented using OULA tasks.
courses. Tensorflow.
80/20% Train/Test - Class balancing
split techniques used
All 3 datasets are include four
publicly avail- under-sampling,
able. four over-sam-
pling and three
combined sam-
pling techniques.

13
Table 6  (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S3 Wang et al. (2022) Studied human- - Dataset with - Constructed user - Normalized GenderDebiased: - ML evaluation − 80% of the

13
fair-AI interac- information embedding via discounted NDCCG@3: showed that the participants are
tion: from 16,619 Neural collabo- cumulative gain 0.005 gender-debiased not open to new
1. Case study – anonymised rative filtering at K Higher NDCCG@10: recommender is career suggestions
developed a Facebook users; (NCF) model NDCG@K 0.01 fairer and more as most are uni-
debiased recom- 60% female and for predicting means better NDCCG@20: accurate than versity students.
mender system 40% male, 1,380 the items a user accuracy 0.013 gender-aware - Focused only on
that mitigates unique college “likes”. - Non-parity UPAR = 1.1188 recommender applicants but
gender bias in majors, 143,303 - Then, trained a unfairness. Lower Mean t-test but online user career-related
college major unique items Logistic Regres- UPAR means acceptance study showed bias exist also in
recommenda- that a user can sion classifier fairer system. score: 0.279 that participants college admission
tions like on Face- to predict top - T-test for user GenderAware: prefer the biased decision makers.
2. Conducted book and 3.5 college majors acceptance study NDCCG@3: system.
an online user million + user- using the 0.0009 - For ML evalua-
study with over item pairs. embedding as NDCCG@10: tion, Gender-
200 student - Online user input features. 0.005 debiased recom-
participants. study with 202 - Vector NDCCG@20: mender is fairer
participants: projection-based 0.007 without any loss
48% females and bias attenua- UPAR = 1.1445 of prediction
49.5% males. tion method to Mean t-test accuracy.
Only 18.3% gender-debias acceptance
open to career the recommen- score: 0.372
suggestions dations.
- Train-test split:
70/30%
International Journal of Artificial Intelligence in Education
Table 6  (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S4 Yu et al. (2021) Investigated Sample contains Extracted 58 fea- - Accuracy GBT Including or Did not investigate
whether sensi- 81,858 online tures. AWARE - Recall Accuracy, Recall excluding the impact of
tive features course-taking experiment - True Negative and TNR protected attrib- counterfactual
should be records for included all Rate (TNR) slightly higher utes does not protected attrib-
included in 24,198 unique features while Measure fairness by 0.2, 0.1, generally affect utes on fairness.
model students and 874 BLIND experi- via Group differ- and 0.1% in fairness in terms
Case study: Col- unique courses; ment excluded ence in perfor- AWARE model of any metric in
lege drop-out and 564,104 four protected mance metric compared to any enrollment
prediction with residential attributes. BLIND for format.
or without course-taking LR and Gradient online students. Performance do
four sensi- records for Boosting were No difference not differ signif-
tive features 93,457 unique implemented. for residential. icantly between
(gender, URM, students and Train-Test split LR the BLIND
first-generation 2,877 unique done by cohort: Accuracy, and AWARE
student and courses. Training set has Recall and models.
International Journal of Artificial Intelligence in Education

high-financial 17,259 online TNR slightly GBT slightly


need). and 79,182 resi- lesser by 0.2, performed better
dential students 0.1, and 0.1% than LR.
while test set has in AWARE
6,939 online and compared to
14,275 residen- BLIND for
tial students. online students.
5-fold CV Grid- Difference not
Search, indicator statistically
variables for significant
missing values, (p < 0.1)
feature scaling,
and adjusted
sample weights
to address class
imbalance.

13
Table 6  (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S5 Jiang and Pardos Trial differ- Anonymised - LSTM model Accuracy, True Adversarial Using weight Focused on race;

13
(2021) ent fairness students’ course implemented. Positive Rate, learning strat- loss function did not analyze
strategies for a enrolment - Fairness Strate- and True Nega- egy has best to balance race bias that may be
grade prediction data from UC gies: tive Rate reported group fairness reduced the introduced by
algorithm with Berkeley. 82,309 1) Fairness according to as follows: TPR, TNR, and other sensitive
respect to race. undergraduates through una- equity of oppor- TPR: accuracy for features like gen-
Fairness strategies with a total of wareness as tunity and equity Range = 9.48 & all race groups der and parental
were applied 1.97 million baseline. of odds. STD = 3.8 except Pacific income.
at three stages enrollments and 2) Data construc- TNR: Islanders.
of the model 10,430 unique tion stage: Range = 8.76 & Inference strategy
pipeline: data courses. Added race and STD = 3.45 recorded the
construction, other features; Accuracy: best perfor-
model training, assigned weights Range = 1.5 & mance, but it
and inference to balance race STD = 0.62 is the worst in
(prediction). and grade labels. NB: If group terms of fair-
3) Model training fairness is fully ness.
stage: Adver- attained, Range Adversarial learn-
sarial learning and STD will ing strategy
4) Prediction be 0. achieved the
stage: inference fairest results for
strategy, i.e. all metrics.
remove sensitive
feature (race) for
prediction.
- Training, valida-
tion, and test
data are in the
ratio 13:1:1.
International Journal of Artificial Intelligence in Education
Table 6  (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S6 Sha et al. (2021) Assessed algorith- 3,703 randomly- - Evaluated four Accuracy, Cohen’s - CNN-LSTM - Traditional Focused only on
mic fairness in selected discus- traditional ML κ, AUC, and F1 has highest ML models two demographic
classification of sion posts made models: SVM, score for perfor- performance: provided fairer groups: gender
forum posts. by Monash Random Forests, mance evaluation 0.795 AUC results while DL and first language.
Fairness assessed University Logistic regres- ABROCA for fair- for original models yielded
along demo- students includ- sion, and Naïve ness evaluation. sample; 0.792 better perfor-
graphic groups ing 2,339 (63%) Bayes and two and 0.802 AUC mance.
(sex and English content-relevant DL models: for balanced - Strict perfor-
as first or second posts and 1,364 Bi-LSTM and sample along mance-for-fair-
language). (37%) content- CNN-LSTM. gender and ness trade- off is
irrelevant posts. - Text embeddings first language not evident.
Dataset is publicly generated using respectively. - Equal sampling
available. the tool Bert-as- - Best fairness in across demo-
service. original sam- graphic groups
- Trained clas- ple: ABROCA reduced model
International Journal of Artificial Intelligence in Education

sifiers with of 0.007 for unfairness.


two samples: gender, and
original sample 0.032 for lan-
and balanced guage group.
sample (along - In balanced
gender and first sample:
language). ABROCA
- Training, of 0.003 for
validation, and sex group
test set in ratio and 0.012 for
70:10:20. language.

13
Table 6  (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S7 Yu et al. (2020) Fairness and Sample students Institutional Accuracy, False The combination Survey data is Scope of the

13
Performance enrolled in 10 features positive rate of institutional neither fair nor feature sets was
evaluation of the online, introduc- included student (FPR), and LMS accurate. limited and not
utility of differ- tory STEM demographics False negative rate data gave Inclusion of representative of
ent data sources courses taught and academic (FNR). best accuracy institutional the full potential
in predicting from 2016 to achievement Fairness measured (∆ = 0.052, features seemed of different data
short and long- 2018 at a large, prior to college. via performance p < 0.001 for to produce the sources.
term student public research Click features parity. short term and most unfair
success. US university. were derived ∆ = 0.037, outcome.
Data sources Original dataset from the LMS p = 0.014 for All data sources
examined has 2,244 stu- data. long term). exhibit bias.
include LMS dents. After data Survey features Institutional data
data, institu- cleaning, the included four consistently
tional data and final sample size constructs of underestimates
survey data. was 2,093. self-regulated historically
learning skills disadvantaged
and self-efficacy student groups
from pre-course while LMS
surveys. data overesti-
SVM, LR, and mate some of
RF models were these groups
implemented. more often.
International Journal of Artificial Intelligence in Education
Table 6  (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S8 Hu and Rangwala Proposed Multiple A 10-year data at - Created a model, Accuracy, Dis- The proposed Though designed Focused only on
(2020) Cooperative George Mason MCCM which crimination and model MCCM for individual gender and race
Classifier Model University from is composed of Consistency. recorded the fairness, the pro- groups.
to improve indi- Fall 2009 to Fall two classifiers, best perfor- posed MCCM
vidual fairness 2019. Covering each of which mance in achieved both
in at-risk student top five majors. corresponds to a mitigating bias group fairness
prediction. A course is chosen sensitive attrib- in terms of dis- and indi-
only if at least ute, e.g., male or crimination. vidual fairness,
300 students female. because group
have taken it. - Used Rawlsian and individual
Fairness, LR, fairness are
Learning Fair highly cor-
Representation, related.
and Adversarial
Learned Fair
International Journal of Artificial Intelligence in Education

Representation
as baselines.

13
Table 6  (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S9 Lee and Kizilcec - Predicted student - Student-level - Random Forest Equality of oppor- - Overall accu- - Optimizing the Analysis focused on

13
(2020) success based administrative model was tunity, racy of the model to satisfy two binary pro-
on university data spanning implemented. Demographic par- model on the equality of tected attributes
administrative Fall 2014 to - Weighted each ity and test data is 0.73 op- portunity by racial-ethnicity
records. Spring 2019 was instance with Positive predictive with an f-score perpetuates and gender alone.
- Applied post-hoc obtained from the inverse of its parity of 0.80. unfairness in
adjustments to a US Research label proportion - After correction terms of demo-
improve model University. to achieve label for equality of graphic parity
fairness. - The final pro- balance in the opportunity and positive
cessed data has training set. via threshold predictive parity
5,443 instances - Indicator value adjustment for for both gender
and 56 features. of -999 imputed each group, the and racial-ethnic
- Duplicates, for missing subgroup accu- groups.
missing course standardized test racy remains - Model exhibits
grades, courses scores. similar for both gender and
taken multiple - To improve groups. racial bias in
times, and fairness, differ- two out of three
grades other ent threshold fairness meas-
than letter values for each ures considered.
(A-F) or pass/ subgroup was
fail grades were set such that
removed. equality of
opportunity is
achieved in the
testing set.
International Journal of Artificial Intelligence in Education
Table 6  (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S10 Anderson et al. Assessed the Data from public, - Feature set cov- Equity of models Decision - No model, or Need to investigate
(2019) equitability and US R1 research. ers academic measured via Tree = 0.798 population, saw if the model
optimality of Includes 14,706 performance, false positive and SVM = 0.805 a meaningful implementation
graduation pre- first-time in col- financial false negative Logistic Regres- change in per- affects student
diction models. lege undergrads, information, predictions. sion = 0.807 group perfor- outcomes, and
admitted in pre-admission AUC-ROC for Random For- mance when whether the
Fall semes- information, and performance est = 0.800 trained only on slight unfairness
ters between extra-curricular SGD = 0.814 one population. observed translate
2006–2012 activities. - Models are to real-world dif-
(inclusive). - Linear Kernel not perfectly ferences.
SVM, Decision equitable.
Trees, Random
Forests, Logistic
Regressions, and
Stochastic Gra-
International Journal of Artificial Intelligence in Education

dient Descent
classifier were
implemented.
- Model param-
eters selected
via 5-fold CV
Grid Search.
Train-test split:
80:20%

13
Table 6  (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S11 Loukina et al. Explored multiple - Created a corpus Four different sim- Three dimensions - The speakers of - Total fairness Considered human

13
(2019) dimensions of with uniform ulated models of fairness: all languages may not be scores to be
fairness. distribution of implemented: Overall score were disadvan- achievable and the true ‘gold
Case study: Simu- native languages random model, accuracy; Overall taged by the different defini- standard’ measure
lated and real by randomly perfect model, score difference; META model. tions of fairness of language profi-
data to assess sampling a an almost per- and Conditional - GER speak- may require dif- ciency, however,
impact of Eng- similar number fect model, and score difference ers were ferent solutions. human scores are
lish language of test-takers for metadata-based underscored - None of three likely to contain a
proficiency each version of model. while JPN definitions of certain amount of
test-takers’ the test. Train-test split: speakers were fairness is, in error and bias.
native language - The final corpus 75:25% overscored principle, more
on automated included 26,710 by the perfect important than
scores. responses from model. another.
4,452 test-takers - On actual
(742 for each models, the
language). model trained
separately for
each native lan-
guage is most
fair in terms of
overall score
differences.
International Journal of Artificial Intelligence in Education
Table 6  (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S12 Kusner et al. Developed a A survey data Counterfactual RMSE for accuracy Full and unaware The proposed Implemented only
(2017) framework for covering 163 fairness defined Counterfactual fair- models are fair counterfactual for one use case
modeling fair- law schools as: a decision ness for fairness w.r.t. to sex as fairness frame-
ness using tools and 21,790 law to an individual they have very work takes pro-
from causal students in the is fair if it is weak causal tected attributes
inference. United States the same in the link between into account in
with informa- actual world and sex and GPA/ modelling.
tion on LSAT a counterfactual LSAT.
score, GPA, etc., world where
Provided by Law the individual
School Admis- belonged to a
sion Council different demo-
graphic group.
International Journal of Artificial Intelligence in Education

13
International Journal of Artificial Intelligence in Education

Funding The author gratefully acknowledges support from Chevening Scholarship, the UK govern-
ment’s global scholarship programme, funded by the Foreign, Commonwealth, and Development Office
(FCDO).

Declarations
Competing Interests The author has no relevant financial or non-financial interests to disclose.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permis-
sion directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/
licenses/by/4.0/.

References
Anderson, H., Boodhwani, A., & Baker, R. S. (2019). Assessing the Fairness of Graduation Predictions.
In EDM.
Baker, R. S., & Hawn, A. (2021). Algorithmic bias in education. International Journal of Artificial Intel-
ligence in Education, 32, https://​doi.​org/​10.​1007/​s40593-​021-​00285-9
Corbett-Davies, S., Gaebler, J., Nilforoshan, H., Shroff, R., & Goel, S. (2023). The measure and mismeas-
ure of fairness. Journal of Machine Learning Research, 24(312), 1–117.
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. In Pro-
ceedings of the 3rd innovations in theoretical computer science conference (pp. 214–226).
Gajane, P., & Pechenizkiy, M. (2017). On formalizing fairness in prediction with machine learning. arXiv
preprint arXiv:1710.03184. Retrieved from https://​arxiv.​org/​abs/​1710.​03184
Gardner, J., Brooks, C., & Baker, R. (2019). Evaluating the Fairness of Predictive Student models
through Slicing Analysis. Proceedings of the 9th International Conference on Learning Analytics &
Knowledge, 225–234. https://​doi.​org/​10.​1145/​33037​72.​33037​91.
Gedrimiene, E., Celik, I., Mäkitalo, K., & Muukkonen, H. (2023). Transparency and trustworthiness in
user intentions to follow career recommendations from a learning analytics tool. Journal of Learn-
ing Analytics, 10(1), 54–70.
Gusenbauer, M., & Haddaway, N. R. (2020). Which academic search systems are suitable for systematic
reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other
resources. Research Synthesis Methods, 11(2), 181–217.
Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. In Proceed-
ings of the 30th International Conference on Neural Information Processing Systems (NIPS’16) (pp.
3323-3331). Curran Associates Inc.
Hu, Q., & Rangwala, H. (2020). Towards fair educational data mining: A case study on detecting at-risk
students. International Educational Data Mining Society.
Jiang, W., & Pardos, Z. A. (2021). Towards equity and algorithmic fairness in student grade prediction. In
Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (pp. 608–617).
Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017).Counterfactual fairness. In Proceedings of the
31st International Conference on Neural Information Processing Systems (NIPS’17) (pp. 4069-
4079). Curran Associates Inc.
Kuzilek, J., Hlosta, M., & Zdrahal, Z. (2017). Open University Learning Analytics dataset Sci. Data, 4,
170171. https://​doi.​org/​10.​1038/​sdata.​2017.​171
Lee, H., & Kizilcec, R. F. (2020). Evaluation of fairness trade-offs in predicting student success. arXiv
preprint arXiv:2007.00088. Retrieved from https://​arxiv.​org/​abs/​2007.​00088
Loukina, A., & Buzick, H. (2017). Use of automated scoring in spoken language assessments for test tak-
ers with speech impairments. ETS Research Report Series, 2017(1), 1–10.

13
International Journal of Artificial Intelligence in Education

Loukina, A., Madnani, N., & Zechner, K. (2019). The many dimensions of algorithmic fairness in educa-
tional applications. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Build-
ing Educational Applications (pp. 1–10).
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., & Moher, D.
(2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Inter-
national Journal of Surgery, 88,
Patterson, C., York, E., Maxham, D., Molina, R., & Mabrey, P. (2023). Applying a responsible innovation
framework in developing an equitable early alert system:: A case study. Journal of Learning Analyt-
ics, 10(1), 24–36.
Pessach, D., & Shmueli, E. (2022). A review on fairness in machine learning. ACM Computing Surveys
(CSUR), 55(3), 1–44.
Rets, I., Herodotou, C., & Gillespie, A. (2023). Six practical recommendations enabling ethical use of
predictive learning analytics in distance education. Journal of Learning Analytics, 10(1), 149–167.
https://​doi.​org/​10.​18608/​jla.​2023.​7743
Sha, L., Raković, M., Das, A., Gašević, D., & Chen, G. (2022). Leveraging class balancing techniques to
alleviate algorithmic bias for predictive tasks in education. IEEE Transactions on Learning Tech-
nologies, 15(4), 481–492.
Sha, L., Rakovic, M., Whitelock-Wainwright, A., Carroll, D., Yew, V. M., Gasevic, D., & Chen, G.
(2021). Assessing algorithmic fairness in automatic classifiers of educational forum posts. In Arti-
ficial Intelligence in Education: 22nd International Conference, AIED 2021, Utrecht, The Nether-
lands, June 14–18, 2021, Proceedings, Part I 22 (pp. 381–394). Springer International Publishing.
Stilgoe, J., Owen, R., & Macnaghten, P. (2013). Developing a framework for responsible innovation.
Research Policy, 42(9), 1568–1580. https://​doi.​org/​10.​1016/j.​respol.​2013.​05.​008
Suresh, H., & Guttag, J. (2021). A framework for understanding sources of harm throughout the machine
learning life cycle. In Equity and Access in Algorithms, Mechanisms, and Optimization (pp. 1–9).
Wang, C., Wang, K., Bian, A., Islam, R., Keya, K. N., Foulds, J., & Pan, S. (2022). Do Humans Prefer
Debiased AI Algorithms? A Case Study in Career Recommendation. In 27th International Confer-
ence on Intelligent User Interfaces (pp. 134–147).
Yu, R., Lee, H., & Kizilcec, R. F. (2021). Should college dropout prediction models include protected
attributes? In Proceedings of the Eighth ACM Conference on Learning@ Scale (pp. 91–100).
Yu, R., Li, Q., Fischer, C., Doroudi, S., & Xu, D. (2020). Towards Accurate and Fair Prediction of Col-
lege Success: Evaluating Different Sources of Student Data. Proceedings of The 13th International
Conference on Educational Data Mining (EDM 2020), 292–301.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

13

You might also like