Debiasing Education Algorithms: Jamiu Adekunle Idowu
Debiasing Education Algorithms: Jamiu Adekunle Idowu
https://doi.org/10.1007/s40593-023-00389-4
ARTICLE
Abstract
This systematic literature review investigates the fairness of machine learning algo-
rithms in educational settings, focusing on recent studies and their proposed solu-
tions to address biases. Applications analyzed include student dropout prediction,
performance prediction, forum post classification, and recommender systems.
We identify common strategies, such as adjusting sample weights, bias attenua-
tion methods, fairness through un/awareness, and adversarial learning. Commonly
used metrics for fairness assessment include ABROCA, group difference in perfor-
mance, and disparity metrics. The review underscores the need for context-specific
approaches to ensure equitable treatment and reveals that most studies found no
strict tradeoff between fairness and accuracy. We recommend evaluating fairness
of data and features before algorithmic fairness to prevent algorithms from receiv-
ing discriminatory inputs, expanding the scope of education fairness studies beyond
gender and race to include other demographic attributes, and assessing the impact of
fair algorithms on end users, as human perceptions may not align with algorithmic
fairness measures.
Introduction
Motivation
In 2020, the UK’s decision to use an algorithm to predict A-level grades led to a
public outcry. The algorithm disproportionately downgraded students from disad-
vantaged backgrounds while favoring those from affluent schools. This is just one
out of many cases of how algorithms perpetuate existing biases. For instance, an
13
Vol.:(0123456789)
International Journal of Artificial Intelligence in Education
• Research Objective 1: To identify the key metrics for assessing the fairness of
education algorithms.
13
International Journal of Artificial Intelligence in Education
• Research Objective 2: To identify and analyze the bias mitigation strategies and
sensitive features used in educational algorithms.
• Research Objective 3: To investigate the tradeoff between fairness and accuracy
in educational algorithms.
Fairness in machine learning has become an increasingly central topic, driven by the
growing awareness of the societal implications of algorithmic decisions. While the
design and training of algorithms can increase unfairness, biases often come from
the data itself capturing historical prejudices, cultural stereotypes, or demographic
disparities, which in turn influence model behaviour (Yu et al., 2020; Jiang & Par-
dos, 2021).
Although the definition of fairness is a subject of debate, at the broadest level,
fairness of algorithms falls into two categories: individual fairness and group
fairness.
Individual Fairness emphasizes that similar individuals should receive similar
outcomes. Some common measures in this category include:
Group Fairness, on the other hand, focuses on ensuring equal statistical outcomes
across distinct groups partitioned by protected attributes. Some widely used meas-
ures here are:
13
International Journal of Artificial Intelligence in Education
(Gardner et al., 2019). Recognizing this, Yu et al. (2021) emphasized the importance
of contextualizing fairness based on the application.
Method
This systematic literature review (SLR) collates, classifies and interprets the existing
literature that fits the pre-defined eligibility criteria of the review. The 2020 updated
PRISMA guideline is adopted for the review (Page et al., 2021). The stages involved
in this SLR are discussed as follows:
Scopus was selected as the principal search system for this systematic review
because it is extensive (70,000,000 + publications) and can efficiently perform
searches with high precision, recall, and reproducibility (Gusenbauer & Haddaway,
2020). In addition, ACM Digital Library and IEEE Xplore are used as supplemen-
tary search systems because they are specifically focused on the computer science
subject (Gusenbauer & Haddaway, 2020). Table 1 provides more details into the
data sources.
The keywords used for the search strategy are presented in Table 2.
This section outlines the inclusion and exclusion criteria for the SLR. We screened
studies in three steps: first by titles and keywords, then abstracts, and finally through
full-text reading. Studies that met the inclusion criteria were further evaluated in
Stage 3 using the quality assessment criteria.
Inclusion Criteria:
• Papers that directly address the issues of algorithmic bias in educational settings
or tools.
• Studies published from 2015 onwards.
• Research materials such as peer-reviewed journal articles, conference proceed-
ings, white papers, etc. all with clearly defined research questions.
Exclusion Criteria:
13
Table 1 Data sources for identifying relevant studies
Data source Type Years covered Justification
Scopus Abstract and Citations database (journal articles, confer- All years Extensive publications, can be used as principal search system, full index, bulk
ence proceedings, etc.) download, repeatable, etc. (Gusenbauer & Haddaway, 2020).
International Journal of Artificial Intelligence in Education
ACM Full text Digital Library All years Primary focus on Computer Science, can be used as a principal search system,
full index, bulk download, repeatable, etc. (Gusenbauer & Haddaway, 2020).
IEEE Xplore Digital Library database (journals, conference papers, etc.) All years Focus on Computer Science, full index, bulk download, repeatable, etc. (Guse-
nbauer & Haddaway, 2020).
13
International Journal of Artificial Intelligence in Education
QA1. Did the study have clearly defined research questions and methodology?
QA2. Did the study incorporate considerations of fairness during the algorithm
development process, ensuring non-discrimination based on factors like gender,
race, disability, socioeconomic status, etc.?
QA3. Did the study report the performance of the algorithm using suitable metrics
such as accuracy, F1 score, AUC-ROC, precision, recall, etc.?
QA4. Was the study based on a real-world educational dataset?
13
International Journal of Artificial Intelligence in Education
Results
A total of 3424 targeted studies were identified. 2747 of these are from the prin-
cipal search system (Scopus), 357 from ACM, and 320 from IEEE Xplore. Using
EndNote, a reference management package, 426 duplicates were detected and
removed leaving 2998 studies whose titles and/or abstracts were screened.
By screening the studies’ titles and keywords, 2865 papers were sieved out
while the abstracts of the remaining 133 papers were fully read. At the end of
the screening exercise, 86 papers were removed, leaving 47 papers as eligible for
full text screening. After full text screening, 31 papers were dropped leaving 16
papers. A consultation with experts yielded one additional paper, Counterfactual
Fairness, a study that developed a framework for modeling fairness using causal
inference and demonstrated the use of the framework with fair prediction of stu-
dents’ success at Law School (Kusner et al., 2017). In the end, 12 papers made it
through the ‘thick and thin’ of our quality assessment and became eligible papers.
Figure 2 presents a PRISMA flow chart of the study selection process. The data
extraction table containing the study design, dataset, methods, evaluation metrics,
results, conclusion, and limitations of the eligible papers is presented in Appen-
dix Table 6.
13
International Journal of Artificial Intelligence in Education
Discussion
13
International Journal of Artificial Intelligence in Education
∫
ABROCA = |ROCb (t) − ROCc (t)|dt
0
Unlike fairness metrics like equalized odds, ABROCA is not threshold depend-
ent. The metric reveals model discrimination between subgroups across all pos-
sible thresholds t. Also, as pointed out by S1, ABROCA evaluates model accu-
racy without focusing strictly on positive cases, making it relevant for learning
analytics where a prediction, such as a student potentially dropping out, is not
inherently good or bad, but rather used to inform future support actions. Another
advantage of the metric is its minimal computational requirement, as it can be cal-
culated directly from the results of a prediction modeling task (S1). Indeed, some
of the relevant studies identified in this SLR adopted ABROCA; while S6 (Sha
et al., 2021) used ABROCA to assess fairness of students’ forum post classifica-
tion, S2 (Sha et al., 2022) used the metric to assess fairness of student dropout
prediction, student performance prediction, and forum post classification.
2. Group Difference in Performance (S4, S8, S9, S10 & S11)
Four of the selected studies measured fairness via group difference in perfor-
mance. S4 (Yu et al., 2021) used the difference between the Accuracy, Recall and
TNR of its AWARE model (i.e., sensitive features present as predictors) and the
BLIND model (sensitive features absent). Similarly, S9 (Lee & Kizilcec, 2020)
used demographic parity as one of its fairness metrics while S10 (Anderson
et al., 2019) measured the optimality of models via a comparison between the
AUC ROC gained or lost when separate models were built for each sub-popula-
tion. Meanwhile, S11 (Loukina et al., 2019), the approach adopted involved three
dimensions of fairness namely overall score accuracy; overall score difference;
and conditional score difference.
3. True Positive Rate and True Negative Rate (S5 & S9)
S5 (Jiang & Pardos, 2021) examined race group fairness by reporting true
positive rate (TPR), true negative rate (TNR), and accuracy according to equity
of opportunity and equity of odds. Similarly, S9 (Lee & Kizilcec, 2020) used
equality of opportunity and positive predictive parity.
4. False Positive Rate and False Negative Rate (S7 & S10)
For S7 (Yu et al., 2020), performance parity across student subpopulations was
the approach to fairness measurement. The study computed disparity metrics and
tested the significance of the disparities using one-sided two proportion z-test.
13
International Journal of Artificial Intelligence in Education
Higher ratios indicate a greater degree of unfairness. The disparity metrics are
defined as follows:
Accuracy disparity = accref ∕accg
FPR disparity = fprg ∕fprref
FNR disparity = fnrg ∕fnrref
where g is the disadvantaged student group and ref is the reference group.
Also, S10 (Anderson et al., 2019) assessed the equity of graduation prediction
models using False Positive Rates and False Negatives.
5. Others (S3 & S12)
S3 (Wang et al., 2022) and S12 (Kusner et al., 2017) took different unique
approaches in assessing the fairness of students’ career recommender system
and prediction of success in law school. While S1 used non-parity unfairness,
UPAR, S12 used counterfactual fairness. According S3, non-parity unfairness,
UPAR, computes the absolute difference of the average ratings between two
groups of users.
[ ] [ ]
UPAR = |Eg y − E¬g y |
[ ] [ ]
where Eg y is the average predicted score for one group (e.g. Male), and E¬g y
is the average predicted score for the other group (e.g. Female), the lower the
UPAR, the fairer the system.
Meanwhile, S12, focusing on predicting success in law school, proposed Coun-
terfactual Fairness. This approach is rooted in the idea that a predictor is con-
sidered counterfactually fair if, for any given context, altering a protected attrib-
ute (like race or gender) doesn’t change the prediction outcome when all other
causally unrelated attributes remain constant. Formally, let A denote protected
attributes and the remaining attributes as X, while Y denotes the desired output.
Given a causal model represented by (U, V, F) where V is the union of A and X,
S12 postulated the following criterion for predictors of Y:
A predictor Y is said to be counterfactually fair if, for any specific context
defined by X = x and A = a, the following condition holds:
( ) ( )
P YA←a (U) = y|X = x, A = a = P YA←a� (U) = y|X = x, A = a
for every possible outcome y and for any alternative value 𝛼 ′ that the pro-
tected attribute A can assume.
Clearly, this SLR reveals the diverse nature of metrics for measuring education
algorithmic fairness with common methods being ABROCA (S1, S2 & S6), true
positive rate and true negative rate (S5 & S9), false positive rate, false negative
rate, and disparity metrics (S7 & S10) as well as group difference in perfor-
mance (S4, S8, S9, S10 & S11). Other methods include non-parity unfairness,
UPAR (S3) and counterfactual fairness (S12).
13
International Journal of Artificial Intelligence in Education
Each method comes with its unique strengths and limitations, making it clear that
there is no one-size-fits-all approach to fairness assessment. For instance, ABROCA
is advantageous in scenarios where fairness needs to be evaluated independent of
decision thresholds, offering a more holistic view of model discrimination. On the
other hand, ABROCA might not be as effective in scenarios where specific deci-
sion thresholds are crucial, and this is exactly where measures like equalized odds
do well. Also, counterfactual fairness, which ensures that predictions don’t change
when altering protected attributes like race or gender, is beneficial in contexts where
direct discrimination needs to be explicitly addressed. However, this approach might
be less suitable in situations where indirect bias or correlation with non-protected
attributes is the primary concern.
In a nutshell, our review emphasizes the need for a nuanced understanding of
research and application contexts highlighting that the choice of fairness metric
should be aligned with the specific goals, characteristics, and ethical considerations
of each educational algorithm.
This review reveals that most of the studies in the existing literature on education
algorithmic fairness primarily concentrate on gender and race as sensitive features.
As shown in Table 4, ten papers (S1, S2, S3, S4, S6, S7, S8, S9, S10, & S12) of
the 12 papers reviewed used gender as sensitive feature while seven papers (S4, S5,
S7, S8, S9, S10, & S12) used race as a sensitive feature. For instance, S1 (Gard-
ner et al., 2019) performed a slicing analysis along student gender alone, with the
authors acknowledging the need for future work to consider multiple dimensions of
student identity. Also, in its research on the effectiveness of class balancing tech-
niques for fairness improvement, S2 (Sha et al., 2022) focused only on student gen-
der groups, with the study noting as part of its limitations, the need to consider other
demographic attributes like first language and educational background. Similarly,
the debiased college major recommender system by S3 (Wang et al., 2022) mitigates
only gender bias. Meanwhile, the study acknowledged that gender bias is only one of
the biases that can harm career’s choices, emphasizing the importance of addressing
other types of biases in future studies.
Also, S5 (Jiang & Pardos, 2021) primarily focused on race and did not thoroughly
analyze potential biases that may have been introduced by other protected attributes
such as parental income. In a similar vein, S6 (Sha et al., 2021) concentrated only on
gender and first language while S8 (Hu & Rangwala, 2020) and S9 (Lee & Kizilcec,
2020) focused only gender and race.
Notable exceptions are S4 (Yu et al., 2021) and S7 (Yu et al., 2021), which exam-
ined four sensitive features each: gender, race, first-generation student, and financial
need. Meanwhile, only one paper S11 (Loukina et al., 2019) considered disability. It
is the same for native language, as only S6 (Sha et al., 2021) considered it as a sensi-
tive feature.
13
13
Table 4 Sensitive features used in selected studies
Paper Dataset Sensitive features
Sex Race Status* Disability Language
S1 Data from 44 MOOCs. 3,080,230 students in training & 1,111,591 in the test set. ✓
S2 Moodle Data, KDDCUP (2015) and OULA Kuzilek et al. (2017) ✓
S3 Dataset with information from 16,619 anonymised Facebook users ✓
S4 81,858 online course-taking records for 24,198 students and 874 unique courses ✓ ✓ ✓
S5 82,309 undergrads with 1.97 m enrollments and 10,430 courses from UC-Berkeley ✓
S6 3,703 randomly-selected discussion posts by Monash University students ✓ ✓
S7 2,244 students enrolled in 10 online, STEM courses taught from 2016 to 2018 ✓ ✓ ✓
S8 A 10-year data at George Mason University from 2009 to 2019. Covering top five majors. ✓ ✓
S9 Admin data of 5,443 students from Fall 2014 to Spring 2019 from a US university ✓ ✓
S10 Data covering 14,706 first-time in US college undergrads, 2006–2012 (inclusive) ✓ ✓
S11 A corpus included 26,710 responses from 4,452 test-takers ✓
S12 A survey data covering 163 law schools and 21,790 law students in the US ✓ ✓
The papers included in this SLR leveraged various fairness strategies to debias edu-
cation algorithms as shown in Table 5.
S2, S4, S6 and S9 used class balancing techniques to improve fairness. In par-
ticular, S2 (Sha et al., 2022) examined the effectiveness of 11 class balancing tech-
niques in improving the fairness of three representative prediction tasks: forum post
classification, student dropout prediction, and student performance prediction. The
research considered four under-sampling techniques: Tomek’s links, Near Miss,
Edited Nearest Neighbour, and Condensed Nearest Neighbour; four over-sampling
techniques: SMOTE, SMOTE (K-means), SMOTE (Borderline), and ADASYN;
and four hybrid techniques (with different combinations of under-sampling and
over-sampling techniques). Class balancing was performed on the training set to
ensure gender parity. The study revealed that eight of the 11 class balancing tech-
niques improved fairness in at least two of the three prediction tasks. Particularly,
TomekLink and SMOTE-TomekLink improved fairness across all the three tasks.
Meanwhile, compared to fairness improvement, accuracy was almost exclusively
improved by over-sampling techniques.
Similarly, to address the issue of class imbalance, S4 (Yu et al., 2021) with a
focus on student drop-out prediction and S9 (Lee & Kizilcec, 2020) with a focus on
student performance prediction adjusted the sample weights to be inversely propor-
tional to class frequencies during the training stage, ensuring that the model fairly
considered both majority and minority classes during the learning process. Also, to
address algorithmic bias in students’ forum post classification, S6 (Sha et al., 2021)
explored the viability of equal sampling for observed demographic groups (gender
and first-language background) in the model training process. The research trained
six classifiers with two different training data samples, the original training sample
13
International Journal of Artificial Intelligence in Education
(all posts after removing the testing data) and the equal training sample (equal num-
ber of posts randomly selected for each demographic group). To ensure comparable
results, the same testing data was used for evaluation. The study found that most
classifiers became fairer for the demographic groups when using equal sampling.
Meanwhile, to mitigate gender bias in a career recommender system, S3 (Wang
et al., 2022) used vector projection-based bias attenuation method. As noted by the
study, traditional word embeddings, trained on large data, often inherit racial and
gender biases. For instance, vector arithmetic on embeddings can produce biased
analogies, such as “doctor − man + woman = nurseε. The user embeddings in the
career recommender system faced a similar issue. Therefore, the study introduced
a debiasing step: given the embedding of a user, pu, and a unit vector vB represent-
ing the global gender bias in the same embedding space, the study debiased pu by
removing its projection on the gender bias vector vB. According to S3, this bias
attenuation method is beyond a simple “fairness through unawareness” technique, as
it systematically removes bias related to both the sensitive feature (gender) and the
proxy features.
Noting that the decision to include or exclude sensitive features as predictors in
models is contentious in literature, S4 (Yu et al., 2021) set out to address the chal-
lenge by exploring the implication of including or excluding protected attributes in
student dropout prediction. The study compared BLIND (fairness through unaware-
ness) and AWARE models (fairness through awareness). Fairness levels did not sub-
stantially differ between the models, regardless of whether the protected attributes
were included or excluded (S4).
To minimize racial bias in student grade prediction, S5 (Jiang & Pardos, 2021)
trial several fairness strategies. The first fairness strategy used is “fairness by una-
wareness” (similar to S4) which was made the default (baseline). At the data con-
struction stage, S5 assigned weights to different race and grade labels to address data
imbalance. At the model training stage, the study used adversarial learning strategy.
Additional fairness strategies tried include the inference strategy and “alone strat-
egy”. The inference strategy involved adding sensitive feature (race) to the input for
training and removing it for prediction while the alone strategy involved training the
model separately on each race group. According to S5, using weight loss function to
balance race reduced the TPR, TNR, and accuracy for all race groups except Pacific
Islanders. For all the three metrics used, no single strategy was always the best,
however, the inference strategy recorded the most frequent best performance but it
is also the worst in terms of fairness. Meanwhile, the adversarial learning strategy
achieved the most fair results for all metrics.
S8 and S12 took unique approaches as they proposed new framework for bias
mitigation. Specifically, S12 (Kusner et al., 2017) proposed counterfactual fairness
as discussed in “Research Objective 1: Measuring the Fairness of Education Algo-
rithms” section. On the other hand, S8 (Hu & Rangwala, 2020), developed the mul-
tiple cooperative classifier model (MCCM) to improve individual fairness in student
at-risk prediction. MCCM consists of two classifiers, each corresponding to a spe-
cific sensitive attribute (e.g., male or female). When given an individual’s feature
vector xi , both classifiers receive the input. The output of the classifier correspond-
ing to the individual’s sensitive attribute si provides the probability of being positive,
13
International Journal of Artificial Intelligence in Education
while the output of the other classifier offers the probability of being positive if the
individual’s sensitive attribute were different (i.e. 1 − si ). The difference between the
two outputs is measured using KL-divergence. Based on the assumption of metric
free individual fairness, a prediction is fair when the difference between the two clas-
sifiers is negligible. To improve fairness, the research included a term that represents
the KL-divergence in the model’s objective function as fairness constraint. By mini-
mizing this difference during the model training process and controlling the trade-
off between accuracy and fairness with an hyperparameter λ, the model promotes
fairness between different sensitive attributes. S8 compared the proposed MCCM to
baseline models such as individually fair algorithms like Rawlsian Fairness, Learn-
ing Fair Representation, and Adversarial Learned Fair Representation as well as an
algorithm with no fairness constraint, Logistic Regression. The proposed MCCM is
the best in mitigating gender bias in student at-risk prediction case study. Though
the model was designed to improve individual fairness, it also achieved group fair-
ness, underscoring the high correlation between both. Meanwhile, the LR model
was highly biased as there was no fairness constraint imposed on it (S8).
In its own research, S9 (Lee & Kizilcec, 2020) sought to mitigate racial and
gender bias in Random Forest model predicting student performance by adjusting
threshold values for each demographic sub-group. The study noted that optimizing
the model to satisfy equality of opportunity perpetuates unfairness in terms of posi-
tive predictive parity and demographic parity emphasizing that it is not possible to
satisfy all notions of fairness simultaneously.
The variety of fairness strategies discussed in the reviewed studies, such as class
balancing techniques and adjusting sample weights (S2, S4, S6 & S9), bias attenua-
tion methods (S3), fairness through awareness/unawareness (S4 & S5), adversarial
learning (S5), multiple cooperative classifier models (S8) and counterfactual fair-
ness (S12), emphasizes the need for a context-specific approach. By considering the
unique characteristics and desired outcomes of education algorithms, researchers
can ensure that they serve students in an equitable manner, applying the most suit-
able strategy for each specific scenario.
S2, S6, S7, & S8 revealed that it is critical to consider the fairness of data and fea-
tures used in educational algorithms before delving into algorithmic fairness itself.
Meanwhile, limited work has been done in evaluating bias in data and feature engi-
neering (S2). As noted by S7 (Yu et al., 2020) and S8 (Hu & Rangwala, 2020),
biased data and features can contribute to the unfairness of predictive algorithms,
as they may reflect historical prejudices, demographic inequalities, and cultural
stereotypes. According to S2 (Sha et al., 2022), bias is inherent in most real-world
education datasets. All the three datasets: Moodle dataset from Monash University,
Xue-TangX KDDCUP (2015) dataset, and the Open University Learning Analytics
(OULA) dataset exhibited distribution and hardness bias (Kuzilek et al., 2017). 60%
of forum posts in the Moodle dataset were authored by female students while over
68% of the KDDCUP dataset were male student data. OULA had the lowest distri-
bution bias with only 9% difference in male and female samples sizes. However, S2
13
International Journal of Artificial Intelligence in Education
noted that using class-balancing techniques reduced the hardness and distribution
bias of the datasets.
Another eligible paper, S6 (Sha et al., 2021), pointed out that extracting fairer
features and evaluating feature fairness can prevent algorithms from receiving dis-
criminatory inputs, especially as this information is difficult to detect later in model
implementation.
To examine fairness as an attribute of data sources rather than algorithms, S7
(Yu et al., 2020) evaluated the fairness impact of features from three data sources:
institutional data (female, transfer, low income, first-gen, URM, SAT total score and
High school GPA were extracted as features), LMS data (total clicks, total clicks
by category, total time, and total time by category), and survey data (effort regula-
tion, time management, environment management, and self-efficacy). According to
the study, combining institutional and LMS data led to the most accurate prediction
college success. On the other hand, features from the institutional data led to the
most discriminatory behavior of the model. Meanwhile, features from survey dataset
recorded the lowest accuracy while also showing high bias.
The trade-offs between fairness and accuracy in educational algorithms are a criti-
cal aspect of algorithm design (Pessach & Shmueli, 2022). These trade-offs have
been investigated in various studies, however, papers (S1, S2, S3, S4 & S6) in this
SLR found no strict trade-off between model accuracy and fairness. Starting with S1
(Gardner et al., 2019), the study did not notice a strict tradeoff between the perfor-
mance and discrimination of its MOOC dropout models. Also, S2 (Sha et al., 2022)
found that fairness improvements were complementary with accuracy improvements
when applying over-sampling class balancing techniques. Meanwhile, S3 (Wang
et al., 2022) presented an interesting case. A debiased career recommender system
was found to be more accurate and fairer than its original biased version when eval-
uated using the measures of machine learning accuracy and fairness. However, when
evaluated through an online user study of more than 200 college students, partici-
pants preferred the original biased system over the debiased system (S3). The find-
ings of this study highlight that fair algorithms may not meet the expectations of
end users, for refusing to confirm existing human bias. In its research, S4 (Yu et al.,
2021) found that including or excluding protected attributes in dropout prediction
models did not significantly affect performance metrics like accuracy, recall, and
TNR, neither did it lead to different levels of fairness. Similarly, in a study compar-
ing machine learning and deep learning models in delivering fair predictions, S6
(Sha et al., 2021) found that utilizing techniques such as equal sampling can help
reduce model unfairness without sacrificing classification performance.
Meanwhile, a study by S5 (Jiang & Pardos, 2021) which tried different fair-
ness strategies revealed that adversarial learning achieved the best group fair-
ness without sacrificing much with respect to performance metrics like TPR,
TNR, and accuracy. However, another strategy that included sensitive attribute
13
International Journal of Artificial Intelligence in Education
(race) most frequently scored the highest for the performance metrics, but it was
also the worst in terms of fairness, indicating the existence of fairness-accuracy
tradeoff.
1. Prioritize Fairness of Data and Features: As indicated by S2, S6, S7, & S8,
it is critical to assess the fairness of data and features used in educational algo-
rithms before delving into algorithmic fairness itself. This is essential to prevent
algorithms from receiving discriminatory inputs.
2. Broaden the scope of education algorithmic fairness studies by incorporating
demographic attributes such as socioeconomic status and disability when analys-
ing sensitive features. This SLR revealed that most papers in the existing literature
focus primarily on gender and race as sensitive features. Specifically, ten papers
(S1, S2, S3, S4, S6, S7, S8, S9, S10, & S12) out of the 12 papers in this review
used gender as a sensitive feature while seven papers (S4, S5, S7, S8, S9, S10, &
S12) used race as a sensitive feature. In contrast, only one paper each examined
fairness related to native language (S6) and disability (S11). Notably, none of the
eligible studies in this SLR examined military-connected status or age.
3. Consider End User’s Perspective: As reported by S3, the expectations of end
users may not align with algorithm’s fairness measures. Therefore, investigating
user preferences can help bridge the gap between machine learning fairness and
human perceptions.
4. Harmonize Fairness with Accuracy: When exploring novel debiasing proce-
dures for algorithms, focus on taking advantage of the complementary nature of
fairness and accuracy, rather than compromising one for another as papers (S1,
S2, S3, S4 & S6) in our review indicate there is no strict tradeoff between the two.
13
Appendix Table 6
13
Table 6 Data extracted from eligible papers
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S1 Gardner et al. Proposed - Raw data from - Slicing Analysis - ABROCA for Mean for 44 - ABROCA offers - Slicing was done
(2019) “ABROCA” a 44 MOOCs. - Feature extrac- fairness measure- MOOCs: more insights only along gender
new method − 3,080,230 tion ment = ABROCA into model as sensitive
1
for unfairness students in the - Hyperparameter ∫ 0 |ROCb − ROCc |dt LSTM = 0.0351 discrimination feature.
measurement training dataset optimization - AUC-ROC for LR = 0.0353 than equalized - Focused only on
in predictive − 1,111,591 − 5 ML models accuracy SVM = 0.0311 odds. measuring fair-
student models. students in the for experiment Naïve- - Model discrimi- ness, nothing on
ABROCA meas- test set. Bayes = 0.0177 nation is related mitigation.
ures the absolute CART = 0.0280 to gender and
value of the AUC course cur-
area between LSTM = 0.653 ricular.
the baseline LR = 0.669 - No strict tradeoff
group ROC SVM = 0.693 between model
curve ROCb Naïve- performance
(e.g. Male) and Bayes = 0.558 and discrimina-
the comparison CART = 0.715 tion
groups ROCc
(e.g. Female).
Replication study
International Journal of Artificial Intelligence in Education
Table 6 (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S2 Sha et al. (2022) Investigated the Three datasets − 22 features Distribution and Best results are 8 out of the 11 Addressed only
impact of 11 used: extracted for hardness bias stated: CBTs improved the data aspect
class balancing - Moodle dataset: KDDCUP drop- to measure data Hardness bias: fairness in at of algorithmic
techniques on 3,703 forum out prediction; bias. 0.001 for least two out of bias but bias may
fairness of three discussion posts CFIN model (i.e. AUC for predictive Forum post the three tasks also occur during
representative by Monash Uni- DNN + CNN). accuracy task, 0.014 Over-sampling annotation of a
prediction tasks: versity students − 112 features ABROCA for pre- for KDDCUP, techniques training sample,
Forum post classi- - KDDCUP extracted for dictive fairness and 0.01 for methods feature engineer-
fication, student (2015) dataset OULA perfor- OULA. improved both ing, and model
dropout predic- with 39 courses mance predic- AUC: 0.844, fairness and training.
tion and student and 180,785 tion; GridSearch 0.89, and 0.802 accuracy. Focused only on
performance students from and Random respectively for No tradeoff student gender
prediction. online learning Forest model Forum post, between fairness groups. Need
platform Xue- implemented. KDDCUP, and and accuracy. to explore other
TangX. - Forum post clas- OULA tasks. demographic
International Journal of Artificial Intelligence in Education
13
Table 6 (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S3 Wang et al. (2022) Studied human- - Dataset with - Constructed user - Normalized GenderDebiased: - ML evaluation − 80% of the
13
fair-AI interac- information embedding via discounted NDCCG@3: showed that the participants are
tion: from 16,619 Neural collabo- cumulative gain 0.005 gender-debiased not open to new
1. Case study – anonymised rative filtering at K Higher NDCCG@10: recommender is career suggestions
developed a Facebook users; (NCF) model NDCG@K 0.01 fairer and more as most are uni-
debiased recom- 60% female and for predicting means better NDCCG@20: accurate than versity students.
mender system 40% male, 1,380 the items a user accuracy 0.013 gender-aware - Focused only on
that mitigates unique college “likes”. - Non-parity UPAR = 1.1188 recommender applicants but
gender bias in majors, 143,303 - Then, trained a unfairness. Lower Mean t-test but online user career-related
college major unique items Logistic Regres- UPAR means acceptance study showed bias exist also in
recommenda- that a user can sion classifier fairer system. score: 0.279 that participants college admission
tions like on Face- to predict top - T-test for user GenderAware: prefer the biased decision makers.
2. Conducted book and 3.5 college majors acceptance study NDCCG@3: system.
an online user million + user- using the 0.0009 - For ML evalua-
study with over item pairs. embedding as NDCCG@10: tion, Gender-
200 student - Online user input features. 0.005 debiased recom-
participants. study with 202 - Vector NDCCG@20: mender is fairer
participants: projection-based 0.007 without any loss
48% females and bias attenua- UPAR = 1.1445 of prediction
49.5% males. tion method to Mean t-test accuracy.
Only 18.3% gender-debias acceptance
open to career the recommen- score: 0.372
suggestions dations.
- Train-test split:
70/30%
International Journal of Artificial Intelligence in Education
Table 6 (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S4 Yu et al. (2021) Investigated Sample contains Extracted 58 fea- - Accuracy GBT Including or Did not investigate
whether sensi- 81,858 online tures. AWARE - Recall Accuracy, Recall excluding the impact of
tive features course-taking experiment - True Negative and TNR protected attrib- counterfactual
should be records for included all Rate (TNR) slightly higher utes does not protected attrib-
included in 24,198 unique features while Measure fairness by 0.2, 0.1, generally affect utes on fairness.
model students and 874 BLIND experi- via Group differ- and 0.1% in fairness in terms
Case study: Col- unique courses; ment excluded ence in perfor- AWARE model of any metric in
lege drop-out and 564,104 four protected mance metric compared to any enrollment
prediction with residential attributes. BLIND for format.
or without course-taking LR and Gradient online students. Performance do
four sensi- records for Boosting were No difference not differ signif-
tive features 93,457 unique implemented. for residential. icantly between
(gender, URM, students and Train-Test split LR the BLIND
first-generation 2,877 unique done by cohort: Accuracy, and AWARE
student and courses. Training set has Recall and models.
International Journal of Artificial Intelligence in Education
13
Table 6 (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S5 Jiang and Pardos Trial differ- Anonymised - LSTM model Accuracy, True Adversarial Using weight Focused on race;
13
(2021) ent fairness students’ course implemented. Positive Rate, learning strat- loss function did not analyze
strategies for a enrolment - Fairness Strate- and True Nega- egy has best to balance race bias that may be
grade prediction data from UC gies: tive Rate reported group fairness reduced the introduced by
algorithm with Berkeley. 82,309 1) Fairness according to as follows: TPR, TNR, and other sensitive
respect to race. undergraduates through una- equity of oppor- TPR: accuracy for features like gen-
Fairness strategies with a total of wareness as tunity and equity Range = 9.48 & all race groups der and parental
were applied 1.97 million baseline. of odds. STD = 3.8 except Pacific income.
at three stages enrollments and 2) Data construc- TNR: Islanders.
of the model 10,430 unique tion stage: Range = 8.76 & Inference strategy
pipeline: data courses. Added race and STD = 3.45 recorded the
construction, other features; Accuracy: best perfor-
model training, assigned weights Range = 1.5 & mance, but it
and inference to balance race STD = 0.62 is the worst in
(prediction). and grade labels. NB: If group terms of fair-
3) Model training fairness is fully ness.
stage: Adver- attained, Range Adversarial learn-
sarial learning and STD will ing strategy
4) Prediction be 0. achieved the
stage: inference fairest results for
strategy, i.e. all metrics.
remove sensitive
feature (race) for
prediction.
- Training, valida-
tion, and test
data are in the
ratio 13:1:1.
International Journal of Artificial Intelligence in Education
Table 6 (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S6 Sha et al. (2021) Assessed algorith- 3,703 randomly- - Evaluated four Accuracy, Cohen’s - CNN-LSTM - Traditional Focused only on
mic fairness in selected discus- traditional ML κ, AUC, and F1 has highest ML models two demographic
classification of sion posts made models: SVM, score for perfor- performance: provided fairer groups: gender
forum posts. by Monash Random Forests, mance evaluation 0.795 AUC results while DL and first language.
Fairness assessed University Logistic regres- ABROCA for fair- for original models yielded
along demo- students includ- sion, and Naïve ness evaluation. sample; 0.792 better perfor-
graphic groups ing 2,339 (63%) Bayes and two and 0.802 AUC mance.
(sex and English content-relevant DL models: for balanced - Strict perfor-
as first or second posts and 1,364 Bi-LSTM and sample along mance-for-fair-
language). (37%) content- CNN-LSTM. gender and ness trade- off is
irrelevant posts. - Text embeddings first language not evident.
Dataset is publicly generated using respectively. - Equal sampling
available. the tool Bert-as- - Best fairness in across demo-
service. original sam- graphic groups
- Trained clas- ple: ABROCA reduced model
International Journal of Artificial Intelligence in Education
13
Table 6 (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S7 Yu et al. (2020) Fairness and Sample students Institutional Accuracy, False The combination Survey data is Scope of the
13
Performance enrolled in 10 features positive rate of institutional neither fair nor feature sets was
evaluation of the online, introduc- included student (FPR), and LMS accurate. limited and not
utility of differ- tory STEM demographics False negative rate data gave Inclusion of representative of
ent data sources courses taught and academic (FNR). best accuracy institutional the full potential
in predicting from 2016 to achievement Fairness measured (∆ = 0.052, features seemed of different data
short and long- 2018 at a large, prior to college. via performance p < 0.001 for to produce the sources.
term student public research Click features parity. short term and most unfair
success. US university. were derived ∆ = 0.037, outcome.
Data sources Original dataset from the LMS p = 0.014 for All data sources
examined has 2,244 stu- data. long term). exhibit bias.
include LMS dents. After data Survey features Institutional data
data, institu- cleaning, the included four consistently
tional data and final sample size constructs of underestimates
survey data. was 2,093. self-regulated historically
learning skills disadvantaged
and self-efficacy student groups
from pre-course while LMS
surveys. data overesti-
SVM, LR, and mate some of
RF models were these groups
implemented. more often.
International Journal of Artificial Intelligence in Education
Table 6 (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S8 Hu and Rangwala Proposed Multiple A 10-year data at - Created a model, Accuracy, Dis- The proposed Though designed Focused only on
(2020) Cooperative George Mason MCCM which crimination and model MCCM for individual gender and race
Classifier Model University from is composed of Consistency. recorded the fairness, the pro- groups.
to improve indi- Fall 2009 to Fall two classifiers, best perfor- posed MCCM
vidual fairness 2019. Covering each of which mance in achieved both
in at-risk student top five majors. corresponds to a mitigating bias group fairness
prediction. A course is chosen sensitive attrib- in terms of dis- and indi-
only if at least ute, e.g., male or crimination. vidual fairness,
300 students female. because group
have taken it. - Used Rawlsian and individual
Fairness, LR, fairness are
Learning Fair highly cor-
Representation, related.
and Adversarial
Learned Fair
International Journal of Artificial Intelligence in Education
Representation
as baselines.
13
Table 6 (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S9 Lee and Kizilcec - Predicted student - Student-level - Random Forest Equality of oppor- - Overall accu- - Optimizing the Analysis focused on
13
(2020) success based administrative model was tunity, racy of the model to satisfy two binary pro-
on university data spanning implemented. Demographic par- model on the equality of tected attributes
administrative Fall 2014 to - Weighted each ity and test data is 0.73 op- portunity by racial-ethnicity
records. Spring 2019 was instance with Positive predictive with an f-score perpetuates and gender alone.
- Applied post-hoc obtained from the inverse of its parity of 0.80. unfairness in
adjustments to a US Research label proportion - After correction terms of demo-
improve model University. to achieve label for equality of graphic parity
fairness. - The final pro- balance in the opportunity and positive
cessed data has training set. via threshold predictive parity
5,443 instances - Indicator value adjustment for for both gender
and 56 features. of -999 imputed each group, the and racial-ethnic
- Duplicates, for missing subgroup accu- groups.
missing course standardized test racy remains - Model exhibits
grades, courses scores. similar for both gender and
taken multiple - To improve groups. racial bias in
times, and fairness, differ- two out of three
grades other ent threshold fairness meas-
than letter values for each ures considered.
(A-F) or pass/ subgroup was
fail grades were set such that
removed. equality of
opportunity is
achieved in the
testing set.
International Journal of Artificial Intelligence in Education
Table 6 (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S10 Anderson et al. Assessed the Data from public, - Feature set cov- Equity of models Decision - No model, or Need to investigate
(2019) equitability and US R1 research. ers academic measured via Tree = 0.798 population, saw if the model
optimality of Includes 14,706 performance, false positive and SVM = 0.805 a meaningful implementation
graduation pre- first-time in col- financial false negative Logistic Regres- change in per- affects student
diction models. lege undergrads, information, predictions. sion = 0.807 group perfor- outcomes, and
admitted in pre-admission AUC-ROC for Random For- mance when whether the
Fall semes- information, and performance est = 0.800 trained only on slight unfairness
ters between extra-curricular SGD = 0.814 one population. observed translate
2006–2012 activities. - Models are to real-world dif-
(inclusive). - Linear Kernel not perfectly ferences.
SVM, Decision equitable.
Trees, Random
Forests, Logistic
Regressions, and
Stochastic Gra-
International Journal of Artificial Intelligence in Education
dient Descent
classifier were
implemented.
- Model param-
eters selected
via 5-fold CV
Grid Search.
Train-test split:
80:20%
13
Table 6 (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S11 Loukina et al. Explored multiple - Created a corpus Four different sim- Three dimensions - The speakers of - Total fairness Considered human
13
(2019) dimensions of with uniform ulated models of fairness: all languages may not be scores to be
fairness. distribution of implemented: Overall score were disadvan- achievable and the true ‘gold
Case study: Simu- native languages random model, accuracy; Overall taged by the different defini- standard’ measure
lated and real by randomly perfect model, score difference; META model. tions of fairness of language profi-
data to assess sampling a an almost per- and Conditional - GER speak- may require dif- ciency, however,
impact of Eng- similar number fect model, and score difference ers were ferent solutions. human scores are
lish language of test-takers for metadata-based underscored - None of three likely to contain a
proficiency each version of model. while JPN definitions of certain amount of
test-takers’ the test. Train-test split: speakers were fairness is, in error and bias.
native language - The final corpus 75:25% overscored principle, more
on automated included 26,710 by the perfect important than
scores. responses from model. another.
4,452 test-takers - On actual
(742 for each models, the
language). model trained
separately for
each native lan-
guage is most
fair in terms of
overall score
differences.
International Journal of Artificial Intelligence in Education
Table 6 (continued)
Ref Study design Dataset Methods Evaluation metrics Results Conclusions Limitations
S12 Kusner et al. Developed a A survey data Counterfactual RMSE for accuracy Full and unaware The proposed Implemented only
(2017) framework for covering 163 fairness defined Counterfactual fair- models are fair counterfactual for one use case
modeling fair- law schools as: a decision ness for fairness w.r.t. to sex as fairness frame-
ness using tools and 21,790 law to an individual they have very work takes pro-
from causal students in the is fair if it is weak causal tected attributes
inference. United States the same in the link between into account in
with informa- actual world and sex and GPA/ modelling.
tion on LSAT a counterfactual LSAT.
score, GPA, etc., world where
Provided by Law the individual
School Admis- belonged to a
sion Council different demo-
graphic group.
International Journal of Artificial Intelligence in Education
13
International Journal of Artificial Intelligence in Education
Funding The author gratefully acknowledges support from Chevening Scholarship, the UK govern-
ment’s global scholarship programme, funded by the Foreign, Commonwealth, and Development Office
(FCDO).
Declarations
Competing Interests The author has no relevant financial or non-financial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permis-
sion directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/
licenses/by/4.0/.
References
Anderson, H., Boodhwani, A., & Baker, R. S. (2019). Assessing the Fairness of Graduation Predictions.
In EDM.
Baker, R. S., & Hawn, A. (2021). Algorithmic bias in education. International Journal of Artificial Intel-
ligence in Education, 32, https://doi.org/10.1007/s40593-021-00285-9
Corbett-Davies, S., Gaebler, J., Nilforoshan, H., Shroff, R., & Goel, S. (2023). The measure and mismeas-
ure of fairness. Journal of Machine Learning Research, 24(312), 1–117.
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. In Pro-
ceedings of the 3rd innovations in theoretical computer science conference (pp. 214–226).
Gajane, P., & Pechenizkiy, M. (2017). On formalizing fairness in prediction with machine learning. arXiv
preprint arXiv:1710.03184. Retrieved from https://arxiv.org/abs/1710.03184
Gardner, J., Brooks, C., & Baker, R. (2019). Evaluating the Fairness of Predictive Student models
through Slicing Analysis. Proceedings of the 9th International Conference on Learning Analytics &
Knowledge, 225–234. https://doi.org/10.1145/3303772.3303791.
Gedrimiene, E., Celik, I., Mäkitalo, K., & Muukkonen, H. (2023). Transparency and trustworthiness in
user intentions to follow career recommendations from a learning analytics tool. Journal of Learn-
ing Analytics, 10(1), 54–70.
Gusenbauer, M., & Haddaway, N. R. (2020). Which academic search systems are suitable for systematic
reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other
resources. Research Synthesis Methods, 11(2), 181–217.
Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. In Proceed-
ings of the 30th International Conference on Neural Information Processing Systems (NIPS’16) (pp.
3323-3331). Curran Associates Inc.
Hu, Q., & Rangwala, H. (2020). Towards fair educational data mining: A case study on detecting at-risk
students. International Educational Data Mining Society.
Jiang, W., & Pardos, Z. A. (2021). Towards equity and algorithmic fairness in student grade prediction. In
Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (pp. 608–617).
Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017).Counterfactual fairness. In Proceedings of the
31st International Conference on Neural Information Processing Systems (NIPS’17) (pp. 4069-
4079). Curran Associates Inc.
Kuzilek, J., Hlosta, M., & Zdrahal, Z. (2017). Open University Learning Analytics dataset Sci. Data, 4,
170171. https://doi.org/10.1038/sdata.2017.171
Lee, H., & Kizilcec, R. F. (2020). Evaluation of fairness trade-offs in predicting student success. arXiv
preprint arXiv:2007.00088. Retrieved from https://arxiv.org/abs/2007.00088
Loukina, A., & Buzick, H. (2017). Use of automated scoring in spoken language assessments for test tak-
ers with speech impairments. ETS Research Report Series, 2017(1), 1–10.
13
International Journal of Artificial Intelligence in Education
Loukina, A., Madnani, N., & Zechner, K. (2019). The many dimensions of algorithmic fairness in educa-
tional applications. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Build-
ing Educational Applications (pp. 1–10).
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., & Moher, D.
(2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Inter-
national Journal of Surgery, 88,
Patterson, C., York, E., Maxham, D., Molina, R., & Mabrey, P. (2023). Applying a responsible innovation
framework in developing an equitable early alert system:: A case study. Journal of Learning Analyt-
ics, 10(1), 24–36.
Pessach, D., & Shmueli, E. (2022). A review on fairness in machine learning. ACM Computing Surveys
(CSUR), 55(3), 1–44.
Rets, I., Herodotou, C., & Gillespie, A. (2023). Six practical recommendations enabling ethical use of
predictive learning analytics in distance education. Journal of Learning Analytics, 10(1), 149–167.
https://doi.org/10.18608/jla.2023.7743
Sha, L., Raković, M., Das, A., Gašević, D., & Chen, G. (2022). Leveraging class balancing techniques to
alleviate algorithmic bias for predictive tasks in education. IEEE Transactions on Learning Tech-
nologies, 15(4), 481–492.
Sha, L., Rakovic, M., Whitelock-Wainwright, A., Carroll, D., Yew, V. M., Gasevic, D., & Chen, G.
(2021). Assessing algorithmic fairness in automatic classifiers of educational forum posts. In Arti-
ficial Intelligence in Education: 22nd International Conference, AIED 2021, Utrecht, The Nether-
lands, June 14–18, 2021, Proceedings, Part I 22 (pp. 381–394). Springer International Publishing.
Stilgoe, J., Owen, R., & Macnaghten, P. (2013). Developing a framework for responsible innovation.
Research Policy, 42(9), 1568–1580. https://doi.org/10.1016/j.respol.2013.05.008
Suresh, H., & Guttag, J. (2021). A framework for understanding sources of harm throughout the machine
learning life cycle. In Equity and Access in Algorithms, Mechanisms, and Optimization (pp. 1–9).
Wang, C., Wang, K., Bian, A., Islam, R., Keya, K. N., Foulds, J., & Pan, S. (2022). Do Humans Prefer
Debiased AI Algorithms? A Case Study in Career Recommendation. In 27th International Confer-
ence on Intelligent User Interfaces (pp. 134–147).
Yu, R., Lee, H., & Kizilcec, R. F. (2021). Should college dropout prediction models include protected
attributes? In Proceedings of the Eighth ACM Conference on Learning@ Scale (pp. 91–100).
Yu, R., Li, Q., Fischer, C., Doroudi, S., & Xu, D. (2020). Towards Accurate and Fair Prediction of Col-
lege Success: Evaluating Different Sources of Student Data. Proceedings of The 13th International
Conference on Educational Data Mining (EDM 2020), 292–301.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
13