0% found this document useful (0 votes)

7 views12 pages

Paula 4

Uploaded by

Paula Araujo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views12 pages

Paula 4

Uploaded by

Paula Araujo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Fairness in Automated Essay Scoring: A Comparative Analysis of

Algorithms on German Learner Essays from Secondary Education

Nils-Jonathan Schaller1 , Yuning Ding2 , Andrea Horbach2,3 ,
Jennifer Meyer1 , Thorben Jansen1
1
Leibniz Institute for Science and Mathematics Education at the University of Kiel, Germany
2
CATALPA, FernUniversität in Hagen, Germany
3
Hildesheim University, Germany

Abstract The foundation of equity in automated feedback

systems is the fairness of the algorithm ((Holstein
Pursuing educational equity, particularly in and Doroudi, 2021), (Pedró et al., 2019)), i.e.,
writing instruction, requires that all students the absence of any prejudice or favoritism toward
receive fair (i.e., accurate and unbiased) assess- groups of students based on their inherent or ac-
ment and feedback on their texts. Automated
quired characteristics, including their background
Essay Scoring (AES) algorithms have so far fo-
cused on optimizing the mean accuracy of their
and their psychological variables((Mehrabi et al.,
scores and paid less attention to fair scores for 2019),(Government Equalities Office, 2013)). Al-
all subgroups, although research shows that stu- gorithmic fairness is widely discussed in various ed-
dents receive unfair scores on their essays in re- ucational contexts from normative (Blodgett et al.,
lation to demographic variables, which in turn 2020; European Commission, Directorate-General
are related to their writing competence. We add for Education, Youth, Sport and Culture, 2022),
to the literature arguing that AES should also societal (Baker and Hawn, 2022; Kizilcec and Lee,
optimize for fairness by presenting insights on
2020), or methodological (Mitchell et al., 2021)
the fairness of scoring algorithms on a corpus
of learner texts in the German language and perspectives, but literature reviews have shown that
introduce the novelty of examining fairness on it is rarely investigated empirically (Li et al., 2023).
psychological and demographic differences in Specifically in the AES context, only six empir-
addition to demographic differences. We com- ical studies have examined algorithmic fairness,
pare shallow learning, deep learning, and large examining differences in algorithmic accuracy and
language models with full and skewed subsets biases for students with different gender, race, and
of training data to investigate what is needed
language backgrounds in English-language corpora
for fair scoring. The results show that training
on a skewed subset of higher and lower cogni-
(Arthur et al., 2021; Baffour et al., 2023; Bridge-
tive ability students shows no bias but very low man et al., 2009; Litman et al., 2021; Kwako et al.,
accuracy for students outside the training set. 2022; Yancey et al., 2023). This means that while
Our results highlight the need for specific train- AES is widely used in education in many countries
ing data on all relevant user groups, not only (Fleckenstein et al., 2023) including non-English
for demographic background variables but also speaking countries, it is unclear whether the al-
for cognitive abilities as psychological student gorithms used are fair to all groups of students
characteristics.
confronted with the results or whether they might
disfavor some student gropus. Compounding the
1 Introduction problem, the few existing studies have shown that,
depending on the algorithms used, students’ essays
Educational equity is seen as a foundation for learn-
were not scored fairly and disfavored groups related
ing with technology (Warschauer et al., 2004), be-
to race/ethnicity, economic status, and English Lan-
cause all students need effective instruction. One
guage Learner status (e.g., Baffour et al. (2023);
of the most effective instructional practices is feed-
Litman et al. (2021); Yang et al. (2024)).
back (Hattie and Timperley, 2007), which can sup-
port students in acquiring complex skills like writ- So far, previous studies only analyzed fairness in
ing (Graham et al., 2015). Automated essay scoring relation to students’ demographic variables in cor-
(AES) can be used to provide students with feed- pora with students’ essays in English: Extending
back on their writing at scale (Fleckenstein et al., this research to a corpus on argumentation essays
2023). in the German language, we address three main re-
210
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications, pages 210–221
June 20, 2024 ©2024 Association for Computational Linguistics
search questions: (1) How fair are AES algorithms dropping out of a course. To our knowledge, there
for students with different levels of cognitive abil- are only two papers that diagnosed the predictive
ities as psychological characteristics strongly re- bias displayed by AES models(Litman et al., 2021;
lated to writing competence? (Zhang and Zhang, Arthurs and Alvero, 2020), even though the impor-
2023). Addressing this question is linked to the tance of this task has been pointed out as early as
wider equity issue of whether AES systems are in 2012 (Williamson et al., 2012). Litman et al.
likely to widen or narrow the gap between high and (2021) evaluated the fairness of shallow and deep
low-performing students. (2) How fair are AES learning AES algorithms for essays from the up-
algorithms in languages other than English? The per elementary level in the English language using
question is especially important when automated three measures: Overall Score Accuracy (OSA),
scoring is based on large language models, mostly Overall Score Difference (OSD), and Conditional
trained on English text data. (3) How is the distri- Score Difference (CSD). They found that shallow
bution of student characteristics in the training data and deep AES algorithms showed systematically
impacting the mean accuracy and fairness of the overly positive and negative scoring depending on
prediction? students’ gender, race, and socioeconomic status.
By answering these questions, our paper makes Arthurs and Alvero (2020) showed that a shallow
the following contributions: First, we provide a learning AES system for college admissions essays
set of baseline models, including shallow learn- based on word vectors favored high-income stu-
ing, deep learning, and generative large language dents over low-income students (see also (Bridge-
models (LLM), for the newly released DARIUS man et al., 2009) for similar results for essays from
corpus, thus enriching the automatic scoring land- the Test of English as a foreign language). Addi-
scape with models for a large German argumenta- tionally, the authors trained models on only essays
tive writing corpus. from the highest quartile of students in terms of
Second, we conduct fairness evaluations on our performance, showing that these models are not
results indicating that none of the models trained on suitable for students from the other quartiles. Yang
the entirety of training data shows unfair behavior et al. (2024) further emphasized that the fairness of
towards specific subgroups. AES systems is compromised if such models are
Finally, to assess the role of the distribution of used on students or tasks for which they have not
the training data on algorithmic fairness, we train been trained.
shallow and deep models with subsets of data from In addition to the studies included in the litera-
students of low and high cognitive ability, as well ture review, recent studies added an investigation of
as a mixed subset based on low, medium and high fairness in Large Language Models scoring essays
cognitive ability, and show that the models are un- from a high school context Baffour et al. (2023) in
fair to the groups not included in the training set. the PERSUADE 2.0 corpus (Crossley et al., 2022).
We make all of our code publicly available.1 The authors compared the winning entries of the
Kaggle Feedback Prize competition.2 They show
2 Related Work: Fairness in AES differences in the model’s accuracy based on demo-
Algorithm graphic factors such as student race/ethnicity, and
According to a literature review by Li et al. (2023), economic disadvantage. Similar fairness issues
there have been 49 peer-reviewed empirical studies based on students’ demographic variables were
focused on fairness and predictive bias in educa- shown for large language models in essays in the
tion since 2010, highlighting the growing academic English language written by first (Kwako et al.,
interest in these issues. 2023) and second language students (Yancey et al.,
The studies included multiple fairness measures, 2023).
including the accuracy for the included groups and In summary, previous studies on fairness in AES
the mean differences between predicted and an- have used shallow learning models, deep learning
notated scores for each score (e.g., (Litman et al., models, and LLMs and compared whether the accu-
2021)). Most of these studies were conducted in racy of judgments and systematic over/underrating
contexts other than AES, such as predicting stu- can be explained by students’ demographic vari-
dents’ course performance or their likelihood of
2
https://www.kaggle.com/competitions/
1
https://github.com/darius-ipn/fairness_AES feedback-prize-2021

211
ables. The results showed some fairness problems, also include claims written not only in the opening
which were exacerbated in the studies where the paragraphs but also within the conclusion, offering
AES was additionally trained only on a homoge- a comprehensive view of the argumentative intent.
nous group of students. Such claims form the basis for the author’s further
arguments and the direction of their reasoning.
3 Data Position annotation: This annotation extracts
the essay’s directional stance regarding the the-
The DARIUS corpus is a collection of 4,589 an-
matic issues presented in the writing tasks —
notated argumentative texts written by 1,839 stu-
whether the argumentation aligns with, diverges
dents from German high schools, spread across
from, or remains ambiguous towards the positions
114 classes in 33 different schools(Schaller et al.,
debated within the tasks. This annotation is impor-
2024). Essays that were off-topic, shorter than two
tant for understanding the diversity of viewpoints
sentences, empty, or contained names or other data
and the critical engagement of students with the
relevant to data protection were removed before-
socio-scientific topics at hand.
hand. The final dataset consists of essays from two
Warrant annotation: A warrant is one out
writing assignments focused on socio-scientific is-
of five argumentative elements annotated in the
sues on the topics energy and automotive, contain-
dataset as part of the Toulmin’s Argumentation Pat-
ing 2,307 and 2,282 essays respectively. Students
tern (TAP) annotations, following the definitions by
wrote a draft and revision on one task, followed
Riemeier et al. (2012). TAP describes a structural
by an essay on the other task, resulting in up to 3
framework for constructing logical and compelling
essays per student. An example text is listed in the
arguments by including a claim, providing sup-
Appendix 7. Students also provided demographic
porting evidence (data), explaining the connection
data voluntarily, a selection of which is listed in
between the claim and data (warrant), and address-
Table 1.
ing counterarguments (rebuttal). For this study, we
The dataset has been extensively annotated with focus exemplarily on warrants because the use of
information about argumentative structure on dif- warrants indicates already a higher argumentation
ferent levels of granularity. In the present study, we skill(Osborne et al., 2016). TAP elements are not
focus specifically on a subset of these annotations, marked on the sentence level but on the token level,
namely content zone, major claim, position and as a TAP sequence can cover a wide range from
warrant. Out of the nine original annotation cate- subordinate clauses to entire paragraphs.
gories, we selected those as they reflect different
parts of an argumentative text, e.g. structure and 3.2 Demographic and Psychological Data
content, and are annotated on different granularity We consider the following demographic variables:
levels (token level to whole texts). We used the Grade Grade indicates which grade level the stu-
demographic data to measure fairness with respect dent is in. The dataset was obtained for students
to gender, profile, school, cognitive ability (KFT), between Grade 9 and Grade 12.
and languages, which are further explained after Gender The students could indicate their gender.
providing more details on the annotations in Sec- Options were female, male, and diverse.
tion 3.1. A more extensive description can be found School The German school system differentiates
in the original paper (Schaller et al., 2024). between different forms of high school.
3.1 Annotations • Gemeinschaftsschule: non-academic track
Content zone: This annotation category breaks • Gymnasium: academic track
down the essays into their basic parts: the introduc-
• Berufsschule: vocational training
tion, the body, and the conclusion. Each section
can be as short as one sentence or span several Profile The German school system allows stu-
sentences. dents to choose a profile. The Natural Sciences
Major claim annotation: Central to the argu- profile, for example, has a focus on math and sci-
mentative essence of the essays, the Major Claim ence, while the Social Sciences profile can have a
annotation identifies the pivotal stance taken by the focus on politics or ethics.
author on the discussed issue. In contrast to similar Languages The students could indicate the lan-
annotation efforts (Stab and Gurevych, 2014), we guage that they speak at home.
212
Grade Level Gender Profile Language
Level Students Gender Students Profile Students Language Students
9 423 Female 801 Natural Sciences 414 native 1265
10 346 Male 664 Social Sciences 255 non-native 576
11 547 Diverse 90 Sports 119
12 404 Missing 284 Linguistics 61
13 113 Aesthetics 13
Missing 6 Missing 977

Table 1: Combined Overview: Grade Level, Gender, Profile, and Language of Students

KFT The Cognitive Abilities Test (Kognitiver of deep learning and GPT-4 (OpenAI, 2024) to
Fähigkeitstest or KFT) developed by Heller and represent generative LLMs. For sequence tagging,
Perleth (2000), measures students’ cognitive abil- we also use the BERT-based classifier and again
ities through non-verbal figural analogies. These prompt GPT-4 this time providing the whole essay
questions evaluate abstract reasoning and the ability as input.
to apply logical rules to visual information with-
out linguistic content, making them useful for as- 4.2 Data Split
sessing individuals across different linguistic back- We use a fixed data split of 80% training data and
grounds. A typical problem displays a sequence 20 % test data. From the training data, we used a
of shapes that follow a certain transformation (e.g., subset of 60% as validation data to find the best
rotation, reflection). The test-taker must identify epoch for deep learning and for prompt-tuning for
and apply the same transformation to a new set of generative LLMs in pre-experiments, i.e. the whole
figures. training data set was used in the main experiments
for training. As we were not interested in the over-
4 Method all best performance but rather in the intrinsic fair-
ness differences between models, we did not further
In the following section, we describe the experi-
fine-tune any hyperparameters.
mental setup for our evaluation study.
4.3 Performance and Fairness Evaluation
4.1 Classifiers
The evaluation of our classification results is moti-
We experiment with a diverse set of classifiers to vated by the intended use of the classifiers to pro-
see performance and fairness differences between vide formative feedback to learners in e.g. an on-
instances of different model architectures. Our line tutoring system. Although it might also be of
machine learning goal is to predict certain spans interest to show the specific location of an argumen-
in an essay text. For most of these spans, span tative element within a learner essay as feedback,
boundaries align with sentence boundaries. our primary concern for this study is to determine
Major claim annotations always consist of single whether certain argumentative elements are present
sentences. The other annotation types, i.e. con- in a text or not. Therefore, we first transform any
tent zone and position annotations may also span classifier output into a binary decision on the docu-
multiple sentences. Only warrant annotations do ment level indicating whether (at least one instance
not necessarily align with sentence boundaries and of) a certain argumentative element is present in an
can consist of segments on the sub-sentence level. essay.
Therefore, we make use of both sentence classifi- In our fairness evaluation, we follow the frame-
cation and sequence tagging approaches. For sen- work proposed by (Loukina et al., 2019) and their
tence classification, we use a Support Vector Ma- implementation provided within the RSMTool soft-
chine (SVM) in standard configuration, provided ware package (Madnani and Loukina, 2016). More
by the scikit-learn python package (Pedregosa et al., precisely, we compute overall score accuracy (osa),
2011) as an instance of shallow learning. The fea- overall score difference (osd) and conditional score
tures utilized in the SVM classifier are the TF-IDF difference (csd), where the first looks at squared
vectors of the most frequent 1- to 3-grams. We use errors (S − H)2 and the latter two at actual errors
a BERT-based 3 sentence classifier as an instance S − H. In every case, a linear regression is fit
3
dbmdz/bert-base-german-cased with the error being the dependent variable and the
213
Label Model All Grades Gender Profile School Languages KFT
Introduction Shallow .63 [.35, .68] [.53, .67] [.58, .73] [.48, .68] [.60, .70] [.57, .67]
Deep .81 [.51, .85] [.76, .84] [.74, .83] [.69, .95] [.80, .85] [.75, .85]
LLM .60 [.50, .63] [.46, .62] [.55, .61] [.51, .77] [.59, .59] [.58, .61]
Conclusion Shallow .55 [.44, .71] [.50, .58] [.46, .55] [.46, .61] [.54, .55] [.52, .57]
Deep .70 [.64, .80] [.59, .74] [.63, .81] [.64, .78] [.64, .71] [.64, .78]
LLM .68 [.63, .76] [.68, .81] [.63, .67] [.58, .84] [.65, .68] [.61, .72]
Major Claim Shallow .68 [.62, .74] [.66, .74] [.49, .75] [.42, .81] [.66, .72] [.62, .72]
Deep .88 [.78, .92] [.87, .88] [.80, .95] [.81, .89] [.87, .88] [.84, .90]
LLM .75 [.68, .82] [.66, .81] [.63, .84] [.71, .91] [.71, .86] [.66, .86]
Position Shallow .41 [.34, .46] [.34, .53] [.16, .49] [.29, .56] [.36, .50] [.17, .58]
Deep .44 [.23, .56] [.36, .73] [.23, .61] [.28, .46] [.37, .59] [.27, .54]
LLM .32 [.13, .37] [.29, .54] [.29, .47] [.22, .60] [.31, .33] [.23, .37]
Warrant Shallow .43 [.32, .51] [.39, .51] [.38, .51] [.38, .47] [.39, .55] [.37, .52]
Deep .44 [.27, .53] [.38, .55] [.36, .68] [.36, .65] [.41, .52] [.25, .54]
LLM .00 [-.16, .09] [-.02, .32] [-.18, .02] [-.04, .14] [-.02, .07] [-.13, .08]

Table 2: Kappa values for the individual classifiers evaluated either on all test essays or on essays from a certain
subgroup. We report the minimal and maximal values among the subgroups for each demographic variable.

respective subgroup information being the indepen- 5 Experimental Study

dent variable for osa and osd. For csd, two models
are fitted, one with both the subgroup and human In the following, we discuss the results of our ex-
score as independent variables and one using the perimental studies. We compare the three classi-
human score only. We use the R2 as a measure of fication model types (Shallow, Deep, and LLM)
model fairness for osa and osd and the difference in with respect to both fairness and kappa. In the first
R2 for csd. In our analysis we follow Williamson experiment, we trained on the complete dataset and
et al. who established that absolute values above 0.1 evaluated the fairness for certain subgroups.
suggests unfairness or bias against certain groups. In a second experiment, we trained models on
subsets of the training data that represent only a spe-
Fairness should be considered in addition to cific part of the whole population (in our case, the
mean accuracy because research on teacher judg- upper and lower quartiles of the cognitive ability
ments has shown that the qualities of judgments values) and examined the fairness of such models.
are almost uncorrelated, and teachers who are
5.1 Evaluation of Full Models on Fairness and
very good at judging the average class level can
Performance
be very unfair to the high or low-performing stu-
dents((Möller et al., 2022),(Urhahne and Wijnia, Table 2 presents the performance of our trained
2021)). models with regard to chance-corrected kappa val-
ues, providing insights into the agreement between
We used Cohen’s kappa to account for chance model predictions and human annotators. The
agreement in evaluating our model. This is cru- range values in brackets show variances across
cial when classifiers evaluate argumentative ele- the different subgroups. We excluded the sub-
ments in essays. Percentage agreement alone may group Aesthetic from the category Profile, as it
overstate accuracy by reflecting chance, mislead- had only 9 students and led to extreme outliers.
ing results. Kappa provides a more accurate mea- Our study involved three machine learning models:
surement of agreement strength. This is crucial in Shallow (SVM), Deep (BERT), and LLM (gpt-4-
educational settings, where precise feedback is nec- turbo-preview, GPT). The prompts used for the
essary, as ignoring chance agreement could overes- LLM are displayed in the Appendix.
timate teacher judgments. By incorporating kappa, For the prediction of the Introduction the Deep
we aim for a more balanced evaluation of our clas- model demonstrated the highest performance with
sifier’s performance and fairness across diverse stu- an overall kappa of .81, indicating a strong agree-
dent groups, improving feedback in educational ment with human annotations. In contrast, the Shal-
technologies and reducing biases in teacher assess- low and LLM models performed worse, a trend that
ments. persists through all models. The order of the model
214
Label Metric Model Grades Gender Profile School Language KFT
agreements.
Introduction osa Shallow .008 .001 -.002 -.001 .000 -.001
Deep .011 -.002 -.003 .001 -.001 .003 The analysis reveals the strengths and weak-
LLM -.000 -.001 -.004 .000 -.001 -.002
osd Shallow .005 .004 -.001 .010 -.001 .007 nesses inherent to each modeling approach. Deep
Deep .001 .005 .000 -.001 -.001 -.000
LLM .014 -.001 .004 -.000 -.000 .001 learning models, particularly BERT, consistently
csd Shallow
Deep
.019
.009
.026
.022
.038
.037
.013
.004
.001
.001
.012
.000 demonstrated robust kappa scores, affirming their
Conclusion osa
LLM
Shallow
.032
.014
-.002
-.001
.014
-.003
-.007
-.001
-.000
.000
.008
.000
suitability for complex linguistic tasks. Depend-
Deep
LLM
-.003
.005
.000
-.001
.007
-.004
-.002
.002
-.001
-.001
.001
-.002
ing on the task, the SVM varied between stay-
osd Shallow .004 .001 .004 .002 -.000 -.002 ing behind between 1 to 18 points from BERT. In
Deep -.002 -.001 .001 .006 .002 -.001
LLM .001 -.000 .000 .003 -.001 -.002 contrast, the generative capabilities of LLM mod-
csd Shallow -.003 .005 .019 -.001 .005 .005
Deep -.000 -.004 -.024 .004 -.001 -.002 els, such as GPT, varied extremely in their per-
Major Claim osa
LLM
Shallow
.003
-.002
-.007
-.002
.014
-.004
.000
.006
-.000
-.001
-.000
-.001 formance, although never outperforming the Deep
Deep
LLM
.001
-.001
-.002
.004
-.001
-.001
-.002
.001
-.001
.003
-.000
.005
model. These findings underscore the importance
osd Shallow
Deep
.003
-.001
-.001
-.001
-.002
-.003
.001
.000
-.001
-.001
.007
-.000
of model selection based on the specific demands
LLM .004 -.002 -.002 -.002 .001 -.002 of the task at hand. It is entirely possible that dif-
csd Shallow .002 -.010 .011 .007 -.001 .005
Deep -.002 .001 .004 -.001 -.001 .000 ferent prompts would have led to different results.
LLM .002 .005 .044 .008 .003 -.001
Position osa Shallow -.003 -.001 .003 .015 .001 .008 However, it would have to be examined whether
Deep
LLM
.003
.005
-.000
-.001
-.003
.004
.017
.001
-.001
.001
.005
.008 the resources required (time to develop and test the
osd Shallow
Deep
.005
-.000
-.002
-.001
.012
-.002
.007
.007
.002
.001
.003
-.002
appropriate prompt, cost of the queries, energy con-
csd
LLM
Shallow
.004
.000
-.002
.012
.006
.057
.007
.019
-.001
.001
.002
.010
sumption of LLM models) justify this procedure.
Deep
LLM
.002
.008
.019
-.010
.050
-.018
.018
-.005
-.000
.002
.014
.022
Table 3 shows the fairness measures based on the
Warrant osa Shallow .007 -.002 .003 -.001 .007 .006 models, trained on the whole dataset. As reported,
Deep .007 .001 .018 .008 .004 .016
LLM .012 .004 .005 -.003 .003 .008 values over .10 are potentially an issue of concern.
osd Shallow .000 .004 .002 .009 -.001 .003
Deep -.001 .002 -.002 -.001 -.001 -.000 None of the calculations on any model resulted in
csd
LLM
Shallow
-.001
.010
-.002
.002
-.004
-.036
.004
.003
-.000
.000
.006
-.001
any value above .10.
Deep -.001 .011 -.008 .005 -.001 -.002
LLM .008 .006 .086 .007 .005 .025
5.2 Training Models on KFT Subgroups
Table 3: Fairness evaluation metrics of all classifiers.
As a second step, we estimated the effects it can
have if certain subgroups are not adequately re-
performance is also reflected in the results ordered flected in the training data. For this experiment,
by demographic data. we considered specifically cognitive abilities rep-
For the Conclusion, the Deep model similarly resented by cognitive ability values. We divided
outperformed its counterparts again, followed the training data into four quartiles based on the
closely by the LLM. The SVM stays behind. When cognitive ability values and trained models on data
evaluating Major Claim, all models display a from the lowest and highest quartiles only. For
noticeably enhanced performance, especially the a more balanced comparison to general data, we
Deep model, reaching a kappa value of .88 fol- also sampled a comparable size of training data
lowed by the LLM (.75), and lastly the Shallow from all four quartiles in a stratified way, e.g. from
model .68. each quartile we took a randomised sample of 25%.
For Position and Warrant, kappa values reveal This subset is further referret to as mixed data. This
a drop in performance across all models, with the experiment was not conducted for LLMs, as our
Deep model followed closely by the SVM. The zero-shot approach does not rely on training data.
LLM model lags behind, for the Position annota- Unsurprisingly, the performance of both the
tion at a value around zero, showing challenges SVM and the BERT model deteriorated in com-
in capturing the nuanced expression of stances or parison to models trained on the full training set
viewpoints within texts. Those results seem to mir- (see Table 4).
ror also the inter-annotator agreements of the orig- In general, the deep model performed still bet-
inal annotation, in which the annotations for In- ter than the shallow one, except for the position
troduction/Conclusion (content zone) and Major model trained on the low quartile as well as the
Claim had both an inter-annotator Krippendorffs warrant models trained on the highest and lowest
alpha of .83, the Position annotation at .68, while quartiles. There is no indication that any of the
all TAP values (e.g. warrant) showed very low quartiles lead to a stronger model. Each category
215
Label KFT Model All Grades Gender Profile School Languages
Introduction high Shallow .38 [-.04, .44] [.33, .62] [.29, .39] [.18, .48] [.35, .45]
Deep .56 [.30, .62] [.29, .59] [.45, .56] [.25, .57] [.53, .61]
low Shallow .47 [.26, .48] [.30, .43] [.30, .65] [.41, .57] [.40, .64]
Deep .65 [.59, .67] [.63, .68] [.60, .71] [.62, .70] [.64, .64]
mixed Shallow .46 [.06, .51] [.39, .61] [.40, .47] [.17, .55] [.41, .57]
Deep .71 [.65, .73] [.68, .71] [.70, .76] [.70, .73] [.70, .75]
Conclusion high Shallow .39 [.21, .48] [.37, .53] [.29, .47] [.21, .52] [.27, .40]
Deep .62 [.49, .66] [.56, .65] [.53, .72] [.52, .77] [.58, .62]
low Shallow .25 [.19, .27] [.17, .28] [.21, .23] [.09, .29] [.20, .25]
Deep .44 [.16, .51] [.40, .43] [.29, .47] [.34, .62] [.41, .54]
mixed Shallow .42 [.32, .55] [.41, .56] [.34, .44] [.35, .45] [.34, .42]
Deep .54 [.43, .57] [.49, .69] [.50, .63] [.45, .62] [.54, .54]
Major Claim high Shallow .57 [.47, .63] [.50, .62] [.36, .58] [.35, .57] [.55, .61]
Deep .83 [.67, .87] [.81, .88] [.79, .90] [.78, .92] [.82, .85]
low Shallow .58 [.46, .63] [.52, .62] [.37, .67] [.35, .66] [.57, .61]
Deep .84 [.77, .89] [.80, .86] [.76, .95] [.70, .85] [.82, .87]
mixed Shallow .56 [.49, .62] [.52, .56] [.31, .70] [.35, .58] [.52, .67]
Deep .81 [.58, .86] [.70, .82] [.73, .89] [.61, .82] [.79, .87]
Position high Shallow .02 [.00, .05] [.00, .03] [.00, .00] [.00, .04] [.00, .03]
Deep .29 [-.05, .43] [.23, .49] [-.04, .43] [.17, .43] [.27, .30]
low Shallow .37 [.34, .48] [.28, .69] [.29, .41] [.19, .69] [.28, .52]
Deep .34 [-.07, .40] [.28, .71] [.23, .47] [.08, .61] [.29, .44]
mixed Shallow .16 [.00, .18] [.00, .15] [.06, .15] [.00, .37] [.14, .18]
Deep .37 [-.03, .43] [.33, .53] [.24, .43] [.29, .43] [.33, .43]
Warrant high Shallow .26 [.10, .32] [.21, .29] [.23, .29] [.05, .27] [.23, .36]
Deep .23 [.13, .30] [.18, .31] [.21, .34] [.16, .37] [.21, .29]
low Shallow .23 [.19, .24] [.19, .30] [.20, .28] [.14, .35] [.19, .37]
Deep .20 [.03, .26] [.16, .34] [.19, .41] [.12, .61] [.16, .34]
mixed Shallow .17 [.13, .22] [.16, .28] [.12, .31] [.05, .41] [.16, .22]
Deep .25 [.18, .30] [.20, .39] [.22, .28] [.22, .49] [.24, .29]

Table 4: Kappa values of KFT classifiers and all subtypes.

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

Kappa

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
KFT Subgroups KFT Subgroups KFT Subgroups

(a) Introduction (b) Conclusion (c) Major Claim

1 1

0.8 0.8

0.6 0.6
Kappa

Kappa

0.4 0.4

0.2 0.2

0 0
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
KFT Subgroups KFT Subgroups

(d) Position (e) Warrant

Figure 1: Kappa values of KFT classifiers on different KFT subgroups. Q = Quartile.

216
(low, high, and mixed) can perform best in different
tasks, e.g. mixed deep in Introduction, high deep in
Label Metric KFT Model Grades Gender Profile School Language KFT Conclusion, or low shallow/mixed deep in Position.
Introduction osa high Shallow
Deep
.011
.001
.005
.003
-.004
-.002
.003
.000
-.001
-.001
.001
.001 In terms of fairness, we still found no values above
low Shallow
Deep
.001
-.003
.001
-.001
.007
-.003
.000
-.002
.008
-.001
.007
.003 0.1 (see Table 5).
mixed Shallow .008 -.001 -.004 .003 .001 .001
Deep -.003 -.000 -.004 -.003 -.001 .000 When examining Figure 1 we can see that mod-
osd high Shallow .003 .003 -.004 .000 .001 -.000
Deep .001 -.001 .005 .000 -.000 .001 els differed in their performance when tested on
low Shallow .002 .000 .003 .002 -.001 .006
Deep -.002 .007 .004 .002 -.001 .002 different subgroups. For the Introduction, a shal-
mixed Shallow .010 -.002 .000 .009 -.001 .000
Deep -.001 .000 .002 .006 -.001 .004 low model trained on the dataset of the students
csd high Shallow .012 .017 .093 .016 .000 -.001

low
Deep
Shallow
.014
.018
.007
.025
.074
.053
.007
.020
.007
.005
.006
.013
with the highest KFT quartile (high shallow) was
mixed
Deep
Shallow
.011
.022
.008
.017
.034
.065
-.005
.022
.002
.002
.006
.009
performing better on the subgroup it was trained on
Conclusion osa high
Deep
Shallow
.009
.011
.011
.001
.015
.003
.004
.004
-.000
-.001
.010
.002
(e.g. Quartile 4) than on the other subgroups and
low
Deep
Shallow
.000
.010
-.000
-.000
.002
-.004
.001
.001
-.001
.002
.004
.016
the other way around (low KFT model performed
mixed
Deep
Shallow
.006
.011
-.002
.001
.000
-.003
-.001
-.002
.006
-.001
-.000
.001
better on the subset with low KFT, e.g. Quartile
osd
mixed
high
Deep
Shallow
-.001
.016
.003
-.002
-.001
.005
-.001
.004
-.001
-.001
-.003
-.001
1.). The mixed models had the lowest variance in
low
Deep
Shallow
-.004
.004
.002
-.002
-.004
-.002
-.003
.000
-.000
.003
-.001
.012
performance.
mixed
Deep
Shallow
.005
.003
-.002
-.002
.001
-.000
-.000
.001
-.001
-.001
-.003
-.003 There are exceptions in which the model per-
formed better on a different subgroup than the one
Deep -.001 -.001 .003 -.002 -.001 -.002
csd high Shallow .010 -.009 -.025 -.007 .006 .010
Deep .004 .003 -.007 -.003 -.001 .004
low Shallow .001 .006 -.033 .006 -.000 -.001 it was trained on, e.g., in (d) Position, all models
Deep .004 .012 .042 .008 .001 .002
mixed Shallow .001 -.009 -.003 -.011 .004 .003 except high shallow lost performance on Quartile
Deep .004 .006 .034 .004 .001 .007
Major Claim osa high Shallow -.001 .004 -.004 .008 -.001 .000 4. Furthermore, all combinations of algorithm and
Deep .001 -.002 -.003 .004 -.001 -.003
low Shallow -.002 .003 -.001 .003 -.001 -.003 training data did have a comparable stable perfor-
Deep .001 -.000 .002 -.001 -.000 -.002
mixed Shallow .000 -.001 -.001 .018 .000 -.001 mance on (c) Major Claim.
Deep .006 .001 .003 .003 .001 -.002
osd high Shallow
Deep
-.001
-.002
.000
-.000
-.002
-.004
-.003
-.003
-.000
-.000
.005
-.002
In general, using training data from only one
low Shallow
Deep
.004
.003
.002
-.000
.002
-.004
.000
-.002
-.001
.000
.000
-.000
student group seemed to introduce a bias, disad-
mixed Shallow
Deep
.006
.002
-.002
-.001
-.002
-.001
-.003
-.001
-.001
-.001
.003
-.001
vantaging other student groups. This finding un-
csd high Shallow
Deep
-.002
-.002
-.004
.006
.032
-.010
.014
.005
-.001
.001
.005
-.002
derlines the need to include training data from a
low Shallow
Deep
.002
.002
.002
-.001
.043
-.003
.014
-.005
-.001
-.000
-.000
-.000 diverse range of students to ensure fairness and
mixed Shallow
Deep
.005
.002
.000
.000
.020
.012
.021
.004
-.000
-.001
.004
-.001 avoid skewed outcomes.
Position osa high Shallow .003 .002 .020 .011 .014 .036
Deep -.002 -.002 -.002 .012 .007 .024
low Shallow
Deep
.002
-.001
-.000
-.001
.007
-.000
.016
.003
.000
.001
.015
.005 6 Conclusion and Future Work
mixed Shallow .001 .003 .016 .008 .009 .026
Deep -.001 -.002 .020 .009 .002 .011
osd high Shallow
Deep
.003
.000
.002
-.000
.020
-.001
.011
.006
.014
.001
.036
.003
In our work, we provide three basic models (shal-
low Shallow
Deep
.000
-.003
-.002
-.002
.005
-.001
.006
.005
-.001
.000
.006
.003
low learning models, deep learning models, and
mixed Shallow
Deep
.002
-.002
.003
-.002
.017
.005
.006
.001
.010
.003
.024
.002
LLM) trained on the annotations of the DARIUS
csd high Shallow
Deep
-.000
.003
-.003
-.013
.015
.096
-.003
-.014
.000
.001
-.000
.004
corpus of learner texts in German. These mod-
low Shallow
Deep
-.001
.001
.030
.008
.027
.017
.039
.016
.005
.001
.013
.003
els are ready to use in schools, for example, to
mixed Shallow
Deep
-.001
.002
.019
.005
.041
.059
.025
.005
-.000
-.000
.001
.004
create a feedback tool for training argumentative
Warrant osa high Shallow
Deep
.010
.003
-.001
-.000
-.000
.003
-.002
.007
.006
.004
.004
.015 skills. Evaluation of model fairness showed that
low Shallow
Deep
.014
.011
-.002
-.001
.001
.012
-.000
.013
.008
.008
.010
.023 all models produced fair scores for all students,
mixed Shallow
Deep
.019
.007
-.002
.001
.013
.002
.007
.001
.004
.003
.015
.009 considering demographic and psychological differ-
ences among students. In a second experiment, we
osd high Shallow .005 -.002 -.002 .004 -.001 .001
Deep .000 -.000 -.003 -.002 -.001 -.001
low Shallow .003 -.001 .016 .011 .001 .013
Deep .005 -.002 .002 .001 .000 .001 trained our models on subgroups of students, based
mixed Shallow .012 -.002 .009 .011 -.000 .007
Deep .005 -.002 -.002 -.001 -.001 .002 on either low, high, or mixed cognitive abilities,
csd high Shallow .002 -.003 -.047 -.002 .000 .003
Deep .009 .007 -.020 .002 .002 -.000 to investigate the extent to which skewed training
low Shallow .003 -.008 -.042 -.001 -.000 .003
Deep -.000 -.005 -.035 -.002 -.001 -.001 data leads to unfair AES system scores. Our re-
mixed Shallow .001 -.017 -.045 -.007 .000 -.001
Deep .000 -.011 -.014 -.008 .002 .001 sults showed lower performance for students who
were not in the training data, emphasizing the im-
Table 5: Fairness evaluation metrics of KFT classifiers
and all subtypes.
portance of including samples of the full range
of users in the training data for AES, not only for
demographic background variables but also for psy-
chological aspects such as cognitive abilites. Fail-
217
ure to do so risks reducing the predictive accuracy in word vector evaluation methods. International
of the algorithm for those who are not adequately Educational Data Mining Society.
represented. To mitigate the risk of students receiv- Perpetual Baffour, Tor Saxberg, and Scott Crossley.
ing unfair scores based on their demographic and 2023. Analyzing bias in large language model so-
psychological variables, we advocate that future lutions for assisted writing feedback tools: Lessons
AES systems incorporate the goal of fairness in ad- from the feedback prize competition series. In Pro-
ceedings of the 18th Workshop on Innovative Use
dition to accuracy into their training data collection of NLP for Building Educational Applications (BEA
and algorithm optimization function, going beyond 2023), pages 242–246.
the current state of retrospective analysis of model
Ryan S. Baker and Aaron. Hawn. 2022. Algorithmic
fairness.
bias in education. International Journal of Artificial
Intelligence in Education, 32:1052–1092.
7 Limitations
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and
This study encounters several limitations that have Hanna Wallach. 2020. Language (technology) is
to be mentioned. One constraint is the small size of power: A critical survey of “bias” in NLP. In Pro-
certain subgroups within the corpus, as seen in Ta- ceedings of the 58th Annual Meeting of the Asso-
ble 1, e.g., students with specific family languages, ciation for Computational Linguistics, pages 5454–
5476, Online. Association for Computational Lin-
profiles like Linguistics or Aesthetics. The under- guistics.
representation of those subgroups poses a challenge
in drawing robust conclusions for these particular Brent Bridgeman, Catherine Trapani, and Yigal Attali.
2009. Considering fairness and validity in evaluating
groups, potentially impacting the reliability and automated scoring. In Annual Meeting of the Na-
applicability of our outcomes to these populations. tional Council on Measurement in Education, San
Additionally, the comparatively homogenous Diego, CA.
population in the state of Schleswig-Holstein in
Scott A Crossley, Perpetual Baffour, Yu Tian, Aigner
northern Germany, restricts the generalizability Picou, Meg Benner, and Ulrich Boser. 2022. The
of our findings. The demographic profile of persuasive essays for rating, selecting, and under-
Schleswig-Holstein may not reflect the diversity standing argumentative and discourse elements (PER-
found in other regions or countries, potentially nar- SUADE) corpus 1.0. Assessing Writing, 54:100667.
rowing our study’s insights. European Commission, Directorate-General for Educa-
In conclusion, while our study provides insights tion, Youth, Sport and Culture. 2022. Ethical guide-
into fairness in the subgroups of the DARIUS Cor- lines on the use of artificial intelligence (AI) and data
in teaching and learning for educators. Publications
pus, these limitations underscore the necessity for a
Office of the European Union.
cautious interpretation of our findings and suggest
areas for future research efforts to build upon and Johanna Fleckenstein, Lucas W. Liebenow, and Jennifer
address these constraints. Meyer. 2023. Automated feedback and writing: A
multi-level meta-analysis of effects on students’ per-
formance. Frontiers in Artificial Intelligence, 6.
8 Acknowledgements
Government Equalities Office. 2013. Equality Act
This work was supported by the Deutsche Telekom 2010: guidance. https://www.gov.uk/guidance/
Stiftung and partially conducted at “CATALPA equality-act-2010-guidance. Accessed: 2023-
- Center of Advanced Technology for Assisted 09-21.
Learning and Predictive Analytics” of the FernUni- Steve Graham, Michael Hebert, and Karen Harris. 2015.
versität in Hagen, Germany. Formative assessment and writing: A meta-analysis.
The Elementary School Journal.

References John Hattie and Helen Timperley. 2007. The power of

feedback. Review of Educational Research.
Philip Arthur, Dongwon Ryu, and Gholamreza Haffari.
2021. Multilingual simultaneous neural machine Kurt Heller and Christoph Perleth. 2000. Kognitiver
translation. In Findings of the Association for Com- Fähigkeitstest für 4.-12. Klassen, Revision (KFT 4-
putational Linguistics: ACL-IJCNLP 2021, pages 12+ R).
4758–4766, Online. Association for Computational
Linguistics. Kenneth Holstein and Shayan Doroudi. 2021. Equity
and artificial intelligence in education: Will "aied"
Noah Arthurs and AJ Alvero. 2020. Whose truth is the amplify or alleviate inequities in education? CoRR,
"ground truth"? college admissions essays and bias abs/2104.12920.

218
René F. Kizilcec and Hansol Lee. 2020. Algorithmic progression for argumentation in science. Journal of
fairness in education. CoRR, abs/2007.05443. Research in Science Teaching, 53.
Alexander Kwako, Yixin Wan, Jieyu Zhao, Kai-Wei Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram-
Chang, Li Cai, and Mark Hansen. 2022. Using item fort, Vincent Michel, Bertrand Thirion, Olivier Grisel,
response theory to measure gender and racial bias Mathieu Blondel, Andreas Müller, Joel Nothman,
of a BERT-based automated English speech assess- Gilles Louppe, Peter Prettenhofer, Ron Weiss, Vin-
ment system. In Proceedings of the 17th Workshop cent Dubourg, Jake Vanderplas, Alexandre Passos,
on Innovative Use of NLP for Building Educational David Cournapeau, Matthieu Brucher, Matthieu Per-
Applications (BEA 2022), pages 1–7, Seattle, Wash- rot, and Édouard Duchesnay. 2011. Scikit-learn: Ma-
ington. Association for Computational Linguistics. chine learning in Python. Journal of Machine Learn-
ing Research, 12:2825–2830.
Alexander Kwako, Yixin Wan, Jieyu Zhao, Mark
Hansen, Kai-Wei Chang, and Li Cai. 2023. Does bert Francesc Pedró, Miguel Subosa, Axel Rivas, and Paula
exacerbate gender or l1 biases in automated english Valverde. 2019. Artificial intelligence in education:
speaking assessment? In Proceedings of the 18th challenges and opportunities for sustainable develop-
Workshop on Innovative Use of NLP for Building Ed- ment. Working papers on education policy 7, UN-
ucational Applications (BEA 2023), pages 668–681. ESCO, France. Includes bibliography.

Lin Li, Lele Sha, Yuheng Li, Mladen Raković, Jia Tanja Riemeier, Claudia Aufschnaiter, Jan Fleischhauer,
Rong, Srecko Joksimovic, Neil Selwyn, Dragan Gaše- and Christian Rogge. 2012. Argumentationen von
vić, and Guanliang Chen. 2023. Moral machines schülern prozessbasiert analysieren: Ansatz, vorge-
or tyranny of the majority? a systematic review on hen, befunde und implikationen. Zeitschrift für Di-
predictive bias in education. In LAK23: 13th Inter- daktik der Naturwissenschaften, 18:141–180.
national Learning Analytics and Knowledge Confer- Nils-Jonathan Schaller, Andrea Horbach, Lars Höft,
ence, pages 499–508. Yuning Ding, Jan L Bahr, Jennifer Meyer, and Thor-
Diane Litman, Haoran Zhang, Richard Correnti, Lind- ben Jansen. 2024. Darius: A comprehensive learner
say Matsumura, and Elaine L. Wang. 2021. A fair- corpus for argument mining in german-language es-
ness evaluation of automated methods for scoring says. OSF Preprints. Accepted for LREC-COLING
text evidence usage in writing. In International Con- 2024.
ference on Artificial Intelligence in Education, pages Christian Stab and Iryna Gurevych. 2014. Annotating
255–267, Cham. Springer International Publishing. argument components and relations in persuasive es-
says. In Proceedings of COLING 2014, the 25th
Anastassia Loukina, Nitin Madnani, and Klaus Zechner.
International Conference on Computational Linguis-
2019. The many dimensions of algorithmic fairness
tics: Technical Papers, pages 1501–1510, Dublin,
in educational applications. In Proceedings of the
Ireland. Dublin City University and Association for
Fourteenth Workshop on Innovative Use of NLP for
Computational Linguistics.
Building Educational Applications, pages 1–10.
Detlef Urhahne and Lisette Wijnia. 2021. A review
Nitin Madnani and Anastassia Loukina. 2016. RSM- on the accuracy of teacher judgments. Educational
Tool: A collection of tools for building and evaluating Research Review, 32.
automated scoring models. Journal of Open Source
Software, 1(3). Mark Warschauer, Michele Knobel, and Leeann Stone.
2004. Technology and equity in schooling: Decon-
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, structing the digital divide. Educational Policy.
Kristina Lerman, and Aram Galstyan. 2019. A sur-
vey on bias and fairness in machine learning. ACM David M. Williamson, Xiaoming Xi, and F. Jay Breyer.
Computing Surveys. 2012. A framework for evaluation and use of auto-
mated scoring. Educational Measurement: Issues
Shira Mitchell, Eric Potash, Solon Barocas, Alexander and Practice, 31.
D’Amour, and Kristian Lum. 2021. Algorithmic fair-
ness: Choices, assumptions, and definitions. Annual Kevin P. Yancey, Geoffrey T. LaFlair, Anthony Verardi,
Review of Statistics and its Application. and Jill Burstein. 2023. Rating short l2 essays on
the cefr scale with gpt-4. In Proceedings of the 18th
Jens Möller, Thorben Jansen, Johanna Fleckenstein, Workshop on Innovative Use of NLP for Building
Nils Machts, Jennifer Meyer, and Raja Reble. 2022. Educational Applications, BEA@ACL 2023, Toronto,
Judgment accuracy of german student texts: Do Canada, 13 July 2023, pages 576–584. Association
teacher experience and content knowledge matter? for Computational Linguistics.
Teaching and Teacher Education, 119:103879.
Kaixun Yang, Mladen Raković, Yuyang Li, Quanlong
OpenAI. 2024. Gpt-4 technical report. Guan, Dragan Gašević, and Guanliang Chen. 2024.
Unveiling the tapestry of automated essay scoring:
Jonathan Osborne, Bryan Henderson, Anna Macpher- A comprehensive investigation of accuracy, fairness,
son, Evan Szu, Andrew Wild, and Shi-Ying Yao. and generalizability. Proceedings of the AAAI Con-
2016. The development and validation of a learning ference on Artificial Intelligence.

219
Jianhua Zhang and Lawrence Jun Zhang. 2023. Exam- A GPT prompts used in our experiments
ining the relationship between english as a foreign
language learners’ cognitive abilities and l2 grit in
predicting their writing performance. Learning and Item Description
Instruction, 88. Conclusion Does this text have a concluding section, a
summary? Answer with 1 for Yes or 0 for
No.
Introduction Does this text have an introduction? An-
swer with 1 for Yes or 0 for No.
Main Thesis Is this text a main thesis, meaning a sen-
tence in a text that takes a clear position?
Answer with 1 for Yes or 0 for No.
Position Does this text discuss all three positions
of the task? Either cars that are pow-
ered by hydrogen, electricity, or e-fuels,
or other task that involves hydroelectric
power plants, solar power plants, and wind
farms. If all three options are discussed,
answer with 1, if not then 0.
Warrant Do the arguments in the text have an expla-
nation, meaning a more detailed explana-
tion of the argument? If yes answer with
1, if not then 0.

Table 6: GPT prompts

B DARIUS corpus example

220
Deutsch Englisch
In Norddeutschland wird die Frage gestellt welche In northern Germany, the question is being asked as to
klimaneutrale Energiegewinnung gebaut werden soll, um which climate-neutral energy generation should be built
eine Klimaneutralität zu erreichen. Zur Frage kommen in order to achieve climate neutrality. The options are
Windparks, Solar und Wasserkraftanlagen. Ich finde, dass wind farms, solar and hydropower plants. I think that the
der Bau von Windparks gefördert werden soll. Mit 45% construction of wind farms should be promoted. At 45%
Wirkungsgrad sind diese schwächer als efficiency, they are less efficient than hydropower plants
Wasserkraftanlagen und stärker als Solarparks. Obwohl and more efficient than solar parks. Although the
der Wirkungsgrad mit 45% geringer ist als bei efficiency of 45% is lower than that of hydropower plants,
Wasserkraftanlagen, liefert ein Windpark mit 40 GWh pro a wind farm with 40 GWh per year supplies more
Jahr mehr Strom als Solarpark und Wasserkraftanlage. electricity than solar farms and hydropower plants. The
Ebenfalls ist der Preis relativ zum Jahresertrag günstig price relative to the annual yield is also lower at 14
mit 14 Millionen als Solarpark und Wasserkraftanlage. million than solar parks and hydroelectric power plants. It
Ebenfalls muss man in Betracht ziehen, dass der must also be taken into account that the wind farm emits
Windpark weniger CO2 ausstoßt. Solarpark und less CO2. The solar park and hydropower plant emit
Wasserkraftanlage stoßen 35000t und 12000t CO2 und 35,000 tons and 12,000 tons of CO2 respectively, while
der Windpark nur 8,800t. Jedoch muss man sagen, dass the wind park emits only 8,800 tons. However, it must be
der Windpark nur eine Lebensdauer von 20 Jahren hat. said that the wind farm only has a lifespan of 20 years. In
Währenddessen halten Solarparks 30 Jahre und contrast, solar parks last 30 years and hydroelectric power
Wasserkraftanlage 80 Jahre. Auf der Ebene der plants 80 years. On the level of local emissions, the wind
Lokalemissionen besitz der Windpark die meisten farm has the most emissions with acoustic, infrasound
Emission mit Hör-, Infraschall und Schattenwurft. Die and shadow flicker. The hydropower plant does not cast
Wasserkraftanlage wirft keinen Schattenwurf, aber hat any shadows, but still has audible and infrasound
trotzdem Hör- und Infraschall. Der Solarpark hat keinen emissions. The solar park has no emissions of any kind.
Emissionen jeglicher Art. Zum Schluss komme ich, dass In conclusion, I believe that wind farms should be
man Windparks fördern sollte, da die Vorteile die promoted because the advantages outweigh the
Nachteile überwiegen. Sie bieten günstig Strom und disadvantages. They provide cheap electricity and cause
verursachen wenig Treibhausgasemissionen, aber man little greenhouse gas emissions, but it should be noted
muss anmerken, dass ein Windpark keine hohe that a wind farm does not have a long lifespan, so they
Lebensdauer hat, sodass diese öfters erneuert werden have to be renewed frequently, and that residents and
müssen, und dass Anwohner und Tiere von diesem animals can be disturbed by them.
belästigt werden können.

Table 7: Example essay in the DARIUS Corpus, translated via DeepL4

221

Paula 3
No ratings yet
Paula 3
10 pages
A Fairness Evaluation of Automated Methods For Scoring Text Evidence Usage in Writing
No ratings yet
A Fairness Evaluation of Automated Methods For Scoring Text Evidence Usage in Writing
13 pages
Debiasing Education Algorithms: Jamiu Adekunle Idowu
No ratings yet
Debiasing Education Algorithms: Jamiu Adekunle Idowu
31 pages
Understanding Algorithmic Bias in Education
No ratings yet
Understanding Algorithmic Bias in Education
41 pages
Paula 1
No ratings yet
Paula 1
12 pages
Fixna Algorithmic Bias
No ratings yet
Fixna Algorithmic Bias
6 pages
Assignment ML 3
No ratings yet
Assignment ML 3
4 pages
Artificial Intelligence Bias On English Language Learners in Automatic Scoring
No ratings yet
Artificial Intelligence Bias On English Language Learners in Automatic Scoring
12 pages
Art 5
No ratings yet
Art 5
19 pages
Algorithmic Fairness: From Social Good To A Mathematical Framework
No ratings yet
Algorithmic Fairness: From Social Good To A Mathematical Framework
2 pages
Understanding Algorithmic Fairness in Education
No ratings yet
Understanding Algorithmic Fairness in Education
30 pages
1 s2.0 S2666920X24000353 Main
No ratings yet
1 s2.0 S2666920X24000353 Main
9 pages
Highlights
No ratings yet
Highlights
41 pages
Fischer 2023 Evaluating The Ethics of Machines Assessing Humans
No ratings yet
Fischer 2023 Evaluating The Ethics of Machines Assessing Humans
9 pages
Algorithms' Impact on Education
No ratings yet
Algorithms' Impact on Education
12 pages
Algorithmic Fairness in Economics
No ratings yet
Algorithmic Fairness in Economics
6 pages
Frame 24wfibieobfoiwfioweoifiowebfoiweiofbiw
No ratings yet
Frame 24wfibieobfoiwfioweoifiowebfoiweiofbiw
1 page
Impossible Fairness
No ratings yet
Impossible Fairness
16 pages
Fpsyg 15 1221177
No ratings yet
Fpsyg 15 1221177
18 pages
Causal Reasoning For Algorithmic Fairness
No ratings yet
Causal Reasoning For Algorithmic Fairness
21 pages
Fairaied: Navigating Fairness, Bias, and Ethics in Educational Ai Applications
No ratings yet
Fairaied: Navigating Fairness, Bias, and Ethics in Educational Ai Applications
47 pages
Explainable AI in Education: Trends & Challenges
No ratings yet
Explainable AI in Education: Trends & Challenges
8 pages
Fairness Review
No ratings yet
Fairness Review
44 pages
Fair Data Generation for AI Bias Mitigation
No ratings yet
Fair Data Generation for AI Bias Mitigation
18 pages
Abrocadistributions For Algorithmic Bias Assessment: Considerations Around Interpretation
No ratings yet
Abrocadistributions For Algorithmic Bias Assessment: Considerations Around Interpretation
11 pages
Jasa 1
No ratings yet
Jasa 1
12 pages
51 Did You Do Your Homework Raising Awareness On Software Fairness and Discrimination
No ratings yet
51 Did You Do Your Homework Raising Awareness On Software Fairness and Discrimination
5 pages
Philosophy Compass - 2021 - Fazelpour - Algorithmic Bias Senses Sources Solutions
No ratings yet
Philosophy Compass - 2021 - Fazelpour - Algorithmic Bias Senses Sources Solutions
16 pages
Bias and Fairness in AI Systems
No ratings yet
Bias and Fairness in AI Systems
31 pages
Fairlearn: Assessing and Improving Fairness of AI Systems
No ratings yet
Fairlearn: Assessing and Improving Fairness of AI Systems
8 pages
AI Models in Automated Essay Scoring
No ratings yet
AI Models in Automated Essay Scoring
13 pages
A Survey of Current Machine Learning Approaches To Student Free-Text Evaluation For Intelligent Tutoring - 2022
No ratings yet
A Survey of Current Machine Learning Approaches To Student Free-Text Evaluation For Intelligent Tutoring - 2022
39 pages
Escholarship UC Item 6kf0r28s
No ratings yet
Escholarship UC Item 6kf0r28s
45 pages
AI in Holistic Essay Scoring Analysis
No ratings yet
AI in Holistic Essay Scoring Analysis
10 pages
Fairness in AI: A New Perspective
No ratings yet
Fairness in AI: A New Perspective
15 pages
Maintaining AI Fairness Amid Domain Shifts
No ratings yet
Maintaining AI Fairness Amid Domain Shifts
28 pages
Fairness in Machine Learning Lessons From Politica
No ratings yet
Fairness in Machine Learning Lessons From Politica
11 pages
Ethical AI Integration in Higher Education Assessment - A Global Perspective On Bias, Fairness, and T
No ratings yet
Ethical AI Integration in Higher Education Assessment - A Global Perspective On Bias, Fairness, and T
14 pages
Comparing Text Augmentation by GPT-3.5 and Llama3 For Evaluating Student Responses
No ratings yet
Comparing Text Augmentation by GPT-3.5 and Llama3 For Evaluating Student Responses
27 pages
Lecture 1 - Novi Quadrianto
No ratings yet
Lecture 1 - Novi Quadrianto
57 pages
CS 540 AI Final Exam Questions
No ratings yet
CS 540 AI Final Exam Questions
9 pages
CS 540 AI Final Exam Questions
No ratings yet
CS 540 AI Final Exam Questions
10 pages
AI Inclusivity in Education Research
No ratings yet
AI Inclusivity in Education Research
12 pages
Chapter 2
No ratings yet
Chapter 2
7 pages
Philosophy Public Affairs - 2023 - Beigang - Reconciling Algorithmic Fairness Criteria
No ratings yet
Philosophy Public Affairs - 2023 - Beigang - Reconciling Algorithmic Fairness Criteria
25 pages
Ai Fairness: From Principles To Practice: Arash Bateni, PHD
No ratings yet
Ai Fairness: From Principles To Practice: Arash Bateni, PHD
21 pages
108 Learning Fair Representations
No ratings yet
108 Learning Fair Representations
10 pages
On Automated Essay Grading Using Large Language Models: Pei Yee Liew Ian K. T. Tan
No ratings yet
On Automated Essay Grading Using Large Language Models: Pei Yee Liew Ian K. T. Tan
8 pages
Sample Presentation
No ratings yet
Sample Presentation
11 pages
CS 540 AI Final Exam Questions
No ratings yet
CS 540 AI Final Exam Questions
11 pages
Metrics and Algorithms For Identifying and Mitigating Bias in AI Design A Counterfactual Fairness Approach
No ratings yet
Metrics and Algorithms For Identifying and Mitigating Bias in AI Design A Counterfactual Fairness Approach
12 pages
Chatgpt For Writing Evaluation Examining The Accuracy and Reliability of Ai Generated Scores Compared To Human Raters
No ratings yet
Chatgpt For Writing Evaluation Examining The Accuracy and Reliability of Ai Generated Scores Compared To Human Raters
23 pages
Distributive Justice in Algorithmic Fairness
No ratings yet
Distributive Justice in Algorithmic Fairness
22 pages
Data Augmentation For Automated Essay Scoring Using Transformer Models
No ratings yet
Data Augmentation For Automated Essay Scoring Using Transformer Models
5 pages
Understanding Fairness in Machine Learning
No ratings yet
Understanding Fairness in Machine Learning
6 pages
Machine Learning To Be Like Thee For Al
No ratings yet
Machine Learning To Be Like Thee For Al
27 pages
Ad3501 DL Unit 1
No ratings yet
Ad3501 DL Unit 1
7 pages
1.1 Project Scope: CMRTC 1
No ratings yet
1.1 Project Scope: CMRTC 1
25 pages
Europol Innovation Lab Facing Reality Law Enforcement and The Challenge of Deepfakes
No ratings yet
Europol Innovation Lab Facing Reality Law Enforcement and The Challenge of Deepfakes
23 pages
Lec3 (DD2421)
No ratings yet
Lec3 (DD2421)
37 pages
Automorph: Automated Retinal Vascular Morphology Quantification Via A Deep Learning Pipeline
No ratings yet
Automorph: Automated Retinal Vascular Morphology Quantification Via A Deep Learning Pipeline
14 pages
Pile Driveability Prediction Techniques
No ratings yet
Pile Driveability Prediction Techniques
39 pages
Fake News Detection Project Documentation
No ratings yet
Fake News Detection Project Documentation
16 pages
Next Word Prediction Using NLP and Deep Learning With LSTM
No ratings yet
Next Word Prediction Using NLP and Deep Learning With LSTM
17 pages
Water Quality Analysis Report
No ratings yet
Water Quality Analysis Report
42 pages
Jornal Multimedia Tools & Applications 29.10.2024
No ratings yet
Jornal Multimedia Tools & Applications 29.10.2024
13 pages
Leveraging Infrared Spectroscopy For
No ratings yet
Leveraging Infrared Spectroscopy For
13 pages
Machine Learning BCS602
No ratings yet
Machine Learning BCS602
81 pages
Predicting OEE in Manufacturing with ML
No ratings yet
Predicting OEE in Manufacturing with ML
8 pages
Heart Disease Prediction IEEE Paper Placeholde
No ratings yet
Heart Disease Prediction IEEE Paper Placeholde
6 pages
Improving The Accuracy of Food Commodity Price Prediction Model Using Deep Learning Algorithm
No ratings yet
Improving The Accuracy of Food Commodity Price Prediction Model Using Deep Learning Algorithm
5 pages
NORA - HealthCare Voice Based Chatbot
No ratings yet
NORA - HealthCare Voice Based Chatbot
11 pages
Automated Algorithm Discovery via Deep Distilling
No ratings yet
Automated Algorithm Discovery via Deep Distilling
12 pages
2565-Article Text-7901-3-10-20231018
No ratings yet
2565-Article Text-7901-3-10-20231018
8 pages
A Lightweight Model For Satellite Pose Estimation
No ratings yet
A Lightweight Model For Satellite Pose Estimation
12 pages
Go-tuning for Enhanced Zero-shot Learning
No ratings yet
Go-tuning for Enhanced Zero-shot Learning
9 pages
Made Easy
No ratings yet
Made Easy
11 pages
Fu NTIRE 2025 Challenge On Cross-Domain Few-Shot Object Detection Methods and CVPRW 2025 Paper
No ratings yet
Fu NTIRE 2025 Challenge On Cross-Domain Few-Shot Object Detection Methods and CVPRW 2025 Paper
22 pages
Viyan Report
No ratings yet
Viyan Report
59 pages
Alphaforge: A Framework To Mine and Dynamically Combine Formulaic Alpha Factors
No ratings yet
Alphaforge: A Framework To Mine and Dynamically Combine Formulaic Alpha Factors
10 pages
Diagnosing Dysarthria With Long Short-Term Memory Networks: September 15-19, 2019, Graz, Austria
No ratings yet
Diagnosing Dysarthria With Long Short-Term Memory Networks: September 15-19, 2019, Graz, Austria
5 pages
Crop Suitability & Fertilizer Data Mining
No ratings yet
Crop Suitability & Fertilizer Data Mining
62 pages
Applied - Data - Science MODULE 3 SEM 8
No ratings yet
Applied - Data - Science MODULE 3 SEM 8
41 pages
Artificial Intelligence - A Smart and Empowering Approach To Womens Safety
No ratings yet
Artificial Intelligence - A Smart and Empowering Approach To Womens Safety
19 pages
IoT-Based Crop Disease Recognition System
No ratings yet
IoT-Based Crop Disease Recognition System
9 pages
Machinelearning GateNotes
No ratings yet
Machinelearning GateNotes
105 pages

Paula 4

Uploaded by

Paula 4

Uploaded by

Fairness in Automated Essay Scoring: A Comparative Analysis of

Algorithms on German Learner Essays from Secondary Education

Abstract The foundation of equity in automated feedback

respective subgroup information being the indepen- 5 Experimental Study

Table 4: Kappa values of KFT classifiers and all subtypes.

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

(a) Introduction (b) Conclusion (c) Major Claim

(d) Position (e) Warrant

Figure 1: Kappa values of KFT classifiers on different KFT subgroups. Q = Quartile.

References John Hattie and Helen Timperley. 2007. The power of

Table 6: GPT prompts

B DARIUS corpus example

Table 7: Example essay in the DARIUS Corpus, translated via DeepL4

You might also like