Paula 4
Paula 4
211
ables. The results showed some fairness problems, also include claims written not only in the opening
which were exacerbated in the studies where the paragraphs but also within the conclusion, offering
AES was additionally trained only on a homoge- a comprehensive view of the argumentative intent.
nous group of students. Such claims form the basis for the author’s further
arguments and the direction of their reasoning.
3 Data Position annotation: This annotation extracts
the essay’s directional stance regarding the the-
The DARIUS corpus is a collection of 4,589 an-
matic issues presented in the writing tasks —
notated argumentative texts written by 1,839 stu-
whether the argumentation aligns with, diverges
dents from German high schools, spread across
from, or remains ambiguous towards the positions
114 classes in 33 different schools(Schaller et al.,
debated within the tasks. This annotation is impor-
2024). Essays that were off-topic, shorter than two
tant for understanding the diversity of viewpoints
sentences, empty, or contained names or other data
and the critical engagement of students with the
relevant to data protection were removed before-
socio-scientific topics at hand.
hand. The final dataset consists of essays from two
Warrant annotation: A warrant is one out
writing assignments focused on socio-scientific is-
of five argumentative elements annotated in the
sues on the topics energy and automotive, contain-
dataset as part of the Toulmin’s Argumentation Pat-
ing 2,307 and 2,282 essays respectively. Students
tern (TAP) annotations, following the definitions by
wrote a draft and revision on one task, followed
Riemeier et al. (2012). TAP describes a structural
by an essay on the other task, resulting in up to 3
framework for constructing logical and compelling
essays per student. An example text is listed in the
arguments by including a claim, providing sup-
Appendix 7. Students also provided demographic
porting evidence (data), explaining the connection
data voluntarily, a selection of which is listed in
between the claim and data (warrant), and address-
Table 1.
ing counterarguments (rebuttal). For this study, we
The dataset has been extensively annotated with focus exemplarily on warrants because the use of
information about argumentative structure on dif- warrants indicates already a higher argumentation
ferent levels of granularity. In the present study, we skill(Osborne et al., 2016). TAP elements are not
focus specifically on a subset of these annotations, marked on the sentence level but on the token level,
namely content zone, major claim, position and as a TAP sequence can cover a wide range from
warrant. Out of the nine original annotation cate- subordinate clauses to entire paragraphs.
gories, we selected those as they reflect different
parts of an argumentative text, e.g. structure and 3.2 Demographic and Psychological Data
content, and are annotated on different granularity We consider the following demographic variables:
levels (token level to whole texts). We used the Grade Grade indicates which grade level the stu-
demographic data to measure fairness with respect dent is in. The dataset was obtained for students
to gender, profile, school, cognitive ability (KFT), between Grade 9 and Grade 12.
and languages, which are further explained after Gender The students could indicate their gender.
providing more details on the annotations in Sec- Options were female, male, and diverse.
tion 3.1. A more extensive description can be found School The German school system differentiates
in the original paper (Schaller et al., 2024). between different forms of high school.
3.1 Annotations • Gemeinschaftsschule: non-academic track
Content zone: This annotation category breaks • Gymnasium: academic track
down the essays into their basic parts: the introduc-
• Berufsschule: vocational training
tion, the body, and the conclusion. Each section
can be as short as one sentence or span several Profile The German school system allows stu-
sentences. dents to choose a profile. The Natural Sciences
Major claim annotation: Central to the argu- profile, for example, has a focus on math and sci-
mentative essence of the essays, the Major Claim ence, while the Social Sciences profile can have a
annotation identifies the pivotal stance taken by the focus on politics or ethics.
author on the discussed issue. In contrast to similar Languages The students could indicate the lan-
annotation efforts (Stab and Gurevych, 2014), we guage that they speak at home.
212
Grade Level Gender Profile Language
Level Students Gender Students Profile Students Language Students
9 423 Female 801 Natural Sciences 414 native 1265
10 346 Male 664 Social Sciences 255 non-native 576
11 547 Diverse 90 Sports 119
12 404 Missing 284 Linguistics 61
13 113 Aesthetics 13
Missing 6 Missing 977
Table 1: Combined Overview: Grade Level, Gender, Profile, and Language of Students
KFT The Cognitive Abilities Test (Kognitiver of deep learning and GPT-4 (OpenAI, 2024) to
Fähigkeitstest or KFT) developed by Heller and represent generative LLMs. For sequence tagging,
Perleth (2000), measures students’ cognitive abil- we also use the BERT-based classifier and again
ities through non-verbal figural analogies. These prompt GPT-4 this time providing the whole essay
questions evaluate abstract reasoning and the ability as input.
to apply logical rules to visual information with-
out linguistic content, making them useful for as- 4.2 Data Split
sessing individuals across different linguistic back- We use a fixed data split of 80% training data and
grounds. A typical problem displays a sequence 20 % test data. From the training data, we used a
of shapes that follow a certain transformation (e.g., subset of 60% as validation data to find the best
rotation, reflection). The test-taker must identify epoch for deep learning and for prompt-tuning for
and apply the same transformation to a new set of generative LLMs in pre-experiments, i.e. the whole
figures. training data set was used in the main experiments
for training. As we were not interested in the over-
4 Method all best performance but rather in the intrinsic fair-
ness differences between models, we did not further
In the following section, we describe the experi-
fine-tune any hyperparameters.
mental setup for our evaluation study.
4.3 Performance and Fairness Evaluation
4.1 Classifiers
The evaluation of our classification results is moti-
We experiment with a diverse set of classifiers to vated by the intended use of the classifiers to pro-
see performance and fairness differences between vide formative feedback to learners in e.g. an on-
instances of different model architectures. Our line tutoring system. Although it might also be of
machine learning goal is to predict certain spans interest to show the specific location of an argumen-
in an essay text. For most of these spans, span tative element within a learner essay as feedback,
boundaries align with sentence boundaries. our primary concern for this study is to determine
Major claim annotations always consist of single whether certain argumentative elements are present
sentences. The other annotation types, i.e. con- in a text or not. Therefore, we first transform any
tent zone and position annotations may also span classifier output into a binary decision on the docu-
multiple sentences. Only warrant annotations do ment level indicating whether (at least one instance
not necessarily align with sentence boundaries and of) a certain argumentative element is present in an
can consist of segments on the sub-sentence level. essay.
Therefore, we make use of both sentence classifi- In our fairness evaluation, we follow the frame-
cation and sequence tagging approaches. For sen- work proposed by (Loukina et al., 2019) and their
tence classification, we use a Support Vector Ma- implementation provided within the RSMTool soft-
chine (SVM) in standard configuration, provided ware package (Madnani and Loukina, 2016). More
by the scikit-learn python package (Pedregosa et al., precisely, we compute overall score accuracy (osa),
2011) as an instance of shallow learning. The fea- overall score difference (osd) and conditional score
tures utilized in the SVM classifier are the TF-IDF difference (csd), where the first looks at squared
vectors of the most frequent 1- to 3-grams. We use errors (S − H)2 and the latter two at actual errors
a BERT-based 3 sentence classifier as an instance S − H. In every case, a linear regression is fit
3
dbmdz/bert-base-german-cased with the error being the dependent variable and the
213
Label Model All Grades Gender Profile School Languages KFT
Introduction Shallow .63 [.35, .68] [.53, .67] [.58, .73] [.48, .68] [.60, .70] [.57, .67]
Deep .81 [.51, .85] [.76, .84] [.74, .83] [.69, .95] [.80, .85] [.75, .85]
LLM .60 [.50, .63] [.46, .62] [.55, .61] [.51, .77] [.59, .59] [.58, .61]
Conclusion Shallow .55 [.44, .71] [.50, .58] [.46, .55] [.46, .61] [.54, .55] [.52, .57]
Deep .70 [.64, .80] [.59, .74] [.63, .81] [.64, .78] [.64, .71] [.64, .78]
LLM .68 [.63, .76] [.68, .81] [.63, .67] [.58, .84] [.65, .68] [.61, .72]
Major Claim Shallow .68 [.62, .74] [.66, .74] [.49, .75] [.42, .81] [.66, .72] [.62, .72]
Deep .88 [.78, .92] [.87, .88] [.80, .95] [.81, .89] [.87, .88] [.84, .90]
LLM .75 [.68, .82] [.66, .81] [.63, .84] [.71, .91] [.71, .86] [.66, .86]
Position Shallow .41 [.34, .46] [.34, .53] [.16, .49] [.29, .56] [.36, .50] [.17, .58]
Deep .44 [.23, .56] [.36, .73] [.23, .61] [.28, .46] [.37, .59] [.27, .54]
LLM .32 [.13, .37] [.29, .54] [.29, .47] [.22, .60] [.31, .33] [.23, .37]
Warrant Shallow .43 [.32, .51] [.39, .51] [.38, .51] [.38, .47] [.39, .55] [.37, .52]
Deep .44 [.27, .53] [.38, .55] [.36, .68] [.36, .65] [.41, .52] [.25, .54]
LLM .00 [-.16, .09] [-.02, .32] [-.18, .02] [-.04, .14] [-.02, .07] [-.13, .08]
Table 2: Kappa values for the individual classifiers evaluated either on all test essays or on essays from a certain
subgroup. We report the minimal and maximal values among the subgroups for each demographic variable.
1 1 1
Kappa
Kappa
0 0 0
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
KFT Subgroups KFT Subgroups KFT Subgroups
0.8 0.8
0.6 0.6
Kappa
Kappa
0.4 0.4
0.2 0.2
0 0
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
KFT Subgroups KFT Subgroups
216
(low, high, and mixed) can perform best in different
tasks, e.g. mixed deep in Introduction, high deep in
Label Metric KFT Model Grades Gender Profile School Language KFT Conclusion, or low shallow/mixed deep in Position.
Introduction osa high Shallow
Deep
.011
.001
.005
.003
-.004
-.002
.003
.000
-.001
-.001
.001
.001 In terms of fairness, we still found no values above
low Shallow
Deep
.001
-.003
.001
-.001
.007
-.003
.000
-.002
.008
-.001
.007
.003 0.1 (see Table 5).
mixed Shallow .008 -.001 -.004 .003 .001 .001
Deep -.003 -.000 -.004 -.003 -.001 .000 When examining Figure 1 we can see that mod-
osd high Shallow .003 .003 -.004 .000 .001 -.000
Deep .001 -.001 .005 .000 -.000 .001 els differed in their performance when tested on
low Shallow .002 .000 .003 .002 -.001 .006
Deep -.002 .007 .004 .002 -.001 .002 different subgroups. For the Introduction, a shal-
mixed Shallow .010 -.002 .000 .009 -.001 .000
Deep -.001 .000 .002 .006 -.001 .004 low model trained on the dataset of the students
csd high Shallow .012 .017 .093 .016 .000 -.001
low
Deep
Shallow
.014
.018
.007
.025
.074
.053
.007
.020
.007
.005
.006
.013
with the highest KFT quartile (high shallow) was
mixed
Deep
Shallow
.011
.022
.008
.017
.034
.065
-.005
.022
.002
.002
.006
.009
performing better on the subgroup it was trained on
Conclusion osa high
Deep
Shallow
.009
.011
.011
.001
.015
.003
.004
.004
-.000
-.001
.010
.002
(e.g. Quartile 4) than on the other subgroups and
low
Deep
Shallow
.000
.010
-.000
-.000
.002
-.004
.001
.001
-.001
.002
.004
.016
the other way around (low KFT model performed
mixed
Deep
Shallow
.006
.011
-.002
.001
.000
-.003
-.001
-.002
.006
-.001
-.000
.001
better on the subset with low KFT, e.g. Quartile
osd
mixed
high
Deep
Shallow
-.001
.016
.003
-.002
-.001
.005
-.001
.004
-.001
-.001
-.003
-.001
1.). The mixed models had the lowest variance in
low
Deep
Shallow
-.004
.004
.002
-.002
-.004
-.002
-.003
.000
-.000
.003
-.001
.012
performance.
mixed
Deep
Shallow
.005
.003
-.002
-.002
.001
-.000
-.000
.001
-.001
-.001
-.003
-.003 There are exceptions in which the model per-
formed better on a different subgroup than the one
Deep -.001 -.001 .003 -.002 -.001 -.002
csd high Shallow .010 -.009 -.025 -.007 .006 .010
Deep .004 .003 -.007 -.003 -.001 .004
low Shallow .001 .006 -.033 .006 -.000 -.001 it was trained on, e.g., in (d) Position, all models
Deep .004 .012 .042 .008 .001 .002
mixed Shallow .001 -.009 -.003 -.011 .004 .003 except high shallow lost performance on Quartile
Deep .004 .006 .034 .004 .001 .007
Major Claim osa high Shallow -.001 .004 -.004 .008 -.001 .000 4. Furthermore, all combinations of algorithm and
Deep .001 -.002 -.003 .004 -.001 -.003
low Shallow -.002 .003 -.001 .003 -.001 -.003 training data did have a comparable stable perfor-
Deep .001 -.000 .002 -.001 -.000 -.002
mixed Shallow .000 -.001 -.001 .018 .000 -.001 mance on (c) Major Claim.
Deep .006 .001 .003 .003 .001 -.002
osd high Shallow
Deep
-.001
-.002
.000
-.000
-.002
-.004
-.003
-.003
-.000
-.000
.005
-.002
In general, using training data from only one
low Shallow
Deep
.004
.003
.002
-.000
.002
-.004
.000
-.002
-.001
.000
.000
-.000
student group seemed to introduce a bias, disad-
mixed Shallow
Deep
.006
.002
-.002
-.001
-.002
-.001
-.003
-.001
-.001
-.001
.003
-.001
vantaging other student groups. This finding un-
csd high Shallow
Deep
-.002
-.002
-.004
.006
.032
-.010
.014
.005
-.001
.001
.005
-.002
derlines the need to include training data from a
low Shallow
Deep
.002
.002
.002
-.001
.043
-.003
.014
-.005
-.001
-.000
-.000
-.000 diverse range of students to ensure fairness and
mixed Shallow
Deep
.005
.002
.000
.000
.020
.012
.021
.004
-.000
-.001
.004
-.001 avoid skewed outcomes.
Position osa high Shallow .003 .002 .020 .011 .014 .036
Deep -.002 -.002 -.002 .012 .007 .024
low Shallow
Deep
.002
-.001
-.000
-.001
.007
-.000
.016
.003
.000
.001
.015
.005 6 Conclusion and Future Work
mixed Shallow .001 .003 .016 .008 .009 .026
Deep -.001 -.002 .020 .009 .002 .011
osd high Shallow
Deep
.003
.000
.002
-.000
.020
-.001
.011
.006
.014
.001
.036
.003
In our work, we provide three basic models (shal-
low Shallow
Deep
.000
-.003
-.002
-.002
.005
-.001
.006
.005
-.001
.000
.006
.003
low learning models, deep learning models, and
mixed Shallow
Deep
.002
-.002
.003
-.002
.017
.005
.006
.001
.010
.003
.024
.002
LLM) trained on the annotations of the DARIUS
csd high Shallow
Deep
-.000
.003
-.003
-.013
.015
.096
-.003
-.014
.000
.001
-.000
.004
corpus of learner texts in German. These mod-
low Shallow
Deep
-.001
.001
.030
.008
.027
.017
.039
.016
.005
.001
.013
.003
els are ready to use in schools, for example, to
mixed Shallow
Deep
-.001
.002
.019
.005
.041
.059
.025
.005
-.000
-.000
.001
.004
create a feedback tool for training argumentative
Warrant osa high Shallow
Deep
.010
.003
-.001
-.000
-.000
.003
-.002
.007
.006
.004
.004
.015 skills. Evaluation of model fairness showed that
low Shallow
Deep
.014
.011
-.002
-.001
.001
.012
-.000
.013
.008
.008
.010
.023 all models produced fair scores for all students,
mixed Shallow
Deep
.019
.007
-.002
.001
.013
.002
.007
.001
.004
.003
.015
.009 considering demographic and psychological differ-
ences among students. In a second experiment, we
osd high Shallow .005 -.002 -.002 .004 -.001 .001
Deep .000 -.000 -.003 -.002 -.001 -.001
low Shallow .003 -.001 .016 .011 .001 .013
Deep .005 -.002 .002 .001 .000 .001 trained our models on subgroups of students, based
mixed Shallow .012 -.002 .009 .011 -.000 .007
Deep .005 -.002 -.002 -.001 -.001 .002 on either low, high, or mixed cognitive abilities,
csd high Shallow .002 -.003 -.047 -.002 .000 .003
Deep .009 .007 -.020 .002 .002 -.000 to investigate the extent to which skewed training
low Shallow .003 -.008 -.042 -.001 -.000 .003
Deep -.000 -.005 -.035 -.002 -.001 -.001 data leads to unfair AES system scores. Our re-
mixed Shallow .001 -.017 -.045 -.007 .000 -.001
Deep .000 -.011 -.014 -.008 .002 .001 sults showed lower performance for students who
were not in the training data, emphasizing the im-
Table 5: Fairness evaluation metrics of KFT classifiers
and all subtypes.
portance of including samples of the full range
of users in the training data for AES, not only for
demographic background variables but also for psy-
chological aspects such as cognitive abilites. Fail-
217
ure to do so risks reducing the predictive accuracy in word vector evaluation methods. International
of the algorithm for those who are not adequately Educational Data Mining Society.
represented. To mitigate the risk of students receiv- Perpetual Baffour, Tor Saxberg, and Scott Crossley.
ing unfair scores based on their demographic and 2023. Analyzing bias in large language model so-
psychological variables, we advocate that future lutions for assisted writing feedback tools: Lessons
AES systems incorporate the goal of fairness in ad- from the feedback prize competition series. In Pro-
ceedings of the 18th Workshop on Innovative Use
dition to accuracy into their training data collection of NLP for Building Educational Applications (BEA
and algorithm optimization function, going beyond 2023), pages 242–246.
the current state of retrospective analysis of model
Ryan S. Baker and Aaron. Hawn. 2022. Algorithmic
fairness.
bias in education. International Journal of Artificial
Intelligence in Education, 32:1052–1092.
7 Limitations
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and
This study encounters several limitations that have Hanna Wallach. 2020. Language (technology) is
to be mentioned. One constraint is the small size of power: A critical survey of “bias” in NLP. In Pro-
certain subgroups within the corpus, as seen in Ta- ceedings of the 58th Annual Meeting of the Asso-
ble 1, e.g., students with specific family languages, ciation for Computational Linguistics, pages 5454–
5476, Online. Association for Computational Lin-
profiles like Linguistics or Aesthetics. The under- guistics.
representation of those subgroups poses a challenge
in drawing robust conclusions for these particular Brent Bridgeman, Catherine Trapani, and Yigal Attali.
2009. Considering fairness and validity in evaluating
groups, potentially impacting the reliability and automated scoring. In Annual Meeting of the Na-
applicability of our outcomes to these populations. tional Council on Measurement in Education, San
Additionally, the comparatively homogenous Diego, CA.
population in the state of Schleswig-Holstein in
Scott A Crossley, Perpetual Baffour, Yu Tian, Aigner
northern Germany, restricts the generalizability Picou, Meg Benner, and Ulrich Boser. 2022. The
of our findings. The demographic profile of persuasive essays for rating, selecting, and under-
Schleswig-Holstein may not reflect the diversity standing argumentative and discourse elements (PER-
found in other regions or countries, potentially nar- SUADE) corpus 1.0. Assessing Writing, 54:100667.
rowing our study’s insights. European Commission, Directorate-General for Educa-
In conclusion, while our study provides insights tion, Youth, Sport and Culture. 2022. Ethical guide-
into fairness in the subgroups of the DARIUS Cor- lines on the use of artificial intelligence (AI) and data
in teaching and learning for educators. Publications
pus, these limitations underscore the necessity for a
Office of the European Union.
cautious interpretation of our findings and suggest
areas for future research efforts to build upon and Johanna Fleckenstein, Lucas W. Liebenow, and Jennifer
address these constraints. Meyer. 2023. Automated feedback and writing: A
multi-level meta-analysis of effects on students’ per-
formance. Frontiers in Artificial Intelligence, 6.
8 Acknowledgements
Government Equalities Office. 2013. Equality Act
This work was supported by the Deutsche Telekom 2010: guidance. https://www.gov.uk/guidance/
Stiftung and partially conducted at “CATALPA equality-act-2010-guidance. Accessed: 2023-
- Center of Advanced Technology for Assisted 09-21.
Learning and Predictive Analytics” of the FernUni- Steve Graham, Michael Hebert, and Karen Harris. 2015.
versität in Hagen, Germany. Formative assessment and writing: A meta-analysis.
The Elementary School Journal.
218
René F. Kizilcec and Hansol Lee. 2020. Algorithmic progression for argumentation in science. Journal of
fairness in education. CoRR, abs/2007.05443. Research in Science Teaching, 53.
Alexander Kwako, Yixin Wan, Jieyu Zhao, Kai-Wei Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram-
Chang, Li Cai, and Mark Hansen. 2022. Using item fort, Vincent Michel, Bertrand Thirion, Olivier Grisel,
response theory to measure gender and racial bias Mathieu Blondel, Andreas Müller, Joel Nothman,
of a BERT-based automated English speech assess- Gilles Louppe, Peter Prettenhofer, Ron Weiss, Vin-
ment system. In Proceedings of the 17th Workshop cent Dubourg, Jake Vanderplas, Alexandre Passos,
on Innovative Use of NLP for Building Educational David Cournapeau, Matthieu Brucher, Matthieu Per-
Applications (BEA 2022), pages 1–7, Seattle, Wash- rot, and Édouard Duchesnay. 2011. Scikit-learn: Ma-
ington. Association for Computational Linguistics. chine learning in Python. Journal of Machine Learn-
ing Research, 12:2825–2830.
Alexander Kwako, Yixin Wan, Jieyu Zhao, Mark
Hansen, Kai-Wei Chang, and Li Cai. 2023. Does bert Francesc Pedró, Miguel Subosa, Axel Rivas, and Paula
exacerbate gender or l1 biases in automated english Valverde. 2019. Artificial intelligence in education:
speaking assessment? In Proceedings of the 18th challenges and opportunities for sustainable develop-
Workshop on Innovative Use of NLP for Building Ed- ment. Working papers on education policy 7, UN-
ucational Applications (BEA 2023), pages 668–681. ESCO, France. Includes bibliography.
Lin Li, Lele Sha, Yuheng Li, Mladen Raković, Jia Tanja Riemeier, Claudia Aufschnaiter, Jan Fleischhauer,
Rong, Srecko Joksimovic, Neil Selwyn, Dragan Gaše- and Christian Rogge. 2012. Argumentationen von
vić, and Guanliang Chen. 2023. Moral machines schülern prozessbasiert analysieren: Ansatz, vorge-
or tyranny of the majority? a systematic review on hen, befunde und implikationen. Zeitschrift für Di-
predictive bias in education. In LAK23: 13th Inter- daktik der Naturwissenschaften, 18:141–180.
national Learning Analytics and Knowledge Confer- Nils-Jonathan Schaller, Andrea Horbach, Lars Höft,
ence, pages 499–508. Yuning Ding, Jan L Bahr, Jennifer Meyer, and Thor-
Diane Litman, Haoran Zhang, Richard Correnti, Lind- ben Jansen. 2024. Darius: A comprehensive learner
say Matsumura, and Elaine L. Wang. 2021. A fair- corpus for argument mining in german-language es-
ness evaluation of automated methods for scoring says. OSF Preprints. Accepted for LREC-COLING
text evidence usage in writing. In International Con- 2024.
ference on Artificial Intelligence in Education, pages Christian Stab and Iryna Gurevych. 2014. Annotating
255–267, Cham. Springer International Publishing. argument components and relations in persuasive es-
says. In Proceedings of COLING 2014, the 25th
Anastassia Loukina, Nitin Madnani, and Klaus Zechner.
International Conference on Computational Linguis-
2019. The many dimensions of algorithmic fairness
tics: Technical Papers, pages 1501–1510, Dublin,
in educational applications. In Proceedings of the
Ireland. Dublin City University and Association for
Fourteenth Workshop on Innovative Use of NLP for
Computational Linguistics.
Building Educational Applications, pages 1–10.
Detlef Urhahne and Lisette Wijnia. 2021. A review
Nitin Madnani and Anastassia Loukina. 2016. RSM- on the accuracy of teacher judgments. Educational
Tool: A collection of tools for building and evaluating Research Review, 32.
automated scoring models. Journal of Open Source
Software, 1(3). Mark Warschauer, Michele Knobel, and Leeann Stone.
2004. Technology and equity in schooling: Decon-
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, structing the digital divide. Educational Policy.
Kristina Lerman, and Aram Galstyan. 2019. A sur-
vey on bias and fairness in machine learning. ACM David M. Williamson, Xiaoming Xi, and F. Jay Breyer.
Computing Surveys. 2012. A framework for evaluation and use of auto-
mated scoring. Educational Measurement: Issues
Shira Mitchell, Eric Potash, Solon Barocas, Alexander and Practice, 31.
D’Amour, and Kristian Lum. 2021. Algorithmic fair-
ness: Choices, assumptions, and definitions. Annual Kevin P. Yancey, Geoffrey T. LaFlair, Anthony Verardi,
Review of Statistics and its Application. and Jill Burstein. 2023. Rating short l2 essays on
the cefr scale with gpt-4. In Proceedings of the 18th
Jens Möller, Thorben Jansen, Johanna Fleckenstein, Workshop on Innovative Use of NLP for Building
Nils Machts, Jennifer Meyer, and Raja Reble. 2022. Educational Applications, BEA@ACL 2023, Toronto,
Judgment accuracy of german student texts: Do Canada, 13 July 2023, pages 576–584. Association
teacher experience and content knowledge matter? for Computational Linguistics.
Teaching and Teacher Education, 119:103879.
Kaixun Yang, Mladen Raković, Yuyang Li, Quanlong
OpenAI. 2024. Gpt-4 technical report. Guan, Dragan Gašević, and Guanliang Chen. 2024.
Unveiling the tapestry of automated essay scoring:
Jonathan Osborne, Bryan Henderson, Anna Macpher- A comprehensive investigation of accuracy, fairness,
son, Evan Szu, Andrew Wild, and Shi-Ying Yao. and generalizability. Proceedings of the AAAI Con-
2016. The development and validation of a learning ference on Artificial Intelligence.
219
Jianhua Zhang and Lawrence Jun Zhang. 2023. Exam- A GPT prompts used in our experiments
ining the relationship between english as a foreign
language learners’ cognitive abilities and l2 grit in
predicting their writing performance. Learning and Item Description
Instruction, 88. Conclusion Does this text have a concluding section, a
summary? Answer with 1 for Yes or 0 for
No.
Introduction Does this text have an introduction? An-
swer with 1 for Yes or 0 for No.
Main Thesis Is this text a main thesis, meaning a sen-
tence in a text that takes a clear position?
Answer with 1 for Yes or 0 for No.
Position Does this text discuss all three positions
of the task? Either cars that are pow-
ered by hydrogen, electricity, or e-fuels,
or other task that involves hydroelectric
power plants, solar power plants, and wind
farms. If all three options are discussed,
answer with 1, if not then 0.
Warrant Do the arguments in the text have an expla-
nation, meaning a more detailed explana-
tion of the argument? If yes answer with
1, if not then 0.
220
Deutsch Englisch
In Norddeutschland wird die Frage gestellt welche In northern Germany, the question is being asked as to
klimaneutrale Energiegewinnung gebaut werden soll, um which climate-neutral energy generation should be built
eine Klimaneutralität zu erreichen. Zur Frage kommen in order to achieve climate neutrality. The options are
Windparks, Solar und Wasserkraftanlagen. Ich finde, dass wind farms, solar and hydropower plants. I think that the
der Bau von Windparks gefördert werden soll. Mit 45% construction of wind farms should be promoted. At 45%
Wirkungsgrad sind diese schwächer als efficiency, they are less efficient than hydropower plants
Wasserkraftanlagen und stärker als Solarparks. Obwohl and more efficient than solar parks. Although the
der Wirkungsgrad mit 45% geringer ist als bei efficiency of 45% is lower than that of hydropower plants,
Wasserkraftanlagen, liefert ein Windpark mit 40 GWh pro a wind farm with 40 GWh per year supplies more
Jahr mehr Strom als Solarpark und Wasserkraftanlage. electricity than solar farms and hydropower plants. The
Ebenfalls ist der Preis relativ zum Jahresertrag günstig price relative to the annual yield is also lower at 14
mit 14 Millionen als Solarpark und Wasserkraftanlage. million than solar parks and hydroelectric power plants. It
Ebenfalls muss man in Betracht ziehen, dass der must also be taken into account that the wind farm emits
Windpark weniger CO2 ausstoßt. Solarpark und less CO2. The solar park and hydropower plant emit
Wasserkraftanlage stoßen 35000t und 12000t CO2 und 35,000 tons and 12,000 tons of CO2 respectively, while
der Windpark nur 8,800t. Jedoch muss man sagen, dass the wind park emits only 8,800 tons. However, it must be
der Windpark nur eine Lebensdauer von 20 Jahren hat. said that the wind farm only has a lifespan of 20 years. In
Währenddessen halten Solarparks 30 Jahre und contrast, solar parks last 30 years and hydroelectric power
Wasserkraftanlage 80 Jahre. Auf der Ebene der plants 80 years. On the level of local emissions, the wind
Lokalemissionen besitz der Windpark die meisten farm has the most emissions with acoustic, infrasound
Emission mit Hör-, Infraschall und Schattenwurft. Die and shadow flicker. The hydropower plant does not cast
Wasserkraftanlage wirft keinen Schattenwurf, aber hat any shadows, but still has audible and infrasound
trotzdem Hör- und Infraschall. Der Solarpark hat keinen emissions. The solar park has no emissions of any kind.
Emissionen jeglicher Art. Zum Schluss komme ich, dass In conclusion, I believe that wind farms should be
man Windparks fördern sollte, da die Vorteile die promoted because the advantages outweigh the
Nachteile überwiegen. Sie bieten günstig Strom und disadvantages. They provide cheap electricity and cause
verursachen wenig Treibhausgasemissionen, aber man little greenhouse gas emissions, but it should be noted
muss anmerken, dass ein Windpark keine hohe that a wind farm does not have a long lifespan, so they
Lebensdauer hat, sodass diese öfters erneuert werden have to be renewed frequently, and that residents and
müssen, und dass Anwohner und Tiere von diesem animals can be disturbed by them.
belästigt werden können.
221