AI can be useful in certain tasks, but beware the risk of misleading students

EiC-EBT-GenAI-shutterstock_2575917661_Index

Generative artificial intelligence (GenAI) tools based on large language models (LLMs), such as ChatGPT, are now firmly embedded in education ecosystems. For chemistry educators, this presents both opportunity and risk. On the one hand, LLMs can generate worked examples or explanations on demand; on the other, chemistry relies heavily on precise reasoning and challenging representational fluency, which GenAI may not be able to handle.

Generative artificial intelligence (GenAI) tools based on large language models (LLMs), such as ChatGPT, are now firmly embedded in education ecosystems. For chemistry teachers, this presents both opportunity and risk. On the one hand, LLMs can generate worked examples or explanations on demand; on the other, chemistry relies heavily on precise reasoning and challenging representational fluency, which GenAI may not be able to handle (rsc.li/3ZtwkPZ). 

As education institutions grapple with how to respond to student use of GenAI, evidence is urgently needed to understand not just whether these tools can answer chemistry questions, but also how they reason, where they fail and what this means for teaching and assessment. 

In a 2025 study, researchers set out to evaluate the reliability and reasoning capabilities of ChatGPT in chemistry education. Rather than focusing on routine factual recall, the study deliberately examined how AI systems respond to chemistry questions that require multi-step reasoning and conceptual understanding.

In a 2025 study, researchers set out to evaluate the reliability and reasoning capabilities of ChatGPT in chemistry education (rsc.li/4sycGPT). Rather than focusing on routine factual recall, the study deliberately examined how AI systems respond to chemistry questions that require multi-step reasoning and conceptual understanding.

The researchers were particularly interested in whether ChatGPT can support meaningful learning in chemistry or whether its tendency to generate fluent, authoritative-sounding responses masks deeper weaknesses that could mislead students.

Confidently wrong

AI responses: confidently wrong

The study used a structured set of textbook-style exercises spanning acids and bases, atomic structure, chemical bonds, chemical reactions and instrumental laboratory techniques. However, the prompts were adversarial, meaning they were modified to include incorrect assumptions, misconceptions or incomplete information, such as chemically impossible calculations or misidentified chemical entities. Some prompts deliberately challenged ChatGPT to refine its answers in ambiguous or unsolvable scenarios.

All of the responses were reviewed by a chemistry expert, who evaluated whether the prompt successfully elicited an incorrect or misleading response. These prompts were classified to reveal recurring patterns in the models’ reasoning failures.

More than half of the adversarial prompts successfully exposed weaknesses or failures in ChatGPT’s responses, and these took several forms: some failures arose from built-in misconceptions; others appeared when the model was pushed to extend or refine its explanations. Many involved incomplete corrections. This raises concerns about students using ChatGPT uncritically.

A particularly important finding was the models’ tendency to maintain confidence even when wrong

Performance also varied markedly by topic. Adversarial prompts were most successful in areas such as chemical reactions and instrumental laboratory techniques. In contrast, topics like atomic structure proved more robust, with no successful adversarial attacks in the small sample tested. Acids and bases occupied a middle ground, with a mix of successful and unsuccessful attacks. 

A particularly important finding was the models’ tendency to maintain confidence even when wrong. Incorrect answers were often delivered with the same authoritative tone as correct ones, making it difficult for learners to distinguish reliable explanations from flawed reasoning. This raises concerns about uncritical student use, especially in unsupervised settings.

This study shows that GenAI can support learning in chemistry when students are explicitly taught to interrogate, rather than accept, its outputs.

Fraser Scott

S-Ş Uçar, I Lopez-Gazpio and J Lopez-GazpioEduc. Inf. Technol., 2025, doi.org/10.1007/s10639-024-13295-6

Teaching Tips

  • Practise using GenAI tools to familiarise yourself with them.
  • Use AI to support metacognition to surface misconceptions, not to provide model answers.
  • Encourage students to validate GenAI responses against trusted resources such as textbooks, mark schemes, data books or experimental evidence.
  • Design tasks that focus on reasoning by prompting students to explain why a response is correct or incorrect, rather than whether GenAI arrived at the right answer.
  • Ask students to submit closely related prompts, then compare responses to see how small changes in question phrasing can influence AI output.
  • Discuss with students how the AI’s confident language and tone can mask conceptual errors, particularly in multi-step chemical reasoning.

Teaching Tips

  • Practice using GenAI tools so you can familiarise yourself with them.
  • Use AI to support metacognition to surface misconceptions, not to provide model answers (https://rsc.li/4bHFDmP).
  • Encourage students to validate GenAI responses against trusted resources such as textbooks, mark schemes, data books or experimental evidence.
  • Design tasks that focus on reasoning by prompting students to explain why a response is correct or incorrect, rather than whether GenAI arrived at the ‘right’ answer.
  • Ask students to submit closely related prompts, then compare responses to see how small changes in question phrasing can influence AI output.
  • Discuss with students how confident language can mask conceptual errors, particularly in multi-step chemical reasoning.