Does AI have a role in the grading of student work?
My last post focused on the pros and cons of using generative AI for writing assessment tasks and marking criteria. This time, I'd like to explore another significant dimension of AI in assessment: its role in grading student work.
Research shows teachers are increasingly using AI tools to grade both low and high stakes assessments (Schwartz, 2025). As Flodén (2025) observes, we continue to see new AI tools emerge that are specifically designed for grading student work. With growing AI adoption in schools, the use of AI in the grading of student work is likely to accelerate.
However, the integration of AI in grading is not without its challenges. While generative AI has the potential to assist in grading students' work, its reliability compared to human grading varies and presents several challenges. Ethical considerations, hallucinations and potential biases are significant concerns, as AI can perpetuate bias depending on the nature of its training data (Lui & Bridgeman, 2023). This post examines the suitability of these tools across a range of assessment contexts in secondary and tertiary education, highlighting both the opportunities and the limitations of AI-assisted grading.
Current research on AI grading capabilities
The emerging research on AI grading presents a complex picture. AI systems demonstrate variable reliability across different assessment contexts, with several notable findings:
Short answer responses show promising results. Henkel et al. (2023) found that GPT-4, with minimal prompt engineering, achieves grading performance nearly equivalent to expert human raters. Their study using reading comprehension questions from students in Ghana showed impressive performance metrics, suggesting potential for formative literacy assessment tasks.
For essays, the picture is more complex. Seßler et al. (2025) analysed the performance of various LLMs in evaluating student essays according to ten pre-defined criteria. Their findings revealed that:
For university exams, Flodén (2025) found that without detailed grading instructions, ChatGPT could generate scores that appeared similar to human grading at first glance. However, deeper analysis revealed that:
Why educators should maintain the final say
These findings highlight the importance of educators having the final say when assessing student work, especially when AI has been used in the evaluation process.
There are several important reasons for this:
1. Maintaining pedagogical insights
The grading process provides valuable insights that can inform future teaching decisions and personalised support strategies. By letting AI handle all analysis of student work, educators miss the opportunity to deeply understand students' strengths and weaknesses.
2. Technical limitations of AI
Large Language Models function as pattern-matching algorithms that generate text by tokenising input and predicting likely word sequences based on statistical patterns learned during training. This fundamental design can lead to "hallucinations" - plausible but factually incorrect information about student responses.
These inaccuracies stem from several sources: biased or insufficient training data in specialised domains, the probabilistic nature of text prediction without genuine comprehension, and malfunctioning attention mechanisms within neural networks that incorrectly associate information.
These hallucinations have significant implications for the assessment of student work.
3. The potential for biased evaluations
AI can perpetuate bias and potentially cause harm through misinterpretation due to limitations in its training data. As Tonmoy (2024, p. 1) explains:
"Unlike traditional AI systems focused on limited tasks, LLMs have been exposed to vast amounts of online text data during training. While this allows them to display impressive language fluency, it also means they are capable of extrapolating information from the biases in training data, misinterpreting ambiguous prompts, or modifying the information to align superficially with the input. This becomes hugely alarming when we rely on language generation capabilities for sensitive applications, such as summarizing medical records, customer support conversations, financial analysis reports, and providing erroneous legal advice. Small errors could lead to harm, revealing the LLMs' lack of actual comprehension despite advances in self-learning..."
Research has yet to fully demonstrate AI's capability to reliably assess large, diverse student populations across various task types and content domains.
4. Content vs. form assessment discrepancies
Research by Wetzler et al. (2024) in higher education found that AI was better at assessing language and structure than discipline-specific content. Seßler et al. (2025) confirmed this pattern, noting stronger AI performance on grammar and structural elements than on content mastery.
5. Proportional bias
Wetzler et al. (2024) also identified that AI can be more lenient at lower performance levels and stricter at higher levels, creating a proportional bias that could disadvantage high-achieving students.
Features of effective AI grading systems
The literature indicates that achieving high accuracy and reliability in AI assessment can be challenging. However, there is emerging consensus about the features of effective AI grading tools. These features include:
For further insights into establishing robust AI-driven grading systems at scale, I recommend the AWS Blog Post that details Benchmark Education's development process for their AI tool designed to assess open-ended student responses.
The path forward
It is my firm belief that educators should have the final say when assessing students' work, especially when AI has supported the evaluation process. AI should be viewed as an assistant rather than a replacement for teacher judgement - providing initial assessments, consistency checks, or second opinions that inform, but don't determine, final grades.
As Li et al. (2024) suggest in their implications for practice, assessors might consider adapting the technology as a grading aid within a human-in-the-loop process, using AI to increase consistency while allowing teachers to moderate and refine the feedback.
The educational value of assessment extends beyond simply assigning grades. The process of reviewing student work provides teachers with crucial insights into learning progress, misconceptions, and areas where instruction needs adjustment. Delegating this entirely to AI would sacrifice a vital feedback loop in teaching and learning.
Research clearly demonstrates both the potential and limitations of AI in assessment. To maximise efficiency benefits, educators should strategically deploy AI marking tools alongside teacher-led assessment practices, whilst consistently evaluating their effectiveness and impact. Through this ongoing monitoring, schools can progressively refine their assessment approaches, ensuring the optimal integration of technological capabilities with teaching expertise.
What's your experience with AI grading tools? Do you see them as valuable assistants or potential replacements for human assessment?
Cheers, Rod
References:
AWS Public Sector Blog Team. (n.d.). Benchmark Education accelerates grading and boosts student feedback with generative AI on AWS. AWS Public Sector Blog. Retrieved April 10, 2025, from https://aws.amazon.com/blogs/publicsector/benchmark-education-accelerates-grading-and-boosts-student-feedback-with-generative-ai-on-aws/
Flodén, J. (2025). Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British educational research journal, 51(1), 201-224.
Henkel, O., Hills, L., Roberts, B., & McGrane, J. (2023). Can LLMs Grade Short-Answer Reading Comprehension Questions: An Empirical Study with a Novel Dataset. arXiv preprint arXiv:2310.18373.
Li, J., Jangamreddy, N. K., Hisamoto, R., Bhansali, R., Dyda, A., Zaphir, L., & Glencross, M. (2024). AI-assisted marking: Functionality and limitations of ChatGPT in written assessment evaluation. Australasian Journal of Educational Technology, 40(4), 56-72.
Liu, D. & Bridgeman, A. (2023). Should we use generative artificial intelligence tools for marking and feedback? Teaching@Sydney. Retrieved from https://educational-innovation.sydney.edu.au/teaching@sydney/should-we-use-generative-artificial-intelligence-tools-for-marking-and-feedback/
Seßler, K., Fürstenberg, M., Bühler, B., & Kasneci, E. (2025, March). Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (pp. 462-472).
Schwartz, J. (2025, February). Is it ethical to use AI to grade? Education Week. https://www.edweek.org/technology/is-it-ethical-to-use-ai-to-grade/2025/02
Tonmoy, S. M., Zaman, S. M., Jain, V., Rani, A., Rawte, V., Chadha, A., & Das, A. (2024). A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313, 6.
Wetzler, E. L., Cassidy, K. S., Jones, M. J., Frazier, C. R., Korbut, N. A., Sims, C. M., ... & Wood, M. (2024). Grading the Graders: Comparing Generative AI and Human Assessment in Essay Evaluation. Teaching of Psychology, 00986283241282696.
#AIinEducation #StudentAssessment #EducationalTechnology #AIgrading #TeacherWorkload
Principal Coastal Engineer, Market Lead for Coastal and Waterfront Development at Haskoning
5moThoughtful post, thanks Dr Rod
AI Governance & Compliance | Author and Advisor | Helping Senior Leaders Turn AI Policy into Classroom Practice
5moIt's an interesting paradox, on the one hand, we want to streamline curriculum and rubrics to make it easier for AI to mark. Thus freeing up more time for the teachers. However, I'd argue that free time should be used to develop, implement, and evaluate AI resistant pedagogy.
AI Agentic Systems Architect & AI Impact Strategist
5moWe could eventually design a robust AI grading system capable of 98% accuracy in controlled contexts—especially if paired with curriculum reforms that embrace clear rubrics, competency-based education, and data-driven feedback. But for now, AI should be the assistant, not the final authority, especially in holistic learning environments.