Does AI have a role in the grading of student work?

Does AI have a role in the grading of student work?

My last post focused on the pros and cons of using generative AI for writing assessment tasks and marking criteria. This time, I'd like to explore another significant dimension of AI in assessment: its role in grading student work.

Research shows teachers are increasingly using AI tools to grade both low and high stakes assessments (Schwartz, 2025). As Flodén (2025) observes, we continue to see new AI tools emerge that are specifically designed for grading student work. With growing AI adoption in schools, the use of AI in the grading of student work is likely to accelerate.

However, the integration of AI in grading is not without its challenges. While generative AI has the potential to assist in grading students' work, its reliability compared to human grading varies and presents several challenges. Ethical considerations, hallucinations and potential biases are significant concerns, as AI can perpetuate bias depending on the nature of its training data (Lui & Bridgeman, 2023). This post examines the suitability of these tools across a range of assessment contexts in secondary and tertiary education, highlighting both the opportunities and the limitations of AI-assisted grading.

Current research on AI grading capabilities

The emerging research on AI grading presents a complex picture. AI systems demonstrate variable reliability across different assessment contexts, with several notable findings:

Short answer responses show promising results. Henkel et al. (2023) found that GPT-4, with minimal prompt engineering, achieves grading performance nearly equivalent to expert human raters. Their study using reading comprehension questions from students in Ghana showed impressive performance metrics, suggesting potential for formative literacy assessment tasks.

For essays, the picture is more complex. Seßler et al. (2025) analysed the performance of various LLMs in evaluating student essays according to ten pre-defined criteria. Their findings revealed that:

  • Closed-source models (particularly GPT-4 and o1) outperformed open-source alternatives • The o1 model showed the strongest correlation with teacher assessments in overall scoring (Spearman's r =.74)
  • AI performed better on language-related criteria than content evaluation
  • AI systems, when compared with teacher judgements, tended to be more generous with overall marks
  • Open-source systems (LLaMA 3 and Mixtral) showed minimal correlation with teacher ratings

For university exams, Flodén (2025) found that without detailed grading instructions, ChatGPT could generate scores that appeared similar to human grading at first glance. However, deeper analysis revealed that:

  • 70% of AI gradings were within 10% of teachers' gradings, but only 31% within 5%
  • The agreement on exact grades was only 30%
  • AI tended to produce a tighter range of scores than human evaluators
  • AI struggled most with questions closely tied to specific course content

Why educators should maintain the final say

These findings highlight the importance of educators having the final say when assessing student work, especially when AI has been used in the evaluation process.  

There are several important reasons for this:

1. Maintaining pedagogical insights

The grading process provides valuable insights that can inform future teaching decisions and personalised support strategies. By letting AI handle all analysis of student work, educators miss the opportunity to deeply understand students' strengths and weaknesses.

2. Technical limitations of AI

Large Language Models function as pattern-matching algorithms that generate text by tokenising input and predicting likely word sequences based on statistical patterns learned during training. This fundamental design can lead to "hallucinations" - plausible but factually incorrect information about student responses.

These inaccuracies stem from several sources: biased or insufficient training data in specialised domains, the probabilistic nature of text prediction without genuine comprehension, and malfunctioning attention mechanisms within neural networks that incorrectly associate information.

These hallucinations have significant implications for the assessment of student work.

3. The potential for biased evaluations

AI can perpetuate bias and potentially cause harm through misinterpretation due to limitations in its training data. As Tonmoy (2024, p. 1) explains:

"Unlike traditional AI systems focused on limited tasks, LLMs have been exposed to vast amounts of online text data during training. While this allows them to display impressive language fluency, it also means they are capable of extrapolating information from the biases in training data, misinterpreting ambiguous prompts, or modifying the information to align superficially with the input. This becomes hugely alarming when we rely on language generation capabilities for sensitive applications, such as summarizing medical records, customer support conversations, financial analysis reports, and providing erroneous legal advice. Small errors could lead to harm, revealing the LLMs' lack of actual comprehension despite advances in self-learning..."

Research has yet to fully demonstrate AI's capability to reliably assess large, diverse student populations across various task types and content domains.

4. Content vs. form assessment discrepancies

Research by Wetzler et al. (2024) in higher education found that AI was better at assessing language and structure than discipline-specific content. Seßler et al. (2025) confirmed this pattern, noting stronger AI performance on grammar and structural elements than on content mastery.

5. Proportional bias

Wetzler et al. (2024) also identified that AI can be more lenient at lower performance levels and stricter at higher levels, creating a proportional bias that could disadvantage high-achieving students.

Features of effective AI grading systems

The literature indicates that achieving high accuracy and reliability in AI assessment can be challenging. However, there is emerging consensus about the features of effective AI grading tools. These features include:

  • Comprehensive system instructions - AI assistants used for grading need well-structured system instructions that include: Role, Context, Instructions, Criteria, and Examples (RCICE). This structure ensures the AI understands the assessment context and requirements.
  • Multiple-shot prompting - Providing the AI with multiple examples within a single prompt guides its output more effectively than one-shot or zero-shot prompting. This approach enhances the model's ability to generate appropriate evaluations by providing broader understanding of what sample student responses might look like.
  • Human-in-the-loop evaluation - Given the rapid advancement in AI technologies, regular quality assurance is essential. This means: 1) developing rubrics for each AI-generated output; 2) teachers scoring AI outputs against these rubrics; and 3) feeding this information back to improve the AI grading tool.
  • Knowledge grounding - AI tools need contextual knowledge about the educational domain and discipline (often implemented through Retrieval Augmented Generation - RAG - systems).
  • Well-defined rubrics - AI grading needs to be aligned with clear assessment criteria to ensure consistent results.
  • Discipline-specific calibration - AI grading systems require separate validation for different disciplines and question types.
  • Educator AI literacy - Educators need sufficient understanding of AI to properly evaluate whether grading systems meet the standards of pert-practice outlined above.

For further insights into establishing robust AI-driven grading systems at scale, I recommend the AWS Blog Post that details Benchmark Education's development process for their AI tool designed to assess open-ended student responses.

The path forward

It is my firm belief that educators should have the final say when assessing students' work, especially when AI has supported the evaluation process. AI should be viewed as an assistant rather than a replacement for teacher judgement - providing initial assessments, consistency checks, or second opinions that inform, but don't determine, final grades.

As Li et al. (2024) suggest in their implications for practice, assessors might consider adapting the technology as a grading aid within a human-in-the-loop process, using AI to increase consistency while allowing teachers to moderate and refine the feedback.

The educational value of assessment extends beyond simply assigning grades. The process of reviewing student work provides teachers with crucial insights into learning progress, misconceptions, and areas where instruction needs adjustment. Delegating this entirely to AI would sacrifice a vital feedback loop in teaching and learning.

Research clearly demonstrates both the potential and limitations of AI in assessment. To maximise efficiency benefits, educators should strategically deploy AI marking tools alongside teacher-led assessment practices, whilst consistently evaluating their effectiveness and impact. Through this ongoing monitoring, schools can progressively refine their assessment approaches, ensuring the optimal integration of technological capabilities with teaching expertise.

What's your experience with AI grading tools? Do you see them as valuable assistants or potential replacements for human assessment?

Cheers, Rod


References:

AWS Public Sector Blog Team. (n.d.). Benchmark Education accelerates grading and boosts student feedback with generative AI on AWS. AWS Public Sector Blog. Retrieved April 10, 2025, from https://aws.amazon.com/blogs/publicsector/benchmark-education-accelerates-grading-and-boosts-student-feedback-with-generative-ai-on-aws/

Flodén, J. (2025). Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British educational research journal51(1), 201-224.

Henkel, O., Hills, L., Roberts, B., & McGrane, J. (2023). Can LLMs Grade Short-Answer Reading Comprehension Questions: An Empirical Study with a Novel Dataset. arXiv preprint arXiv:2310.18373.

Li, J., Jangamreddy, N. K., Hisamoto, R., Bhansali, R., Dyda, A., Zaphir, L., & Glencross, M. (2024). AI-assisted marking: Functionality and limitations of ChatGPT in written assessment evaluation. Australasian Journal of Educational Technology40(4), 56-72.

Liu, D. & Bridgeman, A. (2023). Should we use generative artificial intelligence tools for marking and feedback? Teaching@Sydney. Retrieved from https://educational-innovation.sydney.edu.au/teaching@sydney/should-we-use-generative-artificial-intelligence-tools-for-marking-and-feedback/

Seßler, K., Fürstenberg, M., Bühler, B., & Kasneci, E. (2025, March). Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (pp. 462-472).

Schwartz, J. (2025, February). Is it ethical to use AI to grade? Education Week. https://www.edweek.org/technology/is-it-ethical-to-use-ai-to-grade/2025/02

Tonmoy, S. M., Zaman, S. M., Jain, V., Rani, A., Rawte, V., Chadha, A., & Das, A. (2024). A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.013136.

Wetzler, E. L., Cassidy, K. S., Jones, M. J., Frazier, C. R., Korbut, N. A., Sims, C. M., ... & Wood, M. (2024). Grading the Graders: Comparing Generative AI and Human Assessment in Essay Evaluation. Teaching of Psychology, 00986283241282696.

#AIinEducation #StudentAssessment #EducationalTechnology #AIgrading #TeacherWorkload

Natalie Patterson

Principal Coastal Engineer, Market Lead for Coastal and Waterfront Development at Haskoning

5mo

Thoughtful post, thanks Dr Rod

Like
Reply
Ryan James Purdy

AI Governance & Compliance | Author and Advisor | Helping Senior Leaders Turn AI Policy into Classroom Practice

5mo

It's an interesting paradox, on the one hand, we want to streamline curriculum and rubrics to make it easier for AI to mark. Thus freeing up more time for the teachers. However, I'd argue that free time should be used to develop, implement, and evaluate AI resistant pedagogy.

William Poutu

AI Agentic Systems Architect & AI Impact Strategist

5mo

We could eventually design a robust AI grading system capable of 98% accuracy in controlled contexts—especially if paired with curriculum reforms that embrace clear rubrics, competency-based education, and data-driven feedback. But for now, AI should be the assistant, not the final authority, especially in holistic learning environments.

To view or add a comment, sign in

Others also viewed

Explore content categories