Does AI have a role in the grading of student work?

Dr Rod Lane

Director: Learning and Innovation at New Zealand School Boards Association (NZSBA). Consultant at The AI Agency Aotearoa

Published Apr 18, 2025

My last post focused on the pros and cons of using generative AI for writing assessment tasks and marking criteria. This time, I'd like to explore another significant dimension of AI in assessment: its role in grading student work.

Research shows teachers are increasingly using AI tools to grade both low and high stakes assessments (Schwartz, 2025). As Flodén (2025) observes, we continue to see new AI tools emerge that are specifically designed for grading student work. With growing AI adoption in schools, the use of AI in the grading of student work is likely to accelerate.

However, the integration of AI in grading is not without its challenges. While generative AI has the potential to assist in grading students' work, its reliability compared to human grading varies and presents several challenges. Ethical considerations, hallucinations and potential biases are significant concerns, as AI can perpetuate bias depending on the nature of its training data (Lui & Bridgeman, 2023). This post examines the suitability of these tools across a range of assessment contexts in secondary and tertiary education, highlighting both the opportunities and the limitations of AI-assisted grading.

Current research on AI grading capabilities

The emerging research on AI grading presents a complex picture. AI systems demonstrate variable reliability across different assessment contexts, with several notable findings:

Short answer responses show promising results. Henkel et al. (2023) found that GPT-4, with minimal prompt engineering, achieves grading performance nearly equivalent to expert human raters. Their study using reading comprehension questions from students in Ghana showed impressive performance metrics, suggesting potential for formative literacy assessment tasks.

For essays, the picture is more complex. Seßler et al. (2025) analysed the performance of various LLMs in evaluating student essays according to ten pre-defined criteria. Their findings revealed that:

Closed-source models (particularly GPT-4 and o1) outperformed open-source alternatives • The o1 model showed the strongest correlation with teacher assessments in overall scoring (Spearman's r =.74)
AI performed better on language-related criteria than content evaluation
AI systems, when compared with teacher judgements, tended to be more generous with overall marks
Open-source systems (LLaMA 3 and Mixtral) showed minimal correlation with teacher ratings

For university exams, Flodén (2025) found that without detailed grading instructions, ChatGPT could generate scores that appeared similar to human grading at first glance. However, deeper analysis revealed that:

70% of AI gradings were within 10% of teachers' gradings, but only 31% within 5%
The agreement on exact grades was only 30%
AI tended to produce a tighter range of scores than human evaluators
AI struggled most with questions closely tied to specific course content

Why educators should maintain the final say

These findings highlight the importance of educators having the final say when assessing student work, especially when AI has been used in the evaluation process.

There are several important reasons for this:

1. Maintaining pedagogical insights

The grading process provides valuable insights that can inform future teaching decisions and personalised support strategies. By letting AI handle all analysis of student work, educators miss the opportunity to deeply understand students' strengths and weaknesses.

2. Technical limitations of AI

Large Language Models function as pattern-matching algorithms that generate text by tokenising input and predicting likely word sequences based on statistical patterns learned during training. This fundamental design can lead to "hallucinations" - plausible but factually incorrect information about student responses.

These inaccuracies stem from several sources: biased or insufficient training data in specialised domains, the probabilistic nature of text prediction without genuine comprehension, and malfunctioning attention mechanisms within neural networks that incorrectly associate information.

These hallucinations have significant implications for the assessment of student work.

3. The potential for biased evaluations

AI can perpetuate bias and potentially cause harm through misinterpretation due to limitations in its training data. As Tonmoy (2024, p. 1) explains:

"Unlike traditional AI systems focused on limited tasks, LLMs have been exposed to vast amounts of online text data during training. While this allows them to display impressive language fluency, it also means they are capable of extrapolating information from the biases in training data, misinterpreting ambiguous prompts, or modifying the information to align superficially with the input. This becomes hugely alarming when we rely on language generation capabilities for sensitive applications, such as summarizing medical records, customer support conversations, financial analysis reports, and providing erroneous legal advice. Small errors could lead to harm, revealing the LLMs' lack of actual comprehension despite advances in self-learning..."

Research has yet to fully demonstrate AI's capability to reliably assess large, diverse student populations across various task types and content domains.

4. Content vs. form assessment discrepancies

Research by Wetzler et al. (2024) in higher education found that AI was better at assessing language and structure than discipline-specific content. Seßler et al. (2025) confirmed this pattern, noting stronger AI performance on grammar and structural elements than on content mastery.

5. Proportional bias

Wetzler et al. (2024) also identified that AI can be more lenient at lower performance levels and stricter at higher levels, creating a proportional bias that could disadvantage high-achieving students.

Features of effective AI grading systems

The literature indicates that achieving high accuracy and reliability in AI assessment can be challenging. However, there is emerging consensus about the features of effective AI grading tools. These features include:

Comprehensive system instructions - AI assistants used for grading need well-structured system instructions that include: Role, Context, Instructions, Criteria, and Examples (RCICE). This structure ensures the AI understands the assessment context and requirements.
Multiple-shot prompting - Providing the AI with multiple examples within a single prompt guides its output more effectively than one-shot or zero-shot prompting. This approach enhances the model's ability to generate appropriate evaluations by providing broader understanding of what sample student responses might look like.
Human-in-the-loop evaluation - Given the rapid advancement in AI technologies, regular quality assurance is essential. This means: 1) developing rubrics for each AI-generated output; 2) teachers scoring AI outputs against these rubrics; and 3) feeding this information back to improve the AI grading tool.
Knowledge grounding - AI tools need contextual knowledge about the educational domain and discipline (often implemented through Retrieval Augmented Generation - RAG - systems).
Well-defined rubrics - AI grading needs to be aligned with clear assessment criteria to ensure consistent results.
Discipline-specific calibration - AI grading systems require separate validation for different disciplines and question types.
Educator AI literacy - Educators need sufficient understanding of AI to properly evaluate whether grading systems meet the standards of pert-practice outlined above.

For further insights into establishing robust AI-driven grading systems at scale, I recommend the AWS Blog Post that details Benchmark Education's development process for their AI tool designed to assess open-ended student responses.

The path forward

It is my firm belief that educators should have the final say when assessing students' work, especially when AI has supported the evaluation process. AI should be viewed as an assistant rather than a replacement for teacher judgement - providing initial assessments, consistency checks, or second opinions that inform, but don't determine, final grades.

As Li et al. (2024) suggest in their implications for practice, assessors might consider adapting the technology as a grading aid within a human-in-the-loop process, using AI to increase consistency while allowing teachers to moderate and refine the feedback.

The educational value of assessment extends beyond simply assigning grades. The process of reviewing student work provides teachers with crucial insights into learning progress, misconceptions, and areas where instruction needs adjustment. Delegating this entirely to AI would sacrifice a vital feedback loop in teaching and learning.

Research clearly demonstrates both the potential and limitations of AI in assessment. To maximise efficiency benefits, educators should strategically deploy AI marking tools alongside teacher-led assessment practices, whilst consistently evaluating their effectiveness and impact. Through this ongoing monitoring, schools can progressively refine their assessment approaches, ensuring the optimal integration of technological capabilities with teaching expertise.

What's your experience with AI grading tools? Do you see them as valuable assistants or potential replacements for human assessment?

Cheers, Rod

References:

AWS Public Sector Blog Team. (n.d.). Benchmark Education accelerates grading and boosts student feedback with generative AI on AWS. AWS Public Sector Blog. Retrieved April 10, 2025, from https://aws.amazon.com/blogs/publicsector/benchmark-education-accelerates-grading-and-boosts-student-feedback-with-generative-ai-on-aws/

Flodén, J. (2025). Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British educational research journal, 51(1), 201-224.

Henkel, O., Hills, L., Roberts, B., & McGrane, J. (2023). Can LLMs Grade Short-Answer Reading Comprehension Questions: An Empirical Study with a Novel Dataset. arXiv preprint arXiv:2310.18373.

Li, J., Jangamreddy, N. K., Hisamoto, R., Bhansali, R., Dyda, A., Zaphir, L., & Glencross, M. (2024). AI-assisted marking: Functionality and limitations of ChatGPT in written assessment evaluation. Australasian Journal of Educational Technology, 40(4), 56-72.

Liu, D. & Bridgeman, A. (2023). Should we use generative artificial intelligence tools for marking and feedback? Teaching@Sydney. Retrieved from https://educational-innovation.sydney.edu.au/teaching@sydney/should-we-use-generative-artificial-intelligence-tools-for-marking-and-feedback/

Seßler, K., Fürstenberg, M., Bühler, B., & Kasneci, E. (2025, March). Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (pp. 462-472).

Schwartz, J. (2025, February). Is it ethical to use AI to grade? Education Week. https://www.edweek.org/technology/is-it-ethical-to-use-ai-to-grade/2025/02

Tonmoy, S. M., Zaman, S. M., Jain, V., Rani, A., Rawte, V., Chadha, A., & Das, A. (2024). A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313, 6.

Wetzler, E. L., Cassidy, K. S., Jones, M. J., Frazier, C. R., Korbut, N. A., Sims, C. M., ... & Wood, M. (2024). Grading the Graders: Comparing Generative AI and Human Assessment in Essay Evaluation. Teaching of Psychology, 00986283241282696.

#AIinEducation #StudentAssessment #EducationalTechnology #AIgrading #TeacherWorkload

AI Update for Educators

658 followers

+ Subscribe

Natalie Patterson

Principal Coastal Engineer, Market Lead for Coastal and Waterfront Development at Haskoning

5mo

Thoughtful post, thanks Dr Rod

Ryan James Purdy

AI Governance & Compliance | Author and Advisor | Helping Senior Leaders Turn AI Policy into Classroom Practice

5mo

It's an interesting paradox, on the one hand, we want to streamline curriculum and rubrics to make it easier for AI to mark. Thus freeing up more time for the teachers. However, I'd argue that free time should be used to develop, implement, and evaluate AI resistant pedagogy.

1 Reaction

William Poutu

AI Agentic Systems Architect & AI Impact Strategist

5mo

We could eventually design a robust AI grading system capable of 98% accuracy in controlled contexts—especially if paired with curriculum reforms that embrace clear rubrics, competency-based education, and data-driven feedback. But for now, AI should be the assistant, not the final authority, especially in holistic learning environments.

LinkedIn respects your privacy

Does AI have a role in the grading of student work?

Dr Rod Lane

Director: Learning and Innovation at New Zealand School Boards Association (NZSBA). Consultant at The AI Agency Aotearoa

Current research on AI grading capabilities

Why educators should maintain the final say

Features of effective AI grading systems

AI Update for Educators

658 followers

More articles by this author

Others also viewed

Sample Syllabus Statements and Activities for Using AI in School

Pervasive Use of AI in Education Demands that We Raise the Performance Bar

Is Generative AI the Future of Education?

The Art of Reverse Questioning: Transforming AI from Answer-Giver to Question-Architect

Is AI Harmful or Helpful for Students? The Tale of Two Studies

Beyond "Using AI": Teaching Students to Think About Their Thinking with Artificial Intelligence

Generative AI and the Crisis of Critical Thinking in Higher Education

What Careers Should Kids Follow in an AI World?

How will AI alter the student experience? Experts weigh in

Navigating the AI Frontier: Empowering Educators and Students Through AI Literacy

Explore content categories

Current research on AI grading capabilities

Why educators should maintain the final say

Features of effective AI grading systems

AI Update for Educators

658 followers

AI in Educational Assessment: Balancing Innovation with Integrity

Apr 12, 2025

AI Update for Educators #8: UK government spends £4 million on better AI for schools!

Sep 5, 2024

Navigating the AI Tool Maze: My Journey as an Educator Part 2

Aug 12, 2024

AI Updates for Educators #7

Jul 28, 2024

AI Update for Educators #6

Jun 29, 2024

AI Update for Educators #5:

Jun 21, 2024

AI Update for Educators #4

Jun 16, 2024

AI Update for Educators #3

Jun 7, 2024

AI Update for Educators #2

May 30, 2024

AI Update for Educators #1

May 24, 2024

Others also viewed

Sample Syllabus Statements and Activities for Using AI in School

Pervasive Use of AI in Education Demands that We Raise the Performance Bar

Is Generative AI the Future of Education?

The Art of Reverse Questioning: Transforming AI from Answer-Giver to Question-Architect

Is AI Harmful or Helpful for Students? The Tale of Two Studies

Beyond "Using AI": Teaching Students to Think About Their Thinking with Artificial Intelligence

Generative AI and the Crisis of Critical Thinking in Higher Education

What Careers Should Kids Follow in an AI World?

How will AI alter the student experience? Experts weigh in

Navigating the AI Frontier: Empowering Educators and Students Through AI Literacy

Explore content categories