The evaluator effect: Reliability challenges in AI-assisted assessment for business and educational technology

Francis Melaragni; Phyllis Baron

doi:10.30574/wjaets.2026.18.3.0182

Francis Melaragni ^*and Phyllis Baron

MCPHS University, 179 Longwood Ave. Boston, MA 02115

Research Article

World Journal of Advanced Engineering Technology and Sciences, 2026, 18(03), 497-514

Article DOI: 10.30574/wjaets.2026.18.3.0182

DOI url: https://doi.org/10.30574/wjaets.2026.18.3.0182

Publication history

Received on 12 February 2026; revised on 25 March 2026; accepted on 27 March 2026

Abstract

As educational institutions increasingly adopt large language models (LLMs) for automated assessment, grading support, and instructional feedback, fundamental questions emerge about the reliability of these systems. This study investigates whether LLM evaluators reach consistent conclusions when assessing identical student-like outputs—a critical validity concern for evidence-based educational practice. Five leading LLMs generated responses to seven domain-diverse prompts simulating typical educational tasks. Two LLM evaluators (Claude Sonnet 4 and ChatGPT-4o) independently assessed each response using standardized rubric criteria common in higher education. Results revealed systematic instability: Claude Sonnet 4's performance rating varied from 22.9% to 74.3% of criterion wins depending solely on which AI system conducted the evaluation—a 51.4 percentage point swing. This evaluator-dependent variance produced near-complete rank reversals, raising serious questions about the validity of AI-assisted assessment in educational contexts. The findings have immediate implications for: (1) faculty adopting AI grading tools, (2) institutions procuring educational technology, (3) researchers using AI evaluators in learning analytics, and (4) administrators developing AI governance policies. We argue that educational technology stakeholders must implement multi-evaluator validation protocols, transparent reliability reporting, and careful pilot testing before deploying AI assessment systems. This study contributes to the growing literature on responsible AI integration in teaching and learning while highlighting both opportunities and risks for educational innovation.

Keywords

Educational Technology Assessment; AI Reliability; Automated Grading; Learning Management Systems; Educational Technology Evaluation; Pedagogical Technology; Faculty Development

Download Article PDF

https://wjaets.com/sites/default/files/fulltext_pdf/WJAETS-2026-0182.pdf

Get Your e Certificate of Publication using below link

Download Certificate

Preview Article PDF

How to cite this article

Francis Melaragni and Phyllis Baron. The evaluator effect: Reliability challenges in AI-assisted assessment for business and educational technology. World Journal of Advanced Engineering Technology and Sciences, 2026, 18(03), 497-514. Article DOI: https://doi.org/10.30574/wjaets.2026.18.3.0182

The evaluator effect: Reliability challenges in AI-assisted assessment for business and educational technology

Francis Melaragni ^*and Phyllis Baron

Preview Article PDF

Get Certificates

Issue details

The evaluator effect: Reliability challenges in AI-assisted assessment for business and educational technology

Francis Melaragni * and Phyllis Baron

Preview Article PDF

Get Certificates

Issue details

Francis Melaragni ^*and Phyllis Baron