MCPHS University, 179 Longwood Ave. Boston, MA 02115
World Journal of Advanced Engineering Technology and Sciences, 2026, 18(03), 497-514
Article DOI: 10.30574/wjaets.2026.18.3.0182
Received on 12 February 2026; revised on 25 March 2026; accepted on 27 March 2026
As educational institutions increasingly adopt large language models (LLMs) for automated assessment, grading support, and instructional feedback, fundamental questions emerge about the reliability of these systems. This study investigates whether LLM evaluators reach consistent conclusions when assessing identical student-like outputs—a critical validity concern for evidence-based educational practice. Five leading LLMs generated responses to seven domain-diverse prompts simulating typical educational tasks. Two LLM evaluators (Claude Sonnet 4 and ChatGPT-4o) independently assessed each response using standardized rubric criteria common in higher education. Results revealed systematic instability: Claude Sonnet 4's performance rating varied from 22.9% to 74.3% of criterion wins depending solely on which AI system conducted the evaluation—a 51.4 percentage point swing. This evaluator-dependent variance produced near-complete rank reversals, raising serious questions about the validity of AI-assisted assessment in educational contexts. The findings have immediate implications for: (1) faculty adopting AI grading tools, (2) institutions procuring educational technology, (3) researchers using AI evaluators in learning analytics, and (4) administrators developing AI governance policies. We argue that educational technology stakeholders must implement multi-evaluator validation protocols, transparent reliability reporting, and careful pilot testing before deploying AI assessment systems. This study contributes to the growing literature on responsible AI integration in teaching and learning while highlighting both opportunities and risks for educational innovation.
Educational Technology Assessment; AI Reliability; Automated Grading; Learning Management Systems; Educational Technology Evaluation; Pedagogical Technology; Faculty Development
Get Your e Certificate of Publication using below link
Preview Article PDF
Francis Melaragni and Phyllis Baron. The evaluator effect: Reliability challenges in AI-assisted assessment for business and educational technology. World Journal of Advanced Engineering Technology and Sciences, 2026, 18(03), 497-514. Article DOI: https://doi.org/10.30574/wjaets.2026.18.3.0182