ISLS 2026
ICLS Poster

#58: Item Calibration in Reading Assessment with LLMs: Human Alignment and the Masking Effect

Wed Jun 17, 4:15 PM–5:45 PM · Outdoors

This study evaluates large language models (LLMs) for calibrating item difficulty in grade 8 reading comprehension using 30 NAEP items. LLM-calibrated difficulty estimates were compared with human benchmarks, and masking was used to manipulate input information. Results show no significant differences across human-derived difficulty levels, indicating misalignment with human performance. Masking systematically reduced accuracy, particularly for harder items as more input text was obscured. These findings raise concerns about the validity of LLM-based item calibration.

Speakers

  • Lingchen Kong — University of Florida

Authors

Lingchen Kong