ICLS Poster

#58: Item Calibration in Reading Assessment with LLMs: Human Alignment and the Masking Effect

Wed Jun 17, 4:15 PM–5:45 PM · Outdoors

Part of ICLS Posters - Wednesday Poster Session (In-Person)

Generative AI & Large Language Models Assessment, Feedback & Formative Practices AI in Education

This study evaluates large language models (LLMs) for calibrating item difficulty in grade 8 reading comprehension using 30 NAEP items. LLM-calibrated difficulty estimates were compared with human benchmarks, and masking was used to manipulate input information. Results show no significant differences across human-derived difficulty levels, indicating misalignment with human performance. Masking systematically reduced accuracy, particularly for harder items as more input text was obscured. These findings raise concerns about the validity of LLM-based item calibration.

Speakers

Lingchen Kong — University of Florida

Authors

Lingchen Kong