#135: Small vs. Large Language Models: Comparing Language Models in Generating Productive Failure Math Problems
Large language models (LLMs) such as GPT-4o can generate high-quality math problems but are resource-intensive and difficult to fine-tune. Small language models (SLMs) like Phi-3 may provide a more practical alternative, as they can run locally and be adapted for specific tasks. In this study, we generated 60 Productive Failure (PF) math problems, with half produced by GPT-4o and half by Phi-3. To ensure comparability, we controlled for grade level, Common Core standard, theme, interest, and prior knowledge. Trained raters evaluated the problems across five PF dimensions. Results showed that GPT-4o (M = 13.38; SD = 1.88) outperformed Phi-3 (M = 10.67; SD = 2.73), with a large effect size (d = 1.16). These findings indicate that although GPT-4o currently generates more effective PF problems, Phi-3 shows promising potential. With further fine-tuning and task-specific adaptation, SLMs like Phi-3 may offer a sustainable, cost-effective solution for educational problem generation.
Speakers
- Seyedahmad Rahimi — Georgia Tech
Authors
Seyedahmad Rahimi, Salah Esmaeiligoujar, Deniz Ercan, Maryam Babaee, Ran Gao