Fine-tuned GPT-3.5 Performance for Explanatory Feedback
Table of Links
Abstract and 1 Introduction
2. Background
2.1 Effective Tutoring Practice
2.2 Feedback for Tutor Training
2.3 Sequence Labeling for Feedback Generation
2.4 Large Language Models in Education
3. Method
3.1 Dataset and 3.2 Sequence Labeling
3.3 GPT Facilitated Sequence Labeling
3.4 Metrics
4. Results
4.1 Results on RQ1
4.2 Results on RQ2
5. Discussion
6. Limitation and Future Works
7. Conclusion
8. Acknowledgments
9. References
APPENDIX
A. Lesson Principles
B. Input for Fine-Tunning GPT-3.5
C. Scatter Matric of the Correlation on the Outcome-based Praise
D. Detailed Results of Fine-Tuned GPT-3.5 Model’s Performance
4.2 Results on RQ2
Building upon the insights gained from the results on RQ1, where the M-IoU was established as a viable proxy for assessing the quality of text highlighted by GPT models, we delved into the potential of fine-tuning the GPT-3.5 model to enhance its performance in identifying praise within tutor responses. Notably, our ability to fine-tune the GPT-4 model was constrained due to access limitations. Consequently, our efforts were concentrated on the GPT-3.5 model, the performance of which is depicted in Figure 4 and the detailed results of model performance was shown in the Appendix D.
In Figure 4, we present the model’s performance, quantified by averaging the M-IoU scores derived from five distinct random seeds across various training partitions (from 13 to 65 training sample size). The inclusion of error bars in our analysis offers a visual representation of the model’s performance variability, ranging from its maximum to its minimum across these partitions. We simulated a low-resource scenario—characterized by a limited training dataset—and observe the fine-tuned GPT-3.5 model’s capability to maintain satisfactory performance under such constraints. Starting with a mere 13 training samples (10% of the full dataset), the model demonstrated an approximate M-IoU score of 0.5 for effort-based praise and 0.65 for outcome-based praise, showcasing performance on par with that achieved through the prompting method applied to the GPT models. As the training sample size increased, we generally observed an improvement in model performance, with a peculiar exception observed in the outcome-based praise performance when utilizing 52 training samples. Expanding the dataset to 65 samples resulted in the model attaining an M-IoU score of roughly 0.6 for effort-based praise—surpassing the efficacy of the prompting method. Correspondingly, the performance on outcome-based praise reached an M-IoU score of 0.75, rivaling that of expert annotations given that an M-IoU of 0.68 equates to a human rating score of 0.77.
Motivated by these promising outcomes, we elected to adopt the model exhibiting optimal performance in highlighting effort-based praise as the foundation for our automated feedback system within tutor training programs. This decision is underpinned by the pivotal role of effort-based praise in educational feedback; it is the essence of effective praise, determining the appropriateness and impact of the tutor’s feedback on student motivation and learning. The ability to accurately identify and underscore effort-based praise in tutors’ responses is thus deemed crucial for enhancing the quality of educational feedback. In support of this initiative, a demo of our automated explanatory feedback system is accessible via the provided link[3], showcasing the application’s potential to transform tutor training by emphasizing the significance of effort-based praise.
[3] The demo of our automated explanatory feedback can be found here https://edm24-effort-outcome.vercel.app/