Table of Links
Abstract and 1 Introduction
2. Background
2.1 Effective Tutoring Practice
2.2 Feedback for Tutor Training
2.3 Sequence Labeling for Feedback Generation
2.4 Large Language Models in Education
3. Method
3.1 Dataset and 3.2 Sequence Labeling
3.3 GPT Facilitated Sequence Labeling
3.4 Metrics
4. Results
4.1 Results on RQ1
4.2 Results on RQ2
5. Discussion
6. Limitation and Future Works
7. Conclusion
8. Acknowledgments
9. References
APPENDIX
A. Lesson Principles
B. Input for Fine-Tunning GPT-3.5
C. Scatter Matric of the Correlation on the Outcome-based Praise
D. Detailed Results of Fine-Tuned GPT-3.5 Model’s Performance
9. REFERENCES
[1] V. Aleven, O. Popescu, and K. R. Koedinger. Towards tutorial dialog to support self-explanation: Adding natural language understanding to a cognitive tutor. In Proceedings of Artificial Intelligence in Education, pages 246–255, 2001.
[2] J. E. Beck, K.-m. Chang, J. Mostow, and A. Corbett. Does help help? introducing the bayesian evaluation and assessment methodology. In Intelligent Tutoring Systems: 9th International Conference, ITS 2008, Montreal, Canada, June 23-27, 2008 Proceedings 9, pages 383–394. Springer, 2008.
[3] S. Bhat, H. A. Nguyen, S. Moore, J. Stamper, M. Sakr, and E. Nyberg. Towards automated generation and evaluation of questions in educational domains. In Proceedings of the 15th International Conference on Educational Data Mining, volume 701, 2022.
[4] A. Brandsen, S. Verberne, M. Wansleeben, and K. Lambers. Creating a dataset for named entity recognition in the archaeology domain. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4573–4577, 2020.
[5] A. P. Cavalcanti, A. Barbosa, R. Carvalho, F. Freitas, Y.-S. Tsai, D. Gaˇsevi´c, and R. F. Mello. Automatic feedback in online learning environments: A systematic literature review. Computers and Education: Artificial Intelligence, 2:100027, 2021.
[6] Y. Chen, L. Wu, Q. Zheng, R. Huang, J. Liu, L. Deng, J. Yu, Y. Qing, B. Dong, and P. Chen. A boundary regression model for nested named entity recognition. Cognitive Computation, 15(2):534–551, 2023.
[7] D. R. Chine, P. Chhabra, A. Adeniran, S. Gupta, and K. R. Koedinger. Development of scenario-based mentor lessons: an iterative design process for training at scale. In Proceedings of the Ninth ACM Conference on Learning@ Scale, pages 469–471, 2022.
[8] D. R. Chine, P. Chhabra, A. Adeniran, J. Kopko, C. Tipper, S. Gupta, and K. R. Koedinger. Scenario-based training and on-the-job support for equitable mentoring. In The Learning Ideas Conference, pages 581–592. Springer, 2022.
[9] W. Dai, J. Lin, H. Jin, T. Li, Y.-S. Tsai, D. Gaˇsevi´c, and G. Chen. Can large language models provide feedback to students? a case study on chatgpt. In 2023 IEEE International Conference on Advanced Learning Technologies (ICALT), pages 323–325. IEEE, 2023.
[10] W. Dai, Y.-S. Tsai, J. Lin, A. Aldino, H. Jin, T. Li, D. Gaˇsevic, and G. Chen. Assessing the proficiency of large language models in automatic feedback generation: An evaluation study. 2024.
[11] L. Deleger, Q. Li, T. Lingren, M. Kaiser, K. Molnar, L. Stoutenborough, M. Kouril, K. Marsolo, I. Solti, et al. Building gold standard corpora for medical natural language processing tasks. In AMIA Annual Symposium Proceedings, volume 2012, page 144. American Medical Informatics Association, 2012.
[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the NAACL-HLT, Volume 1, pages 4171–4186, 2019.
[13] J. Dietrichson, M. Bøg, T. Filges, and A.-M. Klint Jørgensen. Academic interventions for elementary and middle school students with low socioeconomic status: A systematic review and meta-analysis. Review of educational research, 87(2):243–282, 2017.
[14] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and E. Hovy. A survey of data augmentation approaches for nlp. In Findings of the ACL: ACL-IJCNLP 2021, pages 968–988, 2021.
[15] N. Gisev, J. S. Bell, and T. F. Chen. Interrater agreement and interrater reliability: key concepts, approaches, and applications. Research in Social and Administrative Pharmacy, 9(3):330–338, 2013.
[16] C. Grouin, S. Rosset, P. Zweigenbaum, K. Fort, O. Galibert, and L. Quintard. Proposal for an extension of traditional named entities: From guidelines to evaluation, an overview. In Proceedings of the 5th linguistic annotation workshop, pages 92–100, 2011.
[17] A. Gurung, S. Baral, M. P. Lee, A. C. Sales, A. Haim, K. P. Vanacore, A. A. McReynolds, H. Kreisberg, C. Heffernan, and N. T. Heffernan. How common are common wrong answers? crowdsourcing remediation at scale. In Proceedings of the Tenth ACM Conference on Learning@ Scale, pages 70–80, 2023.
[18] A. Gurung, S. Baral, K. P. Vanacore, A. A. Mcreynolds, H. Kreisberg, A. F. Botelho, S. T. Shaw, and N. T. Hefferna. Identification, exploration, and remediation: Can teachers predict common wrong answers? In LAK23: 13th International Learning Analytics and Knowledge Conference, pages 399–410, 2023.
[19] J. Guryan, J. Ludwig, M. P. Bhatt, P. J. Cook, J. M. Davis, K. Dodge, G. Farkas, R. G. Fryer Jr, S. Mayer, H. Pollack, et al. Not too late: Improving academic outcomes among adolescents. American Economic Review, 113(3):738–765, 2023.
[20] Z. F. Han, J. Lin, A. Gurung, D. R. Thomas, E. Chen, C. Borchers, S. Gupta, and K. R. Koedinger. Improving assessment of tutoring practices using retrieval-augmented generation, 2024.
[21] J. Hattie and H. Timperley. The power of feedback. Review of Educational Research, 2007.
[22] N. T. Heffernan and C. L. Heffernan. The assistments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching. International Journal of Artificial Intelligence in Education, 24:470–497, 2014.
[23] M. Henderson, R. Ajjawi, D. Boud, and E. Molloy. The Impact of Feedback in Higher Education: Improving assessment outcomes for learners. Springer Nature, 2019.
[24] D. Hirunyasiri, D. R. Thomas, J. Lin, K. R. Koedinger, and V. Aleven. Comparative analysis of gpt-4 and human graders in evaluating praise given to students in synthetic dialogues. arXiv preprint arXiv:2307.02018, 2023.
[25] L. N. Jenkins, M. T. Floress, and W. Reinke. Rates and types of teacher praise: A review and future directions. Psychology in the Schools, 52(5):463–476, 2015.
[26] D. Jurafsky and J. H. Martin. Speech and language processing. 3rd, 2022.
[27] S. Kakarla, D. Thomas, J. Lin, S. Gupta, and K. R. Koedinger. Using large language models to assess tutors’ performance in reacting to students making math errors, 2024.
[28] K. S. Kalyan. A survey of gpt-3 family large language models including chatgpt and gpt-4. Natural Language Processing Journal, page 100048, 2023.
[29] M. L. Kamins and C. S. Dweck. Person versus process praise and criticism: implications for contingent self-worth and coping. Developmental psychology, 35(3):835, 1999.
[30] E. Kasneci, K. Seßler, S. Kuchemann, M. Bannert, ¨ D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. Gunnemann, E. H ¨ ullermeier, et al. Chatgpt for ¨ good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274, 2023.
[31] M. Konkol and M. Konop´ık. Segment representations in named entity recognition. In International conference on text, speech, and dialogue, pages 61–70. Springer, 2015.
[32] M. A. Kraft and G. T. Falken. A blueprint for scaling tutoring and mentoring across public schools. AERA Open, 7:23328584211042858, 2021.
[33] E. Latif and X. Zhai. Fine-tuning chatgpt for automatic scoring. Computers and Education: Artificial Intelligence, page 100210, 2024.
[34] Z. Levonian, C. Li, W. Zhu, A. Gade, O. Henkel, M.-E. Postle, and W. Xing. Retrieval-augmented generation to improve math question-answering: Trade-offs between groundedness and human preference, 2023.
[35] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Kuttler, M. Lewis, W.-t. ¨ Yih, T. Rockt¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
[36] J. Li, A. Sun, J. Han, and C. Li. A survey on deep learning for named entity recognition. IEEE TKDE, 34(1):50–70, 2020.
[37] J. Lin, W. Dai, L.-A. Lim, Y.-S. Tsai, R. F. Mello, H. Khosravi, D. Gaˇsevi´c, and G. Chen. Learner-centred analytics of feedback content in higher education. In LAK23, pages 100–110, 2023.
[38] J. Lin, Z. Han, D. R. Thomas, A. Gurung, S. Gupta, V. Aleven, and K. R. Koedinger. Leveraging large language models to enhance feedback provision in tutor training program. International Journal of Artificial Intelligence in Education, 2024.
[39] J. Lin, S. Singh, L. Sha, W. Tan, D. Lang, D. Gaˇsevi´c, and G. Chen. Is it a good move? mining effective tutoring strategies from human–human tutorial dialogues. Future Generation Computer Systems, 127:194–207, 2022.
[40] J. Lin, D. R. Thomas, F. Han, S. Gupta, W. Tan, N. D. Nguyen, and K. R. Koedinger. Using large language models to provide explanatory feedback to human tutors. arXiv preprint arXiv:2306.15498, 2023.
[41] C. Liu, H. Fan, and J. Liu. Span-based nested named entity recognition with pretrained language model. In Database Systems for Advanced Applications: 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, Proceedings, Part II 26, pages 620–628. Springer, 2021.
[42] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
[43] H. Luo, W. Tan, N. D. Nguyen, and L. Du. Re-weighting tokens: A simple and effective active learning strategy for named entity recognition. arXiv preprint arXiv:2311.00906, 2023.
[44] M. L. McHugh. Interrater reliability: the kappa statistic. Biochemia medica, 22(3):276–282, 2012.
[45] H. McNichols, W. Feng, J. Lee, A. Scarlatos, D. Smith, S. Woodhead, and A. Lan. Exploring automated distractor and feedback generation for math multiple-choice questions via in-context learning. arXiv preprint arXiv:2308.03234, 2023.
[46] N. D. Nguyen, W. Tan, L. Du, W. Buntine, R. Beare, and C. Chen. Auc maximization for low-resource named entity recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11):13389–13399, Jun. 2023.
[47] N. D. Nguyen, W. Tan, L. Du, W. Buntine, R. Beare, and C. Chen. Low-resource named entity recognition: Can one-vs-all auc maximization help? In 2023 IEEE International Conference on Data Mining (ICDM), pages 1241–1246. IEEE, 2023.
[48] A. Nickow, P. Oreopoulos, and V. Quan. The impressive effects of tutoring on prek-12 learning: A systematic review and meta-analysis of the experimental evidence. 2020.
[49] A. Pardo, K. Bartimote, S. B. Shum, S. Dawson, J. Gao, D. Gaˇsevi´c, S. Leichtweis, D. Liu, R. Mart´ınez-Maldonado, N. Mirriahi, et al. Ontask: Delivering data-informed, personalized learning support actions. Journal of Learning Analytics, 5(3):235–249, 2018.
[50] T. Patikorn and N. T. Heffernan. Effectiveness of crowd-sourcing on-demand assistance from teachers in online learning platforms. In Proceedings of the Seventh ACM Conference on Learning@ Scale, pages 115–124, 2020.
[51] C. Pornprasit and C. Tantithamthavorn. Gpt-3.5 for code review automation: How do few-shot learning, prompt design, and model fine-tuning impact their performance? arXiv preprint arXiv:2402.00905, 2024.
[52] J. Reich. Teaching drills: Advancing practice-based teacher education through short, low-stakes, high-frequency practice. Journal of Technology and Teacher Education, 30(2):217–228, 2022.
[53] T. Ryan, M. Henderson, K. Ryan, and G. Kennedy. Designing learner-centred text-based feedback: a rapid review and qualitative synthesis. Assessment & Evaluation in Higher Education, 46(6):894–912, 2021.
[54] A. Shrivastava and J. Heer. Iseql: Interactive sequence learning. In Proceedings of the 25th International Conference on Intelligent User Interfaces, pages 43–54, 2020.
[55] D. Thomas, X. Yang, S. Gupta, A. Adeniran, E. Mclaughlin, and K. Koedinger. When the tutor becomes the student: Design and evaluation of efficient scenario-based lessons for tutors. In LAK23: 13th International Learning Analytics and Knowledge Conference, pages 250–261, 2023.
[56] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2022.