Large Language Models (LLMs) have shown promise in automating feedback, enhancing accessibility, and customizing support. However, their integration into educational frameworks requires careful consideration of pedagogical effectiveness and accuracy. This paper describes an experimental setup in the specific domain of data science finalized to assess the ability of LLMs to provide accurate and helpful feedback to students. The dataset used in our study was obtained from a self-learning platform for medical students taking data science courses at the University of L’Aquila (Italy). We found that the most effective approach involved tailoring prompts based on the type of statistical test (normality tests or hypothesis tests). The accuracy of the LLM in providing KR feedback (i.e., right/wrong classification) was 0.93. The ability of the LLM to return adequate feedback to explain the mistake was measured in more than 75% of the cases, with more difficulty when the feedback is about the interpretation of a hypothesis test, adequate in only 71% cases. In summary, these findings are consistent with the growing literature on the use of LLMs in statistics, reinforcing their potential in this area of research. Longitudinal monitoring would be necessary to track how model improvements affect performance on educational feedback tasks over time, as conclusions valid for current models may quickly become outdated as the technology evolves.
Preliminary Evaluation of an LLM-Based System for Grading and Providing Feedback on Short-Text Answers in Data Science Exercises
Cofini V.
;Jobe T.;Letteri I.;Vittorini P.
2025-01-01
Abstract
Large Language Models (LLMs) have shown promise in automating feedback, enhancing accessibility, and customizing support. However, their integration into educational frameworks requires careful consideration of pedagogical effectiveness and accuracy. This paper describes an experimental setup in the specific domain of data science finalized to assess the ability of LLMs to provide accurate and helpful feedback to students. The dataset used in our study was obtained from a self-learning platform for medical students taking data science courses at the University of L’Aquila (Italy). We found that the most effective approach involved tailoring prompts based on the type of statistical test (normality tests or hypothesis tests). The accuracy of the LLM in providing KR feedback (i.e., right/wrong classification) was 0.93. The ability of the LLM to return adequate feedback to explain the mistake was measured in more than 75% of the cases, with more difficulty when the feedback is about the interpretation of a hypothesis test, adequate in only 71% cases. In summary, these findings are consistent with the growing literature on the use of LLMs in statistics, reinforcing their potential in this area of research. Longitudinal monitoring would be necessary to track how model improvements affect performance on educational feedback tasks over time, as conclusions valid for current models may quickly become outdated as the technology evolves.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


