Large Language Models (LLMs) have shown promise in automating feedback, enhancing accessibility, and customizing support. However, their integration into educational frameworks requires careful consideration of pedagogical effectiveness and accuracy. This paper describes an experimental setup in the specific domain of data science finalized to assess the ability of LLMs to provide accurate and helpful feedback to students. The dataset used in our study was obtained from a self-learning platform for medical students taking data science courses at the University of L’Aquila (Italy). We found that the most effective approach involved tailoring prompts based on the type of statistical test (normality tests or hypothesis tests). The accuracy of the LLM in providing KR feedback (i.e., right/wrong classification) was 0.93. The ability of the LLM to return adequate feedback to explain the mistake was measured in more than 75% of the cases, with more difficulty when the feedback is about the interpretation of a hypothesis test, adequate in only 71% cases. In summary, these findings are consistent with the growing literature on the use of LLMs in statistics, reinforcing their potential in this area of research. Longitudinal monitoring would be necessary to track how model improvements affect performance on educational feedback tasks over time, as conclusions valid for current models may quickly become outdated as the technology evolves.

Preliminary Evaluation of an LLM-Based System for Grading and Providing Feedback on Short-Text Answers in Data Science Exercises

Cofini V.
;
Jobe T.;Letteri I.;Vittorini P.
2025-01-01

Abstract

Large Language Models (LLMs) have shown promise in automating feedback, enhancing accessibility, and customizing support. However, their integration into educational frameworks requires careful consideration of pedagogical effectiveness and accuracy. This paper describes an experimental setup in the specific domain of data science finalized to assess the ability of LLMs to provide accurate and helpful feedback to students. The dataset used in our study was obtained from a self-learning platform for medical students taking data science courses at the University of L’Aquila (Italy). We found that the most effective approach involved tailoring prompts based on the type of statistical test (normality tests or hypothesis tests). The accuracy of the LLM in providing KR feedback (i.e., right/wrong classification) was 0.93. The ability of the LLM to return adequate feedback to explain the mistake was measured in more than 75% of the cases, with more difficulty when the feedback is about the interpretation of a hypothesis test, adequate in only 71% cases. In summary, these findings are consistent with the growing literature on the use of LLMs in statistics, reinforcing their potential in this area of research. Longitudinal monitoring would be necessary to track how model improvements affect performance on educational feedback tasks over time, as conclusions valid for current models may quickly become outdated as the technology evolves.
2025
9783032050694
9783032050700
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11697/283559
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact