Modern software applications generate a wide range of runtime metrics, which are vital to many quality assurance activities. These data are often recorded and aggregated as time series to observe patterns and trends of various runtime aspects over time. In this context, Time Series Forecasting (TSF) offers unique opportunities for predicting software runtime behavior and identifying potential anomalies. Although TSF models have been successfully applied in fields such as economics and climatology, their capabilities for forecasting software runtime metrics remain relatively underexplored. In this paper, we conduct a comprehensive empirical evaluation of 8 TSF models on 110 real-world software runtime metrics recorded over the course of about one year. Our evaluation encompasses three classical statistical models, three neural network models, and two time series foundation models. Results show that the foundation models achieve state-of-the-art performance on TSF of software runtime metrics, outperforming other models with strong statistical significance. Our findings indicate that foundation models, despite being trained exclusively on time series data from other domains, can effectively generalize to software runtime metrics in a zero-shot setting. This makes them a convenient plug-and-play solution for practitioners and researchers aiming to integrate TSF into their software quality assurance processes. Yet, their performance is not uniformly superior across all the time series, underscoring the absence of a “silver bullet” solution.

Forecasting software runtime metrics: A comparative study of classical statistical, neural network, and foundation models

Traini L.;Cortellessa V.
2026-01-01

Abstract

Modern software applications generate a wide range of runtime metrics, which are vital to many quality assurance activities. These data are often recorded and aggregated as time series to observe patterns and trends of various runtime aspects over time. In this context, Time Series Forecasting (TSF) offers unique opportunities for predicting software runtime behavior and identifying potential anomalies. Although TSF models have been successfully applied in fields such as economics and climatology, their capabilities for forecasting software runtime metrics remain relatively underexplored. In this paper, we conduct a comprehensive empirical evaluation of 8 TSF models on 110 real-world software runtime metrics recorded over the course of about one year. Our evaluation encompasses three classical statistical models, three neural network models, and two time series foundation models. Results show that the foundation models achieve state-of-the-art performance on TSF of software runtime metrics, outperforming other models with strong statistical significance. Our findings indicate that foundation models, despite being trained exclusively on time series data from other domains, can effectively generalize to software runtime metrics in a zero-shot setting. This makes them a convenient plug-and-play solution for practitioners and researchers aiming to integrate TSF into their software quality assurance processes. Yet, their performance is not uniformly superior across all the time series, underscoring the absence of a “silver bullet” solution.
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11697/284000
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact