Data-intensive applications (DIAs) increasingly rely on trustworthy, high-quality data to support reliable analytics, machine learning, and automated decision-making. However, ensuring data quality at scale remains a major challenge due to schema evolution, ingestion inconsistencies, and the manual effort required to define validation rules. This paper presents DQGen, a framework for automating the generation of data quality validation scripts tailored to complex, evolving datasets. By leveraging dataset metadata and systematically mapping data quality dimensions (e.g., completeness, uniqueness, validity) to Great Expectations (GE) rules, DQGen produces executable validation code adaptable to any schema. We evaluate DQGen on a real-world dataset from a large-scale Internet Service Provider (ISP), comprising over 26 million records across multiple relational tables. Results show that DQGen reduces validation setup time by over 90%, improves rule coverage and consistency, and enables continuous integration of data quality checks in batch or CI/CD workflows. The proposed framework contributes to the reliability and governance of modern DIAs by ensuring scalable, transparent, and automated validation.

DQGen: Scalable Metadata-Driven Automation for Data Quality Validation in Data-Intensive Applications

Abughazala, Moamin;Muccini, Henry
2026-01-01

Abstract

Data-intensive applications (DIAs) increasingly rely on trustworthy, high-quality data to support reliable analytics, machine learning, and automated decision-making. However, ensuring data quality at scale remains a major challenge due to schema evolution, ingestion inconsistencies, and the manual effort required to define validation rules. This paper presents DQGen, a framework for automating the generation of data quality validation scripts tailored to complex, evolving datasets. By leveraging dataset metadata and systematically mapping data quality dimensions (e.g., completeness, uniqueness, validity) to Great Expectations (GE) rules, DQGen produces executable validation code adaptable to any schema. We evaluate DQGen on a real-world dataset from a large-scale Internet Service Provider (ISP), comprising over 26 million records across multiple relational tables. Results show that DQGen reduces validation setup time by over 90%, improves rule coverage and consistency, and enables continuous integration of data quality checks in batch or CI/CD workflows. The proposed framework contributes to the reliability and governance of modern DIAs by ensuring scalable, transparent, and automated validation.
2026
9783032044020
9783032044037
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11697/284165
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 0
social impact