DQGen: Scalable Metadata-Driven Automation for Data Quality Validation in Data-Intensive Applications

Abughazala, Moamin; Muccini, Henry

doi:10.1007/978-3-032-04403-7_31

Data-intensive applications (DIAs) increasingly rely on trustworthy, high-quality data to support reliable analytics, machine learning, and automated decision-making. However, ensuring data quality at scale remains a major challenge due to schema evolution, ingestion inconsistencies, and the manual effort required to define validation rules. This paper presents DQGen, a framework for automating the generation of data quality validation scripts tailored to complex, evolving datasets. By leveraging dataset metadata and systematically mapping data quality dimensions (e.g., completeness, uniqueness, validity) to Great Expectations (GE) rules, DQGen produces executable validation code adaptable to any schema. We evaluate DQGen on a real-world dataset from a large-scale Internet Service Provider (ISP), comprising over 26 million records across multiple relational tables. Results show that DQGen reduces validation setup time by over 90%, improves rule coverage and consistency, and enables continuous integration of data quality checks in batch or CI/CD workflows. The proposed framework contributes to the reliability and governance of modern DIAs by ensuring scalable, transparent, and automated validation.

DQGen: Scalable Metadata-Driven Automation for Data Quality Validation in Data-Intensive Applications

Abughazala, Moamin;Muccini, Henry

2026-01-01

Abstract

Data-intensive applications (DIAs) increasingly rely on trustworthy, high-quality data to support reliable analytics, machine learning, and automated decision-making. However, ensuring data quality at scale remains a major challenge due to schema evolution, ingestion inconsistencies, and the manual effort required to define validation rules. This paper presents DQGen, a framework for automating the generation of data quality validation scripts tailored to complex, evolving datasets. By leveraging dataset metadata and systematically mapping data quality dimensions (e.g., completeness, uniqueness, validity) to Great Expectations (GE) rules, DQGen produces executable validation code adaptable to any schema. We evaluate DQGen on a real-world dataset from a large-scale Internet Service Provider (ISP), comprising over 26 million records across multiple relational tables. Results show that DQGen reduces validation setup time by over 90%, improves rule coverage and consistency, and enables continuous integration of data quality checks in batch or CI/CD workflows. The proposed framework contributes to the reliability and governance of modern DIAs by ensuring scalable, transparent, and automated validation.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Codice ISBN
	
				9783032044020
9783032044037
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11697/284165

Citazioni

ND

0

0

DQGen: Scalable Metadata-Driven Automation for Data Quality Validation in Data-Intensive Applications

Abughazala, Moamin;Muccini, Henry

2026-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)