December 15, 2023 ADC Team 5 min read

Data Quality: Testing and Validation

Poor data quality costs large organisations an average of $12.9 million per year (IBM). Data validation is no longer optional — it is an engineering discipline in its own right. Here are the essential techniques and tools.

Key takeaways

Poor data quality costs large organisations an average of $12.9M/year (IBM) — covering 6 dimensions: completeness, accuracy, consistency, uniqueness, validity, freshness
Great Expectations: the Python reference library for data validation; defines assertions on data and auto-generates HTML validation reports for ETL pipelines
dbt tests: native SQL pipeline validation with built-in generic tests (not_null, unique, accepted_values) plus custom SQL assertions visible in dbt Cloud
Soda Core: scans BigQuery, Snowflake, Postgres and more in real time; YAML DSL readable by both developers and business teams

The 6 Dimensions of Data Quality

Completeness: No required field is missing
Accuracy: Values correspond to reality
Consistency: Same data = same values across systems
Uniqueness: No unintentional duplicates
Validity: Respect for format, range and domain
Freshness: Data is up to date according to SLAs

Great Expectations — The Python Standard

Great Expectations is the reference Python library for data validation. It allows defining "expectations" (assertions on data) and integrating them into ETL pipelines. It automatically generates HTML validation reports.

dbt Tests — Validation in SQL Pipelines

If you use dbt for your SQL transformations, dbt tests are native and elegant. Generic tests (not_null, unique, accepted_values, relationships) plus custom tests in SQL. Results visible in dbt Cloud.

Soda Core — The Data Quality Platform

Soda Core allows you to scan your datasets (BigQuery, Snowflake, Postgres...) and alert in real time when quality degrades. Its YAML DSL makes rules readable by everyone — developers and business teams alike.

Multi-Layer Validation Strategy

Ingestion: Validate schema and types as soon as data enters (Pydantic, Avro schemas)
Transformation: Test the results of dbt/Spark transformations
Storage: Check counts, distributions and missing values after loading
Consumption: Alerts on dashboards and reports if metrics go outside thresholds

Metrics to Monitor

Do not measure everything — focus on critical business metrics: null rate for required fields, duplicate rate, expected vs observed volumes, freshness (time of last update), and statistical distribution of numerical values (anomaly detection).

Train in Data Engineering

Our Data Engineering track covers data quality, ETL pipelines and modern tools (dbt, Spark, Airflow).

View tracks