Data Quality: Testing and Validation
Poor data quality costs large organisations an average of $12.9 million per year (IBM). Data validation is no longer optional — it is an engineering discipline in its own right. Here are the essential techniques and tools.
- Poor data quality costs large organisations an average of $12.9M/year (IBM) — covering 6 dimensions: completeness, accuracy, consistency, uniqueness, validity, freshness
- Great Expectations: the Python reference library for data validation; defines assertions on data and auto-generates HTML validation reports for ETL pipelines
- dbt tests: native SQL pipeline validation with built-in generic tests (
not_null,unique,accepted_values) plus custom SQL assertions visible in dbt Cloud - Soda Core: scans BigQuery, Snowflake, Postgres and more in real time; YAML DSL readable by both developers and business teams
The 6 Dimensions of Data Quality
- Completeness: No required field is missing
- Accuracy: Values correspond to reality
- Consistency: Same data = same values across systems
- Uniqueness: No unintentional duplicates
- Validity: Respect for format, range and domain
- Freshness: Data is up to date according to SLAs
Great Expectations — The Python Standard
Great Expectations is the reference Python library for data validation. It allows defining "expectations" (assertions on data) and integrating them into ETL pipelines. It automatically generates HTML validation reports.
dbt Tests — Validation in SQL Pipelines
If you use dbt for your SQL transformations, dbt tests are native and elegant. Generic tests (not_null, unique, accepted_values, relationships) plus custom tests in SQL. Results visible in dbt Cloud.
Soda Core — The Data Quality Platform
Soda Core allows you to scan your datasets (BigQuery, Snowflake, Postgres...) and alert in real time when quality degrades. Its YAML DSL makes rules readable by everyone — developers and business teams alike.
Multi-Layer Validation Strategy
- Ingestion: Validate schema and types as soon as data enters (Pydantic, Avro schemas)
- Transformation: Test the results of dbt/Spark transformations
- Storage: Check counts, distributions and missing values after loading
- Consumption: Alerts on dashboards and reports if metrics go outside thresholds
Metrics to Monitor
Do not measure everything — focus on critical business metrics: null rate for required fields, duplicate rate, expected vs observed volumes, freshness (time of last update), and statistical distribution of numerical values (anomaly detection).
Train in Data Engineering
Our Data Engineering track covers data quality, ETL pipelines and modern tools (dbt, Spark, Airflow).
View tracksValidate the reliability of your data pipelines with this operational checklist. Included in QA Automation Guide 2026.
Download free