FR EN
Data Quality: Testing and Validation
Data Quality: Testing and Validation — AutomationDataCamp
December 15, 2023 ADC Team 5 min read

Data Quality: Testing and Validation

Poor data quality costs large organisations an average of $12.9 million per year (IBM). Data validation is no longer optional — it is an engineering discipline in its own right. Here are the essential techniques and tools.

Key takeaways
  • Poor data quality costs large organisations an average of $12.9M/year (IBM) — covering 6 dimensions: completeness, accuracy, consistency, uniqueness, validity, freshness
  • Great Expectations: the Python reference library for data validation; defines assertions on data and auto-generates HTML validation reports for ETL pipelines
  • dbt tests: native SQL pipeline validation with built-in generic tests (not_null, unique, accepted_values) plus custom SQL assertions visible in dbt Cloud
  • Soda Core: scans BigQuery, Snowflake, Postgres and more in real time; YAML DSL readable by both developers and business teams

The 6 Dimensions of Data Quality

  • Completeness: No required field is missing
  • Accuracy: Values correspond to reality
  • Consistency: Same data = same values across systems
  • Uniqueness: No unintentional duplicates
  • Validity: Respect for format, range and domain
  • Freshness: Data is up to date according to SLAs

Great Expectations — The Python Standard

Great Expectations is the reference Python library for data validation. It allows defining "expectations" (assertions on data) and integrating them into ETL pipelines. It automatically generates HTML validation reports.

dbt Tests — Validation in SQL Pipelines

If you use dbt for your SQL transformations, dbt tests are native and elegant. Generic tests (not_null, unique, accepted_values, relationships) plus custom tests in SQL. Results visible in dbt Cloud.

Soda Core — The Data Quality Platform

Soda Core allows you to scan your datasets (BigQuery, Snowflake, Postgres...) and alert in real time when quality degrades. Its YAML DSL makes rules readable by everyone — developers and business teams alike.

Multi-Layer Validation Strategy

  • Ingestion: Validate schema and types as soon as data enters (Pydantic, Avro schemas)
  • Transformation: Test the results of dbt/Spark transformations
  • Storage: Check counts, distributions and missing values after loading
  • Consumption: Alerts on dashboards and reports if metrics go outside thresholds

Metrics to Monitor

Do not measure everything — focus on critical business metrics: null rate for required fields, duplicate rate, expected vs observed volumes, freshness (time of last update), and statistical distribution of numerical values (anomaly detection).

Train in Data Engineering

Our Data Engineering track covers data quality, ETL pipelines and modern tools (dbt, Spark, Airflow).

View tracks

Related articles

Data-Driven Testing: Complete Guide

CSV, JSON, Excel for efficient parameterised tests.

Read more

AI and Machine Learning for Testing

How AI is transforming test automation.

Read more