Synapse Data Quality Checks

Friday, November 1, 2024

I implemented a custom data quality framework for use within synapse pipelines.

Premise

As part of a large data platform modernization effort for a company with $4 Billion dollars in annual revenue I was tasked with implementing automatic data quality checks. The larger data framework was entirely metadata driven limited to Snowflake for any arbitrary options. The automatic data quality checks were to run on a set cadence and log metadata about loaded data during that time period. More specifically the data quality checks accomplished the following goals:

Produce metadata about source and target data during a set cadence to verify ETL completion
Identify late-arriving data through discrepancies in metadata checks

Process

I created a metadata-driven process that performed and wrote the results of metadata checks to Snowflake. These processes ran separately from the ETLs themselves. The first step of this engagement was to define the metadata that would ultimately trigger each job. This was accomplished by doing the following:

Examination of current ETL processes
Conceptually determining what in a metadata collection process can be parameterized
Work backwards from sample process with appropriate parameterization

From that time I used SQL construction to create metadata collection processes that were called from the metadata.

Results

Creation of automated data quality checks for three data sources
Identify late arriving data in 3 sources

Lessons Learned

Document communications

Previous Retail Fraud Detection

Next Anomaly Detection ETL