Premise
As part of a large data platform modernization effort for a company with $4 Billion dollars in annual revenue I was tasked with implementing automatic data quality checks. The larger data framework was entirely metadata driven limited to Snowflake for any arbitrary options. The automatic data quality checks were to run on a set cadence and log metadata about loaded data during that time period. More specifically the data quality checks accomplished the following goals:
- Produce metadata about source and target data during a set cadence to verify ETL completion
- Identify late-arriving data through discrepancies in metadata checks
Process
I created a metadata-driven process that performed and wrote the results of metadata checks to Snowflake. These processes ran separately from the ETLs themselves. The first step of this engagement was to define the metadata that would ultimately trigger each job. This was accomplished by doing the following:
- Examination of current ETL processes
- Conceptually determining what in a metadata collection process can be parameterized
- Work backwards from sample process with appropriate parameterization
From that time I used SQL construction to create metadata collection processes that were called from the metadata.
Results
- Creation of automated data quality checks for three data sources
- Identify late arriving data in 3 sources
Lessons Learned
- Document communications