Metadata Driven Orchestration POC

Friday, November 1, 2024

I architected and created a fully metadata-driven architecture that both saturates available compute resources and orchestates jobs and bundles

Premise

A client retained my team to explore alternate architectural patterns for a data platform built on StreamSets. The data platform relied on two concepts:

Jobs: Individual units of work executed within an AWS environment
Workflow: Composition of jobs arranged as a directed acyclic graph

Given an arbitrary number of compute nodes our task was to create an implementation in StreamSets to track progress, execute workflows (and subsequent jobs) with an arbitrary priority and saturate available compute nodes.

Process

We implemented an orchestrator / executor architecture similar to what is now seen in Gruntwork. The specifics of this implementation will be omitted due to privacy constraints.

Results

Proved that such an implementation is possible within StreamSets and AWS
Created new internal templates for StreamSets deployment using custom Jython stages
Provided best practices and recommendations for a similar approach not on StreamSets. Lessons learned would inform later metadata-driven architecture patterns at Axis

Previous Inventory Analysis

Next Python API Orchestrator

Frank Kovacs

I make data work

Senior data engineer and proven builder specializing in enterprise data platforms, AI deployments, GitOps and distributed computing.

Senior Software Engineer
Senior Platform Engineer
Custom Scala ETL Framework
Full data source mapping
Python API Orchestrator
R Geopolitical Data ETL
Senior Data Engineer
Synapse Data Quality Checks
Data Engineer
Associate Data Engineer

Metadata Driven Orchestration POC

Premise

Process

Results

Frank Kovacs

Related articles