Metadata Driven Orchestration POC

Friday, November 1, 2024

I architected and created a fully metadata-driven architecture that both saturates available compute resources and orchestates jobs and bundles


Premise

A client retained my team to explore alternate architectural patterns for a data platform built on StreamSets. The data platform relied on two concepts:

  • Jobs: Individual units of work executed within an AWS environment
  • Workflow: Composition of jobs arranged as a directed acyclic graph

Given an arbitrary number of compute nodes our task was to create an implementation in StreamSets to track progress, execute workflows (and subsequent jobs) with an arbitrary priority and saturate available compute nodes.

Process

We implemented an orchestrator / executor architecture similar to what is now seen in Gruntwork. The specifics of this implementation will be omitted due to privacy constraints.

Results

  • Proved that such an implementation is possible within StreamSets and AWS
  • Created new internal templates for StreamSets deployment using custom Jython stages
  • Provided best practices and recommendations for a similar approach not on StreamSets. Lessons learned would inform later metadata-driven architecture patterns at Axis