Premise
A client retained my team to explore alternate architectural patterns for a data platform built on StreamSets. The data platform relied on two concepts:
- Jobs: Individual units of work executed within an AWS environment
- Workflow: Composition of jobs arranged as a directed acyclic graph
Given an arbitrary number of compute nodes our task was to create an implementation in StreamSets to track progress, execute workflows (and subsequent jobs) with an arbitrary priority and saturate available compute nodes.
Process
We implemented an orchestrator / executor architecture similar to what is now seen in Gruntwork. The specifics of this implementation will be omitted due to privacy constraints.
Results
- Proved that such an implementation is possible within StreamSets and AWS
- Created new internal templates for StreamSets deployment using custom Jython stages
- Provided best practices and recommendations for a similar approach not on StreamSets. Lessons learned would inform later metadata-driven architecture patterns at Axis