Custom Scala ETL Framework

Friday, November 1, 2024

This page is under construction. Please check back for further updates.

I implemented a custom ETL framework for use within a databricks execution engine for media data.


Premise

A media data startup that handles hundreds of gigabytes of data each day retained the services of my team to create a new ETL framework to process daily vendor data. Current infrastructure cost thousands of dollars daily and was operationally inefficient in terms of error logging, failure modes and processing speeds. More specifically the ETL framework was to emphasize the following points:

  • Fail-fast of processes to avoid incurring additional compute costs
  • Monadic implementation with higher-kinded typing
  • Advanced logging

Process

We implemented a scala package to be installed on a Databricks cluster to be run through an orchestrator. The specifics of the project are hidden to maintain client privacy.

Results

  • Daily compute costs reduced by 70%

Lessons Learned

  • Utilize your project manager