Custom Scala ETL Framework

Friday, November 1, 2024

#Sql #Spark #Databricks #Scala #Aws #Pyspark

This page is under construction. Please check back for further updates.

I implemented a custom ETL framework for use within a databricks execution engine for media data.

Premise

A media data startup that handles hundreds of gigabytes of data each day retained the services of my team to create a new ETL framework to process daily vendor data. Current infrastructure cost thousands of dollars daily and was operationally inefficient in terms of error logging, failure modes and processing speeds. More specifically the ETL framework was to emphasize the following points:

Fail-fast of processes to avoid incurring additional compute costs
Monadic implementation with higher-kinded typing
Advanced logging

Process

We implemented a scala package to be installed on a Databricks cluster to be run through an orchestrator. The specifics of the project are hidden to maintain client privacy.

Results

Daily compute costs reduced by 70%

Lessons Learned

Utilize your project manager

Previous NYC HackR Model Context Protocol Overview

Next Fabric Architecture Proof of Concept