Hyperparameter Tuning

Friday, November 1, 2024

#Python #Git #AI #Unsupervised Learning #Clustering #R

As a senior data engineer at Axis I continued to expand my responsibilities and contributions to both Axis and clients.

Premise

Internal stakeholders were testing an AI-powered procedure that transforms an unknown amount of files. A limiting factor in the process was large, often duplicative, inputs. I was tasked with creating a process that reduces the size of input bundles for faster processing. The success of the project was to be measured much we could reduce the number of inputs with confidence. The stakeholders advised to err on the side of caution as the consequences for erroneously leaving an artifact within the input were less than those of excluding said input.

Process

I conducted user interviews with firm partners, tool creators and end-users. The nature of the inputs to be parsed revealed that inputs were often clustered by business case yet rarely had explicit tagging of those groups. The hypothesized clusters of the inputs of arbitrary size N contained a single useful artifact and N-1 logical duplicates. I defined logical duplicates. I created the following process:

Cluster inputs based on metadata and content with an emphasis on content-based markers
Tune the hyperparameter K, the number of clusters, to a value that minimized within-cluster variation
The clustering algorithm was designed to favor smaller, more similar clusters and produce a large “spare sock drawer” cluster of everything else.
Produce a list of artifacts to be included from each small cluster and the items from the leftover cluster.

Results

The resulting process was a python package with a CLI interface that both:

Produced an optimal number of clusters K
Produced outputs of those clusters given a certain value of K

Even while biasing for smaller clusters the process resulted in 30% smaller inputs.

Skills Used

Machine Learning: Clustering
R: Exploratory Data Analysis
Git: Version control to store artifacts
Python: Final deliverable with CLI

Previous Full data source mapping

Next Inventory Analysis