Premise
Internal stakeholders were testing an AI-powered procedure that transforms an unknown amount of files. A limiting factor in the process was large, often duplicative, inputs. I was tasked with creating a process that reduces the size of input bundles for faster processing. The success of the project was to be measured much we could reduce the number of inputs with confidence. The stakeholders advised to err on the side of caution as the consequences for erroneously leaving an artifact within the input were less than those of excluding said input.
Process
I conducted user interviews with firm partners, tool creators and end-users. The nature of the inputs to be parsed revealed that inputs were often clustered by business case yet rarely had explicit tagging of those groups. The hypothesized clusters of the inputs of arbitrary size N contained a single useful artifact and N-1 logical duplicates. I defined logical duplicates. I created the following process:
- Cluster inputs based on metadata and content with an emphasis on content-based markers
- Tune the hyperparameter K, the number of clusters, to a value that minimized within-cluster variation
- The clustering algorithm was designed to favor smaller, more similar clusters and produce a large “spare sock drawer” cluster of everything else.
- Produce a list of artifacts to be included from each small cluster and the items from the leftover cluster.
Results
The resulting process was a python package with a CLI interface that both:
- Produced an optimal number of clusters K
- Produced outputs of those clusters given a certain value of K
Even while biasing for smaller clusters the process resulted in 30% smaller inputs.
Skills Used
- Machine Learning: Clustering
- R: Exploratory Data Analysis
- Git: Version control to store artifacts
- Python: Final deliverable with CLI