Jonathan Will

Projects and Research

View Full Portfolio

Resource-Efficient Large-Scale Data Processing

2020-2025

Here I show some of the main research I’ve done between 2020 and 2025. It features innovative methods for selecting suitable compute resources for any large-scale data processing workload. This work will conclude with a doctoral degree in late January 2026.

Research Problem

Distributed dataflow systems such as Apache Spark and Apache Flink enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs –– that neither lead to bottlenecks nor to low resource utilization –– is often challenging, even for expert users such as data engineers. An efficient configuration of resources optimizes for low execution cost and low execution duration simultaneously. This is an optimization problem that many researchers around the world are currently working on.

Problem description

Main Results

The following presents some of my main works tackling the aforementioned research problem.

Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs [PDF]

This paper provides a problem analysis of model-based performance prediction of dataflow jobs on different types of cloud infrastructures. It identifies how different aspects of the execution context contribute and draws conclusions for designing collaborative systems for sharing performance metrics and performance models for data processing jobs.

C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds [PDF]

This work identifies how different aspects of the execution context contribute and designs a collaborative system for sharing performance metrics and performance models for data processing jobs. It includes system specifications and generalized performance models, which are implementated and evaluated. By considering the broader execution context, these performance models have shown better results than performance models from related work.

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud [PDF]

This paper discusses the sharing aspect of data, with a particular focus on increasing resource efficiency when storing and exchanging potentially large amounts of data.

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing [PDF]

This work focuses on appropriate memory allocation, since this has shown to typically have the most significant impact on resource-efficiency. Systems like Spark use memory for caching parts of the dataset. Meanwhile, a lack of memory leads to more intensive use of the disk, slowing down the processing. By conducting profiling runs on reduced hardware, and on small samples of the input dataset, we can extrapolate memory use of the full execution. The derived resource configurations have shown high cost-efficiency when compared to related work.

Flora: Efficient Cloud Resource Selection for Big Data Processing via Job Classification [PDF]

This most recent work uses statistical methods to in combination with infrastructure profiling to predict resource options’ relative posititions in the cost–duration space. It exploits the similarity of resource preferences of jobs with similar memory access patterns. The derived resource configurations have shown high resource-efficiency. Further, this method features very low exploration overhead, when compared to related work. Keeping this resource selection overhead low is a central theme in my doctoral thesis, which will soon be published here.

A complete list of more than 20 papers I co-authored during my time in research is available on my Google Scholar profile.