Jonathan Will

Projects and Research

View Full Portfolio

Collaborative Cluster Configuration

2020-2021

Here I show some of the main research I’ve done in 2020 and 2021. It includes collaborative methods for selecting suitable compute resources for large-scale data processing.

Research Problem

Distributed dataflow systems such as Apache Spark and Apache Flink enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs –– that neither lead to bottlenecks nor to low resource utilization –– is often challenging, even for expert users such as data engineers. This is an optimization problem that many researchers around the world are currently working on.

Problem description

Main Results

These are my main first-author papers in this context:

Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs [PDF]

This paper provides a problem analysis of model-based performance prediction of dataflow jobs on different types of cloud infrastructures. It identifies how different aspects of the execution context contribute and draws conclusions for designing collaborative systems for sharing performance metrics and performance models for data processing jobs.

C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds [PDF]

“C3O” is a continuation of the previous paper that presents a system implementation and evaluation. It includes further system specifications and generalized performance models. By considering the broader execution context, they have shown better results than performance models from related work.

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud [PDF]

This paper discusses the sharing aspect of data, with a particular focus on increasing resource efficiency when storing and exchanging potentially large amounts of data.

A complete list of my co-authored papers is available on my Google Scholar profile.