Here I show some of the main research I’ve done in 2020 and 2021. It includes collaborative methods for selecting suitable compute resources for large-scale data processing.
Research Problem
Distributed dataflow systems such as Apache Spark and Apache Flink enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs –– that neither lead to bottlenecks nor to low resource utilization –– is often challenging, even for expert users such as data engineers. This is an optimization problem that many researchers around the world are currently working on.
Main Results
These are my main first-author papers in this context:
Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs [PDF]
This paper provides a problem analysis of model-based performance
prediction of dataflow jobs on different types of cloud infrastructures.
It identifies how different aspects of the execution context contribute
and draws conclusions for designing collaborative systems for sharing
performance metrics and performance models for data processing jobs.
C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds [PDF]
“C3O” is a continuation of the previous paper that presents a system
implementation and evaluation. It includes further system specifications
and generalized performance models. By considering the broader execution
context, they have shown better results than performance models from
related work.
Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud [PDF]
This paper discusses the sharing aspect of data, with a particular
focus on increasing resource efficiency when storing and exchanging
potentially large amounts of data.
A complete list of my co-authored papers is available on my Google Scholar profile.