PI: Volkan Vural, Kyle Marcus
Institution: University of California, San Diego
Our goal is to schedule pieces of a workflow in a smart way such that they take advantage of different cloud architectures. To start our investigation we have selected our Kubernetes cluster called Nautilus in order to collect metrics for workflow jobs and build a smart scheduler for. Nautilus is a collection of GPU processing nodes within NSF CHASE-CI. NSF CHASE-CI is a network of fast GPU appliances for machine learning and storage managed through Kubernetes on the high-speed Pacific Research Platform (PRP). Our first task was to identify the necessary metrics that provide utilization information over the resources allocated to a job. These metrics will be then used as inputs for training machine learning models to predict resource utilization. We have identified 57 metrics that are pulled from Prometheus that is a toolkit for monitoring Kubernetes. These metrics are dealing with cpu, memory, disk, and network usage. On top of these existing metrics, we also created a script that runs a set of simple operations and measures the time it takes to complete them. These time measurements are collected to indicate the contention level in resources allocated to containers. The metrics that are collected are at container level and a single job can be run over multiple containers. Therefore, we also created aggregated container metrics to find overall resources utilized by a specific job run over multiple containers. We then designed a JSON schema to store the collected data in MongoDB.
We then designed an interface to create and execute workflows and jobs that will be used for data collection purposes. The process starts with the execution of a job and continues with data collection until the collected data is saved in MongoDB, which is fully automated.
Finally, we also designed an early prototype of machine learning model to predict resource utilization on dummy datasets. However, with the collection of real data sets, we will update this model.
As a next task we plan to extract job metrics in Pandas with Python to turn the collected metric data to features to be used in machine learning model. Also we will integrate the contention data collector script to our process. Once our infrastructure is finalized, we will run our data collection process on the PNNL cluster as well. Our final goal is to design a smart scheduler that will optimize the resource utilization for a given list of jobs and resources and finish these jobs in the minimum amount of time.