12 May 2023

Large-scale cluster management at Google with Borg

Reference papers:

[1] Large-scale cluster management at Google with Borg

Summary

In this paper, the authors describe a cluster-management system called Borg, that is responsible for scheduling tens of thousands of jobs running on Google’s warehouse scale data centers. The authors elaborate on the key design principles behind Borg and highlight how it has improved the efficiency and reliability of Google’s data centers. Some of the challenges faced in designing such a system are discussed by the authors such as large heterogeneous workloads, optimization of resource utilization, the need for high availability and low-latency. The authors describe the usage of Borg from a user’s perspective. Users can submit jobs on Borg using a declarative configuration language called BCL. Each job can be customized using a large number of parameters that define the job’s resource usage, priority etc. The system also provides multiple levels of UIs and logging to ensure that the users can debug their jobs. This design makes the system user-friendly and abstracts away many complexities involved in scheduling. The authors then introduce the overall architecture of the system which involves a master node called Borgmaster (replicated five times) and multiple worker processes running on cluster machines called Borglets. Borg categorizes requests as jobs that runs on a cell which is a collection of machines in a cluster. Each machine has a reserved set of resources to run jobs called alloc. Once a user submits a job to Borg, it is processed by the borgmaster which has two major components – scheduler and link shard. The scheduler keeps track of pending jobs and its job is to allocate the jobs to viable cells using scheduling algorithms. The link shard is the point of contact between the master and borglets that notify the borgmaster of the current status of a cell. Finally, the authors evaluate Borg using a trace of jobs from Google’s production clusters. They studied and reported the effects of various factors on the performance of the cluster, such as bucketing vs overhead, different resource estimation settings etc. They also highlighted Borg’s fault-tolerance and flexibility, showing that it was able to handle failures and adapt to changes in workload. The experiments and successful usage of Borg in real-world Google clusters show the applicability of the proposed system at scale.

Positive Points

  • Although Borg is a centralized system, the ideas incorporated in the system such as the master distributing workload to multiple shards, scheduler caching, feasibility classes to avoid repeated checks, and link shard compression techniques allow the system to perform efficiently at a large scale. These ideas are very helpful for developers working on internet-scale centralized systems.
  • The experiments carried out by the authors were very well thought and evaluated the system through the cell compaction metric which is much more sophisticated other than the most commonly used average CPU utilization. In particular, Fig. 13 showing the latency of job scheduling for a real-world trace was very insightful.
  • The scheduling algorithm is one of the most important components of the system as it directly relates to resource utilization. The authors spend sufficient time in explaining the trade-offs involved with different scheduling algorithms and carry out extensive evaluations to study the same. The scoring methodology for scheduling is very efficient and innovative that is built as a hybrid of worst-fit and best-fit policies.

Drawbacks

  • Although Borg is very scalable and abstracts away many complexities from the users there are a few things about the system that may not be feasible for certain usecases - each job shares it IP with the host which leads to resource contention within jobs, BCL has 230 different configs which can be very difficult to configure and lack of support for multi-job workflows. However, it is also important to note that many of these issues were tackled in Kuberenetes.
  • The authors could have explained certain components of their system in more technical detail. For example, it would have been beneficial for the readers if the authors describe how the various concurrent processes are implemented and discuss the concurrency challenges if any.
  • In large-scale systems like Borg, tail latencies are a major issue. From the paper, it is unclear how Borg deals with tail latencies while scheduling jobs across multiple cells. It would have been helpful for the readers if authors included a discussion about the effect of stragglers in Borg.

Research Questions

  1. If we were to implement a cluster-management service on a smaller scale (such as CMU clusters) what changes would be required in Borg to reduce overhead? – some ideas: less replicas of masters, more communication between master and workers.
  2. In what aspects is Yarn different from Borg?
  3. How does Borg deal with stragglers?

Tags: