Abhinand Jha

Large-scale cluster management at Google with Borg

Reference papers:

[1] Large-scale cluster management at Google with Borg

Summary

In this paper, the authors describe a cluster-management system called Borg, that is responsible for scheduling tens of thousands of jobs running on Google’s warehouse scale data centers. The authors elaborate on the key design principles behind Borg and highlight how it has improved the efficiency and reliability of Google’s data centers. Some of the challenges faced in designing such a system are discussed by the authors such as large heterogeneous workloads, optimization of resource utilization, the need for high availability and low-latency. The authors describe the usage of Borg from a user’s perspective. Users can submit jobs on Borg using a declarative configuration language called BCL. Each job can be customized using a large number of parameters that define the job’s resource usage, priority etc. The system also provides multiple levels of UIs and logging to ensure that the users can debug their jobs. This design makes the system user-friendly and abstracts away many complexities involved in scheduling. The authors then introduce the overall architecture of the system which involves a master node called Borgmaster (replicated five times) and multiple worker processes running on cluster machines called Borglets. Borg categorizes requests as jobs that runs on a cell which is a collection of machines in a cluster. Each machine has a reserved set of resources to run jobs called alloc. Once a user submits a job to Borg, it is processed by the borgmaster which has two major components – scheduler and link shard. The scheduler keeps track of pending jobs and its job is to allocate the jobs to viable cells using scheduling algorithms. The link shard is the point of contact between the master and borglets that notify the borgmaster of the current status of a cell. Finally, the authors evaluate Borg using a trace of jobs from Google’s production clusters. They studied and reported the effects of various factors on the performance of the cluster, such as bucketing vs overhead, different resource estimation settings etc. They also highlighted Borg’s fault-tolerance and flexibility, showing that it was able to handle failures and adapt to changes in workload. The experiments and successful usage of Borg in real-world Google clusters show the applicability of the proposed system at scale.

Positive Points

Drawbacks

Research Questions

  1. If we were to implement a cluster-management service on a smaller scale (such as CMU clusters) what changes would be required in Borg to reduce overhead? – some ideas: less replicas of masters, more communication between master and workers.
  2. In what aspects is Yarn different from Borg?
  3. How does Borg deal with stragglers?

<< Previous Post

|

Next Post >>

#Computer Science #System Design #Distributed Systems #Backend #File Systems #Software Engineering #Storage