Blog on Abhinand Jha

Large-scale cluster management at Google with Borg

Fri, 12 May 2023 00:00:00 +0000

Reference papers:

[1] Large-scale cluster management at Google with Borg

Summary

In this paper, the authors describe a cluster-management system called Borg, that is responsible for scheduling tens of thousands of jobs running on Google’s warehouse scale data centers. The authors elaborate on the key design principles behind Borg and highlight how it has improved the efficiency and reliability of Google’s data centers. Some of the challenges faced in designing such a system are discussed by the authors such as large heterogeneous workloads, optimization of resource utilization, the need for high availability and low-latency. The authors describe the usage of Borg from a user’s perspective. Users can submit jobs on Borg using a declarative configuration language called BCL. Each job can be customized using a large number of parameters that define the job’s resource usage, priority etc. The system also provides multiple levels of UIs and logging to ensure that the users can debug their jobs. This design makes the system user-friendly and abstracts away many complexities involved in scheduling. The authors then introduce the overall architecture of the system which involves a master node called Borgmaster (replicated five times) and multiple worker processes running on cluster machines called Borglets. Borg categorizes requests as jobs that runs on a cell which is a collection of machines in a cluster. Each machine has a reserved set of resources to run jobs called alloc. Once a user submits a job to Borg, it is processed by the borgmaster which has two major components – scheduler and link shard. The scheduler keeps track of pending jobs and its job is to allocate the jobs to viable cells using scheduling algorithms. The link shard is the point of contact between the master and borglets that notify the borgmaster of the current status of a cell. Finally, the authors evaluate Borg using a trace of jobs from Google’s production clusters. They studied and reported the effects of various factors on the performance of the cluster, such as bucketing vs overhead, different resource estimation settings etc. They also highlighted Borg’s fault-tolerance and flexibility, showing that it was able to handle failures and adapt to changes in workload. The experiments and successful usage of Borg in real-world Google clusters show the applicability of the proposed system at scale.

Xen and the Art of Virtualization

Wed, 12 Apr 2023 00:00:00 +0000

Reference papers:

[1] Xen and the Art of Virtualization

Summary

Barham et al. introduce a Virtual Machine Monitor (VMM) called Xen. The main contribution of their approach is their implementation of paravirtualization – modifying the guest OSes to provide efficient virtualization efficiency. The authors first introduce the concept and need for virtualization and the various existing approaches by which virtualization is achieved. The authors then give an overview of the Xen architecture and how it virtualizes memory, CPU and IO. Paravirtualization lies at the heart of Xen, the guest OSes are modified in a way to allow then to be aware of the presence of a hypervisor (without modifying the ABI so that the applications are not affected). Some key features of Xen discussed by the authors include its support for dynamic VM management, support for multiple guest operating systems including Linux, Windows, and NetBSD, and its high-performance design that allows for efficient resource allocation and improved scalability. Finally, the authors perform extensive evaluations to study the performance of Xen and compare it to the performance of Native OSes and VMWare ESX. The evaluations are carried out on several benchmarks that test various aspects of Xen’s performance and the results show that even with virtualization, the applications achieve close to native performance.

Distributed Concensus: PAXOS

Thu, 12 May 2022 00:00:00 +0000

Reference papers:

[1\ PAXOS Made Live: An Engineering Perspective]

Summary

Paxos is a well known and researched consensus protocol to obtain fault tolerance in distributed systems. In this paper, Chandra et al. highlight their experience in implementing the Paxos algorithm to achieve fault-tolerant replication in distributed systems in a production environment. They show that even though the algorithm is well researched in academic literature, its practical implementation is still a quite involved process. First, the authors explain the limitations of the previous fault-tolerant database used by Google’s lock service (Chubby) which motivated them to implement their own Paxos based database as a replacement. The authors start by introducing the Paxos algorithm to provide the necessary background for the readers. The authors then go on to highlight many practical challenges and the solutions they developed while implementing the protocol, such as disk corruptions, dealing with stale data using master leases, detecting master turnover using epoch numbers etc. Some optimizations such as multi-paxos, client trigerred snapshot etc. make their system more robust. The authors highlight the lessons learned during the implementation process, including the importance of choosing the right abstractions, testing thoroughly, and carefully handling failure scenarios. The paper concludes by listing some open problems and encouraging the research community working in fault-tolerant systems to develop tools that help in the transition of theoretical concepts into real-world production ready protocols. Overall, The paper serves as a valuable resource for engineers looking to implement Paxos in their own systems.

Live VM Migration

Thu, 10 Mar 2022 00:00:00 +0000

Reference papers:

[1] Live Migration of Virtual Machines

Summary

Clark et al. propose an approach to migrating live VMs across different instances with minimal degradation of the quality of service and downtime. Live migration is particularly helpful for load re-balancing, fault management and server maintenance. The authors first describe the traditional approaches to live migration including stop-and-copy, demand- copy, pre-copy and other hybrid methods. The authors adopt pre-copy as their approach because of its efficiency. The authors describe the various design considerations in their approach and provide solutions to migrating storage (NAS) and network connections between instances with minimal downtime. The main idea of their approach is to iteratively pre-copy the VM’s page tables onto another VM without stopping it. Once majority of the pages have been migrated, the VM is brought to a complete stop and the final transfer of state is initiated. This approach of iterative copying results in very minimal downtime during migration. Finally, the authors extensively study the performance of their live migration approach on various benchmarks involving static load web-servers, dynamic content generating servers and also interactive gaming servers. Some optimizations discussed by the authors for their approach include adaptive rate-limiting, freeing page cache pages and other paravirtualization optimization. The evaluation results on several benchmarks showed the applicability of the author’s live migration approach.

Google File System (GFS) / Colossus

Thu, 10 Feb 2022 00:00:00 +0000

Reference papers:

[1] The Google File System

[2] GFS: Evolution on Fast-Forward

Summary

One of the main challenges in the domain of distributed systems is building scalable storage systems that can handle large amounts of data in an efficient way. The Google File System (GFS) proposed in [1] is a scalable distributed file system for large scale data-intensive applications. Ghemawat et al. highlight some of the key design and performance considerations that went into developing GFS. Moreover, they emphasize on the fact that although the system is built keeping in mind Google’s unique setting, these ideas are also applicable to similar systems. The authors of [1] start by stating a few assumptions that influence their design decisions – system failures are not an exception but a norm, most of the writes are append operations, larger block sizes are more efficient and throughput is more important than low-latency. The authors then describe the overall architecture of GFS that includes a single master and multiple distributed chunk servers. The master is a central repository that stores the metadata about each file and the chunk servers responsible for those files. The actual data is stored in chunks of 64MB which is distributed amongst the chunk servers. In order to prevent the master from being a bottleneck, the system is designed such that no data transfer involves the master, the clients directly fetch data from the chunk servers. Keeping fault-tolerance and availability in mind, the system also employs a heartbeat mechanism for health checks, and each chunk server is replicated 3 times (configurable). The authors of [1] also discuss design aspects such as replication strategies, garbage collection, failure recovery through snapshots and checkpoints and handling concurrent writes. Finally, the authors provide experimental results of GFS performance in both, a simulated environment and real Google clusters running GFS in production. The results provide conclusive evidence that the system is able to handle reads and writes with a reasonably good performance at a large scale. In [2] Quinlan provides deeper insights into how GFS has evolved and adapted to the changing demands of large- scale systems. The authors in [2] specifically discuss issues faced by GFS like the master being a single point of failure, number of files being a bottleneck and the trend of applications shifting towards low-latency requirements. They also discuss the advantages of another system –BigTable and how it is able to utilize GFS to provide efficient access to files of smaller sizes. Although, there were some hiccups in adopting GFS to the changing needs of servicing latency critical client facing applications, the overall impact of GFS was positive.

Consistent Hashing: Chord

Sat, 05 Feb 2022 00:00:00 +0000

Reference paper:

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications

Summary

One of the fundamental problems in peer-to-peer (P2P) applications is to efficiently locate a particular node in the network. Stoica et al. present the design for a scalable and efficient P2P distributed lookup protocol named Chord. The authors first give an overview of consistent hashing which is widely used in such scenarios. They highlight the scalability issues of consistent hashing that are caused by each node having to maintain the state of all of the other nodes in the network. In order to overcome this limitation, they propose the use of distributed hash-based indexing and routing algorithms that allow Chord to handle a large number of nodes and data lookups efficiently. In the proposed protocol, each node only needs to maintain a reference to its immediate successor and only a subset of all other nodes in the network, leading to better scalability. Chord is also able to handle node join and leave operations without requiring significant changes to the network’s structure, which contributes to its robustness in dynamic networks. The authors provide a thorough analysis of the performance of Chord and show that it performs well in terms of time complexity that scales logarithmically with the number of Chord nodes. Lastly, the authors also prove the fault-tolerance capabilities of Chord through experiments.

Distributed systems: Stragglers

Mon, 24 Jan 2022 00:00:00 +0000

Reference paper:

The Tail at Scale

Summary

One of the challenges in building large-scale systems is to consistently maintain low-latency in responses. Dean and Barroso, in their paper highlight techniques that can be used to build distributed systems that have an overall low-latency despite its constituent components having occasional high-latency. The authors first provide several reasons that contribute to high tail latency such as shared resources, queuing, daemons, garbage collection etc. and then they show that variability in latency of individual components lead to high overall service latency due to the “fan-out” architecture generally adopted in large-scale systems. The authors take inspiration from fault-tolerant systems and focus on techniques that reduce latency hiccups regardless of root cause. Finally, the authors introduce techniques such as sending the same request to multiple replicas and only considering the first response (hedged requests) and allowing multiple servers to communicate updates regarding requests (tied requests) to reduce tail latency. Other techniques proposed by the authors such as micro-partitioning, selective replication and latency-induced probation also lead to significant reduction in latency at larger scales.

Advanced topics in server design

Thu, 20 Jan 2022 00:00:00 +0000

Reference papers:

[1] A Scalable and Explicit Event Delivery Mechanism for UNIX

[2] accept()able Strategies for Improving Web Server Performance

Summary

Server performance depends on a number of factors and in order to design high performance and scalable servers, it is important to carefully reason about these factors. Authors of [1] and [2] argue that a server’s policies for accepting new client connections [2] and the OS event-notification mechanism [1], have a signif- icant effect on a server’s performance. The authors in [1] highlight that web servers using the OS system call select (or poll) to check for events, scale poorly with the event rate. To mitigate the scalability issue they propose an event-based notification mechanism wherein the kernel maintains a queue of events and notifies the application once an event is ready. This leads to better scalability as the kernel performs work proportional to the number of events. The authors in [2] study another important aspect of server design – the policy for accepting new connections. Through their experiments, they argue that finding the correct balance between accepting new connections and processing existing connections can improve the performance of web servers. They conduct experiments on three different servers and vary the number of consecutively accepted connections using an accept-limit parameter – the results show that well-tuned accept policies yield reasonable improvements as compared to the baseline policy.

Flash: An efficient and portable Web Server

Sun, 16 Jan 2022 00:00:00 +0000

The reference paper can be found here.

Summary

In this paper, the authors propose Flash – an efficient and portable web server that utilizes an asymmetric multi-process even-driven (AMPED) architecture. This architecture is a combination of the single-process event-driven (SPED) and multi-process/multi-thread (MP/MT) architectures. Flash consists of a main process that handles HTTP requests in an event-driven fashion, and it uses several asynchronous helper processes to perform blocking operations on the disk. The authors provide a brief overview of the different types of server architectures and analyze their server’s performance compared to Zeus (SPED) and Apache (MP/MT) servers as benchmarks. The authors also introduce several important optimizations in caching, byte alignment, file access via mmap and checking for memory residency before calling helper processes which result in large performance gains. Most of these optimizations are fairly standard in present-day web servers but at the time the paper was published, these ideas were quite novel.

The End-to-End Principle in system design

Fri, 14 Jan 2022 00:00:00 +0000

The reference paper can be found here.

Summary

In this paper, the authors provide an argument that follows the “Occam’s razor” principle in the context of designing systems. The central problem that the authors address is deciding the appropriate boundaries between the functions of a distributed system, and where to place these functions in the system. The end- to-end argument aims to make the lower levels of a system as simple as possible by implementing functions closer to the endpoints. The authors do not propose the end-to-end argument as a strict rule, but as a set of rational principles to keep in mind while designing layered systems. This argument has contributed to the rapid growth of the Internet as we know it today (TCP/IP) and it is also found in several other applications such as end-to-end encryption, reliable file transmission etc.