10 Feb 2022

Google File System (GFS) / Colossus

Reference papers:

[1] The Google File System

[2] GFS: Evolution on Fast-Forward

Summary

One of the main challenges in the domain of distributed systems is building scalable storage systems that can handle large amounts of data in an efficient way. The Google File System (GFS) proposed in [1] is a scalable distributed file system for large scale data-intensive applications. Ghemawat et al. highlight some of the key design and performance considerations that went into developing GFS. Moreover, they emphasize on the fact that although the system is built keeping in mind Google’s unique setting, these ideas are also applicable to similar systems. The authors of [1] start by stating a few assumptions that influence their design decisions – system failures are not an exception but a norm, most of the writes are append operations, larger block sizes are more efficient and throughput is more important than low-latency. The authors then describe the overall architecture of GFS that includes a single master and multiple distributed chunk servers. The master is a central repository that stores the metadata about each file and the chunk servers responsible for those files. The actual data is stored in chunks of 64MB which is distributed amongst the chunk servers. In order to prevent the master from being a bottleneck, the system is designed such that no data transfer involves the master, the clients directly fetch data from the chunk servers. Keeping fault-tolerance and availability in mind, the system also employs a heartbeat mechanism for health checks, and each chunk server is replicated 3 times (configurable). The authors of [1] also discuss design aspects such as replication strategies, garbage collection, failure recovery through snapshots and checkpoints and handling concurrent writes. Finally, the authors provide experimental results of GFS performance in both, a simulated environment and real Google clusters running GFS in production. The results provide conclusive evidence that the system is able to handle reads and writes with a reasonably good performance at a large scale. In [2] Quinlan provides deeper insights into how GFS has evolved and adapted to the changing demands of large- scale systems. The authors in [2] specifically discuss issues faced by GFS like the master being a single point of failure, number of files being a bottleneck and the trend of applications shifting towards low-latency requirements. They also discuss the advantages of another system –BigTable and how it is able to utilize GFS to provide efficient access to files of smaller sizes. Although, there were some hiccups in adopting GFS to the changing needs of servicing latency critical client facing applications, the overall impact of GFS was positive.

Positive Points

  • The authors of [1] made the paper very accessible by clearly stating their assumptions, using easy to follow illustrations of their design and providing metrics from real-world services provided by google. For example, the tables analyzing the operations and bytes transfer breakdown on Google clusters were very insightful.
  • In [1], the authors also discussed potential pitfalls of some of their design choices and provided useful alterna- tives. For instance, in section 2.5 they highlighted the issue of hotspots and provided an alternative solution of allowing clients to communicate with eachother.
  • In [1] the authors proposed a novel solution to the split-brain problem – the usage of lease given to a primary chunk server, and the expiration timeout that prevents assigning multiple primaries in case of a failure. This is an elegant solution to a difficult problem.
  • In [2], the authors discuss some of the limitations of GFS in retrospective that were not evident from the initial design proposed in [1]. For example, in [1] the authors ascertain that the cost of adding extra memory to the master is a small price to pay for the potential advantages of storing the metadata in memory. However, the discussion in [2] provides a different perspective and helps in better understanding of the drawbacks of some design choices introduced in [1].

Drawbacks

  • The authors in [1] elaborate on several concepts involving fault-tolerance and availability, but do not provide sufficient illustrations or pseudo-code samples for some complicated mechanisms like recovery using snapshots and checkpoints. The garbage collection process could have been explained in more detailed as it is an important aspect of a storage system.
  • In the experiments pertinent to fault-tolerance in section 6.2.5 of [1], the authors discuss recovery times for chunk server failures, but did not consider the scenario when a master server fails. This scenario is of particular importance in a single-master system. Provided some analysis of the effects of master failure would have been helpful.
  • In the related work section of [1], it would have been helpful to include a table or a concise summary of the different kinds of file system architectures, different kinds of workloads, and which architectures are and are not effective for each kind of workload.
  • In [2], the authors discuss the consistency issues that occur in GFS. With regards to the design proposed in [1], the consistency model was relaxed to simplify the design however, since consistency is generally an important requirement of storage systems, the authors should have made a more convincing case for their design choice.

Research Questions

  1. How can the GFS design be modified to support strong consistency? – Some things that can be introduced to the system are 2 Phase Commits (2PC) and all writes being done through the primary chunk server.
  2. Active-active Masters: Can we use an active-active mechanism by having multiple Masters that balance the load amongst each other? What could be some potential drawbacks of this?
  3. GFS is not really suitable for writes at random locations and focuses entirely on appends. Would it be possible to extend the architecture to support random writes to make it more general? Given that SSDs are readily available and allow fast random writes.

Tags: