Abhinand Jha

Distributed systems: Stragglers

Reference paper:

Summary

One of the challenges in building large-scale systems is to consistently maintain low-latency in responses. Dean and Barroso, in their paper highlight techniques that can be used to build distributed systems that have an overall low-latency despite its constituent components having occasional high-latency. The authors first provide several reasons that contribute to high tail latency such as shared resources, queuing, daemons, garbage collection etc. and then they show that variability in latency of individual components lead to high overall service latency due to the “fan-out” architecture generally adopted in large-scale systems. The authors take inspiration from fault-tolerant systems and focus on techniques that reduce latency hiccups regardless of root cause. Finally, the authors introduce techniques such as sending the same request to multiple replicas and only considering the first response (hedged requests) and allowing multiple servers to communicate updates regarding requests (tied requests) to reduce tail latency. Other techniques proposed by the authors such as micro-partitioning, selective replication and latency-induced probation also lead to significant reduction in latency at larger scales.

Positive Points

Drawbacks

Research Questions

  1. How would latency-induced probation work in case of writing to a server on probation?
  2. In a complex large-scale distributed system, there are a lot of moving parts. How can one identify issues that are specially caused by latency variability? For example, some latency issues could be caused by faults/retries.
  3. One of the modern commercial databases that use hedged requests is Cassandra. It uses the percentile heuristic to send 3 requests for each request. This increases the load on the server but provides better latency.

<< Previous Post

|

Next Post >>

#Computer Science #System Design #Distributed Systems #Backend #Networks #Software Engineering