Ellis Michael

Linearizable, Wait-free Reads of Replicated State Machines

13 Apr 2020

I recently read the Gryff paper [1], published this year at NSDI. As I was reading it, I noticed that the authors utilizeda trick for achieving linearizable reads of replicated data that has value beyond their specific system. Many people who’ve studied fault-tolerant replication have probably derived it on their own (as I did years ago), but I don’t recall ever seeing the technique described in an earlier paper.1 So, I thought I’d take the opportunity to describe the technique on its own as well as how it applies to any state machine replication (SMR) protocol, beyond the Gryff protocol described by Burke et al.

Any SMR protocol that guarantees linearizability [2] of updates provides the abstraction that the changes to the underlying state (called “commands”) are processed by a single, fault-tolerant service. Or equivalently, there is a logical shared log onto which clients’ commands are appended. In practice, this shared log is often explicitly constructed, and separate machines (called “replicas”) consume the log by executing the commands in log order. We know from the FLP impossibility result [3] that in an asynchronous system in which some replicas can fail, there will always be the possibility that the system ceases to make progress and will forever be prevented from appending commands to the shared log, even if one assumes that all messages that get sent are eventually delivered.

Does this then imply that reading the state of a replicated state machine is necessarily subject to the same limitations? Implementing read operations for an SMR protocol naïvely is straightforward. Reads can simple be treated as commands and can go through the normal replication protocol. Using this method, however, would subject reads to the impossibility result described above and the performance bottlenecks that come with some SMR protocols. However, it turns out that these downsides are not inherent. Instead, we can guarantee that responses to read requests can be returned in a wait-free manner [4]. That is, we can guarantee that these read operations will terminate in finite rounds of communication.

Firstly, if one only cares about serializability of reads, rather than linearizability, the solution is straightforward. Any client read the state of any replica in a single round of communication (sending the initial request to a majority to tolerate faults). As long as replicas have access to the state of the system after applying some prefix of the current shared log (as is almost always the case with SMR protocols), then a single replica’s response to the client will suffice.2 In essence, this weakening of linearizability to serializability means that the client might receive stale data. Whether or not this is acceptable depends on the application, but mere serializability can sometimes lead to unintuitive results.

To see how we will achieve wait-free, linearizable reads, we’ll take the multi-instance Paxos protocol [5] as a running example (using some of the terminology from Paxos Made Moderately Complex [6]). In this execution, we see a single client, sending a command to the system. The leader sends a message to the followers and waits for responses from a majority (including itself). Once this is done, the leader executes the command, replies to the client, and lets the followers know that a decision has been reached and that they can safely execute the command as well.

Suppose then, that after receives a response to its command but before any of the followers are notified of the decision, attempts to read the state of the replicated state machine. In principle, is allowed as many rounds of communication with a majority of replicas as is necessary. For now, let’s just consider a single round of communication. sends the query to all replicas and because we would like to tolerate the failure of any minority, it only waits for a response from a majority. Now, suppose that none of the responses it receives comes from the leader replica.

Because initiated its request after got a response, the state reads must reflect ’s command. However, as they respond to the ’s read, none of the followers yet knows that a decision has been reached about ’s command, even if they do know that there is a decision pending. It would seem that the only safe thing for to do in this scenario is to wait until a decision on the pending command is reached and to delay a response to the read until then. However, that would put us back in the realm where the FLP result applies and preclude any hopes of wait-freedom.

The solution is fairly straightforward. The problem is that the state updates are replied to before knowledge of their commitment has reached a majority. So, we will delay the response the client until that knowledge has reached a majority. The leader waits until a majority executes the command and acknowledges the decision before responding.3

We’re almost done. You might think that this completely resolves the issue, and in the setting we’ve described so far in which there are only two clients – one sending updates and the other reading the state – it does. However, there is a subtle problem that arises when there are multiple clients sending reads. In this example, is also attempting to read the state of the replicated state machine. Here, the result of ’s command will be visible to but not to . Since initiated its read after ’s already finished, this would be a violation of linearizability.

We solve this issue the same way the ABD register protocol [7] does. We add a write-back phase to the read protocol, in which the client ensures that a majority of the replicas are at least as up-to-date as the state the client learned about during the first phase of the read. It distributes any decisions which have not yet reached a majority and waits until there is some up-to-date majority before considering the read completed.

Of course, if the original majority the client read from all had the same set of decisions (or at least all returned the same result if only a subset of the state was read), then this second phase can be skipped entirely.

So, we have two strategies: (1) delaying the client-side completion of updates until they’re permanently committed at a majority and (2) reading using a two-round protocol, the second round of which ensures that a majority is up-to-date with the value read. Neither of these strategies is specific to Paxos and can be applied to SMR protocols broadly. However, they do present trade-offs, as Burke et al. describe.

References

  1. Burke, M., Cheng, A. and Lloyd, W. Gryff: Unifying Consensus and Shared Registers. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). Feb. 2020. 591–617.
  2. Herlihy, M.P. and Wing, J.M. Linearizabiliy: A Correctness Condition for Concurrent Objects. ACM Transactions on Programming Languages and Systems. 12(3):463–492. Jul. 1990.
  3. Fischer, M.J., Lynch, N.A. and Paterson, M.S. Impossibility of Distributed Consensus with One Faulty Process. J. ACM. 32(2):374–382. Apr. 1985.
  4. Herlihy, M. Wait-Free Synchronization. ACM Trans. Program. Lang. Syst. 13(1):124–149. Jan. 1991.
  5. Lamport, L. The Part-time Parliament. ACM Transactions on Computer Systems. 16(2):133–169. May 1998.
  6. Van Renesse, R. and Altinbuken, D. Paxos Made Moderately Complex. ACM Comput. Surv. 47(3):42:1–42:36. Feb. 2015.
  7. Attiya, H., Bar-Noy, A. and Dolev, D. Sharing Memory Robustly in Message-Passing Systems. J. ACM. 42(1):124–142. Jan. 1995.
  1. If this technique has been previously described in the literature, I would very much appreciate someone sending me a link! 

  2. Each response will need to be tagged with some sort of version number, and the client will have to cache responses locally to enforce the process order requirement of serializability. 

  3. The phase added to the update protocol could instead be initiated by the client, and the client could wait for acknowledgments to the decision before considering the update complete. In the Paxos context, having the leader drive this second phase to completion is simpler.