Deadlock Freedom: Beyond Liveness Properties

19 Jan 2023

Properties

In their classic paper, “Defining Liveness” [1], Alpern and Schneider show that all properties of distributed systems are the conjunction of a safety property and a liveness property. Informally, safety properties specify the “bad things” that should not happen while liveness properties specify the “good things” that must happen. More formally, safety properties are the properties which can always be falsified by finite traces and liveness properties are the properties which can only be falsified by infinite traces. If \(S\) is the set of program states, \(S^\omega\) is the set of infinite sequences of program states, and \(S^*\) is the set of finite sequences of program states then \(P\) is a safety property if

\[\forall \sigma \in S^\omega : \exists i \ge 0 : \forall \beta \in S^\omega : \sigma \not \in P \implies \sigma[..i]\beta \not \in P\]

and \(P\) is a liveness property if

\[\forall \alpha \in S^* : \exists \beta \in S^\omega : \alpha \beta \in P \ .\]

Here, \(\sigma[..i]\) is just the finite sequence with the first \(i\) elements of \(\sigma\) and \(\sigma[..i] \beta\) and \(\alpha \beta\) are concatenations of sequences.

Alpern and Schneider also give two different proofs that any property can be decomposed into a safety property and a liveness property, one based on topology and one based on automata theory [1, 2]. I haven’t yet defined what a “property” is, though. It’s implicit in the above definitions, but in this characterization, a property is just a set of (infinite) traces!

The Properties of Consensus

Properties are powerful and can be used to specify important things we want from our distributed systems. For example, the specification of the consensus problem can be expressed as the intersection of a safety property – all correct processes that decide a value decide the same value, and that value must have been some input value – and a liveness property – all correct (non-failing) processes eventually decide a value.

Sometimes, proving properties (especially liveness properties) of systems requires making additional assumptions. Consider the termination guarantees of a system like Paxos. We know that consensus in an asynchronous environment with the possibility of a single process failure cannot be solved (without randomization) [3]. In particular, protocols which guarantee safety cannot guarantee termination. However, people use protocols such as Paxos in practice. Are these practitioners just religious zealots in the Paxos cult, believing without any basis that Paxos will save them from the evils of impossibility results? Of course not! We can prove things about the liveness of Paxos; in particular, a Paxos implementation with a well-designed leader election mechanism can guarantee liveness during periods of sufficient synchrony.

How do we formalize what “periods of sufficient synchrony” are? One way people do this is through the use of failure detectors. If you equip each replica with a failure detector which is responsible for informing the replica when the current leader has failed, then we know that as long as the failure detectors behave at least as well as \(\diamond W\), the weakest failure detector that can solve consensus [4], (and as long as the network itself satisfies basic fairness criteria¹) then the system is guaranteed to eventually make progress. The way that you could disprove that Paxos (or an implementation of Paxos) satisfies this “\(\diamond W\)-termination” liveness property is by describing an infinite (fair) execution of Paxos in which the failure detectors satisfy the \(\diamond W\) guarantees, no more than \(f\) processes fail, and some correct process never decides a value.

Deadlock Freedom

While proving liveness properties of consensus protocols like Paxos is useful, it will always involve making additional assumptions about the well-behavedness of the environment. I claim, however, that there is something much more basic that we would like to say about Paxos and many other distributed protocols and implementations of those protocols that does not rely on any additional assumptions. I’ve been calling this property deadlock freedom.

Deadlock Freedom: The system cannot reach a state from which no terminating state is reachable.

Notice that deadlock freedom is neither a safety nor liveness property – it’s not a property at all! We can easily express it in a branching temporal logic such as CTL (e.g., \(\mathbf{AG}. \mathbf{EF}. termination\)), but deadlock freedom cannot be falsified with a single trace. Just because a system can fail to make progress forever doesn’t mean that it ever reaches a state from which progress is impossible. However, any liveness properties that you wish to prove about a system will imply deadlock freedom, unless additional assumptions you employ to prove liveness make the dead states unreachable.

You might think that deadlock freedom is actually a safety property. After all, deadlocks in a multi-threaded locking scheme can be detected by looking for cycles in the wait-for graph. If a locking system cannot reach a cyclically waiting state, then it is free of deadlocks.

“Deadlocks” in distributed systems, however, can be much more subtle. Imagine implementing a 500,000 line system that contains a dozen different roles that nodes can play. Proving (or even knowing) all of the ways in which such a system could get permanently stuck, were they to occur, might be extremely difficult. But expressing the ways in which the system can make progress might be comparatively easy. We would like to be able to express useful attributes of our systems (such as deadlock freedom) without needing to prove complicated theorems.

Deadlock freedom is also a useful attribute of systems to look for in model checkers. In DSLabs, while we do not check for deadlock freedom in general, we do check that student implementations can make progress from specific intermediate states.

Hyperproperties

While deadlock freedom is not a property, it is a hyperproperty. Hyperproperties, as defined by Clarkson and Schneider, are sets of sets of (infinite) traces [5]. A distributed system satisfies a hyperproperty if the complete set of possible executions it can produce is in the hyperproperty. In light of hyperproperties, what we have been just calling “properties” up until now are referred to as trace properties for clarity.

It turns out that there is an analogous safety and liveness decomposition for hyperproperties! Safety hyperproperties are the hyperproperties which can always be falsified by finite sets of finite traces,² and liveness hyperproperties are hyperproperties which cannot be falsified by finite sets of finite traces.³ (See the hyperproperties paper for a more full discussion of the definitions of hyperproperties [5].) Furthermore, every hyperproperty is the conjunction of a safety hyperproperty and a liveness hyperproperty.

Deadlock freedom is a liveness hyperproperty. It cannot be falsified with finite traces at all. A counter-example to deadlock freedom would have to show that there is some state that is reachable, and in all possible executions which reach that state, termination/progress is never achieved.

Hyperproperties were introduced to specify important security policies (such as secure information flow) and complex liveness requirements (e.g., SLAs such as mean response time). However, as we have seen, they can be much more basic. Deadlock freedom is a liveness hyperproperty which is broadly applicable and extremely desirable. While trace properties are a useful narrowing of hyperproperties because they can be more amenable to certain formal methods techniques such as verification and model checking, a broader understanding of the specification of distributed systems must include hyperproperties, which I do not think are as widely understood, hence the reason for this blog post.

References

Alpern, B. and Schneider, F.B. Defining liveness. Information Processing Letters. 21(4):181–185. 1985.
Alpern, B. and Schneider, F.B. Recognizing Safety and Liveness. Distributed Computing. 2(3):117–126. Sep. 1987.
Fischer, M.J., Lynch, N.A. and Paterson, M.S. Impossibility of Distributed Consensus with One Faulty Process. J. ACM. 32(2):374–382. Apr. 1985.
Chandra, T.D., Hadzilacos, V. and Toueg, S. The Weakest Failure Detector for Solving Consensus. jacm. 43(4):685–722. Jul. 1996.
Clarkson, M.R. and Schneider, F.B. Hyperproperties. Journal of Computer Security. 18(6):1157–1210. Sep. 2010.

The fairness criteria that are assumed vary, but one basic assumption that is usually necessary is that all processes either fail or take infinitely many steps. The FLP impossibility result assumes that all messages are eventually delivered (as long as the receiver has not failed), which is a particularly strong assumption but only broadens the applicability of the impossibility result. When proving the liveness of protocols, weaker assumptions about message delivery are often assumed. For example, a fair-lossy network only guarantees that if a message is sent infinitely many times (from the same sender to the same receiver), then it is delivered infinitely many times. ↩
That is, no complete set of executions which extends a counter-example set of finite traces will satisfy a safety hyperproperty. If we have two sets of (possibly infinite) traces \(A\) and \(B\), then \(A \le B\) if \(\forall a \in A : \exists b \in B : a \le b\) where \(a \le b\) means that \(a\) is a prefix of \(b\). A hyperproperty is a safety hyperproperty if, for every set of executions, \(T\) which does not satisfy the safety hyperproperty, there exists a finite set of finite traces \(C \le T\) such that for all \(T' \ge C\), \(T'\) also does not satisfy the hyperproperty. ↩
A hyperproperty is a liveness hyperproperty if, for all finite sets of finite traces \(C\), \(\exists T \ge C\) such that \(T\) satisfies the hyperproperty. ↩