Ellis Michael

2017
October
Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control.
Jialin Li, Ellis Michael, and Dan R. K. Ports.
Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP '17), Shanghai, China.
abstract pdf slides
Distributed storage systems aim to provide strong consistency and isolation guarantees on an architecture that is partitioned across multiple shards for scalability and replicated for fault tolerance. Traditionally, achieving all of these goals has required an expensive combination of atomic commitment and replication protocols – introducing extensive coordination overhead. Our system, Eris, takes a different approach. It moves a core piece of concurrency control functionality, which we term multi-sequencing, into the datacenter network itself. This network primitive takes on the responsibility for consistently ordering transactions, and a new lightweight transaction protocol ensures atomicity.
The end result is that Eris avoids both replication and transaction coordination overhead: we show that it can process a large class of distributed transactions in a single round-trip from the client to the storage system without any explicit coordination between shards or replicas in the normal case. It provides atomicity, consistency, and fault tolerance with less than 10% overhead – achieving throughput 3.6–35x higher and latency 72–80% lower than a conventional design on standard benchmarks.
Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control (Extended Version).
Jialin Li, Ellis Michael, and Dan R. K. Ports.
Technical Report UW-CSE-17-10-01, University of Washington CSE.
abstract pdf
Distributed storage systems aim to provide strong consistency and isolation guarantees on an architecture that is partitioned across multiple shards for scalability and replicated for fault tolerance. Traditionally, achieving all of these goals has required an expensive combination of atomic commitment and replication protocols – introducing extensive coordination overhead. Our system, Eris, takes a different approach. It moves a core piece of concurrency control functionality, which we term multi-sequencing, into the datacenter network itself. This network primitive takes on the responsibility for consistently ordering transactions, and a new lightweight transaction protocol ensures atomicity.
The end result is that Eris avoids both replication and transaction coordination overhead: we show that it can process a large class of distributed transactions in a single round-trip from the client to the storage system without any explicit coordination between shards or replicas in the normal case. It provides atomicity, consistency, and fault tolerance with less than 10% overhead – achieving throughput 3.6–35x higher and latency 72–80% lower than a conventional design on standard benchmarks.
Recovering Shared Objects Without Stable Storage.
Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres.
Proceedings of the 31st International Symposium on Distributed Computing (DISC '17), Vienna, Austria.
abstract pdf
This paper considers the problem of building fault-tolerant shared objects when processes can crash and recover but lose their persistent state on recovery. This Diskless Crash-Recovery (DCR) model matches the way many long-lived systems are built. We show that it presents new challenges, as operations that are recorded at a quorum may not persist after some of the processes in that quorum crash and then recover.
To address this problem, we introduce the notion of crash-consistent quorums, where no recoveries happen during the quorum responses. We show that relying on crash-consistent quorums enables a recovery procedure that can recover all operations that successfully finished. Crash-consistent quorums can be easily identified using a mechanism we term the crash vector, which tracks the causal relationship between crashes, recoveries, and other operations.
We apply crash-consistent quorums and crash vectors to build two storage primitives. We give a new algorithm for multi-writer, multi-reader atomic registers in the DCR model that guarantees safety under all conditions and termination under a natural condition. It improves on the best prior protocol for this problem by requiring fewer rounds, fewer nodes to participate in the quorum, and a less restrictive liveness condition. We also present a more efficient single-writer, single-reader atomic set―a virtual stable storage abstraction. It can be used to lift any existing algorithm from the traditional Crash-Recovery model to the DCR model. We examine a specific application, state machine replication, and show that existing diskless protocols can violate their correctness guarantees, while ours offers a general and correct solution.
August Recovering Shared Objects Without Stable Storage (Extended Version).
Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres.
Technical Report UW-CSE-17-08-01, University of Washington CSE.
abstract pdf
This paper considers the problem of building fault-tolerant shared objects when processes can crash and recover but lose their persistent state on recovery. This Diskless Crash-Recovery (DCR) model matches the way many long-lived systems are built. We show that it presents new challenges, as operations that are recorded at a quorum may not persist after some of the processes in that quorum crash and then recover.
To address this problem, we introduce the notion of crash-consistent quorums, where no recoveries happen during the quorum responses. We show that relying on crash-consistent quorums enables a recovery procedure that can recover all operations that successfully finished. Crash-consistent quorums can be easily identified using a mechanism we term the crash vector, which tracks the causal relationship between crashes, recoveries, and other operations.
We apply crash-consistent quorums and crash vectors to build two storage primitives. We give a new algorithm for multi-writer, multi-reader atomic registers in the DCR model that guarantees safety under all conditions and termination under a natural condition. It improves on the best prior protocol for this problem by requiring fewer rounds, fewer nodes to participate in the quorum, and a less restrictive liveness condition. We also present a more efficient single-writer, single-reader atomic set―a virtual stable storage abstraction. It can be used to lift any existing algorithm from the traditional Crash-Recovery model to the DCR model. We examine a specific application, state machine replication, and show that existing diskless protocols can violate their correctness guarantees, while ours offers a general and correct solution.
2016
November
Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering.
Jialin Li, Ellis Michael, Adriana Szekeres, Naveen Kr. Sharma, and Dan R. K. Ports.
Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16), Savannah, GA, USA.
abstract pdf slides
Distributed applications use replication, implemented by protocols like Paxos, to ensure data availability and transparently mask server failures. This paper presents a new approach to achieving replication in the data center without the performance cost of traditional methods. Our work carefully divides replication responsibility between the network and protocol layers. The network orders requests but does not ensure reliable delivery – using a new primitive we call ordered unreliable multicast (OUM). Implementing this primitive can be achieved with near-zero-cost in the data center. Our new replication protocol, Network-Ordered Paxos (NOPaxos), exploits network ordering to provide strongly consistent replication without coordination. The resulting system not only outperforms both latency- and throughput-optimized protocols on their respective metrics, but also yields throughput within 2% and latency within 16 μs of an unreplicated system – providing replication without the performance cost.
September Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering (Extended Version).
Jialin Li, Ellis Michael, Adriana Szekeres, Naveen Kr. Sharma, and Dan R. K. Ports.
Technical Report UW-CSE-16-09-02, University of Washington CSE.
abstract pdf
Distributed applications use replication, implemented by protocols like Paxos, to ensure data availability and transparently mask server failures. This paper presents a new approach to achieving replication in the data center without the performance cost of traditional methods. Our work carefully divides replication responsibility between the network and protocol layers. The network orders requests but does not ensure reliable delivery – using a new primitive we call ordered unreliable multicast (OUM). Implementing this primitive can be achieved with near-zero-cost in the data center. Our new replication protocol, Network-Ordered Paxos (NOPaxos), exploits network ordering to provide strongly consistent replication without coordination. The resulting system not only outperforms both latency- and throughput-optimized protocols on their respective metrics, but also yields throughput within 2% and latency within 16 μs of an unreplicated system – providing replication without the performance cost.
August Providing Stable Storage for the Diskless Crash-Recovery Failure Model.
Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres.
Technical Report UW-CSE-16-08-02, University of Washington CSE.
abstract pdf
Many classic protocols in the fault tolerant distributed computing literature assume a Crash-Fail model in which processes either are up, or have crashed and are permanently down. While this model is useful, it does not fully capture the difficulties many real systems must contend with. In particular, real-world systems are long-lived and must have a recovery mechanism so that crashed processes can rejoin the system and restore its fault-tolerance. When processes are assumed to have access to stable storage that is persistent across failures, the Crash-Recovery model is trivial. However, because disk failures are common and because having a disk on a protocol's critical path is often performance concern, diskless recovery protocols are needed. While such protocols do exist in the state machine replication literature, several well-known protocols have flawed recovery mechanisms. We examine these errors to elucidate the problem of diskless recovery and present our own protocol for providing virtual stable storage, transforming any protocol in the Crash-Recovery with stable storage model into a protocol in the Diskless Crash-Recover model.
2015
May
Scaling Leader-Based Protocols for State Machine Replication.
Ellis Michael. Supervisor: Lorenzo Alvisi.
Undergraduate Honors Thesis, University of Texas at Austin.
abstract pdf
State machine replication is a technique used to guarantee the availability of a system even in the presence of faults. Agreement protocols are often used to implement state machine replication. However, the throughput of many agreement protocols, such as Paxos, does not scale with the number of machines available to the system. Systems whose throughput does scale often provide weaker consistency guarantees than state machine replication's linearizability. These weaker guarantees, such as eventual consistency, make the job of the application programmer much more difficult.
Kapritsos and Junqueira previously proposed a protocol which uses an agreement protocol as a primitive and does scale. In this thesis, I describe my novel implementation of this protocol as well as the optimizations that were necessary to achieve scalability. The results of my experiments with this system show modest but promising evidence that the throughput of this system can, indeed, scale linearly with the number of machines.