AWS is building a new storage backend for its Amazon S3 service, a paper presented at the 2021 Symposium on Operating Systems Principles (SOSP) this week reveals -- and it's built on a hefty 40,000 lines of Rust code.
Authored by 10 AWS researchers and three academics, the paper -- which won the best paper award at SOSP 2021 (pipping 348 submissions from 2078 authors to the punch) -- reveals in significant detail how the hyperscaler is "building a new key-value storage node called ShardStore" and some challenges in doing so.
ShardStore is under "ongoing feature development by a full-time engineering team" the paper notes -- but is alreading being deployed worldwide, and "currently stores hundreds of petabytes of customer data as part of a gradual rollout". The team working on it have already prevented 16 issues from reaching production, "including subtle crash consistency and concurrency problems" an earlier LinkedIn post by Amazon Science said.
ShardStore is a log-structured merge tree (LSM tree) but with shard data stored outside the tree to reduce write amplification, AWS Principal Applied Scientist Murat Demirbas (who is currently a on leave as a computer science and engineering professor at SUNY Buffalo) noted in his own detailed blog on the research paper.
The team developed and open-sourced a stateless model checker called Shuttle to run tests on the concurrent executions of ShardStore, noting that "even a relatively small test involves tens of thousands of atomic steps whose interleavings must be explored, and the largest tests involve over a million steps." (Shuttle, which is essentially a library for testing concurrent Rust code, can be found on GitHub here under an Apache 2.0 license).
AWS: ShardStore already being deployed
The core of S3 are storage node servers that persist object data on hard disks. These storage nodes are key-value stores that hold shards of object data, replicated by the control plane across multiple nodes for durability.
The paper details the techniques ShardStore uses to ensure resilience and rapid recovery from any incident across these nodes with "eleven nines" data durability (preservation of 99.999999999% of data) in the event of a crash -- while addressing some of the challenges around developing formally verified concurrent storage systems, lightweight stateless model checking and good debugging experience for concurrent test failure.
(Its developers "routinely run tens of millions of random test sequences before every ShardStore deployment" to ensure resilience, they noted in the research paper; using Rust tool Loom as well as Shuttle.)
"Recovering from a crash that loses an entire storage node's data creates large amounts of repair network traffic and IO load across the storage node fleet. Crash consistency also ensures that the storage node recovers to a safe state after a crash, and so does not exhibit unexpected behavior that may require manual operator intervention" the paper's authors note, detailing how they applied techniques like biasing arguments and failure injections towards potentially problematic cases and increase their ability to detect issues.
Follow The Stack on LinkedIn
"To achieve high performance, ShardStore combines a soft-updates crash consistency protocol, extensive concurrency, append-only IO and garbage collection to support diverse storage media, and other complicating factors" the paper notes, adding that the new approach is API-compatible with AWS's existing storage node software, and so requests can be served by either ShardStore or our existing key-value stores.
(S3 executives noted on Twitter that customers were already benefiting from ShardStore without precisely detailing how: we presume through improved speed and resiliency, but The Stack has asked for further details on the new storage backend for curious enterprise users downstream).
The paper focuses less on ShardStore as a "product" and more on how the team built lightweight formal methods (rigorous techniques for developing and verifying software; in this case the "correctness" of production storage nodes), including via updates to a dynamic analysis tool called Miri that's built into the Rust compiler.
"[Rust] ensures type and memory safety for the safe subset of the language. However, as a low-level system, ShardStore requires a small amount of unsafe Rust code, mostly for interacting with block devices. Unsafe Rust is exempt from some of Rust’s safety guarantees and introduces the risk of undefined behavior. To rule out this class of bugs, we... had to extend Miri to support basic concurrency primitives (threads and locks), and have upstreamed that work to the Rust compiler" the paper notes.