Amazon has open-sourced a new file client for S3 called Mountpoint for Amazon S3 that makes it “easy” for Linux-based applications to connect directly to Amazon S3 buckets and access objects using file APIs – not something that has always been easy or indeed possible for Linux-based large-scale analytics applications.
Mountpoint runs on your compute node of choice (perhaps unsurprisingly, Amazon gives the examples of an EC2 instance or Amazon EKS pod) and mounts an S3 bucket to a directory on your local file system.
That “easy” is Amazon’s term and in quotation marks because AWS admits openly that Mountpoint, which is written in Rust, doesn’t support writes yet and in the future “will only support sequential writes to new objects… [it’s an alpha release that is] not yet ready for use in production workloads, but we’d love for you to kick the tires and let us know what you think” its developers said. “A lot of people will be inappropriately, but deliriously, happy about this”, added AWS’s head of developer relations for high-performance computing, Brendan Bouffler.
Follow The Stack on LinkedIn
Mountpoint's developer team said the were releasing it because “many data lake customers use more domain-specific tools that don’t natively support S3’s object APIs and instead expect inputs and outputs to be files in a local file system," giving the example of genomics research tools that read sequencing data from a file system, or machine learning training pipelines that store checkpoints (snapshots or dumps of your model) on the local file system.
S3 file client Mountpoint: Burst to terabits per second
To recap a little: S3 (Simple Storage Service) is an object storage service that is home to over 280 trillion objects from customers in every conceivable industry vertical and geography. As object storage, it uses a flat structure with metadata and a unique identifier for each object to make them easier to find among trillions of others.
File storage by contrast organises files into hierarchies with directories and folders, using strict file protocols. (Object storage is all at the same level: There are no folders or sub-directories like there are for file storage.)
“Mountpoint allows you to map S3 buckets or prefixes into your instance’s file system namespace, traverse the contents of your buckets as if they were local files, and achieve high throughput access to objects without worrying about performance tuning or provisioning. With Mountpoint, file operations map to GET and PUT operations against S3, allowing scalable file-based applications to burst to terabits per second of aggregate throughput, without any code changes” said Amazon’s James Bornholt, Devabrat Kumar, and Andy Warfield.
"If you’re looking to run file-based applications that need to collaborate or coordinate on shared data across instances or users, we recommend AWS fully managed file services, such as Amazon FSx or Amazon Elastic File System (EFS). For example, Amazon FSx for Lustre enables customers to link a fully POSIX-compliant file system to an S3 bucket. But if you’re building a data lake application that needs to read large objects without using other file system features like locking or POSIX permissions, or write objects sequentially from a single node, Mountpoint makes direct access to S3 simple and efficient” they added in a March 14 blog about the release.
“Why has AWS created its own file system client, when many third-party clients that do this already exist? Examples include S3FS-FUSE which supports Linux, macOS and FreeBSD, the commercial ObjectiveFS system, and Rclone for Windows?” asks Dev Class’s Tim Anderson in his write-up on the release. Kevin Miller, VP and GM for S3 told DevClass. “We took a look at all the connectors and decided it would be best to build a connector from the ground up. We are building it on top of the AWS Common Runtime, which is the library that underpins our SDKs … we are also writing in the Rust programming language which gets us benefits of type checking and other built-in quality features without sacrificing native code performance [and it benefits from] “automated reasoning … for validating correctness of things like S3 strong consistency.”
Mountpoint’s Semantics tenets
When thinking about the semantics Mountpoint for Amazon S3 will support, the repo emphasises three tenets:
- "We will not support semantics that cannot be implemented efficiently against S3's object APIs. We do not try to emulate operations like rename that would require many API calls to S3 to perform.
- "We present a common view of S3 object data through both file and object APIs. We eschew special emulations of POSIX file features (such as ownership and permissions) that have no close analog in S3's object APIs.
- "When these tenets cause us to diverge from POSIX semantics, we prefer to fail early and explicitly. We would rather cause applications to fail with IO errors than silently accept operations like setxattr that we will never successfully persist.
Have a play or kick tyres here.