The AWS Data Ecosystem Grand Tour - Object Storage
Written by Alex Rasmussen on December 13, 2019
This article is part of a series. Here are the rest of the articles in that series:
- Where Your AWS Data Lives
- Block Storage
- Object Storage
- Relational Databases
- Data Warehouses
- Data Lakes
- Key/Value and "NoSQL" Stores
- Graph Databases
- Time-Series Databases
- Ledger Databases
- SQL on S3 and Federated Queries
- Streaming Data
- File Systems
- Data Ingestion
- Data Interfaces
- Training Data for Machine Learning
- Data Security
- Business Intelligence
In the last article, we talked about block storage, and in particular about Elastic Block Store (EBS). EBS volumes have a lot of great features, but also they have a couple big limitations. First, EBS volumes can only be attached to one EC2 instance at a time, which means you can't share data on the volume with multiple instances at once. Second, an EBS volume has to be attached to an instance to be accessible, which is kind of a pain if you want to access the data from something that isn't an instance (which, as we'll see in the rest of this series, is quite common). Finally, EBS volumes have a limited size; when EBS was first introduced, 16 TiB for a single volume seemed hilariously large, but many organizations now have more than 16 TiB of data that they need to process as a logical unit. It's fairly common in some domains, like computational genomics or video processing, for individual files to be multiple terabytes in size.
So if you need to store a lot of files, potentially lots of really big files, in a way that's accessible for reading and writing for lots of instances at once, how do you do it? This is where Amazon Simple Storage Service (S3) comes in.
Object Storage with S3
S3 is an object store. It allows you to create buckets full of objects, where each object is uniquely named by its key. You can store any kind of binary data in S3 as an object, which is why S3 is sometimes referred to as a blob store (where "blob" is short for "Binary Large Object"). Users of S3 can read, write, and delete buckets and objects, list some or all of the objects within a bucket, tag objects with key-value pairs and ... that's basically it. This interface is extremely simple (hence the name) but it's also quite flexible and powerful.
S3 has a number of desirable features that set it apart from EBS. Individual objects in an S3 bucket can be up to 5 TB large, and there's no practical limit to the number of objects you can store in a single bucket. Objects stored in S3 are, by default, extremely durable; AWS advertises 99.999999999% durability. This means that if you store 10,000 objects in an S3 bucket, you can expect to lose one object in that bucket about once every 10,000 years. Of course, this doesn't account for accidental or malicious object loss, but S3 also supports object versioning so that you can un-delete an object if you delete it by accident. S3 also allows you to set permissions at the bucket, object, or account level to control who can perform what operations on which bucket. With the newly introduced S3 Access Points, you can even provide different sets of access to different groups in your organization. Buckets can be easily - some might argue too easily - exposed to the world either directly or through an HTTP server, although this is something you can disable account-wide if you want.
It's tempting to treat an S3 bucket as a file system. Some systems (including the AWS Console) even present a file-system-like interface to make browsing S3 buckets easier. Unfortunately, treating S3 like a file system can get you into trouble. Objects in an S3 bucket can't be grouped in the same way that files can be grouped into directories. The closest you can get to showing the files in a directory is listing all objects in a bucket whose keys begin with a certain string, which is what many systems with a file-system-like interface to S3 do. This works well enough if the number of objects in the bucket is small, but tends to become quite slow if you have more than a few hundred thousand objects in a bucket.
In much the same way that EBS provides different volume types for different use cases, S3 provides different storage classes for objects that, generally speaking, allow you to trade decreased storage cost for decreased object availability. You declare a default storage class for a bucket when the bucket is created, but you can assign different storage classes for objects on an object-by-object basis at any time if you want. You can also declare a lifecycle management policy that moves objects between storage classes based on time-based rules, e.g. moving objects to cheaper but less available storage after 30 days.
The most common storage class is S3 Standard. This storage class is designed to provide high performance with 99.99% availability, meaning you can expect less than an hour of downtime a year. Objects with this storage class achieve this level of availability by being replicated many times across multiple availability zones. There are so many replicas, in fact, that entire availability zones can become unavailable and the object can still be read and updated.
The Infrequent Access (IA) storage classes are cheaper than S3 Standard because objects with these classes are replicated across fewer locations. Both IA classes provide the same performance as S3 Standard, but their decreased replication results in decreased availability guarantees. S3 Standard-Infrequent Access (S3 Standard-IA) advertises 99.9% availability, or about nine hours of maximum downtime a year, and S3 One Zone-Infrequent Access (S3 One Zone-IA) advertises 99.5% availability, or about two days of maximum downtime a year.
S3 also offers the Intelligent Tiering storage class. Objects stored in this storage class are stored in one of two tiers: a frequent access tier that's equivalent to S3 Standard, and an infrequent access tier that's equivalent to S3 Standard-IA. Objects begin their life in the frequent tier, and objects in the frequent tier are moved to the infrequent tier if they haven't been accessed for 30 days. If an object in the infrequent tier is accessed, it's automatically and transparently moved back into the frequent tier. This storage class advertises the same availability as S3 Standard-IA, 99.9%.
Storing objects in the Intelligent Tiering class can save you money if you don't know when or how frequently your objects will be accessed. That additional intelligence isn't free, however; AWS charges a small per-object monthly fee for the monitoring and automation that moves objects between tiers.
Archival and Backup Storage with Glacier
Objects with the storage classes described above are designed to be readable at any time. For objects that serve as backups, archives, or regulatory audit trails, this always-on readability isn't always necessary. These types of objects are typically only read as part of recovery, and it's exceptionally important (operationally and sometimes legally) that they're never lost. The S3 Glacier storage classes (S3 Glacier and S3 Glacier Deep Archive) are designed for storing these kinds of objects as durably and cheaply as possible.
While objects stored with the Glacier storage classes can co-exist in the same bucket with objects that aren't, the two kinds of objects are treated differently enough that Glacier is often described as though it were its own distinct service. The Glacier API even refers to objects as archives and buckets as vaults, reflecting its focus on archival storage.
Archives in Glacier are much cheaper to store than objects in the Infrequent Access classes, but there's a major catch: you can't read archives on-demand. Instead, you have to retrieve the archive first, paying an extra fee for each GB you retrieve. Retrieval can take anywhere from several minutes to several hours, depending on how much you're willing to pay for retrieval. Archives in Glacier Deep Archive are even cheaper to store, but retrieval always takes on the order of hours.
In addition the expected create, retrieve, update, and delete operations on archives, the Glacier API also allows you to define what's called a vault lock policy, which can prevent anyone (even your organization's root account!) from deleting or modifying some or all of a vault's archives. This is particularly useful in heavily regulated environments like financial services or health care, where some kinds of data have to be retained unmodified for many years for compliance purposes.
As mentioned above, if you use S3 you pay by the GB-month for the objects you store and you pay by the object if you're using the Intelligent Tiering storage class. You also pay for all requests made against S3 objects. Read and write operations are metered separately, and reads are generally less expensive than writes. You also pay typical AWS data transfer costs for all data transferred as part of an S3 read or write (see this article for more on data transfer pricing). If you tag objects, you pay a small fraction of a penny per tag. There are a host of other features (cross-region replication, transfer acceleration through AWS's edge network, etc.) that come with their own additional fees, mostly charging by the amount of data transferred or the number of objects impacted.
Next Up: Relational Databases
In this article, we took a look at S3, one of the oldest and most widely used of AWS's data services. Next, we'll take a look at AWS's relational database offerings.
If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.