The AWS Data Ecosystem Grand Tour - Streaming Data

Written by Alex Rasmussen on February 3, 2020

This article is part of a series. Here are the rest of the articles in that series:

Data streams are rarely this clear. — (Photo by Karim Sakhibgareev on Unsplash)

At the core of many of the data storage systems we've covered so far is some kind of immutable, ordered, append-only log that records each change that happens to the system's data. In relational databases, this log forms the source of truth for the state of the database and is used for things like crash recovery and replication. DynamoDB and Neptune expose their logs directly to users. These logs are one important example of a data stream.

Data streams are a pretty powerful abstraction. We won't discuss all the benefits of the abstraction here, but there are a couple big ones. Using a stream of events instead of direct service-to-service communication to coordinate the execution of a multi-service task can allow some services to continue operating even when others are offline. Streams can also serve as a kind of buffer, allowing the system to temporarily receive data faster than it can process it without the system overloading. Replicating state using a stream of updates is also pretty easy to reason about. If two replicas have applied updates up to the same point in the stream, those two replicas are synchronized. Compared to trying to synchronize arbitrary state across replicas, this almost feels like cheating.

One thing you'll notice about these benefits is that they're predicated on three big assumptions: that streams are highly available, durable for long periods of time, and capable of handling a high volume of incoming data. Building a stream processing platform that provides these guarantees is hard. AWS recognizes both the power of the abstraction and the complexity of its implementation, and has several managed services available to users who want all the benefits without the hassles of maintaining a streaming data system.

The first such system is Amazon Kinesis Data Streams (KDS). KDS is a fully managed data streaming service that advertises being able to capture gigabytes of event data per second and retain that data durably for 24 hours (or 7 days if you pay more). Later in this series, we'll look at several services that can read and write to KDS streams natively, but of course you can build your own consumers and producers as well.

When you create a KDS stream, you specify how many shards it should use. Each shard allows the stream to consume 1MB per second or 1000 events per second, whichever comes first. By default, each shard also provides up to 2MB per second of total read bandwidth to the stream's consumers, although you can increase that to 2MB per second per consumer by upgrading to what KDS calls "enhanced fan-out".

When using KDS, you're charged per shard-hour for each shard that you use. The cost of each shard-hour increases if you choose to enable enhanced fan-out or increased retention. In addition, you're charged for each insert (or PUT) operation into the stream. PUTs are measured in PUT Payload Units (PPUs), and each 25KB of each event inserted into the stream consumes a PPU.

The most popular open-source data streaming service by far is Apache Kafka. Kafka and KDS are similar in many ways and provide similar interfaces and functionality. To attract existing Kafka users who don't want to move to Kinesis, AWS also provides a managed Kafka service in the form of Amazon Managed Streaming for Apache Kafka (MSK).

MSK, like Redshift or RDS, requires that you provision a set of Kafka nodes (called brokers), and offers several different broker instance sizes to choose from. When using MSK, you pay for the time that your brokers are running, and more powerful broker instances cost more. You're also charged by GB-hour for the data that the brokers store, and standard data transfer rates also apply although transfer between brokers isn't charged.

Next: The Venerable File System

In this article, we talked about streaming data and introduced Kinesis Data Streams. We'll be seeing a lot more of data streams later in the series when we start to talk about data processing systems.

Next time, for our final data storage system of the series, we'll take a look at AWS's roster of file systems.

If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.