The AWS Data Ecosystem Grand Tour - Processing

Written by Alex Rasmussen on March 10, 2020

This article is part of a series. Here are the rest of the articles in that series:

Differentiated heavy lifting. — (Photo by Ant Rozetsky on Unsplash)

There are tools in the AWS customer's toolkit for ingesting data into many different kinds of data storage systems, most of which offer sophisticated query interfaces. In a lot of cases, that functionality is enough to meet the needs of an organization. Sometimes, though, you need to do more general-purpose data processing. In this article, we'll look at some of the ways AWS helps you do that.

At the risk of oversimplifying, you can think of data processing as coming in two flavors: batch and streaming. Batch processing processes data all at once, either periodically or in response to some user demand. Streaming, on the other hand, processes each datum (usually associated with an event) as it streams in. The data processing task we've focused on most in this series is ETL, and we've seen that it can be performed both in batch and streaming contexts within AWS.

Batch and streaming have their relative strengths and weaknesses. We won't cover those strengths and weaknesses here, since whole books have been written on that subject. Suffice to say that in a large enough organization you'll have applications that are amenable to both kinds of processing, and AWS has solutions for each.

Batch Processing

If you've dealt with large-scale data processing in the last ten years, you've probably heard of Hadoop. Hadoop began its life at Yahoo as an open-source re-implementation of the GFS distributed object storage system and the MapReduce batch computation framework, two pieces of Google's early "Big Data" infrastructure. HDFS, Hadoop's GFS analog, has remained relatively unchanged in the intervening years, but the options available for batch processing in Hadoop have expanded far beyond MapReduce, with Spark mostly replacing MapReduce as the dominant batch processing system.

Hadoop's explosive popularity led to the development of a bunch of other data systems that all worked within the Hadoop ecosystem. Almost every data system we've covered in this series has at least one equivalent that runs on Hadoop. Eventually, a resource manager called YARN was developed to allow this growing menagerie of services to share cluster resources more easily.

All of Hadoop's new functionality came at a cost, however. Maintaining and configuring Hadoop clusters is notoriously challenging, particularly because a Hadoop cluster is a collection of largely independent distributed systems that all have to work together. Especially for enterprises used to administering large, monolithic databases, the administrative and cognitive burden of keeping a Hadoop cluster up and running are high.

Amazon EMR is AWS's hosted version of Hadoop. AWS takes care of launching and configuring EMR clusters for you, removing much of the care-and-feeding burden that makes Hadoop such a chore to operate. If you've run a Hadoop cluster before, a lot of the interfaces will look like you'd expect, and if you have an data system that can be deployed on HDFS and YARN, it'll likely work in EMR.

This being an AWS service, there are a number of integrations with the rest of their ecosystem that makes it compelling for existing AWS customers. S3 buckets are accessible from the EMR cluster natively via EMR File System (EMRFS). AWS makes it pretty obvious that it would prefer that you use EMRFS for persistent storage, recommending that HDFS only be used for ephemeral or temporary data. AWS has modified YARN to support EC2 spot instances, so processing jobs can acquire short-lived, cheap compute resources when a batch job starts and release them when the job is done. User management, authentication and authorization, logging, and monitoring are all handled with existing AWS tools like IAM and CloudWatch. EMR users also have access to EMR Notebooks, which are hosted Jupyter notebooks that come pre-configured to make running Spark jobs on an EMR cluster easier.

EMR runs on a cluster of specialized EC2 instances. In terms of pricing, you pay what you would have paid for those instances in EC2, plus a per-instance markup for EMR that varies by instance size. As with normal EC2, bigger instances will get you more compute power, but will also cost more.

Stream Processing

We've already covered Kinesis Data Streams (KDS), AWS's system for capturing and storing data streams, and Kinesis Data Firehose, the solution for ETL from KDS streams into other data stores. AWS's tool for more general purpose processing on streams is called Kinesis Data Analytics. With Kinesis Data Analytics, you can build applications using either SQL queries or Java programs that process a KDS stream. These applications are built with a modified version of Apache Flink, a popular open-source system for stream processing. Flink applications consist of a graph of operators that process and transform the stream's events. In SQL mode, Flink compiles a SQL query into an operator graph for you, and in Java mode, you write the application and its operators yourself.

One particularly interesting feature of Flink is that it allows for stateful processing over streams, meaning that operators can remember information across multiple events rather than processing each event independently. This allows for much more sophisticated stream processing without having to use an external data store like a relational database or a key/value store to synchronize state. There are some restrictions to how and where this state can be used, of course, but we won't get into that here.

Kinesis Data Analytics comes with some common templates for operator graphs that do things like anomaly detection and streaming top-K. It can also attempt to infer the schema of a stream's data using a sample of records from the streaming source, which makes writing applications (particularly Java applications) a little less cumbersome.

Kinesis Data Analytics is serverless, and as with a lot of AWS's serverless offerings, you're charged based on the number of capacity units you use. In this case, the capacity units are Kinesis Processing Units (KPUs), which are equivalent to using 1 vCPU and 4GB of RAM for an hour. The system will automatically scale (using KPU-hours in the process) to try to keep up with the stream as events arrive.

If you're running a Java application, you're charged extra for state storage and backups by the GB-month, as well as an additional KPU-hour per hour for orchestration. This additional orchestration surcharge is presumably because user-defined applications tend to put more load on Flink than automatically generated ones do.

Next: Interfaces

In this article, we took a look at some of the solutions for building more general purpose data processing applications in AWS. Next, we'll look at some of the ways to connect applications to data in AWS that didn't quite fit into the previous topics.

If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.