The AWS Data Ecosystem Grand Tour - ETL
Written by Alex Rasmussen on February 14, 2020
This article is part of a series. Here are the rest of the articles in that series:
- Where Your AWS Data Lives
- Block Storage
- Object Storage
- Relational Databases
- Data Warehouses
- Data Lakes
- Key/Value and "NoSQL" Stores
- Graph Databases
- Time-Series Databases
- Ledger Databases
- SQL on S3 and Federated Queries
- Streaming Data
- File Systems
- Data Ingestion
- Data Interfaces
- Training Data for Machine Learning
- Data Security
- Business Intelligence
We talked a little about ETL (Extract, Transform, Load) back when we looked at data lakes. ETL processes take data from a source storage system, modify that data, and load it into a destination storage system (usually a data warehouse or a data lake). It's not the most glamorous task in the world, and it's critical to the success of any sufficiently large organization's data strategy. It can be hard to write robust ETL scripts fast enough to keep pace with demand for new data sources, though. AWS has a few managed ETL solutions that aim to make this process easier.
We've already discussed AWS Glue, with its data catalog and ETL code generator. AWS Data Pipeline also provides a managed ETL system, but it looks a lot closer to what you're used to if you've used tools like Airflow or Oozie before. You define a pipeline of transformation activities, either from a template or using a GUI tool called Data Pipeline Architect, and Data Pipeline handles running the pipeline on a schedule that you specify. This makes Data Pipeline significantly more flexible than Glue, but at the cost of more hands-on configuration.
Data Pipeline's orchestration features also look a lot like those in Amazon Simple Workflow Service (SWF). Like SWF, Data Pipeline tracks activity execution and handles dependent scheduling and retries. Data Pipeline's ETL-specific functionality - like precondition checks on data and easy copying between data stores - set it apart from SWF for ETL workloads.
Data Pipeline activities are split into two classes based on how frequently they run. High frequency activities, which execute more than once per day, are significantly more expensive than low frequency activities, which execute less than once per day. Activities can be executed in AWS or on-premises, but on-premises activities are significantly more expensive. Notably, you're charged for every pipeline you have declared in Data Pipeline, even if it's not running, and you still pay the normal amount for any other AWS resources that you use as part of a pipeline's execution.
Solutions like Glue and Data Pipeline are all well and fine if the data you're ingesting is stored in something like a database or a file system, but what if you need to load streaming data? If you're running Kinesis Data Streams already, you can load data from that stream several different data stores with Amazon Kinesis Data Firehose.
You can think of Kinesis Data Firehose as a set of pre-built data processors for Kinesis Data Streams that support some common ETL actions like ingestion into AWS stores like S3, Redshift, and ElasticSearch Service as well as third-party systems like Splunk. It can also run transformations (defined as Lambda functions) over data as it's ingested, convert between data formats, and encrypt data as it's ingested. You can even configure it to load into multiple destinations at once.
You're charged for Kinesis Data Firehose based on the volume of data you ingest. If you're using it to convert records to Parquet or ORC format, you're charged extra for conversion. If you process more than 500TB per month, your data ingestion costs start to go down a little bit. You can start to negotiate pricing with AWS once you go over 5PB (5000TB!) per month.
Next: Data Processing
In this article, we looked at a couple ways that AWS makes it easier to get your data from one place to another in AWS, either as a batch or as it streams in.
Next time, we'll look at some ways that you can do more general kinds of data processing. This is where we finally get to talking about the Hadoop ecosystem, at least as it pertains to AWS.
If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.