The AWS Data Ecosystem Grand Tour - Data Lakes
Written by Alex Rasmussen on January 9, 2020
This article is part of a series. Here are the rest of the articles in that series:
- Where Your AWS Data Lives
- Block Storage
- Object Storage
- Relational Databases
- Data Warehouses
- Data Lakes
- Key/Value and "NoSQL" Stores
- Graph Databases
- Time-Series Databases
- Ledger Databases
- SQL on S3 and Federated Queries
- Streaming Data
- File Systems
- Data Ingestion
- Data Interfaces
- Training Data for Machine Learning
- Data Security
- Business Intelligence
Despite their long history and wide adoption, data warehouses have some problems. They're expensive, and getting data into them often requires extracting data from various (sometimes non-relational) sources, storing it in some temporary staging area, massaging it into an appropriately relational form, and carefully loading it into the warehouse. This ETL (extract/transform/load) process is often expensive, slow, and time-consuming to maintain. These problems have been exacerbated in the last ten years or so as the volumes of data that organizations have to manage and analyze becomes increasingly large and diverse.
Another way to provide a centralized source of an organization's data is to build a data lake. At its most basic, a data lake is just a place where a large collection of curated datasets can be stored, typically as a bunch of files. Unlike data warehouses, data lakes don't tend to impose any kind of structure (relational or otherwise) on the data they store, which allows for heterogeneously structured or unstructured data to live alongside highly structured data in the same storage system. This flexibility also makes the extract and load parts of ETL easier to do, since you could just copy data into the data lake from wherever the data originated without doing any transformation at all if you wanted to1. In an ideal world, this means that it's easy for anyone in your organization to contribute curated datasets to a central location that can then be easily and flexibly re-used by your whole organization.
The benefits of a data lake come with their associated drawbacks, of course. The flexibility and heterogeneity that the data lake allows makes querying its data more difficult than it would be in a data warehouse. In particular, the location and schema of a particular dataset in the data lake is much harder to discover. The ease of ingestion in a data lake also poses problems for data quality and governance. If everyone is allowed to load data into the data lake willy-nilly, it's possible for the data lake to quickly become an uncurated, insecure, disorganized mess. This process is sometimes jokingly referred to as building a data swamp.
To overcome some of these deficiencies, you want to have some way of identifying what data is in the data lake, where that data is, and how that data is structured, if it's structured at all. You'll also want to make ingestion from other kinds of data stores into the data lake as easy and standardized as you can to avoid building a bunch of bespoke ETL scripts yourself, and you'll want to enforce some sensible access control policy to keep the data in the lake secure. Any data lake management solution worth its salt will have solutions for most or all of these issues. AWS Lake Formation, AWS's data lake management solution, is no exception.
Lake Formation manages data lakes stored in S3 buckets. Its APIs provide the only means of accessing data in the data lake, which allows Lake Formation to centrally enforce access control and reformat and re-partition the lake's data behind the scenes. Most of Lake Formation's other features are actually provided by another AWS service called AWS Glue, so we'll spend most of this article talking about AWS Glue's functionality.
AWS Glue is a managed ETL system and data catalog. As mentioned above, ETL is a process that moves data from one location and format into a different location and/or format, often doing other data cleaning steps along the way. A data catalog is meant to serve as a master record of your organization's datasets, including their locations, schemas, ownership, and other metadata. Glue has a set of crawlers that can find your various AWS datasets (in RDS, Aurora, S3, Redshift, and elsewhere) and store information about those datasets in the data catalog. These crawlers are optional, though; you can populate data in your data catalog manually or using some other system if you want. Glue uses the information in its data catalog to generate ETL scripts that load data from a source dataset into any number of target data systems - including Lake Formation - and record the new target dataset in the data catalog. Glue is "serverless", meaning that your jobs are executed on a cluster that AWS manages and you're billed for the amount of time on that cluster that you use.
Glue also includes some automated data cleaning tools. Most notably, it has
FindMatches, a machine learning model that can be trained to recognize whether two records are semantic duplicates of one another. This is useful both to de-duplicate a single dataset and to match records across different datasets. It's also got some built-in logic for doing common cleaning operations like date format standardization.
You aren't charged for using Lake Formation, but you're charged for services like S3 and Glue that Lake Formation uses or manages, as well as for any ETL scripts you run.
Glue charges by the hour for both crawlers and ETL jobs. Since ETL jobs are serverless, you're charged in terms of Data Processing Units (DPUs). A DPU-hour roughly equates to an hour of time running on an instance with 4 vCPUs and 16 GB of RAM. Of course, standard data transfer and request charges for both source and target systems also apply.
If you chose to develop your ETL scripts interactively, you can provision a development endpoint so that you can iteratively test uncommitted scripts. Provisioning that development endpoint effectively reserves cluster capacity for this purpose, so you're charged (in DPU-hours) for the entire time that the development endpoint is provisioned. This can get expensive relatively quickly.
The data catalog portion of Glue charges a fee for storing and accessing information in the catalog, although the first million objects and the first million accesses every month are free.
Next: Operational Storage Without SQL
Now that we've looked at analytical stores in the form of data warehouses and data lakes, we'll spend the next article back in the realm of operational stores. Specifically, we'll be looking at some of the operational stores that aren't relational databases.
If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.
This process of loading data before transforming it rather than afterwards is sometimes referred to as ELT, with load happening before transform. This term has also been used to describe the increasingly common practice of loading data into a data warehouse before transforming it, though, so it's not really specific to data lakes. ↩