The AWS Data Ecosystem Grand Tour - Data Warehousing
Written by Alex Rasmussen on January 7, 2020
This article is part of a series. Here are the previous articles in that series:
When we looked at relational databases, I mentioned the distinction in the database world between online transaction processing (OLTP) and online analytical processing (OLAP) systems. OLTP systems are meant for the day-to-day operations of an organization or system. They contain the current state of the data and tend to handle a large number of relatively small and simple reads and writes to that data. The data in an OLTP system tends to be normalized to avoid data duplication and enforce data dependencies, which may make individual transactions more complex and slower to execute. The primary user of an OLTP database is usually the organization's customers, who interact with some piece of software that reads and writes to the database on their behalf.
OLAP systems, by contrast, are meant for analyzing the history of an organization's data over long periods of time in order to make strategic decisions. They contain the historical state of the organization's data, which makes them quite a bit larger on average than OLTP databases. Queries to OLAP databases are mostly either complex reads or periodic bulk writes. The data in an OLAP database is typically heavily de-normalized for speed and ease of use, and tables are often the result of combining data from multiple disparate sources together. OLAP database tend to be used internally by an organization's analysts and data scientists.
By far the most popular kind of OLAP database is the data warehouse, and companies like Teradata and Oracle have been selling data warehouses to large organizations for decades. AWS's data warehouse is called Amazon Redshift. Today, Redshift is one of AWS's most popular and lucrative services.
Redshift's primary selling points are highly performant queries at petabyte scale and rich integration with the rest of the AWS ecosystem. Some of Redshift's performance is achieved through well-known techniques like columnar storage, data compression, and materialized views. Redshift also has a few more exotic tools in its toolbox, like scheduling queries based on their predicted runtime and using a proprietary compression encoding for numeric and date formats.
To provide consistently fast performance, Redshift will add transient capacity to a cluster automatically if demand on the cluster exceeds the cluster's ability to serve it. This is referred to as Concurrency Scaling and is purchased with Concurrency Scaling credits. Each cluster gets an hour of free Concurrency Scaling credits per 24 hours in the same kind of token bucket fashion we saw when looking at I/O credits in EBS; after that, you're charged for each Concurrency Scaling credit you use.
Redshift is a natural landing point for a lot of data from elsewhere in the AWS ecosystem, and AWS clearly wants Redshift to sit at the center of an organization's day-to-day analytics activities. Redshift can interact with data in a data lake that's managed with AWS Lake Formation, either by loading data from the data lake, dumping data to it, or querying data in the lake directly via a feature called Redshift Spectrum. It can perform federated querying by joining in data from Aurora PostgreSQL or RDS PostgreSQL. It can also ingest data from AWS's streaming and bulk data systems, and can serve as a data source for report generation in QuickSight, AWS's data visualization tool. Redshift also has many of the same quality-of-life features that RDS does, including automated provisioning, automated backups and built-in fault tolerance.
A Redshift data warehouse runs on a cluster of nodes. As with many of AWS's other offerings, you choose the size, type, and quantity of nodes you want for your Redshift cluster and you pay for those nodes either hourly or upfront. Redshift nodes come in different flavors for different use cases. At time of writing, there are three such flavors: dense compute (DC), dense storage (DS), and RA3. DC nodes use SSDs for local storage and are designed to provide good query performance at relatively low cost for smaller (i.e. under 500 GB) warehouses. DS nodes use HDDs for local storage and provide significantly higher storage per dollar than DC nodes at the cost of decreased performance. RA3 nodes are for customers with large or high-demand data warehouses. They use large local SSDs as a high-speed local cache for S3 storage, and soon will be able to use a hardware-accelerated cache/sub-query processor called AQUA to make accessing and querying that S3 storage even faster.
Interestingly, RA3 nodes allow you to scale the size of the warehouse's storage independently of the number of cluster nodes you have. When using RA3 nodes, you're charged per GB-month for storage at roughly the same rate you'd be charged for S3. For all other node types, some amount of on-board storage comes with each node you use.
If you're using Redshift Spectrum to query data lakes in S3 directly, you pay by the TB of data scanned by those queries. As mentioned above, you're charged for Concurrency Scaling credits used in excess of your free credit allotment. Backups and data transfer are charged similarly to how they are with RDS: backups are free up to 100% of the cluster's size for active clusters, and standard network transfer pricing applies except for some kinds of transfers between Redshift and S3.
Regardless of configuration, Redshift isn't cheap. AWS claims to be the cheapest cloud data warehouse to operate, but you can still expect to pay thousands of dollars per year for even the most modest Redshift cluster. For larger organizations, that's not a big deal (and beats having to pay Oracle or Teradata for a license), but smaller organizations will want to think hard about whether they have large enough concurrency and throughput demands to justify Redshift's cost.
A Brief Aside on Storage Dis-aggregation
The dis-aggregation of compute and storage is prominently featured in both Redshift and Aurora. This broader trend in data systems design makes a lot of sense given the current state of server hardware. In the past, network bandwidth limitations meant that remote disks were slower than locally attached disks. Network speeds have been increasing faster than disk speeds lately, which increasingly means that the network is no longer the bottleneck and remote disks are almost as fast as local ones are. This trend increasingly allows compute nodes to treat local and remote disks as roughly the same thing, and hence to provision and manage storage nodes completely separately from compute nodes. This separation opens the door for some really interesting new architectures, gives customers the freedom to provision their storage much more flexibly, and lets operators like AWS utilize their storage more effectively. It will be interesting to see how this trend plays out in the years to come as the price and throughput of various hardware components continues to change.
Next: Jumping in the Data Lake
In this article, we looked at AWS's data warehouse solution. Next, we'll look at data lakes, another popular way of managing large amounts of organizational data, and examine AWS's data lake solution.
If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.