The AWS Data Ecosystem Grand Tour - Relational Databases

Written by Alex Rasmussen on December 17, 2019

This article is part of a series. Here are the rest of the articles in that series:

Library Database — (Photo by Jan Kolar (www.kolar.io) on Unsplash)

Relational databases are an enormously popular way of storing lots of inter-related structured data. They've been around since the 1970s, but continue to serve as the heart of the majority of data management setups. If you're building an application in AWS, some part of it is likely going to read or write to a relational database. That relational database is stored in what's called a relational database management system, or RDBMS.

In this article, we'll look specifically at RDBMSs that are designed for online transaction processing (OLTP). OLTP systems are designed to support many concurrent users doing many relatively small reads and write transactions to a database at the same time. This is in contrast to online analytical processing (OLAP) databases, which are designed for executing large, complex read-mostly queries against large data sets. We'll cover OLAP databases when we talk about data warehouses later in this series.

Relational databases are powerful, but they can be quite difficult to operate. Dealing with things like upgrades, backups, and replication are notorious headaches for database administrators. It can also be quite difficult to diagnose performance problems and optimize performance by tuning various database settings. Every sufficiently large organization has individuals (and often whole teams) who spend much of their time on the care and maintenance of the organization's RDBMSs.

To alleviate this pain, AWS offers Amazon Relational Database Service (RDS). RDS provides a turn-key way of provisioning a RDBMS in the AWS cloud. Each RDS instance comes with built-in support for replication, performance monitoring, automatic backups, and managed upgrades. You can also configure two RDS instances in different AZs as a primary-backup pair for increased fault protection; if the primary instance fails, RDS will automatically initiate failover to the backup. It also supports read replication, so you can spin up additional read replicas to handle increased read traffic to the database as needed. As your storage begins to fill, the underlying storage volume can automatically resize itself up to some maximum size. If you've ever had to implement support for these features in an on-premise database yourself, you know how much trouble it can be to set up these features and keep them working.

How RDS implements all these features (and which features are available) depends on which RDBMS you're using, and RDS has quite a few RDBMSs to choose from.

Database Options

RDS has support for all the major relational database players you'd expect: popular open-source databases MySQL, MariaDB, and PostgreSQL are supported, as are enterprise staples Oracle and SQL Server. These RDBMSs operate pretty much how you'd expect if you've used them before. AWS has also built their own relational database called Amazon Aurora. Since Aurora is unique to AWS, we'll focus on it here.

Aurora is wire-compatible with both PostgreSQL and MySQL, but uses its own custom distributed storage engine. This storage engine is separate from the compute portion of the database, the part that handles things like query planning and transaction processing. This allows Aurora to do a lot of interesting things. For example, since Aurora's compute is decoupled from its storage, compute nodes can be added and removed without those nodes having to "catch up" to one another by synchronizing storage. This makes operations like adding read replicas or performing failover easier than they are in typical clustered databases.

Aurora's decoupled design also allows for a configuration called Amazon Aurora Serverless that scales the compute part of an Aurora database up or down automatically to meet changes in demand; it can even scale down to zero instances when the database is idle. It does this by serving database traffic from a shared multi-tenant pool of Aurora instances that it attaches or detaches to the database's storage as needed. If your database traffic is bursty or unpredictable, this might save you a lot of money versus running an Aurora instance 24/7.

I'm glossing over a lot of interesting details that you can read about if you enjoy nerding out about distributed databases like I do.

What About Hybrid Cloud?

By default, AWS will set up and host your database instances for you. If you can't host your database in the AWS cloud for compliance or security reasons, you can use Amazon RDS on VMware and get most of the benefits of RDS while running the database instances themselves within your own private VMware vSphere cluster. RDS on VMware is kind of an odd duck in the AWS data ecosystem. It seems like it was intended more as an initial step in a cloud migration path for large or heavily regulated companies who are hesitant about moving their databases off-site. Despite this apparent bend toward large enterprises, RDS on VMware doesn't support Oracle's RDBMS; I'm guessing this is a largely a legal or political issue rather than a technical one. Unsurprisingly, you can't run Aurora with RDS on VMware either, likely because replicating their storage architecture in any on-premises vSphere cluster was a non-starter.

Pricing

RDS database instances look a lot like generic EC2 instances, but they run on specialized instance types whose names all begin with db. (e.g. db.t3.2xlarge). Bigger instances with more vCPUs, more RAM, and faster network speeds cost more. As with EC2, you can choose to pay for an instance hourly or buy single-year or multi-year reservations of the instance for a lower price.

As with EC2 instances, you can pick the kind of storage your database instance uses. There are both general-purpose and provisioned IOPS SSD storage options that are priced similarly to how they're priced in EBS. RDS also supports HDD storage to retain backward compatibility with old instances, but they strongly recommend using SSDs.

As you might expect, a multi-AZ deployment of any database instance effectively doubles your cost for both compute and storage, since you're deploying two copies of the instance in two different places.

Aurora Serverless is the exception to everything I mentioned above, since you're not managing the instances that run your database yourself. Instead, running an Aurora Serverless database consumes Aurora Capacity Units (ACUs). Each ACU represents about 2GB of RAM and some proportional amount of compute and network capacity. Aurora Serverless instances are billed by the ACU-hour for however many ACUs they use in a billing period.

If you're storing backups, you're only charged if the total space used for an instance's backups is more than the capacity of the instance itself. Backups are heavily compressed, so (according to AWS, at least) you're hardly ever going to see a charge for this. If you retain an instance's backup after it's terminated, you're charged by the GiB-month regardless of the backup's size.

By default, database performance history is retained for seven days. If you want to keep a longer history, you pay an additional price per vCPU-month, so long term monitoring on larger instances costs more. If you're querying performance history from an external tool via their API, you pay per thousand API calls.

As with the other services we're covering in this series, normal network transfer costs mostly apply, though inter-AZ transfer for replication in a multi-AZ setup is free.

Next: Entering the Warehouse

In this article, we took a look at AWS's relational database offerings. Next, we'll look at their data warehouse offering, Amazon Redshift.

If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.