The AWS Data Ecosystem Grand Tour - Time-Series Databases

Written by Alex Rasmussen on January 21, 2020

This article is part of a series. Here are the rest of the articles in that series:

Time waits for no-one. Everyone waits for Timestream. — (Photo by Eric Prouzet on Unsplash)

Last time, we looked at graph databases, which are meant to solve a specific class of problem: querying data based on the relationships between entities rather than the entities themselves. Nothing stops you from trying to use a relational database to solve that problem, but making those sorts of queries execute efficiently in a relational database is hard.

Time-series databases are another kind of system designed to solve a problem that databases aren't particularly good at solving. As you might expect from the name, time-series databases store and query time-series data. You can think of time-series data as a collection of timestamped measurements that arrive over time from one or more sources. Time-series data could be the price of a stock over time, the CPU load on a server, the temperature reading from a sensor in a power plant, or an event that's generated any time someone opens or closes a door. In all these datasets, you care about both the contents of the measurement and when the measurement was made.

Time-series data tends to be summarized rather than queried individually. For example, you might want to know the average CPU load on a server in five minute intervals for the last day, but individual measurements of CPU utilization either don't matter to you or are too variable to draw any conclusions from.

Since the data is time-specific, there may be a point past which you don't need to query or store the data anymore. For example, I might only query temperature sensor data from the past 24 hours, but I might need to store it for 24 months for policy reasons or to query every once in a while to look at historical trends. To keep your queries fast and your costs low, you might want some kind of lifecycle management system that moves data older than 24 hours to cheaper (and slower) storage and deletes data older than 24 months.

Another common characteristic of time-series data is that you're receiving it in close to real-time from a lot of sources at once. If you want information on the price of one stock, for instance, you probably also want the price of all stocks. If you're running a power plant, you've likely got hundreds or thousands of sensors giving you measurements from various locations in the plant. This means that your database has to be highly available because you won't be able to take those measurements again, and able to ingest large volumes of data from a lot of sources at high speed.

Nothing stops you from using a relational database, a non-relational distributed row store, or a document store to do this. You might find yourself re-inventing a lot of functionality that existing time-series databases like InfluxDB or OpenTSDB have already implemented in the process, though.

Amazon has been advertising a time-series database called Amazon Timestream for a while now, but it's currently still in preview. We don't know a ton about it, but we can make some assumptions based on its pricing page, and based on other time-series database systems.

We know that Timestream is serverless. Users are billed for each 1 million 1KB writes they perform regardless of the rate at which data is written. This differs from some of AWS's other data services, which charge for some abstract capacity unit that determines how many writes the service can handle per second while retaining low latency. This deviation from the norm makes sense, though, given the high throughput requirements we discussed earlier.

We can safely assume that Timestream has a query interface, although we don't know anything about it. Some time-series databases have a more SQL-like interface, while others opt for a more exotic domain-specific query language. I'm guessing that Timestream will opt for a more SQL-like approach since it will feel more natural for their customers, but I don't have any evidence to back that up. Users pay for each TB scanned by queries, and I assume that Timestream's query planner will try to minimize the amount of data scanned per query. Timestream also appears to have some kind of support for automatic data "rollups", time-series aggregations that are updated as new data arrives.

We know that Timestream has three tiers of data storage: a memory store, an SSD store and a magnetic store. All three stores charge by the GB-hour. Data in all three stores can be queried, but querying data in a more performant store costs more. The difference in price between the stores is dramatic: the memory store is almost 100x the cost of the SSD store and 1000x the cost of the magnetic store per GB-month. Moving data between these stores is handled by a table data retention policy. This retention policy is time-centric, allowing users to specify how long data remains in a given store before being demoted to a lower storage tier. This is similar in some respects to what we saw in S3's intelligent tiering system, but might not be as flexible.

That's really all we know about Timestream at this point. It was first announced at re:Invent 2018, and I was a little surprised when it didn't come out of preview at re:Invent 2019. It will be interesting to see how much I got right when Timestream finally sees the light of day.

Next: Don't Call It a Blockchain

In the past two articles, we've looked at two kinds of databases that handle a specific kind of problem that more traditional relational databases aren't great at solving. Next time, we'll continue that trend by looking at ledger databases, a new technological solution to a problem as old as accounting.

If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.