The AWS Data Ecosystem Grand Tour - Data Ingestion

Written by Alex Rasmussen on February 10, 2020

This article is part of a series. Here are the rest of the articles in that series:

Run the AWS pipes right into your building. — (Photo by Victor Garcia on Unsplash, cropped by me)

If you're moving your infrastructure to AWS, one of your biggest challenges is likely to be moving your existing data. You'll want to make sure that all your data is transferred and that it survived the journey into AWS intact. Transferring large volumes of data over the wide-area Internet to AWS can take a long time, and you'll never finish unless you can transfer data faster than it's being produced. If you're moving a database, you may need to keep that database running while you're doing the transfer, which complicates things still further. AWS recognizes that these challenges are daunting, especially to older or larger organizations, so they've developed solutions for each of them.

Moving Your Files

AWS DataSync is a software agent that sits inside of your organization's network and facilitates transfer of your data to S3 or EFS. It runs as a virtual machine, but it's configured and managed mostly through your AWS console. AWS claims that it uses an optimized parallel transfer protocol to make transfer to AWS faster than if you were to just use s3 sync. That increased transfer had better be worth the cost, though; transfer into AWS using DataSync costs $0.04 per GB transferred, which is a lot more than the $0 per GB that transfer into AWS normally costs.

If you've got such a large volume of data or data is being generated quickly enough, DataSync may not work for you, either because it will never catch up or because it will cost way too much. If you want to get a large volume of data transferred quickly, the fastest way to do it is still physically shipping it from one place to another. AWS has a family of devices, called the AWS Snow Family, for doing just that. These devices come in three different sizes. The smallest is AWS Snowball. A Snowball is basically a ruggedized mini-PC full of hard drives that AWS ships to you. You connect the Snowball to your network, transfer your data onto it with the help of a specialized client, and send it back to them. It has an e-ink display on the front that displays a shipping label, so you can just drop it in the mail. When AWS receives the Snowball, they'll load the data into an S3 bucket for you.

Each Snowball "job" (shipping a Snowball, loading it with data, and sending it back) costs between $200 and $250 plus shipping, depending on whether you use their 50 or 80TB model. You can keep the Snowball for ten days without extra cost; if you need to keep it for more than ten, it'll cost you $15 a day. Transfer into S3 from a Snowball is free. You can use a Snowball to bulk copy data from an S3 bucket as well, but transferring that data out will cost you. You'd better not lose the Snowball once you've received it or else you'll pay a $7500 device replacement fee.

The next tier of device in the Snow family is AWS Snowball Edge. This device is a souped-up, rack-mountable version of the Snowball with embedded compute capability. These devices are designed for low-connectivity environments where you're collecting data and need to analyze or pre-process it on site. For instance, you might be on a container ship and need to collect data from the ship's various sensors, analyze that data as you're sailing, and transfer it to AWS for further analysis when you're in a port. Snowball Edge devices can run EC2 instances and execute Lambda functions in addition to storing a large amount of data, making them like little AWS data centers in a box.

Snowball Edge devices come in storage optimized and compute optimized variants; the storage optimized variant is cheaper and has more storage (it maxes out at about 80TB), but it has fewer vCPUs and less RAM than the compute optimized variant. The compute optimized variants can come with on-board GPUs for accelerating machine learning applications, but that'll cost extra. You can lease a Snowball Edge device by the day, or on a 1-year or 3-year term. Like many other AWS services that offer extended lease terms, you're spending more money up-front to pay less per day. You'll want to be even more careful with your Snowball Edge once you receive it, since it will cost upwards of $20,000 to replace if you break it.

For most of us, transferring data 40 to 80TB at a time with a Snowball or Snowball Edge is enough. Some organizations need more. A lot more. For them, there's the AWS Snowmobile, which is effectively up to 1250 Snowballs a time shipped in a 45-foot ruggedized shipping container on the back of a truck. This allows for petabyte-scale transfer in a single "job", but is obviously not going to be cheap. AWS's public pricing information for Snowmobile is essentially "if you have to ask, you can't afford it".

Moving your Database

AWS Data Migration Service (DMS) is a tool that migrates or replicates data between databases without downtime, as long as one or both of those databases reside in RDS or EC2. DMS runs inside AWS, and can migrate or replicate any database that it can open a connection to. If the two databases are of the same type (two PostgreSQL databases, for example), migration and synchronization are pretty straightforward. If they're of different types, things get a little trickier because you have to be able to translate schemas (and, if applicable, things like stored procedures and triggers) from the source database's type system to the destination's type system. AWS provides the AWS Schema Conversion Tool (SCT) to make this process easier. SCT scans the source database and tries to figure out how to do the conversion for you, reporting any incompatibilities or roadblocks it finds along the way.

Next: Extract, Transform, Load

In this article, we talked about various ways to get your data into AWS. Next time, we'll talk about a related topic: how to extract your data from one location, transform it, and put the transformed data somewhere else.

If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.