The AWS Data Ecosystem Grand Tour - Data Interfaces

Written by Alex Rasmussen on March 12, 2020

This article is part of a series. Here are the rest of the articles in that series:

Happy plugs come from Washington. — (Photo by Paul Hanaoka on Unsplash, cropped by me)

Each of the data systems we've covered so far has some kind of API that can be used to query or access it. When developing applications or deploying systems, those APIs can sometimes be a bit too low-level. In this article, we'll look at some interfaces to AWS's data systems that provide that higher-level interface.

APIs for Offline and Real-Time

If you're developing an app that runs on a mobile device, sometimes that device is going to be in a place where Internet connectivity is spotty or non-existent. For some kinds of applications, you'd like your users to be able to keep using the app without disruption in those situations. For instance, suppose you're using a document collaboration app while you're on a train. If the train goes into a tunnel, you'd like to be able to keep working on the document while your Internet connection is disrupted and have your changes synchronized with your collaborators' changes when you're out of the tunnel and have connectivity again. This kind of functionality is tricky to implement, particularly when you have to resolve situations where two updates made at the same time conflict with one another.

Another common application feature that's tricky to implement is real-time subscription. If you're using a chat app, for instance, you want to see other people's messages as soon as possible after they're sent. Having the app continuously ask the server "Hey, any new messages for me yet?" is both a drain on battery life and can lead to a less-than-stellar user experience.

AWS AppSync aims to solve these problems. AppSync is a library and API generator that handles much of the trickiness behind offline operation and real-time notification for you. You can use AppSync to provide an API for things like DynamoDB tables, ElasticSearch indexes, Aurora databases and other HTTP APIs.

AppSync uses GraphQL as both its query language and data definition language. AppSync can leverage GraphQL's relatively rich description of the API's data model to automatically handle things like local storage and update synchronization. For some kinds of data stores, AppSync will also transparently maintain versioning metadata for each object, which means that it can detect when two updates to an object conflict. When an update occurs, the server can either reject the conflicting update, automatically merge the two conflicting versions together according to some predefined rules, or invoke a Lambda function to perform some more sophisticated conflict resolution.

AppSync charges per million queries or data modification operations, and per million real-time updates. Since real-time updates rely on persistent connections to the AppSync service, you also pay per million connection-minutes to the AppSync service. If you want to improve performance, you can also add cache nodes. Cache nodes come in a variety of sizes, and are roughly twice as expensive per hour to operate as a comparable EC2 instance would be.

Interfacing with Common Protocols

Sometimes, you want to give S3 or EBS support to an application that has no native interface to those systems. This is particularly common in organizations with a hybrid setup where some resources are on-premises and others are cloud-based. AWS Storage Gateway gets around that lack of support by interposing a translator of sorts between industry standard storage protocols and S3 and/or EBS.

Storage Gateway consists of three component services. File Gateway provides an NFS or SMB interface to S3, allowing existing systems to read and write to S3 buckets as though they were more traditional file systems. Volume Gateway exposes EBS volumes via iSCSI, allowing anything that can mount an iSCSI volume to mount an EBS volume. Tape Gateway exposes S3 or Glacier using the iSCSI-VTL protocol, a popular protocol used by tape backup and recovery systems. Each service provides things like local caching and optimized transfer to its corresponding backend service.

The gateway service runs as an appliance. It can be deployed as an on-premises VM, an EC2 instance, or a rack-mounted hardware appliance. Appliances in EC2 cost a bit less than on-premises ones do. You pay for usage on the AWS service that the gateway is targeting, plus some amount for each GB written to the gateway itself. Data transfer from the gateway to the cloud is free, but transfer from the cloud to the gateway costs extra, so write-heavy workloads will be cheaper than read-heavy ones.

There's one more common organizational pattern that the Storage Gateway family doesn't cover. Transferring files back and forth with SFTP is still a pretty common pattern for inter-organizational data exchange, however depressing that may be for some of us. Hey, at least it's not run-of-the-mill insecure FTP or mailing hard drives back and forth, right? I've had customers or partners ask for both of those things.

If you need to expose your S3 bucket as an SFTP server, AWS Transfer for SFTP has you covered. This service allows you to use your organization's existing authentication mechanism to control access to the server, and integrates with AWS Route 53 for DNS, so it can (in theory) act as a drop-in replacement for an existing SFTP server. This all comes at a steep price, though; the server itself costs $0.30 an hour as long as it's on (that's $2,628 per year for 24/7 operation), and you pay $0.04/GB for both uploads and downloads on top of that.

Next: Building Training Data

In this article, we took a look at some of the interfaces to AWS data that haven't been covered in previous articles. Next time, we'll take a look at building training data for machine learning models with AWS Sagemaker Ground Truth.

If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.