The AWS Data Ecosystem Grand Tour - Training Data for Machine Learning

Written by Alex Rasmussen on March 17, 2020

This article is part of a series. Here are the rest of the articles in that series:

Molding young, digital, extremely simplistic minds. — (Photo by Element5 Digital on Unsplash, cropped by me)

Machine learning models are capable of pretty incredible things, but fundamentally they're good at picking up on patterns based on what they've already seen. If you train a model on data that's bad - if the data contains errors, class imbalance, or any one of numerous other kinds of bias - the trained model's inference will be correspondingly bad. Many machine learning projects fail because they don't have a good training dataset, and spending the money to construct one is hard to justify, especially when a high-performance model isn't guaranteed.

One of the things that makes good training data hard to collect is that it's often the result of manual labor. If you dig down far enough, especially in a model's early days, training data is almost always compiled by hand; read Vicki Boykis' excellent article on the subject for concrete examples of this. Humans are expensive, they get bored easily, and you have to double-check or triple-check their work because they make mistakes. Unfortunately, until we have true artificial intelligence (and don't hold your breath there), humans are the best option we've got.

Coordinating a bunch of humans that are labeling training data is complicated. Understandably, there's been a lot of interest in how to do a better job at building training datasets, and Amazon Sagemaker Ground Truth is AWS's attempt.

Sagemaker Ground Truth is the labeling part of AWS's SageMaker suite of services for building, training, and deploying machine learning models. Its job is to farm out labeling jobs to a group of human labelers and recording the labels that those labelers assign. You can use your own labelers for this, hire a team of professional labelers, or farm the process out to Mechanical Turk, Amazon's crowdsourcing service.

For certain kinds of problems (image recognition or text classification, for instance), AWS already has models that are pretty good, but that will need additional training for your specific use case. In these cases, you can reduce labeling cost by using what Sagemaker Ground Truth calls automatic labeling. The automatic labeling system uses a kind of machine learning called active learning to reduce the amount of training data that has to be labeled by hand.

The active learning process begins with a validation set, a random subset of unlabeled data that's manually labeled. The active learning system passes the validation set through AWS's existing model for the problem domain, and uses that model's performance to establish a confidence threshold. It then runs all the unlabeled data through the model, which generates a confidence score and an inferred label for each unlabeled datum. Data with confidence levels over the threshold are automatically labeled with their inferred labels, then both new and existing labeled data are used to train a new model. A random subset of the remaining unlabeled data is manually labeled to construct a new validation set, and the process repeats. The amount of labeled data increases with each iteration until all data is labeled or you tell the system to stop (perhaps because you ran out of money for manual labelers).

As long as you have an initial model to bootstrap from, active learning can dramatically decrease the amount of manual labeling you have to do. Of course, you have to be really confident in that initial model to be able to trust the automatic labels, which is why AWS only offers automatic labeling for certain problem domains.

All this infrastructure doesn't come cheap. You're charged per labeled object, whether it's labeled automatically or manually. If you use Mechanical Turk or a professional labeling service, you pay an additional fee per object. If you've got a large dataset (tens of thousands of objects or more), be prepared to pay thousands of dollars for labels no matter which method you choose.

Next: Snooping Out Sensitive Data

In this article, we looked at SageMaker Ground Truth and its facilities for producing labeled training data. Next time, we'll look at ways to detect and manage sensitive data in your organization's S3 buckets.

If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.