The AWS Data Ecosystem Grand Tour - Search

Written by Alex Rasmussen on January 31, 2020

This article is part of a series. Here are the rest of the articles in that series:

A field full of haystacks. — (Photo by Jenelle Hayes on Unsplash)

All of the data systems that we've covered so far have some form of querying capability. These queries are more than sufficient for a lot of operational and analytical use cases, but they tend to perform poorly when you don't know exactly what you're looking for. This applies to both relational and document databases, but let's look at relational databases in particular.

One characteristic of relational queries that make fuzzy queries hard is that you have to be able to tell the system the exact parameters of the data you're retrieving (e.g. the data's IDs, or the contents of some of its fields) in order to get the results that you want. If your query is more approximate (e.g. retrieve all records whose name field contains a word that is synonymous with "bicycle"), your query will fall short unless you've had the foresight to apply a lot of pre-processing and indexing to your data beforehand.

If a relational query returns multiple results, the order of those results can only be defined in the form of an ordering constraint in terms of the data's fields (e.g. ascending by age, descending by date). This can bury the most relevant results under a pile of irrelevant ones that happen to appear earlier in the sort order. You could derive a "relevance" field in your query and order by that, but deriving the value of that field can get complicated quickly.

Search engines are designed to bridge these gaps in functionality. They build a data structure called an inverted index by analyzing each item (a document, in search engine terms) in a dataset to derive a corpus of terms and a list of which documents contain each term. Terms for text fields are derived by applying various natural language processing techniques like tokenization and stemming to make approximate queries more accurate. When the search engine is queried, it transforms the search query into a list of terms, looks up those terms in the index, and collates the record lists that contain each term into a unified list of search results. Those records are then ordered by some relevance score that can vary according to the query. For example, you could rank a document's relevance based on the number of occurrences of the query's search terms, but rank a document higher if the search terms occur in the document's title.

AWS currently has two different search engine options, both based on popular open-source projects. Amazon Elasticsearch Service uses Elasticsearch, and Amazon CloudSearch uses Apache Solr. Elasticsearch seem to be getting a lot more attention from AWS these days than CloudSearch is. Both services are fully managed, meaning they handle much of the complexity of running the search engine including handling the addition and removal of nodes from a cluster, backup and restore, and version upgrades. Amazon Elasticsearch Service has a feature (currently in preview) called UltraWarm that can store indexes that are queried and updated infrequently in S3, allowing Elasticsearch to serve them at much lower cost. UltraWarm claims to use a "sophisticated caching solution" to make accessing indexes from S3 faster. We don't know much about what this cache is, but I suspect it may be the same solution that enables AQUA in Redshift under the hood.

Both Amazon Elasticsearch Service and Amazon CloudSearch run on clusters of instances like many of AWS's other services do. As usual, more powerful instances tend to cost more. Some classes of Elasticsearch Service instances will let you use local SSDs for storage, but most use EBS.

Pricing for these two services are both pretty unsurprising given what we've seen so far. You pay for the instances that you use, either up-front or by the hour, and data transfer is billed in the usual way. Some operations (re-indexing, batch uploading) are billed based on the amount of data you're processing. Elasticsearch periodically snapshots its indexes; you get 14 days of those automated snapshots free, but any manual snapshots are stored in S3 and subject to S3 pricing.

There's another search engine in AWS's portfolio worth mentioning, although it's more domain specific than either Elasticsearch Service or CloudSearch. Amazon Kendra is a search engine specifically tuned for searching over large collections of documents within enterprise intranets. If you've ever used internal knowledge base tools like Confluence before, you can think of Kendra as a supercharged version of those tools' search functionality. Kendra heavily leverages machine learning to determine how best to index documents and extract the right terms from a query. It also has a set of managed connectors that can scan documents and ingest them into the search engine automatically from a number of sources.

Kendra is packaged much more like a SaaS product than you're used to seeing in AWS, with a couple of different editions that allow for a certain number of indexed documents and queries per day and the option to purchase additional queries or storage. Kendra's connectors charge a small amount for each document indexed and an hourly rate while its connectors are running.

Next: Float Gently Down the Stream

In this article, we looked at AWS's search engine services. Next week, we'll cover services for handling event streams.

If you'd like to get notified when new articles in this series get written, please subscribe to the newsletter by entering your e-mail address in the form below. You can also subscribe to the blog's RSS feed. If you'd like to talk more about any of the topics covered in this series, please contact me.