< Back

Inside the Algolia Engine Part 2 — The Indexing Challenge of Instant Search

Search engine usage has evolved a lot in the last few years. Today’s users expect instant feedback without having to click on the “enter” button and wait for the page to load to get their results. They also expect the engine to tolerate typing mistakes on the fly without having to validate proposals of query re-phrasing, like the (in)famous “did you mean?”

Of course all these changes have pretty big consequences on the engine itself. In this post, we will concentrate on the way we handle indexing at Algolia, why we do it differently in order to maximize indexing speed and search performance and how we natively address these evolutions in search.

Introduction to indexing

The goal of the indexing process is to store your data in a specific data-structure that is optimized for search. Without indexing, we would need to scan all the documents in a database to detect which ones match a query. This would take forever and be highly inefficient, ultimately resulting in a terrible user experience. This is one of the reasons we continue to focus on optimizing indexing speed as part of our overall efforts to make search more efficient.

The main part of the indexing process is to create a data-structure that contains a mapping of each word to the associated list of documents containing this word. One mapping of the word to the list of documents is called an inverted list, with the name coming  from the inversion of the space: instead of having documents that contains words, we have now words with the list of documents containing each word. The concept is similar to the index you have at the end of a book, except all words are in the index.

There are a lot of algorithms to store those inverted lists in an efficient way while keeping good performance. Most of those techniques are pretty old and well described in the very good Managing Gigabytes book.

The creation of those inverted lists can be decomposed in two big parts:

  • For each document, we extract the list of words and build a hash-table that associates words to documents
  • When all documents are processed, we compute an on-disk binary data-structure containing the mapping of words to documents. This data-structure is the index we will use to process queries.

How search-as-you-type is usually implemented

As-you-type display of search results was introduced by Google in 2010 as “Google Instant” and later referred to as “Instant Search.” Users love this experience as it creates a seamless interaction with the data they are looking for! It removes the intermediate steps such as query validation, “did-you-mean” re-phrasing and decreases the time spent waiting for the results to show up! It also means indexing and search have to be redesigned because you don’t simply search for words anymore, but also for partially written words (prefixes) and it needs to be very fast!
There are four main different approaches engineers use today to build this type of instant search:

1) Autocomplete (also called Suggest)

This approach is the most commonly used and is based on a static data-structure on the side of the search engine. You can see it as a big dictionary in the form of a Trie (special type of Tree that associates string to an output, all the descendants of a node have a common prefix of the string).

Advantages:

  • Very fast response time (usually few milliseconds)
  • Scales well, can easily handle thousands of queries per second on one machine

Drawbacks:

  • This is not a search engine, you have to search for a sub-string on the pre-defined list of expressions in the dictionary. In terms of user-experience, it often leads to very frustrating “no results” or “bad results” because of this limitation
  • The relevance is not the same as the search engine itself, it is just a static dictionary. It works well if you propose “frequent queries,” but starts to have weird effects if you use it to search on the same records as the fully-featured search engine itself. The relevance will be different, of lesser quality and will lead to a lot of confusion.

2) Index prefixes

This approach is simply to build an inverted list for all prefixes of a word instead of just the word itself. For example instead of just having an inverted list for the word “engine”, you will also have one for “e”, “en”, “eng”, “engi” and “engin”. Those lists will usually contain a specific indicator that allows at query time to make the difference between a prefix and an exact word (both exact and prefix inverted lists will be queried with a preference for exact).

Advantages:

  • Fast resolution of prefixes as they are pre-computed
  • Ability to keep the relevance of the search-engine for all interaction with the engine, including as-you-type experience (user is never lost between two different ways to rank results)

Drawbacks:

  • Increase the time to publish results because the indexing process is far more expensive than without inverted list and consumes far more memory (inverted list are kept in memory)
  • Increase a lot the size of the index, which reduces a lot the likelihood of fitting in memory and so can hurt performance (even on a SSD)

3) ngrams computation

This approach is very similar to the indexing of all prefixes but it limits the computation of prefixes by allowing a minimum and a maximum number of letters in the prefix. The main advantage is to reduce the cost compared to the indexing of all prefixes, if you query a prefix that is bigger than the maximum number of letters, the query will be automatically be transformed into a query on several ngrams (queries with fewer letters than the minimum will not trigger the prefix search).

Advantages and drawbacks are the same as the indexing of all prefixes; it is just a small improvement to reduce a bit the cost of indexing by adding more computation in the search.

4) Search prefixes at search time

This approach keeps the indexing process pretty light by keeping only one inverted list per word but it requires to do all the processing at query-time. For example if you perform the query ‘a’, it will mean looking for all words that start by ‘a’ in a big dictionary and then perform a query with all those words in a big OR operand. This approach is very expensive and usually people that select this approach start the prefix search after a certain number of letters in the prefix to avoid having queries that are too expensive.

Advantages:

  • No indexing overhead for prefix searches
  • No compromises on the relevance of the search-engine

Drawbacks:

  • Produces very complex and slow queries
  • Only works if the number of documents and words is very small, it does not scale

Engineers working on search have no other choice than spending time to test those different approaches and select the one that seems the best for their use cases, unfortunately all of them have serious drawbacks.

A different approach

At the beginning of Algolia, we worked on a different product: an offline search engine SDK for mobile app developers. And we wanted this engine to provide a fast instant search experience on old devices like an iPhone 3G.

None of the standard approaches described above would have worked on such a low-end hardware, so we had to design a new way to do the indexing with very little RAM and CPU. All that while maintaining an instant search experience (one of the purpose of indexing being to speed up the search operations).

Having already worked on multiple data-structures for search engines prior to Algolia helped me a lot in designing a different way to solve the problem. The first inspiration came from a generational string B-tree I built several years ago. It was pretty similar to what the LevelDB fast key-value store developed by Google is doing (associate an arbitrary size value to an arbitrary size string).

First, I knew that for Algolia I wanted to have a generational data-structure.

Generational data-structure

Having a generational data-structure makes updating the data a lot faster. Typically, you’ll create two (or more) “generations” of your data-structure. A big one containing the “old” data, and a small one containing the “new” data.

Then, if you need to add/update/delete an object, you’ll update a smaller structure containing only the “new” information, which is much faster.

Merge of Generational data-structure

At some point (generally when your smallest generation reaches a certain size), you’ll need to merge several generations into one for a better efficiency. This operation is called a merge.

In order to have an efficient merge, each generation needs to be sorted, so that you can scan them in parallel while generating the result (merge of the parallel scan is done via a standard heap-merge algorithm).

Radix tree

For Algolia, I decided to represent the words that will be searchable as a radix tree (space-optimized representation of a Trie, below we look at what a radix tree is in practice).

First because a radix tree is sorted, which allowed us to have an efficient merge and to use multiple generations of the tree.

Second because it works very well to process the typo-tolerance in a very efficient way (more details on the typo-tolerance will come in another post).

It is easy to build a radix tree in memory before writing it on disk, but it consumes a lot of RAM. And I wanted to use the least amount of RAM possible, because:

  1. For the initial offline search SDK, we simply didn’t have enough RAM on an iPhone 3G.
  2. For our current SaaS engine, we want the RAM to be focused on making search fast (the indices are all stored on RAM for search), which means indexing should consume the minimum amount of RAM.

What’s great is that you can actually write a radix tree on the fly without having to build the tree in memory if you’re indexing a list of sorted words.

For example, let’s consider that the dataset we want to search on contains 5 objects, each having a single word. Each object will be represented by a value from 1 to 5.

Normally, multiple objects could have the same word, so the leaf would have a list of values. But for simplicity’s sake, we’ll take records containing different words.

Dictionary example mapping a word to an integer

Each node of our radix tree will contain a string.

If the node is a leaf, it has a value associated to the corresponding object having this word.

If the node is not a leaf, it can have an associated value, but it can also just be an internal node representing a shared prefix.

The radix-tree we’re going to build will look like this:

Radix-tree of the previous dictionary

Now, let’s build our radix-tree. Seems simple now enough, right? The naive approach is to build the complete data-structure in memory and then to dump it on disk. We do not want to use this approach because we want to keep as much memory as possible on our servers to hosts the indices of our users and provide a fast search! This is why we designed all the indexing to use as little memory as possible while optimizing speed of indexing.

Here’s the trick: we can build this data-structure on-disk directly from the list of words and with only a few kilobytes of memory. To do that, we’ll flush from the RAM the nodes as soon as possible, by building the tree on the fly.

The first step is to order alphanumerically the words of the documents we want to index (which is already done in our example).

Then, we’re going to build our tree using a depth-first search: since a radix tree is sorted in the alphanumeric order, it’ll be easy to flush nodes to the disk as soon as we don’t need them.

  1. 1. Add the word Fast
  2. 2. Add the word Faster
  3. 3. Add the word Test
  4. Step 3
  5. 4. Flush the branch “fast” to disk
  6. Step 4
  7. 5. Add the word toaster
  8. 6. Flush the branch “est”
  9. 7. Add the word toasting
  10. 8. Flush the branch “er”, then “ing”, “oast” and finally “t”
  11. This way to build a tree is very efficient because it only keeps in memory a very small number of nodes (in the worst case scenario you will have a number of nodes in memory that is equal to the longest key in the tree, in our case we always have fewer than 100 nodes in memory).Now that this is done, let’s focus on the next part: prefix search!

Prefix search

Remember that we wanted an instant search experience. This means that the engine must be able to retrieve results as-you-type, at each character.

If you type “f”, you want to retrieve the results “fast” and “faster”, so the engine must know that both these words contain the prefix “f”.

And we don’t want to calculate the association prefix-object for all the prefixes, which would take a lot of time and memory, but only when it’s really necessary.

The good news is that our radix tree can tell us precisely where we need to store prefixes: all the nodes in the tree which have children.

In our example, it is only necessary for the nodes “fast”, “t” and “toast”.

Radix-tree with prefix inverted list computed

You see what I just did right there? Yes! the prefixes can be computed on the fly too! We just need to remember what was in the last few branches that we flushed.

For example, if I just flushed the branch “est”, I’ll remember that the prefix “t” is associated to the object 3 (test). I’ll do the same with “toaster” (object 4) and “toasting” (object 5). And when we’re done processing all the words beginning with “t”, I’ll know that this node should contain the objects 3, 4 and 5.

A few numbers

This way to build prefixes has been built into our engine since 2012 and we index billions of jobs every day with this algorithm. We have measured the impact on a lot of different indices, the impact on search is pretty obvious as we always have the prefix inverted list when we do a query (and we’ll have a lot of RAM handy), but the impact on indexing time and on the size are less obvious. Here is a small summary of the impacts that are very small compared to the advantages it gives at query time:

Minimum Average Maximum
Impact of prefixes on index size +9% +21% +43%
Impact of prefixes on indexing speed +3% +12% +17%

This approach is significantly better than any of the other approaches presented before, it is actually optimal at query time and has a very reasonable impact on indexing performance!

This optimization of the prefixed indexing is just one of the numerous optimizations we have built in the Algolia engine. The reason we built it is because of our focus on performance and the fact that our engine was developed specifically to allow for the strong constraints of mobiles devices. We also had to think a bit out of the box, mainly by considering the word dictionary and inverted lists as one unique data-structure instead of two independent one.

I look forward to reading your thoughts and comments on this post and continuing to explain how our engine is implemented internally, transparency is part of our DNA! 🙂

We recommend to read the other posts of this series:

  • Hal

    Great post! For the “hash-table that associates words to documents”, what data structure do you use to store the documents? Bitmaps or roaring bitmaps? Something custom?

    • Nice to hear that you like the post! For the hash-table itself we are using Sparsehash, for the documents in the hashtable we are using a custom implementation (very compact linked list as we want to optimize the memory usage and the number of call to memory allocation for speed). The number of records for each word follows a Zipf’s law, using one memory allocated data-structure for each word would cost a lot in term of memory allocation and the over-head of the memory-allocation and the data-structure make a binary encoding useless. This is why we have this custom implementation of a linked list optimized for size in the hash-map entry (we allocate linked-list entries by big blocs in order to reduce memory allocations). In practice on a machine with 128GB of RAM, we use around 10GB for indexing, considering we build 6 different indices in parallel, the remaining 118GB are used to store indices in memory (we optimize everything to have 100% of the index in memory)

  • Shai Alon

    Fantastic Post Jullien.
    This brings me back to how amazed I was the first time I saw the Aglolia search in action, and how it changed my perception for the possibilities of “search as you type”… While I appreciate elasticsearch for it’s architecture and flexibility, I always wondered what was Algolia’s secret sauce that makes it outperform elastic in speed, relevance and simplicity (not having to deal with configuring prefixes and n-grams is a joy when using Algolia).
    Of all the architecture pieces you’ve posted over the years (including stackshare), this is the best as it really reveals the core of what makes the solution so elegantly amazing.

    Looking forward to more posts like this,
    Shai

    P.S.
    You’ve probably head this before, but it would be amazing if you release an open source “mini-algolia”…