Making the Internet Archive’s full text search faster 

(You can also find this article on medium)
This article describes how we made the full-text organic search faster at the Internet Archivewithout scaling horizontally — allowing our users to search in just a few seconds across our collection of 35 million documents containing books, magazine, newspapers, scientific papers, patents and much more.

By organic search we mean the “method for entering one or a plurality of search items in a single data string into a search engine. Organic search results are listings on search engine results pages that appear because of their relevance to the search terms.” (1).

The relevance should be scored on the search query matches for every document. In other words, if the search query has a perfect match in some of our documents, we want to return these documents; otherwise, we want to return the documents containing a part of the query or one of its subset.

Let’s consider some of our initial conditions:

  • Our corpus is very heterogeneous. It contains a lot of documents with content, context and language.
  • The size of our documents varies considerably, from a few kilobytes to several megabytes of text. We have  documents consisting of only a few pages and books with thousands of pages.
  • This index should also be agnostic about the kind, content, context and language of a document. For simplicity and for an easier connection with our documents ingestion pipeline, we need to keep everything in a single index.
  • At the Internet Archive we have thousands of new documents to index every day so the search must continue to work properly in a continuous indexing mode.
  • The search has to work as fast as possible, serving each search request without slowing down the execution of the other ones.

:: Cluster ElasticSearch

Our first idea was deploying a simple ElasticSearch cluster, relying on Lucene’s full-text search server features.

We built the cluster on KVM virtual machines on Intel hardware using Ubuntu 14.4  and ElasticSearch 2.3.4:

Number and type of node CPU RAM Storage
10 datanodes 14 CPU 45GB 6.4TB SSD consumer
3 masters 2 CPU 2GB
4 clients 22 CPU 34GB ram

The Java Virtual Machine (JVM) runs with 30GB: -Xms30g -Xmx30g

Our document structure is a bit more complicated, but to keep it simple let’s consider every document as consisting of only these fields: Title, Author, Body. All of these fields are type string and analyzed with a custom analyzer using the International Components for Unicode.

tokenizer normalizer folding
icu_tokenzer icu_normalizer icu_folding

We created an index of 40 shards with 1 replica, for a total footprint of ~14TB of data, with shards of ~175GB.

:: First results and first slow queries

At the beginning this index was not too bad. With the first 10 million documents loaded, we got good response time and performance. Our goal is to index and make the search available for all 35 million documents. We soon learned that the performance got worse when adding more documents.

We discovered that some queries were taking too many seconds to be executed, keeping our resources busy and slowing down all the cluster metrics, in particular the indexing latency and the search latency. Some queries were taking more than 30 seconds even when the cluster was not particularly busy.

As an example, a search for “the black cat” took longer than the more simple query “black cat”  because of the “the” in the query. Another good example was: “to be or not to be” — particularly slow because it is comprised of only high-frequency words. An even slower query was “to be or not to be that is the question,” which took over one minute.

:: The analysis

To better understand the problem, we monitored the cluster’s nodes while the slow query was running.  We used profile and hot-threads from the API to profile the queries and to monitor the threads execution on the cluster. We also used iostat to keep monitored the IO activities on the datanodes.

We discovered:

  • Lots of hot threads were about the highlighting phase. We need this feature because we don’t want to know only what documents contain the best matches, but we also want to show the results in their context with a snippet. This operation can be expensive, especially considering that we want to display snippets for every document returned in the results list.
  • The search for queries with high-frequency words was particularly expensive.
  • The CPU use was ok.
  • The ratio of JVM ram versus the Operative System ram was not well balanced, so the file system cache was unable to manage our big shards and documents.
  • The Garbage Collector (GC) pauses often and for too long (sometimes more than a minute) — this is probably because some of our documents contain very big values for the body field.
  • We discovered that the Solid State Disk (SSD) bandwidth was saturated during these expensive queries.

:: Modeling the problem

In this step we addressed the index settings and mappings, looking for ways to improve the performance without adding new nodes or hardware upgrades.

The ElasticSearch stack is big and complex so it is easy to get lost in an empiric quest for the best configuration. We therefore decided to follow a strict approach to the problem. To better analyze it we decided to experiment with a simpler deployment: building a single node cluster, with a single shard index, with the same shard size as one of our production shards.

A simple copy of one of our shards from production to the development context was not enough. We wanted to be sure that the distribution of the documents in the shard is normalized; the shard must contain documents with the same type distribution ratio as the production one, so we wrote a little script to extract the dump of the index in a set of small chunks containing random documents distribution (good enough for our purposes). Then we indexed some of these chunks to build a single shard of 180GB.

Running some benchmarks we focused only on 90% of the most popular queries, extracted form our production nginx log, compiling a list of ~10k queries well distributed for response time.

We use siege to run our benchmark tests with the flags: one single process, verbose output, don’t wait between searches, run for 5 minutes and load the search to siege from a file.

siege -c 1 -v -b -t 5 -f model.txt  

We ran this test several times with two different queries: one with the highlighting feature and one without.

Without Highlight 1.3 trans/sec
With Highlight 0.8 trans/sec

Analyzing the hot threads we discovered that our bottleneck was in the parsing of the lucene‘s inverted index.  

We have a lot of high-frequency words, some of them present in almost all of the 35M documents.

To simplify a bit: with our resources and our deployed cluster, we weren’t able to search and filter the inverted index fast enough because we have millions of matches to be filtered and checked against the query.

The combination of large documents (we index a whole book as a single document) and the need to support full phrase queries (and show the user the correct results) leads to a serious problem: when a phrase query includes a specific token present in a large portion of the documents, all of these candidate documents must be examined for scoring. The scoring process is very IO intensive because relative proximity data for the tokens in a document must be determined.

:: Exploring the solutions

–  Operational improvements:

  • Optimizing reads: The operating system, by default, caches file data that is read from disk. A typical read operation involves physical disk access to read the data from disk into the file system cache, and then to copy the data from the cache to the application buffer (2). Because our documents are big, having a bigger disk cache allows us to load documents more quickly. We added more RAM to the datanodes, for a total of 88GB, to have a better JVM/disk cache ratio.
    Number and type of node RAM
    10 datanodes 88GB
  • Optimizing writes: In a linux system, for performance reasons, written data goes into a cache before being sent to disk (called “dirty cache“). Write caches allow us to write to memory very quickly, but then we will have to pay the cost of writing out all the data to the discs, SSD in our case (3). To reduce the use of the SSD bandwidth we decided to reduce the size of this dirty cache, to make the writings smaller but more frequent. We set the dirty_bytes value from the default value 33.554.432 to 1.048.576 with:
    sudo sysctl -w vm.dirty_bytes=1048576
    By doing this, we got better performance from the disks, reducing the SSD saturation during writings drastically.

– The highlighting problem:
This was the easy one. Upon reading additional documentation and running some tests with different mappings we learned that making the highlighting faster requires storing the term vector  with the position offset payloads.

– The scoring problem and the Common Grams:
The first idea, that we discovered is the most popular in the industry, was: “let’s scale horizontally!“, that is “add new data nodes.” Adding new data nodes would reduce the size of a single shard and the activity for every data node (reducing the SSD bandwidth, distributing the work better, etc.).

This is usually the easy — and expensive — solution, but Internet Archive is a nonprofit and our budget for the project was already gone after our first setup.

We decided that the right direction to go was to find a smarter way to add more information to the inverted index. We want to make lucene’s life easier when it’s busy parsing all its data structures to retrieve the search results.

What if we were to encode proximity information at document index time? And then somehow encode that in the inverted index? This is what the Common Grams tokenizer was made for.

For the few thousand most common tokens in our index we push proximity data into the token space. This is done by generating tokens which represent two adjacent words instead of one when one of the words is a very common word. These two word tokens have a much lower document hit count and result in a dramatic reduction in the number of candidate documents which need to be scored when Lucene is looking for a phrase match.

Normally, for the indexing and the searching phase, a text or a query string is tokenized word by word. For example: "the quick and brown fox" produces the tokens: the - quick - and - brown - fox
These tokens are the keys stored in the inverted index. Using the common grams approach the high frequency words are not inserted in the inverted index alone, but also with their previous and following word. In this way the inverted index is richer, containing the relative positional information for the high-frequency words.
For our  example the tokenization become: the - the_quick - quick - quick_and - and_brown - brown - fox

Using luke to extract the most frequent words:
To implement this solution we needed to know the most high-frequency words in our corpus, so we used luke to explore our normalized shard to extract the list of the most frequent words in our corpus. We decided to pick the 5000 most frequent words:

Rank Frequency Word
1 817.695 and
2 810.060 of
3 773.635 in
4 753.855 a
5 735.514 to
49999 44.417 391
50000 44.416 remedy

We ran the siege benchmark in our test single node cluster and obtained great results:

Without Highlight 5 trans/sec
With Highlight using the Fast Term Vector 2.4 trans/sec

:: The new index

With this last step we had everything we needed to reindex the documents with a new custom analyzer using the icu tools we saw before, but this time we also added the common grams capabilities to the analyzer and the search_analyzer of our document fields. In particular we added a specialized filter in the common gram tokenization using our common words file. With these new features enabled the disk footprint is bigger so we decided to add more shards to the cluster, building the new index with 60 shards and 1 replica, for a total of 25 TB of data, ~210 GB for shard.
Check out the end of this document for an example of our new mapping configuration.

:: The final results

The performance of the new index looked very good immediately.
For example let’s consider the results of the query “to be or not to be“:

hits found max score execution time
Old Index 6.479.232 0.4427 20s 80ms
New Index 501.476 0.7738 2s 605ms

With the new index 95% of the queries are served in less than 10 seconds and the number of slower queries is marginal. There are less hits found because the new inverted index is more accurate. We have less noise and less poorly relevant results, with a higher max score and an execution time 10x faster for this particular query.

The team:

  • Giovanni Damiola – twitter
  • Samuel Stoller
  • Aaron Ximm

:: New index mapping

"doc_title": { 
    "type": "string", 
    "analyzer": "textIcu", 
    "analyzer": "textIcu"
"doc_author": { "type": "string", 
    "index": "analyzed", 
    "analyzer": "textIcu", 
"doc_body": { 
    "type": "string", 
    "index": "analyzed", 
    "analyzer": "textIcu", 
    "term_vector": "with_positions_offsets_payloads" 

"analyzer" : {
    "textIcu": {
        "type": "custom",
        "tokenizer": "icu_tokenizer",
        "char_filter": [ "icu_normalizer" ],
        "filter": [ "icu_folding", "common_grams" ]
    "textIcuSearch": {
        "type": "custom",
        "tokenizer": "icu_tokenizer",
        "char_filter": [ "icu_normalizer" ],
        "filter": [ "icu_folding", "common_grams_query" ]