Tagged: search Toggle Comment Threads | Keyboard Shortcuts

  • gio 8:58 pm on May 31, 2017 Permalink
    Tags: cluster, elasticsearch, full text search, fulltext, fulltextsearch, internet archive, internetarchive, lucene, search, tuning   

    Making the Internet Archive’s full text search faster 

    (You can also find this article on medium)
    This article describes how we made the full-text organic search faster at the Internet Archivewithout scaling horizontally — allowing our users to search in just a few seconds across our collection of 35 million documents containing books, magazine, newspapers, scientific papers, patents and much more.

    By organic search we mean the “method for entering one or a plurality of search items in a single data string into a search engine. Organic search results are listings on search engine results pages that appear because of their relevance to the search terms.” (1).

    The relevance should be scored on the search query matches for every document. In other words, if the search query has a perfect match in some of our documents, we want to return these documents; otherwise, we want to return the documents containing a part of the query or one of its subset.

    Let’s consider some of our initial conditions:

    • Our corpus is very heterogeneous. It contains a lot of documents with content, context and language.
    • The size of our documents varies considerably, from a few kilobytes to several megabytes of text. We have  documents consisting of only a few pages and books with thousands of pages.
    • This index should also be agnostic about the kind, content, context and language of a document. For simplicity and for an easier connection with our documents ingestion pipeline, we need to keep everything in a single index.
    • At the Internet Archive we have thousands of new documents to index every day so the search must continue to work properly in a continuous indexing mode.
    • The search has to work as fast as possible, serving each search request without slowing down the execution of the other ones.

    :: Cluster ElasticSearch

    Our first idea was deploying a simple ElasticSearch cluster, relying on Lucene’s full-text search server features.

    We built the cluster on KVM virtual machines on Intel hardware using Ubuntu 14.4  and ElasticSearch 2.3.4:

    Number and type of node CPU RAM Storage
    10 datanodes 14 CPU 45GB 6.4TB SSD consumer
    3 masters 2 CPU 2GB
    4 clients 22 CPU 34GB ram

    The Java Virtual Machine (JVM) runs with 30GB: -Xms30g -Xmx30g

    Our document structure is a bit more complicated, but to keep it simple let’s consider every document as consisting of only these fields: Title, Author, Body. All of these fields are type string and analyzed with a custom analyzer using the International Components for Unicode.

    tokenizer normalizer folding
    icu_tokenzer icu_normalizer icu_folding

    We created an index of 40 shards with 1 replica, for a total footprint of ~14TB of data, with shards of ~175GB.

    :: First results and first slow queries

    At the beginning this index was not too bad. With the first 10 million documents loaded, we got good response time and performance. Our goal is to index and make the search available for all 35 million documents. We soon learned that the performance got worse when adding more documents.

    We discovered that some queries were taking too many seconds to be executed, keeping our resources busy and slowing down all the cluster metrics, in particular the indexing latency and the search latency. Some queries were taking more than 30 seconds even when the cluster was not particularly busy.

    As an example, a search for “the black cat” took longer than the more simple query “black cat”  because of the “the” in the query. Another good example was: “to be or not to be” — particularly slow because it is comprised of only high-frequency words. An even slower query was “to be or not to be that is the question,” which took over one minute.

    :: The analysis

    To better understand the problem, we monitored the cluster’s nodes while the slow query was running.  We used profile and hot-threads from the API to profile the queries and to monitor the threads execution on the cluster. We also used iostat to keep monitored the IO activities on the datanodes.

    We discovered:

    • Lots of hot threads were about the highlighting phase. We need this feature because we don’t want to know only what documents contain the best matches, but we also want to show the results in their context with a snippet. This operation can be expensive, especially considering that we want to display snippets for every document returned in the results list.
    • The search for queries with high-frequency words was particularly expensive.
    • The CPU use was ok.
    • The ratio of JVM ram versus the Operative System ram was not well balanced, so the file system cache was unable to manage our big shards and documents.
    • The Garbage Collector (GC) pauses often and for too long (sometimes more than a minute) — this is probably because some of our documents contain very big values for the body field.
    • We discovered that the Solid State Disk (SSD) bandwidth was saturated during these expensive queries.

    :: Modeling the problem

    In this step we addressed the index settings and mappings, looking for ways to improve the performance without adding new nodes or hardware upgrades.

    The ElasticSearch stack is big and complex so it is easy to get lost in an empiric quest for the best configuration. We therefore decided to follow a strict approach to the problem. To better analyze it we decided to experiment with a simpler deployment: building a single node cluster, with a single shard index, with the same shard size as one of our production shards.

    A simple copy of one of our shards from production to the development context was not enough. We wanted to be sure that the distribution of the documents in the shard is normalized; the shard must contain documents with the same type distribution ratio as the production one, so we wrote a little script to extract the dump of the index in a set of small chunks containing random documents distribution (good enough for our purposes). Then we indexed some of these chunks to build a single shard of 180GB.

    Running some benchmarks we focused only on 90% of the most popular queries, extracted form our production nginx log, compiling a list of ~10k queries well distributed for response time.

    We use siege to run our benchmark tests with the flags: one single process, verbose output, don’t wait between searches, run for 5 minutes and load the search to siege from a file.

    siege -c 1 -v -b -t 5 -f model.txt  

    We ran this test several times with two different queries: one with the highlighting feature and one without.

    Without Highlight 1.3 trans/sec
    With Highlight 0.8 trans/sec

    Analyzing the hot threads we discovered that our bottleneck was in the parsing of the lucene‘s inverted index.  

    We have a lot of high-frequency words, some of them present in almost all of the 35M documents.

    To simplify a bit: with our resources and our deployed cluster, we weren’t able to search and filter the inverted index fast enough because we have millions of matches to be filtered and checked against the query.

    The combination of large documents (we index a whole book as a single document) and the need to support full phrase queries (and show the user the correct results) leads to a serious problem: when a phrase query includes a specific token present in a large portion of the documents, all of these candidate documents must be examined for scoring. The scoring process is very IO intensive because relative proximity data for the tokens in a document must be determined.

    :: Exploring the solutions

    –  Operational improvements:

    • Optimizing reads: The operating system, by default, caches file data that is read from disk. A typical read operation involves physical disk access to read the data from disk into the file system cache, and then to copy the data from the cache to the application buffer (2). Because our documents are big, having a bigger disk cache allows us to load documents more quickly. We added more RAM to the datanodes, for a total of 88GB, to have a better JVM/disk cache ratio.
      Number and type of node RAM
      10 datanodes 88GB
    • Optimizing writes: In a linux system, for performance reasons, written data goes into a cache before being sent to disk (called “dirty cache“). Write caches allow us to write to memory very quickly, but then we will have to pay the cost of writing out all the data to the discs, SSD in our case (3). To reduce the use of the SSD bandwidth we decided to reduce the size of this dirty cache, to make the writings smaller but more frequent. We set the dirty_bytes value from the default value 33.554.432 to 1.048.576 with:
      sudo sysctl -w vm.dirty_bytes=1048576
      By doing this, we got better performance from the disks, reducing the SSD saturation during writings drastically.

    – The highlighting problem:
    This was the easy one. Upon reading additional documentation and running some tests with different mappings we learned that making the highlighting faster requires storing the term vector  with the position offset payloads.

    – The scoring problem and the Common Grams:
    The first idea, that we discovered is the most popular in the industry, was: “let’s scale horizontally!“, that is “add new data nodes.” Adding new data nodes would reduce the size of a single shard and the activity for every data node (reducing the SSD bandwidth, distributing the work better, etc.).

    This is usually the easy — and expensive — solution, but Internet Archive is a nonprofit and our budget for the project was already gone after our first setup.

    We decided that the right direction to go was to find a smarter way to add more information to the inverted index. We want to make lucene’s life easier when it’s busy parsing all its data structures to retrieve the search results.

    What if we were to encode proximity information at document index time? And then somehow encode that in the inverted index? This is what the Common Grams tokenizer was made for.

    For the few thousand most common tokens in our index we push proximity data into the token space. This is done by generating tokens which represent two adjacent words instead of one when one of the words is a very common word. These two word tokens have a much lower document hit count and result in a dramatic reduction in the number of candidate documents which need to be scored when Lucene is looking for a phrase match.

    Normally, for the indexing and the searching phase, a text or a query string is tokenized word by word. For example: "the quick and brown fox" produces the tokens: the - quick - and - brown - fox
    These tokens are the keys stored in the inverted index. Using the common grams approach the high frequency words are not inserted in the inverted index alone, but also with their previous and following word. In this way the inverted index is richer, containing the relative positional information for the high-frequency words.
    For our  example the tokenization become: the - the_quick - quick - quick_and - and_brown - brown - fox

    Using luke to extract the most frequent words:
    To implement this solution we needed to know the most high-frequency words in our corpus, so we used luke to explore our normalized shard to extract the list of the most frequent words in our corpus. We decided to pick the 5000 most frequent words:

    Rank Frequency Word
    1 817.695 and
    2 810.060 of
    3 773.635 in
    4 753.855 a
    5 735.514 to
    49999 44.417 391
    50000 44.416 remedy

    We ran the siege benchmark in our test single node cluster and obtained great results:

    Without Highlight 5 trans/sec
    With Highlight using the Fast Term Vector 2.4 trans/sec

    :: The new index

    With this last step we had everything we needed to reindex the documents with a new custom analyzer using the icu tools we saw before, but this time we also added the common grams capabilities to the analyzer and the search_analyzer of our document fields. In particular we added a specialized filter in the common gram tokenization using our common words file. With these new features enabled the disk footprint is bigger so we decided to add more shards to the cluster, building the new index with 60 shards and 1 replica, for a total of 25 TB of data, ~210 GB for shard.
    Check out the end of this document for an example of our new mapping configuration.

    :: The final results

    The performance of the new index looked very good immediately.
    For example let’s consider the results of the query “to be or not to be“:

    hits found max score execution time
    Old Index 6.479.232 0.4427 20s 80ms
    New Index 501.476 0.7738 2s 605ms

    With the new index 95% of the queries are served in less than 10 seconds and the number of slower queries is marginal. There are less hits found because the new inverted index is more accurate. We have less noise and less poorly relevant results, with a higher max score and an execution time 10x faster for this particular query.

    The team:

    • Giovanni Damiola – twitter
    • Samuel Stoller
    • Aaron Ximm

    :: New index mapping

    "doc_title": { 
        "type": "string", 
        "analyzer": "textIcu", 
        "analyzer": "textIcu"
    "doc_author": { "type": "string", 
        "index": "analyzed", 
        "analyzer": "textIcu", 
    "doc_body": { 
        "type": "string", 
        "index": "analyzed", 
        "analyzer": "textIcu", 
        "term_vector": "with_positions_offsets_payloads" 
    "analyzer" : {
        "textIcu": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "char_filter": [ "icu_normalizer" ],
            "filter": [ "icu_folding", "common_grams" ]
        "textIcuSearch": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "char_filter": [ "icu_normalizer" ],
            "filter": [ "icu_folding", "common_grams_query" ] 


  • gio 10:08 pm on August 3, 2015 Permalink
    Tags: index, , reindexing, search,   

    OL: reindexing the Solr search 

    To reindex the OpenLibrary’s Solr search index making it consistent with the db data we can use the script:


    here you can find the source.

    Here same basic usage:

    /olsystem/bin/olenv python /opt/openlibrary/openlibrary/scripts/ol-solr-indexer.py --config /olsystem/etc/openlibrary.yml --bookmark ol-solr-indexer.bookmark --backward --days 2

    /olsystem/bin/olenv is the script to load the right virtualenv.
    --config /olsystem/etc/openlibrary.yml is the OpenLibrary yml configuration file.
    --bookmark ol-solr-indexer.bookmark is the location of the last scan timestamp (YYYY-MM-DD hh:mm:ss) bookmarked.
    --backward / --forward the direction to do the reindexing
    --days the number of days to reindex.

    the script can run in daemon mode. At the moment we still using the new-solr-updater for the partial updates…

  • gio 10:36 pm on July 17, 2015 Permalink
    Tags: , search,   

    SOLR: commits and optimize 

    • Commit: When you are indexing documents to solr none of the changes you are making will appear until you run the commit command. So timing when to run the commit command really depends on the speed at which you want the changes to appear on your site through the search engine. However it is a heavy operation and so should be done in batches not after every update.

      curl 'http://<SOLR_INSTANCE_URL>/update?commit=true'

    • Optimize: This is similar to a defrag command on a hard drive. It will reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents. Solr is a read only data store so every time you index a document it will mark the old document as deleted and then create a brand new document to replace the deleted one. Optimize will remove these deleted documents. You can see the search document vs. deleted document count by going to the Solr Statistics page and looking at the numDocs vs. maxDocs numbers. The difference between the two numbers is the amount of deleted (non-search able) documents in the index.

      curl 'http://<SOLR_INSTANCE_URL>/update?optimize=true'

      Also Optimize builds a whole NEW index from the old one and then switches to the new index when complete. Therefore the command requires double the space to perform the action. So you will need to make sure that the size of your index does not exceed %50 of your available hard drive space. (This is a rule of thumb, it usually needs less then %50 because of deleted documents)

    (source: http://stackoverflow.com/a/3737972)

  • gio 7:52 pm on July 14, 2015 Permalink
    Tags: , , search,   

    OL: Updating the Search Engine (notes by: Anand Chitipothu) 

    Open Library uses Apache Solr for providing search functionality to the website. The Solr instance maintains its an index of the all the items to be search in its data directory. This index is informally called as “search index”. All book, work and author records are stored in the same search engine.

    :: Updating the Search Engine

    Whenever a record is updated on Open Library, the corresponding entry in the search engine must be updated to get the uptodate results. Open Library has two different ways to update the search index.

    • The Manual Way

      The openlibrary/solr module provides a script called update_work.py for updating solr. Even though the script name is indicating only work, it can be used for updating even edition and author documents in solr.

      To update a one or more entries manually:
      WARNING: be sure to use the right openlibrary.yml file…

      $ python openlibrary/update_work.py --config openlibrary.yml /books/OL123M /works/OL234W /authors/OL45A

      By default, the script performs an commit to the Solr. Doing a commit ensures that the changes are flushed to the disk and available to search requests from now on. However, it is very expensive operation and takes more than 5 minutes (at the time of writing this).

      To update the documents without out committing them, add `–nocommit` flag.

      $ python openlibrary/update_work.py --config openlibrary.yml --nocommit /books/OL123M /works/OL234W /authors/OL45A
    • The Solr Updater

      There is a script scripts/new_solr_updater.py, which is run as a daemon process, listens to the edits happening to the database and updates the corresponding documents in Solr.

      Infobase, the system that handles the all the modifications to the system, maintains a log of all changes. It writes a JSON entry to a log file whenever something is modified in the database. It also provides an API to request these log entries.

      The solr updater script uses this to get new modifitions made after what it has last seen. It uses those entries to find which documents should be updated in the search engine.

      While this looks like a fair approach, the solr updater script can fail at a bad record or fail at an unexpected data. When this happen the solr updater dies and starts from the same point when it comes up and thus gets into an infinite loop.

      The current position of the log file consumed by the solr updated is maintained in a state file. The state file will be at /var/run/openlibrary/solr-update.offset, or any other path specified as argument to the solr updater script.

    :: The Updating Process

    To understand what is involved in updating a record in solr, lets restrict to work search and try to visualize a work record with bunch of editions.

    The work record should appear in search results, when any one the following terms are used in the search query.

    • title of the work
    • title of any of its editions (could be in other languages)
    • ISBN or any other ID of the editions
    • name of the the authors
    • (and some more)

    To get all this information, the solr document needs information from the following sources.

    • The work record from OL database
    • All the edition records belonged to that work
    • All the author records belonged to that work
    • Data about all the records which have been marked as redirect to any of the above records
    • IA metadata for each edition having a scan

    When updating multiple records at once, getting these individually might be too inefficient. So, some efforts have gone into it to make the process faster by making requests in batches whenever possible and directly take to the database to avoid middle layer overheads.

    The flow will be similar for author records as well.

  • gio 5:30 pm on July 9, 2015 Permalink
    Tags: , , search,   

    OL: Search with Solr 

    OpenLibrary is using Apache SOLR as search platform.

    Solr Server: http://solr:8983/solr/
     Solr Admin: http://solr:8984/solr/admin/
    • To READ/SEARCH an entry:

      curl http://solr:8983/solr/select?q=QUERY

      or using a browser:

    • To CREATE/UPDATE an entry:

      curl http://solr:8983/solr/update?commitWithin=10000 -H "Content-Type: text/xml" --data-binary '<add><doc><field name="edition_key">OL7649435M</field><field name="cover_i">405982</field><field name="isbn">9780671525323</field><field name="isbn">0671525328</field><field name="has_fulltext">False</field><field name="author_name">Jeff Noon</field><field name="seed">/books/OL7649435M</field><field name="seed">/works/OL8262577W</field><field name="seed">/authors/OL450487A</field><field name="author_key">OL450487A</field><field name="title">Vurt</field><field name="publish_date">March 1, 1995</field><field name="type">work</field><field name="ebook_count_i">0</field><field name="id_librarything">19214</field><field name="edition_count">1</field><field name="key">/works/OL8262577W</field><field name="id_goodreads">1420154</field><field name="publisher">Audioworks</field><field name="language">eng</field><field name="last_modified_i">1436385710</field><field name="cover_edition_key">OL7649435M</field><field name="publish_year">1995</field><field name="first_publish_year">1995</field><field name="author_facet">OL450487A Jeff Noon</field></doc></add>'
    • To DELETE an entry:

      curl -L 'http://solr:8983/solr/update?commitWithin=10000' -H "Content-Type: text/xml" --data-binary '<delete><query>key:/works/OL17071689W</query></delete>'

      or using a GET:

      curl -l 'http://solr:8983//solr/update?commitWithin=60000&stream.body=%3Cdelete%3E%3Cquery%3Ekey:/works/OL17058137W%3C/query%3E%3C/delete%3E'

      We are using commitWithin=10000 instead of commit=true because the solr server could be busy.

    The SOLR response should look like:

    <lst name="responseHeader"><int name="status">0</int><int name="QTime">70</int></lst>
Compose new post
Next post/Next comment
Previous post/Previous comment
Show/Hide comments
Go to top
Go to login
Show/Hide help
shift + esc