Updates from gio Toggle Comment Threads | Keyboard Shortcuts

  • gio 8:58 pm on May 31, 2017 Permalink
    Tags: cluster, elasticsearch, full text search, fulltext, fulltextsearch, internet archive, internetarchive, lucene, , tuning   

    Making the Internet Archive’s full text search faster 

    (You can also find this article on medium)
    This article describes how we made the full-text organic search faster at the Internet Archivewithout scaling horizontally — allowing our users to search in just a few seconds across our collection of 35 million documents containing books, magazine, newspapers, scientific papers, patents and much more.

    By organic search we mean the “method for entering one or a plurality of search items in a single data string into a search engine. Organic search results are listings on search engine results pages that appear because of their relevance to the search terms.” (1).

    The relevance should be scored on the search query matches for every document. In other words, if the search query has a perfect match in some of our documents, we want to return these documents; otherwise, we want to return the documents containing a part of the query or one of its subset.

    Let’s consider some of our initial conditions:

    • Our corpus is very heterogeneous. It contains a lot of documents with content, context and language.
    • The size of our documents varies considerably, from a few kilobytes to several megabytes of text. We have  documents consisting of only a few pages and books with thousands of pages.
    • This index should also be agnostic about the kind, content, context and language of a document. For simplicity and for an easier connection with our documents ingestion pipeline, we need to keep everything in a single index.
    • At the Internet Archive we have thousands of new documents to index every day so the search must continue to work properly in a continuous indexing mode.
    • The search has to work as fast as possible, serving each search request without slowing down the execution of the other ones.

    :: Cluster ElasticSearch

    Our first idea was deploying a simple ElasticSearch cluster, relying on Lucene’s full-text search server features.

    We built the cluster on KVM virtual machines on Intel hardware using Ubuntu 14.4  and ElasticSearch 2.3.4:

    Number and type of node CPU RAM Storage
    10 datanodes 14 CPU 45GB 6.4TB SSD consumer
    3 masters 2 CPU 2GB
    4 clients 22 CPU 34GB ram

    The Java Virtual Machine (JVM) runs with 30GB: -Xms30g -Xmx30g

    Our document structure is a bit more complicated, but to keep it simple let’s consider every document as consisting of only these fields: Title, Author, Body. All of these fields are type string and analyzed with a custom analyzer using the International Components for Unicode.

    tokenizer normalizer folding
    icu_tokenzer icu_normalizer icu_folding

    We created an index of 40 shards with 1 replica, for a total footprint of ~14TB of data, with shards of ~175GB.

    :: First results and first slow queries

    At the beginning this index was not too bad. With the first 10 million documents loaded, we got good response time and performance. Our goal is to index and make the search available for all 35 million documents. We soon learned that the performance got worse when adding more documents.

    We discovered that some queries were taking too many seconds to be executed, keeping our resources busy and slowing down all the cluster metrics, in particular the indexing latency and the search latency. Some queries were taking more than 30 seconds even when the cluster was not particularly busy.

    As an example, a search for “the black cat” took longer than the more simple query “black cat”  because of the “the” in the query. Another good example was: “to be or not to be” — particularly slow because it is comprised of only high-frequency words. An even slower query was “to be or not to be that is the question,” which took over one minute.

    :: The analysis

    To better understand the problem, we monitored the cluster’s nodes while the slow query was running.  We used profile and hot-threads from the API to profile the queries and to monitor the threads execution on the cluster. We also used iostat to keep monitored the IO activities on the datanodes.

    We discovered:

    • Lots of hot threads were about the highlighting phase. We need this feature because we don’t want to know only what documents contain the best matches, but we also want to show the results in their context with a snippet. This operation can be expensive, especially considering that we want to display snippets for every document returned in the results list.
    • The search for queries with high-frequency words was particularly expensive.
    • The CPU use was ok.
    • The ratio of JVM ram versus the Operative System ram was not well balanced, so the file system cache was unable to manage our big shards and documents.
    • The Garbage Collector (GC) pauses often and for too long (sometimes more than a minute) — this is probably because some of our documents contain very big values for the body field.
    • We discovered that the Solid State Disk (SSD) bandwidth was saturated during these expensive queries.

    :: Modeling the problem

    In this step we addressed the index settings and mappings, looking for ways to improve the performance without adding new nodes or hardware upgrades.

    The ElasticSearch stack is big and complex so it is easy to get lost in an empiric quest for the best configuration. We therefore decided to follow a strict approach to the problem. To better analyze it we decided to experiment with a simpler deployment: building a single node cluster, with a single shard index, with the same shard size as one of our production shards.

    A simple copy of one of our shards from production to the development context was not enough. We wanted to be sure that the distribution of the documents in the shard is normalized; the shard must contain documents with the same type distribution ratio as the production one, so we wrote a little script to extract the dump of the index in a set of small chunks containing random documents distribution (good enough for our purposes). Then we indexed some of these chunks to build a single shard of 180GB.

    Running some benchmarks we focused only on 90% of the most popular queries, extracted form our production nginx log, compiling a list of ~10k queries well distributed for response time.

    We use siege to run our benchmark tests with the flags: one single process, verbose output, don’t wait between searches, run for 5 minutes and load the search to siege from a file.

    siege -c 1 -v -b -t 5 -f model.txt  

    We ran this test several times with two different queries: one with the highlighting feature and one without.

    Without Highlight 1.3 trans/sec
    With Highlight 0.8 trans/sec

    Analyzing the hot threads we discovered that our bottleneck was in the parsing of the lucene‘s inverted index.  

    We have a lot of high-frequency words, some of them present in almost all of the 35M documents.

    To simplify a bit: with our resources and our deployed cluster, we weren’t able to search and filter the inverted index fast enough because we have millions of matches to be filtered and checked against the query.

    The combination of large documents (we index a whole book as a single document) and the need to support full phrase queries (and show the user the correct results) leads to a serious problem: when a phrase query includes a specific token present in a large portion of the documents, all of these candidate documents must be examined for scoring. The scoring process is very IO intensive because relative proximity data for the tokens in a document must be determined.

    :: Exploring the solutions

    –  Operational improvements:

    • Optimizing reads: The operating system, by default, caches file data that is read from disk. A typical read operation involves physical disk access to read the data from disk into the file system cache, and then to copy the data from the cache to the application buffer (2). Because our documents are big, having a bigger disk cache allows us to load documents more quickly. We added more RAM to the datanodes, for a total of 88GB, to have a better JVM/disk cache ratio.
      Number and type of node RAM
      10 datanodes 88GB
    • Optimizing writes: In a linux system, for performance reasons, written data goes into a cache before being sent to disk (called “dirty cache“). Write caches allow us to write to memory very quickly, but then we will have to pay the cost of writing out all the data to the discs, SSD in our case (3). To reduce the use of the SSD bandwidth we decided to reduce the size of this dirty cache, to make the writings smaller but more frequent. We set the dirty_bytes value from the default value 33.554.432 to 1.048.576 with:
      sudo sysctl -w vm.dirty_bytes=1048576
      By doing this, we got better performance from the disks, reducing the SSD saturation during writings drastically.

    – The highlighting problem:
    This was the easy one. Upon reading additional documentation and running some tests with different mappings we learned that making the highlighting faster requires storing the term vector  with the position offset payloads.

    – The scoring problem and the Common Grams:
    The first idea, that we discovered is the most popular in the industry, was: “let’s scale horizontally!“, that is “add new data nodes.” Adding new data nodes would reduce the size of a single shard and the activity for every data node (reducing the SSD bandwidth, distributing the work better, etc.).

    This is usually the easy — and expensive — solution, but Internet Archive is a nonprofit and our budget for the project was already gone after our first setup.

    We decided that the right direction to go was to find a smarter way to add more information to the inverted index. We want to make lucene’s life easier when it’s busy parsing all its data structures to retrieve the search results.

    What if we were to encode proximity information at document index time? And then somehow encode that in the inverted index? This is what the Common Grams tokenizer was made for.

    For the few thousand most common tokens in our index we push proximity data into the token space. This is done by generating tokens which represent two adjacent words instead of one when one of the words is a very common word. These two word tokens have a much lower document hit count and result in a dramatic reduction in the number of candidate documents which need to be scored when Lucene is looking for a phrase match.

    Normally, for the indexing and the searching phase, a text or a query string is tokenized word by word. For example: "the quick and brown fox" produces the tokens: the - quick - and - brown - fox
    These tokens are the keys stored in the inverted index. Using the common grams approach the high frequency words are not inserted in the inverted index alone, but also with their previous and following word. In this way the inverted index is richer, containing the relative positional information for the high-frequency words.
    For our  example the tokenization become: the - the_quick - quick - quick_and - and_brown - brown - fox

    Using luke to extract the most frequent words:
    To implement this solution we needed to know the most high-frequency words in our corpus, so we used luke to explore our normalized shard to extract the list of the most frequent words in our corpus. We decided to pick the 5000 most frequent words:

    Rank Frequency Word
    1 817.695 and
    2 810.060 of
    3 773.635 in
    4 753.855 a
    5 735.514 to
    49999 44.417 391
    50000 44.416 remedy

    We ran the siege benchmark in our test single node cluster and obtained great results:

    Without Highlight 5 trans/sec
    With Highlight using the Fast Term Vector 2.4 trans/sec

    :: The new index

    With this last step we had everything we needed to reindex the documents with a new custom analyzer using the icu tools we saw before, but this time we also added the common grams capabilities to the analyzer and the search_analyzer of our document fields. In particular we added a specialized filter in the common gram tokenization using our common words file. With these new features enabled the disk footprint is bigger so we decided to add more shards to the cluster, building the new index with 60 shards and 1 replica, for a total of 25 TB of data, ~210 GB for shard.
    Check out the end of this document for an example of our new mapping configuration.

    :: The final results

    The performance of the new index looked very good immediately.
    For example let’s consider the results of the query “to be or not to be“:

    hits found max score execution time
    Old Index 6.479.232 0.4427 20s 80ms
    New Index 501.476 0.7738 2s 605ms

    With the new index 95% of the queries are served in less than 10 seconds and the number of slower queries is marginal. There are less hits found because the new inverted index is more accurate. We have less noise and less poorly relevant results, with a higher max score and an execution time 10x faster for this particular query.

    The team:

    • Giovanni Damiola – twitter
    • Samuel Stoller
    • Aaron Ximm

    :: New index mapping

    "doc_title": { 
        "type": "string", 
        "analyzer": "textIcu", 
        "analyzer": "textIcu"
    "doc_author": { "type": "string", 
        "index": "analyzed", 
        "analyzer": "textIcu", 
    "doc_body": { 
        "type": "string", 
        "index": "analyzed", 
        "analyzer": "textIcu", 
        "term_vector": "with_positions_offsets_payloads" 
    "analyzer" : {
        "textIcu": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "char_filter": [ "icu_normalizer" ],
            "filter": [ "icu_folding", "common_grams" ]
        "textIcuSearch": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "char_filter": [ "icu_normalizer" ],
            "filter": [ "icu_folding", "common_grams_query" ] 


  • gio 12:53 am on February 15, 2017 Permalink
    Tags: docker, security   

    Docker: some tips about security 

    Risk Areas

    • Host security
    • Client <> Daemon communication
    • Daemon security
    • Communication with public/private registry
    • Images
    • Containers
    • Registry’s security

    Life cycle of an image

    _dockerfile_ ==BUILD==>> _image_ ==SPIN==> Container
                 maintaining images securely

    Dockerfile Security

    • Do not write secrets in your dockerfile. Use some secret management solution like Hashicorp Vault.
    • Create a USER to executes the processes in the image, otherwise the container will run as root.
    • Avoid to use the latest tag. Better use version pinnin to avoid cache issues.
    • Remove unnecessary setuid, setgid permissions.
    • Do not write any kind of update instructions. All the layers are cached and this not assure the execution of the update.
    • Download packages securely and do not download unnecessary packages. Reduce your attack surface.
    • Use COPY instead of ADD to reduce the attack surface. add-or-copy
    • Use HEALTHCHECK command. Reducing Deploy Risk With Docker’s New Health Check Instruction Test-drive Docker Healthcheck in 10 minutes
    • Use gosu instead of sudo wherever is possible gosu.
    • Try to restrict a image/container to one single service.

    Maintaining and Consuming Images

    • Docker Content Trust

      • provides authenticity, integrity and freshness guarantees.
      • takes some time to understand and prepare
    • Vulenrability free images:

      • tool selection: binary level analysis + hash based
      • Twistlock, Scalock, Nautilus to use only signed images, scan images, automatic container profiling.
      • aquasec
    • Except compatibility issues, all images and packages must be up-to-date.

    Advanced Security (production/enterprise zone)

    • Do not use Docker hub Images. High possibility of malicious images.
    • Maintain your own in-house registries.
    • Perform image optimization techniques.
    • Use commercial tools:
      • Image Lockdown
      • RBAC
    • Use file monitoring solutions to monitor any malicious changes in image layers.
    • Have separate patch, vulnerability management procedures for containers.
    • Customize CIS Docker benchmarks.

    Container Runtime

    Some Docker defaults to consider

    • Containers can consume entire memory causeing DOS.
    • Containers can communicate with each other leading to sniffing etc.
    • Containers are on the same bridge leading ARP spoofing, MITM, etc.
    • Containers have no fork limit causing fork bomb.
    • Containers run as root.
    • Docker daemon access users have effective root privileges


    • Namespaces: beware of non-namespaced kernel keyrings: SYS_TIME, etc. Do not share namespaces unless really needed.
    • Seccomp
    • LSM’s SELinux and Apparmor
    • Capabilities: do not use privileged containers and try to set flag for not acquiring any additional privileges.


    • Manideep 5:24 pm on July 14, 2017 Permalink

      Thanks for the credits 🙂

  • gio 10:47 pm on April 22, 2016 Permalink
    Tags: coverage, python, tests   

    Python Coverage 

    :: Install

        pip install coverage

    Command line usage:

    • run – Run a Python program and collect execution data.
    • report – Report coverage results.
    • html – Produce annotated HTML listings with coverage results.
    • xml – Produce an XML report with coverage results.
    • annotate – Annotate source files with coverage results.
    • erase – Erase previously collected coverage data.
    • combine – Combine together a number of data files.
    • debug – Get diagnostic information.
        $ coverage help
        $ coverage help run

    :: run

        $ coverage run my_program.py

    :: report

        $ coverage report -m
        Name                      Stmts   Miss  Cover   Missing
        my_program.py                20      4    80%   33-35, 39
        my_other_module.py           56      6    89%   17-23
        TOTAL                        76     10    87%

    : a nicer report

        $ coverage html

    ## Branch coverage
    In addition to the usual statement coverage, coverage.py also supports branch coverage measurement. Where a line in your program could jump to more than one next line, coverage.py tracks which of those destinations are actually visited, and flags lines that haven’t visited all of their possible destinations.

        $ coverage run --branch myprog.py

    (source: https://coverage.readthedocs.org)

  • gio 11:01 pm on March 10, 2016 Permalink  

    OL: anti-spam tools 

    We are addressing the spam problem with some volunteers help. In particular Charles is working hard on it and having great results. I need to contact him to have his updates.

    Here some notes from Anand, December 2014:
    I spent most of today fighting spam. Blocked hundreds of accounts and built some simple tools to help block spam.


    • Added “block and revert all edits” button to admin profile page
    • Added /admin/spamwords page to mark some words as spam https://openlibrary.org/admin/spamwords
    • Also a way to blacklist a domain from /admin/spamwords

    If the edit to a page contains any of the spam words or email of the user is from the blacklisted domains, the edit won’t be accepted. New registrations with emails from those domains are also not accepted.

    Measures taken:

    • I’ve added the common words found in the recent spam to the spam words
    • blacklisted mail.com as almost all of the spam was coming from that domain. This may stop some genuine people from registering and making edits.
    • blocked and reverted edits lot of accounts

    Hope this helps in reducing the spam to open library.

  • gio 7:17 pm on March 7, 2016 Permalink  

    OL: how to edit external ID Numbers editions 

    All the data like label, name, webside are accesible at:


    to edit it you need to edit the YML file:

  • gio 12:08 am on February 25, 2016 Permalink  

    GitLab 8.5 and the new CI 

    With GitLab 8.5 we are introducing some great new features, between them also a brand new CI service.

    To have a working CI follow this quick start.

    If you don’t know how to setup your runner or you need one: ask me. We have a VM with docker available to register new runners.

  • gio 11:35 pm on February 24, 2016 Permalink  

    Docker cookbook 

    :: list images

    sudo docker images

    :: list the containers

    sudo docker ps
    sudo docker ps -a

    :: start/stop a container

    sudo docker start CONTAINER

    (More …)

  • gio 11:53 pm on February 23, 2016 Permalink  

    GitLab: installing version 8.5.0 

    • One:
      sudo apt-get install curl openssh-server ca-certificates postfix
    • Two:
      ~$ curl -s https://packages.gitlab.com/install/repositories/gitlab/gitlab-ce/script.deb.sh | sudo bash
      ~$ sudo apt-get install gitlab-ce=8.5.0-ce.1
    • Three:
      sudo gitlab-ctl reconfigure
  • gio 8:51 pm on February 23, 2016 Permalink  

    GitLab: bakup and restore 

    WARNING: You can only restore a backup to exactly the same version of GitLab that you created it on.

      :: Create a backup of the GitLab system:

    • sudo gitlab-rake gitlab:backup:create

      the backup file will be created at /var/opt/gitlab/backups/
      or where specified in the gitlab config file.

      You can skip the backup for come components using:

      sudo gitlab-rake gitlab:backup:create SKIP=db,uploads
    • Please be informed that a backup does not store your configuration files. One reason for this is that your database contains encrypted information for two-factor authentication. Storing encrypted information along with its key in the same place defeats the purpose of using encryption in the first place!
      At the very minimum you should backup /etc/gitlab/gitlab-secrets.json (Omnibus) or /home/git/gitlab/.secret (source) to preserve your database encryption key.
      :: Restore a previously created backup:

    • We will assume that you have installed GitLab from an omnibus package and run
      sudo gitlab-ctl reconfigure

      at least once.

      Note: that you need to run gitlab-ctl reconfigure after changing gitlab-secrets.json.

      First make sure your backup tar file is in /var/opt/gitlab/backups (or wherever gitlab_rails[‘backup_path’] points to).

      sudo cp 1393513186_gitlab_backup.tar /var/opt/gitlab/backups/
    • Next, restore the backup by running the restore command. You need to specify the timestamp of the backup you are restoring.
      # Stop processes that are connected to the database
      sudo gitlab-ctl stop unicorn
      sudo gitlab-ctl stop sidekiq
      # This command will overwrite the contents of your GitLab database!
      sudo gitlab-rake gitlab:backup:restore BACKUP=1393513186
      # Start GitLab
      sudo gitlab-ctl start
      # Check GitLab
      sudo gitlab-rake gitlab:check SANITIZE=true

    source: https://gitlab.com/gitlab-org/gitlab-ce/blob/master/doc/raketasks/backup_restore.md

  • gio 9:59 pm on August 17, 2015 Permalink
    Tags: , , tips   

    Git and coding: tip and tricks 

    • Socks Proxy: To access our GitLab website from IPs outside the Internet Archive’s network you need a socks proxy. To do so you can:
      • You can open a sock proxy with:
        ssh -N -D <port> username@archive.org

        and configure manually your browser or your network to use the socks proxy.

      • if you are using OSX you can use this script to make the socks proxy setup easier.
    • Sublime Text: if you are using sublime text to develop your code, you will appreciate this how-to use sublime text over ssh.
    • Memory problem: if running git push you have this error: fatal: Out of memory, calloc failed, you can configure git to use only one thread for “packing”:
      git config --global pack.threads 1

      another “solution” is to remove the limit:

      ulimit -v unlimited
    • Git prompt:
      • If you’re a Bash user, you can tap into some of your shell’s features to make your experience with Git a lot friendlier. Git actually ships with plugins for several shells, but it’s not turned on by default. Take a look to: Git prompt [2].
      • If you’re using zsh, and also make use of oh-my-zsh, many themes include git in the prompt. It is recommended that you set git config –local oh-my-zsh.hide-dirty 1 within the petabox repo to prevent a slow prompt.
    • How to rename the author info for all the commits in a repo:
      • for a single commit:
        git commit --amend --author "New Author Name <email@address.com>"
      • for all the commits in a repo:
        git filter-branch --commit-filter 'if [ "$GIT_AUTHOR_NAME" = "Josh Lee" ];
          then export GIT_AUTHOR_NAME="Hobo Bob"; export GIT_AUTHOR_EMAIL=hobo@example.com;
        fi; git commit-tree "$@"'
    • kelsey 2:58 am on August 18, 2015 Permalink

      Here is some info I found the hard way on Mac & case-sensitive git repos http://kelsey.blog.archive.org/2015/08/18/mac-case-sensitive-git-repos/

    • pooh 5:34 am on August 18, 2015 Permalink

      i have a script in ~tracey/scripts/post-commit that one can copy to petabox/.git/hooks/ to make the “update the $Id….$ thing in the file” (mostly useful for deriver hackers now, etc.)

      traceypooh [3:25 PM]
      (it re-checks out file after commit, so your version now has the $Id….$ string updated in the file you just committed)

    • pooh 5:34 am on August 18, 2015 Permalink

      [petabox tree file last mod times]
      ~tracey/scripts/post-checkout is nice to add in petabox/.git/hooks/ when one first clones a tree (technically, clone the tree, then add this hook, then checkout a file — it will take a *long* time to run, then remove the hook). this will make all files have modtime of their last commit (which can be v. helpful to determine quick age/modtimes of files at a “ls” glance) for those who like that kind of thing

    • kelsey 10:31 pm on September 25, 2015 Permalink

      That pesky error that we ran into before

      % git clone git@git.archive.org:ia/petabox.git -v
      Cloning into 'petabox'...
      fatal: protocol error: bad line length character: No s

      Most likely means that the user is trying to clone a repo that they don’t have access to. Add the user to the project members, and you’re good to go

Compose new post
Next post/Next comment
Previous post/Previous comment
Show/Hide comments
Go to top
Go to login
Show/Hide help
shift + esc