Recent Updates Page 2 Toggle Comment Threads | Keyboard Shortcuts

  • gio 10:08 pm on August 3, 2015 Permalink | Reply
    Tags: index, , reindexing, ,   

    OL: reindexing the Solr search 

    To reindex the OpenLibrary’s Solr search index making it consistent with the db data we can use the script:

    /openlibrary/scripts/ol-solr-indexer.py

    here you can find the source.

    Here same basic usage:

    /olsystem/bin/olenv python /opt/openlibrary/openlibrary/scripts/ol-solr-indexer.py --config /olsystem/etc/openlibrary.yml --bookmark ol-solr-indexer.bookmark --backward --days 2

    where:
    /olsystem/bin/olenv is the script to load the right virtualenv.
    --config /olsystem/etc/openlibrary.yml is the OpenLibrary yml configuration file.
    --bookmark ol-solr-indexer.bookmark is the location of the last scan timestamp (YYYY-MM-DD hh:mm:ss) bookmarked.
    --backward / --forward the direction to do the reindexing
    --days the number of days to reindex.

    the script can run in daemon mode. At the moment we still using the new-solr-updater for the partial updates…

     
  • gio 10:36 pm on July 17, 2015 Permalink | Reply
    Tags: , ,   

    SOLR: commits and optimize 

    • Commit: When you are indexing documents to solr none of the changes you are making will appear until you run the commit command. So timing when to run the commit command really depends on the speed at which you want the changes to appear on your site through the search engine. However it is a heavy operation and so should be done in batches not after every update.

      curl 'http://<SOLR_INSTANCE_URL>/update?commit=true'

    • Optimize: This is similar to a defrag command on a hard drive. It will reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents. Solr is a read only data store so every time you index a document it will mark the old document as deleted and then create a brand new document to replace the deleted one. Optimize will remove these deleted documents. You can see the search document vs. deleted document count by going to the Solr Statistics page and looking at the numDocs vs. maxDocs numbers. The difference between the two numbers is the amount of deleted (non-search able) documents in the index.

      curl 'http://<SOLR_INSTANCE_URL>/update?optimize=true'

      Also Optimize builds a whole NEW index from the old one and then switches to the new index when complete. Therefore the command requires double the space to perform the action. So you will need to make sure that the size of your index does not exceed %50 of your available hard drive space. (This is a rule of thumb, it usually needs less then %50 because of deleted documents)

    (source: http://stackoverflow.com/a/3737972)

     
  • gio 7:08 pm on July 15, 2015 Permalink | Reply
    Tags: database, db, , , postgres, queries   

    OL: some useful queries 

    Open Library uses Postgres as database.
    All the OL’s entities are stored as things in the thing table.
    Every raw contains:

     id | key | type | latest_revision | created | last_modified 
    ----+-----+------+-----------------+---------+---------------

    Some useful types are: /type/author /type/work /type/edition /type/user

    openlibrary=# SELECT * FROM thing WHERE key='/type/author' OR key='/type/edition' OR key='/type/work' OR key='/type/user';
     
        id    |      key      | type | latest_revision |          created           |       last_modified        
    ----------+---------------+------+-----------------+----------------------------+----------------------------
     17872418 | /type/work    |    1 |              14 | 2008-08-18 22:51:38.685066 | 2010-08-09 23:37:25.678493
           22 | /type/user    |    1 |               5 | 2008-03-19 16:44:20.354477 | 2009-03-16 06:21:53.030443
           52 | /type/edition |    1 |              33 | 2008-03-19 16:44:24.216334 | 2009-09-22 10:44:06.178888
           58 | /type/author  |    1 |              11 | 2008-03-19 16:44:24.216334 | 2009-06-29 12:35:31.346997
    • Count the authors:
      openlibrary=# SELECT count(*) as count FROM thing WHERE type='58';
    • Count the works:
      openlibrary=# SELECT count(*) as count FROM thing WHERE type='17872418';
    • Count the editions:
      openlibrary=# SELECT count(*) as count FROM thing WHERE type='52';
    • Count the users:
      openlibrary=# SELECT count(*) as count FROM thing WHERE type='22';
     
  • gio 7:52 pm on July 14, 2015 Permalink | Reply
    Tags: , , ,   

    OL: Updating the Search Engine (notes by: Anand Chitipothu) 

    Open Library uses Apache Solr for providing search functionality to the website. The Solr instance maintains its an index of the all the items to be search in its data directory. This index is informally called as “search index”. All book, work and author records are stored in the same search engine.

    :: Updating the Search Engine

    Whenever a record is updated on Open Library, the corresponding entry in the search engine must be updated to get the uptodate results. Open Library has two different ways to update the search index.

    • The Manual Way

      The openlibrary/solr module provides a script called update_work.py for updating solr. Even though the script name is indicating only work, it can be used for updating even edition and author documents in solr.

      To update a one or more entries manually:
      WARNING: be sure to use the right openlibrary.yml file…

      $ python openlibrary/update_work.py --config openlibrary.yml /books/OL123M /works/OL234W /authors/OL45A

      By default, the script performs an commit to the Solr. Doing a commit ensures that the changes are flushed to the disk and available to search requests from now on. However, it is very expensive operation and takes more than 5 minutes (at the time of writing this).

      To update the documents without out committing them, add `–nocommit` flag.

      $ python openlibrary/update_work.py --config openlibrary.yml --nocommit /books/OL123M /works/OL234W /authors/OL45A
    • The Solr Updater

      There is a script scripts/new_solr_updater.py, which is run as a daemon process, listens to the edits happening to the database and updates the corresponding documents in Solr.

      Infobase, the system that handles the all the modifications to the system, maintains a log of all changes. It writes a JSON entry to a log file whenever something is modified in the database. It also provides an API to request these log entries.

      The solr updater script uses this to get new modifitions made after what it has last seen. It uses those entries to find which documents should be updated in the search engine.

      While this looks like a fair approach, the solr updater script can fail at a bad record or fail at an unexpected data. When this happen the solr updater dies and starts from the same point when it comes up and thus gets into an infinite loop.

      The current position of the log file consumed by the solr updated is maintained in a state file. The state file will be at /var/run/openlibrary/solr-update.offset, or any other path specified as argument to the solr updater script.

    :: The Updating Process

    To understand what is involved in updating a record in solr, lets restrict to work search and try to visualize a work record with bunch of editions.

    The work record should appear in search results, when any one the following terms are used in the search query.

    • title of the work
    • title of any of its editions (could be in other languages)
    • ISBN or any other ID of the editions
    • name of the the authors
    • (and some more)

    To get all this information, the solr document needs information from the following sources.

    • The work record from OL database
    • All the edition records belonged to that work
    • All the author records belonged to that work
    • Data about all the records which have been marked as redirect to any of the above records
    • IA metadata for each edition having a scan

    When updating multiple records at once, getting these individually might be too inefficient. So, some efforts have gone into it to make the process faster by making requests in batches whenever possible and directly take to the database to avoid middle layer overheads.

    The flow will be similar for author records as well.

     
  • gio 5:30 pm on July 9, 2015 Permalink | Reply
    Tags: , , ,   

    OL: Search with Solr 

    OpenLibrary is using Apache SOLR as search platform.

    Solr Server: http://solr:8983/solr/
     Solr Admin: http://solr:8984/solr/admin/
    
    • To READ/SEARCH an entry:

      curl http://solr:8983/solr/select?q=QUERY

      or using a browser:

      http://solr:8983/solr/select?q=QUERY&cache=false
    • To CREATE/UPDATE an entry:

      curl http://solr:8983/solr/update?commitWithin=10000 -H "Content-Type: text/xml" --data-binary '<add><doc><field name="edition_key">OL7649435M</field><field name="cover_i">405982</field><field name="isbn">9780671525323</field><field name="isbn">0671525328</field><field name="has_fulltext">False</field><field name="author_name">Jeff Noon</field><field name="seed">/books/OL7649435M</field><field name="seed">/works/OL8262577W</field><field name="seed">/authors/OL450487A</field><field name="author_key">OL450487A</field><field name="title">Vurt</field><field name="publish_date">March 1, 1995</field><field name="type">work</field><field name="ebook_count_i">0</field><field name="id_librarything">19214</field><field name="edition_count">1</field><field name="key">/works/OL8262577W</field><field name="id_goodreads">1420154</field><field name="publisher">Audioworks</field><field name="language">eng</field><field name="last_modified_i">1436385710</field><field name="cover_edition_key">OL7649435M</field><field name="publish_year">1995</field><field name="first_publish_year">1995</field><field name="author_facet">OL450487A Jeff Noon</field></doc></add>'
    • To DELETE an entry:

      curl -L 'http://solr:8983/solr/update?commitWithin=10000' -H "Content-Type: text/xml" --data-binary '<delete><query>key:/works/OL17071689W</query></delete>'

      or using a GET:

      curl -l 'http://solr:8983//solr/update?commitWithin=60000&stream.body=%3Cdelete%3E%3Cquery%3Ekey:/works/OL17058137W%3C/query%3E%3C/delete%3E'

      We are using commitWithin=10000 instead of commit=true because the solr server could be busy.

    The SOLR response should look like:

    <response>
    <lst name="responseHeader"><int name="status">0</int><int name="QTime">70</int></lst>
    </response>
     
  • gio 5:37 pm on July 7, 2015 Permalink | Reply
    Tags: balance, haproxy, nginx, , , proxy   

    OL: using HAProxy as a better proxy balancer 

    To distribute the requests between the webnodes we decided to use an HAProxy between the NGINX webserver and the apps on the webnodes. HAProxy allow us to balance the requests between the webnodes with more granularity then NGINX.

    To check the HAProxy statistics report at the page http://openlibrary.org/admin?stats

    To do so we installed HAProxy server on ol-www1 on port 7072, with /etc/haproxy/haproxy.cfg:

    global
            log 127.0.0.1   daemon info
            maxconn 4096
            user haproxy
            group haproxy
            daemon
     
    defaults
            log     global
            mode    http
            option  httplog
            option  dontlognull
            retries 3
            option redispatch
            maxconn 2000
            contimeout      9000
            clitimeout      7200000
            srvtimeout      7200000
        stats uri     /admin?stats
        stats refresh 5s
        # these are added so the client ip comes through
        option httpclose
        option forwardfor
        option forceclose
     
     
    listen  ol-web-app 0.0.0.0:7072
            mode    http
            balance roundrobin
        option httpchk GET /
        timeout check 3000
        #option httpchk GET /solr/select?rows=0&q=*:*
        server  web1 ol-web1:7071 maxconn 23 check inter 5000 rise 2 fall 2
        server  web2 ol-web2:7071 maxconn 15 check inter 5000 rise 2 fall 2

    With this config file the requests will be distribuited between ol-web1 and ol-web2 in a numer of 23 and 15 requests. There is an asymmetry between the nodes because the different number of gnunicorn instances running on them.

    Then we changed the NGINX conf at /etc/nginx/sites-enabled/openlibrary.conf setting the upstream to the HAProxy server:

    upstream webnodes {
        server 127.0.0.1:7072;
    }
     
    ...
     
  • gio 10:44 pm on May 28, 2015 Permalink | Reply  

    TTScribe: basic infos 

     
  • gio 12:17 am on May 7, 2015 Permalink | Reply  

    GitLab: Continuous Integration at ci.archive.org 

    GitLab is shipped with a Continuous Integration Server, our instance is reachable at https://ci.archive.org.
    Adding a gitlab project it is possible to run jobs on specialized nodes called runners.
    At the moment we have only one runner, configured to execute the jobs in a docker container, with Ubuntu.14.04, python, php5. The docker image is build using this Dockerfile.

    for sys-admins:

     
  • gio 8:50 pm on May 5, 2015 Permalink | Reply  

    Git: how to split a repo 

    You can use this command to split out a dir from a repo

    git filter-branch --subdirectory-filter folder_to_split_out/ -- master

    If you want tor remove a file or a folder from a repo:

    git filter-branch --index-filter "git rm -r --cached --ignore-unmatch file_to_remove" --prune-empty
     
  • gio 11:30 pm on April 21, 2015 Permalink | Reply  

    CReP: IA dev workflow using git-svn 

    • WARNING:: if you are working on vm-home to avoid out-of-memory problems do:

      ulimit -v unlimited
    • Pull the last commits to the local master branch with:

      git checkout master
      git svn rebase
    • create a working branch

      git checkout -b devBranch
    • use this branch to develop your feature and fix the bugs
      and commit your work to the local git repository with

      git commit -m 'bla bla bla bla'
    • create a test branch like the master

      git checkout master
      git svn rebase
      git checkout -b sandMaster

      and merge all the commits from the devBranch branch in an atomic commit (using –squash)

      git merge --squash devBranch

      check what is going on

      git status
      git diff --staged

      if everything looks ok, commit it

      git commit
    • Now you can check if this code is working correctly
      to be sure that the commit to the SVN will work correctly…
      If everything is fine…
    • …let’s do the same commit to the master branch

      git checkout master
      git merge --squash devBranch
      git status
      git diff --staged
      git commit
      git svn rebase
    • Let’s do a dry-run and check the patches

      git svn dcommit --dry-run
      git diff-tree 25e084b618580d69f5b57d8f9c1ca37045cf65dd~1 25e084b618580d69f5b57d8f9c1ca37045cf65dd -p
    • If everything is correct it’s time to commit to the SVN remote repo
    • git svn dcommit
    • You can push the code using the pi interface
     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel