Tagged: openlibrary Toggle Comment Threads | Keyboard Shortcuts

  • gio 7:08 pm on July 15, 2015 Permalink | Reply
    Tags: database, db, , openlibrary, postgres, queries   

    OL: some useful queries 

    Open Library uses Postgres as database.
    All the OL’s entities are stored as things in the thing table.
    Every raw contains:

     id | key | type | latest_revision | created | last_modified 

    Some useful types are: /type/author /type/work /type/edition /type/user

    openlibrary=# SELECT * FROM thing WHERE key='/type/author' OR key='/type/edition' OR key='/type/work' OR key='/type/user';
        id    |      key      | type | latest_revision |          created           |       last_modified        
     17872418 | /type/work    |    1 |              14 | 2008-08-18 22:51:38.685066 | 2010-08-09 23:37:25.678493
           22 | /type/user    |    1 |               5 | 2008-03-19 16:44:20.354477 | 2009-03-16 06:21:53.030443
           52 | /type/edition |    1 |              33 | 2008-03-19 16:44:24.216334 | 2009-09-22 10:44:06.178888
           58 | /type/author  |    1 |              11 | 2008-03-19 16:44:24.216334 | 2009-06-29 12:35:31.346997
    • Count the authors:
      openlibrary=# SELECT count(*) as count FROM thing WHERE type='58';
    • Count the works:
      openlibrary=# SELECT count(*) as count FROM thing WHERE type='17872418';
    • Count the editions:
      openlibrary=# SELECT count(*) as count FROM thing WHERE type='52';
    • Count the users:
      openlibrary=# SELECT count(*) as count FROM thing WHERE type='22';
  • gio 7:52 pm on July 14, 2015 Permalink | Reply
    Tags: , openlibrary, ,   

    OL: Updating the Search Engine (notes by: Anand Chitipothu) 

    Open Library uses Apache Solr for providing search functionality to the website. The Solr instance maintains its an index of the all the items to be search in its data directory. This index is informally called as “search index”. All book, work and author records are stored in the same search engine.

    :: Updating the Search Engine

    Whenever a record is updated on Open Library, the corresponding entry in the search engine must be updated to get the uptodate results. Open Library has two different ways to update the search index.

    • The Manual Way

      The openlibrary/solr module provides a script called update_work.py for updating solr. Even though the script name is indicating only work, it can be used for updating even edition and author documents in solr.

      To update a one or more entries manually:
      WARNING: be sure to use the right openlibrary.yml file…

      $ python openlibrary/update_work.py --config openlibrary.yml /books/OL123M /works/OL234W /authors/OL45A

      By default, the script performs an commit to the Solr. Doing a commit ensures that the changes are flushed to the disk and available to search requests from now on. However, it is very expensive operation and takes more than 5 minutes (at the time of writing this).

      To update the documents without out committing them, add `–nocommit` flag.

      $ python openlibrary/update_work.py --config openlibrary.yml --nocommit /books/OL123M /works/OL234W /authors/OL45A
    • The Solr Updater

      There is a script scripts/new_solr_updater.py, which is run as a daemon process, listens to the edits happening to the database and updates the corresponding documents in Solr.

      Infobase, the system that handles the all the modifications to the system, maintains a log of all changes. It writes a JSON entry to a log file whenever something is modified in the database. It also provides an API to request these log entries.

      The solr updater script uses this to get new modifitions made after what it has last seen. It uses those entries to find which documents should be updated in the search engine.

      While this looks like a fair approach, the solr updater script can fail at a bad record or fail at an unexpected data. When this happen the solr updater dies and starts from the same point when it comes up and thus gets into an infinite loop.

      The current position of the log file consumed by the solr updated is maintained in a state file. The state file will be at /var/run/openlibrary/solr-update.offset, or any other path specified as argument to the solr updater script.

    :: The Updating Process

    To understand what is involved in updating a record in solr, lets restrict to work search and try to visualize a work record with bunch of editions.

    The work record should appear in search results, when any one the following terms are used in the search query.

    • title of the work
    • title of any of its editions (could be in other languages)
    • ISBN or any other ID of the editions
    • name of the the authors
    • (and some more)

    To get all this information, the solr document needs information from the following sources.

    • The work record from OL database
    • All the edition records belonged to that work
    • All the author records belonged to that work
    • Data about all the records which have been marked as redirect to any of the above records
    • IA metadata for each edition having a scan

    When updating multiple records at once, getting these individually might be too inefficient. So, some efforts have gone into it to make the process faster by making requests in batches whenever possible and directly take to the database to avoid middle layer overheads.

    The flow will be similar for author records as well.

  • gio 5:30 pm on July 9, 2015 Permalink | Reply
    Tags: , openlibrary, ,   

    OL: Search with Solr 

    OpenLibrary is using Apache SOLR as search platform.

    Solr Server: http://solr:8983/solr/
     Solr Admin: http://solr:8984/solr/admin/
    • To READ/SEARCH an entry:

      curl http://solr:8983/solr/select?q=QUERY

      or using a browser:

    • To CREATE/UPDATE an entry:

      curl http://solr:8983/solr/update?commitWithin=10000 -H "Content-Type: text/xml" --data-binary '<add><doc><field name="edition_key">OL7649435M</field><field name="cover_i">405982</field><field name="isbn">9780671525323</field><field name="isbn">0671525328</field><field name="has_fulltext">False</field><field name="author_name">Jeff Noon</field><field name="seed">/books/OL7649435M</field><field name="seed">/works/OL8262577W</field><field name="seed">/authors/OL450487A</field><field name="author_key">OL450487A</field><field name="title">Vurt</field><field name="publish_date">March 1, 1995</field><field name="type">work</field><field name="ebook_count_i">0</field><field name="id_librarything">19214</field><field name="edition_count">1</field><field name="key">/works/OL8262577W</field><field name="id_goodreads">1420154</field><field name="publisher">Audioworks</field><field name="language">eng</field><field name="last_modified_i">1436385710</field><field name="cover_edition_key">OL7649435M</field><field name="publish_year">1995</field><field name="first_publish_year">1995</field><field name="author_facet">OL450487A Jeff Noon</field></doc></add>'
    • To DELETE an entry:

      curl -L 'http://solr:8983/solr/update?commitWithin=10000' -H "Content-Type: text/xml" --data-binary '<delete><query>key:/works/OL17071689W</query></delete>'

      or using a GET:

      curl -l 'http://solr:8983//solr/update?commitWithin=60000&stream.body=%3Cdelete%3E%3Cquery%3Ekey:/works/OL17058137W%3C/query%3E%3C/delete%3E'

      We are using commitWithin=10000 instead of commit=true because the solr server could be busy.

    The SOLR response should look like:

    <lst name="responseHeader"><int name="status">0</int><int name="QTime">70</int></lst>
  • gio 5:37 pm on July 7, 2015 Permalink | Reply
    Tags: balance, haproxy, nginx, , openlibrary, proxy   

    OL: using HAProxy as a better proxy balancer 

    To distribute the requests between the webnodes we decided to use an HAProxy between the NGINX webserver and the apps on the webnodes. HAProxy allow us to balance the requests between the webnodes with more granularity then NGINX.

    To check the HAProxy statistics report at the page http://openlibrary.org/admin?stats

    To do so we installed HAProxy server on ol-www1 on port 7072, with /etc/haproxy/haproxy.cfg:

            log   daemon info
            maxconn 4096
            user haproxy
            group haproxy
            log     global
            mode    http
            option  httplog
            option  dontlognull
            retries 3
            option redispatch
            maxconn 2000
            contimeout      9000
            clitimeout      7200000
            srvtimeout      7200000
        stats uri     /admin?stats
        stats refresh 5s
        # these are added so the client ip comes through
        option httpclose
        option forwardfor
        option forceclose
    listen  ol-web-app
            mode    http
            balance roundrobin
        option httpchk GET /
        timeout check 3000
        #option httpchk GET /solr/select?rows=0&q=*:*
        server  web1 ol-web1:7071 maxconn 23 check inter 5000 rise 2 fall 2
        server  web2 ol-web2:7071 maxconn 15 check inter 5000 rise 2 fall 2

    With this config file the requests will be distribuited between ol-web1 and ol-web2 in a numer of 23 and 15 requests. There is an asymmetry between the nodes because the different number of gnunicorn instances running on them.

    Then we changed the NGINX conf at /etc/nginx/sites-enabled/openlibrary.conf setting the upstream to the HAProxy server:

    upstream webnodes {
  • gio 6:37 pm on April 1, 2015 Permalink | Reply
    Tags: lending, openlibrary, waitinglists   

    OL: how to fix a common problem with the waitinglists 

    For the OL admins: the following solution is implemented with the Upload loan info button, on the Borrow – Administration page.

    The lending system is managed through Internet Archive.
    When there is a connection problem between IA and OL during a loan update request (returning or borrowing) it is possible that the waitinglist stalls in an undetermined status.

    To fix this situation we are using the script:

    The script accept three commands:

        if cmd == "update-loans":
        elif cmd == "update-waitinglists":
        elif cmd == "update-waitinglist":

    To update and fix a waitinglist we use the update-waitinglist command:

    python scripts/openlibrary-server openlibrary.yml runscript scripts/update-loans.py update-waitinglist <InternteArchiveItemId>

    :: If the script does not run correctly and you receive this error message:
    “Required security token not privided or didn’t match.”
    it means you don’t have the ia_ol_shared_key in yours openlibrary.yml

    :: Do not run the script as root

  • gio 11:00 pm on March 11, 2015 Permalink | Reply
    Tags: openlibrary   

    OL: how to generate the dump files, step-by-step 

    Here the instructions about how to generate the ol_dump.txt.gz files.

    ol-home is the right place to do this.

    1 :: Dumping the data table from ol-db1
    this task requires around 1 hour to complete.

    giovanni@ol-home:/1/var/tmp$ psql -h ol-db1 -U openlibrary openlibrary -c "copy data to stdout" | gzip -c > data.txt.gz

    2 :: Activate the virtual environment /opt/openlibrary/venv

    giovanni@ol-home:/1/var/tmp$ source /opt/openlibrary/venv/bin/activate

    3 :: Generate the metadata table dump from archive db
    this task requires around 1 hour to complete.

    (venv)giovanni@ol-home:/1/var/tmp$ ARCHIVE_DB_PASSWORD=`/opt/.petabox/dbserver`
    (venv)giovanni@ol-home:/1/var/tmp$ python /opt/openlibrary/openlibrary/scripts/2012/dump-ia-items.py --host db-current --user archive --password $ARCHIVE_DB_PASSWORD --database archive | gzip -c > ia_metadata_dump_2015-03-11.txt.gz

    4 :: Generate the dump of all revisions of all documents.
    this task requires around 8 hours to complete.

    (venv)giovanni@ol-home:/1/var/tmp$ /opt/openlibrary/openlibrary/scripts/oldump.py cdump data.txt.gz 2015-03-11 | gzip -c > ol_cdump.txt.gz
    (venv)giovanni@ol-home:/1/var/tmp$ rm data.txt.gz

    5 :: Generate the dump of latest revisions of all documents.
    this task requires around 6 hours to complete.

    (venv)giovanni@ol-home:/1/var/tmp$ gzip -cd ol_cdump.txt.gz | python /opt/openlibrary/openlibrary/scripts/oldump.py sort --tmpdir /1/var/tmp | python /opt/openlibrary/openlibrary/scripts/oldump.py dump | gzip -c > ol_dump_2015-03-11.txt.gz
    (venv)giovanni@ol-home:/1/var/tmp$ rm -rf /1/var/tmp/oldumpsort

    6 :: Splitting the Dump into authors, editions, works, redirects

    (venv)giovanni@ol-home:/1/var/tmp$ gzip -cd ol_dump_2015-03-11.txt.gz | python /opt/openlibrary/openlibrary/scripts/oldump.py split --format ol_dump_%s_2015-03-11.txt.gz

    7 :: Generate the denormalized works Dump <<---- TO FIX: the script returns exceptions
    where each row contains a JSON document with the following fields:

    • work – The work documents
    • editions – List of editions that belong to this work
    • authors – All the authors of this work
    • ia – IA metadata for all the ia items referenced in the editions as a list
    • duplicates – dictionary of duplicates (key -> it’s duplicates) of work and edition docs mentioned above
    (venv)giovanni@ol-home:/1/var/tmp$ python /opt/openlibrary/openlibrary/scripts/2011/09/generate_deworks.py ol_dump_2015-03-11.txt.gz ia_metadata_dump_2015-03-11.txt.gz | gzip -c > ol_dump_deworks_2015-01-11.txt.gz
    (venv)giovanni@ol-home:/1/var/tmp$ ls
    ia_metadata_dump_2015-03-11.txt.gz  ol_dump_2015-03-11.txt.gz
    ol_dump_redirects_2015-03-11.txt.gz ol_dump_authors_2015-03-11.txt.gz
    ol_dump_deworks_2015-01-11.txt.gz   ol_dump_editions_2015-03-11.txt.gz
  • gio 7:12 pm on February 26, 2015 Permalink | Reply
    Tags: openlibrary   

    OL: infobase memory leaks fixed 

    We found a memory/threads leak on ol-home related to the fastcgi library used by the infobase process.


    as you can see: restarting infobase restores the number of threads.


    With Sam Stoller we used a gdb script to print the python stacktrace.

    Thread 1 (Thread 0x7fb5e3c5c740 (LWP 23820)):
    #3 Frame 0xc40860, for file /opt/openlibrary/venv/local/lib/python2.7/site-packages/flup/server/threadedserver.py, line 76, in run (self=&lt;WSGIServer(_appLock=, multiprocess=False, _umask=None, roles=(1,), _hupReceived=False, _connectionClass=, _jobClass=, _threadPool=&lt;ThreadPool(_workQueue=[
    ], _lock=&lt;_Condition(_Condition__lock=&lt;_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=, _RLock__count=0) at remote 0x7fb5dcc2d5d0&gt;, a
    cquire=, _is_owned=, _release_save=, release=, _acquire_restore=, _Verbose__verbose=False, _Condition__waiters=[, , , , multiprocess=False, _umask=None, roles=(1,), _hupReceived=False, _connectionClass=, _jobClass=, _threadPool=&lt;ThreadPool(_workQueue=[],
     _lock=&lt;_Condition(_Condition__lock=&lt;_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=, _RLock__count=0) at remote 0x7fb5dcc2d5d0&gt;, acq
    uire=, _is_owned=, _release_save=, release=, _acquire_restore=, _Verbose__verbose=False, _Condition__waiters=[, , , , addr=('', 7050),
        return flups.WSGIServer(func, multiplexed=True, bindAddress=addr).run()
    #18 Frame 0x7fb5e17cc830, for file /opt/openlibrary/venv/local/lib/python2.7/site-packages/web/wsgi.py, line 42, in runwsgi (func=, args=['7050'])
        return runfcgi(func, validaddr(args[0]))
    #21 Frame 0x7fb5dcc1ab00, for file /opt/openlibrary/venv/local/lib/python2.7/site-packages/web/application.py, line 313, in run (self=&lt;application(fvars={'from_json': , 'things': , 'reindex': , 'get_data': , 'seq': , 'app': , 'echo': , 'load_config': , 'logreader': , 'new_key': , 'get_many': , 'to_int': , 'readlog': , 'web': , 'update_config': , 'setup_remoteip': , 'cache': , '__package__': 'infogami.infobase', 'write': , 'start':...(truncated)
        return wsgi.runwsgi(self.wsgifunc(*middleware))
    #25 Frame 0x7fb5dcc1e1f8, for file /opt/openlibrary/deploys/openlibrary/6b2cc05/infogami/infobase/server.py, line 615, in run ()
    #28 Frame 0x7fb5e18f3cc8, for file /opt/openlibrary/deploys/openlibrary/6b2cc05/infogami/infobase/server.py, line 639, in start (config_file='/olsystem/etc/infobase.yml', args=('fastcgi', '7050'))
    #33 Frame 0x7fb5e1c56230, for file /opt/openlibrary/openlibrary/scripts/infobase-server, line 32, in main (args=['/olsystem/etc/infobase.yml', 'fastcgi', '7050'], server=)
    #36 Frame 0x7fb5e3bb8208, for file /opt/openlibrary/openlibrary/scripts/infobase-server, line 61, in  ()

    It looks like there is a deadlock related to multiplexed=True of the wsgi server, as defined in /opt/openlibrary/venv/local/lib/python2.7/site-packages/web/wsgi.py line 17.

    We found also this interesting note about the flup multiplexed.

    Anand Chitipothu fixed it switching the flup fastcgi infobase server to a multiplexed=False with the patch: https://github.com/internetarchive/openlibrary/pull/234

    “The web.py runfcgi is using multiplexed=True option for fastcgi server and that seem to cause some memory leaks. Using a variant of runfcgi that sets multiplexed=False.”

    This fixed the leaks problem:


    • internetarchive 7:22 pm on February 26, 2015 Permalink | Reply

      GREAT work, team. Thank you for hunting it down and getting it resolved!

  • gio 10:48 pm on February 25, 2015 Permalink | Reply
    Tags: openlibrary   

    OL: import new books from IA – importer 

    To import new books from IA you have to add IA identifiers using the OL page: https://openlibrary.org/admin/imports/add

    On ol-home must be running the process manage-imports.py:

    python scripts/manage-imports.py --config /olsystem/etc/openlibrary.yml import-all

    you can launch it with:

    sudo -u openlibrary /olsystem/bin/olenv HOME=/home/openlibrary OPENLIBRARY_RCFILE=/olsystem/etc/olrc-importbot python scripts/manage-imports.py --config /olsystem/etc/openlibrary.yml import-all >> /tmp/importer.log

    you can see the log at: /tmp/importer.log

  • gio 7:18 pm on February 19, 2015 Permalink | Reply
    Tags: openlibrary   

    OL: stats graph in homepage 

    The statistrics graph in the OL homepage are generated with the script:
    it works calling the code openlibrary.admin

  • gio 10:58 pm on February 11, 2015 Permalink | Reply
    Tags: openlibrary   

    OL: recent activities summary 

    Primary results:

    • Open Library has a more reliable and stable infrastructure.
    • It’s easier doing activities like monitoring, system management and diagnosis.
    • The github community can come back to fix bug and develop.

    I have to specially thank Raj, Sam, Anand and Andy for helping me in this process.


    == Cluster Reliability ==

    :- Tomcat configuration updated to better fit the OL infrastructure.
    :- Solr configuration updated to fit the other Archive solr instances.
    :- Added an haproxy to ol-solr2 helping tomcat to handle all the connections properly.
    :- Rebooted ol-solr2 on SSD.


    == Diagnostics ==

    :- Installed and configured Munin to produce some graphic reports.
    :- Coded a daemon and a munin plugin tracking the Response Time, and the Status Code response rates.
    :- Designed, coded, configured and installed an “one-page” dashboard to let us better monitoring the cluster status and where to find the main info related to OL: wiki, nagios, admin-center, lending stats, github, etc.
    :- Minor Nagios triggers updated, making the alarm more useful and effective.


    == Bugs Fixing ==

    :- Found and debugged a memory/threads leak on ol-home, the problem seems related to the original infogami/webpy code. I informed Anand, and I hope he will have time to answer and tell us how to solve this issue ASAP.
    :- Found an outage issue related to ARP and the load balancer DNS . Sam and Andy are working on it.
    :- Found and fixed an important issue on the deployment process.
    :- Fixed some management scripts that were not working properly.
    :- Fixed the backup scripts that were not working properly.
    :- Updating the underestimated disk space for backups on ol-home.
    Me and Andy we will finish this week and we’ll finally solve the annoying DISK-FULL monthly problem
    :- Fixed the sitemaps generation process. Finally now we have an updated sitemap, working correctly. This solve some problems we had with the google-bot.
    :- Learned how to recover from an ACS4-related OL outage.
    :- Minor log-rotation, disk full problem solved on ol-solr2.
    :- Fixed the issues with the Vagrant developing instance.


    == Security Upgrade ==

    :- Removed the SSLv3 protocol support from nginx, solving the POODLE vulnerability.


    == Documentation ==

    :- Updated the wiki page with all the NEW documentation we wrote during these activities.
    :- Updated the wiki page with some old documentation not deprecated yet.


    == Github and Developers ==

    :- Merged some old pull request from the community.
    :- General cleaning.

Compose new post
Next post/Next comment
Previous post/Previous comment
Show/Hide comments
Go to top
Go to login
Show/Hide help
shift + esc