Tagged: solr Toggle Comment Threads | Keyboard Shortcuts

  • gio 10:08 pm on August 3, 2015 Permalink
    Tags: index, , reindexing, , solr   

    OL: reindexing the Solr search 

    To reindex the OpenLibrary’s Solr search index making it consistent with the db data we can use the script:

    /openlibrary/scripts/ol-solr-indexer.py

    here you can find the source.

    Here same basic usage:

    /olsystem/bin/olenv python /opt/openlibrary/openlibrary/scripts/ol-solr-indexer.py --config /olsystem/etc/openlibrary.yml --bookmark ol-solr-indexer.bookmark --backward --days 2

    where:
    /olsystem/bin/olenv is the script to load the right virtualenv.
    --config /olsystem/etc/openlibrary.yml is the OpenLibrary yml configuration file.
    --bookmark ol-solr-indexer.bookmark is the location of the last scan timestamp (YYYY-MM-DD hh:mm:ss) bookmarked.
    --backward / --forward the direction to do the reindexing
    --days the number of days to reindex.

    the script can run in daemon mode. At the moment we still using the new-solr-updater for the partial updates…

     
  • gio 10:36 pm on July 17, 2015 Permalink
    Tags: , , solr   

    SOLR: commits and optimize 

    • Commit: When you are indexing documents to solr none of the changes you are making will appear until you run the commit command. So timing when to run the commit command really depends on the speed at which you want the changes to appear on your site through the search engine. However it is a heavy operation and so should be done in batches not after every update.

      curl 'http://<SOLR_INSTANCE_URL>/update?commit=true'

    • Optimize: This is similar to a defrag command on a hard drive. It will reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents. Solr is a read only data store so every time you index a document it will mark the old document as deleted and then create a brand new document to replace the deleted one. Optimize will remove these deleted documents. You can see the search document vs. deleted document count by going to the Solr Statistics page and looking at the numDocs vs. maxDocs numbers. The difference between the two numbers is the amount of deleted (non-search able) documents in the index.

      curl 'http://<SOLR_INSTANCE_URL>/update?optimize=true'

      Also Optimize builds a whole NEW index from the old one and then switches to the new index when complete. Therefore the command requires double the space to perform the action. So you will need to make sure that the size of your index does not exceed %50 of your available hard drive space. (This is a rule of thumb, it usually needs less then %50 because of deleted documents)

    (source: http://stackoverflow.com/a/3737972)

     
  • gio 7:52 pm on July 14, 2015 Permalink
    Tags: , , , solr   

    OL: Updating the Search Engine (notes by: Anand Chitipothu) 

    Open Library uses Apache Solr for providing search functionality to the website. The Solr instance maintains its an index of the all the items to be search in its data directory. This index is informally called as “search index”. All book, work and author records are stored in the same search engine.

    :: Updating the Search Engine

    Whenever a record is updated on Open Library, the corresponding entry in the search engine must be updated to get the uptodate results. Open Library has two different ways to update the search index.

    • The Manual Way

      The openlibrary/solr module provides a script called update_work.py for updating solr. Even though the script name is indicating only work, it can be used for updating even edition and author documents in solr.

      To update a one or more entries manually:
      WARNING: be sure to use the right openlibrary.yml file…

      $ python openlibrary/update_work.py --config openlibrary.yml /books/OL123M /works/OL234W /authors/OL45A

      By default, the script performs an commit to the Solr. Doing a commit ensures that the changes are flushed to the disk and available to search requests from now on. However, it is very expensive operation and takes more than 5 minutes (at the time of writing this).

      To update the documents without out committing them, add `–nocommit` flag.

      $ python openlibrary/update_work.py --config openlibrary.yml --nocommit /books/OL123M /works/OL234W /authors/OL45A
    • The Solr Updater

      There is a script scripts/new_solr_updater.py, which is run as a daemon process, listens to the edits happening to the database and updates the corresponding documents in Solr.

      Infobase, the system that handles the all the modifications to the system, maintains a log of all changes. It writes a JSON entry to a log file whenever something is modified in the database. It also provides an API to request these log entries.

      The solr updater script uses this to get new modifitions made after what it has last seen. It uses those entries to find which documents should be updated in the search engine.

      While this looks like a fair approach, the solr updater script can fail at a bad record or fail at an unexpected data. When this happen the solr updater dies and starts from the same point when it comes up and thus gets into an infinite loop.

      The current position of the log file consumed by the solr updated is maintained in a state file. The state file will be at /var/run/openlibrary/solr-update.offset, or any other path specified as argument to the solr updater script.

    :: The Updating Process

    To understand what is involved in updating a record in solr, lets restrict to work search and try to visualize a work record with bunch of editions.

    The work record should appear in search results, when any one the following terms are used in the search query.

    • title of the work
    • title of any of its editions (could be in other languages)
    • ISBN or any other ID of the editions
    • name of the the authors
    • (and some more)

    To get all this information, the solr document needs information from the following sources.

    • The work record from OL database
    • All the edition records belonged to that work
    • All the author records belonged to that work
    • Data about all the records which have been marked as redirect to any of the above records
    • IA metadata for each edition having a scan

    When updating multiple records at once, getting these individually might be too inefficient. So, some efforts have gone into it to make the process faster by making requests in batches whenever possible and directly take to the database to avoid middle layer overheads.

    The flow will be similar for author records as well.

     
  • gio 5:30 pm on July 9, 2015 Permalink
    Tags: , , , solr   

    OL: Search with Solr 

    OpenLibrary is using Apache SOLR as search platform.

    Solr Server: http://solr:8983/solr/
     Solr Admin: http://solr:8984/solr/admin/
    
    • To READ/SEARCH an entry:

      curl http://solr:8983/solr/select?q=QUERY

      or using a browser:

      http://solr:8983/solr/select?q=QUERY&cache=false
    • To CREATE/UPDATE an entry:

      curl http://solr:8983/solr/update?commitWithin=10000 -H "Content-Type: text/xml" --data-binary '<add><doc><field name="edition_key">OL7649435M</field><field name="cover_i">405982</field><field name="isbn">9780671525323</field><field name="isbn">0671525328</field><field name="has_fulltext">False</field><field name="author_name">Jeff Noon</field><field name="seed">/books/OL7649435M</field><field name="seed">/works/OL8262577W</field><field name="seed">/authors/OL450487A</field><field name="author_key">OL450487A</field><field name="title">Vurt</field><field name="publish_date">March 1, 1995</field><field name="type">work</field><field name="ebook_count_i">0</field><field name="id_librarything">19214</field><field name="edition_count">1</field><field name="key">/works/OL8262577W</field><field name="id_goodreads">1420154</field><field name="publisher">Audioworks</field><field name="language">eng</field><field name="last_modified_i">1436385710</field><field name="cover_edition_key">OL7649435M</field><field name="publish_year">1995</field><field name="first_publish_year">1995</field><field name="author_facet">OL450487A Jeff Noon</field></doc></add>'
    • To DELETE an entry:

      curl -L 'http://solr:8983/solr/update?commitWithin=10000' -H "Content-Type: text/xml" --data-binary '<delete><query>key:/works/OL17071689W</query></delete>'

      or using a GET:

      curl -l 'http://solr:8983//solr/update?commitWithin=60000&stream.body=%3Cdelete%3E%3Cquery%3Ekey:/works/OL17058137W%3C/query%3E%3C/delete%3E'

      We are using commitWithin=10000 instead of commit=true because the solr server could be busy.

    The SOLR response should look like:

    <response>
    <lst name="responseHeader"><int name="status">0</int><int name="QTime">70</int></lst>
    </response>
     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel