Updates from February, 2015 Toggle Comment Threads | Keyboard Shortcuts

  • gio 7:12 pm on February 26, 2015 Permalink | Reply
    Tags:   

    OL: infobase memory leaks fixed 

    We found a memory/threads leak on ol-home related to the fastcgi library used by the infobase process.

    threads-week

    as you can see: restarting infobase restores the number of threads.

     

    With Sam Stoller we used a gdb script to print the python stacktrace.

    Thread 1 (Thread 0x7fb5e3c5c740 (LWP 23820)):
    #3 Frame 0xc40860, for file /opt/openlibrary/venv/local/lib/python2.7/site-packages/flup/server/threadedserver.py, line 76, in run (self=<WSGIServer(_appLock=, multiprocess=False, _umask=None, roles=(1,), _hupReceived=False, _connectionClass=, _jobClass=, _threadPool=<ThreadPool(_workQueue=[
    ], _lock=<_Condition(_Condition__lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=, _RLock__count=0) at remote 0x7fb5dcc2d5d0>, a
    cquire=, _is_owned=, _release_save=, release=, _acquire_restore=, _Verbose__verbose=False, _Condition__waiters=[, , , , multiprocess=False, _umask=None, roles=(1,), _hupReceived=False, _connectionClass=, _jobClass=, _threadPool=<ThreadPool(_workQueue=[],
     _lock=<_Condition(_Condition__lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=, _RLock__count=0) at remote 0x7fb5dcc2d5d0>, acq
    uire=, _is_owned=, _release_save=, release=, _acquire_restore=, _Verbose__verbose=False, _Condition__waiters=[, , , , addr=('0.0.0.0', 7050),
     flups=)
        return flups.WSGIServer(func, multiplexed=True, bindAddress=addr).run()
    #18 Frame 0x7fb5e17cc830, for file /opt/openlibrary/venv/local/lib/python2.7/site-packages/web/wsgi.py, line 42, in runwsgi (func=, args=['7050'])
        return runfcgi(func, validaddr(args[0]))
    #21 Frame 0x7fb5dcc1ab00, for file /opt/openlibrary/venv/local/lib/python2.7/site-packages/web/application.py, line 313, in run (self=<application(fvars={'from_json': , 'things': , 'reindex': , 'get_data': , 'seq': , 'app': , 'echo': , 'load_config': , 'logreader': , 'new_key': , 'get_many': , 'to_int': , 'readlog': , 'web': , 'update_config': , 'setup_remoteip': , 'cache': , '__package__': 'infogami.infobase', 'write': , 'start':...(truncated)
        return wsgi.runwsgi(self.wsgifunc(*middleware))
    #25 Frame 0x7fb5dcc1e1f8, for file /opt/openlibrary/deploys/openlibrary/6b2cc05/infogami/infobase/server.py, line 615, in run ()
        app.run()
    #28 Frame 0x7fb5e18f3cc8, for file /opt/openlibrary/deploys/openlibrary/6b2cc05/infogami/infobase/server.py, line 639, in start (config_file='/olsystem/etc/infobase.yml', args=('fastcgi', '7050'))
        run()
    #33 Frame 0x7fb5e1c56230, for file /opt/openlibrary/openlibrary/scripts/infobase-server, line 32, in main (args=['/olsystem/etc/infobase.yml', 'fastcgi', '7050'], server=)
        server.start(*args)
    #36 Frame 0x7fb5e3bb8208, for file /opt/openlibrary/openlibrary/scripts/infobase-server, line 61, in  ()
        main(sys.argv[1:])

    It looks like there is a deadlock related to multiplexed=True of the wsgi server, as defined in /opt/openlibrary/venv/local/lib/python2.7/site-packages/web/wsgi.py line 17.

    We found also this interesting note about the flup multiplexed.

    Anand Chitipothu fixed it switching the flup fastcgi infobase server to a multiplexed=False with the patch: https://github.com/internetarchive/openlibrary/pull/234

    “The web.py runfcgi is using multiplexed=True option for fastcgi server and that seem to cause some memory leaks. Using a variant of runfcgi that sets multiplexed=False.”

    This fixed the leaks problem:

    threads-day

     
    • internetarchive 7:22 pm on February 26, 2015 Permalink | Reply

      GREAT work, team. Thank you for hunting it down and getting it resolved!

  • gio 10:48 pm on February 25, 2015 Permalink | Reply
    Tags:   

    OL: import new books from IA – importer 

    To import new books from IA you have to add IA identifiers using the OL page: https://openlibrary.org/admin/imports/add

    On ol-home must be running the process manage-imports.py:

     
    python scripts/manage-imports.py --config /olsystem/etc/openlibrary.yml import-all

    you can launch it with:

     
    sudo -u openlibrary /olsystem/bin/olenv HOME=/home/openlibrary OPENLIBRARY_RCFILE=/olsystem/etc/olrc-importbot python scripts/manage-imports.py --config /olsystem/etc/openlibrary.yml import-all >> /tmp/importer.log

    you can see the log at: /tmp/importer.log

     
  • gio 7:18 pm on February 19, 2015 Permalink | Reply
    Tags:   

    OL: stats graph in homepage 

    The statistrics graph in the OL homepage are generated with the script:
    /opt/openlibrary/openlibrary/scripts/store_counts.py
    it works calling the code openlibrary.admin
    https://github.com/internetarchive/openlibrary/blob/master/openlibrary/admin/stats.py

     
  • gio 10:58 pm on February 11, 2015 Permalink | Reply
    Tags:   

    OL: recent activities summary 

    Primary results:

    • Open Library has a more reliable and stable infrastructure.
    • It’s easier doing activities like monitoring, system management and diagnosis.
    • The github community can come back to fix bug and develop.

    I have to specially thank Raj, Sam, Anand and Andy for helping me in this process.

     

    == Cluster Reliability ==

    :- Tomcat configuration updated to better fit the OL infrastructure.
    :- Solr configuration updated to fit the other Archive solr instances.
    :- Added an haproxy to ol-solr2 helping tomcat to handle all the connections properly.
    :- Rebooted ol-solr2 on SSD.

     

    == Diagnostics ==

    :- Installed and configured Munin to produce some graphic reports.
    :- Coded a daemon and a munin plugin tracking the Response Time, and the Status Code response rates.
    :- Designed, coded, configured and installed an “one-page” dashboard to let us better monitoring the cluster status and where to find the main info related to OL: wiki, nagios, admin-center, lending stats, github, etc.
    http://ol-home.us.archive.org:8088/dashboard/
    :- Minor Nagios triggers updated, making the alarm more useful and effective.

     

    == Bugs Fixing ==

    :- Found and debugged a memory/threads leak on ol-home, the problem seems related to the original infogami/webpy code. I informed Anand, and I hope he will have time to answer and tell us how to solve this issue ASAP.
    :- Found an outage issue related to ARP and the load balancer DNS . Sam and Andy are working on it.
    :- Found and fixed an important issue on the deployment process.
    :- Fixed some management scripts that were not working properly.
    :- Fixed the backup scripts that were not working properly.
    :- Updating the underestimated disk space for backups on ol-home.
    Me and Andy we will finish this week and we’ll finally solve the annoying DISK-FULL monthly problem
    :- Fixed the sitemaps generation process. Finally now we have an updated sitemap, working correctly. This solve some problems we had with the google-bot.
    :- Learned how to recover from an ACS4-related OL outage.
    :- Minor log-rotation, disk full problem solved on ol-solr2.
    :- Fixed the issues with the Vagrant developing instance.

     

    == Security Upgrade ==

    :- Removed the SSLv3 protocol support from nginx, solving the POODLE vulnerability.
    https://www.us-cert.gov/ncas/alerts/TA14-290A

     

    == Documentation ==

    :- Updated the wiki page with all the NEW documentation we wrote during these activities.
    https://wiki.archive.org/twiki/bin/view/OpenLibrary/WebHome
    :- Updated the wiki page with some old documentation not deprecated yet.

     

    == Github and Developers ==

    :- Merged some old pull request from the community.
    :- General cleaning.

     
  • gio 6:48 pm on February 10, 2015 Permalink | Reply
    Tags:   

    OL: deploying the code 

    To deploy the OL code:

    rkumar@ol-www1:~$ sudo -s
    root@ol-www1:/home/rkumar$ su openlibrary
    openlibrary@ol-www1:/home/rkumar$ . /opt/openlibrary/venv/bin/activate 
    (venv)openlibrary@ol-www1:/home/rkumar$ pip install fabric
    (venv)openlibrary@ol-www1:/home/rkumar$ /olsystem/bin/deploy-code openlibrary

    It is possible that may not work as fabric that we use is very old verison (cit. anand).

    Try it from ol-home.

    Fix it using:

    sudo -u openlibrary rsync -av rsync://ol-home/opt/openlibrary/venv /opt/openlibrary/

    or:

    /olsystem/bin/olenv pip install -U fab==1.1.2
     
  • gio 6:00 pm on February 10, 2015 Permalink | Reply
    Tags:   

    OL: monitoring Nginx response time and response status codes with python and munin 

    To let Nginx to log properly the response time please see the blog article tracking app response time on nginx.

    We expect to have the /var/log/nginx/access.log formatted as:

    '$seed$remote_addr $host $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_time'

    The log file lines look like:

    0.165.58.183 openlibrary.org - [10/Feb/2015:17:34:35 +0000] "GET /search?q=The+Law&subject_facet=Law&person_facet=R.+A.+Ramsay HTTP/1.1" 200 7159 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.660
    0.165.58.183 openlibrary.org - [10/Feb/2015:17:34:35 +0000] "GET /search?q=RICO&author_key=OL2687659A&subject_facet=In+library HTTP/1.1" 200 6895 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.684
    0.143.71.149 openlibrary.org - [10/Feb/2015:17:34:35 +0000] "GET /search?sort=old&subject_facet=Protected+DAISY&language=mul&publisher_facet=Nelson+Doubleday&publisher_facet=Macmillan+and+co.%2C+limited&publisher_facet=Franklin+Library HTTP/1.1" 200 5860 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.307
    0.103.218.37 openlibrary.org - [10/Feb/2015:17:34:35 +0000] "GET /works/OL118971W HTTP/1.0" 301 0 "https://openlibrary.org/" "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)" 0.057

    Using a python script and a munin plugin we can plot this two graphs:

    • Nginx response time: plotting the requests/second rate splitting up in response time ranges.
      nginx_req_time-day
    • Nginx response code: plotting the requests?second rate for every response code class.
      nginx_req_status_code-day

    This script /usr/local/nglog/nglog.py runs in background reading the access.log line by line, sampling the request/second rate for different response time ranges and for the different response code classes. The values are saved in the files /tmp/nginx_request_status_stats and /tmp/nginx_request_time_stats.

    giovanni@ol-www1:/tmp$ cat /tmp/nginx_request_time_stats 
    T1.value 75.100000 
    T2.value 54.555556 
    T3.value 16.938889
    T4.value 1.711111 
    T5.value 2.933333
    giovanni@ol-www1:/tmp$
    giovanni@ol-www1:/tmp$ cat /tmp/nginx_request_status_stats 
    s2xx.value 123.600000 
    s3xx.value 19.755556 
    s4xx.value 7.883333 
    s5xx.value 0.000000
    giovanni@ol-www1:/tmp$

    When a status code 500 is found the script logs the request in the file /var/log/nginx/nglog.log
    The script is launched with the bash script /usr/local/nglog/nglogd.sh and it executes:

    tail -F /var/log/nginx/access.log| python /usr/local/nglog/nglog.py &

    To plot the results we use two ad-hoc munin plugins:
    /usr/share/munin/plugins/nginx_req_time

    #!/bin/sh
     
    case $1 in
       config)
            cat <<'EOM'
    graph_title NGINX response time
    graph_vlabel num_req /sec
     
    T1.label <= 0.5s
    T2.label >0.5 and <=1
    T3.label >1 and <=5
    T4.label >5 and <=10
    T5.label > 10
    EOM
            exit 0;;
    esac

    /usr/share/munin/plugins/nginx_req_status_code

    #!/bin/sh
     
    case $1 in
       config)
            cat <<'EOM'
    graph_title NGINX response status code
    graph_vlabel req/sec
     
    s2xx.label 2xx
    s3xx.label 3xx
    s4xx.label 4xx
    s5xx.label 5xx
    EOM
            exit 0;;
    esac
     
    • Oleg 7:55 pm on January 14, 2017 Permalink | Reply

      Hi,
      tell me plz, where I can find a script “nglog/nglog.py” (GitHub)

    • gio 6:59 pm on May 23, 2017 Permalink | Reply

      the script is not available online sorry. But it is just a Nginx’s log parser, with some counters.

  • gio 11:32 pm on February 9, 2015 Permalink | Reply
    Tags: debug   

    Print a Python stacktrace of a running process 

    To print a Python stacktrace of a running process you need two script

    :: The ignore-error.py

    class IgnoreErrorsCommand (gdb.Command):
        """Execute a single command, ignoring all errors.
    Only one-line commands are supported.
    This is primarily useful in scripts."""
     
        def __init__ (self):
            super (IgnoreErrorsCommand, self).__init__ ("ignore-errors",
                                                        gdb.COMMAND_OBSCURE,
                                                        # FIXME...
                                                        gdb.COMPLETE_COMMAND)
     
        def invoke (self, arg, from_tty):
            try:
                gdb.execute (arg, from_tty)
            except:
                pass
     
    IgnoreErrorsCommand ()

    :: and the gdb script pygdb

    • be sure gdb and python2.7-dbg are installed: apt-get install gdb python2.7/dbg
    # sudo apt-get install gdb python2.7-dbg
    # to run:
    # sudo gdb python
    # (gdb) bt_pid <pid>
    source /home/samuel/tmp/ignore-errors.py
     
    define bt_pid
        attach $arg0
        t a a ignore-errors py-bt
        t a a bt
        detach
    end

    To run this script and have the stacktrace:

    giovanni@ol-home:~$ sudo gdb python
    (gdb) source pygdb
    (gdb) bt_pid <PID>
     
  • gio 11:00 pm on February 4, 2015 Permalink | Reply
    Tags:   

    OL: How to generate the sitemaps 

    First you need the last ol_dump file.
    This file is generated on ol-home using as source the dump file ol_dump.txt.gz. To know how to generate it please check the post: OL: how to generate the dump files

    :: Generate sitemaps on ol-home

    anand@ol-home:/1/var/tmp/sitemaps$ python /opt/openlibrary/openlibrary/scripts/2009/01/sitemaps/sitemap.py ../dumps/ol_dump_2015-01-31/ol_dump_2015-01-31.txt.gz
    ....
    Wed Feb  4 12:48:03 2015 writing sitemaps/sitemap_works_1700.xml.gz 39979
    Wed Feb  4 12:48:03 2015 writing sitemaps/sitemap_works_1701.xml.gz 39963
    Wed Feb  4 12:48:04 2015 writing sitemaps/sitemap_works_1702.xml.gz 39975
    Wed Feb  4 12:48:04 2015 writing sitemaps/sitemap_works_1703.xml.gz 39947
    Wed Feb  4 12:48:05 2015 writing sitemaps/sitemap_works_1704.xml.gz 39347
    Wed Feb  4 12:48:05 2015 writing sitemaps/sitemap_works_1705.xml.gz 39691
    Wed Feb  4 12:48:05 2015 writing sitemaps/sitemap_works_1706.xml.gz 38423
    Wed Feb  4 12:48:06 2015 writing sitemaps/sitemap_works_1707.xml.gz 30851
    Wed Feb  4 12:48:06 2015 writing sitemaps/sitemap_works_1708.xml.gz 3447
    Wed Feb  4 12:48:06 2015 writing sitemaps/siteindex.xml.gz 19891
    Wed Feb  4 12:48:06 2015 done
    anand@ol-home:~$

    :: Copy sitemaps to ol-www1

    anand@ol-www1:~$ sudo mkdir -p /1/var/lib/openlibrary/sitemaps
    anand@ol-www1:~$ sudo rsync -av rsync://ol-home/var_1/tmp/sitemaps/sitemaps /1/var/lib/openlibrary/sitemaps/
    ...
    sitemaps/sitemap_works_1707.xml.gz
    sitemaps/sitemap_works_1708.xml.gz
     
    sent 94,519 bytes  received 329,106,541 bytes  73,155,791.11 bytes/sec
    total size is 328,678,782  speedup is 1.00
    anand@ol-www1:~$

    :: Verify sitemaps are available

    anand@ol-www1:~$ curl -I https://openlibrary.org/static/sitemaps/siteindex.xml.gz
    HTTP/1.1 200 OK
    Server: nginx/1.1.19
    Date: Wed, 04 Feb 2015 16:48:50 GMT
    Content-Type: text/plain
    Content-Length: 14689
    Last-Modified: Wed, 04 Feb 2015 12:48:06 GMT
    Connection: keep-alive
    Accept-Ranges: bytes
    anand@ol-www1:~$
     
  • gio 8:21 pm on February 2, 2015 Permalink | Reply
    Tags:   

    OL: Sitemap generation 

    To generate the Sitemap, execute this code on ol-home:

    python sitemaps.py ol_dump_works_latest.txt.gz

    the sitemaps.py is located at

    /1/var/lib/openlibrary/deploy/openlibrary/openlibrary/data/sitemap.py

    The last dump is available at: http://openlibrary.org/data/ol_dump_works_latest.txt.gz
    for more details you can see https://openlibrary.org/developers/dumps.

    After the sitemap is generated you need to place it in /1/var/lib/openlibrary/sitemaps
    as defined in /olsystem/etc/nginx/sites-available/openlibrary.conf on ol-www1.

    
        location ~ ^/static/(docs|tour|sitemaps|jsondumps|images/shelfview|sampledump.txt.gz)(/.*)?$ {
            root /1/var/lib/openlibrary/sitemaps;
            autoindex on;
            rewrite ^/static/(.*)$ /$1 break;
        }
    

    Note: use /1/var/tmp on ol-home, then you can rsync it from rsync://ol-home/var_1/tmp/

     
  • gio 7:44 pm on February 2, 2015 Permalink | Reply
    Tags:   

    OL tracking app response time on nginx 

    To track the response time let’s edit

    /etc/nginx/nginx.conf

    adding the $request_time to the log_format line:

    
        log_format iacombined '$seed$remote_addr $host $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_time';
    

    Note: $request_time includes the time to complete sending the data to the client.

    To show the requests longer than 3 seconds:

    sudo tail -F /var/log/nginx/access.log -s 0.1  | perl -ne '$|=1; $_ =~ / ([^ ]+$)/; if ($1 > 3.0) {print;}'

    To show the requests longer than 3 second, hiding the 400 errors:

    sudo tail -F /var/log/nginx/access.log -s 0.1  | grep -v '"-" 40. 0' --line-buffered | perl -ne '$|=1; $_ =~ / ([^ ]+$)/; if ($1 > 3.0) {print;}'
     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel