OL: recent activities summary
Primary results:
- Open Library has a more reliable and stable infrastructure.
- It’s easier doing activities like monitoring, system management and diagnosis.
- The github community can come back to fix bug and develop.
I have to specially thank Raj, Sam, Anand and Andy for helping me in this process.
== Cluster Reliability ==
:- Tomcat configuration updated to better fit the OL infrastructure.
:- Solr configuration updated to fit the other Archive solr instances.
:- Added an haproxy to ol-solr2 helping tomcat to handle all the connections properly.
:- Rebooted ol-solr2 on SSD.
== Diagnostics ==
:- Installed and configured Munin to produce some graphic reports.
:- Coded a daemon and a munin plugin tracking the Response Time, and the Status Code response rates.
:- Designed, coded, configured and installed an “one-page” dashboard to let us better monitoring the cluster status and where to find the main info related to OL: wiki, nagios, admin-center, lending stats, github, etc.
http://ol-home.us.archive.org:8088/dashboard/
:- Minor Nagios triggers updated, making the alarm more useful and effective.
== Bugs Fixing ==
:- Found and debugged a memory/threads leak on ol-home, the problem seems related to the original infogami/webpy code. I informed Anand, and I hope he will have time to answer and tell us how to solve this issue ASAP.
:- Found an outage issue related to ARP and the load balancer DNS . Sam and Andy are working on it.
:- Found and fixed an important issue on the deployment process.
:- Fixed some management scripts that were not working properly.
:- Fixed the backup scripts that were not working properly.
:- Updating the underestimated disk space for backups on ol-home.
Me and Andy we will finish this week and we’ll finally solve the annoying DISK-FULL monthly problem
:- Fixed the sitemaps generation process. Finally now we have an updated sitemap, working correctly. This solve some problems we had with the google-bot.
:- Learned how to recover from an ACS4-related OL outage.
:- Minor log-rotation, disk full problem solved on ol-solr2.
:- Fixed the issues with the Vagrant developing instance.
== Security Upgrade ==
:- Removed the SSLv3 protocol support from nginx, solving the POODLE vulnerability.
https://www.us-cert.gov/ncas/alerts/TA14-290A
== Documentation ==
:- Updated the wiki page with all the NEW documentation we wrote during these activities.
https://wiki.archive.org/twiki/bin/view/OpenLibrary/WebHome
:- Updated the wiki page with some old documentation not deprecated yet.
== Github and Developers ==
:- Merged some old pull request from the community.
:- General cleaning.