OL: recent activities summary

Primary results:

  • Open Library has a more reliable and stable infrastructure.
  • It’s easier doing activities like monitoring, system management and diagnosis.
  • The github community can come back to fix bug and develop.

I have to specially thank Raj, Sam, Anand and Andy for helping me in this process.


== Cluster Reliability ==

:- Tomcat configuration updated to better fit the OL infrastructure.
:- Solr configuration updated to fit the other Archive solr instances.
:- Added an haproxy to ol-solr2 helping tomcat to handle all the connections properly.
:- Rebooted ol-solr2 on SSD.


== Diagnostics ==

:- Installed and configured Munin to produce some graphic reports.
:- Coded a daemon and a munin plugin tracking the Response Time, and the Status Code response rates.
:- Designed, coded, configured and installed an “one-page” dashboard to let us better monitoring the cluster status and where to find the main info related to OL: wiki, nagios, admin-center, lending stats, github, etc.
:- Minor Nagios triggers updated, making the alarm more useful and effective.


== Bugs Fixing ==

:- Found and debugged a memory/threads leak on ol-home, the problem seems related to the original infogami/webpy code. I informed Anand, and I hope he will have time to answer and tell us how to solve this issue ASAP.
:- Found an outage issue related to ARP and the load balancer DNS . Sam and Andy are working on it.
:- Found and fixed an important issue on the deployment process.
:- Fixed some management scripts that were not working properly.
:- Fixed the backup scripts that were not working properly.
:- Updating the underestimated disk space for backups on ol-home.
Me and Andy we will finish this week and we’ll finally solve the annoying DISK-FULL monthly problem
:- Fixed the sitemaps generation process. Finally now we have an updated sitemap, working correctly. This solve some problems we had with the google-bot.
:- Learned how to recover from an ACS4-related OL outage.
:- Minor log-rotation, disk full problem solved on ol-solr2.
:- Fixed the issues with the Vagrant developing instance.


== Security Upgrade ==

:- Removed the SSLv3 protocol support from nginx, solving the POODLE vulnerability.


== Documentation ==

:- Updated the wiki page with all the NEW documentation we wrote during these activities.
:- Updated the wiki page with some old documentation not deprecated yet.


== Github and Developers ==

:- Merged some old pull request from the community.
:- General cleaning.