OL: monitoring Nginx response time and response status codes with python and munin

To let Nginx to log properly the response time please see the blog article tracking app response time on nginx.

We expect to have the /var/log/nginx/access.log formatted as:

'$seed$remote_addr $host $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_time'

The log file lines look like:

0.165.58.183 openlibrary.org - [10/Feb/2015:17:34:35 +0000] "GET /search?q=The+Law&subject_facet=Law&person_facet=R.+A.+Ramsay HTTP/1.1" 200 7159 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.660
0.165.58.183 openlibrary.org - [10/Feb/2015:17:34:35 +0000] "GET /search?q=RICO&author_key=OL2687659A&subject_facet=In+library HTTP/1.1" 200 6895 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.684
0.143.71.149 openlibrary.org - [10/Feb/2015:17:34:35 +0000] "GET /search?sort=old&subject_facet=Protected+DAISY&language=mul&publisher_facet=Nelson+Doubleday&publisher_facet=Macmillan+and+co.%2C+limited&publisher_facet=Franklin+Library HTTP/1.1" 200 5860 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 0.307
0.103.218.37 openlibrary.org - [10/Feb/2015:17:34:35 +0000] "GET /works/OL118971W HTTP/1.0" 301 0 "https://openlibrary.org/" "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)" 0.057

Using a python script and a munin plugin we can plot this two graphs:

  • Nginx response time: plotting the requests/second rate splitting up in response time ranges.
    nginx_req_time-day
  • Nginx response code: plotting the requests?second rate for every response code class.
    nginx_req_status_code-day

This script /usr/local/nglog/nglog.py runs in background reading the access.log line by line, sampling the request/second rate for different response time ranges and for the different response code classes. The values are saved in the files /tmp/nginx_request_status_stats and /tmp/nginx_request_time_stats.

giovanni@ol-www1:/tmp$ cat /tmp/nginx_request_time_stats 
T1.value 75.100000 
T2.value 54.555556 
T3.value 16.938889
T4.value 1.711111 
T5.value 2.933333
giovanni@ol-www1:/tmp$
giovanni@ol-www1:/tmp$ cat /tmp/nginx_request_status_stats 
s2xx.value 123.600000 
s3xx.value 19.755556 
s4xx.value 7.883333 
s5xx.value 0.000000
giovanni@ol-www1:/tmp$

When a status code 500 is found the script logs the request in the file /var/log/nginx/nglog.log
The script is launched with the bash script /usr/local/nglog/nglogd.sh and it executes:

tail -F /var/log/nginx/access.log| python /usr/local/nglog/nglog.py &

To plot the results we use two ad-hoc munin plugins:
/usr/share/munin/plugins/nginx_req_time

#!/bin/sh
 
case $1 in
   config)
        cat <<'EOM'
graph_title NGINX response time
graph_vlabel num_req /sec
 
T1.label <= 0.5s
T2.label >0.5 and <=1
T3.label >1 and <=5
T4.label >5 and <=10
T5.label > 10
EOM
        exit 0;;
esac

/usr/share/munin/plugins/nginx_req_status_code

#!/bin/sh
 
case $1 in
   config)
        cat <<'EOM'
graph_title NGINX response status code
graph_vlabel req/sec
 
s2xx.label 2xx
s3xx.label 3xx
s4xx.label 4xx
s5xx.label 5xx
EOM
        exit 0;;
esac