29 April 2015

Monitoring Varnish for random crashes

I'm using Varnish to cache the frontend of a site that a client is busy promoting.  It does a great job of reducing requests to my backend but is prone to random crashes.  I normally get about two weeks of uptime on this particular server, which is significantly lower than other places that I've deployed Varnish.

I just don't have enough information to work with to try and solve why the random crash is occurring.  The system log shows that a child process doesn't respond to CLI and so is killed.  The child never seems to be able to be brought up again.

My /var/log/messages file looks like this:

 08:31:45 varnishd[7888]: Child (16669) not responding to CLI, killing it.  
 08:31:45 varnishd[7888]: Child (16669) died signal=3  
 08:31:45 varnishd[7888]: child (25675) Started  
 08:31:45 varnishd[7888]: Child (25675) said Child starts  
 08:31:45 varnishd[7888]: Child (25675) said SMF.s0 mmap'ed 1073741824 bytes of 1073741824  
 08:32:19 varnishd[7888]: Child (25675) not responding to CLI, killing it.  
 08:32:21 varnishd[7888]: Child (25675) not responding to CLI, killing it.  
 08:32:21 varnishd[7888]: Child (25675) died signal=3  

Which doesn't give me a lot to work with.  I couldn't find anything in the documentation about this sort of problem.  I don't want to uninstall Varnish so I decided to rather look for a way to monitor the process.

I first tried Monit but after about two weeks my site was down.  After sshing onto the box and restarting Varnish I checked the monit logs.  Although it was able to recognize that Varnish had crashed, it was not able to successfully bring it back up.

My Monit log looked like this:

 [BST Apr 23 09:07:24] error  : 'varnish' process is not running  
 [BST Apr 23 09:07:24] info   : 'varnish' trying to restart  
 [BST Apr 23 09:07:24] info   : 'varnish' start: /etc/init.d/varnish  
 [BST Apr 23 09:07:54] error  : 'varnish' failed to start  
 [BST Apr 23 09:08:54] error  : 'varnish' process is not running  
 [BST Apr 23 09:08:54] info   : 'varnish' trying to restart  
 [BST Apr 23 09:08:54] info   : 'varnish' start: /etc/init.d/varnish  
 [BST Apr 23 09:09:24] error  : 'varnish' failed to start  
 [BST Apr 23 09:10:24] error  : 'varnish' process is not running  
 [BST Apr 23 09:10:24] info   : 'varnish' trying to restart  
 [BST Apr 23 09:10:24] info   : 'varnish' start: /etc/init.d/varnish  
 [BST Apr 23 09:10:54] error  : 'varnish' failed to start  
 [BST Apr 23 09:11:54] error  : 'varnish' service restarted 3 times within 3 cycles(s) - unmonitor  

My problem sounded a lot like this one on ServerFault so I looked for another way to monitor the process other than using Monit.

Instead of using daemonize, supervisord, or another similar program I'm trying out a simple shell script that I found at http://blog.unixy.net/2010/05/dirty-varnish-monitoring-script/.  The author says it's dirty, and I suppose it is, but it has the advantage of being dead simple and easy to control.   I've set it up as a cron job to run every five minutes.  Hopefully this will be a more effective way to make sure that Varnish doesn't stay dead for very long.

In case the source file goes down I saved a copy as a Gist:

Tip