29 April 2015

Monitoring Varnish for random crashes

I'm using Varnish to cache the frontend of a site that a client is busy promoting.  It does a great job of reducing requests to my backend but is prone to random crashes.  I normally get about two weeks of uptime on this particular server, which is significantly lower than other places that I've deployed Varnish.

I just don't have enough information to work with to try and solve why the random crash is occurring.  The system log shows that a child process doesn't respond to CLI and so is killed.  The child never seems to be able to be brought up again.

My /var/log/messages file looks like this:

 08:31:45 varnishd[7888]: Child (16669) not responding to CLI, killing it.  
 08:31:45 varnishd[7888]: Child (16669) died signal=3  
 08:31:45 varnishd[7888]: child (25675) Started  
 08:31:45 varnishd[7888]: Child (25675) said Child starts  
 08:31:45 varnishd[7888]: Child (25675) said SMF.s0 mmap'ed 1073741824 bytes of 1073741824  
 08:32:19 varnishd[7888]: Child (25675) not responding to CLI, killing it.  
 08:32:21 varnishd[7888]: Child (25675) not responding to CLI, killing it.  
 08:32:21 varnishd[7888]: Child (25675) died signal=3  

Which doesn't give me a lot to work with.  I couldn't find anything in the documentation about this sort of problem.  I don't want to uninstall Varnish so I decided to rather look for a way to monitor the process.

I first tried Monit but after about two weeks my site was down.  After sshing onto the box and restarting Varnish I checked the monit logs.  Although it was able to recognize that Varnish had crashed, it was not able to successfully bring it back up.

My Monit log looked like this:

 [BST Apr 23 09:07:24] error  : 'varnish' process is not running  
 [BST Apr 23 09:07:24] info   : 'varnish' trying to restart  
 [BST Apr 23 09:07:24] info   : 'varnish' start: /etc/init.d/varnish  
 [BST Apr 23 09:07:54] error  : 'varnish' failed to start  
 [BST Apr 23 09:08:54] error  : 'varnish' process is not running  
 [BST Apr 23 09:08:54] info   : 'varnish' trying to restart  
 [BST Apr 23 09:08:54] info   : 'varnish' start: /etc/init.d/varnish  
 [BST Apr 23 09:09:24] error  : 'varnish' failed to start  
 [BST Apr 23 09:10:24] error  : 'varnish' process is not running  
 [BST Apr 23 09:10:24] info   : 'varnish' trying to restart  
 [BST Apr 23 09:10:24] info   : 'varnish' start: /etc/init.d/varnish  
 [BST Apr 23 09:10:54] error  : 'varnish' failed to start  
 [BST Apr 23 09:11:54] error  : 'varnish' service restarted 3 times within 3 cycles(s) - unmonitor  

My problem sounded a lot like this one on ServerFault so I looked for another way to monitor the process other than using Monit.

Instead of using daemonize, supervisord, or another similar program I'm trying out a simple shell script that I found at http://blog.unixy.net/2010/05/dirty-varnish-monitoring-script/.  The author says it's dirty, and I suppose it is, but it has the advantage of being dead simple and easy to control.   I've set it up as a cron job to run every five minutes.  Hopefully this will be a more effective way to make sure that Varnish doesn't stay dead for very long.

In case the source file goes down I saved a copy as a Gist:

Tip

21 April 2015

Fixing when queue workers keep popping the same job off a queue

From Amazon Queue documentation
My (Laravel) project uses a queue system for importing because these jobs can take a fair amount of time (up to an hour) and I want to have them run asynchronously to prevent my users from having to sit and watch a spinning ball.

I created a cron job which would run my Laravel queue work command every 5 minutes.  PHP is not really the best language for long-running processes which is why I elected to rather run a task periodically instead of listening all the time.  This introduced some latency (which I could cut down) but this is acceptable in my use case (imports happen once a month and are only run manually if an automated import fails).

The problem that I faced was that my queue listener kept popping the same job off the queue.  I didn't try running multiple listeners but I'm pretty confident weirdness would have resulted in that case as well.

Fixing the problem turned out to be a matter of configuring the visibility time of my queue.  I was using the SQS provided default of 30 seconds.  Amazon defines the visibility timeout as:
The period of time that a message is invisible to the rest of your application after an application component gets it from the queue. During the visibility timeout, the component that received the message usually processes it, and then deletes it from the queue. This prevents multiple components from processing the same message.  Source: Amazon documentation
This concept is common to queues and exists in various names in Beanstalk and others.  In Beanstalk the setting is called time-to-run and IronMQ refers to it as timeout.  So if your run time exceeds your queue availability timeout then workers will pop off a job that is currently being run in another process.
Tip

17 April 2015

Setting up the admin server in HHVM 3.6.0 and Nginx

Hiphop has a built-in admin server that has a lot of useful functions.  I found out about it on an old post on the Hiphop blog.

Since those times Hiphop has moved towards using an ini file instead of a config.hdf file.

On a standard prebuilt HHVM on Ubuntu you should  find the ini file in /etc/hhvm/php.ini

Facebook maintains a list of all the ini settings on Github.

It is into this file that we add two lines to enable the admin server:

 hhvm.admin_server.password = SecretPassword  
 hhvm.admin_server.port = 8888  

I then added a server to Nginx by creating this file: /etc/nginx/conf.d/admin.conf (by default Nginx includes all conf files in that directory):

 server {  
   # hhvm admin  
   listen 8889;  
   location ~ {  
     fastcgi_pass  127.0.0.1:8888;  
     include    fastcgi_params;  
   }  
 }  

Now I can run curl 'http://localhost:8889' from my shell on the box to get a list of commands. Because I host this project with Amazon and have not set up a security rule the port/server are not available to the outside world.  You may want to check your firewall rules on your server.

To run a command add the password as a get variable:

 curl 'http://localhost:8889/check-health?auth=SecretPassword'  
Tip

Searching in a radius with Postgres

Postgres has two very useful extensions - earthdistance and postgis.  PostGIS is much more accurate but I found earthdistance to be very easy to use and accurate enough for my purpose (finding UK postcodes within a radius of a point).

To install it first find your Postgres version and then install the appropriate package.  On my Debian Mint dev box it looks like the below snippet. My production machine is an Amazon RDS and you can skip this step in that environment.

 psql -V  
 sudo apt-get install postgresql-contrib postgresql-contrib-9.3  
 sudo service postgresql restart  

Having done that you should launch psql and run these two commands.  Make sure that you install cube first because it is a requirement of earthdistance.

 CREATE EXTENSION cube; CREATE EXTENSION earthdistance;  

Now that the extensions are installed you have access to all of the functions they provide.

If you want to check that they're working you can run SELECT earth(); as a quick way to test a function. It should return the earth's radius.

Earthdistance treats earth as a perfect sphere but Postgis is more accurate. Since I'm working in a fairly local region (less than 20 statute miles) I felt that approximating the surface as a sphere was sufficiently accurate.

I had a table of postcodes and an example query looked like this:

 SELECT * FROM postcodes  
 WHERE earth_box(ll_to_earth(-0.20728725565285000, 51.48782225571700000), 100) @> ll_to_earth(lat, lng)  

This took quite a long time (*cough* a minute *cough*) to run through the 1.7 million postcodes in my table. After a little searching I realized that I should index the calculation with the result that queries now take about 100ms, which is plenty fast enough on my dev box.

 For reference - using the Haversine formula in a query was taking around 30 seconds to process - even after constraining my query to a box before calculating radii.

 CREATE INDEX postcode_gis_index on postcodes USING gist(ll_to_earth(lat, lng));  

So now I can search for UK postcodes that are close to a point and I can do so sufficiently quickly that my user won't get bored of watching a loading spinner gif.
Tip