21 April 2015

Fixing when queue workers keep popping the same job off a queue

From Amazon Queue documentation
My (Laravel) project uses a queue system for importing because these jobs can take a fair amount of time (up to an hour) and I want to have them run asynchronously to prevent my users from having to sit and watch a spinning ball.

I created a cron job which would run my Laravel queue work command every 5 minutes.  PHP is not really the best language for long-running processes which is why I elected to rather run a task periodically instead of listening all the time.  This introduced some latency (which I could cut down) but this is acceptable in my use case (imports happen once a month and are only run manually if an automated import fails).

The problem that I faced was that my queue listener kept popping the same job off the queue.  I didn't try running multiple listeners but I'm pretty confident weirdness would have resulted in that case as well.

Fixing the problem turned out to be a matter of configuring the visibility time of my queue.  I was using the SQS provided default of 30 seconds.  Amazon defines the visibility timeout as:
The period of time that a message is invisible to the rest of your application after an application component gets it from the queue. During the visibility timeout, the component that received the message usually processes it, and then deletes it from the queue. This prevents multiple components from processing the same message.  Source: Amazon documentation
This concept is common to queues and exists in various names in Beanstalk and others.  In Beanstalk the setting is called time-to-run and IronMQ refers to it as timeout.  So if your run time exceeds your queue availability timeout then workers will pop off a job that is currently being run in another process.

17 April 2015

Setting up the admin server in HHVM 3.6.0 and Nginx

Hiphop has a built-in admin server that has a lot of useful functions.  I found out about it on an old post on the Hiphop blog.

Since those times Hiphop has moved towards using an ini file instead of a config.hdf file.

On a standard prebuilt HHVM on Ubuntu you should  find the ini file in /etc/hhvm/php.ini

Facebook maintains a list of all the ini settings on Github.

It is into this file that we add two lines to enable the admin server:

 hhvm.admin_server.password = SecretPassword  
 hhvm.admin_server.port = 8888  

I then added a server to Nginx by creating this file: /etc/nginx/conf.d/admin.conf (by default Nginx includes all conf files in that directory):

 server {  
   # hhvm admin  
   listen 8889;  
   location ~ {  
     fastcgi_pass  127.0.0.1:8888;  
     include    fastcgi_params;  
   }  
 }  

Now I can run curl 'http://localhost:8889' from my shell on the box to get a list of commands. Because I host this project with Amazon and have not set up a security rule the port/server are not available to the outside world.  You may want to check your firewall rules on your server.

To run a command add the password as a get variable:

 curl 'http://localhost:8889/check-health?auth=SecretPassword'  

Searching in a radius with Postgres

Postgres has two very useful extensions - earthdistance and postgis.  PostGIS is much more accurate but I found earthdistance to be very easy to use and accurate enough for my purpose (finding UK postcodes within a radius of a point).

To install it first find your Postgres version and then install the appropriate package.  On my Debian Mint dev box it looks like the below snippet. My production machine is an Amazon RDS and you can skip this step in that environment.

 psql -V  
 sudo apt-get install postgresql-contrib postgresql-contrib-9.3  
 sudo service postgresql restart  

Having done that you should launch psql and run these two commands.  Make sure that you install cube first because it is a requirement of earthdistance.

 CREATE EXTENSION cube; CREATE EXTENSION earthdistance;  

Now that the extensions are installed you have access to all of the functions they provide.

If you want to check that they're working you can run SELECT earth(); as a quick way to test a function. It should return the earth's radius.

Earthdistance treats earth as a perfect sphere but Postgis is more accurate. Since I'm working in a fairly local region (less than 20 statute miles) I felt that approximating the surface as a sphere was sufficiently accurate.

I had a table of postcodes and an example query looked like this:

 SELECT * FROM postcodes  
 WHERE earth_box(ll_to_earth(-0.20728725565285000, 51.48782225571700000), 100) @> ll_to_earth(lat, lng)  

This took quite a long time (*cough* a minute *cough*) to run through the 1.7 million postcodes in my table. After a little searching I realized that I should index the calculation with the result that queries now take about 100ms, which is plenty fast enough on my dev box.

 For reference - using the Haversine formula in a query was taking around 30 seconds to process - even after constraining my query to a box before calculating radii.

 CREATE INDEX postcode_gis_index on postcodes USING gist(ll_to_earth(lat, lng));  

So now I can search for UK postcodes that are close to a point and I can do so sufficiently quickly that my user won't get bored of watching a loading spinner gif.

30 March 2015

Ignoring duplicate inserts with Postgres when processing a batch

I'm busy on a project which involves importing fairly large datasets of about ~3.3GB at a time.  I have to read a CSV file, process each line, and generate a number of database records from the results of that process.

Users are expected to be able to rerun batches and there is overlap between different datasets.  For example: the dataset of "last year" overlaps with the dataset of "all time".  This means that we need an elegant way to handle duplicate updates.

Searching if a record exists (by PK) is fine until the row count in the table gets significant.  At just over 2 million records it was taking my development machine 30 seconds to process 10,000 records.  This number steadily increased as the row count increased.


I had to find a better way to do this and happened across the option of using a database rule to ignore duplicates.  While using the rule there is a marked improvement in the performance as I no longer need to search the database for a record.

17 March 2015

Adding info to Laravel logs

I am coding a queue worker that is handling some pretty large (2gig+) datasets and so wanted some details in my logs that Vanilla laravel didn't offer.

Reading the documentation at http://laravel.com/docs/4.2/errors wasn't much help until I twigged that I could manipulate the log object returned by Log::getMonolog();.

Here is an example of adding memory usage to Laravel logs.

In app/start/global.php make the following changes
 Log::useFiles(storage_path().'/logs/laravel.log');  
 $log = Log::getMonolog();  
 $log->pushProcessor(new Monolog\Processor\MemoryUsageProcessor);  

You'll find the Monolog documentation on the repo

12 March 2015

Support for Postgres broken in HHVM 3.6.0

On my desktop machine I run my package upgrades every day.  The other day my Hiphop version got updated to 3.6.0 and suddenly my Postgres support died.

Running Hiphop gave a symbol not found error in the postgres.so file ( undefined symbol: _ZTIN4HPHP11PDOResourceE\n ) exactly like the issue reported on the driver repository (here).

I tried to recompile the postgres driver against Hiphop 3.6.0 but hit a number of problems, mostly to do with hhvm-pgsql-master/pdo_pgsql_statement.cpp it seems.

The fix for the incompatibility was unfortunately rolling back to my previous version of Hiphop.  To do this on Mint/Ubuntu just do this:

  1. Run cat /etc/*-release to get your release information
  2. Download the appropriate package for your distro from http://dl.hhvm.com/ubuntu/pool/main/h/hhvm/
  3. Remove your 3.6.0 installation of hhvm: sudo apt-get remove hhvm
  4. Install the package you downloaded : sudo dpkg -i <deb package>
After that everything should be installed properly and you can start up hhvm without a problem.




22 February 2015

Fixing puppet "Exiting; no certificate found and waitforcert is disabled" error

While debugging and setting up Puppet I am still running the agent and master from CLI in --no-daemonize mode.  I kept getting an error on my agent - ""Exiting; no certificate found and waitforcert is disabled".

The fix was quite simple and a little embarrassing.  Firstly I forgot to run my puppet master with root privileges which meant that it was unable to write incoming certificate requests to disk.  That's the embarrassing part and after I looked at my shell prompt and noticed this issue fixing it was quite simple.

Firstly I got the puppet ssl path by running the command puppet agent --configprint ssldir

Then I removed that directory so that my agent no longer had any certificates or requests.

On my master side I cleaned the old certificate by running puppet cert clean --all (this would remove all my agent certificates but for now I have just the one so its quicker than tagging it).

I started my agent up with the command puppet agent --test which regenerated the certificate and sent the request to my puppet master.  Because my puppet master was now running with root privileges (*cough*) it was able to write to its ssl directory and store the request.

I could then sign the request on my puppet master by running puppet cert sign --all

When running normally the puppet master will run as the puppet user so I'm not overly worried about running it as root in CLI while I debug it.