27 May 2015

Allowing the whole wide world to read your S3 bucket

This is a bucket policy that you can use to allow the whole world the ability to read files in your s3 bucket.

You might want to do this if you're serving static web content from S3 and don't need the fine grained control that the Amazon documentation details.

You will still need to set up permissions on the bucket but this policy will let people read the files you're storing on S3.

26 May 2015

Storing large values in Memcached with PHP

Memcached saved my users a minute per query
I'm working on a business intelligence tool that requires as an intermediary calculation a list of UK postcodes that are within a radius of a user supplied postcode.

It currently takes about 7 seconds to query my Postgres database to get this list out.  Unfortunately I need to do this several times as part of a goal seeking function so I need to greatly improve this lookup speed.

I'm already using the Postgres earthdistance module and have properly indexed my table so I realized that I needed to look for a caching solution.

Memcached places limits on the size of the value you can store.  The default setup is 1meg and I'm reluctant to change this because it adds to the deployment burden.  My result sets were sometimes up to 4 megs large - searching on a 20 mile radius in London yields a lot of postcodes!

My idea was to split the large piece of data into several smaller pieces and to place an index referencing the pieces as the value for the key we're trying to store.

I decided to make use of PHP's gzcompress() function to reduce the size of the element because I felt that the time I spend compressing the data is still going to be drastically less than running the query and I want to try my best to avoid cache evictions.

I'm currently using Laravel so the code snippets below use the facades made available by Laravel.  I think the code is readable enough to extend to other PHP environments and I think the approach could also be ported to other languages.


20 May 2015

Using Fail2Ban to protect a Varnished site from scrapers

I'm using Varnish to cache a fairly busy property site.  Varnish works like a bomb for normal users and has greatly improved our page load speed.

For bots that are scraping the site, presumably to add the property listings to their own site, though the cache is next to useless since the bots are sequentially trawling through the whole site.

I decided to use fail2ban to block IP's who hit the site too often.

The first step in doing so was to enable a disk based access log for Varnish so that fail2ban will have something to work with.

This means setting up varnishncsa.  Add this to your /etc/rc.local file:

 varnishncsa -a -w /var/log/varnish/access.log -D -P /var/run/varnishncsa.pid  

This starts up varnishncsa in daemon mode and appends Varnish access attempts to /var/log/varnish/access.log

Now edit or create /etc/logrotate.d/varnish and make an entry to rotate this access log:

  /var/log/varnish/*log {   
     create 640 http log   
       /bin/kill -USR1 `cat /var/run/varnishncsa.pid 2>/dev/null` 2> /dev/null || true   

Install fail2ban. On Ubuntu you can apt-get install fail2ban

Edit /etc/fail2ban/jail.conf and add a block like this:

 enabled = true  
 port = http,https  
 filter = http-get-dos  
 logpath = /var/log/varnish/access.log  
 maxretry = 300  
 findtime = 300  
 #ban for 5 minutes  
 bantime = 600  
 action = iptables[name=HTTP, port=http, protocol=tcp]  

This means that if a person has 300 (maxretry) requests in 300 (findtime) seconds then a ban of 600 (bantime) seconds is applied.

 We need to create the filter in /etc/fail2ban/filter.d/http-get-dos.conf to create the pattern to match the jail:

 # Fail2Ban configuration file  
 # Author: http://www.go2linux.org  
 # Option: failregex  
 # Note: This regex will match any GET entry in your logs, so basically all valid and not valid entries are a match.  
 # You should set up in the jail.conf file, the maxretry and findtime carefully in order to avoid false positives.  
 failregex = ^<HOST>.*"GET  
 # Option: ignoreregex  
 # Notes.: regex to ignore. If this regex matches, the line is ignored.  
 # Values: TEXT  
 ignoreregex =  

Now lets test the regex against the log file so that we can see if it is correctly picking up the IP addresses of the visitors:

fail2ban-regex /var/log/varnish/access.log /etc/fail2ban/filter.d/http-get-dos.conf 

You should see a list of IP addresses and times followed by summary statistics.

When you restart fail2ban your scraper protection should be up and running.

14 May 2015

Solving Doctrine - A new entity was found through the relationship

There are so many different problems that people have with the Doctrine error message:

 exception 'Doctrine\ORM\ORMInvalidArgumentException' with message 'A new entity was found through the relationship 'App\Lib\Domain\Datalayer\UnicodeLookups#lookupStatus' that was not configured to cascade persist operations for entity:  

Searching through the various online sources was a bit of a nightmare.  The best documentation I found was at http://www.krueckeberg.org/ where there were a number of clearly explained examples of various associations.

More useful information about association ownership was in the Doctrine manual, but I found a more succinct explanation in the answer to this question on StackOverflow.

Now I understood better about associations and ownership and was able to identify exactly what sort I was using and the syntax that was required. I was implementing a uni-directional many to one relationship, which is supposedly one of the most simple to map.

I had used the Doctrine reverse engineering tool to generate the stubs of my model.  I was expecting it to be able to handle this particular relationship - the tool warns that it does not properly map all relationships but this particular one actually works out of the box.

 {project root}/vendor/doctrine/orm/bin/doctrine orm:convert-mapping --from-database yml --namespace App\\Lib\\Domain\\Datalayer\\ .  
 {project root}/vendor/doctrine/orm/bin/doctrine orm:generate-entities --generate-methods=true --generate-annotations=true --regenerate-entities=true ../../../  
 {project root}/vendor/doctrine/orm/bin/doctrine orm:generate-proxies ../Proxies  

Just as an aside to explain the paths and namespaces : I'm using Laravel 5 and put my domain model into app/Lib/Domain.  I'm implementing a form of the repository design pattern so I have the following directory structure:


The mappings are not used at runtime but are used to generate the entities.

So my generated class looked like this:

 namespace App\Lib\Domain\Datalayer;  
  * Lookups  
  * @ORM\Table(name="lookups", indexes={@ORM\Index(name="IDX_4CEC819D037A087", columns={"lookup_status_id"}), @ORM\Index(name="IDX_4CEC8194D39DE23", columns={"camera_event_id"})})  
  * @ORM\Entity  
 class Lookups  
    * @var \App\Lib\Domain\Datalayer\LookupStatuses  
    * @ORM\ManyToOne(targetEntity="App\Lib\Domain\Datalayer\LookupStatuses", inversedBy="lookups", cascade={"persist"})  
    * @ORM\JoinColumns({  
    *  @ORM\JoinColumn(name="lookup_status_id", referencedColumnName="id")  
    * })  
   private $lookupStatus;  

My use case was:

  1. Find a lookup status from the table
  2. Call setLookupStatus on my lookup class
  3. Persist and flush the lookup class
  4. Error
After carefully reviewing all the documentation I linked above, and a great deal more, I realized that step 1 was the issue.  Because I was getting the object out of cache Doctrine thought it was new.  

Of course I couldn't persist the object as the help message suggested (it already existed so I got a key violation error).  The mappings were actually correct.

So my advice for resolving this error message is to read through the documentation I linked above carefully and then make sure that Doctrine is actually aware of the entity you're using.  You may need to persist it (if its new) or otherwise make sure its not cached.