Skip to main content

Using Fail2Ban to protect a Varnished site from scrapers

I'm using Varnish to cache a fairly busy property site.  Varnish works like a bomb for normal users and has greatly improved our page load speed.

For bots that are scraping the site, presumably to add the property listings to their own site, though the cache is next to useless since the bots are sequentially trawling through the whole site.

I decided to use fail2ban to block IP's who hit the site too often.

The first step in doing so was to enable a disk based access log for Varnish so that fail2ban will have something to work with.

This means setting up varnishncsa.  Add this to your /etc/rc.local file:

 varnishncsa -a -w /var/log/varnish/access.log -D -P /var/run/varnishncsa.pid  

This starts up varnishncsa in daemon mode and appends Varnish access attempts to /var/log/varnish/access.log

Now edit or create /etc/logrotate.d/varnish and make an entry to rotate this access log:

  /var/log/varnish/*log {   
     create 640 http log   
     compress   
     postrotate   
       /bin/kill -USR1 `cat /var/run/varnishncsa.pid 2>/dev/null` 2> /dev/null || true   
     endscript   
   }   

Install fail2ban. On Ubuntu you can apt-get install fail2ban

Edit /etc/fail2ban/jail.conf and add a block like this:

 [http-get-dos]  
 enabled = true  
 port = http,https  
 filter = http-get-dos  
 logpath = /var/log/varnish/access.log  
 maxretry = 300  
 findtime = 300  
 #ban for 5 minutes  
 bantime = 600  
 action = iptables[name=HTTP, port=http, protocol=tcp]  

This means that if a person has 300 (maxretry) requests in 300 (findtime) seconds then a ban of 600 (bantime) seconds is applied.

 We need to create the filter in /etc/fail2ban/filter.d/http-get-dos.conf to create the pattern to match the jail:

 # Fail2Ban configuration file  
 #  
 # Author: http://www.go2linux.org  
 #  
 [Definition]  
 # Option: failregex  
 # Note: This regex will match any GET entry in your logs, so basically all valid and not valid entries are a match.  
 # You should set up in the jail.conf file, the maxretry and findtime carefully in order to avoid false positives.  
 failregex = ^<HOST>.*"GET  
 # Option: ignoreregex  
 # Notes.: regex to ignore. If this regex matches, the line is ignored.  
 # Values: TEXT  
 #  
 ignoreregex =  

Now lets test the regex against the log file so that we can see if it is correctly picking up the IP addresses of the visitors:

fail2ban-regex /var/log/varnish/access.log /etc/fail2ban/filter.d/http-get-dos.conf 

You should see a list of IP addresses and times followed by summary statistics.

When you restart fail2ban your scraper protection should be up and running.

Comments

Popular posts from this blog

Separating business logic from persistence layer in Laravel

There are several reasons to separate business logic from your persistence layer.  Perhaps the biggest advantage is that the parts of your application which are unique are not coupled to how data are persisted.  This makes the code easier to port and maintain. I'm going to use Doctrine to replace the Eloquent ORM in Laravel.  A thorough comparison of the patterns is available  here . By using Doctrine I am also hoping to mitigate the risk of a major version upgrade on the underlying framework.  It can be expected for the ORM to change between major versions of a framework and upgrading to a new release can be quite costly. Another advantage to this approach is to limit the access that objects have to the database.  Unless a developer is aware of the business rules in place on an Eloquent model there is a chance they will mistakenly ignore them by calling the ActiveRecord save method directly. I'm not implementing the repository pattern in all its glory in this demo.  

Fixing puppet "Exiting; no certificate found and waitforcert is disabled" error

While debugging and setting up Puppet I am still running the agent and master from CLI in --no-daemonize mode.  I kept getting an error on my agent - ""Exiting; no certificate found and waitforcert is disabled". The fix was quite simple and a little embarrassing.  Firstly I forgot to run my puppet master with root privileges which meant that it was unable to write incoming certificate requests to disk.  That's the embarrassing part and after I looked at my shell prompt and noticed this issue fixing it was quite simple. Firstly I got the puppet ssl path by running the command   puppet agent --configprint ssldir Then I removed that directory so that my agent no longer had any certificates or requests. On my master side I cleaned the old certificate by running  puppet cert clean --all  (this would remove all my agent certificates but for now I have just the one so its quicker than tagging it). I started my agent up with the command  puppet agent --test   whi

Redirecting non-www urls to www and http to https in Nginx web server

Image: Pixabay Although I'm currently playing with Elixir and its HTTP servers like Cowboy at the moment Nginx is still my go-to server for production PHP. If you haven't already swapped your web-server from Apache then you really should consider installing Nginx on a test server and running some stress tests on it.  I wrote about stress testing in my book on scaling PHP . Redirecting non-www traffic to www in nginx is best accomplished by using the "return" verb.  You could use a rewrite but the Nginx manual suggests that a return is better in the section on " Taxing Rewrites ". Server blocks are cheap in Nginx and I find it's simplest to have two redirects for the person who arrives on the non-secure non-canonical form of my link.  I wouldn't expect many people to reach this link because obviously every link that I create will be properly formatted so being redirected twice will only affect a small minority of people. Anyway, here's