Good stuff for programming geeks
[ start | index | login or register ]
start > 2004-03-18

2004-03-18

Created by tmoertel. Last edited by tmoertel 1617 days ago. Viewed 12442 times. #3
[diff] [history] [edit] [rdf]
labels
attachments

Blocking referer spam

One of the nice (or bad) things about many modern wiki and blog systems is that they will automatically generate a "referrer" link when somebody visits your site by way of another site. Referrer links point back to the sites from which your readers came, making it easy for you and your readers explore those (presumably) related sites.

Like most things on the Internet, referrer links are subject to abuse by spammers. I noticed this when looking at the "people came here from" section at the bottom of the start page. That's where >>SnipSnap places its automatically generated referral links. I saw the expected referrals from >>Slashdot, >>Kuro5hin, and the like, but I also saw referrals from obviously bogus places like "optinpr.com" and "web-promotion.net", among others. Looking up these suspect domains on >>Google confirmed their bogosity.

Time for action.

Taking action

(First, I should note that although the dictionary spells it "referrer", in the HTTP world, it's "referer". You'll see both spellings used below.)

The first thing I did was Google for information on what others were doing about referrer spam. Most folks were doing the sensible thing, which is using >>Apache's >>mod_rewrite to send back 403 Forbidden responses to requests when the HTTP_REFERER matched known-spammer URLs. Here are a few folks who have used that approach:

A typical mod_rewrite configuration looks something like this:

RewriteEngine On
RewriteCond %{HTTP_REFERER} ^http://(www.)?spamdomain1.*$ [OR]
RewriteCond %{HTTP_REFERER} ^http://(www.)?spamdomain2.*$ [OR]
# and so on
RewriteCond %{HTTP_REFERER} ^http://(www.)?spamdomainN.*$
RewriteRule .* - [F,L]

That's a pretty good start, but we can make some improvements.

Refining our solution

First, I think the regexps used above are too specific. The only thing that is constant in the referrer spam I get is the domain part:

  • http://www.spamhost.com/
  • http://www.spamhost.com/blah
  • http://spamhost.com
  • http://deals.spamhost.com/

Why not just match on the domain part? Second, why not put the evil domains in a configuration file that is easy to change?

Following the lead taken by the >>"Referer-based Deflector" example from >>A Users Guide to URL Rewriting with the Apache Webserver, we can do just that The idea is to use the RewriteMap directive to create a mapping file that lists URLs of "bad guys." Only in our implementation, we'll list the bad guy's domains instead of URLs:

##
##  refererdom.deny
##

attorneyslawyerslawfirms.com - cpads.com - jeschke.com - optinpr.com - textbanker.com - travelnow.com - web-promotion.net -

# and so on

(Note that the trailing dashes are required because technically this is a map file.) The nice thing about this configuration file is that, if we change it, Apache will automatically re-read it and apply our new rules. Thus, for example, if we learn of a new spammer using the ispamyou.com domain, we can block him just like this:

$ echo 'ispamyou.com -' >> /path/to/refererdom.deny

Now, with this file in place, all that's left to do is filter incoming requests' referrers against it. That seems straightforward: When we receive a request, we can take a look at the associated HTTP_REFERER, extract the domain part from it, and look it up in the refererdom.deny file. If we have a match, we'll Forbid the request.

It turns out that it's tricky to extract the domain part from the URL. I couldn't see any way to extract it using mod_rewrite, itself. (If you know how to do it, post a comment.) Luckily, I can do this easily in Perl:

#!/usr/bin/perl -wlp

# url-to-domain.pl # TGM 2004-04-18 # # Returns the domain part of URLs: # # http://www.mydomain.com/blah -> mydomain.com # http://www.mydomain.com -> mydomain.com

BEGIN { $| = 1 }

s{^http://}{}; # strip http:// prefix s{/.*}{}; # strip pathname after hostname s{^.*\.([^.]+\.[^.]+)$}{$1}; # convert hostname into domain

Once we have this Perl program, it's easy to tie it into Apache using another RewriteMap. Putting it all together, we arrive at this final bit of Apache configuration directives:

# These Apache rules prevent referer spam

RewriteMap domain prg:/path/to/url-to-domain.pl RewriteMap referer-deny txt:/path/to/refererdom.deny

RewriteCond %{HTTP_REFERER} !="" RewriteCond ${referer-deny:${domain:%{HTTP_REFERER}}|NOT-FOUND} !=NOT-FOUND RewriteRule ^/.* - [F,L]

With these rules loaded, let's give our solution a try using the GET program from Perl's LWP bundle to generate requests:

$ GET -ds http://community.moertel.com/
200 OK

$ GET -ds -H 'Referer: http://optinpr.com/' http://community.moertel.com/ 403 Forbidden

Looks like we have a winner: Normal requests go through just fine, but requests associated with referral spam are forbidden.

Happy spam blocking!

Please login to post a comment.
community.moertel.com | Copyright © 2003–07 Moertel Consulting