Blocking referer spam 
One of the nice (or bad) things about many modern wiki and blog
systems is that they will automatically generate a "referrer" link when
somebody visits your site by way of another site. Referrer links point
back to the sites from which your readers came, making it easy for you
and your readers explore those (presumably) related sites.
Like most things on the Internet, referrer links are subject to abuse
by spammers. I noticed this when looking at the "people came here
from" section at the bottom of the
start page. That's where
SnipSnap places its automatically
generated referral links. I saw the expected referrals from
Slashdot,
Kuro5hin, and the like, but I also saw
referrals from obviously bogus places like "optinpr.com" and
"web-promotion.net", among others. Looking up these suspect domains
on
Google confirmed their bogosity.
Time for action.
Taking action
(First, I should note that although the dictionary spells it
"referrer", in the HTTP world, it's "referer". You'll see both
spellings used below.)
The first thing I did was Google for information on what
others were doing about referrer spam. Most folks were doing the
sensible thing, which is using
Apache's
mod_rewrite
to send back
403 Forbidden responses to requests when the
HTTP_REFERER matched known-spammer URLs. Here are a few folks who
have used that approach:
A typical mod_rewrite configuration looks something like this:
RewriteEngine On
RewriteCond %{HTTP_REFERER} ^http://(www.)?spamdomain1.*$ [OR]
RewriteCond %{HTTP_REFERER} ^http://(www.)?spamdomain2.*$ [OR]
# and so on
RewriteCond %{HTTP_REFERER} ^http://(www.)?spamdomainN.*$
RewriteRule .* - [F,L]That's a pretty good start, but we can make some improvements.
Refining our solution
First, I think the regexps used above are too specific. The only
thing that is constant in the referrer spam I get is the domain part:
- http://www.spamhost.com/
- http://www.spamhost.com/blah
- http://spamhost.com
- http://deals.spamhost.com/
Why not just match on the domain part? Second, why not put the evil
domains in a configuration file that is easy to change?
Following the lead taken by the
"Referer-based Deflector"
example from
A Users Guide to URL Rewriting with the Apache Webserver, we can
do just that
The idea is to use the
RewriteMap directive to create a mapping file
that lists URLs of "bad guys." Only in our implementation, we'll
list the bad guy's domains instead of URLs:
##
## refererdom.deny
##attorneyslawyerslawfirms.com -
cpads.com -
jeschke.com -
optinpr.com -
textbanker.com -
travelnow.com -
web-promotion.net -# and so on
(Note that the trailing dashes are required because technically this is a map file.) The nice thing about this configuration file is that, if we change it,
Apache will automatically re-read it and apply our new rules. Thus,
for example, if we learn of a new spammer using the
ispamyou.com
domain, we can block him just like this:
$ echo 'ispamyou.com -' >> /path/to/refererdom.deny
Now, with this file in place, all that's left to do is filter incoming
requests' referrers against it. That seems straightforward: When we
receive a request, we can take a look at the associated HTTP_REFERER,
extract the domain part from it, and look it up in the
refererdom.deny file. If we have a match, we'll Forbid the
request.
It turns out that it's tricky to extract the domain part from the URL.
I couldn't see any way to extract it using mod_rewrite, itself. (If
you know how to do it, post a comment.) Luckily, I
can do this easily
in Perl:
#!/usr/bin/perl -wlp# url-to-domain.pl
# TGM 2004-04-18
#
# Returns the domain part of URLs:
#
# http://www.mydomain.com/blah -> mydomain.com
# http://www.mydomain.com -> mydomain.comBEGIN { $| = 1 }s{^http://}{}; # strip http:// prefix
s{/.*}{}; # strip pathname after hostname
s{^.*\.([^.]+\.[^.]+)$}{$1}; # convert hostname into domainOnce we have this Perl program, it's easy to tie it into Apache
using another
RewriteMap. Putting it all together, we arrive at this final bit of Apache configuration directives:
# These Apache rules prevent referer spamRewriteMap domain prg:/path/to/url-to-domain.pl
RewriteMap referer-deny txt:/path/to/refererdom.denyRewriteCond %{HTTP_REFERER} !=""
RewriteCond ${referer-deny:${domain:%{HTTP_REFERER}}|NOT-FOUND} !=NOT-FOUND
RewriteRule ^/.* - [F,L]With these rules loaded, let's give our solution a try using the GET program from Perl's LWP bundle to generate requests:
$ GET -ds http://community.moertel.com/
200 OK$ GET -ds -H 'Referer: http://optinpr.com/' http://community.moertel.com/
403 Forbidden
Looks like we have a winner: Normal requests go through just fine,
but requests associated with referral spam are forbidden.
Happy spam blocking!