<rdf:RDF
    xmlns:s='http://snipsnap.org/rdf/snip-schema#'
    xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
    xml:base='http://community.moertel.com/ss/rdf'>
    <s:Snip rdf:about='http://community.moertel.com/ss/rdf#2004-03-18'
         s:name='2004-03-18'
         s:cUser='tmoertel'
         s:oUser='tmoertel'
         s:mUser='tmoertel'>
        <s:content>1 Blocking referer spam {anchor:Blocking referer spam}&#xA;One of the nice (or bad) things about many modern wiki and blog&#xA;systems is that they will automatically generate a &quot;referrer&quot; link when&#xA;somebody visits your site by way of another site.  Referrer links point&#xA;back to the sites from which your readers came, making it easy for you&#xA;and your readers explore those (presumably) related sites.&#xA;&#xA;Like most things on the Internet, referrer links are subject to abuse&#xA;by spammers.  I noticed this when looking at the &quot;people came here&#xA;from&quot; section at the bottom of the [start] page.  That&apos;s where&#xA;{link:SnipSnap|http://snipsnap.org/} places its automatically&#xA;generated referral links.  I saw the expected referrals from&#xA;{link:Slashdot|http://slashdot.org/},&#xA;{link:Kuro5hin|http://kuro5hin.org/}, and the like, but I also saw&#xA;referrals from obviously bogus places like &quot;optinpr.com&quot; and&#xA;&quot;web-promotion.net&quot;, among others.  Looking up these suspect domains&#xA;on {link:Google|http://google.com/} confirmed their bogosity.&#xA;&#xA;Time for action.&#xA;&#xA;1.1 Taking action&#xA;&#xA;(First, I should note that although the dictionary spells it&#xA;&quot;referrer&quot;, in the HTTP world, it&apos;s &quot;referer&quot;.  You&apos;ll see both&#xA;spellings used below.)&#xA;&#xA;The first thing I did was Google for information on what&#xA;others were doing about referrer spam.  Most folks were doing the&#xA;sensible thing, which is using&#xA;{link:Apache|http://httpd.apache.org/}&apos;s&#xA;{link:mod_rewrite|http://httpd.apache.org/docs-2.0/mod/mod_rewrite.html}&#xA;to send back __403 Forbidden__ responses to requests when the&#xA;HTTP_REFERER matched known-spammer URLs.  Here are a few folks who&#xA;have used that approach:&#xA;&#xA;- http://vigilant.tv/article/3416&#xA;- http://www.spywareinfo.com/articles/referer_spam/&#xA;- http://www.joemaller.com/refererspam.shtml&#xA;&#xA;A typical mod_rewrite configuration looks something like this:&#xA;&#xA;{code:none}&#xA;RewriteEngine On&#xA;RewriteCond %{HTTP_REFERER} ^http\://(www\.)?spamdomain1.*$ [OR]&#xA;RewriteCond %{HTTP_REFERER} ^http\://(www\.)?spamdomain2.*$ [OR]&#xA;\# and so on&#xA;RewriteCond %{HTTP_REFERER} ^http\://(www\.)?spamdomainN.*$&#xA;RewriteRule .* - [F,L]&#xA;{code}&#xA;&#xA;That&apos;s a pretty good start, but we can make some improvements.&#xA;&#xA;1.1 Refining our solution&#xA;&#xA;First, I think the regexps used above are too specific.  The only&#xA;thing that is constant in the referrer spam I get is the domain part:&#xA;&#xA;- http\://www.spamhost.com/&#xA;- http\://www.spamhost.com/blah&#xA;- http\://spamhost.com&#xA;- http\://deals.spamhost.com/&#xA;&#xA;Why not just match on the domain part?  Second, why not put the evil&#xA;domains in a configuration file that is easy to change?&#xA;&#xA;Following the lead taken by the&#xA;{link:&quot;Referer-based Deflector&quot;|http://www.engelschall.com/pw/apache/rewriteguide/\#ToC42}&#xA;example from {link:A Users Guide to URL Rewriting with the Apache Webserver|http://www.engelschall.com/pw/apache/rewriteguide/}, we can&#xA;do just that&#xA;The idea is to use the __RewriteMap__ directive to create a mapping file&#xA;that lists URLs of &quot;bad guys.&quot;  Only in our implementation, we&apos;ll&#xA;list the bad guy&apos;s domains instead of URLs:&#xA;&#xA;{code}&#xA;\#\#&#xA;\#\#  refererdom.deny&#xA;\#\#&#xA;&#xA;attorneyslawyerslawfirms.com -&#xA;cpads.com -&#xA;jeschke.com -&#xA;optinpr.com -&#xA;textbanker.com -&#xA;travelnow.com -&#xA;web-promotion.net -&#xA;&#xA;\# and so on&#xA;&#xA;{code}&#xA;&#xA;&#xA;(Note that the trailing dashes are required because technically this is a map file.)  The nice thing about this configuration file is that, if we change it,&#xA;Apache will automatically re-read it and apply our new rules.  Thus,&#xA;for example, if we learn of a new spammer using the __ispamyou.com__&#xA;domain, we can block him just like this:&#xA;&#xA;{code}&#xA;$ echo &apos;ispamyou.com -&apos; &gt;&gt; /path/to/refererdom.deny&#xA;{code}&#xA;&#xA;Now, with this file in place, all that&apos;s left to do is filter incoming&#xA;requests&apos; referrers against it.  That seems straightforward: When we&#xA;receive a request, we can take a look at the associated HTTP_REFERER,&#xA;extract the domain part from it, and look it up in the&#xA;__refererdom.deny__ file.  If we have a match, we&apos;ll Forbid the&#xA;request.&#xA;&#xA;It turns out that it&apos;s tricky to extract the domain part from the URL.&#xA;I couldn&apos;t see any way to extract it using mod_rewrite, itself.  (If&#xA;you know how to do it, post a comment.)  Luckily, I ~~can~~ do this easily&#xA;in Perl:&#xA;&#xA;{code:none}&#xA;\#!/usr/bin/perl -wlp&#xA;&#xA;\# url-to-domain.pl&#xA;\# TGM 2004-04-18&#xA;\#&#xA;\# Returns the domain part of URLs:&#xA;\#&#xA;\# http\://www.mydomain.com/blah -&gt; mydomain.com&#xA;\# http\://www.mydomain.com      -&gt; mydomain.com&#xA;&#xA;BEGIN { $| = 1 }&#xA;&#xA;s{^http\://}{};               \# strip http:// prefix&#xA;s{/.*}{};                    \# strip pathname after hostname&#xA;s{^.*\\\\.([^.]+\\\\.[^.]+)$}\{$1}; \# convert hostname into domain&#xA;{code}&#xA;&#xA;Once we have this Perl program, it&apos;s easy to tie it into Apache&#xA;using another __RewriteMap__.  Putting it all together, we arrive at this final bit of Apache configuration directives:&#xA;&#xA;{code:none}&#xA;\# These Apache rules prevent referer spam&#xA;&#xA;RewriteMap domain       prg:/path/to/url-to-domain.pl&#xA;RewriteMap referer-deny txt:/path/to/refererdom.deny&#xA;&#xA;RewriteCond %{HTTP_REFERER} !=&quot;&quot;&#xA;RewriteCond ${referer-deny:${domain:%{HTTP_REFERER}}|NOT-FOUND} !=NOT-FOUND&#xA;RewriteRule  ^\/.*  -  [F,L]&#xA;{code}&#xA;&#xA;With these rules loaded, let&apos;s give our solution a try using the GET program from Perl&apos;s LWP bundle to generate requests:&#xA;&#xA;{code}&#xA;$ __GET -ds http\://community.moertel.com/__&#xA;200 OK&#xA;&#xA;$ __GET -ds -H &apos;Referer: http\://optinpr.com/&apos; http\://community.moertel.com/__&#xA;403 Forbidden&#xA;{code}&#xA;&#xA;Looks like we have a winner:  Normal requests go through just fine,&#xA;but requests associated with referral spam are forbidden.&#xA;&#xA;Happy spam blocking!</s:content>
        <s:mTime>2004-03-26 00:28:46.316</s:mTime>
        <s:cTime>2004-03-18 18:52:31.22</s:cTime>
        <s:comments
             rdf:type='http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag'/>
        <s:snipLinks>
            <rdf:Bag>
                <rdf:li rdf:resource='#snipsnap-index'/>
                <rdf:li rdf:resource='#tmoertel'/>
                <rdf:li rdf:resource='#snipsnap-search'/>
                <rdf:li rdf:resource='http://community.moertel.com/ss/rdf#'/>
                <rdf:li rdf:resource='http://community.moertel.com/ss/rdf#space/2004-03-18'/>
                <rdf:li rdf:resource='http://community.moertel.com/ss/rdf#start/2004-06-18/2'/>
                <rdf:li rdf:resource='http://community.moertel.com/ss/rdf#start/'/>
                <rdf:li rdf:resource='http://community.moertel.com/ss/rdf#start/2004-06-18/1'/>
            </rdf:Bag>
        </s:snipLinks>
        <s:attachments
             rdf:type='http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag'/>
    </s:Snip>
</rdf:RDF>
