<rdf:RDF
    xmlns:s='http://snipsnap.org/rdf/snip-schema#'
    xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
    xml:base='http://community.moertel.com/ss/rdf'>
    <s:Snip rdf:about='http://community.moertel.com/ss/rdf#start/2004-06-18/2'
         s:name='start/2004-06-18/2'
         s:cUser='tmoertel'
         s:oUser='tmoertel'
         s:mUser='tmoertel'>
        <s:content>1 Spamsitive closure? {anchor:Spamsitive closure?}&#xA;One of the nice things about SnipSnap, the wiki+blog system that runs&#xA;this community site, is that it automagically creates backlinks to&#xA;sites that link to this site.  These backlinks make it easy for people&#xA;browsing this site to find related information on the other sites,&#xA;since presumably that&apos;s why the other sites linked to this site in the&#xA;first place.&#xA;&#xA;Unfortunately, spammers have corrupted this useful feature in order to&#xA;gain so-called &quot;Google juice&quot; &amp;#8211; higher rankings in searches.&#xA;What the spammers do is use HTTP robots that hit sites like mine&#xA;hundreds and often thousands of times with forged HTTP referrer&#xA;information.  The forged information makes it appear that users have&#xA;visited my site via links from the spammers&apos; sites.  The spammers&#xA;hope that my web site will then create automatic backlinks that point&#xA;their sites, and that these backlinks will earn them Google juice.&#xA;&#xA;As I wrote in my [2004-03-18] and {link:2004-04-21|http://community.moertel.com/ss/space/start/2004-04-21/1|img=none} entries, I have written&#xA;some tiny tools that help me block this kind of abuse and, in the&#xA;event some garbage spills through my defenses, clean out the bogus backlinks.&#xA;Every morning, I get an email that contains a summary of new sites&#xA;that have (supposedly) referred users to this site.  It&apos;s easy to spot&#xA;the spammy ones and add them to my blacklist and filter list.&#xA;&#xA;However, every once in a while, a spammer tries something different,&#xA;and lots of new spammy referrals show up.  Then it takes a while,&#xA;maybe fifteen minutes, to clean up the mess.  Recently, for example, a&#xA;spammer who has control of an army of zombie hosts (probably Win32&#xA;home cpmputers exploited by one of the many Windows-based worms) used them&#xA;to referrer-spamvertise a dozen or so porn sites.  Rather than clean&#xA;this spam out the database by hand, I decided to use the weight of the spammer&apos;s&#xA;own attack against him.&#xA;&#xA;__Transitive closure.__ The idea behind transitive closure is this.&#xA;If you have a directed graph ~~G~~, you can compute the transitive&#xA;closure of ~~G~~ by creating a new graph in which every pair of&#xA;vertices ~~u, v~~ is connected by an edge if any path exists between&#xA;the same vertices in ~~G~~.  In other words, you&apos;re finding all the other&#xA;vertices that you can possibly reach from each vertex and connecting&#xA;them directly.  The reason this is useful is because it makes later analyses of the graph easier.  If you want to know what vertices are reachable from ~~u~~ you don&apos;t have to hunt for them, those nodes are now ~~u~~&apos;s immediate neighbors.  &#xA;&#xA;Now, what&apos;s this got to do with referrer spam?  A lot.  Create a graph&#xA;that represents the relationships between IP addresses and referrers&#xA;in your web server&apos;s access log, compute its transitive closure, and&#xA;then take a look at a few of the vertices that represent known spammy&#xA;IP addresses or spamvertised web sites.  Guess what all the edges from&#xA;those vertices point to?  That&apos;s right &amp;#8211; other, related spammy IP&#xA;addresses and web sites.&#xA;&#xA;In short, transitive closure gives us an&#xA;easy way to blacklist ~~all~~ of the IP addresses and web sites&#xA;belonging to a spammer, if we can provide a just few spammy addresses or&#xA;sites to get started.  (Note: You could also do this &amp;#8211; probably more efficiently &amp;#8211; by finding&#xA;strongly-connected components that contain the target vertices.)&#xA;&#xA;I wrote a small program to do just this.  (Actually, as an optimization, it computes closure only for the subset of the graph that can be reached from the seeds.)  Then I fed the program a single&#xA;spammy site that I knew was spamvertised by the zombie-controlling spammer.  After a &#xA;few seconds, the program output 27 domains and 465 IP addresses,&#xA;all under the control of the spammer.  These went right into&#xA;the blacklist and filter list, and that was that.&#xA;</s:content>
        <s:mTime>2004-06-18 21:10:57.832</s:mTime>
        <s:cTime>2004-06-18 21:10:57.832</s:cTime>
        <s:comments
             rdf:type='http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag'/>
        <s:snipLinks>
            <rdf:Bag>
                <rdf:li rdf:resource='#snipsnap-index'/>
                <rdf:li rdf:resource='#tmoertel'/>
                <rdf:li rdf:resource='http://community.moertel.com/ss/rdf#'/>
                <rdf:li rdf:resource='#snipsnap-search'/>
                <rdf:li rdf:resource='http://community.moertel.com/ss/rdf#start/2004-04-21/1'/>
                <rdf:li rdf:resource='http://community.moertel.com/ss/rdf#space/start/2004-06-18/2'/>
                <rdf:li rdf:resource='http://community.moertel.com/ss/rdf#2004-03-18'/>
            </rdf:Bag>
        </s:snipLinks>
        <s:attachments
             rdf:type='http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag'/>
    </s:Snip>
</rdf:RDF>
