Spamsitive closure? 
One of the nice things about SnipSnap, the wiki+blog system that runs
this community site, is that it automagically creates backlinks to
sites that link to this site. These backlinks make it easy for people
browsing this site to find related information on the other sites,
since presumably that's why the other sites linked to this site in the
first place.
Unfortunately, spammers have corrupted this useful feature in order to
gain so-called "Google juice" – higher rankings in searches.
What the spammers do is use HTTP robots that hit sites like mine
hundreds and often thousands of times with forged HTTP referrer
information. The forged information makes it appear that users have
visited my site via links from the spammers' sites. The spammers
hope that my web site will then create automatic backlinks that point
their sites, and that these backlinks will earn them Google juice.
As I wrote in my
2004-03-18 and
2004-04-21 entries, I have written
some tiny tools that help me block this kind of abuse and, in the
event some garbage spills through my defenses, clean out the bogus backlinks.
Every morning, I get an email that contains a summary of new sites
that have (supposedly) referred users to this site. It's easy to spot
the spammy ones and add them to my blacklist and filter list.
However, every once in a while, a spammer tries something different,
and lots of new spammy referrals show up. Then it takes a while,
maybe fifteen minutes, to clean up the mess. Recently, for example, a
spammer who has control of an army of zombie hosts (probably Win32
home cpmputers exploited by one of the many Windows-based worms) used them
to referrer-spamvertise a dozen or so porn sites. Rather than clean
this spam out the database by hand, I decided to use the weight of the spammer's
own attack against him.
Transitive closure. The idea behind transitive closure is this.
If you have a directed graph
G, you can compute the transitive
closure of
G by creating a new graph in which every pair of
vertices
u, v is connected by an edge if any path exists between
the same vertices in
G. In other words, you're finding all the other
vertices that you can possibly reach from each vertex and connecting
them directly. The reason this is useful is because it makes later analyses of the graph easier. If you want to know what vertices are reachable from
u you don't have to hunt for them, those nodes are now
u's immediate neighbors.
Now, what's this got to do with referrer spam? A lot. Create a graph
that represents the relationships between IP addresses and referrers
in your web server's access log, compute its transitive closure, and
then take a look at a few of the vertices that represent known spammy
IP addresses or spamvertised web sites. Guess what all the edges from
those vertices point to? That's right – other, related spammy IP
addresses and web sites.
In short, transitive closure gives us an
easy way to blacklist
all of the IP addresses and web sites
belonging to a spammer, if we can provide a just few spammy addresses or
sites to get started. (Note: You could also do this – probably more efficiently – by finding
strongly-connected components that contain the target vertices.)
I wrote a small program to do just this. (Actually, as an optimization, it computes closure only for the subset of the graph that can be reached from the seeds.) Then I fed the program a single
spammy site that I knew was spamvertised by the zombie-controlling spammer. After a
few seconds, the program output 27 domains and 465 IP addresses,
all under the control of the spammer. These went right into
the blacklist and filter list, and that was that.