Good stuff for programming geeks
[ start | index | login or register ]
start > 2004-06-18 > 1

Start/2004-06-18/1

Created by tmoertel. Last edited by tmoertel 941 days ago. Viewed 953 times. #2
[diff] [history] [edit] [rdf]
labels
attachments

Spamsitive closure?

One of the nice things about SnipSnap, the wiki+blog system that runs this community site, is that it automagically creates backlinks to sites that link to this site. These backlinks make it easy for people browsing this site to find related information on the other sites, since presumably that's why the other sites linked to this site in the first place.

Unfortunately, spammers have corrupted this useful feature in order to gain so-called "Google juice" – higher rankings in searches. What the spammers do is use HTTP robots that hit sites like mine hundreds and often thousands of times with forged HTTP referrer information. The forged information makes it appear that users have visited my site via links from the spammers' sites. The spammers hope that my web site will then create automatic backlinks that point their sites, and that these backlinks will earn them Google juice.

As I wrote in my 2004-03-18 and 2004-04-21 entries, I have written some tiny tools that help me block this kind of abuse and, in the event some garbage spills through my defenses, clean out the bogus backlinks. Every morning, I get an email that contains a summary of new sites that have (supposedly) referred users to this site. It's easy to spot the spammy ones and add them to my blacklist and filter list.

However, every once in a while, a spammer tries something different, and lots of new spammy referrals show up. Then it takes a while, maybe fifteen minutes, to clean up the mess. Recently, for example, a spammer who has control of an army of zombie hosts (probably Win32 home cpmputers exploited by one of the many Windows-based worms) used them to referrer-spamvertise a dozen or so porn sites. Rather than clean this spam out the database by hand, I decided to use the weight of the spammer's own attack against him.

Transitive closure. The idea behind transitive closure is this. If you have a directed graph G, you can compute the transitive closure of G by creating a new graph in which every pair of vertices u, v is connected by an edge if any path exists between the same vertices in G. In other words, you're finding all the other vertices that you can possibly reach from each vertex and connecting them directly. The reason this is useful is because it makes later analyses of the graph easier. If you want to know what vertices are reachable from u you don't have to hunt for them, those nodes are now u's immediate neighbors.

Now, what's this got to do with referrer spam? A lot. Create a graph that represents the relationships between IP addresses and referrers in your web server's access log, compute its transitive closure, and then take a look at a few of the vertices that represent known spammy IP addresses or spamvertised web sites. Guess what all the edges from those vertices point to? That's right – other, related spammy IP addresses and web sites.

In short, transitive closure gives us an easy way to blacklist all of the IP addresses and web sites belonging to a spammer, if we can provide a just few spammy addresses or sites to get started. (Note: You could also do this – probably more efficiently – by finding connected components that contain the target vertices.)

I wrote a small program to do just this. (Actually, as an optimization, it computes closure only for the subset of the graph that can be reached from the seeds.) Then I fed the program a single spammy site that I knew was spamvertised by the zombie-controlling spammer. After a few seconds, the program output 27 domains and 465 IP addresses, all under the control of the spammer. These went right into the blacklist and filter list, and that was that.

Please login to post a comment.
community.moertel.com | Copyright © 2003–07 Moertel Consulting