Cleaning referrer spam out of SnipSnap 
In an earlier entry I posted some code that I was using, along with Apache Rewrite rules, to block referrer spam. On occasion, however, some spam referrals do get through, and they end up in SnipSnap's database of backlinks. That means the spammy backlinks are likely to end up displayed on the site and give Google juice to the spammers.
To solve that problem, I wrote the following Perl script to clean spammy links from SnipSnap's database. To use it, I just export my SnipSnap database as an XML file, run the file through the script (giving the script a regex that matches the spammy URLs), and then import the result back into SnipSnap. Takes about a minute.
Here's the script:
#!/usr/bin/perluse warnings;
use strict;my $target_pattern = shift;unless ($target_pattern) {
require File::Basename;
my $cmd = File::Basename::basename($0);
print STDERR "Usage: $cmd target-regexp [Input.snip]\n";
exit 1;
}undef $/; # slurp mode
my $snip = <>;$snip =~ s{(?<=<backLinks>)([^<]+)}{scrub($1)}ges;print $snip;
sub scrub {
join '|', grep {!/$target_pattern/o} split /\|/, $_[0];
}=head1 NAMEclean-snipsnap-backlinks.pl=head1 SYNOPSISB<clean-snipsnap-backlinks.pl> I<target-regex> I<SnipSnapDb.snip>
E<gt> I<out.snip>=head1 DESCRIPTIONRemoves backlinks that match the I<target-regex> from the input
SnipSnap database (in Snip format) and prints the cleaned-up database
to standard output.This filter is useful for removing spam and porn backlinks that
spammers create via web crawlers that provide bogus "referer"
information in HTTP requests.=head1 LICENSEThis program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
=head1 AUTHORTom Moertel
http://community.moertel.com/
2004-04-11