Good stuff for programming geeks
[ start | index | login or register ]
start > 2004-04-21 > 1

Start/2004-04-21/1

Created by tmoertel. Last edited by tmoertel 1570 days ago. Viewed 10768 times. #2
[diff] [history] [edit] [rdf]
labels
attachments

Cleaning referrer spam out of SnipSnap

In an earlier entry I posted some code that I was using, along with Apache Rewrite rules, to block referrer spam. On occasion, however, some spam referrals do get through, and they end up in SnipSnap's database of backlinks. That means the spammy backlinks are likely to end up displayed on the site and give Google juice to the spammers.

To solve that problem, I wrote the following Perl script to clean spammy links from SnipSnap's database. To use it, I just export my SnipSnap database as an XML file, run the file through the script (giving the script a regex that matches the spammy URLs), and then import the result back into SnipSnap. Takes about a minute.

Here's the script:

#!/usr/bin/perl

use warnings; use strict;

my $target_pattern = shift;

unless ($target_pattern) { require File::Basename; my $cmd = File::Basename::basename($0); print STDERR "Usage: $cmd target-regexp [Input.snip]\n"; exit 1; }

undef $/; # slurp mode my $snip = <>;

$snip =~ s{(?<=<backLinks>)([^<]+)}{scrub($1)}ges;

print $snip;

sub scrub { join '|', grep {!/$target_pattern/o} split /\|/, $_[0]; }

=head1 NAME

clean-snipsnap-backlinks.pl

=head1 SYNOPSIS

B<clean-snipsnap-backlinks.pl> I<target-regex> I<SnipSnapDb.snip> E<gt> I<out.snip>

=head1 DESCRIPTION

Removes backlinks that match the I<target-regex> from the input SnipSnap database (in Snip format) and prints the cleaned-up database to standard output.

This filter is useful for removing spam and porn backlinks that spammers create via web crawlers that provide bogus "referer" information in HTTP requests.

=head1 LICENSE

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

=head1 AUTHOR

Tom Moertel http://community.moertel.com/ 2004-04-11

Please login to post a comment.
community.moertel.com | Copyright © 2003–07 Moertel Consulting