gordon.dewis.ca - Random musings from Gordon

Subscribe

Spam crawling ‘bots and AntiLeech

August 13, 2008 @ 01:46 By: gordon Category: Meta, WordPress

I just happened to be looking through my blog’s logs and noticed that a ‘bot had crawled through numerous pages on my blog in a very short period of time:

72.3.137.83 - - [13/Aug/2008:06:01:45 +0100] "GET /2008/02/01/ HTTP/1.0" 200 35253 "-" "ISC Systems iRc Search 2.1"
72.3.137.83 - - [13/Aug/2008:06:01:49 +0100] "GET /2008/02/05 HTTP/1.0" 301 84 "-" "ISC Systems iRc Search 2.1"
72.3.137.83 - - [13/Aug/2008:06:01:52 +0100] "GET /2008/02/05/ HTTP/1.0" 200 35375 "-" "ISC Systems iRc Search 2.1"
72.3.137.83 - - [13/Aug/2008:06:01:56 +0100] "GET /2008/02/06 HTTP/1.0" 301 84 "-" "ISC Systems iRc Search 2.1"
72.3.137.83 - - [13/Aug/2008:06:01:59 +0100] "GET /2008/02/06/ HTTP/1.0" 200 33373 "-" "ISC Systems iRc Search 2.1"

The “ISC Systems iRc Search 2.1” user agent caught my interest, so I did a little research with Google.  As I suspected, it seems that this user agent is associated with a web crawler used by an address harvester used by spammers.  I use the AntiLeech plugin to battle content thieves and the like, so I added the user agent to its blacklist.

But how to tell if AntiLeech is actually working?

I came across Bots vs Browsers, where one can learn all about the various ‘bots that are out there.  There are also tools to help simulate various types of connections to a webserver.  Using their User Agent Track Test tool, I hit my blog with a user agent string of “ISC Systems iRc Search 2.1” and looked at the output.  I was happy to discover that my content was replaced by comments telling people to visit my blog if they wanted to see the content — a great thing for content thieves who use content from someone else’s blog to generate traffic for their own sites without the original author’s permission:

AntiLeech output example

Cool.

A tip of the hat to Owen, author of AntiLeech plugin and another tip of the hat to Bots vs Browsers.  Also, thanks to Johann Burkard whose blog entry provided the link to Bots vs Browsers.

6 Responses to “Spam crawling ‘bots and AntiLeech”


  1. Paul Tomblin says:

    So what other user agents do you block?

  2. Johann says:

    Paul,

    for starters, I would recommend blocking generic HTTP libraries like libwww-perl, Python-urllib, anything starting with Java or Jakarta, empty user agents and some more I mentioned in the last link to my site.

  3. gordon says:

    I don’t have a very extensive list, actually, Paul. AntiLeech also allows you to block based on IP address, which I do have a few things in place. As far as I know, it’s not been a huge problem and often in response to those incidents I have noticed. But that’s kind of like closing the barn door after the horses have escaped.

  4. As great as Antileech is, and I do recommend it highly, It has limitations when it comes to RSS scraping. You might want to look at the Copyfeed plugin as a means to increase your protection or, failing that, using the digital fingerprint plugin to track RSS use.

    It’s just something to consider, personally, I’m just happy to see another great link for Antileech!

  5. gordon says:

    I’ve heard of the digital fingerprint plugin, but I’m not familiar with Copyfeed. Thanks for the tips!

  6. Owen says:

    One thing that the Antileech plugin does that I’m not sure the others do is embed a unique image into the feed for the requestor. When they display that image on their site, it makes a call back to your site and Antileech is able to track that they are displaying your content. If the site shows up in Antileech’s list, then it’s definitely something to be concerned about.

    Glad you’re enjoying the plugin.



Leave a Reply to Paul Tomblin