Apache :: Help Blocking Abusive Crawlers

Phylum · Joined: 19 Jun 2013 Posts: 2

Some of my sites are getting slammed by bots. According to awstats, bots are consuming an inordinate amount of bandwidth each month sometimes as much as 43GB. In the past 6 months, the ‘Google’ and ‘bot*’ crawlers have each has consumed nearly 17GB every month month. (So 34GB/mo)
I’d like (read: I need!) some help blocking them via .htaccess and/or even robots.txt

Some numbers from awstats
11 different robots* Hits Bandwidth Last visit
>>Unknown robot (identified by 'bot*') 11,331+457 13.37 GB 18 Jun 2013 - 12:13
>>Googlebot 3,963+74 15.87 GB 13 Jun 2013 - 23:42
>>Unknown robot (identified by 'robot') 3,738+27 5.81 GB 14 Jun 2013 - 00:03
>>Unknown robot (identified by '*bot') 2,311+417 2.94 GB 18 Jun 2013 - 11:12
Unknown robot (identified by 'spider') 803+110 600.61 MB 13 Jun 2013 - 23:58

My current Robots.txt

Steffen

Yep, we all have to live with it. There are dozens and dozens crawlers around with (changing) user agent, and daily there can pop up new ones. You lose the fight.

DnvrSysEngr · Joined: 15 Apr 2012 Posts: 226 Location: Denver, CO USA

I have found that using the GeoIP module (mod_geoip) and blocking certain countries in my httpd.conf really seems to help.

From the original post, it sounds like you are trying to block the requests before it ever gets to your Apache server?

Phylum · Joined: 19 Jun 2013 Posts: 2

Thanks for the responses all and I'll consider trying my hand at getting mod_geoip configured right.

So, what you're saying is that no matter what I do in robots.txt and/or .htaccess, there's nothing I can to do block a spider/crawler/bot with a matching agent?

If so, seems kinda silly.

DnvrSysEngr · Joined: 15 Apr 2012 Posts: 226 Location: Denver, CO USA

I may have gotten lucky and finally got Yandex, baidu, and a few other annoying crawlers blocked with a combination of robots.txt and badbots deny statements in my httpd.conf.

Xing · Moderator Joined: 26 Oct 2005 Posts: 49

To stop annoying ones we have the following in 2.4. Google, Bing and other needed ones are not denied, we want to be indexed by them. Agree with Steffen above that it is hard to fight.

<Directory />
..
..
..
<RequireAll>
Require all granted
Require expr %{HTTP_USER_AGENT} !~ /LinkFinder/i
Require expr %{HTTP_USER_AGENT} !~ /GSLFbot/i
Require expr %{HTTP_USER_AGENT} !~ /sistrix/i
Require expr %{HTTP_USER_AGENT} !~ /zooms/i
Require expr %{HTTP_USER_AGENT} !~ /majesti/i
Require expr %{HTTP_USER_AGENT} !~ /omgili/i
Require expr %{HTTP_USER_AGENT} !~ /ows 98/i
Require expr %{HTTP_USER_AGENT} !~ /extrabot/i
Require expr %{HTTP_USER_AGENT} !~ /ahrefs/i
Require expr %{HTTP_USER_AGENT} !~ /Java/i
Require expr %{HTTP_USER_AGENT} !~ /youtech/i
Require expr %{HTTP_USER_AGENT} !~ /seokicks/i
Require expr %{HTTP_USER_AGENT} !~ /Seznam/i
Require expr %{HTTP_USER_AGENT} !~ /esri/i
Require expr %{HTTP_USER_AGENT} !~ /warebay/i
Require expr %{HTTP_USER_AGENT} !~ /libwww/i
Require expr %{HTTP_USER_AGENT} !~ /Solomo/i
Require expr %{HTTP_USER_AGENT} !~ /WWWC/i
Require expr %{HTTP_USER_AGENT} !~ /ip-web/i
Require expr %{HTTP_USER_AGENT} !~ /panopta/i
Require expr %{HTTP_USER_AGENT} !~ /curl/i
Require expr %{HTTP_USER_AGENT} !~ /Wget/i
Require expr %{HTTP_USER_AGENT} !~ /Spider/i
Require expr %{HTTP_USER_AGENT} !~ /ntegrome/i
Require expr %{HTTP_USER_AGENT} !~ /andwatch/i
Require expr %{HTTP_USER_AGENT} !~ /SearchBot/i
Require expr %{HTTP_USER_AGENT} !~ /spinn3/i
Require expr %{HTTP_USER_AGENT} !~ /BLEX/i
</RequireAll>
</Directory>

X