Keep Server Online
If you find the Apache Lounge, the downloads and overall help useful, please express your satisfaction with a donation.
or
A donation makes a contribution towards the costs, the time and effort that's going in this site and building.
Thank You! Steffen
Your donations will help to keep this site alive and well, and continuing building binaries. Apache Lounge is not sponsored.
| |
|
Topic: Help Blocking Abusive Crawlers |
|
Author |
|
Phylum
Joined: 19 Jun 2013 Posts: 2
|
Posted: Thu 20 Jun '13 0:35 Post subject: Help Blocking Abusive Crawlers |
|
|
Some of my sites are getting slammed by bots. According to awstats, bots are consuming an inordinate amount of bandwidth each month sometimes as much as 43GB. In the past 6 months, the ‘Google’ and ‘bot*’ crawlers have each has consumed nearly 17GB every month month. (So 34GB/mo)
I’d like (read: I need!) some help blocking them via .htaccess and/or even robots.txt
Some numbers from awstats
11 different robots* Hits Bandwidth Last visit
>>Unknown robot (identified by 'bot*') 11,331+457 13.37 GB 18 Jun 2013 - 12:13
>>Googlebot 3,963+74 15.87 GB 13 Jun 2013 - 23:42
>>Unknown robot (identified by 'robot') 3,738+27 5.81 GB 14 Jun 2013 - 00:03
>>Unknown robot (identified by '*bot') 2,311+417 2.94 GB 18 Jun 2013 - 11:12
Unknown robot (identified by 'spider') 803+110 600.61 MB 13 Jun 2013 - 23:58
My current Robots.txt
Code: | User-agent: Googlebot
Disallow: /
User-agent: robot
Disallow: /
User-agent: bot*
Disallow: /
User-agent: *bot
Disallow: /
User-agent: crawl
Disallow: /
User-agent: spider
Disallow: /
User-agent: Yahoo Slurp
Disallow: /
User-agent: discovery
Disallow: /
User-agent: MSN-Media
Disallow: / |
My current .htaccess
Code: | ErrorDocument 503 "Site disabled for crawling"
RewriteEngine on
Options +FollowSymlinks
RewriteCond %{HTTP_USER_AGENT} ^.*(Googlebot|robot|crawl|spider|Yahoo\ Slurp|discovery|MSNBot-Media|bot\*|\*bot|\*bot\*Google).*$ [NC]
RewriteRule . - [F,L]
SetEnvIfNoCase ^User-Agent$ .*(Googlebot|robot|crawl|spider|discovery|MSNBot-Media|bot\*|bot\*|\*bot\*|Google) unwanted_bot
Deny from env=unwanted_bot |
Any help is greatly appreciated |
|
Back to top |
|
Steffen Moderator
Joined: 15 Oct 2005 Posts: 3094 Location: Hilversum, NL, EU
|
Posted: Fri 21 Jun '13 19:11 Post subject: |
|
|
Yep, we all have to live with it. There are dozens and dozens crawlers around with (changing) user agent, and daily there can pop up new ones. You lose the fight. |
|
Back to top |
|
DnvrSysEngr
Joined: 15 Apr 2012 Posts: 226 Location: Denver, CO USA
|
Posted: Fri 21 Jun '13 19:29 Post subject: |
|
|
I have found that using the GeoIP module (mod_geoip) and blocking certain countries in my httpd.conf really seems to help.
From the original post, it sounds like you are trying to block the requests before it ever gets to your Apache server? |
|
Back to top |
|
Phylum
Joined: 19 Jun 2013 Posts: 2
|
Posted: Fri 28 Jun '13 4:56 Post subject: |
|
|
Thanks for the responses all and I'll consider trying my hand at getting mod_geoip configured right.
So, what you're saying is that no matter what I do in robots.txt and/or .htaccess, there's nothing I can to do block a spider/crawler/bot with a matching agent?
If so, seems kinda silly. |
|
Back to top |
|
DnvrSysEngr
Joined: 15 Apr 2012 Posts: 226 Location: Denver, CO USA
|
Posted: Fri 28 Jun '13 6:32 Post subject: |
|
|
I may have gotten lucky and finally got Yandex, baidu, and a few other annoying crawlers blocked with a combination of robots.txt and badbots deny statements in my httpd.conf. |
|
Back to top |
|
Xing Moderator
Joined: 26 Oct 2005 Posts: 49
|
Posted: Fri 28 Jun '13 10:10 Post subject: |
|
|
To stop annoying ones we have the following in 2.4. Google, Bing and other needed ones are not denied, we want to be indexed by them. Agree with Steffen above that it is hard to fight.
<Directory />
..
..
..
<RequireAll>
Require all granted
Require expr %{HTTP_USER_AGENT} !~ /LinkFinder/i
Require expr %{HTTP_USER_AGENT} !~ /GSLFbot/i
Require expr %{HTTP_USER_AGENT} !~ /sistrix/i
Require expr %{HTTP_USER_AGENT} !~ /zooms/i
Require expr %{HTTP_USER_AGENT} !~ /majesti/i
Require expr %{HTTP_USER_AGENT} !~ /omgili/i
Require expr %{HTTP_USER_AGENT} !~ /ows 98/i
Require expr %{HTTP_USER_AGENT} !~ /extrabot/i
Require expr %{HTTP_USER_AGENT} !~ /ahrefs/i
Require expr %{HTTP_USER_AGENT} !~ /Java/i
Require expr %{HTTP_USER_AGENT} !~ /youtech/i
Require expr %{HTTP_USER_AGENT} !~ /seokicks/i
Require expr %{HTTP_USER_AGENT} !~ /Seznam/i
Require expr %{HTTP_USER_AGENT} !~ /esri/i
Require expr %{HTTP_USER_AGENT} !~ /warebay/i
Require expr %{HTTP_USER_AGENT} !~ /libwww/i
Require expr %{HTTP_USER_AGENT} !~ /Solomo/i
Require expr %{HTTP_USER_AGENT} !~ /WWWC/i
Require expr %{HTTP_USER_AGENT} !~ /ip-web/i
Require expr %{HTTP_USER_AGENT} !~ /panopta/i
Require expr %{HTTP_USER_AGENT} !~ /curl/i
Require expr %{HTTP_USER_AGENT} !~ /Wget/i
Require expr %{HTTP_USER_AGENT} !~ /Spider/i
Require expr %{HTTP_USER_AGENT} !~ /ntegrome/i
Require expr %{HTTP_USER_AGENT} !~ /andwatch/i
Require expr %{HTTP_USER_AGENT} !~ /SearchBot/i
Require expr %{HTTP_USER_AGENT} !~ /spinn3/i
Require expr %{HTTP_USER_AGENT} !~ /BLEX/i
</RequireAll>
</Directory>
X |
|
Back to top |
|
|
|
|
|
|