logo
Apache Lounge
Webmasters

 

About Forum Index Downloads Search Register Log in RSS X


Keep Server Online

If you find the Apache Lounge, the downloads and overall help useful, please express your satisfaction with a donation.

or

Bitcoin

A donation makes a contribution towards the costs, the time and effort that's going in this site and building.

Thank You! Steffen

Your donations will help to keep this site alive and well, and continuing building binaries. Apache Lounge is not sponsored.
Post new topic   Forum Index -> Apache View previous topic :: View next topic
Reply to topic   Topic: Help Blocking Abusive Crawlers
Author
Phylum



Joined: 19 Jun 2013
Posts: 2

PostPosted: Thu 20 Jun '13 0:35    Post subject: Help Blocking Abusive Crawlers Reply with quote

Some of my sites are getting slammed by bots. According to awstats, bots are consuming an inordinate amount of bandwidth each month sometimes as much as 43GB. In the past 6 months, the ‘Google’ and ‘bot*’ crawlers have each has consumed nearly 17GB every month month. (So 34GB/mo)
I’d like (read: I need!) some help blocking them via .htaccess and/or even robots.txt

Some numbers from awstats
11 different robots* Hits Bandwidth Last visit
>>Unknown robot (identified by 'bot*') 11,331+457 13.37 GB 18 Jun 2013 - 12:13
>>Googlebot 3,963+74 15.87 GB 13 Jun 2013 - 23:42
>>Unknown robot (identified by 'robot') 3,738+27 5.81 GB 14 Jun 2013 - 00:03
>>Unknown robot (identified by '*bot') 2,311+417 2.94 GB 18 Jun 2013 - 11:12
Unknown robot (identified by 'spider') 803+110 600.61 MB 13 Jun 2013 - 23:58


My current Robots.txt
Code:
User-agent: Googlebot
Disallow: /

User-agent: robot
Disallow: /

User-agent: bot*
Disallow: /

User-agent: *bot
Disallow: /

User-agent: crawl
Disallow: /

User-agent: spider
Disallow: /

User-agent: Yahoo Slurp
Disallow: /

User-agent: discovery
Disallow: /

User-agent: MSN-Media
Disallow: /



My current .htaccess
Code:
ErrorDocument 503 "Site disabled for crawling"
RewriteEngine on
Options +FollowSymlinks

RewriteCond %{HTTP_USER_AGENT} ^.*(Googlebot|robot|crawl|spider|Yahoo\ Slurp|discovery|MSNBot-Media|bot\*|\*bot|\*bot\*Google).*$ [NC]
RewriteRule . - [F,L]

SetEnvIfNoCase ^User-Agent$ .*(Googlebot|robot|crawl|spider|discovery|MSNBot-Media|bot\*|bot\*|\*bot\*|Google) unwanted_bot
Deny from env=unwanted_bot


Any help is greatly appreciated
Back to top
Steffen
Moderator


Joined: 15 Oct 2005
Posts: 3094
Location: Hilversum, NL, EU

PostPosted: Fri 21 Jun '13 19:11    Post subject: Reply with quote

Yep, we all have to live with it. There are dozens and dozens crawlers around with (changing) user agent, and daily there can pop up new ones. You lose the fight.
Back to top
DnvrSysEngr



Joined: 15 Apr 2012
Posts: 226
Location: Denver, CO USA

PostPosted: Fri 21 Jun '13 19:29    Post subject: Reply with quote

I have found that using the GeoIP module (mod_geoip) and blocking certain countries in my httpd.conf really seems to help.

From the original post, it sounds like you are trying to block the requests before it ever gets to your Apache server?
Back to top
Phylum



Joined: 19 Jun 2013
Posts: 2

PostPosted: Fri 28 Jun '13 4:56    Post subject: Reply with quote

Thanks for the responses all and I'll consider trying my hand at getting mod_geoip configured right.

So, what you're saying is that no matter what I do in robots.txt and/or .htaccess, there's nothing I can to do block a spider/crawler/bot with a matching agent?

If so, seems kinda silly.
Back to top
DnvrSysEngr



Joined: 15 Apr 2012
Posts: 226
Location: Denver, CO USA

PostPosted: Fri 28 Jun '13 6:32    Post subject: Reply with quote

I may have gotten lucky and finally got Yandex, baidu, and a few other annoying crawlers blocked with a combination of robots.txt and badbots deny statements in my httpd.conf.
Back to top
Xing
Moderator


Joined: 26 Oct 2005
Posts: 49

PostPosted: Fri 28 Jun '13 10:10    Post subject: Reply with quote

To stop annoying ones we have the following in 2.4. Google, Bing and other needed ones are not denied, we want to be indexed by them. Agree with Steffen above that it is hard to fight.

<Directory />
..
..
..
<RequireAll>
Require all granted
Require expr %{HTTP_USER_AGENT} !~ /LinkFinder/i
Require expr %{HTTP_USER_AGENT} !~ /GSLFbot/i
Require expr %{HTTP_USER_AGENT} !~ /sistrix/i
Require expr %{HTTP_USER_AGENT} !~ /zooms/i
Require expr %{HTTP_USER_AGENT} !~ /majesti/i
Require expr %{HTTP_USER_AGENT} !~ /omgili/i
Require expr %{HTTP_USER_AGENT} !~ /ows 98/i
Require expr %{HTTP_USER_AGENT} !~ /extrabot/i
Require expr %{HTTP_USER_AGENT} !~ /ahrefs/i
Require expr %{HTTP_USER_AGENT} !~ /Java/i
Require expr %{HTTP_USER_AGENT} !~ /youtech/i
Require expr %{HTTP_USER_AGENT} !~ /seokicks/i
Require expr %{HTTP_USER_AGENT} !~ /Seznam/i
Require expr %{HTTP_USER_AGENT} !~ /esri/i
Require expr %{HTTP_USER_AGENT} !~ /warebay/i
Require expr %{HTTP_USER_AGENT} !~ /libwww/i
Require expr %{HTTP_USER_AGENT} !~ /Solomo/i
Require expr %{HTTP_USER_AGENT} !~ /WWWC/i
Require expr %{HTTP_USER_AGENT} !~ /ip-web/i
Require expr %{HTTP_USER_AGENT} !~ /panopta/i
Require expr %{HTTP_USER_AGENT} !~ /curl/i
Require expr %{HTTP_USER_AGENT} !~ /Wget/i
Require expr %{HTTP_USER_AGENT} !~ /Spider/i
Require expr %{HTTP_USER_AGENT} !~ /ntegrome/i
Require expr %{HTTP_USER_AGENT} !~ /andwatch/i
Require expr %{HTTP_USER_AGENT} !~ /SearchBot/i
Require expr %{HTTP_USER_AGENT} !~ /spinn3/i
Require expr %{HTTP_USER_AGENT} !~ /BLEX/i
</RequireAll>
</Directory>

X
Back to top


Reply to topic   Topic: Help Blocking Abusive Crawlers View previous topic :: View next topic
Post new topic   Forum Index -> Apache