logo
Apache Lounge
Webmasters

 

About Forum Index Downloads Search Register Log in RSS X


Keep Server Online

If you find the Apache Lounge, the downloads and overall help useful, please express your satisfaction with a donation.

or

Bitcoin

A donation makes a contribution towards the costs, the time and effort that's going in this site and building.

Thank You! Steffen

Your donations will help to keep this site alive and well, and continuing building binaries. Apache Lounge is not sponsored.
Post new topic   Forum Index -> Apache View previous topic :: View next topic
Reply to topic   Topic: htaccess - blocking Agents containing X, but allowing XY
Author
syncmaster913n



Joined: 23 Aug 2015
Posts: 2
Location: Poland, Warsaw

PostPosted: Sun 23 Aug '15 11:50    Post subject: htaccess - blocking Agents containing X, but allowing XY Reply with quote

Hello everyone,

First of all, my websites are on a very high quality managed VPS hosting account with one of the best providers available, and while I don't know the exact Apache version or operating system version on my "box," I do know that they keep everthing very much up to date. I'm guessing that this information is not necessary for my particular problem anyway, but if I'm wrong, I will open a support ticket with my host and ask them about the details and then update this post. Hope that's ok.

Goal:

I am trying to block certain bots/crawlers from having any sort of access to my website, while allowing other ones. Many of the bots in question do not obey robots.txt, but they almost always have the word "bot" and "spider" in them. Therefore, what I'm trying to do is:

Use htaccess to block all user-agents containig anywhere (without regard to capitalization) in the user-agent either "bot" or "spider" while simultaneously granting access to just a handful of specific bots that also include the word "bot" (Googlebot and bingbot, to be exact).

My solution:

Here is what I have done:

Code:

BrowserMatchNoCase bot bad_bot
BrowserMatchNoCase spider bad_bot
BrowserMatchNoCase google good_bot
BrowserMatchNoCase bing good_bot
Order Deny,Allow
Deny from env=bad_bot
Allow from env=good_bot


I have tested this on a dummy site setup only for testing purposes, and it seems to work: using a Chrome Browser extension that allows me to include any words I want in my User-Agent, I changed my Agent name to various things and then tried to access my website:

- bot: could not access site, raw logs return 403
- BoT: could not access site, raw logs return 403
- spider: could not access site, raw logs return 403
- SPIDer: could not access site, raw logs return 403
- GooglebOt: full website access, raw logs return 200
- BINGbot: full website access, raw logs return 200

It would appear from the above that my method is working, but I'm still afraid of actually using the above htaccess code on my live site.

Question:

So I guess my question is: should the htaccess rules described earlier by me actually work as I want them to, do they make sense? Is there any reason why my test with the user agent "spoofing" might have only provided a false-positive result? Basically I just want to see what those of you who are intimately familiar with the way htaccess is structured think about the above, in hopes of anticipating any possible future problems that might arise due to these rules.


P.S. I have asked my host the exact same question above and they seem to think that the code above is indeed appropriate for the job. I'd still like a "second opinion" though, if possible.

Thank you for your time,
Mark
Back to top
Steffen
Moderator


Joined: 15 Oct 2005
Posts: 3092
Location: Hilversum, NL, EU

PostPosted: Sun 23 Aug '15 12:26    Post subject: Reply with quote

Looks like you use Apache 2.2.

Not necessary to have good and bad bots, only bad bots.

I had in the 2.2 old days:
Code:
SetEnvIf User-Agent archiver noc
SetEnvIf User-Agent Fetch noc
SetEnvIf User-Agent DTS noc
SetEnvIf User-Agent slurp noc
SetEnvIf User-Agent Baid noc
SetEnvIf User-Agent Indy noc
SetEnvIf User-Agent NPBot noc
SetEnvIf User-Agent turn noc
SetEnvIf User-Agent grub noc
SetEnvIf User-Agent ZyBorg noc
SetEnvIf User-Agent Scheduled noc
SetEnvIf User-Agent QuepasaCreep noc
...
....


deny from env=noc



For 2.4 have a look at https://www.apachelounge.com/viewtopic.php?t=5438
Back to top
syncmaster913n



Joined: 23 Aug 2015
Posts: 2
Location: Poland, Warsaw

PostPosted: Sun 23 Aug '15 12:43    Post subject: Reply with quote

Hi Steffan,

Thank you for the reply.

The problem is that I don't know the exact names of the bots I want to block, because they change the names every once in a while, and also new crawlers appear from time to time so I would have to keep monitoring them an adding their names to the htaccess file constantly, which would be too much work. The only thing that those bots have in common is that their user-agent includes either the word "bot" or "spider" so I need to block based on those two words, while making an exception for Googlebot and bingbot to not be blocked.

The rules you provided in your message seems to be appropriate if I have a list of all the bots I want to block, but I don't have such a list. This is why I have Good_bots and Bad_bots.

I hope that I am managing to make my message understandable? Considering the above, do you think my htaccess rules are appropriate?
Back to top
covener



Joined: 23 Nov 2008
Posts: 59

PostPosted: Sun 23 Aug '15 19:26    Post subject: Reply with quote

syncmaster913n wrote:
Hi Steffan,
The rules you provided in your message seems to be appropriate if I have a list of all the bots I want to block, but I don't have such a list. This is why I have Good_bots and Bad_bots.


His point was that there's no point to list "good bots" because 'Order deny,allow' already allows everyone by default.

edit: After rereading your lengthy first post -- your htaccess doesn't do what you want. You need to do one of:
- use !badbot
- write more robust regexes defining badbot
Back to top
James Blond
Moderator


Joined: 19 Jan 2006
Posts: 7373
Location: Germany, Next to Hamburg

PostPosted: Wed 26 Aug '15 11:08    Post subject: Reply with quote

Form the 2.4 docs[1] it seems to be easier

Code:
Require expr %{HTTP_USER_AGENT} != 'BadBot'



[1] http://httpd.apache.org/docs/2.4/howto/access.html#env
Back to top


Reply to topic   Topic: htaccess - blocking Agents containing X, but allowing XY View previous topic :: View next topic
Post new topic   Forum Index -> Apache