Keep Server Online
If you find the Apache Lounge, the downloads and overall help useful, please express your satisfaction with a donation.
or
A donation makes a contribution towards the costs, the time and effort that's going in this site and building.
Thank You! Steffen
Your donations will help to keep this site alive and well, and continuing building binaries. Apache Lounge is not sponsored.
| |
|
Topic: htaccess - blocking Agents containing X, but allowing XY |
|
Author |
|
syncmaster913n
Joined: 23 Aug 2015 Posts: 2 Location: Poland, Warsaw
|
Posted: Sun 23 Aug '15 11:50 Post subject: htaccess - blocking Agents containing X, but allowing XY |
|
|
Hello everyone,
First of all, my websites are on a very high quality managed VPS hosting account with one of the best providers available, and while I don't know the exact Apache version or operating system version on my "box," I do know that they keep everthing very much up to date. I'm guessing that this information is not necessary for my particular problem anyway, but if I'm wrong, I will open a support ticket with my host and ask them about the details and then update this post. Hope that's ok.
Goal:
I am trying to block certain bots/crawlers from having any sort of access to my website, while allowing other ones. Many of the bots in question do not obey robots.txt, but they almost always have the word "bot" and "spider" in them. Therefore, what I'm trying to do is:
Use htaccess to block all user-agents containig anywhere (without regard to capitalization) in the user-agent either "bot" or "spider" while simultaneously granting access to just a handful of specific bots that also include the word "bot" (Googlebot and bingbot, to be exact).
My solution:
Here is what I have done:
Code: |
BrowserMatchNoCase bot bad_bot
BrowserMatchNoCase spider bad_bot
BrowserMatchNoCase google good_bot
BrowserMatchNoCase bing good_bot
Order Deny,Allow
Deny from env=bad_bot
Allow from env=good_bot
|
I have tested this on a dummy site setup only for testing purposes, and it seems to work: using a Chrome Browser extension that allows me to include any words I want in my User-Agent, I changed my Agent name to various things and then tried to access my website:
- bot: could not access site, raw logs return 403
- BoT: could not access site, raw logs return 403
- spider: could not access site, raw logs return 403
- SPIDer: could not access site, raw logs return 403
- GooglebOt: full website access, raw logs return 200
- BINGbot: full website access, raw logs return 200
It would appear from the above that my method is working, but I'm still afraid of actually using the above htaccess code on my live site.
Question:
So I guess my question is: should the htaccess rules described earlier by me actually work as I want them to, do they make sense? Is there any reason why my test with the user agent "spoofing" might have only provided a false-positive result? Basically I just want to see what those of you who are intimately familiar with the way htaccess is structured think about the above, in hopes of anticipating any possible future problems that might arise due to these rules.
P.S. I have asked my host the exact same question above and they seem to think that the code above is indeed appropriate for the job. I'd still like a "second opinion" though, if possible.
Thank you for your time,
Mark |
|
Back to top |
|
Steffen Moderator
Joined: 15 Oct 2005 Posts: 3092 Location: Hilversum, NL, EU
|
Posted: Sun 23 Aug '15 12:26 Post subject: |
|
|
Looks like you use Apache 2.2.
Not necessary to have good and bad bots, only bad bots.
I had in the 2.2 old days: Code: | SetEnvIf User-Agent archiver noc
SetEnvIf User-Agent Fetch noc
SetEnvIf User-Agent DTS noc
SetEnvIf User-Agent slurp noc
SetEnvIf User-Agent Baid noc
SetEnvIf User-Agent Indy noc
SetEnvIf User-Agent NPBot noc
SetEnvIf User-Agent turn noc
SetEnvIf User-Agent grub noc
SetEnvIf User-Agent ZyBorg noc
SetEnvIf User-Agent Scheduled noc
SetEnvIf User-Agent QuepasaCreep noc
...
....
deny from env=noc |
For 2.4 have a look at https://www.apachelounge.com/viewtopic.php?t=5438 |
|
Back to top |
|
syncmaster913n
Joined: 23 Aug 2015 Posts: 2 Location: Poland, Warsaw
|
Posted: Sun 23 Aug '15 12:43 Post subject: |
|
|
Hi Steffan,
Thank you for the reply.
The problem is that I don't know the exact names of the bots I want to block, because they change the names every once in a while, and also new crawlers appear from time to time so I would have to keep monitoring them an adding their names to the htaccess file constantly, which would be too much work. The only thing that those bots have in common is that their user-agent includes either the word "bot" or "spider" so I need to block based on those two words, while making an exception for Googlebot and bingbot to not be blocked.
The rules you provided in your message seems to be appropriate if I have a list of all the bots I want to block, but I don't have such a list. This is why I have Good_bots and Bad_bots.
I hope that I am managing to make my message understandable? Considering the above, do you think my htaccess rules are appropriate? |
|
Back to top |
|
covener
Joined: 23 Nov 2008 Posts: 59
|
Posted: Sun 23 Aug '15 19:26 Post subject: |
|
|
syncmaster913n wrote: | Hi Steffan,
The rules you provided in your message seems to be appropriate if I have a list of all the bots I want to block, but I don't have such a list. This is why I have Good_bots and Bad_bots.
|
His point was that there's no point to list "good bots" because 'Order deny,allow' already allows everyone by default.
edit: After rereading your lengthy first post -- your htaccess doesn't do what you want. You need to do one of:
- use !badbot
- write more robust regexes defining badbot |
|
Back to top |
|
James Blond Moderator
Joined: 19 Jan 2006 Posts: 7373 Location: Germany, Next to Hamburg
|
|
Back to top |
|
|
|
|
|
|