Keep Server Online
If you find the Apache Lounge, the downloads and overall help useful, please express your satisfaction with a donation.
or
A donation makes a contribution towards the costs, the time and effort that's going in this site and building.
Thank You! Steffen
Your donations will help to keep this site alive and well, and continuing building binaries. Apache Lounge is not sponsored.
| |
|
Topic: robots.txt and security |
|
Author |
|
Danll
Joined: 02 Aug 2013 Posts: 49 Location: USA, Houston
|
Posted: Mon 14 Jul '14 19:11 Post subject: robots.txt and security |
|
|
It occurs to me that the robots.txt file, which is intended, in part, to tell search engines NOT to catalog certain files, is publicly available (which is how the search engines get it).
So anyone can download my robots.txt file, and determine which files I don't want advertised on search engines. Them I suppose, they can go get those files.
In other words, it's trivial for people to figure out exactly which files I don't want widely advertised. So that means that, if someone knows where your site is, they know what you don't want the search engines to catalog.
That does seem a bit amusing, and pretty much assures that the robots.txt file doesn't have anything to do with what one might consider "security". |
|
Back to top |
|
glsmith Moderator
Joined: 16 Oct 2007 Posts: 2268 Location: Sun Diego, USA
|
Posted: Mon 14 Jul '14 20:03 Post subject: |
|
|
No, nor is security it's intention. Security by obscurity has never been good security to begin with. |
|
Back to top |
|
Danll
Joined: 02 Aug 2013 Posts: 49 Location: USA, Houston
|
Posted: Mon 14 Jul '14 21:04 Post subject: |
|
|
Yes, "security by obscurity" (I like that phrase!) is a poor construct.
Of course, it is kind of funny that by telling search engines what you don't want people to see, you're telling people what you don't want them to see.
What you're really telling search engines is that you don't want people to be directed to certain things. But by doing so, you're advertising what those certain things are, and where people can find them. |
|
Back to top |
|
glsmith Moderator
Joined: 16 Oct 2007 Posts: 2268 Location: Sun Diego, USA
|
Posted: Tue 15 Jul '14 1:20 Post subject: |
|
|
Come to think of it, we're in the 21st century and this exact concern is probably what brought about some changes to search engines. For instance say you have subdirectories "this", "that" and "other". All you want indexed is "this", "other" and your main index page. Today many of the major search engines accept this;
Code: | User-agent: *
Allow: /this/
Allow: /other/
Allow: /index.html
Disallow: / |
You'll notice I'm not giving away the fact that the subdirectory "that" even exists in this example.
Search engines that have not adapted to this just don't index you period. Looking at a day in the life of my server I can say it looks like googlebot, msnbot and Yandex do. I do not see any of these accessing things that were not specified with Allow. |
|
Back to top |
|
James Blond Moderator
Joined: 19 Jan 2006 Posts: 7373 Location: Germany, Next to Hamburg
|
Posted: Tue 15 Jul '14 10:38 Post subject: |
|
|
it is a good strategie to block google for folder that contains images which shall not be indexed. All other stuff like downloads, folder, etc should be protected via apache config and or your software / code.
Often bad people find your stuff by taking a look at the robots.txt
With many files from a CMS I wouldn't go the strategie from glsmith. |
|
Back to top |
|
Danll
Joined: 02 Aug 2013 Posts: 49 Location: USA, Houston
|
Posted: Tue 15 Jul '14 20:20 Post subject: |
|
|
glsmith wrote: |
You'll notice I'm not giving away the fact that the subdirectory "that" even exists in this example.
|
That's a VERY good point. Basically blacklist everything, and then whitelist just what you want everyone to see. Thanks! |
|
Back to top |
|
glsmith Moderator
Joined: 16 Oct 2007 Posts: 2268 Location: Sun Diego, USA
|
Posted: Wed 16 Jul '14 3:37 Post subject: |
|
|
James' last line of his comment did get me thinking, especially since most CMS' use search engine friendly URLs. With a little config ingenuity, one could have a dynamic robots.txt file.
http://funwithrobots.linkpc.net:88
Hit the robots.txt file in your browser and see what it gives you. If you have Firefox and the User Agent Switcher extension (or similar), go again telling the server you are Googlebot. You will certainly see something different. Find the prize hidden inside.
Of course, there is still no real security here as anyone can do same. If you must completely secure something, don't put it on the server at all.
Password protect behind https at least if it must be available. No spider will index it then |
|
Back to top |
|
James Blond Moderator
Joined: 19 Jan 2006 Posts: 7373 Location: Germany, Next to Hamburg
|
Posted: Wed 16 Jul '14 11:04 Post subject: |
|
|
Well a dynamic robots.txt will cause a sinking in google ranking if not kicking. Sometimes the google bot does not come with the google IP nor the google bot User Agent to check cloaking.
BUT with the CMS it will be possible to create a white list. That is an idea I like. Disallow everything and have white list from the CMS.
--- EDIT ---
Since more or less it is a modified sitemap. |
|
Back to top |
|
|
|
|
|
|