logo
Apache Lounge
Webmasters

 

About Forum Index Downloads Search Register Log in RSS X


Keep Server Online

If you find the Apache Lounge, the downloads and overall help useful, please express your satisfaction with a donation.

or

Bitcoin

A donation makes a contribution towards the costs, the time and effort that's going in this site and building.

Thank You! Steffen

Your donations will help to keep this site alive and well, and continuing building binaries. Apache Lounge is not sponsored.
Post new topic   Forum Index -> Apache View previous topic :: View next topic
Reply to topic   Topic: robots.txt and security
Author
Danll



Joined: 02 Aug 2013
Posts: 49
Location: USA, Houston

PostPosted: Mon 14 Jul '14 19:11    Post subject: robots.txt and security Reply with quote

It occurs to me that the robots.txt file, which is intended, in part, to tell search engines NOT to catalog certain files, is publicly available (which is how the search engines get it).

So anyone can download my robots.txt file, and determine which files I don't want advertised on search engines. Them I suppose, they can go get those files.

In other words, it's trivial for people to figure out exactly which files I don't want widely advertised. So that means that, if someone knows where your site is, they know what you don't want the search engines to catalog.

That does seem a bit amusing, and pretty much assures that the robots.txt file doesn't have anything to do with what one might consider "security".
Back to top
glsmith
Moderator


Joined: 16 Oct 2007
Posts: 2268
Location: Sun Diego, USA

PostPosted: Mon 14 Jul '14 20:03    Post subject: Reply with quote

No, nor is security it's intention. Security by obscurity has never been good security to begin with.
Back to top
Danll



Joined: 02 Aug 2013
Posts: 49
Location: USA, Houston

PostPosted: Mon 14 Jul '14 21:04    Post subject: Reply with quote

Yes, "security by obscurity" (I like that phrase!) is a poor construct.

Of course, it is kind of funny that by telling search engines what you don't want people to see, you're telling people what you don't want them to see.

What you're really telling search engines is that you don't want people to be directed to certain things. But by doing so, you're advertising what those certain things are, and where people can find them.
Back to top
glsmith
Moderator


Joined: 16 Oct 2007
Posts: 2268
Location: Sun Diego, USA

PostPosted: Tue 15 Jul '14 1:20    Post subject: Reply with quote

Come to think of it, we're in the 21st century and this exact concern is probably what brought about some changes to search engines. For instance say you have subdirectories "this", "that" and "other". All you want indexed is "this", "other" and your main index page. Today many of the major search engines accept this;

Code:
User-agent: *
    Allow: /this/
    Allow: /other/
    Allow: /index.html
    Disallow: /


You'll notice I'm not giving away the fact that the subdirectory "that" even exists in this example.

Search engines that have not adapted to this just don't index you period. Looking at a day in the life of my server I can say it looks like googlebot, msnbot and Yandex do. I do not see any of these accessing things that were not specified with Allow.
Back to top
James Blond
Moderator


Joined: 19 Jan 2006
Posts: 7373
Location: Germany, Next to Hamburg

PostPosted: Tue 15 Jul '14 10:38    Post subject: Reply with quote

it is a good strategie to block google for folder that contains images which shall not be indexed. All other stuff like downloads, folder, etc should be protected via apache config and or your software / code.
Often bad people find your stuff by taking a look at the robots.txt
With many files from a CMS I wouldn't go the strategie from glsmith.
Back to top
Danll



Joined: 02 Aug 2013
Posts: 49
Location: USA, Houston

PostPosted: Tue 15 Jul '14 20:20    Post subject: Reply with quote

glsmith wrote:


You'll notice I'm not giving away the fact that the subdirectory "that" even exists in this example.



That's a VERY good point. Basically blacklist everything, and then whitelist just what you want everyone to see. Thanks!
Back to top
glsmith
Moderator


Joined: 16 Oct 2007
Posts: 2268
Location: Sun Diego, USA

PostPosted: Wed 16 Jul '14 3:37    Post subject: Reply with quote

James' last line of his comment did get me thinking, especially since most CMS' use search engine friendly URLs. With a little config ingenuity, one could have a dynamic robots.txt file.

http://funwithrobots.linkpc.net:88

Hit the robots.txt file in your browser and see what it gives you. If you have Firefox and the User Agent Switcher extension (or similar), go again telling the server you are Googlebot. You will certainly see something different. Find the prize hidden inside.

Of course, there is still no real security here as anyone can do same. If you must completely secure something, don't put it on the server at all.

Password protect behind https at least if it must be available. No spider will index it then Smile
Back to top
James Blond
Moderator


Joined: 19 Jan 2006
Posts: 7373
Location: Germany, Next to Hamburg

PostPosted: Wed 16 Jul '14 11:04    Post subject: Reply with quote

Well a dynamic robots.txt will cause a sinking in google ranking if not kicking. Sometimes the google bot does not come with the google IP nor the google bot User Agent to check cloaking.

BUT with the CMS it will be possible to create a white list. That is an idea I like. Disallow everything and have white list from the CMS.

--- EDIT ---
Since more or less it is a modified sitemap.
Back to top


Reply to topic   Topic: robots.txt and security View previous topic :: View next topic
Post new topic   Forum Index -> Apache