Robots.txt: how to exclude sub-directories correctly?

fablau

Hello here,

I am trying to figure out the correct way to tell SEs to crawls this:

http://www.mysite.com/directory/

But not this:

http://www.mysite.com/directory/sub-directory/

or this:

http://www.mysite.com/directory/sub-directory2/sub-directory/...

But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way:

disallow: /directory/sub-directory/

disallow: /directory/sub-directory2/

disallow: /directory/sub-directory/sub-directory/

disallow: /directory/sub-directory2/subdirectory/

etc...

I would end up having thousands of definitions to disallow all the possible sub-directory combinations.

So, is the following way a correct, better and shorter way to define what I want above:

allow: /directory/$

disallow: /directory/*

Would the above work?

Any thoughts are very welcome! Thank you in advance.

Best,

Fab.

MickEdwards

I mentioned both. You add a meta robots to noindex and remove from the sitemap.

sjunaidali

But google is still free to index a link/page even if it is not included in xml sitemap.

MickEdwards

Install Yoast Wordpress SEO plugin and use that to restrict what is indexed and what is allowed in a sitemap.

sjunaidali

I am using wordpress, Enfold theme (themeforest).

I want some files to be accessed by google, but those should not be indexed.

Here is an example: http://prntscr.com/h8918o

I have currently blocked some JS directories/files using robots.txt (check screenshot)

But due to this I am not able to pass Mobile Friendly Test on Google: http://prntscr.com/h8925z (check screenshot)

Is its possible to allow access, but use a tag like noindex in the robots.txt file. Or is there any other way out.

fablau

Yes, everything looks good, Webmaster Tools gave me the expected results with the following directives:

allow: /directory/$

disallow: /directory/*

Which allows this URL:

http://www.mysite.com/directory/

But doesn't allow the following one:

http://www.mysite.com/directory/sub-directory2/...

This page also gives an update similar to mine:

https://support.google.com/webmasters/answer/156449?hl=en

I think I am good! Thanks

fablau

Thank you Michael, it is my understanding then that my idea of doing this:

allow: /directory/$

disallow: /directory/*

Should work just fine. I will test it within Google Webmaster Tools, and let you know if any problems arise.

In the meantime if anyone else has more ideas about all this and can confirm me that would be great!

Thank you again.

MickEdwards

I've always stuck to Disallow and followed -

"This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:"

http://www.robotstxt.org/robotstxt.html

From https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt this seems contradictory

| /* | equivalent to / | equivalent to / | Equivalent to "/" -- the trailing wildcard is ignored. |

I think this post will be very useful for you - http://moz.com/community/q/allow-or-disallow-first-in-robots-txt

fablau

Thank you Michael,

Google and other SEs actually recognize the "allow:" command:

https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

The fact is: if I don't specify that, how can I be sure that the following single command:

disallow: /directory/*

Doesn't prevent SEs to spider the /directory/ index as I'd like to?

MickEdwards

As long as you dont have directories somewhere in /* that you want indexed then I think that will work. There is no allow so you don't need the first line just

disallow: /directory/*

You can test out here- https://support.google.com/webmasters/answer/156449?rd=1

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Robots.txt: how to exclude sub-directories correctly?

Browse Questions

Explore more categories

Related Questions

Robots.txt Help

SSL and robots.txt question - confused by Google guidelines

Subdomain or New Domain or Directory path?

301 redirect or Robots.txt on an interstatial page

Block all but one URL in a directory using robots.txt?

Will my sub-domains pass any SEO credit to my top-level domain?

Getting Google to Correct a Misspelled Site Link...Help!

How Google Carwler Cached Orphan pages and directory?