Best practices for robotx.txt -- allow one page but not the others?

nicole.healthline

So, we have a page, like domain.com/searchhere, but results are being crawled (and shouldn't be), results look like domain.com/searchhere?query1. If I block /searchhere? will it block users from crawling the single page /searchere (because I still want that page to be indexed).

What is the recommended best practice for this?

RyanKent

SEOmoz used to use Google Search for the site. I am confident Google has a solid method for keeping their own results clean.

It appears SEOmoz recently changed their search widget. If you examine the URL you shared, notice none of the search results actually appear in the HTML of the page. For example, load the view-source URL and perform a find (CTRL+F) for "testing" which is the subject of the search. There are no results. Since the results are not in the page's HTML, they would not get indexed.

RyanKent

If Google is viewing the search result pages as soft 404s, then yes, adding the noindex tag should resolve the problem.

nicole.healthline

And, because google can currently crawl these search result pages, there are a number of soft 404 pages popping up. Would adding a noindex tag to these pages fix the issue?

nicole.healthline

Thanks for the links and help.

How does seomoz keep search results from being indexed? They don't block search results with robots.txt and it doesn't appear that they add the noindex tag to the search result pages.(ex: view-source:http://www.seomoz.org/pages/search_results#stq=testing&stp=1)

john4math

Yeah, but Ryan's answer is the best one if you can go that route.

RyanKent

Hi Michelle,

The concept of crawl efficiency is highly misunderstood. Are all your site's pages being indexed? Is new content or changes indexed in a timely manner? If so, that would indicate your site is being crawled efficiently.

Regarding the link you shared, you are on the right track but need to dig a bit deeper. On the page you shared, find the discussion related to robots.txt. There is a link which will lead you to the following page:

https://developers.google.com/webmasters/control-crawl-index/docs/faq#h01

There you will find a more detailed explanation along with several examples of when not to use robots.txt.

robots.txt: Use it if crawling of your content is causing issues on your server. For example, you may want to disallow crawling of infinite calendar scripts. You should not use the robots.txt to block private content (use server-side authentication instead), or handle canonicalization (see our Help Center). If you must be certain that a URL is not indexed, use the robots meta tag or X-Robots-Tag HTTP header instead.

SEOmoz offers a great guide on this topic as well: http://www.seomoz.org/learn-seo/robotstxt

If you desire to go beyond the basic Google and SEOmoz explanation and learn more about this topic, my favorite article related to robots.txt, written by Lindsay, can be found here: http://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutions

nicole.healthline

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769

nicole.healthline

Hi Ryan,

Wouldn't that cause issues with crawl efficiency?

Also, webmaster guidelines say "Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines."

nicole.healthline

Thank you. Are you sure about that?

anhvietprotocol

what about if you use "<a title="Click for Help!">Canonical URL" tag ?</a>

You can put this code: in /searchhere? page.

RyanKent

The best practice would be to add the noindex tag to the search result pages but not the /searchhere page.

Typically speaking, the best robots.txt file is a blank one. The file should only be used as a last resort with respect to blocking content.

john4math

What you outlined sounds to me like it should work. Disallowing /searchhere? shouldn't disallow the top-level search page at /searchhere, but should disallow all the search result pages with queries after the ?.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Best practices for robotx.txt -- allow one page but not the others?

Browse Questions

Explore more categories

Related Questions

How to replace an already ranked page with a better, more optimised one?

Robots.txt - Googlebot - Allow... what's it for?

Home Page or Internal Page

Does Google make continued attempts to crawl an old page one it has followed a 301 to the new page?

Best practice to avoid cannibalization of internal pages

How do I best optimise a page with 3 keywords that all contain 1 common word?

For multi language sites, what is best - two domains or one with both languages?

How does one know where to insert the right strips of coding on the right pages for Canonical Links?