How to block "print" pages from indexing

dreadmichael

I have a fairly large FAQ section and every article has a "print" button. Unfortunately, this is creating a page for every article which is muddying up the index - especially on my own site using Google Custom Search.

Can you recommend a way to block this from happening?

Example Article:

http://www.knottyboy.com/lore/idx.php/11/183/Maintenance-of-Mature-Locks-6-months-/article/How-do-I-get-sand-out-of-my-dreads.html

Example "Print" page:

http://www.knottyboy.com/lore/article.php?id=052&action=print

NakulGoyal

Donnie, I agree. However, we had the same problem on a website and here's what we did the canonical tag:

Over a period of 3-4 weeks, all those print pages disappeared from the SERP. Now if I take a print URL and do a cache: for that page, it shows me the web version of that page.

So yes, I agree the question was about blocking the pages from getting indexed. There's no real recipe here, it's about getting the right solution. Before canonical tag, robots.txt was the only solution. But now with canonical there (provided one has the time and resources available to implement it vs adding one line of text to robots.txt), you can technically 301 the pages and not have to stop/restrict the spiders from crawling them.

Absolutely no offence to your solution in any way. Both are indeed workable solutions. The best part is that your robots.txt solution takes 30 seconds to implement since you provided the actually disallow code :), so it's better.

dreadmichael

Thanks Jennifer, will do! So much good information.

Dr-Pete

Sorry, but I have to jump in - do NOT use all of those signals simultaneously. You'll make a mess, and they'll interfere with each other. You can try Robots.txt or NOINDEX on the page level - my experience suggests NOINDEX is much more effective.

Also, do not nofollow the links yet - you'll block the crawl, and then the page-level cues (like NOINDEX) won't work. You can nofollow later. This is a common mistake and it will keep your fixes from working.

jennita

Josh, please read my and Dr. Pete's comments below. Don't nofollow the links, but do use the meta noindex,follow on the page.

Dr-Pete

Rel-canonical, in practice, does essentially de-index the non-canonical version. Technically, it's not a de-indexation method, but it works that way.

dreadmichael

You are right Donnie. I've "good answered" you too.

I've gone ahead and updated my robots.txt file. As soon as I am able, I will use no indexon the page, no follow on the links, and rel=canonical.

This is just what I needed, a quick fix until I can make a more permanent solution.

SEODinosaur

Your welcome : )

SEODinosaur

Although you are correct... there is still more then one way to skin a chicken.

SEODinosaur

But the spiders still run on the page and read the canonical link, however with the robot text the spiders will not.

SEODinosaur

Yes, but Rel=Canonical does not block a page it only tells google which page to follow out of two pages.The question was how to block, not how to tell google which link to follow. I believe you gave credit to the wrong answer.

http://en.wikipedia.org/wiki/Canonical_link_element

This is not fair. lol

Dr-Pete

I have to agree with Jen - Robots.txt isn't great for getting indexed pages out. It's good for prevention, but tends to be unreliable as a cure. META NOINDEX is probably more reliable.

One trick - DON'T nofollow the print links, at least not yet. You need Google to crawl and read the NOINDEX tags. Once the ?print pages are de-indexed, you could nofollow the links, too.

NakulGoyal

Yes, it's strongly recommended. It should be fairly simple to populate this tag with the "full" URL of the article based on the article ID. This approach will not only help you get rid of the duplicate content issue, but a canonical tag essentially works like a 301 redirect. So from all search engine perspective you are 301'ing your print pages to the real web urls without redirecting the actual user's who are browsing the print pages if they need to.

dreadmichael

Ya it is actually really useful. Unfortunately they are out of business now - so I'm hacking it on my own.

I will take your advice. I've shamefully never used rel= canonical before - so now is a good time to start.

jennita

True but using robots.txt does not keep them out of the index. Only using "noindex" will do that.

dreadmichael

Thanks Donnie. Much appreciated!

NakulGoyal

I actually remember Lore from a while ago. It's an interesting, easy to use FAQ CMS.

Anyways, I would also recommend implementing Canonical Tags for any possible duplicate content issues. So whether it's the print or the web version, each one of them will contain a canonical tag pointing to the web url of that article in the section of your website.

rel="canonical" href="http://www.knottyboy.com/lore/idx.php/11/183/Maintenance-of-Mature-Locks-6-months-/article/How-do-I-get-sand-out-of-my-dreads.html" />

SEODinosaur

http://www.seomoz.org/learn-seo/robotstxt

SEODinosaur

Try This.

User-agent: *

Disallow: /*&action=print

SEODinosaur

Theres more then one way to skin a chicken.

jennita

Rather than using robots.txt I'd use a noindex,follow tag instead to the page. This code goes into the tag for each print page. And it will ensure that the pages don't get indexed but that the links are followed.

dreadmichael

That would be great. Do you mind giving me an example?

SEODinosaur

you can block in .robot text, every page that ends in action=print

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

How to block "print" pages from indexing

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Keywords are indexed on the home page

Rel="canonical"

Should I index my search result pages?

Block or remove pages using a robots.txt

We are still seeing duplicate content on SEOmoz even though we have marked those pages as "noindex, follow." Any ideas why?

I have a site that has both http:// and https:// versions indexed, e.g. https://www.homepage.com/ and http://www.homepage.com/. How do I de-index the https// versions without losing the link juice that is going to the https://homepage.com/ pages?

Block a sub-domain from being indexed

What should i do with the links for "Login", "Register", "My Trolley" links on every page.