Google Indexing Duplicate URLs : Ignoring Robots & Canonical Tags
-
Hi Moz Community,
We have the following robots command that should prevent URLs with tracking parameters being indexed.
Disallow: /*?
We have noticed google has started indexing pages that are using tracking parameters. Example below.
These pages are identified as duplicate content yet have the correct canonical tags:
With various affiliate feeds available for our site, we effectively have duplicate versions of every page due to the tracking query that Google seems to be willing to index, ignoring both robots rules & canonical tags.
Can anyone shed any light onto the situation?
-
Google's multi-layered multi-algorithm system has come a long way in being able to "figure it all out", yet at the same time, falls far short of always successfully "getting it right".
Robots.txt files are no longer an absolute directive. They're now "just another signal", as are canonical tags, meta robots instructions, and their own Google Webmaster URL Parameters system.
Because of this its critical to be consistent across all signals. If you've got the robots.txt file set to not index pages, but also have inbound links from affiliates, that's a prime example of where inbound link signals can override the robots.txt file's instruction if they're not nofollowed links.
While they technically SHOULD not index them after discovering them off-site (because the destination says "index this other version"), that's part of their confused multilayered system.
I have a question though - from what limited information you've provided, this example is based on a url parameter of ?ec=
When I search Google using site:http://www.oakfurnitureland.co.uk/ inurl:ec
I see only three such pages indexed AND where those pages are "fully" indexed. All the rest (over 1,000 additional URLs), are in the Google system, however every one of those others has a meta description of "A description for this result is not available because of this site's robots.txt - learn more."
What that means is they are NOT fully indexing those pages - there is no worry to be had about duplicate content for those. Google is simply tracking that those URLs exist.
So - is that the only URL parameter you're worried about? If so, it's not a major problem on your site. Except for those few exceptions, Google is doing what you need them to do with those.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google-selected canonical makes no sense
Howdy, fellow mozzers, We have added canonical URL to this page - https://www.dignitymemorial.com/obituaries/houston-tx/margot-schurig-8715369/share, pointing to https://www.dignitymemorial.com/obituaries/houston-tx/margot-schurig-8715369 When I check in Google search console, there are no issues reported with that page, and Google does say that it was able to properly read the canonical URL. Yet, it still chooses the page itself as canonical. This doesn't make sense to me. (Here is the link to the screenshot: https://dmitrii-regexseo.tinytake.com/tt/MzU0Mjc0M18xMDY2MTc4Ng) Has anyone dealt with this type of issue, and were you able to resolve it?
Intermediate & Advanced SEO | | DmitriiK0 -
Should I use noindex or robots to remove pages from the Google index?
I have a Magento site and just realized we have about 800 review pages indexed. The /review directory is disallowed in robots.txt but the pages are still indexed. From my understanding robots means it will not crawl the pages BUT if the pages are still indexed if they are linked from somewhere else. I can add the noindex tag to the review pages but they wont be crawled. https://www.seroundtable.com/google-do-not-use-noindex-in-robots-txt-20873.html Should I remove the robots.txt and add the noindex? Or just add the noindex to what I already have?
Intermediate & Advanced SEO | | Tylerj0 -
Best practice for disallowing URLS with Robots.txt
Hi Everybody, We are currently trying to tidy up the crawling errors which are appearing when we crawl the site. On first viewing, we were very worried to say the least:17000+. But after looking closer at the report, we found the majority of these errors were being caused by bad URLs featuring: Currency - For example: "directory/currency/switch/currency/GBP/uenc/aHR0cDovL2NlbnR1cnlzYWZldHkuY29tL3dvcmt3ZWFyP3ByaWNlPTUwLSZzdGFuZGFyZHM9NzEx/" Color - For example: ?color=91 Price - For example: "?price=650-700" Order - For example: ?dir=desc&order=most_popular Page - For example: "?p=1&standards=704" Login - For example: "customer/account/login/referer/aHR0cDovL2NlbnR1cnlzYWZldHkuY29tL2NhdGFsb2cvcHJvZHVjdC92aWV3L2lkLzQ1ODczLyNyZXZpZXctZm9ybQ,,/" My question now is as a novice of working with Robots.txt, what would be the best practice for disallowing URLs featuring these from being crawled? Any advice would be appreciated!
Intermediate & Advanced SEO | | centurysafety0 -
Blog tags are creating excessive duplicate content...should we use rel canonicals or 301 redirects?
We are having an issue with our cilent's blog creating excessive duplicate content via blog tags. The duplicate webpages from tags offer absolutely no value (we can't even see the tag). Should we just 301 redirect the tagged page or use a rel canonical?
Intermediate & Advanced SEO | | VanguardCommunications0 -
How is Google crawling and indexing this directory listing?
We have three Directory Listing pages that are being indexed by Google: http://www.ccisolutions.com/StoreFront/jsp/ http://www.ccisolutions.com/StoreFront/jsp/html/ http://www.ccisolutions.com/StoreFront/jsp/pdf/ How and why is Googlebot crawling and indexing these pages? Nothing else links to them (although the /jsp.html/ and /jsp/pdf/ both link back to /jsp/). They aren't disallowed in our robots.txt file and I understand that this could be why. If we add them to our robots.txt file and disallow, will this prevent Googlebot from crawling and indexing those Directory Listing pages without prohibiting them from crawling and indexing the content that resides there which is used to populate pages on our site? Having these pages indexed in Google is causing a myriad of issues, not the least of which is duplicate content. For example, this file <tt>CCI-SALES-STAFF.HTML</tt> (which appears on this Directory Listing referenced above - http://www.ccisolutions.com/StoreFront/jsp/html/) clicks through to this Web page: http://www.ccisolutions.com/StoreFront/jsp/html/CCI-SALES-STAFF.HTML This page is indexed in Google and we don't want it to be. But so is the actual page where we intended the content contained in that file to display: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff As you can see, this results in duplicate content problems. Is there a way to disallow Googlebot from crawling that Directory Listing page, and, provided that we have this URL in our sitemap: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff, solve the duplicate content issue as a result? For example: Disallow: /StoreFront/jsp/ Disallow: /StoreFront/jsp/html/ Disallow: /StoreFront/jsp/pdf/ Can we do this without risking blocking Googlebot from content we do want crawled and indexed? Many thanks in advance for any and all help on this one!
Intermediate & Advanced SEO | | danatanseo0 -
Indexation of content from internal pages (registration) by Google
Hello, we are having quite a big amount of content on internal pages which can only be accessed as a registered member. What are the different options the get this content indexed by Google? In certain cases we might be able to show a preview to visitors. In other cases this is not possible for legal reasons. Somebody told me that there is an option to send the content of pages directly to google for indexation. Unfortunately he couldn't give me more details. I only know that this possible for URLs (sitemap). Is there really a possibility to do this for the entire content of a page without giving google access to crawl this page? Thanks Ben
Intermediate & Advanced SEO | | guitarslinger0 -
Canonical & noindex? Use together
For duplicate pages created by the "print" function, seomoz says its better to use noindex (http://www.seomoz.org/blog/complete-guide-to-rel-canonical-how-to-and-why-not) and JohnMu says its better to use canonical http://www.google.com/support/forum/p/Webmasters/thread?tid=6c18b666a552585d&hl=en What do you think?
Intermediate & Advanced SEO | | nicole.healthline1 -
Google Author Biography Tag-Why Should I Pay Attention To The Author Biography Tag
Hello, I've reading all about Google's Author Biography tag but I am not sure how I can use this in my business. Can anyone explain ( in plain simple English) how I can leverage this tag? Is there any implications in SEO and higher rankings? Just trying to wrap my head around this concept and why it's important...or not. Thanks, Bill
Intermediate & Advanced SEO | | wparlaman0