Google Indexing Duplicate URLs : Ignoring Robots & Canonical Tags
-
Hi Moz Community,
We have the following robots command that should prevent URLs with tracking parameters being indexed.
Disallow: /*?
We have noticed google has started indexing pages that are using tracking parameters. Example below.
These pages are identified as duplicate content yet have the correct canonical tags:
With various affiliate feeds available for our site, we effectively have duplicate versions of every page due to the tracking query that Google seems to be willing to index, ignoring both robots rules & canonical tags.
Can anyone shed any light onto the situation?
-
Google's multi-layered multi-algorithm system has come a long way in being able to "figure it all out", yet at the same time, falls far short of always successfully "getting it right".
Robots.txt files are no longer an absolute directive. They're now "just another signal", as are canonical tags, meta robots instructions, and their own Google Webmaster URL Parameters system.
Because of this its critical to be consistent across all signals. If you've got the robots.txt file set to not index pages, but also have inbound links from affiliates, that's a prime example of where inbound link signals can override the robots.txt file's instruction if they're not nofollowed links.
While they technically SHOULD not index them after discovering them off-site (because the destination says "index this other version"), that's part of their confused multilayered system.
I have a question though - from what limited information you've provided, this example is based on a url parameter of ?ec=
When I search Google using site:http://www.oakfurnitureland.co.uk/ inurl:ec
I see only three such pages indexed AND where those pages are "fully" indexed. All the rest (over 1,000 additional URLs), are in the Google system, however every one of those others has a meta description of "A description for this result is not available because of this site's robots.txt - learn more."
What that means is they are NOT fully indexing those pages - there is no worry to be had about duplicate content for those. Google is simply tracking that those URLs exist.
So - is that the only URL parameter you're worried about? If so, it's not a major problem on your site. Except for those few exceptions, Google is doing what you need them to do with those.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Near Duplicate Title Tag Checker
Hi Everyone, I know there are a lot of tools like Siteliner, which can check the uniqueness of body copy, but are there any that can restrict the check to the title tags alone? Alternatively, is there an Excel or Google Sheets function that would allow me to do the same thing? Thanks, Andy
Intermediate & Advanced SEO | | AndyRSB0 -
No index detected in robots meta tag GSC issue_Help Please
Hi Everyone, We just did a site migration ( URL structure change, site redesign, CMS change). During migration, dev team messed up badly on a few things including SEO. The old site had pages canonicalized and self canonicalized <> New site doesn't have anything (CMS dev error) so we are working retroactively to add canonicalization mechanism The legacy site had URL’s ending with a trailing slash “/” <> new site got redirected to Set of url’s without “/” New site action : All robots are allowed: A new sitemap is submitted to google search console So here is my problem (it been a long 24hr night for me 🙂 ) 1. Now when I look at GSC homepage URL it says that old page is self canonicalized and currently in index (old page with a trailing slash at the end of URL). 2. When I try to perform a live URL test, I get the message "No: 'noindex' detected in 'robots' meta tag" , so indexation cant be done. I have no idea where noindex is coming from. 3. Robots.txt in search console still showing old file ( no noindex there ) I tried to submit new file but old one still coming up. When I click on "See live robots.txt" I get current robots. 4. I see that old page is still canonicalized and attempting to index redirected old page might be confusing google Hope someone can help to get the new page indexed! I really need it 🙂 Please ping me if you need more clarification. Thank you ! Thank you
Intermediate & Advanced SEO | | bgvsiteadmin1 -
Does Google ignore duplicate meta descriptions?
Hi there SEO mozzers, I am dealing with a website that has duplicate meta descriptions (we know is bad).As a punishment, Google totally ignores the meta descriptions and picks content from the website and displays it in SERP. I already read the https://mza.bundledseo.com/blog/why-wont-google-use-my-meta-description but I was wondering if there is more information/knowledge out there. Any tips are appreciated!
Intermediate & Advanced SEO | | Europarl_SEO_Team0 -
Will disallowing URL's in the robots.txt file stop those URL's being indexed by Google
I found a lot of duplicate title tags showing in Google Webmaster Tools. When I visited the URL's that these duplicates belonged to, I found that they were just images from a gallery that we didn't particularly want Google to index. There is no benefit to the end user in these image pages being indexed in Google. Our developer has told us that these urls are created by a module and are not "real" pages in the CMS. They would like to add the following to our robots.txt file Disallow: /catalog/product/gallery/ QUESTION: If the these pages are already indexed by Google, will this adjustment to the robots.txt file help to remove the pages from the index? We don't want these pages to be found.
Intermediate & Advanced SEO | | andyheath0 -
How to Disallow Tag Pages With Robot.txt
Hi i have a site which i'm dealing with that has tag pages for instant - http://www.domain.com/news/?tag=choice How can i exclude these tag pages (about 20+ being crawled and indexed by the search engines with robot.txt Also sometimes they're created dynamically so i want something which automatically excludes tage pages from being crawled and indexed. Any suggestions? Cheers, Mark
Intermediate & Advanced SEO | | monster990 -
How canonical url harm our website???
Even though my website has no similar/copied content, i used rel=canonical for all my website pages. Is Google or yahoo make any harm to my SERP's?? EX: http://www.seomoz.org is my site, in that i used canonical as rel="<a class="attribute-value">canonical</a>" href="http://www.seomoz.org" to my home page like that similar to all pages, i created rel=canonical. Is search engine harm my website???
Intermediate & Advanced SEO | | MadhukarSV0 -
Lots of incorrect urls indexed - Googlebot found an extremely high number of URLs on your site
Hi, Any assistance would be greatly appreciated. Basically, our rankings and traffic etc have been dropping massively recently google sent us a message stating " Googlebot found an extremely high number of URLs on your site". This first highligted us to the problem that for some reason our eCommerce site has recently generated loads (potentially thousands) of rubbish urls hencing giving us duplication everywhere which google is obviously penalizing us with in the terms of rankings dropping etc etc. Our developer is trying to find the route cause of this but my concern is, How do we get rid of all these bogus urls ?. If we use GWT to remove urls it's going to take years. We have just amended our Robot txt file to exclude them going forward but they have already been indexed so I need to know do we put a redirect 301 on them and also a HTTP Code 404 to tell google they don't exist ? Do we also put a No Index on the pages or what . what is the best solution .? A couple of example of our problems are here : In Google type - site:bestathire.co.uk inurl:"br" You will see 107 results. This is one of many lot we need to get rid of. Also - site:bestathire.co.uk intitle:"All items from this hire company" Shows 25,300 indexed pages we need to get rid of Another thing to help tidy this mess up going forward is to improve on our pagination work. Our Site uses Rel=Next and Rel=Prev but no concanical. As a belt and braces approach, should we also put concanical tags on our category pages whereby there are more than 1 page. I was thinking of doing it on the Page 1 of our most important pages or the View all or both ?. Whats' the general consenus ? Any advice on both points greatly appreciated? thanks Sarah.
Intermediate & Advanced SEO | | SarahCollins0 -
No index, follow vs. canonical url
We have a site that consists almost entirely as a directory of videos. Example here: http://realtree.tv/channels/realtreeoutdoorsclassics We're trying to figure out the best way to handle pagination and utility features such as sort for most recent, most viewed, etc. We've been reading countless articles on this topic, but so far have been unable to determine what might be considered the industry standard. Two solutions seem to stand out... Using the canonical url on all the sorted and paginated pages. However, after reading many blog posts, it seems that you should NEVER use the canonical url to solve the issue of paginated, and thus duplicated content because the search bots will never crawl past the first page leaving many results not in the index. (We are considering ruling this method out.) Another solution seems to be using the meta tag for noindex, follow so that a search engine like Google will crawl your directory pages but not add them to the index themselves. All links are followed so content is crawled and any passing link juice remains unchanged. However, I did see a few articles skeptical of this solution as well saying that there are always better alternatives, or that there is no verification that search engines obey this meta tag. This has placed some doubt in our minds. I was hoping to get some expert advice on these methods as it would pertain to our site. Thank you.
Intermediate & Advanced SEO | | grayloon0