Large site with faceted navigation using rel=canonical, but Google still has issues
-
First off, I just wanted to mention I did post this on one other forum so I hope that is not completely against the rules here or anything. Just trying to get an idea from some of the pros at both sources. Hope this is received well. Now for the question.....
"Googlebot found an extremely high number of URLs on your site:"
Gotta love these messages in GWT. Anyway, I wanted to get some other opinions here so if anyone has experienced something similar or has any recommendations I would love to hear them.
First off, the site is very large and utilizes faceted navigation to help visitors sift through results. I have implemented rel=canonical for many months now to have each page url that is created based on the faceted nav filters, push back to the main category page. However, I still get these damn messages from Google every month or so saying that they found too many pages on the site. My main concern obviously is wasting crawler time on all these pages that I am trying to do what they ask in these instances and tell them to ignore and find the content on page x.
So at this point I am thinking about possibly using robots.txt file to handle these, but wanted to see what others around here thought before I dive into this arduous task. Plus I am a little ticked off that Google is not following a standard they helped bring to the table.
Thanks for those who take the time to respond in advance.
-
Yes that's a different situation. You're now talking about pagination, which quite rightly, canonicals to parent page is not to be used.
For faceted/filtered navigation it seems like canonical usage is indeed the right way to go about it, given Peter's experience just mentioned above, and the article you linked to that says, "...(in part because Google only indexes the content on the canonical page, so any content from the rest of the pages in the series would be ignored)."
-
As for my situation it worked out quite nicely, I just wasn't patient enough. After about 2 months the issue corrected itself for the most part and I was able to reduce about a million "waste" pages out of the index. This is a very large site so losing a million pages in a handful of categories helped me gain in a whole lot of other areas and spread the crawler around to more places that were important for us.
I also spent some time doing some restructuring of internal linking from some of our more authoritative pages that I believe also assisted with this, but in my case rel="canonical" worked out pretty nicely. Just took some time and patience.
-
I should actually add that Google doesn't condone using rel-canonical back to the main search page or page 1. They allow canonical to a "View All" or a complex mix of rel-canonical and rel=prev/next. If you use rel-canonical on too many non-identical pages, they could ignore it (although I don't often find that to be true).
Vanessa Fox just did a write-up on Google's approach:
http://searchengineland.com/implementing-pagination-attributes-correctly-for-google-114970
I have to be honest, though - I'm not a fan of Google's approach. It's incredibly complicated, easy to screw up, doesn't seem to work in all cases, and doesn't work on Bing. This is a very complex issue and really depends on the site in question. Adam Audette did a good write-up:
http://searchengineland.com/five-step-strategy-for-solving-seo-pagination-problems-95494
-
Thanks Dr Pete,
Yes I've used meta no-index on pages that are simply not useful in any way shape or form for Google to find.
I would be hesitant noindexing my filters in question, but it sounds promising that you are backing the canonical approach and there is a latency on reporting. Our PA and DA is extremely high and we get crawled daily, so curious about your measurement tip (inurl) which is a good one!
Many thanks.
Simon
-
I'm working on a couple of cases now, and it is extremely tricky. Google often doesn't re-crawl/re-cache deeper pages for weeks or months, so getting the canonical to work can be a long process. Still, it is generally a very effective tag and can happen quickly.
I agree with others that Robots.txt isn't a good bet. It also tends to work badly with pages that are already indexed. It's good for keeping things out of the index (especially whole folders, for example), but once 1000s of pages are indexed, Robots.txt often won't clean them up.
Another option is META NOINDEX, but it depends on the nature of the facets.
A couple of things to check:
(1) Using site: with inurl:, monitor the faceted navigation pages in the Google index. Are the numbers gradually dropping? That's what you want to see - the GWT error may not update very often. Keep in mind that these numbers can be unreliable, so monitor them daily over a few weeks.
(2) Are there are other URLs you're missing? On a large, e-commerce site, it's entirely possibly this wasn't the only problem.
(3) Did you cut the crawl paths? A common problem is that people canonical, 301-redirect, or NOINDEX, but then nofollow or otherwise cut links to those duplicates. Sounds like a good idea, except that the canonical tag has to be crawled to work. I see this a lot, actually.
-
Did you find a solution for this? I have exactly the same issue and have implemented the rel canonical in exactly the same way.
The issue you are trying to address is improving crawl bandwidth/equity by not letting Google crawl these faceted pages.
I am thinking of Ajax loading in these pages to the parent category page and/or adding nofollow to the links. But the pages have already been indexed, so I wonder if nofollow will have any effect.
Have you had any progress? Any further ideas?
-
Because rel canonical does nothing more than give credit to teh chosen page and aviod duplicat content. it does not tell the SE to stop indexing or redirect. as far as finding the links it has no affect
-
thx
-
OK, sorry I was thinking too many pages, not links.
using no-index will not stop PR flowing, the search engine will still follow the links. -
Yeah that is why I am not real excited about using robots.txt or even a no index in this instance. They are not session ids, but more like:
www.example.com/catgeoryname/a,
www.example.com/catgeoryname/b
www.example.com/catgeoryname/c
etc
which would show all products that start with those letters. There are a lot of other filters too, such as color, size, etc, but the bottom line is I point all those back to just www.example.com/categoryname using rel canonical and am not understanding why it isn't working properly.
-
There are a large number of urls like this because of the way the faceted navigation works and I have considered no index, but somewhat concerned as we do get links to some of these urls and would like to maintain some of that link juice. The warning shows up in Google Webmaster tools when Googlebot finds a large number of urls. The rest of the message reads like this:
"Googlebot encountered extremely large numbers of links on your site. This may indicate a problem with your site's URL structure. Googlebot may unnecessarily be crawling a large number of distinct URLs that point to identical or similar content, or crawling parts of your site that are not intended to be crawled by Googlebot. As a result Googlebot may consume much more bandwidth than necessary, or may be unable to completely index all of the content on your site."
rel canonical should fix this, but apparently it is not
-
Check how you are getting these pages.
Robots.txt is not an ideal solution. If Google finds pages in other places, still these pages will be crawled.
Normally print pages won't have link value and you may no index them.
If there are pages with session ids or campaign codes, use canonical if they have link value. Otherwise no index will be good.
-
the rel canonical with stop you getting duplicate content flags, but there is still a large number of pages its not going to hide them.
I have never seen this warning, how many pages are we talking about?, either it is very very high, or they are confusing the crawler.You may need to no index them
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What is the feeliing of "Here's where our site can help" text links used for conversions?
If you have an ecommerce site that is using editorial content on topics related to the site's business model to build organic traffic and draw visitors who might be interested in using the site's services eventually, what is the SEO (page ranking) impact -- as well as the impact on the visitors' perceptions about the reliability of the information on the site -- of using phrases like "Here is where [our site] can help you." in nearly every article. Note: the "our site" text would be linked in each case as a conversion point to one of the site's services pages to get visitors to move from content pages on a site to the sales pages on the site. Will this have an impact on page rankings? Does it dilute the page's relevance to search engines? Will the content look less authoritative because of the prevalence of these types of links? What about the same conversion links without the "we can help" text - i.e., more natural-sounding links that stem from the flow of the article but can lead interested visitors deeper into the ecommerce section of the site?
Algorithm Updates | | Will-McDermott0 -
Reviews - Google & Third Party
Hi We have reviews on our product pages & service reviews on Feefo, but how important is it to also drive customers to review your company on Google? I'm guessing we should be doing both, but it proves difficult when you already ask them to review your company through a third party? Any tips moz?
Algorithm Updates | | BeckyKey0 -
Increase in impressions reported by Google Analytics
Because Universal Analytics (and Google Webmaster) only stores SEO data for 3 months, I've been downloading SEO data (from the Acquisition tab of Analytics) to get a record of how impressions, clicks, CTR etc are changing in the long term (our business is seasonal, so these long-term patterns are important). Today, I downloaded data for September, and found a very large increase in the number of impressions compared to previous months. I looked back at the data for August, which I've already downloaded, and found that Analytics is now reporting much higher numbers of impressions than I have in my downloaded data. The total number of impressions has roughly doubled, and the increase for individual URLs varies, with some increasing by a factor of 10. The number of clicks has also increased, by about 15% in total. Because of the 3 month cut-off, I could only look back as far as the 11th of July, but the impressions for the end of July are also much higher than in my downloaded data. I've noticed that Analytics has changed some other details in its reporting of SEO data. For example, the impressions and clicks data is no longer rounded. Could this increase in impressions be a result of those changes? Has anyone else experienced something similar? We can go ahead and use the new data but it will throw our analysis off for past months (which have the lower numbers). If others have experienced something similar it would be good to know, so that we can adjust our historical numbers accordingly.
Algorithm Updates | | MargotLoco20 -
Deindexed from Google images Sep17th
We have a travel website that has been ranked in Google for 12-14years. The site produces original images with branding on them and have been for years ranking well. There's been no site changes. We have a Moz spamscore 1/17 and Domain Authority 59. Sep 17th all our images just disappeared from Google Image Search. Even searching for our domain with keyword photo results in nothing. I've checked our Search console and no email from Google and I see no postings on Moz and others relating to search algo changes with Images. I'm at a loss here.. does anyone have some advice?
Algorithm Updates | | danta2 -
Https slower site Versus Non https faster site??
Hey all, I know that everyone is going on about https as a ranking signal (as far as I read it is not a very important ranking signal, but a low ranking signal) but Site speed is a ranking signal https is now a ranking signal as well https makes sites slower So in view of the above, what's better? An https site that is slower A non https site that is faster Thanks!
Algorithm Updates | | bjs20100 -
Recommended action for site hit by penguin ?
What is more advisable, though there surely could be debate on this? Back in '07 till sometime around a year ago it seems our site got hit by google's updates, no manual action though, and have seen in past few months disavowed what we could find as well as deleted a lot of links. We are also working on getting word out on the brand as well and trying to get on some business websites to have articles and offer some discounts. Our keyword rankings seem stuck in limbo the past year or so though. Some main keywords for example seem stuck around page 8 when they used to be on page 1. Question is, can what seems to be a penguin update be recovered from? Is Google likely to refresh the algorithm? Also could starting a new site be more worth the investment - starting fresh with natural links, etc And if googles system could pick up that the site is run from same ip, etc. would they care? Also the keyword competition one of Moz's tools said around 46% if that makes a difference for one of the main keywords. Thanks
Algorithm Updates | | xelaetaks0 -
Has Google problems in indexing pages that use <base href=""> the last days?
Since a couple of days I have the problem, that Google Webmaster tools are showing a lot more 404 Errors than normal. If I go thru the list I find very strange URLs that look like two paths put together. For example: http://www.domain.de/languages/languageschools/havanna/languages/languageschools/london/london.htm If I check on which page Google found that path it is showing me the following URL: http://www.domain.de/languages/languageschools/havanna/spanishcourse.htm If I check the source code of the Page for the Link leading to the London Page it looks like the following: [...](languages/languageschools/london/london.htm) So to me it looks like Google is ignoring the <base href="..."> and putting the path together as following: Part 1) http://www.domain.de/laguages/languageschools/havanna/ instead of base href Part 2) languages/languageschools/london/london.htm Result is the wrong path! http://www.domain.de/languages/languageschools/havanna/languages/languageschools/london/london.htm I know finding a solution is not difficult, I can use absolute paths instead of relative ones. But: - Does anyone make the same experience? - Do you know other reasons which could cause such a problem? P.s.: I am quite sure that the CMS (Typo3) is not generating these paths randomly. I would like to be sure before we change the CMS's Settings to absolute paths!
Algorithm Updates | | SimCaffe0 -
What determines rankings in a site: search?
When I perform a "site:" search on my domains (without specifying a keyword) the top ranked results seem to be a mixture of sensible top-level index pages plus some very random articles. Is there any significance to what Google ranks highly in a site: search? There is some really unrepresentative content returned on page 1, including articles that get virtually no traffic. Is this seriously what Google considers our best or most typical content?
Algorithm Updates | | Dennis-529610