Total Indexed 1.5M vs 83k submitted by sitemap. What?

seoninjaz

We recently took a good look at one of our content site's sitemap and tried to cut out a lot of crap that had gotten in there such as .php, .xml, .htm versions of each page. We also cut out images to put in a separate image sitemap.

The sitemap generated 83,000+ URLs for google to crawl (this partially used the Yoast Wordpress plugin to generate)

In webmaster tools in the index status section is showing that this site has a total index of 1.5 million.

With our sitemap coming back with 83k and google indexing 1.5 million pages, is this a sign of a CMS gone rogue? Is it an indication that we could be pumping out error pages or empty templates, or junk pages that we're cramming into Google's bot?

I would love to hear what you guys think. Is this normal? Is this something to be concerned about? Should our total index more closely match our sitemap page count?

MickEdwards

As well as parameters mentioned you may possibly have heaps of duplicating categories, tags etc. What I would also do is start searching Google with something like site:www.example.com/directory/ or possibly site:www.example.com/category/directory/directory/ so you are tightly narrowing down the results, switch to 100 results per page and manually look for clues.

Carson-Ward

If you have 1.5 million pages and you think your sitemap is comprehensive at 83,000 then yes, your CMS is needlessly generating pages. It's usually not a big deal from a ranking standpoint, but it can make other important issues hard to detect. I would clean it up, but that's a business call you'll have to make.

The first step is diagnosing where are the URLs are coming from. What you do next will depend, but I will give you the best advice I can without knowing what types of extraneous URLs you have and how Google is treating them:

First, I'd start with WMT > Crawl > URL Parameters. Quite often your CMS will generate URLs, and Google usually knows how to handle them. If there are a lot of URL parameters, Google them and see if they're exactly the same as other pages. If they are, make sure you have canonical tags in place to point them to the main version. There's more you can do with parameters, but it'll depend on what you find so I won't go into more detail. As a general rule, though, a CMS should not generate a page unless it is uniquely useful as differentiated landing page or a page for people to link to.

Also check for parameters in your analytics program. They could actually be messing up your pageview data depending on how you report.There's a post on fixing that in GA here:

http://blog.crazyegg.com/2013/03/29/remove-url-parameters-from-google-analytics-reports/

Next I'd look at the "Advanced" tab in WMT > Google Index > Index Status . Are there a lot of URLs removed? If so, check on these pages and see why they're removed and why they exist.

I would also run a crawl with Xenu and Screaming Frog to make sure crawlers are finding a reasonable number of pages and that they're not getting stuck in crawl loops. (crawling variations of a page endlessly). These kinds of issues can prevent new pages from being indexed on time because Google is wasting time (your crawl budget) running in circles.

Chris.Menke

Rob,

Your sitemap is but an indication to Google about urls on your domain. The sitemap does not limit google to crawling or indexing only the urls listed on it, nor is it a directive that tells google to remove urls from the index that it has already crawled. As stated in GWT, use **robots.txt **to specify how search engines should crawl your site, or request **removal **of URLs from Google's search results with the URL removal tool Google webmaster tools under the "google index" link.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Total Indexed 1.5M vs 83k submitted by sitemap. What?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Getting Authority Social Blogs to Index

For a sitemap.html page, does the URL slug have to be /sitemap?

What would the Impact of having a sitemap be?

No images in Google index

Old pages STILL indexed...

Many pages small unique content vs 1 page with big content

Indexing specified entry pages

Ajax Content Indexed