How can I best find out which URLs from large sitemaps aren't indexed?
-
I have about a dozen sitemaps with a total of just over 300,000 urls in them. These have been carefully created to only select the content that I feel is above a certain threshold.
However, Google says they have only indexed 230,000 of these urls. Now I'm wondering, how can I best go about working out which URLs they haven't indexed? No errors are showing in WMT related to these pages.
I can obviously manually start hitting it, but surely there's a better way?
-
There's no obvious function in WM tools, but having a look round there's this option:
http://www.aspfree.com/c/a/BrainDump/Extracting-Google-Indexed-Web-Site-Pages-Using-MS-Excel/
But Google will only display the first 1000 URLs on a site query so you would need to adapt it lots of times. From the looks of it there's not an easy way.
There's maybe a tool out there that is similar to Xenu, but checks the index status in Google also. I haven't ever had the need for this so I'm not aware of one, but the chances are there is something out there.
Good luck!
-
Any ideas on how to go about exporting indexed urls?
-
Hi Peter,
I'd attempt some sort of export of both indexed URLs and actual URLs into an Excel file and try and remove duplicates.
You would need to look into it but I'm sure there's a way of matching and removing duplicates.
Other than that I wouldn't know.
Ben
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google's Omitted Results - Attempt to De-Index
We're trying to get webpages from our QA site out of Google's index. We've inserted the NOINDEX tags. Google now shows only 3 results (down from 196,000), however, they offer a link to "show omitted results" at the bottom of the page. (A) Did we do something wrong? or (B) were we successful with our NOINDEX but Google will offer to show omitted results anyway? Please advise! Thanks!
Technical SEO | | BVREID0 -
I have a 404 error on my site i can't find.
I have looked everywhere. I thought it might have just showed up while making some changes, so while in webmaster tools i said it was fixed.....It's still there. Even moz pro found it. error is http://mydomain.com/mydomain.com No idea how it even happened. thought it might be a plugin problem. Any ideas how to fix this?
Technical SEO | | NateStewart0 -
Why are my URL's with a trailing slash still getting indexed even though they are redirected in the .htaccess file?
My .htaccess file is set up to redirect a URL with a trailing / to the URL without the /. However, my SEOmoz crawl diagnostics report is showing both URL's. I took a look at my Google Webmaster account and saw some duplicate META title issues. Same thing, Google Webmaster is showing the URL with the trailing /. My website was live for about 3 days before I added the code to the .htaccess file to remove the trailing /. Is it possible that in those 3 days that both versions were indexed and haven't been removed even though the .htaccess file has been updated?
Technical SEO | | mkhGT0 -
My blog page isn't ranking in Google
Hi, I noticed that my blog page on my site isn't in Google when i search for full URL link http://www.asggutter.com/blog/ instead i see page that isn't even working asggutter.com/sitemap.xml screen shot http://screencast.com/t/6OVFLwL8nTL How i can i fix that. Thanks
Technical SEO | | tonyklu0 -
BEST Wordpress Robots.txt Sitemap Practice??
Alright, my question comes directly from this article by SEOmoz http://www.seomoz.org/learn-seo/robotstxt Yes, I have submitted the sitemap to google, bing's webmaster tools and and I want to add the location of our site's sitemaps and does it mean that I erase everything in the robots.txt right now and replace it with? <code>User-agent: * Disallow: Sitemap: http://www.example.com/none-standard-location/sitemap.xml</code> <code>???</code> because Wordpress comes with some default disallows like wp-admin, trackback, plugins. I have also read other questions. but was wondering if this is the correct way to add sitemap on Wordpress Robots.txt http://www.seomoz.org/q/robots-txt-question-2 http://www.seomoz.org/q/quick-robots-txt-check. http://www.seomoz.org/q/xml-sitemap-instruction-in-robots-txt-worth-doing I am using Multisite with Yoast plugin so I have more than one sitemap.xml to submit Do I erase everything in Robots.txt and replace it with how SEOmoz recommended? hmm that sounds not right. User-agent: *
Technical SEO | | joony2008
Disallow:
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-login.php
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /comments **ERASE EVERYTHING??? and changed it to** <code> <code>
<code>User-agent: *
Disallow: </code> Sitemap: http://www.example.com/sitemap_index.xml</code> <code>``` Sitemap: http://www.example.com/sub/sitemap_index.xml ```</code> <code>?????????</code> ```</code>0 -
What can I do if Google Webmaster Tools doesn't recognize the robots.txt file?
I'm working on a recently hacked site for a client and and in trying to identify how exactly the hack is running I need to use the fetch as Google bot feature in GWT. I'd love to use this but it thinks the robots.txt is blocking it's acces but the only thing in the robots.txt file is a link to the sitemap. Unde the Blocked URLs section of the GWT it shows that the robots.txt was last downloaded yesterday but it's incorrect information. Is there a way to force Google to look again?
Technical SEO | | DotCar0 -
How can i redirect a url that has % in it?
Google webmaster tools shows a 400 eroor for an old link that contains a 30% off in it. The problem is the % I would like to 301 redirect this link : http://www.geographics.com/Graduation-Stationery,-35%-OFF-Printable-Certificates-Blank-Gift-Certificates/c1353_1354_1359/index.html to http://www.geographics.com/Graduation-Stationery-Printable-Certificates-Blank-Gift-Certificates/c1353_1354_1359/index.html We do not know how to do this in httaccess. Can you please advise? Thanks a lot! Madlena
Technical SEO | | Madlena0 -
Which is the best wordpress sitemap plugin
Does anyone have a recommendation for the best xml sitemap plugin for wordpress sites or do you steer clear of plugins and use a sitemap generator then load it up to the root manually?
Technical SEO | | simoncmason0