Indexed pages
-
Just started a site audit and trying to determine the number of pages on a client site and whether there are more pages being indexed than actually exist. I've used four tools and got four very different answers...
- Google Search Console: 237 indexed pages
- Google search using site command: 468 results
- MOZ site crawl: 1013 unique URLs
- Screaming Frog: 183 page titles, 187 URIs (note this is a free licence, but should cut off at 500)
Can anyone shed any light on why they differ so much? And where lies the truth?
-
Another option is if the site uses a CMS. If so, then you can create a sitemap for content pages/posts etc,.
Personally, I'm with Krzysztof Furtak on SF. Screaming Frog rocks. It'll find most pages, except perhaps Orphan pages as it wouldn't be able to find a link to crawl to discover the page.
If it's really important to get as many pages as possible, I'd do the following (I've put an Astrix (*) next to ones that some people may think are a tad extreme)
- Run a Screaming Frog crawl
- Grab a sitemap from your CMS
- Check any server-based analytics (AWSTATS etc)
- Check your access_log file & parse out URLs in there**(*)**
- site: queries, with & without www, and also using * as a subdomain (use something like Moz's toolbar to export)
- As Krzysztof suggests, Scrapebox would extract data too, but be careful scraping, you may get an IP slap.(*)
- Export crawl data from Moz & a tool such as Deep Crawl
- Throw the pages from all into Excel and de-dupe.
- Once you have a de-duped list, as an optional last step, go back to Screaming Frog and enter list mode (I have the paid version, not sure if it's possible with the free one) and run a crawl over all the de-duped URLs to get status codes etc
If you're going to do this sort of thing a fair bit - buy a Screaming Frog license, it's an awesome tool and can be useful in a multitude of situations.
-
The site: command is handy for asking Google what pages it knows about, however if Muzzmoz wants to know the number of pages on a site, you'd need more than this.
Also, re: your different ways or querying, I like to use:
site:*.domain.com - This can show other subdomains too, that may otherwise be missed
-
Ok so check with site something under 1000 pages and go to the last results page. You'll see that there'll be different number (in almost all cases).
-
I Will Always Prefer To Check Manually Using Site Command Because, site: operator, which will show us how many pages Google currently has indexed for the domain.
There Will Be Difference Between Index status in search console and current index as search console update the data after few days.
The number of indexed URLs is almost always significantly smaller than the number of crawled URLs, because Total indexed excludes URLs identified as duplicates, non-canonical or those that contain a meta no index tag.
Also, Check For Index(Preferred) Version Of Your Site
For E.g-
You can check More About this Here - https://support.google.com/webmasters/answer/2642366?hl=en
-
Hi
Most accurate number is from screaming frog (if you have less than 500 pages or paid version if more than 500).
Google indexes what it wants and if good enough to show in google index. If some pages are similar, got quality issues, blocked by robots etc then it won't show all. BTW don't think number in GSC or google index is good, check it manually because there can be 468 but in fact 200 only.
Moz can have "historical" pages that now don't exists or don't care about quality issues.
The truth is in screaming frog - most accurate number. If you used google user agent then number is the max that can appear in google index. If screaming frog user agent with turned off robots then you'll see bigger number (but google won't show it because of blocks).
If you want to check what's indexed then use tool like scrapebox. First get all urls (maybe without images if you don't care), then check indexed with sb. What's not indexed, can have some issues.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Post Site Migration - thousands of indexed pages, 4 months after
Hi all, Believe me. I think I've already tried and googled for every possible question that I have. This one is very frustrating – I have the following old domain – fancydiamonds dot net. We built a new site – Leibish dot com and done everything by the book: Individual 301 redirects for all the pages. Change of address via the GWT. Trying to maintain and improve the old optimization and hierarchy. 4 months after the site migration – we still have to gain back more than 50% of our original organic traffic (17,000 vs. 35,500-50,000 The thing that strikes me the most that you can still find 2400 indexed pages on Google (they all have 301 redirects). And more than this – if you'll search for the old domain name on Google – fancydiamonds dot net you'll find the old domain! Something is not right here, but I have no explanation why these pages still exist. Any help will be highly appreciated. Thanks!
Technical SEO | | skifr0 -
GWT Images Indexing
Hi guys! How does normally take to get Google to index the images within the sitemap? I recently submitted a new, up to date sitemap and most of the pages have been indexed already, but no images have. Any reason for that? Cheers
Technical SEO | | PremioOscar0 -
Can up a page
I do my best to optimize the on-page parameters for my page www.lkeria.com/AADL-logement-Algerie.php for the kw "aadl" but i can't understand what Ii'm doing wrong (i desapear 2 mounths ago). The page is optimize (title, description, h1, h2 etc.) few links with different ancers, but google put a spamy site www[dot]aadl[dot]biz in top 3 ratheer my page. Can you give me some advice to fix this issue? What I am doing wrong? Tanks in advance
Technical SEO | | lkeria0 -
How to Stop Google from Indexing Old Pages
We moved from a .php site to a java site on April 10th. It's almost 2 months later and Google continues to crawl old pages that no longer exist (225,430 Not Found Errors to be exact). These pages no longer exist on the site and there are no internal or external links pointing to these pages. Google has crawled the site since the go live, but continues to try and crawl these pages. What are my next steps?
Technical SEO | | rhoadesjohn0 -
Instant Indexing
I've been working on a site for a while now, methodically building content and building trust and authority. Lately I've noticed that anything I publish there appears to be instantly indexed by Google, which surprises me. I haven't had this happen before so I'm curious. I'd be interested to hear the experience of others.
Technical SEO | | waynekolenchuk0 -
Top pages give " page not found"
A lot of my top pages point to images in a gallery on my site. When I click on the url under the name of the jpg file I get an error page not found. For instance this link: http://www.fastingfotografie.nl/architectuur-landschap/single-gallery/10162327 Is this a problem? Thanks. Thomas. JkLej.png
Technical SEO | | thomasfasting0 -
Duplicate Page Content Lists the same page twice?
When checking my crawl diagnostics this morning I see that I have the error Duplicate page content. It lists the exact same url twice though and I don't understand how to fix this. It's also listed under duplicate page title. Personal Assistant | Virtual Assistant | Charlotte, NC http://charlottepersonalassistant.com/110 Personal Assistant | Virtual Assistant | Charlotte, NC http://charlottepersonalassistant.com/110 Does this have anything to do with a 301 redirect here? Why does it have http;// twice? Thanks all! | http://www.charlottepersonalassistant.com/ | http://http://charlottepersonalassistant.com/ |
Technical SEO | | eidna220 -
Non-www home page indexed, but www for rest of site
Hi there, grateful for any ideas on why this is happening: http://www.google.co.uk/search?q=site:www.vitispr.com vs http://www.google.co.uk/search?q=site:vitispr.com Google seems to be indexing and caching vitispr.com for our home page but the www. versions for everything else. As you can see the second query finds the home page. Any ideas why that might be? Other info that might be relevant: non-www etc. are all 301'd to www versions. moved domains/urls etc. around in March of this year and for a week or we were redirecting to the non-www version webmaster tools says 'www' preferred Thanks!
Technical SEO | | JaspalX0