Automated XML Sitemap for a BIG site
-
Hi,
I would like to do an automated sitemap for my site but it has more than a million pages. It would need to be a sitemap index with a separation on different parts of the site (i.e. news, video) and I'll want a news sitemap and video sitemap as well (of course). Does anyone have any recommended way of making this and how much would you recommend it getting updated? For news and , I would like it to be pretty immediate if possible but the static pages don't need to be updated as much.
Thanks!
-
Another good reference:
http://googlewebmastercentral.blogspot.com/2014/10/best-practices-for-xml-sitemaps-rssatom.html
that points to how to ping:
http://www.sitemaps.org/protocol.html#submit_ping
specific search engine examples:
-
Excellent. Thank you! How would you ping google when a sitemap is updated?
-
Yes, split them out. You will need an index sitemap. That is a sitemap that links to other sitemaps
https://support.google.com/webmasters/answer/75712?vid=1-635768989722115177-4024498483&rd=1
In any given sitemap you can have up to 50,000 URLs listed in it and it can be no larger than 50MB uncompressed.
https://support.google.com/webmasters/answer/35738?hl=en&vid=1-635768989722115177-4024498483
Therefore, you could have an index sitemap with links up to 50,000 other sitemaps. Each of those sitemaps could contain links to 50,000 URLs on your site each.
If my math is right, that would be a max of 2,500,000,000 URLs if you have 50,000 sitemaps of 50,000 URLs each.
(Interesting side note Google allows up to 500 index sitemaps, so if you take 2,500,000,000 pages x 500 - 1,250,000,000,000 URLs that you can submit to Google via sitemaps)
How you divide up your content into sitemaps would relate to how your organize the pages on your site, so you are on the right track in breaking out the sitemaps by types of content. Depending on how big any one section of the site is, you may need to have more of those sitemaps in that type i.e. articlesitemap1.xml articlesitemap2.xml etc. You get the idea.
It is recommended that you ping Google every time a page in a sitemap is updated so Google will come back and recrawl the sitemap. I don't run any sites with 1M URLs but I do run several that run in the 10s of thousands. We break them up by type and ping whenever we update a page in that group. You need to consider your crawl budget with Google in that it may not crawl all 1M pages in your sitemap as often and so you may consider for a group of pages setting them up so that if you have articlesitemap1.xml, articlesitemap2.xml, articlesitemap3.xml you are always adding your newest URLs to the most recent sitemap created (i.e. articlesitemap3.xml) That way you are generally pinging Google about the update of a single sitemap out of the group vs all three.
My other thought is that in addition to pinging Google only on the sitemaps that that you have updated, you show a 304 server response to all sitemaps that have not been updated. 304 means "not modified" since last visit. One of your challenges will be your crawl budget with Google and so why make them recrawl a sitemap they have already crawled? You may want to consider a 304 on any URL on your site that has not changed since last time Google visited.
All of that said, as I mentioned above, I have not worked at the scale of 1M+ pages and would defer to others on the best way to approach. The general thought process would be the same though in trying to figure out the best way to use your sitemaps to manage your crawl budget from Google. Small side note, if you have 1M+ pages and any of those are from the use of things like sorting parameters, duplicate content, printer friendly pages, you may want to just noindex them regardless and leave them out of the sitemap and not allow Google to crawl them to start with.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Sitemap 404 error
I have generated a .xml sitemap of the site www.ihc.co.uk. The sitemap generated seems all fine, however when submitting to webmaster tools, it is returning a 404 error? anyone experienced this before. deleted and re-done the process. Tried different xml sitemap generators and even cleared cache along the way.
Reporting & Analytics | | dentaldesign0 -
Why does a selection of sites I have written guest posts on not come up on my link analysis?
I have done a few guests posts on different sites and they are not coming up in my link analysis report.
Reporting & Analytics | | meteorelectrical
We created an info graphic on one particular site and this site isn't coming up on the link analysis report. Would there be a reason for this. I ran a check on the sites code and it doesnt contain "nofollow" as i originally thought this was the problem. Here is an example of our work on a site that isn't coming up on the analysis report. http://www.electriciansblog.co.uk/2013/10/energy-saving-using-led-lighting/ Thanks0 -
Linking Multiple Niche Site In Same Google Analytics Account
Hi, I am providing SEO for Local business. Is it advisable to separate out the Google Analytics into different Google account or is it ok to remain it this way? Some of the client might be in the same niche, and might be competing with the same keywords as well. What I was worried is, Google might see these sites as same owner and only rank for 1 of the site. I was thinking to get the owners to register for their own Google Analytics and share the access to me.
Reporting & Analytics | | JonathanSoh0 -
Site re-crawled?
I've fixed many of my errors, but they're still showing in my dashboard. When will the site be crawled again?
Reporting & Analytics | | sakeith0 -
Difference between site: search and Total Indexed in Google Webmaster Tools.
This morning I did a search on Google for my site using the site: operator. I noticed that the number of results returned was significantly different than the "Total indexed" in Google Webmaster Tools. What is the difference and is it normal to have two very different numbers here?
Reporting & Analytics | | Gordian0 -
What is the best way to track mobile sites in Google Analytics?
Hello! I am wondering what the pros and cons of using the regular Google Analytics tracking code on a mobile site versus the tracking documentation from Google specifically on it found at http://code.google.com/mobile/analytics/docs/web/ which is still in labs mode. Does the mobile specific tracking have the same features as the regular one to be able to track events and report the same statistics? Thanks for the help on this one!
Reporting & Analytics | | CabbageTree0 -
Google Analytics internal Site Search - Destination pages dispaly Search results
Hi, Im having a bit of an issue with Google Analytics internal site search, I am able to currently track the search terms through my website internal search but when I click onto destination pages I just get the search result page. When clicking destination pages I would expect to get the pages on which the user ended up after the results page, instead I just get the results page which is pretty much useless ?submitsearchXXXXXX hope you can help, look forward to your response. Thanks,
Reporting & Analytics | | Tug-Agency1 -
Setting up Analytics on a Site that Uses Frames For Some Content
I work with a real estate agent and he uses strings from another tool to populate the listings on his site. In an attempt to be able to track traffic to both the framed pages and the non-framed pages he has two sets of analytics code on his site - one inside the frame and one for the regular part of the site. (there's also a third that the company who hosts his site and provides all these other tools put on his site - but I don't think that's really important to this conversation). Not only is it confusing looking at the analytics data, his bounce rate is down right unmanageable. As soon as anyone clicks on any of the listings they've bounced away. Here's a page - all of those listings below " Here are the most recent Toronto Beaches Real Estate Listings" are part of a frame. http://eastendtorontohomes.com/toronto-beach-real-estate-search/ I'm not really sure what to do about it or how to deal with it? Anyone out there got any good advice? And just in case you're wondering there aren't any other options - apart from spending thousands to build his own database thingie. We've thought about that (as other agents in the city have done that), but just aren't sure it's worth it. And, quite frankly he doesn't want to spend the money.
Reporting & Analytics | | annasus0