Sitemap Help!
-
Hi Guys,
Quick question regarding sitemaps. I am currently working on a huge site that has masses of pages.
I am looking to create a site map. How would you guys do this? i have looked at some tools but it say it will only do up to 30,000 pages roughly. It is so large it would be impossible to do this myself....any suggestions?
Also, how do i find out how many pages my site actually has indexed and not indexed??
Thank You all
Wayne
-
The problem that I have with CMS side sitemap generators is that it often pulls content from pages that are existing and adds entries based off that information. If you have pages linked to that are no longer there, as is the case with dynamic content, then you'll be imposing 404's on yourself like crazy.
Just something to watch out for but it's probably your best solution.
-
Hi! With this file, you can create a Google-friendly sitemap for any given folder almost automatically. No limits on the number of files. Please note that the code is the courtesy of @frkandris who generously helped me out when I had a similair problem. I hope it will be as helpful to you as it was to me
- Copy / paste the code below into a text editor.
- Edit the beginning of the file: where you see seomoz.com, put your own domain name there
- Save the file as getsitemap.php and ftp it to the appropriate folder.
- Write the full URL in your browser: http://www.yourdomain.com/getsitemap.php
- The moment you do it, a sitemap.xml will be generated in your folder
- Refresh your ftp client and download the sitemap. Make further changes to it if you wish.
=== CODE STARTS HERE ===
define(DIRBASE, './');define(URLBASE, 'http://www.seomoz.com/'); $isoLastModifiedSite = "";$newLine = "\n";$indent = " ";if (!$rootUrl) $rootUrl = "http://www.seomoz.com"; $xmlHeader = "$newLine"; $urlsetOpen = "<urlset xmlns=""http://www.google.com/schemas/sitemap/0.84"" ="" <="" span="">xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84 http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">$newLine";$urlsetValue = "";$urlsetClose = "</urlset>$newLine"; function makeUrlString ($urlString) { return htmlentities($urlString, ENT_QUOTES, 'UTF-8');} function makeIso8601TimeStamp ($dateTime) { if (!$dateTime) { $dateTime = date('Y-m-d H:i:s'); } if (is_numeric(substr($dateTime, 11, 1))) { $isoTS = substr($dateTime, 0, 10) ."T" .substr($dateTime, 11, ."+00:00"; } else { $isoTS = substr($dateTime, 0, 10); } return $isoTS;} function makeUrlTag ($url, $modifiedDateTime, $changeFrequency, $priority) { GLOBAL $newLine; GLOBAL $indent; GLOBAL $isoLastModifiedSite; $urlOpen = "$indent<url>$newLine";</url> $urlValue = ""; $urlClose = "$indent$newLine"; $locOpen = "$indent$indent<loc>";</loc> $locValue = ""; $locClose = "$newLine"; $lastmodOpen = "$indent$indent<lastmod>";</lastmod> $lastmodValue = ""; $lastmodClose = "$newLine"; $changefreqOpen = "$indent$indent<changefreq>";</changefreq> $changefreqValue = ""; $changefreqClose = "$newLine"; $priorityOpen = "$indent$indent<priority>";</priority> $priorityValue = ""; $priorityClose = "$newLine"; $urlTag = $urlOpen; $urlValue = $locOpen .makeUrlString("$url") .$locClose; if ($modifiedDateTime) { $urlValue .= $lastmodOpen .makeIso8601TimeStamp($modifiedDateTime) .$lastmodClose; if (!$isoLastModifiedSite) { // last modification of web site $isoLastModifiedSite = makeIso8601TimeStamp($modifiedDateTime); } } if ($changeFrequency) { $urlValue .= $changefreqOpen .$changeFrequency .$changefreqClose; } if ($priority) { $urlValue .= $priorityOpen .$priority .$priorityClose; } $urlTag .= $urlValue; $urlTag .= $urlClose; return $urlTag;} function rscandir($base='', &$data=array()) { $array = array_diff(scandir($base), array('.', '..')); # remove ' and .. from the array / foreach($array as $value) : / loop through the array at the level of the supplied $base / if (is_dir($base.$value)) : / if this is a directory / $data[] = $base.$value.'/'; / add it to the $data array / $data = rscandir($base.$value.'/', $data); / then make a recursive call with the current $value as the $base supplying the $data array to carry into the recursion / elseif (is_file($base.$value)) : / else if the current $value is a file / $data[] = $base.$value; / just add the current $value to the $data array */ endif; endforeach; return $data; // return the $data array } function kill_base($t) { return(URLBASE.substr($t, strlen(DIRBASE)));} $dir = rscandir(DIRBASE);$a = array_map("kill_base", $dir); foreach ($a as $key => $pageUrl) { $pageLastModified = date ("Y-m-d", filemtime($dir[$key])); $pageChangeFrequency = "monthly"; $pagePriority = 0.8; $urlsetValue .= makeUrlTag ($pageUrl, $pageLastModified, $pageChangeFrequency, $pagePriority); } $current = "$xmlHeader$urlsetOpen$urlsetValue$urlsetClose"; file_put_contents('sitemap.xml', $current); ?>
=== CODE ENDS HERE ===
-
HTML sitemaps are good for users; having 100,000 links on a page though, not so much.
If you can (and certainly with a site this large) if you can do video and image sitemaps you'll help Google get around your site.
-
Is there any way i can see pages that have not been indexed?
Not that I can tell and using site: isn't going to be feasible on a large site I guess.
Is it more beneficial to include various site maps or just the one?
Well, the max files size is 50,000 or 10MB uncompressed (you can gzip them), so if you've more than 50,000 URLs you'll have to.
-
Is there any way i can see pages that have not been indexed?
Is it more beneficial to include various site maps or just the one?
Thanks for your help!!
-
Thanks for your help
do you ffel it is important to have HTML + Video site maps as well? How does this make a differance?
-
How big we talking?
Probably best grabbing something server side if your CMS can't do it. Check out - http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators - I know Google says they've not tested any (and neither have I) but they must have looked at them at some point.
Secondly you'll need to know how to submit multiple sitemap parts and how to break them up.
Looking at it Amazon seem to cap theirs at 50,000 and Ebay at 40,000, so I think you should be fine with numbers around there.
Here's how to set up multiple sitemaps in the same directory - http://googlewebmastercentral.blogspot.com/2006/10/multiple-sitemaps-in-same-directory.html
Once you've submitted your sitemaps Webmaster Tools will tell you how many URLs you've submitted vs. how many they've indexed.
-
Hey,
I'm assuming you mean XML sitemaps here: You can create a sitemap index file which essentially lists a number of sitemaps in one file (A sitemap of sitemap files if that makes sense). See http://www.google.com/support/webmasters/bin/answer.py?answer=71453
There are automatic sitemap generators out there - if you're site has categories with thousands of pages I'd split up them up and have a sitemap per category.
DD
-
To extract URLs, you can use Xenu Link Sleuth. Then you msut make a hiearchy of sitemaps so that all sitemaps are efficiently crawled by Google.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Wordpress 'Hide Title' Feature, does this help shorten title length
Im wondering if anyone with some Wordpress experience can help me. I am using Yoast to create my page titles, but yet Moz tells me that my page titles including my actual page title tag which is 'dumfries wedding photography | Hemera Visuals' by clicking on the 'hide title' feature in wordpress will this in turn stop wordpress from automatically adding my page title and therfor bring my title length down drastically? And if so will I have to wait till google next crawls my page to see if this works? Kind Regards Cameron.
On-Page Optimization | | hemeravisuals120 -
Can someone help with Canonical?
I have a wordpress site that On-Page Grader is saying I don't have Canonical done correctly. Here is the comment. Appropriate Use of Rel Canonical If the canonical tag is pointing to a different URL, engines will not count this page as the reference resource and thus, it won't have an opportunity to rank. Make sure you're targeting the right page (if this isn't it, you can reset the target above) and then change the canonical tag to reference that URL. Recommendation: We check to make sure that IF you use canonical URL tags, it points to the right page. If the canonical tag points to a different URL, engines will not count this page as the reference resource and thus, it won't have an opportunity to rank. If you've not made this page the rel=canonical target, change the reference to this URL. NOTE: For pages not employing canonical URL tags, this factor does not apply. I have quite a few sites and have never had an issue with this. Can anyone help? I tried installing a plugin but that seems to have made it worse. This is the front page of the site btw.
On-Page Optimization | | jonnyholt1 -
Paying someone for SEO help?
Moz has been very helpful but since my site was hit by Google Panda and I am getting very little organic traffic, I am thinking about having someone who really knows SEO take a look and give me some help. However, I have no idea how to find someone reputable. Any advice for hiring someone for SEO help (on a budget?)
On-Page Optimization | | 2bloggers0 -
Avoid Keyword Self-Cannibalization. Please Help
This has been addressed plenty of times but I cannot find an example that addresses my issue so that is why I am posting this. I am getting the following Self-Cannibalization error for my Homepage and I am trying to fix it but I just don't see it or maybe I do not understand this correctly. http://fake-diploma.com Keyword Fake Diploma Cannibalizing link"How to make a fake diploma", "How to get a fake diploma", "Making a Fake High School Diploma", "Fake Diploma Template", and "Framing your fake diploma" My understanding is that for Self-Cannibalization to occur I would have to have a link on this page pointing to another page using "Fake Diploma" as my anchor text since I want this page to rank for Fake Diploma. I do have the left sidebar which contains my most recent post and my of my titles do include Fake Diploma but I thought that since they were not an Exact Match and are actually longtail keywords that they did not matter or cause Self-Cannibalization. Am I wrong? How do I fix this.
On-Page Optimization | | diplomajim0 -
Help! A couple of basic questions on dup. content, pagination and tumblr blogs.
Hi, and many thanks in advance for any assistance. According to our GWMT we currently have over a thousand duplicated title tags and meta descriptions. These stem from tabs that we have located beneath the body copy, which when you click on them display offers or itineraries (we're a travel company). So the URLs change to having "?st=Offer" or "?st=Itinerary" at the end, and are considered to be duplicating the original page's title and meta des. Sometimes the original page is also paginated, and shows the same duplication errors. What would be the best way to ensure we're not duplicating anything? Also, we have a tumblr blog, where there's single page displaying all the blog content, but also links to each blog on a separate individual page. We would like to keep the individual pages as we can optimise to target specific keywords, but want to avoid any duplication issues again. Any advice would be greatly appreciated.
On-Page Optimization | | LV70 -
Is this helping?
A few months ago, in hopes of a) helping new customers navigate the sometimes-arcane language used in the precious metals industry and b) earning some props from Google for making the site more user friendly, I talked the Powers-That-Be into allowing me to create a glossary with links from some unfamiliar terms in each product description and added appropriate <title>tages for each term.</p> <p>My question is: <strong>In the opinion of SEOMoz was this endeavor a brilliant, transformative SEO brainstorm—or just a waste of my time?</strong></p> <p>We haven't been running analytics, so we are flying blind.</p> <p>Here's the 411:<br /> <br />Website: www.goldmart.com<br />Glossary: www.goldmart.com/glossary</p> <p>A sample description: http://www.goldmart.com/1-oz-pamp-suisse-lunar-snake-gold-bar-9999-fine-in-assay.html</p></title>
On-Page Optimization | | RScime250 -
Would adding noindex help?
I had completely forgotten that I have about 20 pages of content on my site that is an exact duplicate of other sites (i.e. obtained from PLR site). I really do not want to delete these pages as they do get a lot of visitors (or did before last algo updates). These visitors are not from organic search but have navigated to the pages from within my site. Question is should I a) add noindex to these pages and then ask google to remove them from index or b) try to rewrite them Many Thanks Simon
On-Page Optimization | | spes1230 -
To many links hurting me even though they are helping users
I have a scrabble based site where I function as a anagram solver, scrabble dictionary look up and tons of different word lists. In each of these word lists I link every word to my scrabble dictionary. This has caused Google to index 10018 pages total for my site and over 300 of them have well over 100 links. Many of them contain over 1000 links. I know Google's and SEOMOZ stance that anything over 100 will hurt me. I have always seen the warnings in my dashboard warning me of this but I have simply ignored it. I have posted on this Q and A that I have this issue, but IMO having these links benefit the users in the aspect that they don't have to worry about coping the text and putting it in the search box, they can simply click the link. Some have said if it helps the users then I am good, others have said opposite. I am thinking about removing these links from all these word lists to reduce the links per page. My questions are these. 1. If I remove the links from my page could this possible help me? No harm in trying it out so this is an easy question 2. If I remove the links then I will have over 9000 pages that are indexed with Google that no longer have a link pointing to them, except for the aspect that they are indexed with Google still. Is it going to hurt me if I remove these links and Google no longer sees them linked from my site or anywhere else?
On-Page Optimization | | cbielich0