What is the best tool to crawl a site with millions of pages?
-
I want to crawl a site that has so many pages that Xenu and Screaming Frog keep crashing at some point after 200,000 pages.
What tools will allow me to crawl a site with millions of pages without crashing?
-
Don't forget to exclude pages that don't contain the information you are looking for - exclude query parameters which just result in duplicate content, system files, etc. That may help to bring the amount down.
-
Only basic stuff: URL, Title, Description, and a few HTML elements.
I am aware that building a crawler would be fairly easy, but is there one out there that already does it without consuming too many resources?
-
For what purpose do you want to crawl the site?
A web crawler isn't really hard to write. In 100 lines of code you can probably code one. The question is of course: what do you want out of the crawl?
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Is it best to 301 redirect or use canonical Url when consolidating two pages?
I have build several pages (A and B) with high quantity content. Page A is aged and gets lots of organic traffic, ranks for lots of valuable keywords, and has only internal links to this page. Page B is newer (6 months) and gets little traffic, ranks for no keywords, but has terrific content and many high value external links. As Page A and B are related to a similar theme, I was going to merge content from page B onto page A, but don't know which would be the best approach for handling the links going to page B. For the purposes of keep as much link equity as possible, is it best to us a 301 redirect from B to A or use a canonical URL from B to A?
Intermediate & Advanced SEO | | Cutopia0 -
Building a product clients will integrate into their sites: What is the best way to utilize my clients' unique domain names?
I'm designing a hosted product my clients will integrate into their websites, their end users would access it via my clients' customer-facing websites. It is a product my clients pay for which provides a service to their end users, who would have to login to my product via a link provided by my clients. Most clients would choose to incorporate this link prominently on their home page and site nav.
Intermediate & Advanced SEO | | emzeegee
All clients will be in the same vertical market, so their sites will be keyword rich and related to my site.
Many may even be .org and ,edus The way I see it, there are three main ways I could set this up within the product.
I want to know which is most beneficial, or if I'm missing anything. 1: They set up a subdomain at their domain that serves content from my domain product.theirdomain.com would render content from mydomain.com's database.
product.theirdomain.com could have footer and/or other no-follow links to mydomain.com with target keywords The risk I see here is having hundreds of sites with the same target keyword linking back to my domain.
This may be the worst option, as I'm not sure about if the nofollow will help, because I know Google considers this kind of link to be a link scheme: https://support.google.com/webmasters/answer/66356?hl=en 2: They link to a subdomain on mydomain.com from their nav/site
Their nav would include an actual link to product.mydomain.com/theircompanyname
Each client would have a different "theircompanyname" link.
They would decide and/or create their link method (graphic, presence of alt tag, text, what text, etc).
I would have no control aside from requiring them to link to that url on my server. 3: They link to a subdirectory on mydomain.com from their nav/site
Their nav would include an actual link to mydomain.com/product/theircompanyname
Each client would have a different "theircompanyname" link.
They would decide and/or create their link method (graphic, presence of alt tag, text, what text, etc).
I would have no control aside from requiring them to link to that url on my server. In all scenarios, my marketing content would be set up around mydomain.com both as static content and a blog directory, all with SEO attractive url slugs. I'm leaning towards option 3, but would like input!0 -
301 from old site to new one , Should I point to home page or sub category page ?
Hey Seo Experts, I have a small website ranking for few terms like cabinets sale, buy etc . However what i have now decided is to launch a New website with more different products like living room furniture, wardrobes etc . Out of all these categories on new website Cabinets is one of the SubCategory . Now I do not want to have 2 websites . So wanted to 301 from small cabinets website to newly created website. Some of the doubts I have at the moment is ? 1 Should I REDIRECT 301 to sub category (i,e cabinets) which is purely related to Cabinets or Do a Redirect to HOME PAGE . As I also need more Authority to home page as well , as this is relatively new website ? 2 Second question related to this. If you have multiple sub domains does it divide the total authority & TF.Or it is just Ok to have multiple Sub domains if needed ? Any advice appreciated !! Thanks .
Intermediate & Advanced SEO | | aus00070 -
What to do when your home page an index for a series of pages.
I have created an index stack. My home page is http://www.southernwhitewater.com The home page is the index itself and the 1st page http://www.southernwhitewater.com/nz-adventure-tours-whitewater-river-rafting-hunting-fishing My home page (if your look at it through moz bat for chrome bar} incorporates all the pages in the index. Is this Bad? I would prefer to index each page separately. As per my site index in the footer What is the best way to optimize all these pages individually and still have the customers arrive at the top to a picture. rel= canonical? Any help would be great!! http://www.southernwhitewater.com
Intermediate & Advanced SEO | | VelocityWebsites0 -
Dfferent url of some other site is shown by Google in cace copy of our site's page
Hi, When i check cached copy of url of my site http://goo.gl/BZw2Zz , the url in cache copy shown by Google is of some other third party site. Why is Google showing third party url in our site's cached url. Did any of you guys faced any such issue. Regards,
Intermediate & Advanced SEO | | vivekrathore0 -
Webmaster Tools HTML Improvements Page Blank / Site Not Ranking Well
I have an ecommerce site that is not ranking well currently. It has about 1,000 pages indexed in Google but very few appear to be ranking. I normally find issues in Webmaster Tools HTML Improvements but for some reason it does not see a problem with the site. There are problems, trust me. Moz shows many issues. Google nothing! There is a problem somewhere but I am not seeing it. Why are HTML Improvements blank and the site not ranking? Am I in the dreaded sandbox? Any ideas? Sean We didn't detect any content issues with your site. As we crawl your site, we check it to detect any potential issues with content on your pages, including duplicate, missing, or problematic title tags or meta descriptions. These issues won't prevent your site from appearing in Google search results, but paying attention to them can provide Google with more information and even help drive traffic to your site. For example, title and meta description text can appear in search results, and useful, descriptive text is more likely to be clicked on by users.
Intermediate & Advanced SEO | | optin0 -
What to do about similar product pages on major retail site
Hi all, I have a dilemma and I'm hoping the community can guide me in the right direction. We're working with a major retailer on launching a local deals section of their website (what I'll call the "local site"). The company has 55 million products for one brand, and 37 million for another. The main site (I'll call it the ".com version") is fairly well SEO'd with flat architecture, clean URLs, microdata, canonical tag, good product descriptions, etc. If you were looking for a refrigerator, you would use the faceted navigation and go from department > category > sub-category > product detail page. The local site's purpose is to "localize" all of the store inventory and have weekly offers and pricing specials. We will use a similar architecture as .com, except it will be under a /local/city-state/... sub-folder. Ideally, if you're looking for a refrigerator in San Antonio, Texas, then the local page should prove to be more relevant than the .com generic refrigerator pages. (the local pages have the addresses of all local stores in the footer and use the location microdata as well - the difference will be the prices.) MY QUESTION IS THIS: If we pull the exact same product pages/descriptions from the .com database for use in the local site, are we creating a duplicate content problem that will hurt the rest of the site? I don't think I can canonicalize to the .com generic product page - I actually want those local pages to show up at the top. Obviously, we don't want to copy product descriptions across root domains, but how is it handled across the SAME root domain? Ideally, it would be great if we had a listing from both the .com and the /local pages in the SERPs. What do you all think? Ryan
Intermediate & Advanced SEO | | RyanKelly0 -
Pages Titles in SERPs - Wordpress Site
In Google SERPs we have several websites (built in wordpress) who's pages are being displayed without using the page title - is this google ignoring the page title or is there a problem in our code - also if this is google is it still taking notice of the page title to determine what content is on the page?I have read several articles on this but wondered if someone can advise - I can provide the URL if required.Also I wanted to 100% that our robots.txt is behaving its self.
Intermediate & Advanced SEO | | JohnW-UK0