What is the best tool to crawl a site with millions of pages?
-
I want to crawl a site that has so many pages that Xenu and Screaming Frog keep crashing at some point after 200,000 pages.
What tools will allow me to crawl a site with millions of pages without crashing?
-
Don't forget to exclude pages that don't contain the information you are looking for - exclude query parameters which just result in duplicate content, system files, etc. That may help to bring the amount down.
-
Only basic stuff: URL, Title, Description, and a few HTML elements.
I am aware that building a crawler would be fairly easy, but is there one out there that already does it without consuming too many resources?
-
For what purpose do you want to crawl the site?
A web crawler isn't really hard to write. In 100 lines of code you can probably code one. The question is of course: what do you want out of the crawl?
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Cache and index page of Mobile site
Hi, I want to check cache and index page of mobile site. I am checking it on mobile phone but it is showing the cache version of desktop. So anybody can tell me the way(tool, online tool etc.) to check mobile site index and cache page.
Intermediate & Advanced SEO | | vivekrathore0 -
Site Merge Strategy: Choosing Target Pages for 301 Redirects
I am going to be merging two sites. One is a niche site, and it is being merged with the main site. I am going to be doing 301 redirects to the main site. My question is, what is the best way of redirecting section/category pages in order to maximize SEO benefits. I will be redirecting product to product pages. The questions only concerns sections/categories. Option 1: Direct each section/category to the most closely matched category on the main site. For example, vintage-t-shirts would go to vintage-t-shirt on main site. Option 2: Point as many section/category pages to larger category on main site with selected filters. We have filtered navigation on our site. So if you wanted to see vintage t-shirts, you could go to the vintage t-shirt category, OR you could go to t-shirts and select "vintage" under style filter. In the example above, the vintage-t-shirt section from the niche site would point to t-shirts page with vintage filter selected (something like t-shirts/#/?_=1&filter.style=vintage). With option 2, I would be pointing more links to a main category page on the main site. I would likely have that page rank higher, because more links are pointing to it. I may have a better overall user experience, because if the customer decides to browse another style of t-shirt, they can simply unselect the filter and make other selections. Questions: Which of these options is better as far as: (1) SEO, (2) User experience If I go with option 2, the drawback is that the page titles will all be the same (i.e vintage-t-shirts pointing to the page with filter selected would have "t-shirts" as page title instead of a more targeted page with page title "vintage t-shirts." I believe a workaround would be to pull filter values from the URL and append them to the page title. That way page title for URL t-shirts/#/?=1&filter.style=vintage_ would be something like "vintage, t-shirts." Is this the appropriate way to deal with it? Any thoughts, suggestions, shared experiences would be appreciated.
Intermediate & Advanced SEO | | inhouseseo0 -
Webmaster Tools Not Indexing New Pages
Hi there Mozzers, Running into a small issue. After a homepage redesign (from a list of blog posts to a product page), it seems that blog posts are buried on the http://OrangeOctop.us/ site. The latest write-up on "how to beat real madrid in FIFA 15", http://orangeoctop.us/against-real-madrid-fifa-15/ , has yet to be indexed. It would normally take about a day naturally for pages to be indexed or instantly with a manual submission. I have gone into webmaster tools and manually submitted the page for crawls multiple times on multiple devices. Still not showing up in the search results. Can anybody advise?
Intermediate & Advanced SEO | | orangeoctop.us0 -
When crawls occur - when will my links show up in Open Site Explorer
Hello everyone, I've been building links for a while now and none of them show up in Explorer. My domain authority hasn't changed for about a month or so. When does Google do crawls and when does SEOMoz do crawls? Thanks
Intermediate & Advanced SEO | | Harbor_Compliance0 -
Consolidating MANY separate domains into a much better, single URL: Should I point a landing page or redirect to the new site?
I am consolidating a site for a client who previously, and very foolishly, broke up their domains like so: companyparis.com companyflorence.com companyrome.com etc... I am now done with the new site, which will be at: company.eu with pages as appropriate: company.eu/paris company.eu/florence company.eu/rome This domain, although not entirely new, does not have much authority or rank. In terms of SEO and link-building, is it better to redirect the old domain to the specific page on the new domain: companyparis.com --> company.eu/paris or... is it better to put a landing page at the old domain LINKING to the page on the new domain: companyparis.com --> landing page linking to --> company.eu/paris
Intermediate & Advanced SEO | | thongly0 -
Dynamic pages - ecommerce product pages
Hi guys, Before I dive into my question, let me give you some background.. I manage an ecommerce site and we're got thousands of product pages. The pages contain dynamic blocks and information in these blocks are fed by another system. So in a nutshell, our product team enters the data in a software and boom, the information is generated in these page blocks. But that's not all, these pages then redirect to a duplicate version with a custom URL. This is cached and this is what the end user sees. This was done to speed up load, rather than the system generate a dynamic page on the fly, the cache page is loaded and the user sees it super fast. Another benefit happened as well, after going live with the cached pages, they started getting indexed and ranking in Google. The problem is that, the redirect to the duplicate cached page isn't a permanent one, it's a meta refresh, a 302 that happens in a second. So yeah, I've got 302s kicking about. The development team can set up 301 but then there won't be any caching, pages will just load dynamically. Google records pages that are cached but does it cache a dynamic page though? Without a cached page, I'm wondering if I would drop in traffic. The view source might just show a list of dynamic blocks, no content! How would you tackle this? I've already setup canonical tags on the cached pages but removing cache.. Thanks
Intermediate & Advanced SEO | | Bio-RadAbs0 -
Can changing dynamic url of over 2000 pages site after a year will change its ranking
Hi- Have built site in joomla The urls are dynamic in nature with over a year - all pages are well indexed and backlinks been built over with these dynamic urls Need to know if i hire an agency to change over dynamic url to static url of these 2000 pages - will it also change all Search engine ranking positions of existing urls Will all the seo effort and backlinks build over 15 months will still hold valid or this will just back to square one due to change of urls is it advisable to get the url changed from dynamic to static one - especially when site is receiving over 75,000 visitors every month Thanks in advance. Look for expert suggestions
Intermediate & Advanced SEO | | Modi0 -
How to Build High Quality eCommerce Web Site during Low Quality Web Pages?
Today, I was reading Official Google Webmaster Central Blog: More guidance on building high-quality sites. I found one interesting statement over there. Low-quality content on some parts of a website can impact the whole site’s rankings. Why should I like to discuss on this topic? Because, I have made big change on my website via narrow by search. I want to give specific result to know more about it. This is my category page: http://www.vistastores.com/patio-umbrellas Left narrow by search section is creating accurate page for specific attribute products. California Umbrella:
Intermediate & Advanced SEO | | CommercePundit
http://www.vistastores.com/patio-umbrellas/shopby/manufacturer-california-umbrella From above page following page is accessible. http://www.vistastores.com/patio-umbrellas/shopby/canopy-shape-search-octagonal/manufacturer-california-umbrella Sunbrella Patio Umbrellas:
http://www.vistastores.com/patio-umbrellas/shopby/canopy-fabric-search-sunbrella Similar story for this page. Following page can accessible from above page. http://www.vistastores.com/patio-umbrellas/shopby/canopy-fabric-search-sunbrella/finish-search-wood My website have 100+ categories, 11,000 products. I have checked indexed pages in Google for my website. https://www.google.com/search?q=info%3Awww.vistastores.com&pws=0&gl=US#hl=en&safe=off&pws=0&gl=US&q=site:www.vistastores.com&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=910893d99351c8f7&biw=1366&bih=547 It shows me 35,000+ crawled pages which are developed by left navigation section. So, Will it consider as low quality pages? I want to improve my website performance without delete these pages.0