I can't crawl the archive of this website with Screaming Frog
-
Hi
I'm trying to crawl this website (http://zeri.info/) with Screaming Frog but because of some technical issue with their site (i can't find what is causing it) i'm able to crawl only the first page of each category (ex. http://zeri.info/sport/) and then it will go to crawl each page of their archive (hundreds of thousands of pages) but it won't crawl the links inside these pages.
Thanks a lot!
-
I think the issue comes from the way you handle the pagination and or the way your render archived pages.
Example: First archive page of Aktualehttp://zeri.info/arkiva/?formkey=7301c1be1634ffedb1c3780e5063819b6ec19157&acid=aktuale
Clicking on page 2 adds the date
http://zeri.info/arkiva/?from=2016-06-01&until=2016-06-16&acid=aktuale&formkey=cc0a40ca389eb511b1369a9aa9da915826d6ca44&faqe=2#archive-results => I assume that you're only listing the articles published from June 1st till today.
If I check all the different section & the number of articles listed in each archive I get approx. 1200 pages - add some additional pages linked on these pages and you get to the 2K pages you mentioned.
There seems to be no possibility to reach the previously published content without executing a search - which Screaming Frog can't do. It's quite possible that this is causing issues for Google bot as well so I would try to fix this.
If you really want to crawl the full site in the mean time - add another rule in url rewriting - this time selecting 'regex replace' -
add regex: from=2016-06-01
replace regex from=2010-01-01 (replace by the earliest date of publishing)This way - the system will call url http://zeri.info/arkiva/?from=2010-06-01&until=2016-06-16&acid=kultura&formkey=5932742bd5dd77799524ba31b94928810908fc07&faqe=2 rather than the original one - listing all the articles instead of only the june articles.
Hope this helps.
Dirk
-
I can't make it work. After removing 'fromkey' parameter i was able to crawl 1.7k and it stopped there. The site has more than 400k pages so .. something must be wrong
I want to crawl only the root domain without subdomains and all i can crawl is around 2k pages.
Do you have any idea what might be happening?
-
Great it worked. Just a small note - if Screaming Frog is getting confused by all these parameters, it could well be that Googlebot (while more sophisticated) is also having these issues. You could consider to exclude the formkey parameter in the Search Console (Crawl > URL parameters)
DIrk
-
Dirk, thanks a lot.
I just added "formkey" to be removed as a parameter and it seems to be working. I crawled 1k pages until now and i'm going to monitor how it goes.
The site has more than 400k pages so the process to crawl them all will take time (and i'm going to have to crawl each sector so i can create sitemaps for them).
Thanks again
Gjergj -
In the menu 'url rewriting' you can simply put the parameters the site should ignore (like date, formkey,..). I removed the formkey parameter and I checked the pages of the archive in Screaming Frog.
It is clearly able to detect all the internal links on the page - so will crawl them.
How are you certain that the pages below are not crawled - could you give a specific example of page that should be crawled but isn't?
Dirk
-
I've tried changing settings to respect noindex & canonical .. it will stop crawling the archive pages but still it won't crawl the links inside those pages. (i've added NOINDEX, FOLLOW in all archive pagination pages)
What do you mean by rewriting the url to ignore the formkey? How do you think it should be.
Gjergji
-
It think Screaming Frog is going nuts on the formkey value in the url which is constantly changing when changing pages.
Could you modify the settings of the spider to respect noindex & respect canonical - looks like this is solving the issue.
Alternatively you could rewrite the url to ignore the formkey (remove parameter)
Dirk
-
Hi Logan
I've tried going back to default configuration but it didn't help .. still i don't believe Screaming Frog is to blame, i think there is something wrong with the way the site has been developed (they are using a custom CMS) .. but i can't find the reason why this is happening. As soon as i find the solution then i can ask the guys who developed this site to make the necessary changes.
Thanks a lot.
-
Hi Dirk
Thanks a lot for replying back. The issue is that Screaming Frog is crawling the archive pages (like these examples) but it won't crawl the articles that are listed inside these pages.
The hierarchy of the site goes like this:
Homepage
- Categories (with about 20 articles in them)
- Archive of that category (with all the remaining articles, which in this case means thousands since they are a news website)Screaming Frog will crawl the homepage and categories ... but after it goes to the archive it won't crawl the articles inside archive, instead it will only crawl the pages (pagination) of that archive.
Thanks again.
-
Try going to File > Default Conif > Clear Default Configuration. This happens to me sometimes as well as I've edited settings over time. Clearing it out and going back to default settings is usually quicker than clicking through the settings to identify which one is causing the problem.
-
Did you put in some special filters - just tried to crawl the site & it seems to work just fine?
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Organic search traffic has dropped by 35% since 18 September, we don't know why.
Organic traffic to our website has dropped 35% since 18 September 2017 to date. From 1 January to 18 September 2017 organic traffic was up by just under 1% over all (Google up by 1.32%). Paid search traffic over the same time has remained steady. There is nothing we can think of that we've done that has caused the drop. We had an issue with Google page speed test failing when running a test but we resolved this issue on 20 November and in that time we've seen an even greater drop (44% in the last week). The drop is seen across the 3 main search engines, not just Google, which points toward something we've done, but as mentioned, we can't think of any significant change we made in September that would have such negative effects. There is little difference across devices. Is anyone aware of a significant event in September in the search engine world that may have influenced our organic traffic? Any help gratefully received.
Technical SEO | | imaterus0 -
Can I canonical the same page?
I have a site where I have 500+ Page listing pages and I would like to rel=canonical them to the master page. Example: http://www.example.com//articles?p=18 OR http://www.example.com/articles?p=65 I plan on adding this to the section from of the page template so it goes to all pages - When I do this, I will also add the canonical to the page I am directing the canonical. Is this a bad thing? Or allowed?
Technical SEO | | JoshKimber0 -
Web page is showing up on Google but doesn't show when it was cached, so is it indexed?
Hey everyone So I created a new page on a WordPress website, it was live for a few hours till I changed my mind & switched it back to a draft. Just out of curiosity I did the Site:www.example.com/Example search on Google to see if it had been indexed & apparently it had but when I click on cached to see what time it got indexed at exactly it's showing me an error. So does this mean it is indexed or not?
Technical SEO | | conversiontactics0 -
How can I best find out which URLs from large sitemaps aren't indexed?
I have about a dozen sitemaps with a total of just over 300,000 urls in them. These have been carefully created to only select the content that I feel is above a certain threshold. However, Google says they have only indexed 230,000 of these urls. Now I'm wondering, how can I best go about working out which URLs they haven't indexed? No errors are showing in WMT related to these pages. I can obviously manually start hitting it, but surely there's a better way?
Technical SEO | | rango0 -
Google doesn't rank the best page of our content for keywords. How to fix that?
Hello, We have a strange issue, which I think is due to legacy. Generally, we are a job board for students in France: http://jobetudiant.net (jobetudiant == studentjob in french) We rank quite well (2nd or 3rd) on "Job etudiant <city>", with the right page (the one that lists all job offers in that city). So this is great.</city> Now, for some reason, Google systematically puts another of our pages in front of that: the page that lists the jobs offers in the 'region' of that city. For example, check this page. the first link is a competitor, the 3rd is the "right" link (the job offers in annecy), but the 2nd link is the list of jobs in Haute Savoie (which is the 'departement'- equiv. to county) in which Annecy is... that's annoying. Is there a way to indicate Google that the 3rd page makes more sense for this search? Thanks
Technical SEO | | jgenesto0 -
Two companies merge: website A redirect 301 to website B. Problems?
Hi, last december the company I work for and another company merged. The website of company A was taken offline and the home page was 302 redirected to a page on website B. This page had information about the merger and the consequences for customers. The deeper pages of website A were 301 redirected to similar pages on website B. After a while, the traffic from the redirected home page decreased and we thought it was time to change the redirect from a 302 into a 301 redirect to the home page. Because there are still a lot of links to the home page of website A and we wanted to preserve the link juice. Two weeks ago we changed the 302 redirect from website A into a 301 redirect to the home page of website B. Last week the Google webmaster tools account of website B showed the links from the 301 redirected website A. The total amount of links doubled and the top anchor text is the name of company A instead of company B. This, off course, could trigger an alarm at Google. Because we got a lot of new links with a different anchor text. A tactic used by spammers/black-hats. I am a bit worried that our change will be penalized by Google. But our change is legit. It is to the advantage of our customers to find us if they search for the name of company A or click on a link to website A. We didn´t change the change of address of domain A in Google webmaster tools yet. Is it a good idea to change the change of address of domain A into domain B? Are there other precautions we can take?
Technical SEO | | NN-online0 -
When I look in OpenSiteExplorer, it says that it hasn't followed any of my internal links... WTF?
When I look at my site in open site explorer, it says it has followed 1 internal link.. I know that opensiteexplorer has updated, because my domain rank has changed since the last time i looked. I figured it just hadn't fully crawled and indexed my links, but considering I have had an account here for over a month, and the site is over 7 months old, I doubt that is the case. My site is www.ontracparts.com can anybody help me here?? Thanks, Tyler
Technical SEO | | TylerAbernethy0 -
Google Has Indexed Most of My Site, why won't Bing?
We've got 600K+ pages indexed by Google and have submitted our same sitemap.xml's to Bing, but have only seen 100-200 pages get indexed by Bing. Is this fairly typical? Is there anything further we can do to increase indexation on Bing?
Technical SEO | | jamesti0