Crawl and Indexation Error - Googlebot can't/doesn't access specific folders on microsites
-
Hi,
My first time posting here, I am just looking for some feedback on a indexation issue we have with a client and any feedback on possible next steps or items I may have overlooked.
To give some background, our client operates a website for the core band and a also a number of microsites based on specific business units, so you have corewebsite.com along with bu1.corewebsite.com, bu2.corewebsite.com.
The content structure isn't ideal, as each microsite follows a structure of bu1.corewebsite.com/bu1/home.aspx, bu2.corewebsite.com/bu2/home.aspx and so on.
In addition to this each microsite has duplicate folders from the other microsites so bu1.corewebsite.com has indexable folders bu1.corewebsite.com/bu1/home.aspx but also bu1.corewebsite.com/bu2/home.aspx the same with bu2.corewebsite.com has bu2.corewebsite.com/bu2/home.aspx but also bu2.corewebsite.com/bu1/home.aspx. Therre are 5 different business units so you have this duplicate content scenario for all microsites.
This situation is being addressed in the medium term development roadmap and will be rectified in the next iteration of the site but that is still a ways out.
The issue
About 6 weeks ago we noticed a drop off in search rankings for two of our microsites (bu1.corewebsite.com and bu2.corewebsite.com) over a period of 2-3 weeks pretty much all our terms dropped out of the rankings and search visibility dropped to essentially 0.I can see that pages from the websites are still indexed but oddly it is the duplicate content pages so (bu1.corewebsite.com/bu3/home.aspx or (bu1.corewebsite.com/bu4/home.aspx is still indexed, similiarly on the bu2.corewebsite microsite bu2.corewebsite.com/bu3/home.aspx and bu4.corewebsite.com/bu3/home.aspx are indexed but no pages from the BU1 or BU2 content directories seem to be indexed under their own microsites.
Logging into webmaster tools I can see there is a "Google couldn't crawl your site because we were unable to access your site's robots.txt file." This was a bit odd as there was no robots.txt in the root directory but I got some weird results when I checked the BU1/BU2 microsites in technicalseo.com robots text tool.
Also due to the fact that there is a redirect from bu1.corewebsite.com/ to bu1.corewebsite.com/bu4.aspx I thought maybe there could be something there so consequently we removed the redirect and added a basic robots to the root directory for both microsites.
After this we saw a small pickup in site visibility, a few terms pop into our Moz campaign rankings but drop out again pretty quickly. Also the error message in GSC persisted.
Steps taken so far after that
- In Google Search Console, I confirmed there are no manual actions against the microsites.
- Confirmed there is no instances of noindex on any of the pages for BU1/BU2
- A number of the main links from the root domain to microsite BU1/BU2 have a rel="noopener noreferrer" attribute but we looked into this and found it has no impact on indexation
- Looking into this issue we saw some people had similar issues when using Cloudflare but our client doesn't use this service
- Using a response redirect header tool checker, we noticed a timeout when trying to mimic googlebot accessing the site
- Following on from point 5 we got a hold of a week of server logs from the client and I can see Googlebot successfully pinging the site and not getting 500 response codes from the server...but couldn't see any instance of it trying to index microsite BU1/BU2 content
So it seems to me that the issue could be something server side but I'm at a bit of a loss of next steps to take.
Any advice at all is much appreciated!
-
Hello ImpericMedia,
If you can share the site with me (private message is OK) I'll look into it. If you don't want to do that, here are some things I would look at:
1. If you have verified that the Robots.txt file is not blocking the pages you want indexed, and the pages are still not indexed (or indexed with a message about the Robots.txt file) you should check for a Robots Noindex meta tag on the page. If the source code looks strange you may have to use the Chrome Inspect tool to see the fully rendered page.
2. If there are no blocking robots meta tags on the page you should check the HTTP response for an X-Robots header.
3. If there is no X-Robots header, it's probably because of the duplicate content and spammy(seeming) subdomain setup.
Sorry about the wait. If you include the site URL it will get other community member's curious enough to check it out next time.
I hope this helps. If not, feel free to message me.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Open Site Explorer - Top Pages that don't exist / result of a hack(?)
Hi all, Last year, a website I monitor, got hacked, or infected with malware, I’m not sure which. The result that I got to see is 100’s of ‘not found’ entries in Google Search Console / Crawl Errors for non-existent pages relating to / variations of ‘Canada Goose’. And also, there's a couple of such links showing up in SERPs. Here’s an example of the page URLs: ourdomain.com/canadagoose.php ourdomain.com/replicacanadagoose.php I looked for advice on the webmaster forums, and was recommended to just keep marking them as ‘fixed’ in the console. Sooner or later they’ll disappear. Still, a year after, they appear. I’ve just signed up for a Moz trail and, in Open Site Explorer->Top Pages, the top 2-5 pages are relating to these non-existent pages: URLs that are the result of this ‘canada goose’ spam attack. The non-existent pages each have around 10 Linking Root Domains, with around 50 Inbound Links. My question is: Is there a more direct action I should take here? For example, informing Google of the offending domains with these backlinks. Any thoughts appreciated! Many thanks
Intermediate & Advanced SEO | | macthing1 -
Monthly Refreshes Aren't Actually Needed, Right?
We get tons of emails from Network Solutions with the following text: To ensure that your website is easily found online it is important that you submit your website to the major search engines and internet directories, including: | Google™ Google Places™ Google Mobile™ Bing™ Yahoo!<sup>®</sup> Twitter<sup>®</sup> | Facebook<sup>®</sup> CitySearch<sup>®</sup> Foursquare™ Angie's List<sup>®</sup> GPS navigation MerchantCircle<sup>®</sup> | To do so, we recommend you go to each search engine and internet directories web page, locate the instructions and then complete a monthly refresh of your listing. If you would like us to complete this process for you please call us at... Everything I've ever read about modern SEO says this isn't necessary and it's just a solicitation to get people to pay them for something they don't even need. We update our social pages regularly and maintain listings on many citation sites using Moz Local (in addition to manually building citations). Can you guys confirm that this is just more spam from Network Solutions?
Intermediate & Advanced SEO | | ScottImageWorks0 -
Client rebranded with a new website but can't migrate now defunct franchise website to new website.
Hi everyone, My client is a chain of franchised restaurants with a local domain website named after the franchise. The franchise exited the market while the client stayed and built its own brand with a separate website. The franchise website (which is extremely popular) will be shut down soon but the client will not be able to redirect the franchise website to the new website for legal reasons. What can I do to ensure that we start ranking immediately for the franchise keyphrase as soon as the franchise website is shutdown. We currently have the new website and access to the old website (which we can't redirect) Thanks, T
Intermediate & Advanced SEO | | Tarek_Lel0 -
Help! The website ranks fine but one of my web pages simply won't rank on Google!!!
One of our web pages will not rank on Google. The website as a whole ranks fine except just one section...We have tested and it looks fine...Google can crawl the page no problem. There are no spurious redirects in place. The content is fine. There is no duplicate page content issue. The page has a dozen product images (photos) but the load time of the page is absolutely fine. We have the submitted the page via webmaster and its fine. It gets listed but then a few hours later disappears!!! The site has not been penalised as we get good rankings with other pages. Can anyone help? Know about this problem?
Intermediate & Advanced SEO | | CayenneRed890 -
Can Google read content/see links on subscription sites?
If an article is published on The Times (for example), can Google by-pass the subscription sign-in to read the content and index the links in the article? Example: http://www.thetimes.co.uk/tto/life/property/overseas/article4245346.ece In the above article there is a link to the resort's website but you can't see this unless you subscribe. I checked the source code of the page with the subscription prompt present and the link isn't there. Is there a way that these sites deal with search engines differently to other user agents to allow the content to be crawled and indexed?
Intermediate & Advanced SEO | | CustardOnlineMarketing0 -
Is Google indexing Mp3 audio and MIDI music files? Can that cause any duplicate problems?
Hello, I own virtualsheetmusic.com website and we have several thousands of media files (Mp3 and MIDI files) that potentially Google can index. If that's the case, I am wondering if that could cause any "duplicate" issues of some sort since many of such media files have exact file names or same meta information inside. Any thoughts about this issue are very welcome! Thank you in advance to anyone.
Intermediate & Advanced SEO | | fablau0 -
I need help with a local tax lawyer website that just doesn't get traffic
We've been doing a little bit of linkbuilding and content development for this site on and off for the last year or so: http://www.olsonirstaxattorney.com/ We're trying to rank her for "Denver tax attorney," but in all honesty we just don't have the budget to hit the first page for that term, so it doesn't surprise me that we're invisible. However, my problem is that the site gets almost NO traffic. There are days when Google doesn't send more than 2-3 visitors (yikes). Every site in our portfolio gets at least a few hundred visits a month, so I'm thinking that I'm missing something really obvious on this site. I would expect that we'd get some type of traffic considering the amount of content the site has, (about 100 pages of unique content, give or take) and some of the basic linkbuilding work we've done (we just got an infographic published to a few decent quality sites, including a nice placement on the lawyer.com blog). However, we're still getting almost no organic traffic from Google or Bing. Any ideas as to why? GWMT doesn't show a penalty, doesn't identify any site health issues, etc. Other notes: Unbeknownst to me, the client had cut and pasted IRS newsletters as blog posts. I found out about all this duplicate content last November, and we added "noindex" tags to all of those duplicated pages. The site has never been carefully maintained by the client. She's very busy, so adding content has never been a priority, and we don't have a lot of budget to justify blogging on a regular basis AND doing some of the linkbuilding work we've done (guest posts and infographic).
Intermediate & Advanced SEO | | JasonLancaster0 -
Googlebot crawling partial URLs
Hi guys, I've checked my email this morning and I've got a number of 404 errors over the weekend where Google has tried to crawl some of my existing pages but not found the full URL. Instead of hitting 'domain.com/folder/complete-pagename.php' it's hit 'domain.com/folder/comp'. This is definitely Googlebot/2.1; http://www.google.com/bot.html (66.249.72.53) but I can't find where it would have found only the partial URL. It certainly wasn't on the domain it's crawling and I can't find any links from external sites pointing to us with the incorrect URL. GoogleBot is doing the same thing across a single domain but in different sub-folders. Having checked Webmaster Tools there aren't any hard 404s and the soft ones aren't related and haven't occured since August. I'm really confused as to how this is happening.. Thanks!
Intermediate & Advanced SEO | | panini0