Welcome to the Q&A Forum

LesleyPaone

What are the urls of the two pages that are being crawled? Essentially you don't have 16 pages, you have one page. I cannot see how it could be construed any other way. In Wordpress you might have created 16 different pages, but all of the content loads at once on one page. To get 16 different pages you might have to lazy load the content and also use anchors or hashbangs. You can get full info on the specs here, https://developers.google.com/webmasters/ajax-crawling/docs/specification but as long as pages are not accessible with a hashbang or query string, I think everything is going to be seen as one page.

LesleyPaone

I guess the way I was explaining it was for scalabilty on a large site. You have to think a site like fb or twitter with hundreds of millions of users still has the limitation of only having 50k records in a site map. So if they are running site maps, they have hundreds.

LesleyPaone

If I were in your situation, I would disregard what Alexa says and run a crawl on the site with Moz. Then from there hire a developer to fix the issues if you cannot, then the next crawl that moz does on your site will let you know if he has fixed them.

LesleyPaone

If it were me and someone were asking me to design a system like that, I would design it in a few parts.

First I would create an application that handled the sitemap minus profiles, just for your tos, sign up pages, terms, and what ever pages like that.

Then I would design a system that handled the actual profiles. It would be pretty complex and resource intensive as the site grew. But the main idea flows like this

Start generation, grab the user record with id 1 in the database, check to see if indexable (move to next if not), see what pages are connected, write to xml file, loop back and start with record #2.

There are a few concessions you have to make, you need to keep up with the number of records in a file before you start another file. You can only have 50k records in one file.

The way I would handle the process in total for a large site would be this, sync the required tables via a weekly or daily cron to another instance (server). Call the php script (because that is what I use) that creates the first sitemap for the normal site wide pages. At the end of that site map, put a location for the user profile sitemap, then at the end of the scrip, execute the user profile site map generating script. At the end of each site map, put the location of the next site map file, because as you grow it might take 2-10000 site map files.

One thing that I would ensure to do is get a list of crawler ip addresses and in your .htaccess have an allow / deny rule. That way you can make the site maps only visible to the search engines.

LesleyPaone

A web developer usually. What problems did it find? Some of those services make a big deal out of things that are not actually problems.

LesleyPaone

Envato is a big company with a lot of web assets, so it is really for branding of each / ease of use. Some people that use themeforest might have never heard of codecayon before and vice versa. They also have the size of site to support this kind of layout as well, they literally have 10's of thousands of pages for each site.

You also have to consider that during the development of these different sites, SEO was different as well. Exact match in domain names did more than it does now. So having theme, code and so on in the domain name helped back then too.

LesleyPaone

I have searched for one, but really cannot find one that fits my needs. I am looking at making an on site grader / service that will check pages and point out SEO problems. One that I have found that I like is seorch.eu but they do not have an api. I do not want to reinvent the wheel if I do not have to. Also, the api does not have to be free, or it does not even have to be an api, it can be a self hosted application too.

LesleyPaone

If you are running Wordpress also check what page / pages are being accessed. I have had bots nail my wp-login like that before. If that is the case harden your installation, one thing I have found that stopped it was setting a deny in the htaccess on wp-login / wp-admin.

LesleyPaone

You could consider using a different CDN set up. You could use something like MaxCDN and host them locally and in the cloud. That way you would have direct access to the images on your server as well for pinning and other social media purposes.

LesleyPaone

I would think it would be worth it. Even if you are selling globally having good local citation backlinks will not hurt your site, it will only help it.

LesleyPaone

I would definitely make sure not to 301 redirect any of the bad pages. You might get penalized for that.

They won't count the 404's against you, I would change them to a 410 error though for those pages. Then they will drop out of your GWT and the Google index quicker.

LesleyPaone

This might answer your question, http://moz.com/community/q/opensite-explorer-and-just-discover-links

LesleyPaone

You should look into Piwik. You can set up user accounts where each user can see stats on just their account, but then you can see the stats on all accounts together.

LesleyPaone

I have used Text Broker in the past to good effect. If you have a Raven Tools account it can even hook into it and you can order from it.

LesleyPaone

I am pretty sure it does match a % of the text. We talked about that the other day over in this thread, http://moz.com/community/q/crawl-showing-duplicate-content-but-no-duplicate-content

LesleyPaone

Is there a chance that you could have found their dev site? Look at the source and robots.txt, is it set to noindex and to disallow?

edit: Actually in looking it up, it is something that sales force is doing. I think it would be considered bad, its duplicated content. Another one that is hosted on the same server is

http://ricohau.force.com/

which is also

http://www.ricoh.com.au/

It looks like salesforce is copying the websites for some reason.

LesleyPaone

In a situation like that my first guess would be someone changed the document root in apache but did not restart the server. I am not a Magento person, so you might have what I am saying checked out, because I don't know where things are located. What I would do is look in the databases on the server and see if there is one that holds the most recent customer information. Then I would look at the settings file that Magento uses to specify what database the application uses, see if they are the same. If not, then look around on the server for another directory that holds the most recent version.

If someone changed the document root and did not reboot, it would not take effect until the server was rebooted, thus changing the whole configuration of the site. One place you can have someone look to be sure is the apache logs. In the log it will have the complete system path of the resources accessed, see if there was a change at the time of the reboot.

LesleyPaone

Yes, most platforms already disallow anything from the search in the robots.txt file. You might check and see if this is done already, if not I would do it. You can also add that to the pages as well. Just be careful that it is only on the search pages.

LesleyPaone

I do not definitely know, but I am guessing that since they are between updates (it was supposed to happen on the 18th and its happening on the 22nd now) that they might have scraped that data to use for the update. But that is just my best guess.

LesleyPaone

I would suggest Prestashop, but I am not partial by any means. I develop exclusively with it and I am one of their moderators. Magento is also good too. I think one of the biggest considerations is what the store does in business, how many products, and what features do they need that are not in the default package of the e-commerce program. All platforms have good features, but everyone has features others do not.

LesleyPaone

This is a pretty good explanation, http://moz.com/learn/seo/url

LesleyPaone

If you are using the new bar, on the left hand side click the marker next to the page with the magnifying glass. From there you can highlight links.

LesleyPaone

Honestly I don't know. I really thought development had stopped on it a couple of years ago, but apparently it hasn't. I honestly don't know if they have the ability to change it since it would be a huge core modification for them.

LesleyPaone

I would consider it important, but unfortunately Ecwid does not support it because of how the software uses ajax to build pages. The best tip I can give is to try to rewrite the urls manually and see if the software still works.

LesleyPaone

It very well could. It seems like the generation time is too short for that many products. One thing I would try (I am not an expert with Magento) is to check what your php execution time limit is. If it is 30 seconds or 1 minute, I would try to increase it to something like 5 or 10 minutes and see if the regeneration works then. The script could be timing out.

LesleyPaone

Can you verify if the file is actually on the server? It sounds like the sitemap generator you are using might have a problem. How many products are in the shop? How long does it run when you generate one?

LesleyPaone

I have played around in my spare time with a product called segment.io and it lets you track people around your web site and port the information out to different sources. You can set it up to track where they come from, if they create an account their name or email address, and things like that. Also what they purchase as well. You should take a look at it. Here is a link to their api dealing with tracking people, https://segment.io/docs/tutorials/quickstart-analytics.js/

One thing I don't think you are going to be able to do is track the search term though. With the whole keyword not provided thing it is kind of hard.

LesleyPaone

Oh ok, I see what you mean. What it is actually saying is "this page you are looking at is the one true source". It basically makes a correlation with the search engines between the content on the page and what page that content should be on, in a lesser sense if it is found on another page.

LesleyPaone

Categorization is a difficult issue for ecommerce sites. You want to have enough categories so people can easily find products, but at the same time you do not want to risk a duplicate content penalty or a then content penalty, by being over categorized.

I think with the canonical url pointing to the no filtered page it will be next to impossible to rank the page. If you are looking for click throughs from organic sources, it should work, but from search engines I don't think it will.

The way I handled the issue with one of my clients might be something you could do and it would provide better results. In the ecommerce platform I use it has searches and filters as well. What I did with their shop was I created a module that would handle some searches and filters differently. Basically they could enter a term in the module, say christmas, and any searches or filters on the site for the exact match word would rewrite the term as a psuedo category. It is kind of hard to explain the logic, but it created a stub page, like in your case would be something like /drivetrain/road-cassettes where the cannocial url could be set to the page, the meta description, title, and on page category description could be different. But the page was not accessible through any category structure or any other way, just through the search and filters. Basically what you are creating is a landing page catered to search terms.

If this is done sparingly, there is no issue with it, if you go through and make 10k pages like this, you might get a penalty. What I did was take the top terms that had been searched in the site and used them as a list of what landing pages to make.

LesleyPaone

Why would you think a page would not need one? It is hard to tell from the example you gave what you meant, but I take the stance that every page needs one.

LesleyPaone

The abstract tag is not supported by any major search engines, but it does not mean it is useless. The use cases I have seen for the abstract tag in the past are internal usage. One case I have seen it is organizing the pages in a custom rolled cms application. The abstract tag there will allow you to post a different short description and organize by it in the backend.

The other major use case I have seen for the abstract tag is internal search. I have seen it in professional and medical fields before. The reason being that they want to show a short snippet in regular search engines, but show a different short snippet in their own site search. This is helpful when you have terms that are not searched by the general public, but people using your site search know these terms. It lets you better target your pages to your search engine while having your real meta tags targeted to a general audience.

LesleyPaone

Interesting, it sounds like they had a long term plan. Good thing you found about it in time.

I think the plan sounds good. I would definitely get off front page, it has been out of development for a while so there are vulnerabilities that have not been patched in years.

You might look into a cms like Wordpress or concrete5 to make things easier for you with transitioning the site. Then you would have to only learn minimal html / css, and could focus more on the content.

Good luck to ya.

LesleyPaone

As far as I know it cannot currently be done with schema markup. The only allowance in the standard in the product standard that Google uses is for having multiple sellers selling the same product. From everything I have tested and seen you cannot have multiple products on the same page. I would think that having products with color or size variations would consist of multiple products since they may have a different upc or mpn on them.

LesleyPaone

I don't know if you have those taco commercials where you live that have the little girl in them that says "Why not both!", but you might do that, it would not hurt and it would make you sleep better at night.

Oh, here is a link to the commercial, https://www.youtube.com/watch?v=vqgSO8_cRio

LesleyPaone

I agree zenstorageunits about using rel=canonical but one thing I would like to point out is that Moz does not create false errors. It is a simple crawler, not like google. Google will actually try to follow links that people have used before and that show up in your analytics files. moz uses no logic like that, it just jumps from page to page. If it is picking up a page with a query string like that then it is a link on your site. I would find the links and take them off.

LesleyPaone

Could you possibly share a link so we could check it out? I cannot think of anything off the top of my head that would cause it.

LesleyPaone

Good that the site is clean. What scurri and programs like that do is analyze the site (your real site, not the made up pages) for malicious code. That means all the public facing files should be intact. Also GWT is google webmaster tools, if you passed the scurri check you should be fine there.

Since it sounds like you are on shared or managed hosting I would send a support email to your host and let them know the issue. They might be able to see where someone got in and when it happened, it is worth a shot at least.

What platform are you running on your site? Is it a cms or a custom platform?

More than likely the reason that Moz never detected the pages is because it is a crawler. It starts with your home page and follows every linking page on your site, if the pages were "orphaned" as it sounds like, the crawler would never have picked them up.

LesleyPaone

Change all of your passwords is the first thing you should do. Then you need to examine the server logs to see how they got in. I would check the ftp access logs first. Hopefully you have logging turned on. Then with those logs I would search for lines that are not your ip address. If you are on a static ip and you have had the same ip for a while it should be a lot easier. You will be looking for the other ip address. If you cannot find that the server was accessed from another ip address through ftp, then the next option is to look at the code. There might be an exploit in your site that will allow for it. One thing I would do is look at the files that were added via ftp, they will hold a time stamp on them. You can try to cross reference that time and day with the ftp log. If there is at those times (remember your server might not be set to your time zone) then start looking through the site for a "connector" file. It would have been the first that they created, it is basically a bot file that can create files on your server. If you can find that, check that time stamp against the log.

If you have a restorable version of the site, I would consider doing it. I would also see if your site is label as having malware on it. You can use GWT and sucurri to do that.

As for possibility hurting your rankings, yes, definitely. I would get the issue cleaned up, see if there is a pattern for the files that you can redirect to a 404 page and also the robots.txt file as well. I would also check GWT and see if you have a penalty as well. But I would do all of this ASAP if I were you.

LesleyPaone

In my mind the answer is complicated and depends a lot on your site structure and how your cms or website handles https.

Case in point, if you go to a https page on your site and all of the links pointing out from that page are to https pages, but the pages are redirected to http pages I would consider that bad. The way some sites are written, they just use the same protocol that the current page uses no matter what, so you would have to redirect the links on the page to the non secure pages, I would use a 302 in that instance. At the same time if it is possible I would see if this can be corrected, because it is not proper. The same happens a lot going from non-secure to secure with some platforms as well.

This site has an example of what I mean. http://junglejumparoo.com/product/jungle-jumparoo/ If you add the product to the cart, then hover over the checkout button, it is http, but clicking it redirects you to https. Then on the next page, if you mouse over the menu, you will notice that all of the links are https now. That is handling ssl incorrectly, but in that case I would use a 302 redirect I guess.

Doing the redirect is a quick fix, but as a long game I would try to fix the issue itself and not do a redirect.

LesleyPaone

It might be that Moz's crawler is not picking up the description because of the way that you have it written. The standard way to write a meta description is like this.

name="description" content="Buy the robust and reliable Canon imageRUNNER 2525 from Copyfaxes. Learn more about the Canon 2525 before you buy." />

You have yours written like this.

http-equiv="description" content="Buy the robust and reliable Canon imageRUNNER 2525 from Copyfaxes. Learn more about the Canon 2525 before you buy." />

Using http-equiv actually tries to simulate meta information through a response header. Which is incorrect. You should rewrite your descriptions to be like the first type. Search engines might be able to decipher it, even though it is semantically wrong, but I would not risk it.

LesleyPaone

Not so much a cheat cheat, but have you tried whitespark.ca? They are a local directory search service. It is very reasonable for the time it says you.

LesleyPaone

I am guessing that you are using a system that templates pages and maybe adds a query string after the search, something like search.php?caws+cars. I would set in the header of all of the pages that use the search template a noindex, nofollow. Then I would also add it to the robots text as well to disregard the search pages. They will start dropping out of the results pages in about a week or so.

LesleyPaone

It is totally possible with just about every topic. Write interesting articles that people want to read, it does not matter if someone else has written an article on it, make it your own article. Put your own spin on it. It is kind of like vampire stories, people keep writing them with new spins on them.

It really does not matter if the topic has been covered before, what you are wanting is original content. Don't copy other articles, get ideas from them, but use your own wording. I am sure there are thousands of articles about places to visit in Seattle, but maybe your article has places mixed together that are not found in just one other article.

Sometimes it can be difficult, but every topic can be written about, you just have to find a focus of what you are wanting to write about.

LesleyPaone

Alan,

i would have to say they don't know what they are talking about. Mod_sec is in a sense like an ip black list, if no one ever changes it, it is pretty ineffective in terms of security. I would imagine that inmotion is running a configuration that they have been running for 5 years with no updates. Mod_sec is an old module that there really was a time when it was more useful, but apache has been updated and php to be pretty secure by itself.

On another note, I develop pretty much exclusively in Prestashop and Prestashop is a partner with inmotion hosting. Inside Prestashop is a method to disable mod_sec that runs on inmotion's servers. They don't seem to have an issue with that. Here is a screenshot of it, http://screencast.com/t/gDqO9a8axf

I would think you can safely disable it, but at the same time I would still install a wordpress security plugin just to keep wordpress safe, it has a lot of security holes.

LesleyPaone

Part of it could be. I do not know how you do it on cloud front, but one thing I would suggest is to set the canonical url via the header. Search engines look at the header in files other than web pages to know where the real file is.

Here is the screenshot of one of your images headers.

http://screencast.com/t/05B0nNec

Here is one of the ones from my site, see the link in the header is the canonical url.

http://screencast.com/t/TyXsTmpuscGZ

I would figure out how to do this on cloudfront and I think things will get indexed.

LesleyPaone

I searched a lot of your images, most of the ones I searched were indexed. The ones that were not seemed to be because there were so many other sites using them already. Also, I have a feeling your image site map is not correct, or you are hot linking to other images. A lot of the image urls in it did not point to your domain. You can do a reverse image search in google images and search by url to see if your images are being indexed.

LesleyPaone

There is an inherit risk with everything you do. Putting a webpage up itself can put you at risk for being hacked. But as for GTM, the risk is very low, but the burden is all on your shoulders. If someone gains access to your GTM they can execute malicious code on your site, yes. But the only way they are going to gain access to the account is because of bad security practices for who ever has or sets the passwords. If you use a weak password, someone might can guess it. Or if you use open publicly accessible networks, someone can grab it that way. I would suggest turning two factor identification on in your Gmail account and following good password practices. Don't use the same password for any other service, make a strong password, don't email the password to other people, things like that.

As for mod_sec, it is more of a problem for most cases than it is good for any more, in my opinion. A lot of web applications need it totally disabled to run correctly, or major parts of it. Also if no one is actively monitoring it and adding to it, it is pretty much useless.

Here is a great comic on setting your password to a strong one. http://xkcd.com/936/

LesleyPaone

If it were me, this is what I would do. I would export all of your back links from google webmaster tools. You can do this by going to webmaster tools -> links to your site -> all domains and click to export table.

From there I would if possible go through the list and make a master list of "good" links. Take that list of good links and copy that column into a text file as and save it as a reference point of good quality links. Then every day I would go and download the latest links from webmaster tools, copy and put them in a text file as well. Then I would use beyond compare to see the new links to the site by comparing the "master good link file" to the daily generated files. If you have time, go through the different links and see if any of them are legitimate good quality links. If it is too many, I would just start disavowing them all until the attack is over.

I would also think about signing up to other services as well, like ahrefs and the others that have crawlers, because one crawler will not find all of the links more than likely and from experience GWT will not show you every link either.

LesleyPaone

Put the version that your site directs to. If both can be accessed without a redirection, I would redirect the index.html to not show the file and extension.

LesleyPaone

Hey, just to do a follow up, it looks like everything I am looking at has bounced back, http://screencast.com/t/s3hlsnMrS I hope yours has too.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

LesleyPaone

@LesleyPaone

Posts made by LesleyPaone