Steps you can take to ensure your content is indexed and registered to your site before a scraper gets to it?

Qasim_IMG

Hi,

A clients site has significant amounts of original content that has blatantly been copied and pasted in various other competitor and article sites.

I'm working with the client to rejig lots of this content and to publish new content.

What steps would you recommend to undertake when the new, updated site is launched to ensure Google clearly attributes the content to the clients site first?

One thing I will be doing is submitting a new xml + html sitemap.

Thankyou

AlanBleiweiss

There are no "best practices" established for the tags' usage at this point. On the one hand, it could technically be used for every page, and on the other, should only be used when it's an article, blog post, or other individual person's writing.

Qasim_IMG

Thanks Alan.

Guess there's no magic trick that will give you 100% attribution.

Regarding this tag, do you recommend I add this to EVERY page of the clients website including the homepage? So even the usual about us/contact etc pages?

Cheers

Hash

AlanBleiweiss

Google continually tries to find new ways to encourage solutions for helping them understand intent, relevance, ownership and authority. It's why Schema.org finally hit this year. None of their previous attempts have been good enough, and each has served a specific individual purpose.

So with Schema, the theory is there's a new, unified framework that can grow and evolve, without having to come up with individual solutions.

The "original source" concept was supposed to address the scraper issue, and there's been some value in that, though it's far from perfect. A good scraper script can find it, strip it out or replace the contents.

rel="author" is yet one more thing that can be used in the overall mix, though Schema.org takes authorship and publisher identity to a whole new, complex, and so far confused level :-).

Since Schema.org is most likely not going to be widely adopted til at least early next year, Google's encouraging use of the rel="author" tag as the primary method for assigning authorship at this point, and will continue to support it even as Schema rolls out.

So if you're looking at a best practices solution, yes, rel="author" is advisable. Until it's not.

EGOL

Thanks Alan... I am surprised to learn about this "original source" information. There must not have been a lot of talk about it when it was released or I would have seen it.

Google recently started encouraging people to use the rel="author" attribute. I am going to use that on my site... now I am wondering if I should be using "original source" too.

Are you recommending rel="author"?

Also, reading that full post there is a section added at the end recommending rel="canonical"

AlanBleiweiss

Always have a sitemap.xml file with all the URLs you want indexed included in it. Right after publishing, submit the sitemap.xml file (or files if there are tens of thousands of pages) through Google Webmaster Tools and Bing Webmaster Tools. Include the Meta "original-source" tag in your page headers.

Include a Copyright line at the bottom of each page with the site or company name, and have that link to the home page.

This does not guarantee with 100% certainty that you'll get proper attribution, however these are the best steps you can take in that regard.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Steps you can take to ensure your content is indexed and registered to your site before a scraper gets to it?

Browse Questions

Explore more categories

Related Questions

Do I need to remove pages that don't get any traffic from the index?

Taken a canonical off a page to let it rank with new unique content - what more can I do?

Start a new site to get out of Google penalties?

Recovering from index problem (Take two)

How can i stop such links being indexed

Getting backlinks without content marketing

Website is not getting indexed in Google! Not sure why?

Mobile Site - Same Content, Same subdomain, Different URL - Duplicate Content?