There are quite a few duplicate content misconceptions circulating in the SEO community.
Even though a lot has been said by Google's Matt Cutts about the exaggerated fear some people have in regards to a few lines of duplicate content on their sites, many still do not understand what content duplication is, or whether their site is at risk.
So, let's tackle certain tricky questions that concern duplicate content and put some common myths to rest.
Myth 1. Duplicate content is 'same text on multiple pages'
In truth: Not exactly
Website owners who are not so well-versed in web design think that the only way to produce duplicate content is to purposefully replicate a piece of text on multiple pages.
What they don't realize is that some of their site's pages may be accessible via multiple ULRs (which may happen for various reasons), which, in turn, would automatically lead to content duplication.
That is, ideally, each piece of content should have only one URL associated with it:
In reality, though, it happens quite often that a page has multiple URLs associated with it:
Now, the problem is that search engines are not always capable of matching duplicate URLs to their respective pages. And, even when they manage to do it, they have to decide which version of the page to show in the search results (they normally pick one version and filter out the rest).
Hence, if there are pages on your site that have multiple URLs pointing to them, this means you have an internal duplicate content problem you need to take care of!
Why does URL duplication occur? Most often, it's because of tracking parameters that get recorded in the URL's path, or filtering options that let users re-arrange items within a site.
To avoid these problems, one should use canonical tags, an XML sitemap, a robots.txt file or other means that aid the canonicalization process. Please, find more information on how to tackle these structure issues in our guide to SEO-friendly URL architecture.
Myth 2. One should block crawlers' access to duplicate pages
In truth: Not always
It's a wide-spread opinion that, when you have duplicate URLs on a site, you should close duplicates from getting indexed with a robots.txt.
Although this would sometimes save search engines computing resources (for example, if you have an ecommerce site with tons of filtering options), this is not what Google recommends:
"Google does not recommend blocking crawler access to duplicate content on your website, whether with a robots.txt file or other methods. If search engines can't crawl pages with duplicate content, they can't automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages.
A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the rel="canonical" link element, the URL parameter handling tool, or 301 redirects."
Again, this may not be the best option if your site has a really large number of duplicate URLs, but the general best practice is to use rel="canonical" to specify a preferred URL for each group of duplicates you have.
Myth 3. Legal info/disclaimer across multiple pages isn't allowed
In truth: It is allowed
Some SEO folks believe that having even a small amount of duplicate content on your site can lead to a penalty. In an overwhelming number of cases, however, it can't.
Matt Cutts recently made a video, in which he said that having a Terms and Conditions template or a Disclaimer message across all pages of your site won't get you penalized.
As Cutts explained:
"If it's really required, I wouldn't stress about that… Unless the content that you have is spammy or keyword-stuffed, then an algorithm or a person might take action."
For example, Patient.co.uk, a UK-based medical information resource, has the same disclaimer on many pages:
And, the site is still in good standing with Google and ranks well for a number of queries.
At the same time, Google still advises one to keep the amount of text in that repeated message to a minimum.
Myth 4. Duplicate content penalty doesn't exist
In truth: It does exist
Google frequently boasting about being able to handle duplicate pages most of the time has resulted in some SEOs believing there is no such thing as duplicate content penalty at all.
(By the way, I trust you know Google's Panda update wasn't all about dupe content – it was about poor user experience).
Now, although Google rarely penalizes sites for duplicate content (usually such sites are pure spam), it could easily dish out a penalty to a site that:
- Has nothing but scraped content
- Scrapes images, auto-translates pages, or uses automated tools to spin content prior to publication
- Purposefully creates pages with nearly identical content to rank them for various locations/keywords
As Google themselves say, "mostly, [duplicate content] is not deceptive in origin." Which means they're able to identify it when they see it.
For example, imagine you have a tanning salon. If you publish the same description (e.g., Welcome to the paradise of eternal spring! Spa Éternel is a place where you find…) on 10 different pages in order to rank for 10 different locations, that'll make your site a candidate for duplicate content penalty.
At the same time, 25-30 % of the Web is duplicate content (because people quote other people, etc.). After all, the Web is a big echo chamber, and the same information gets shared on it a lot.
Myth 5. Google can tell the original content creator
In truth: Not always
There's been a lot of discussion on the Web about Google being or not being able to tell the original creator of a content piece.
Some people would say Google replies on publication date to track the authentic author, but multiple instances of hijacked search results (a scraper site outranking the original) disprove that.
Dan Petrovic of Dejan SEO once held several convincing experiments, in which it was established that, when the scraper page has higher PR than the original page, it's likely to outrank the authentic page in search.
Plus, there are many other grey-area situation when Google is hesitant about which version of the page to display in search results.
So, according to Dan Petrovic, there are certain signals you can send Google to let it know you're the original author. These are:
- Claim your Google Authorship
- Specify canonical URLs
- Share a newly published piece on Google+, etc.
Myth 6. Syndicated content is duplicate content
In truth: Not necessarily
There are two types of site that syndicate Web content:
Type 1. Legitimate news sites/information hubs that sometimes feature previously published content. They often provide original commentary and analysis of the piece they cover. Such sites always credit the original content creator.
Type 2: Content syndication sites that produce no content of their own. They scrape content off multiple websites (often it is imagery) and give no credit to the original content creators whatsoever.
So, if your site belongs to the 1st type and you have syndicated content on it, you have nothing to worry about. If you are type 2, getting a penalty is just a matter of time.
However, there may be hidden dangers even in legitimate content syndication. As per Google:
"If you syndicate your content on other sites, Google will always show the version we think is most appropriate for users in each given search, which may or may not be the version you'd prefer."
And, this is what Google suggests you to do to make sure the search engine value of the content you syndicate isn't lost:
- Ask the other site to link back to your original creation
- Ask them to block the syndicated copy from crawlers
(this way they won't outrank you, but will still provide value for readers).
Myth 7. Translated copy on regional site isn't duplicate content
In truth: Sometimes it is
You may think that translating the copy from your English-language site and publishing it on a regional domain/subdomain is never a problem.
Well, sometimes it is.
It may seem improbable that Google should be able to tell it's a duplicate even though the content is in another tongue. But, believe it or not, Google may be capable of doing it if:
- You translated it with an automatic tool and just dumped it on your site
(in which case it would qualify as automatically generated content)
- You copied your English-language content without change to the regional site
So, when creating a foreign site for your biz, tailor its content for the segment of users you are trying to reach with it. Most likely, they would want a slightly different message than the one you have for English-speaking audiences.
By the way, which of the three regional site variants is the best to use?
According to Google, using top-level domain is the best practice of the three.
As has been mentioned throughout the article, it’s usually easy to tell purposefully generated duplicate content from accidentally created one.
However, as Google is a machine is doesn't have the possibility of reviewing each page in its index by hand, there are some best practices to follow to make sure you never get penalized for content duplication:
- Set canonical URLs (or create 301 redirects) for pages accessible via multiple paths
- Keep the amount of text in your cross-site template (if you have one) to a minimum
- Claim authorship of the content you create
- Do not use automated tools to create or translate web pages
This is it! Do you have other duplicate content questions or insights? Let us know – leave a comment!