How Google Chooses Canonical Page
Table of Contents (Click to show/hide)
Active User Rate
Return on Investment
A key element of SEO is canonical pages. It is these pages that Google will crawl the most often, and it is these pages that will appear the most in the search engine results. Any link with duplicated or incredibly similar content will be disregarded from the results. Therefore, it is essential to understand how Google chooses a canonical page, and that is what we want to take a look at here.
If you are too busy to read the following, here is the TL;DR version of the steps in identifying canonical page:
- First of all, there are at least twenty different signals weighted to help identify the canonical page.
- Second, the page content is crawled and documents indexed.
- Google will cluster all duplications together. They will exclude the navigation & footer from the checksum calculation. As such, only the "centrepiece" is left for analysis.
- Next, Google reduces the content into a hash or checksum and then comparing the checksums. A checksum is basically a hash of the content. If the content is unique, the hash of content (a bunch of content without any space, symbols) is unique too. It acts as a fingerprint.
- Google catches "near-duplicates and exact duplicates.
- The signals involved in detection include but not limited to PageRank, content, HTTPS/HTTP, sitemap inclusion, 301 redirecting and a critical one - the rel=canonical attribute.
- Google uses machine learning to calculate the weights rather than do it with human input.
- However, signals weighted differently. For example, redirects have a heavier weight than the HTTP/HTTPS URL signal.
How Google Chooses Canonical Page
There are several different methods that Google utilizes in order to identify what the canonical page is on your website. Some of these methods are unknown (Google really doesn't want you to game their system), but other methods they have spoken extensively about. Here, we want to talk to youa little bit about the major techniques that they use when it comes to how Google chooses canonical page.
Before anything else happens, Google likes to listen to the website owner. As a website owner, you will have some say in which pages Google classes as the canonical page on your website (rel=canonical). There are twenty different signals that Google looks at here, and they weigh them up to get an idea as to which pages they think the site owner wants to be canonical. However, do bear in mind that Google will have the final say here. You can do everything right,but if Google decides that another page is canonical, then that is what you have to go by.
Some of the signals include:
- rel=canonical in the page meta tag
- Internal linking on the website
- Sitemap links
- HTTPS instead of HTTP
- Which URL Google thinks looks better in the searches
Pay attention to these signals, and Google may end up choosing the page you want as the canonical URL.
We will discuss this part more at the end of this page,which is when all those signals will be analyzed.
They Crawl and Index all the Pages
Before Google does anything else, they need to crawl and index the pages on your website. They do this because they need to see whichpages are duplicated. All of the duplicated content will be clustered together. Don't worry. Both the header and the footer will not be regarded as 'duplicate'content. Google will only focus on the main part of the page.
Analyze the Content
Before Google can cluster all of those duplicates together, it will need to analyze the content that it has scanned.
Google will go through the pages on your website and analyse the content. Although, it won't actually 'read' the content on the page in order to check for duplicates. Instead, they will assign a hash number/content to that page. You can think of this as a unique identifier, similar to a fingerprint. Any similar content will have a similar unique identifier.
Google does not read the actual content on the page at this point. This is because it would be too intense on their system resources. The hash is the best way for them to work out which content is duplicated.
Cluster Duplicates Together
The next step is to cluster these duplicates together.
This is done through a series of checksum calculations. Without getting too much into the technical side of things, Google is essentially comparing the hash numbers against one another for similarities. They are looking for content that is precisely the same, and content that is close enough to one another that it is virtually the same.
All the content that Google identifies as a duplicate wil lbe clustered together, and then Google can move onto the final stage.
Identifying the Canonical Page
The final step will involve Google choosing the canonical page. As we said before; there are several different signals that Google will use to see which page is canonical. They will likely listen to the choice of the site owner if it is clear they have put a lot of effort into trying to signal which pages should be canonical. But, as we also said, they donot have to listen at all.
Google gives each signal a weighting. Some are more important than others for determining which page is canonical. For example: a rel=canonical will be more important than a redirect which, in turn, will be more important than HTTPS.
Google uses machine learning to periodically adjust the weighting of each of these signals. This is to ensure that the best canonical pages are always shown in the search engines. As the weighting changes,different pages may become canonical. Therefore, it is important for the webmaster to stay on top of things here.
This is a very basic introduction to the way to how Googlechooses canonical page. However, to be honest, outside of ensuring that yousend the right signals to Google about what you want as the canonical page, youdo not have to think about this process too much. Just send the right signals,and Google should make the right decision.