How Google Chooses Canonical Page

How Google Chooses Canonical Page

July 31, 2023
Marketing Two Cents

Table of Contents (Click to show/hide)

Data Analysis Skills | DEANLONG.io
10.54%

Monthly
Active User Rate

Daily Budget
$5000

Daily
Campaign Budget

Click-through rate increase
60%

Increase
Click-through Rate

an icon shows a lightbulb that indicates being creative | DEANLONG.io
15%

Growth
Return on Investment

Data Analysis Skills | DEANLONG.io
#1

Customer
Segmentation

Daily Budget
#2

Prioritisation of
Limited Resources

Click-through rate increase
#3

Competitive
Responses

an icon shows a lightbulb that indicates being creative | DEANLONG.io
#4

Consumer
Change

A key element of SEO is canonical pages. It is these pages that Google will crawl the most often, and it is these pages that will appear the most in the search engine results. Any link with duplicated or incredibly similar content will be disregarded from the results. Therefore, it is essential to understand how Google chooses a canonical page, and that is what we want to take a look at here.

If you are too busy to read the following, here is the TL;DR version of the steps in identifying canonical page:

  1. First of all, there are at least twenty different signals weighted to help identify the canonical page.
  2. Second, the page content is crawled and documents indexed. 
  3. Google will cluster all duplications together. They will exclude the navigation & footer from the checksum calculation. As such, only the "centrepiece" is left for analysis.
  4. Next, Google reduces the content into a hash or checksum and then comparing the checksums. A checksum is basically a hash of the content. If the content is unique, the hash of content (a bunch of content without any space, symbols) is unique too. It acts as a fingerprint.
  5. Google catches "near-duplicates and exact duplicates.
  6. The signals involved in detection include but not limited to PageRank, content, HTTPS/HTTP, sitemap inclusion, 301 redirecting and a critical one - the rel=canonical attribute.
  7. Google uses machine learning to calculate the weights rather than do it with human input.
  8. However, signals weighted differently. For example, redirects have a heavier weight than the HTTP/HTTPS URL signal.

How Google Chooses Canonical Page

There are several different methods that Google utilizes in order to identify what the canonical page is on your website. Some of these methods are unknown (Google really doesn't want you to game their system), but other methods they have spoken extensively about. Here, we want to talk to youa little bit about the major techniques that they use when it comes to how Google chooses canonical page.

Signals

Before anything else happens, Google likes to listen to the website owner. As a website owner, you will have some say in which pages Google classes as the canonical page on your website (rel=canonical). There are twenty different signals that Google looks at here, and they weigh them up to get an idea as to which pages they think the site owner wants to be canonical. However, do bear in mind that Google will have the final say here. You can do everything right,but if Google decides that another page is canonical, then that is what you have to go by.

Some of the signals include:

  • rel=canonical in the page meta tag
  • Internal linking on the website
  • Sitemap links
  • HTTPS instead of HTTP
  • Which URL Google thinks looks better in the searches

Pay attention to these signals, and Google may end up choosing the page you want as the canonical URL.

We will discuss this part more at the end of this page,which is when all those signals will be analyzed.

They Crawl and Index all the Pages 

Before Google does anything else, they need to crawl and index the pages on your website. They do this because they need to see whichpages are duplicated. All of the duplicated content will be clustered together. Don't worry. Both the header and the footer will not be regarded as 'duplicate'content. Google will only focus on the main part of the page.

Analyze the Content

Before Google can cluster all of those duplicates together, it will need to analyze the content that it has scanned.

Google will go through the pages on your website and analyse the content. Although, it won't actually 'read' the content on the page in order to check for duplicates. Instead, they will assign a hash number/content to that page. You can think of this as a unique identifier, similar to a fingerprint. Any similar content will have a similar unique identifier.

Google does not read the actual content on the page at this point. This is because it would be too intense on their system resources. The hash is the best way for them to work out which content is duplicated.

Cluster Duplicates Together

The next step is to cluster these duplicates together.

This is done through a series of checksum calculations. Without getting too much into the technical side of things, Google is essentially comparing the hash numbers against one another for similarities. They are looking for content that is precisely the same, and content that is close enough to one another that it is virtually the same.

All the content that Google identifies as a duplicate wil lbe clustered together, and then Google can move onto the final stage.

Identifying the Canonical Page

The final step will involve Google choosing the canonical page. As we said before; there are several different signals that Google will use to see which page is canonical. They will likely listen to the choice of the site owner if it is clear they have put a lot of effort into trying to signal which pages should be canonical. But, as we also said, they donot have to listen at all.

Google gives each signal a weighting. Some are more important than others for determining which page is canonical. For example: a rel=canonical will be more important than a redirect which, in turn, will be more important than HTTPS. 

Google uses machine learning to periodically adjust the weighting of each of these signals. This is to ensure that the best canonical pages are always shown in the search engines. As the weighting changes,different pages may become canonical. Therefore, it is important for the webmaster to stay on top of things here.

Conclusion

This is a very basic introduction to the way to how Googlechooses canonical page. However, to be honest, outside of ensuring that yousend the right signals to Google about what you want as the canonical page, youdo not have to think about this process too much. Just send the right signals,and Google should make the right decision.

Source (1, 2)

Best Practices
SEO
Technical SEO
Google
Industry Update
Dean Long | Expert in Growth MarketingHongxin(Dean) Long

Dean Long is a Sydney-based performance marketing and communication professional with expertise in paid search, paid social, affiliate, and digital advertising. He holds a Bachelor's degree in Information Systems and Management and is also a distinguished MBA graduate from Western Sydney University.

Related Posts