4eck Media

International SEO: When Google delivers unexpected results with correct HREFLANG

August 30, 2024
August 30, 2024

For multilingual websites and for international SEO, the correct implementation of the hreflang attribute is often a challenge. This attribute is intended to help search engines select the correct language and country version of a page for users. However, even when implemented correctly, unexpected results can occur if different country versions have been created for the same language area, as can happen with Austria, Germany and Switzerland or England, USA, Australia etc..

The Search Console then indicates that the canonical URL specified by the user was not used for indexing, but that another URL was selected by Google as the canonical URL. Example - here the URL was anonymized via the dev tools:

Angegebene und ausgewählte kanonische URL

Even mapping the content via ccTLDs (.de and .at) is no help. The problem can also occur with top-level domains. So to continue the fictitious example:

Kanonische Zuweisung bei Top-Level-Domains

 

The mechanism behind the problem

This unexpected behavior of Google can occur when the search engine crawler indexes pages from different countries that are very similar in content. An example: Google first crawls the German (DE) version of a page, calculates the so-called hash value and indexes the page with the DE URL as the canonical version. If Google then crawls the Austrian (AT) version of the page, it calculates the hash value again and recognizes that this is identical to the DE version that has already been indexed. Instead of indexing the AT URL separately, Google adds it to the existing document.

Hashing is a technique that Google uses to index the content of websites more efficiently and to recognize duplicates. A hash value is a unique, cryptographic value (a kind of fingerprint) that is created by a hash function based on the content of a web page. If two pages have the same hash value, this indicates that their content is identical or very similar. It is safe to say that Google uses Simhash for crawling.

SimHash for comparing content and recognizing duplicate files

Simhash is a special technique that Google uses to efficiently analyze and compare content, especially with regard to detecting duplicate content. Simhash makes it possible to quickly check large amounts of data for similarities without having to fully compare every single piece of content.

Simhash works by extracting certain characteristics (features) from a text, such as frequent words, phrases or structures. These features are converted into a vector that represents the most important characteristics of the text. After vectorization, hashing takes place. A shorter bit sequence (the simhash) is then generated from this vector using a hash function. The trick here is that similar texts lead to similar Simhash values, which enables quick identification of similar or almost identical content.

Google uses Simhash for duplicate content detection, i.e. to quickly and efficiently recognize duplicate or very similar content on different websites. If two pages have a similar Simhash value, Google can use this as an indicator that the content is very similar and may only include one version of the page in the search index.

Instead of performing a full text analysis for each crawl, Google can use Simhash to quickly check whether the content of a page is significantly different from other content. This saves resources and speeds up the crawling and indexing process.

An important aspect of Simhash is the ability to recognize similar content based on the Hamming distance. The Hamming distance measures the number of bit positions in which two Simhash values are different. A small Hamming distance between two simhash values means that the corresponding contents are very similar. Unlike traditional hashing methods, where even a small change in content can result in a completely different hash value (collision resistance), Simhash produces similar hash values for similar content, making it particularly useful for recognizing nearly identical content.

Pages with similar Simhash values could therefore rank lower because they are considered duplicate content or less original, as Google wants to display unique and high-quality content in the search results.

By the way: SimHash can also be used by operators of very large websites that work a lot with text spinning and programmatic SEO in content creation to automatically compare content and measure the degree of duplicate content.

HREFLANG und Canonical URL

Why hreflang sometimes doesn't work

Google uses SimHash- presumably alongside other methods - to analyze the content of pages and determine whether pages are duplicates. If two pages, such as the DE and AT versions, have the same SimHash value, Google considers them to be identical and selects a canonical URL, regardless of the hreflang tags. hreflang tags, just like html-lang attributes or canonical tags, are not binding for Google, unlike noindex tags or the instructions in robots.txt. Google therefore does what it thinks is right in terms of the best user experience. The determination of which URL is the canonical one is based on various factors, such as the number of inbound links, positive user signals and other criteria. This canonical URL is displayed as such in the Google Search Console (GSC) and is assigned all the traffic.

However, this does not necessarily mean that users are always shown this canonical URL in the search results. Rather, Google displays the URL that is most relevant to the user - which doesn't always have to be the canonical URL. Yes, I know, it sounds complex or even contradictory.

If there are minor differences between pages - like a small module on one page that is missing from the other - this can lead to a distinction big enough to be considered individual content. However, even a minimal difference in search engine guidelines or a change in page content or design can cause Google to treat the pages differently or aggregate them.

Sometimes Google's behavior is truly remarkable. I personally know of an example of an online store for furniture where the website is available as .at for Austrian customers and as .de for German customers. According to Search Console, Google selected the .at URL instead of the .de URL as the canonical URL, although the .de URL was defined as the canonical URL for the German site by the website.

This case became particularly strange when the furniture store took the product offline on the .at page and the page returned 404.

Möbelshop mit 404-Seite

The URL is still in the index and leads to a 404 page, while the German page still lists the product but is not in the index.

Möbel-Website 404 im Index

The .de page still has the page content, but the product is sold out. Nevertheless, it is astonishing that Google has the 404 page (.at)  in the index and the German counterpart with available .de page content does not.

Produktseite im Möbelshop

How the problem affects the GSC data

The behavior described by Google, in which different country versions of a page are combined and a canonical URL is selected, has a direct impact on the validity of the data in the Google Search Console (GSC). This problem means that the performance data displayed in the GSC does not always reflect the actual presentation of the search results.

If Google recognizes different country versions of a page as duplicates and selects one of them as the canonical URL, the entire traffic that the different versions receive is attributed to the canonical URL in the GSC. This means that the GSC may only show traffic data for the DE version of a page, even though users in Austria have actually seen the AT version. This distorts the data and makes it difficult to correctly measure the success of individual country versions.

Ranking data can also be influenced by this behavior. For example, if Google determines the AT URL as canonical, but still shows the DE version to users in Germany, the ranking data in the GSC can also be misleading. This discrepancy arises because the GSC data is based on the canonical URL, while the actual search results may show a different URL.

In the GSC, it looks as if only one URL is performing, while in reality several URLs are displayed in different countries. This makes it difficult to analyze and understand performance at a country level, especially when companies specifically apply different content or optimizations for different markets.

This behavior by Google requires SEO teams to adapt their analysis approaches. Instead of relying solely on the data displayed in the GSC, external tools could be used to check which URL is actually displayed in the search results in different countries. It may also be necessary to take a closer look at the filtering options in the GSC, for example by analyzing the data by country and country directory, to get a clearer picture of the actual performance in the respective markets. Regex queries in the Seach Console can also be helpful in this case. I will be writing a separate article on this soon.

Strategies for the solution

It would be best if Google considered each language/country variant as a canonical URL. As we can see, this is not certain for identical languages. One possible solution would be to question the need for country variants and check whether it makes more sense to create only one version for linguistically identical markets. This can be particularly helpful in cases where the differences between countries are minimal. If there are no differences in currency, such as between Germany (euros) and Switzerland (francs), it makes little sense to create an additional /de-AT/ directory for Austria with euros, where all content is duplicated.

The extent to which Google now restricts the indexing of pages can be measured very easily using an InURL query in Google Search. An example in the following screenshot: The sitemap of the website contains 644 pages for /de-CH/ and a further 1312 for /de-DE/. There are actually considerably fewer in the search index. In addition, this website has a further directory for /de-AT/. It can be assumed that many pages are not in the index and many more have been merged into a canonical URL for the German languages.

Indexierungsvergleich

Some SEO tools also indicate duplicate content after a crawl. For this reason, where there are no differences in content, offer, delivery or other specific characteristics on the country side, it is better to rely on the language directories and merge the three directories for Germany, Austria and Switzerland on /de/.

Another - and probably the best - strategy for correctly canonicalizing the desired URLs is to differentiate the content more strongly. Even if the language is the same, differences in the metadata, the main content or the user guidance can help to convince the Google algorithm that the content is independent and assigned to a specific region. For example, you could work with variables that welcome website visitors from the respective country. Welcome from <variable>, nice to have you here. The <variable> is then used accordingly: Austria for de-at, Germany for de-de, Switzerland for de-ch and so on. The same goes for ... our delivery terms for <variable>. The more differentiated the content, the more individual the hash value, the more likely Google will determine each page as its own canonical URL.

A third approach could also be to use hreflang tags in sitemaps, as this leads to a clearer assignment by Google in some cases. We ourselves also display our hreflang tags in the sitemap and limit ourselves to the pure language variant without the country code.

HREFLANG in Sitemap

In my opinion, the most effective means remains to differentiate the content as much as possible in order to send Google clear signals that these are different pages.