The Google API Leak: What You Should Consider for Your SEO

by Astrid Kramer, first published in german

For those who thought, “Nice, but Google would be cooler,” after last year’s spectacular Yandex Leak, this year will meet their expectations. The Yandex Leak of 2023 primarily focused on the source code and ranking factors, while the Google API Leak included detailed documentation about the API structure and specific algorithms. Nevertheless, intriguing insights can be derived from this year’s leak, just like from its Russian counterpart.

The beginnings read like a thriller: On May 5, 2024, SEO legend and SparkToro founder Rand Fishkin receives an email from an anonymous source claiming access to a large leak of API documentation from Google’s search department. After the initial contact with the anonymous source, later revealed to be Erfan Azimi, Fishkin verifies the documents with the help of SEO experts like Mike King from iPullRank.

The leaked documents comprise over 2,500 pages and contain information about more than 14,000 attributes (a list of attributes can be found at dixonjones.com) associated with the Google API. These documents were apparently publicly accessible between March and May 2024 and were then removed from the GitHub platform. During this time, the documents were indexed by third parties and thus remain accessible, despite Google removing them.

Contradictions to Public Statements

The leaked information contradicts many public statements from Google about how the search engine works. For example, the documents show that click data influences ranking signals, subdomains are evaluated separately, and the age of a domain is a ranking factor—all points that Google has denied in the past.

As Aleyda Solis nicely points out in the Majestic Podcast on the subject, it’s not about pointing fingers at Google and accusing the company of making false statements or “smoke screens,” as is often said in the SEO industry. Corporate spokespersons are often prevented from laying all the facts on the table for internal reasons, which one should understand. Rather, these contradictions show us how important it is to approach Google’s official statements with a healthy dose of skepticism. Especially younger SEOs, who grow up with countless freely available Google-provided information about SEO, may be more inclined to take the knowledge conveyed by Google at face value and not question or test it sufficiently.

Important Insights from the Leak

Some of the most important revelations from the leaked documents include:

1. Use of Click Signals: Google has repeatedly denied using click signals (e.g., clicks on search results) to evaluate websites. However, in his testimony during the DOJ antitrust trial, Pandu Nayak revealed the existence of the ranking systems NavBoost and Glue, confirming widespread doubts about Google’s fundamental statements. NavBoost, a system that has existed since around 2005, uses click-based measurements to improve or degrade rankings in web search. NavBoost collects data from the Google Toolbar and later from Chrome to evaluate search queries. It uses the number of search queries for a specific keyword, the number of clicks on a search result, and the length of the clicks to improve search results.

Despite numerous indications and patents proving the use of click data to alter search results, Google has repeatedly denied using clicks directly in rankings. Rand Fishkin himself has influenced rankings at some conferences or webinars by asking his audience to click on specific results, thus affecting their ranking. These clicks always had effects, but never long-term. The leaked documents show that qualitative click data plays a crucial role. For example, the date of the “last good click” to a document is recorded, indicating that traffic loss over time can affect a page’s ranking. Long clicks, which indicate a successful search session, are also recorded, although Google has publicly downplayed the importance of “dwell time.”

The documents also show that users are considered voters, with their clicks counted as votes. The system segments the data by country and device and records which result had the longest click during a session.

2. Different Evaluation of Page Levels: NavBoost is mentioned 84 times in the leaked documents and is named in five modules. There are indications that evaluations are conducted at the subdomain, root domain, and URL levels, suggesting that Google treats different levels of a website differently. The documents also show how data from this system might have influenced the Panda algorithm.

3. Cookie and Browser Data: Google uses cookie histories and data from logged-in Chrome users to evaluate quality. These information plays a central role in detecting and preventing manual and automated manipulations of search results.

Cookies are small text files stored by websites on users’ devices, containing information about users’ interactions with the website, such as visited pages, clicks, and search queries. Browser data includes additional information collected by the browser itself, such as dwell time on a page, scroll depth, and visit frequency. Google uses this data to get a comprehensive picture of user interaction with websites. This includes both behavior on individual pages and overall navigation patterns across the web. By analyzing cookie and browser data, Google can identify unusual or suspicious patterns indicating manual or automated manipulations. For example, a high number of short visits or repeated clicks on specific elements might be considered signs of click spam.

Google considers user behavior on websites as an indicator of content quality and relevance. Long dwell times and deep scrolling indicate that users find the content valuable and relevant. This data helps Google identify high-quality content and rank it higher in search results. Logged-in Chrome users provide particularly valuable data, as Google can track their search history and interactions across devices.

4. Geographical Data: NavBoost considers geographical differences in click data and adjusts search results accordingly.

Considering geographical differences involves several aspects:
– Regional Preferences: Users from different regions may have different preferences for certain types of content. NavBoost adjusts search results by prioritizing websites particularly relevant to the respective region.
– Linguistic Differences: Different regions speak different languages, and even within the same language, regional differences in language use may exist. NavBoost recognizes these differences and ensures that search results are linguistically and culturally relevant.
– Device-Specific Differences: The way users interact with search engines may vary by region and device type. For example, users in urban areas might use mobile devices more often, while users in rural areas might use desktops. NavBoost considers these differences and optimizes search results accordingly.
– Local Events and Trends: Geographical differences also include considering local events and trends. NavBoost can consider current events, such as local holidays or important news, and adjust search results accordingly, ensuring users receive relevant and current information.

5. Whitelists During the Pandemic and Elections: During the COVID-19 pandemic and democratic elections, Google used whitelists to ensure trusted websites ranked higher in search results. This practice was introduced to minimize the spread of misinformation and harmful content.

In times of crises like the pandemic and during important political events, the risk is particularly high that false information spreads quickly and has serious consequences. Using whitelists allowed Google to guarantee that only websites deemed reliable and credible appeared in the top search results.

This meant that government websites, renowned health organizations like WHO and CDC, and established news sources were preferred when users searched for information about COVID-19. Similarly, during elections, Google ensured that official election information and trusted news sources were prominently displayed to prevent the spread of misinformation about election processes and results.

What Does This Mean for Your SEO?

What exactly can I derive from these insights for my SEO strategy? Let’s go through the individual points in detail:

1. Use of Click Signals: The confirmation that Google uses click data to evaluate search results means that user behavior plays a significant role. Long clicks, which indicate a successful search session, are positively rated. This means that websites should aim to create content that captivates users and makes them stay longer on the page. Internal links that refer to other relevant content should be user-friendly placed. Everything that scares users away when they first visit the page (too long loading times, pop-ups, etc.) should be avoided.

So: actually not new, but it’s good to remind yourself how important happy and loyal users are on your web presence and how closely UX and SEO are intertwined. And last but not least: we should never underestimate how smart Google has become in evaluating user signals.

2. Different Evaluation of Page Levels: We remember the Panda update, where website operators had low-quality content on subdomains or individual URLs that hurt the entire domain. The leak once again shows how important a well-maintained overall presence is in evaluating websites. Especially considering the currently hotly debated Site Reputation Abuse measures by Google, one should avoid turning a blind eye to certain subdomains and accepting qualitative deficiencies.

3. Cookie and Browser Data: Google uses cookie histories and browser data to evaluate the quality of websites and detect manipulations. Websites should ensure they promote positive and authentic user behavior. Avoid typical clickbaits on social media or cold mail actions that lead to disappointment and user dropout or sloppy SEA campaigns where the corresponding landing pages don’t match the ad. Besides monetary losses, such actions lead to noticeable click patterns that should be avoided.

4. Geographical Data: Considering geographical differences in click data means that websites should provide locally relevant content. This can include creating content in different languages and/or dialects and adapting to local trends and events. The clean technical implementation of, for example, the hreflang tag or language declaration and the use of the correct language in the URL is also important.

In organic backlink building and online PR, regional particularities and target groups should be considered. Matching the topic: it’s possible that Google completely ignores irrelevant regional and thematic links.

5. Whitelists During the Pandemic and Elections: Using whitelists to prioritize trusted websites in times of crisis shows how important it is to be recognized as a trusted source. Websites should work to strengthen their authority and credibility by providing high-quality content, being cited by reputable sources, and building a transparent and trustworthy online presence.

In keyword analysis and optimization, one should not neglect their own brand keywords. We have known for a long time that brands have clear ranking advantages. The stronger a brand is, the more trustworthy it is. This also means: search traffic caused by brand names pays into the perception of the brand as such.

In summary, if you have always done product-driven SEO, focused on the user and user experience, let qualified authors speak, and shared regionally relevant information, you don’t need to change your SEO strategy.

The leak shows once again which factors should be valued and how well Google understands contexts. If you are still trying to create a matching landing page for each keyword in 2024, you should thoroughly rethink your SEO activities. However, if you have understood the concept of entities and incorporated it into your SEO, you don’t need to worry about the Google API Leak.

Google has responded to the leak, urging the public to remain calm. The company emphasizes that the leaked documents may contain out-of-context, outdated, or incomplete information and that not all described factors are actually used in the ranking algorithms.

“We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information,” a spokesperson told The Register. “We’ve shared extensive information about how Search works and the types of factors that our systems weigh while also working to protect the integrity of our results from manipulation.”

Whether this is another smoke screen? We don’t know. But we can confidently say: high-quality and long-term SEO pays off, while manipulations fizzle out quickly or even negatively affect your ranking success.