SlideShare a Scribd company logo
How does Google
crawl the web?
A study based on the analysis of 413 million
pages crawled by Botify and 6 billion
Googlebot requests.
Annabelle Bouard
Search Data Strategist @botify.com
Today’s Presenter
@botify - #BotifyWebinar
Dimitri Brunel
Search Data Strategist @botify.com
Today’s Agenda
Goal: Better understand Google's behavior with a scientific approach.
● Methodology & Definitions
● The 1st study of its kind based on real customer data globally
● We put REAL figures on what SEOs have known but could not prove
● Insights from our study
● How can you go further?
@botify - #BotifyWebinar
Methodology &
Definitions
@botify - #BotifyWebinar
Scale-up the dataset
from a single website
to a full set of
websites from
different industries.
Add a scientific
approach to the
empiric one for
detailed and
data-backed insights.
Improved SEO
Methodology
More Precise
Analysis
Confirm or Invalidate
Google’s
behavior and
Discover new ones.
Real Insights
Be more efficient in
our SEO in order to
continually improve
Googlebot’s
efficiency and user
experience.
Share a Belief
@botify - #BotifyWebinar
Definitions for today’s session
URLs that meet the following
criteria:
● HTTP 200 Status Code
● Self Referential Canonical
● No noindex Tag
● HTML Content Type
Percentage of compliant pages
(indexable pages) crawled by
Google in 30 days.
Average number of times a
website’s URL was crawled by
Google in 30 days.
Compliant URL Crawl Ratio Crawl Frequency
@botify - #BotifyWebinar
The 1st study of its
kind based on real
customer data
globally
* Anonymized data@botify - #BotifyWebinar
We Looked at a Massive Amount of Data
270
413M
6.2B
Websites that fall in one of the following industries:
Retail Publisher Classifieds
Pages analyzed. Data from Botify Analytics and from
log files for these websites.
Googlebot requests analyzed. 30 days of
web server log files for each website.
@botify - #BotifyWebinar
We also looked at websites of all shapes and sizes
Industries Analyzed Dataset by Website Size (in pages)
@botify - #BotifyWebinar
And we looked at a lot of different metrics
● Industry
● Size
● Non-Compliant Pages
● Orphan Pages
● PageRank
● Page Depth
● Load Times
● No. of Outlinks
● Template weight
● Content Size
Type of Website Types of Pages Structure & Content
@botify - #BotifyWebinar
We put REAL
figures on what
SEOs have
known but could
not prove
@botify - #BotifyWebinar
How Does Google Crawl the Web?
A website’s size is
one of the most
important factors
impacting Google’s
crawl.
Website size matters
@botify - #BotifyWebinar
KPIs are impacted by website size
Some KPIs like the number of orphan pages, the load time, or the
percentage of words vs. template, have almost no impact on small
websites but have a huge impact on big websites.
Orphan Pages Load Time
% of Words vs. Template
The larger the website, the
greater the impact:
PageRank Depth
Content Size
Huge impact regardless of size:
Other KPIs like the PageRank dilution, depth, or surprisingly content
size, have a big impact on Google’s crawl, regardless of website size.
@botify - #BotifyWebinar
Insights from our study
@botify - #BotifyWebinar
What elements really impact Google’s crawl?
WEBSITE
1. Industry ✘
2. Size ✓
STRUCTURAL KPIS
5. PageRank ✓
6. Depth ✓
7. Load Times ✓
8. No. of Outlinks ✘
9. Template Weight ✓
10. Content Size ✓
TYPE OF PAGES
3. Non-Compliant Pages ✓
4. Orphan Pages ✓
@botify - #BotifyWebinar
Website Size
& Industry
@botify - #BotifyWebinar
#1 - Industry
Expected Results
Similar Crawl Rate
Different Crawl Frequency
depending on Industries
CLASSIFIEDS PUBLISHERRETAILER
@botify - #BotifyWebinar
CRAWL RATIO AND ACTIVE PAGES RATIO BY INDUSTRY
● Googlebot impartially crawls on the web.
● Googlebot crawls impartially regardless of
industry.
● Publishers tend to have more active pages (in
%).
Google’s crawl ratio is not influenced by industry
From our Experience
Confirmation
From the analysis of the Dataset
@botify - #BotifyWebinar
What's your crawl ratio and active pages ratio?
@botify - #BotifyWebinar
CRAWL FREQUENCY BY INDUSTRY
Publishers are crawled 45% more frequently than other industries
Publishers might be crawled more frequently
because of fresher and higher quality content.
From the analysis of the Dataset
New learnings!
@botify - #BotifyWebinar
What's your crawl frequency?
@botify - #BotifyWebinar
What's your crawl frequency?
@botify - #BotifyWebinar
2# - Website Size
Expected Results
Decreasing Crawl Ratio Adaptative Crawl Frequency
> 10K
PAGES
> 1 MILLIONS
PAGES
> 100K
PAGES
< 10K
PAGES
@botify - #BotifyWebinar
Confirmation
CRAWL RATIO AND ACTIVE PAGES RATIO BY WEBSITE SIZE
From our Experience
From the analysis of the Dataset
Websites with >1M pages: Crawl Ratio is significantly
lower than average
● More pages means more difficulties for
Googlebot.
● More pages means fewer active pages in the
SERPs (in %).
● Small websites are better crawled by Google
but still not entirely.
● Large websites have a harder time using
Crawl Budget efficiently.
@botify - #BotifyWebinar
Confirmation
CRAWL FREQUENCY BY WEBSITE SIZE
From the analysis of the Dataset
Pages on large websites with >1M pages are less frequently
crawled than the average
Large websites tend to have more long tail
pages that will be crawled by Google less
frequently.
Good news: this can be influenced with crawl
budget optimization.
@botify - #BotifyWebinar
Type of Pages
@botify - #BotifyWebinar
3# - Non-Compliant Pages
and
or
Canonical tag set not to
self
Not Text/HTML Content
● Poor indexability from a
technical point of vue.
● Negative signals for web spiders.
Risk
Expected Results
Lowest Crawl Ratio Low Indexation Lower Crawl Frequency
At last a composite indicator
NoIndex Status
HTTP codes other
than 200 Status Code
@botify - #BotifyWebinar
Confirmation
COMPLIANT PAGES CRAWLED BY BOTIFY vs. NON-COMPLIANT PAGES CRAWLED BY BOTIFY
From our Experience
From the analysis of the Dataset
37% of pages crawled by Botify were non-compliant pages
● The proportion of Non-Compliant (37%) pages
is still too important vs. aiming for total
indexability (100% of compliant pages).
● The overall average shows that SEO managers
still have room for improvement.
● From our experience we see that many websites still
face this problem, usually because of:
○ Extensive use of Noindex,
○ Server Errors,
○ Canonical annotations.
413M pages crawled
@botify - #BotifyWebinar
Confirmation
CRAWLED COMPLIANT PAGES vs. CRAWLED NON-COMPLIANT PAGES
From our Experience
From the analysis of the Dataset
Google is wasting (at least) 16% of its time and resources
crawling non-compliant pages
● As most websites have a huge proportion of
Non-Compliant Pages, Google is on average
wasting 16% of its time crawling these useless
pages, when it could focus on valuable pages
for SEO traffic.
● Google is wasting time crawling
Non-Compliant Pages.
@botify - #BotifyWebinar
CRAWL RATIO vs. % OF NON-COMPLIANT PAGES CRAWLED BY GOOGLE
From our Experience
From the analysis of the Dataset
Confirmation
The higher the share of Non-Compliant pages crawled, the
lower the Crawl Ratio
● When the proportion of Non-Compliant Pages
crawled by Google increases, the Crawl Ratio
decreases.
● We expect that having more Non-Compliant
Pages crawled by Google will have a negative
impact on the Compliant Page’s Crawl Ratios.
@botify - #BotifyWebinar
LESS THAN 100K PAGES MORE THAN 100K PAGES
For large websites over 100K pages, Crawl Ratio is strongly
impacted by the number of Non-Compliant pages
From the analysis of the Dataset
Confirmation
● Low impact on small websites but huge on medium ones.
@botify - #BotifyWebinar
What's your compliant pages ratio?
Globally and among pages crawled by Google?
#4 - Orphan Pages
● that are outside of the website structure,
● that we did not discover,
● that Google crawled,
● that receive crawl budget.
Expected Results
Cannibalization of crawl
budget
Lowering the crawl ratio
of the site structure
PAGES
Crawled by
BOTIFY
Crawled by
GOOGLE
Crawled by
Google AND Botify
@botify - #BotifyWebinar
Confirmation
CRAWL VOLUME ON PAGES IN STRUCTURE vs. CRAWL VOLUME ON ORPHAN PAGES From our Experience
From the analysis of the Dataset
● On avg. the Orphans Pages steal ¼ of the
crawl.
Orphan pages represent 26% of Google’s crawl
● From our experience, we always see many
Orphans URLs.
● Common reasons:
○ Old implementations, technical regressions,
○ No DNS cleaning, etc.
@botify - #BotifyWebinar
Confirmation
CRAWL RATIO vs. % OF ORPHAN PAGES CRAWLED BY GOOGLE From our Experience
From the analysis of the Dataset
Too many orphan pages negatively impact the way Google
crawls your site
● These pages tend to cannibalize precious crawl
budget, and impact the Crawl Ratio of pages in
the structure that do not benefit of 100% of the
crawl budget.
● As the percentage of Orphan Pages
increases, the Crawl Ratios should be
negatively impacted.
Let’s dig deeper into the data, once again!
Few orphans
= Higher crawl ratio
More orphans
= Lower crawl ratio
@botify - #BotifyWebinar
New learnings!
LESS THAN 100K PAGES MORE THAN 100K PAGES
From the analysis of the DatasetFrom our Experience
● This is true on Large and Very Large websites
only.
● Crawl budget cannibalization whatever the
size of the website.
Especially for large websites where Crawl Ratio is badly impacted
@botify - #BotifyWebinar
What proportion of orphan pages in Google 's crawl,
on your site?
54,57M / (2,86M + 54,57M) = 95%
% among number of URLs
crawled by Botify, here 10M
@botify - #BotifyWebinar
Website Structure
@botify - #BotifyWebinar
#5 - Internal PageRank
Expected Results
Diluting the Internal PageRank on Non-Compliant Pages should
negatively impact Google’s Crawl Ratio on Compliant Pages
How popularity is
distributed within
the website's
internal structure
A strong crawl
signal supposed
to guide
Googlebot(s)
@botify - #BotifyWebinar
Confirmation
CRAWL RATIO vs. % OF INTERNAL PAGERANK SPREAD ACROSS COMPLIANT PAGES
From our Experience
From the analysis of the Dataset
The more the internal PageRank is focusing on compliant
pages, the better will the Crawl ratio be
● If Compliant Pages get more Internal
PageRank, their Crawl Ratio should improve.
Pro Tips:
● Don’t waste PR with Nofollow and Noindex tags.
● Crawl Ratio ⇔ opportunity to improve your links.
Higher crawl ratio
→ Optimize your
linking!
@botify - #BotifyWebinar
What's your internal PR on compliant and non-compliant pages?
(bottom of page)
@botify - #BotifyWebinar
Minimum number of clicks to reach the page from the Home
#6 - Depth
Expected Results
Slow down the Crawl Lower Crawl Ratio
# Folders Depth # Clicks from the Home Page
@botify - #BotifyWebinar
Confirmation
CRAWL RATIO vs. AVG. DEPTH IN ANY WEBSITE STRUCTURE
From our Experience
From the analysis of the Dataset
The depth of a page greatly impacts its chances of being
crawled by Google
● We've known for ages that depth is an SEO /
UX problem:
○ Catalog size,
○ Efficient faceted navigation,
○ Structure pruning to remove useless pages.Avg. Depth
● Websites with a higher average Depth
should be less crawled by Google.
@botify - #BotifyWebinar
What's your average page depth for compliant pages?
@botify - #BotifyWebinar
#7 - Load Times
Expected Results
Idle Crawl
Huge Impact on Crawl
Ratio
We consider from a web crawler's “point of view” :
- The Time to first byte (web server responsiveness) +
- The Time to download the page HTML source (last byte).
@botify - #BotifyWebinar
CRAWL RATIO vs. LOAD TIMES IN MILLISECONDES From our Experience
From the analysis of the Dataset
Can we really believe that Load Times don’t impact Google’s crawl?
● When looking at websites of all sizes, Load
Times don’t seem to have any significant
impact on Google’s crawl.
● With higher average Load Times, the Crawl
Ratios should decrease.
Can we dig deeper into the data to clarify?
Disturbing
fact here
Your
target
@botify - #BotifyWebinar
New learnings!
LESS THAN 10K PAGES
From the analysis of the Dataset
For large websites, Load Times are definitely impacting
Google’s crawl
● Small websites ⇔ Low impact of Load Times
● Big websites ⇔ Huge impact of Load Times
● With higher average Load Times, the Crawl
Ratios should decrease.
Limited
impact
MORE THAN 10K PAGES
From our Experience
Dramatic
impact
@botify - #BotifyWebinar
What's your average load time for compliant pages?
#8 - Number of Internal Outlinks
Expected Results
Quantity is not Quality
Impact on Crawl Ratio
when too many
Either
Follow NoFollow
To a
Compliant page
To a Not
Compliant page
@botify - #BotifyWebinar
No Confirmation
CRAWL RATIO vs. NO. OF OUTLINKS PER PAGE CRAWL RATIO vs. NO. OF OUTLINKS TO NON-COMPLIANT PAGES
From the analysis of the DatasetFrom our Experience
Too many Outlinks don't really impact Google’s crawl
● Google’s crawl doesn’t seem to be very
impacted by the number of Outlinks.
● Less outlinks ⇔ Better Crawl Ratio
● Bad outlinks ⇔ Slightly decrease the Crawl
Ratio
@botify - #BotifyWebinar
Page Content
@botify - #BotifyWebinar
#9 - Percentage of Content
Expected Results
Low percentage of “real” content
often mean heavier page
More difficult to crawl for
Google
Percentage
of
Content
REAL Content
TEMPLATE
Content
@botify - #BotifyWebinar
New learnings!
Confirmation
LESS THAN 10K PAGES MORE THAN 1M PAGES
From the analysis of the Dataset
From the analysis of the Dataset
Heavy templates have a huge impact on large websites’ Crawl
Ratio
● Small websites => Limited impact of the % of
Content VS. Template
Potential reason:
Pages with a low % of Content VS. Template tend to be heavier and therefore have slow Load Times.
We have already seen that large websites are highly impacted by Load Times.
● Big websites => Huge impact of the % of
Content VS. Template
Limited
impact
Significant
impact
@botify - #BotifyWebinar
What's your average template weight?
@botify - #BotifyWebinar
#10 - Content Size
The number of words
in page, excluding
template.
Expected Results
The more content in average,
the more crawled
Yet a limited impact on
Google crawl
@botify - #BotifyWebinar
New learnings!
Confirmation
CRAWL RATIO vs. CONTENT SIZE (in words)
From the analysis of the Dataset
From our Experience
Content Size has a very big impact on Google’s crawl
● Content Size impacts Google’s Crawl.
● Websites with more content should be more
crawled by Google but we do not expect a
very high impact on Google’s crawl.
● Content Size is more impactful on Google’s
crawl than we were expecting.
@botify - #BotifyWebinar
New learnings!
From the analysis of the Dataset
● Content Size positively impacts Google’s crawl for website of any size.
Some
impact
…whatever the size of the website
LESS THAN 100K PAGES MORE THAN 1M PAGES
Positive
impact
Great
impact
BETWEEN 100K AND 1M PAGES
@botify - #BotifyWebinar
What's your average content size?
@botify - #BotifyWebinar
How can you go
further?
Use custom charts
@botify - #BotifyWebinar
After looking at overall, site-wide figures:
Look at pages by type of SEO role (expected traffic)
○ Product pages (ecommerce), article pages (publishing), ads (classifieds)
○ Category pages (lists of products / articles / ads)
→ Breakdown by segment in the report
Use the “advanced selector” feature to explore your data
○ Breakdown by any relevant dimension (content size, depth, linking…)
○ Combine 2 dimensions (segment + content size…)
How can you go further?
What's the impact of load time on Google's crawl?
What's the impact of load time on Google's crawl?
@botify - #BotifyWebinar
What's the impact of content size on Google's crawl?
How to combine with a secondary dimension ?
Avg. load time for compliant pages by segments ?
@botify - #BotifyWebinar
Key
Takeaways
@botify - #BotifyWebinar
Website Size
dramatically impacts
Google’s Crawl Ratio.
Even small websites are not
crawled at 100% by Google.
Crawl Budget matters.
#1 #2
Content size
matters:
Build high-quality
unique content
Orphans Volume
matters:
Like water, don’t
waste crawl budget!
Structure Depth
matters:
Don’t be afraid, prune
useless branches!
#3 #4 #5
Thank you
for your attention
Get in touch!
hello@botify.com
Questions?
How Does Google Crawl the Web?

More Related Content

PDF
How to Incorporate ML in your SERP Analysis, Lazarina Stoy -BrightonSEO Oct, ...
PDF
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
PDF
Léa PICOSSON - SEO Garden Party Novembre 2022
PPTX
Machine Learning use cases for Technical SEO Automation Brighton SEO Patrick ...
PPTX
BrightonSEO: How to generate 8 million SEO test ideas - Will Critchlow
PPTX
Content writers: will AI take your job?
PDF
SXSW 2016 takeaways
PDF
Website Strategy And Audit Proposal PowerPoint Presentation Slides
How to Incorporate ML in your SERP Analysis, Lazarina Stoy -BrightonSEO Oct, ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Léa PICOSSON - SEO Garden Party Novembre 2022
Machine Learning use cases for Technical SEO Automation Brighton SEO Patrick ...
BrightonSEO: How to generate 8 million SEO test ideas - Will Critchlow
Content writers: will AI take your job?
SXSW 2016 takeaways
Website Strategy And Audit Proposal PowerPoint Presentation Slides

What's hot (20)

PDF
Automating Google Lighthouse
PDF
SEO en Portales Verticales. Mi Experiencia [SEOPLUS 2018]
PDF
A beginner's guide to machine learning for SEOs - WTSFest 2022
PPTX
We’ve analysed the SEO of over 100 eCom sites - this is what we’ve learned!
PDF
The Python Cheat Sheet for the Busy Marketer
PPTX
Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...
PPSX
How To Use AI To Enhance Your SEO & Create Better Content
PDF
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
PDF
Customer segmentation and marketing automation with Apache Unomi
PPTX
Practical SEO for Developers - An Introduction
PPTX
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
PDF
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
PDF
Brighton SEO Autumn 2021: Core Web Vitals: Loopholes, Flaws, and Endless Delays
PDF
AI-powered Semantic SEO by Koray GUBUR
PDF
How to take care of yourself when researching/writing about tough subjects
PPTX
Diginius - DuckDuckGo, Privacy and the Future of Search
PDF
How to automate a long tail SEO strategy for ecommerce
PDF
Using Search Intent in our Link Building Efforts
PDF
SEO Strategy Guide [2019]
PDF
Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...
Automating Google Lighthouse
SEO en Portales Verticales. Mi Experiencia [SEOPLUS 2018]
A beginner's guide to machine learning for SEOs - WTSFest 2022
We’ve analysed the SEO of over 100 eCom sites - this is what we’ve learned!
The Python Cheat Sheet for the Busy Marketer
Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...
How To Use AI To Enhance Your SEO & Create Better Content
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
Customer segmentation and marketing automation with Apache Unomi
Practical SEO for Developers - An Introduction
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
Brighton SEO Autumn 2021: Core Web Vitals: Loopholes, Flaws, and Endless Delays
AI-powered Semantic SEO by Koray GUBUR
How to take care of yourself when researching/writing about tough subjects
Diginius - DuckDuckGo, Privacy and the Future of Search
How to automate a long tail SEO strategy for ecommerce
Using Search Intent in our Link Building Efforts
SEO Strategy Guide [2019]
Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...
Ad

Similar to How Does Google Crawl the Web? (20)

PDF
Botify's presentation at Brighton SEO (Sept. 2014)
PDF
One year in the life of a large website with Botify
PPTX
BrightonSEO 5 Critical Questions Your Log Files Can Answer September 2016
PPTX
How to Find Your Site's True Ranking Factors
PPTX
Decrypt Google’s Behavior with Botify Log Analyzer
PPTX
Crawl Budget: Everything you Need to Know
PPTX
Server-side SEO (The art of making love to spiders) by Boaz Sasoon (SimilarWeb)
PPTX
Server side SEO - The art of making love to spiders
PDF
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
PPTX
AI-Powered SEO with Botify: Automation in Prevention, Execution, and Implemen...
PDF
How To Optimize Your Site's Crawl Budget - Technical SEO Philly
PDF
The Beginner's Guide to Googlebot Optimization
PDF
Crawl Budget Optimization - Technical SEO Meetup 1
PDF
How to Optimize Your Website for Crawl Efficiency
PDF
Sasconbeta 2015 Dawn Anderson - Talk To The Spider
PDF
Crawl Budget Optimisation at #dmwf2018
PPTX
Crawl Budget and Its Significance in 2025
PDF
Modern SEO Players Guide
PDF
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021
PDF
Log analysis and pro use cases for search marketers online version (1)
Botify's presentation at Brighton SEO (Sept. 2014)
One year in the life of a large website with Botify
BrightonSEO 5 Critical Questions Your Log Files Can Answer September 2016
How to Find Your Site's True Ranking Factors
Decrypt Google’s Behavior with Botify Log Analyzer
Crawl Budget: Everything you Need to Know
Server-side SEO (The art of making love to spiders) by Boaz Sasoon (SimilarWeb)
Server side SEO - The art of making love to spiders
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
AI-Powered SEO with Botify: Automation in Prevention, Execution, and Implemen...
How To Optimize Your Site's Crawl Budget - Technical SEO Philly
The Beginner's Guide to Googlebot Optimization
Crawl Budget Optimization - Technical SEO Meetup 1
How to Optimize Your Website for Crawl Efficiency
Sasconbeta 2015 Dawn Anderson - Talk To The Spider
Crawl Budget Optimisation at #dmwf2018
Crawl Budget and Its Significance in 2025
Modern SEO Players Guide
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021
Log analysis and pro use cases for search marketers online version (1)
Ad

More from Botify (20)

PPTX
Faceted Navigation: (Almost) Everyone is Doing it Wrong
PDF
From Search to Transaction: How to Master the Customer Experience
PPTX
The Evolution of Customer Journeys & SEO
PPTX
How Is COVID-19 Impacting Organic Search by Industry & What Can We Do About It?
PDF
Webinar: How to Make Data-Driven Marketing Decisions Without a Data Science D...
PDF
The Total Economic Impact of Botify
PPTX
Algo Updates, Volatility, & How to Roll with the Punches in SEO
PPTX
New Holiday Data Reveals Insights About Handling Seasonal Volatility - Q1 202...
PPTX
Living in a mobile first index world
PPTX
BrightonSEO 2019 - Crawl Budget is dead, please welcome Rendering Budget
PDF
Botify Webinar - The new Version of Botify Keywords
PPTX
Mobile-First Index: A Data-Driven Analysis & Discussion
PDF
Why auditing your rel=canonical configuration is a shrewd move
PDF
Botify webinar Internal Linking - October 2018
PDF
GSC vs Scraping: Go Beyond Rankings
PPTX
The GDPR: What, Why and How Botify is Compliant by Design
PDF
Demystifying JavaScript & SEO
PDF
Webinar Structured Data
PDF
Mobile first index webinar
PPTX
Understand the impact of Javascript on SEO
Faceted Navigation: (Almost) Everyone is Doing it Wrong
From Search to Transaction: How to Master the Customer Experience
The Evolution of Customer Journeys & SEO
How Is COVID-19 Impacting Organic Search by Industry & What Can We Do About It?
Webinar: How to Make Data-Driven Marketing Decisions Without a Data Science D...
The Total Economic Impact of Botify
Algo Updates, Volatility, & How to Roll with the Punches in SEO
New Holiday Data Reveals Insights About Handling Seasonal Volatility - Q1 202...
Living in a mobile first index world
BrightonSEO 2019 - Crawl Budget is dead, please welcome Rendering Budget
Botify Webinar - The new Version of Botify Keywords
Mobile-First Index: A Data-Driven Analysis & Discussion
Why auditing your rel=canonical configuration is a shrewd move
Botify webinar Internal Linking - October 2018
GSC vs Scraping: Go Beyond Rankings
The GDPR: What, Why and How Botify is Compliant by Design
Demystifying JavaScript & SEO
Webinar Structured Data
Mobile first index webinar
Understand the impact of Javascript on SEO

Recently uploaded (20)

PDF
Pay-Per-Click Marketing: Strategies That Actually Work in 2025
PPT
Market research before Marketing Research .PPT
PDF
UNIT 2 - 5 DISTRIBUTION IN RURAL MARKETS.pdf
PPTX
Sumit Saxena IIM J Project Market segmentation.pptx
PDF
20K Btc Enabled Cash App Accounts – Safe, Fast, Verified.pdf
PPTX
Mastering eCommerce SEO: Strategies to Boost Traffic and Maximize Conversions
PPTX
UNIT 3 - 5 INDUSTRIAL PRICING.ppt x
PDF
Mastering Bulk Email Campaign Optimization for 2025
PPTX
Captain Morgan x FOS_Revised_8.8.25.pptx
PDF
UNIT 2 - 2 AGRICULTURE MARKETING in INDIA.pdf
PDF
RC 14001 Certification: Enhancing ISO 14001 with EHS & Security Standards
PPTX
Presentation - GreenPantry – Instagram-First Home Kitchen Brand.pptx
PDF
Master Fullstack Development Course in Chennai – Enroll Now!
DOCX
procubiz_modern digital marketingblog.docx
PPTX
Assignment 2 Task 1 - How Consumers Use Technology and Its Impact on Their Lives
PPTX
Best Digital marketing service provider in Chandigarh.pptx
PDF
Coleção Nature .
PPTX
PRINCIPLES OF MANAGEMENT and functions (1).pptx
PPTX
Kimberly Crossland Storytelling Marketing Class 5stars.pptx
PDF
Keshav Solutions Pest Control || Trending Branding Digital Solutions
Pay-Per-Click Marketing: Strategies That Actually Work in 2025
Market research before Marketing Research .PPT
UNIT 2 - 5 DISTRIBUTION IN RURAL MARKETS.pdf
Sumit Saxena IIM J Project Market segmentation.pptx
20K Btc Enabled Cash App Accounts – Safe, Fast, Verified.pdf
Mastering eCommerce SEO: Strategies to Boost Traffic and Maximize Conversions
UNIT 3 - 5 INDUSTRIAL PRICING.ppt x
Mastering Bulk Email Campaign Optimization for 2025
Captain Morgan x FOS_Revised_8.8.25.pptx
UNIT 2 - 2 AGRICULTURE MARKETING in INDIA.pdf
RC 14001 Certification: Enhancing ISO 14001 with EHS & Security Standards
Presentation - GreenPantry – Instagram-First Home Kitchen Brand.pptx
Master Fullstack Development Course in Chennai – Enroll Now!
procubiz_modern digital marketingblog.docx
Assignment 2 Task 1 - How Consumers Use Technology and Its Impact on Their Lives
Best Digital marketing service provider in Chandigarh.pptx
Coleção Nature .
PRINCIPLES OF MANAGEMENT and functions (1).pptx
Kimberly Crossland Storytelling Marketing Class 5stars.pptx
Keshav Solutions Pest Control || Trending Branding Digital Solutions

How Does Google Crawl the Web?

  • 1. How does Google crawl the web? A study based on the analysis of 413 million pages crawled by Botify and 6 billion Googlebot requests.
  • 2. Annabelle Bouard Search Data Strategist @botify.com Today’s Presenter @botify - #BotifyWebinar Dimitri Brunel Search Data Strategist @botify.com
  • 3. Today’s Agenda Goal: Better understand Google's behavior with a scientific approach. ● Methodology & Definitions ● The 1st study of its kind based on real customer data globally ● We put REAL figures on what SEOs have known but could not prove ● Insights from our study ● How can you go further? @botify - #BotifyWebinar
  • 5. Scale-up the dataset from a single website to a full set of websites from different industries. Add a scientific approach to the empiric one for detailed and data-backed insights. Improved SEO Methodology More Precise Analysis Confirm or Invalidate Google’s behavior and Discover new ones. Real Insights Be more efficient in our SEO in order to continually improve Googlebot’s efficiency and user experience. Share a Belief @botify - #BotifyWebinar
  • 6. Definitions for today’s session URLs that meet the following criteria: ● HTTP 200 Status Code ● Self Referential Canonical ● No noindex Tag ● HTML Content Type Percentage of compliant pages (indexable pages) crawled by Google in 30 days. Average number of times a website’s URL was crawled by Google in 30 days. Compliant URL Crawl Ratio Crawl Frequency @botify - #BotifyWebinar
  • 7. The 1st study of its kind based on real customer data globally * Anonymized data@botify - #BotifyWebinar
  • 8. We Looked at a Massive Amount of Data 270 413M 6.2B Websites that fall in one of the following industries: Retail Publisher Classifieds Pages analyzed. Data from Botify Analytics and from log files for these websites. Googlebot requests analyzed. 30 days of web server log files for each website. @botify - #BotifyWebinar
  • 9. We also looked at websites of all shapes and sizes Industries Analyzed Dataset by Website Size (in pages) @botify - #BotifyWebinar
  • 10. And we looked at a lot of different metrics ● Industry ● Size ● Non-Compliant Pages ● Orphan Pages ● PageRank ● Page Depth ● Load Times ● No. of Outlinks ● Template weight ● Content Size Type of Website Types of Pages Structure & Content @botify - #BotifyWebinar
  • 11. We put REAL figures on what SEOs have known but could not prove @botify - #BotifyWebinar
  • 13. A website’s size is one of the most important factors impacting Google’s crawl. Website size matters @botify - #BotifyWebinar
  • 14. KPIs are impacted by website size Some KPIs like the number of orphan pages, the load time, or the percentage of words vs. template, have almost no impact on small websites but have a huge impact on big websites. Orphan Pages Load Time % of Words vs. Template The larger the website, the greater the impact: PageRank Depth Content Size Huge impact regardless of size: Other KPIs like the PageRank dilution, depth, or surprisingly content size, have a big impact on Google’s crawl, regardless of website size. @botify - #BotifyWebinar
  • 15. Insights from our study @botify - #BotifyWebinar
  • 16. What elements really impact Google’s crawl? WEBSITE 1. Industry ✘ 2. Size ✓ STRUCTURAL KPIS 5. PageRank ✓ 6. Depth ✓ 7. Load Times ✓ 8. No. of Outlinks ✘ 9. Template Weight ✓ 10. Content Size ✓ TYPE OF PAGES 3. Non-Compliant Pages ✓ 4. Orphan Pages ✓ @botify - #BotifyWebinar
  • 18. #1 - Industry Expected Results Similar Crawl Rate Different Crawl Frequency depending on Industries CLASSIFIEDS PUBLISHERRETAILER @botify - #BotifyWebinar
  • 19. CRAWL RATIO AND ACTIVE PAGES RATIO BY INDUSTRY ● Googlebot impartially crawls on the web. ● Googlebot crawls impartially regardless of industry. ● Publishers tend to have more active pages (in %). Google’s crawl ratio is not influenced by industry From our Experience Confirmation From the analysis of the Dataset @botify - #BotifyWebinar
  • 20. What's your crawl ratio and active pages ratio? @botify - #BotifyWebinar
  • 21. CRAWL FREQUENCY BY INDUSTRY Publishers are crawled 45% more frequently than other industries Publishers might be crawled more frequently because of fresher and higher quality content. From the analysis of the Dataset New learnings! @botify - #BotifyWebinar
  • 22. What's your crawl frequency? @botify - #BotifyWebinar
  • 23. What's your crawl frequency? @botify - #BotifyWebinar
  • 24. 2# - Website Size Expected Results Decreasing Crawl Ratio Adaptative Crawl Frequency > 10K PAGES > 1 MILLIONS PAGES > 100K PAGES < 10K PAGES @botify - #BotifyWebinar
  • 25. Confirmation CRAWL RATIO AND ACTIVE PAGES RATIO BY WEBSITE SIZE From our Experience From the analysis of the Dataset Websites with >1M pages: Crawl Ratio is significantly lower than average ● More pages means more difficulties for Googlebot. ● More pages means fewer active pages in the SERPs (in %). ● Small websites are better crawled by Google but still not entirely. ● Large websites have a harder time using Crawl Budget efficiently. @botify - #BotifyWebinar
  • 26. Confirmation CRAWL FREQUENCY BY WEBSITE SIZE From the analysis of the Dataset Pages on large websites with >1M pages are less frequently crawled than the average Large websites tend to have more long tail pages that will be crawled by Google less frequently. Good news: this can be influenced with crawl budget optimization. @botify - #BotifyWebinar
  • 27. Type of Pages @botify - #BotifyWebinar
  • 28. 3# - Non-Compliant Pages and or Canonical tag set not to self Not Text/HTML Content ● Poor indexability from a technical point of vue. ● Negative signals for web spiders. Risk Expected Results Lowest Crawl Ratio Low Indexation Lower Crawl Frequency At last a composite indicator NoIndex Status HTTP codes other than 200 Status Code @botify - #BotifyWebinar
  • 29. Confirmation COMPLIANT PAGES CRAWLED BY BOTIFY vs. NON-COMPLIANT PAGES CRAWLED BY BOTIFY From our Experience From the analysis of the Dataset 37% of pages crawled by Botify were non-compliant pages ● The proportion of Non-Compliant (37%) pages is still too important vs. aiming for total indexability (100% of compliant pages). ● The overall average shows that SEO managers still have room for improvement. ● From our experience we see that many websites still face this problem, usually because of: ○ Extensive use of Noindex, ○ Server Errors, ○ Canonical annotations. 413M pages crawled @botify - #BotifyWebinar
  • 30. Confirmation CRAWLED COMPLIANT PAGES vs. CRAWLED NON-COMPLIANT PAGES From our Experience From the analysis of the Dataset Google is wasting (at least) 16% of its time and resources crawling non-compliant pages ● As most websites have a huge proportion of Non-Compliant Pages, Google is on average wasting 16% of its time crawling these useless pages, when it could focus on valuable pages for SEO traffic. ● Google is wasting time crawling Non-Compliant Pages. @botify - #BotifyWebinar
  • 31. CRAWL RATIO vs. % OF NON-COMPLIANT PAGES CRAWLED BY GOOGLE From our Experience From the analysis of the Dataset Confirmation The higher the share of Non-Compliant pages crawled, the lower the Crawl Ratio ● When the proportion of Non-Compliant Pages crawled by Google increases, the Crawl Ratio decreases. ● We expect that having more Non-Compliant Pages crawled by Google will have a negative impact on the Compliant Page’s Crawl Ratios. @botify - #BotifyWebinar
  • 32. LESS THAN 100K PAGES MORE THAN 100K PAGES For large websites over 100K pages, Crawl Ratio is strongly impacted by the number of Non-Compliant pages From the analysis of the Dataset Confirmation ● Low impact on small websites but huge on medium ones. @botify - #BotifyWebinar
  • 33. What's your compliant pages ratio? Globally and among pages crawled by Google?
  • 34. #4 - Orphan Pages ● that are outside of the website structure, ● that we did not discover, ● that Google crawled, ● that receive crawl budget. Expected Results Cannibalization of crawl budget Lowering the crawl ratio of the site structure PAGES Crawled by BOTIFY Crawled by GOOGLE Crawled by Google AND Botify @botify - #BotifyWebinar
  • 35. Confirmation CRAWL VOLUME ON PAGES IN STRUCTURE vs. CRAWL VOLUME ON ORPHAN PAGES From our Experience From the analysis of the Dataset ● On avg. the Orphans Pages steal ¼ of the crawl. Orphan pages represent 26% of Google’s crawl ● From our experience, we always see many Orphans URLs. ● Common reasons: ○ Old implementations, technical regressions, ○ No DNS cleaning, etc. @botify - #BotifyWebinar
  • 36. Confirmation CRAWL RATIO vs. % OF ORPHAN PAGES CRAWLED BY GOOGLE From our Experience From the analysis of the Dataset Too many orphan pages negatively impact the way Google crawls your site ● These pages tend to cannibalize precious crawl budget, and impact the Crawl Ratio of pages in the structure that do not benefit of 100% of the crawl budget. ● As the percentage of Orphan Pages increases, the Crawl Ratios should be negatively impacted. Let’s dig deeper into the data, once again! Few orphans = Higher crawl ratio More orphans = Lower crawl ratio @botify - #BotifyWebinar
  • 37. New learnings! LESS THAN 100K PAGES MORE THAN 100K PAGES From the analysis of the DatasetFrom our Experience ● This is true on Large and Very Large websites only. ● Crawl budget cannibalization whatever the size of the website. Especially for large websites where Crawl Ratio is badly impacted @botify - #BotifyWebinar
  • 38. What proportion of orphan pages in Google 's crawl, on your site? 54,57M / (2,86M + 54,57M) = 95% % among number of URLs crawled by Botify, here 10M @botify - #BotifyWebinar
  • 39. Website Structure @botify - #BotifyWebinar
  • 40. #5 - Internal PageRank Expected Results Diluting the Internal PageRank on Non-Compliant Pages should negatively impact Google’s Crawl Ratio on Compliant Pages How popularity is distributed within the website's internal structure A strong crawl signal supposed to guide Googlebot(s) @botify - #BotifyWebinar
  • 41. Confirmation CRAWL RATIO vs. % OF INTERNAL PAGERANK SPREAD ACROSS COMPLIANT PAGES From our Experience From the analysis of the Dataset The more the internal PageRank is focusing on compliant pages, the better will the Crawl ratio be ● If Compliant Pages get more Internal PageRank, their Crawl Ratio should improve. Pro Tips: ● Don’t waste PR with Nofollow and Noindex tags. ● Crawl Ratio ⇔ opportunity to improve your links. Higher crawl ratio → Optimize your linking! @botify - #BotifyWebinar
  • 42. What's your internal PR on compliant and non-compliant pages? (bottom of page) @botify - #BotifyWebinar
  • 43. Minimum number of clicks to reach the page from the Home #6 - Depth Expected Results Slow down the Crawl Lower Crawl Ratio # Folders Depth # Clicks from the Home Page @botify - #BotifyWebinar
  • 44. Confirmation CRAWL RATIO vs. AVG. DEPTH IN ANY WEBSITE STRUCTURE From our Experience From the analysis of the Dataset The depth of a page greatly impacts its chances of being crawled by Google ● We've known for ages that depth is an SEO / UX problem: ○ Catalog size, ○ Efficient faceted navigation, ○ Structure pruning to remove useless pages.Avg. Depth ● Websites with a higher average Depth should be less crawled by Google. @botify - #BotifyWebinar
  • 45. What's your average page depth for compliant pages? @botify - #BotifyWebinar
  • 46. #7 - Load Times Expected Results Idle Crawl Huge Impact on Crawl Ratio We consider from a web crawler's “point of view” : - The Time to first byte (web server responsiveness) + - The Time to download the page HTML source (last byte). @botify - #BotifyWebinar
  • 47. CRAWL RATIO vs. LOAD TIMES IN MILLISECONDES From our Experience From the analysis of the Dataset Can we really believe that Load Times don’t impact Google’s crawl? ● When looking at websites of all sizes, Load Times don’t seem to have any significant impact on Google’s crawl. ● With higher average Load Times, the Crawl Ratios should decrease. Can we dig deeper into the data to clarify? Disturbing fact here Your target @botify - #BotifyWebinar
  • 48. New learnings! LESS THAN 10K PAGES From the analysis of the Dataset For large websites, Load Times are definitely impacting Google’s crawl ● Small websites ⇔ Low impact of Load Times ● Big websites ⇔ Huge impact of Load Times ● With higher average Load Times, the Crawl Ratios should decrease. Limited impact MORE THAN 10K PAGES From our Experience Dramatic impact @botify - #BotifyWebinar
  • 49. What's your average load time for compliant pages?
  • 50. #8 - Number of Internal Outlinks Expected Results Quantity is not Quality Impact on Crawl Ratio when too many Either Follow NoFollow To a Compliant page To a Not Compliant page @botify - #BotifyWebinar
  • 51. No Confirmation CRAWL RATIO vs. NO. OF OUTLINKS PER PAGE CRAWL RATIO vs. NO. OF OUTLINKS TO NON-COMPLIANT PAGES From the analysis of the DatasetFrom our Experience Too many Outlinks don't really impact Google’s crawl ● Google’s crawl doesn’t seem to be very impacted by the number of Outlinks. ● Less outlinks ⇔ Better Crawl Ratio ● Bad outlinks ⇔ Slightly decrease the Crawl Ratio @botify - #BotifyWebinar
  • 52. Page Content @botify - #BotifyWebinar
  • 53. #9 - Percentage of Content Expected Results Low percentage of “real” content often mean heavier page More difficult to crawl for Google Percentage of Content REAL Content TEMPLATE Content @botify - #BotifyWebinar
  • 54. New learnings! Confirmation LESS THAN 10K PAGES MORE THAN 1M PAGES From the analysis of the Dataset From the analysis of the Dataset Heavy templates have a huge impact on large websites’ Crawl Ratio ● Small websites => Limited impact of the % of Content VS. Template Potential reason: Pages with a low % of Content VS. Template tend to be heavier and therefore have slow Load Times. We have already seen that large websites are highly impacted by Load Times. ● Big websites => Huge impact of the % of Content VS. Template Limited impact Significant impact @botify - #BotifyWebinar
  • 55. What's your average template weight? @botify - #BotifyWebinar
  • 56. #10 - Content Size The number of words in page, excluding template. Expected Results The more content in average, the more crawled Yet a limited impact on Google crawl @botify - #BotifyWebinar
  • 57. New learnings! Confirmation CRAWL RATIO vs. CONTENT SIZE (in words) From the analysis of the Dataset From our Experience Content Size has a very big impact on Google’s crawl ● Content Size impacts Google’s Crawl. ● Websites with more content should be more crawled by Google but we do not expect a very high impact on Google’s crawl. ● Content Size is more impactful on Google’s crawl than we were expecting. @botify - #BotifyWebinar
  • 58. New learnings! From the analysis of the Dataset ● Content Size positively impacts Google’s crawl for website of any size. Some impact …whatever the size of the website LESS THAN 100K PAGES MORE THAN 1M PAGES Positive impact Great impact BETWEEN 100K AND 1M PAGES @botify - #BotifyWebinar
  • 59. What's your average content size? @botify - #BotifyWebinar
  • 60. How can you go further? Use custom charts @botify - #BotifyWebinar
  • 61. After looking at overall, site-wide figures: Look at pages by type of SEO role (expected traffic) ○ Product pages (ecommerce), article pages (publishing), ads (classifieds) ○ Category pages (lists of products / articles / ads) → Breakdown by segment in the report Use the “advanced selector” feature to explore your data ○ Breakdown by any relevant dimension (content size, depth, linking…) ○ Combine 2 dimensions (segment + content size…) How can you go further?
  • 62. What's the impact of load time on Google's crawl?
  • 63. What's the impact of load time on Google's crawl? @botify - #BotifyWebinar
  • 64. What's the impact of content size on Google's crawl?
  • 65. How to combine with a secondary dimension ?
  • 66. Avg. load time for compliant pages by segments ? @botify - #BotifyWebinar
  • 68. Website Size dramatically impacts Google’s Crawl Ratio. Even small websites are not crawled at 100% by Google. Crawl Budget matters. #1 #2
  • 69. Content size matters: Build high-quality unique content Orphans Volume matters: Like water, don’t waste crawl budget! Structure Depth matters: Don’t be afraid, prune useless branches! #3 #4 #5
  • 70. Thank you for your attention Get in touch! hello@botify.com Questions?