SlideShare a Scribd company logo
Crawl Optimization
How to optimise to increase Crawl budget
What is a Crawler ?
A crawler is a program used by search engines to collect data from the internet.
When a crawler visits a website, it picks over the entire website’s content (i.e. the
text) and stores it in a databank. It also stores all the external and internal links to
the website. The crawler will visit the stored links at a later point in time, which is
how it moves from one website to the next. By this process the crawler captures
and indexes every website that has links to at least one other website.
What Crawl Budget Means for Googlebot ?
Prioritizing what to crawl, when, and how much resource the server hosting the site can allocate to crawling is
more important for bigger sites, or those that auto-generate pages based on URL parameters,
Crawl optimization - ( How to optimize to increase crawl budget)
Even a magic SEO wand will not get a web page
to rank if the page has not been indexed
Making sure web pages can be indexed is key during an SEO audit.
Taking Crawl Rate and Crawl Demand together we
define Crawl Budget as the number of URLs Googlebot
can and wants to crawl.
Crawl rate limit
Googlebot is designed to be a good citizen of the web. Crawling is its main
priority, while making sure it doesn't degrade the experience of users visiting the
site. We call this the "crawl rate limit," which limits the maximum fetching rate for
a given site.
Simply put, this represents the number of simultaneous parallel connections
Googlebot may use to crawl the site, as well as the time it has to wait between the
fetches. The crawl rate can go up and down based on a couple of factors:
● Crawl health
● Limit set in Search Console
Crawl Health : if the site responds really quickly for a while, the limit goes up, meaning more connections
can be used to crawl. If the site slows down or responds with server errors, the limit goes down and Googlebot
crawls less.
Crawl Errors Loading Time
Crawl optimization - ( How to optimize to increase crawl budget)
Limit set in Search Console : website owners can reduce Googlebot's crawling of their site. Note
that setting higher limits doesn't automatically increase crawling
To limit the crawl rate:
1. On the Search Console
Home page, click the site
that you want.
2. Click the gear icon , then
click Site Settings.
3. In the Crawl rate section,
select the option you want
and then limit the crawl
rate as desired
Crawl demand
Even if the crawl rate limit isn't reached, if there's no demand from indexing, there
will be low activity from Googlebot. The two factors that play a significant role in
determining crawl demand are:
● Popularity
● Staleness
Popularity: URLs that are more popular on the Internet tend to be crawled
more often to keep them fresher in our index.
1. Dilution of pagerank
2. Low value SEO pages
3. Depth
4. Linking internally
5. Orphaned pages
6. SiteMaps
PAGE RANK Low Value Pages
Crawl optimization - ( How to optimize to increase crawl budget)
Pagerank Dilution (Thin Content) Pagerank Dilution (depth)
Pagerank Dilution (depth) Internal Linking
Orphan Pages SiteMaps
Crawl optimization - ( How to optimize to increase crawl budget)
Staleness: Google systems attempt to prevent URLs from becoming stale in the index.
Detection Index Refresh
Factors affecting crawl budget
According to our analysis, having many low-value-add URLs can negatively affect a site's crawling and indexing.
We found that the low-value-add URLs fall into these categories, in order of significance:
● Faceted navigation and session identifiers
● On-site duplicate content
● Soft error pages
● Hacked pages
● Infinite spaces and proxies
● Low quality and spam content
Faceted navigation
and session identifiers
Faceted navigation, such as filtering by color or
price range, can be helpful for your visitors, but
it’s often not search-friendly since it creates
many combinations of URLs with duplicative
content. With duplicative URLs, search engines
may not crawl new or updated unique content
as quickly, and/or they may not index a page
accurately because indexing signals are diluted
between the duplicate versions
Selecting filters with faceted navigation can cause many URL combinations,
such as
http://www.example.com/category.php?category=gummy-
candies&price=5-10&price=over-10
On the left is potential user navigation on the site (i.e., the click
path), on the right are the pages accessed.
On-site duplicate content
Soft error pages
"soft” or “crypto” 404s.
A "soft 404" occurs when a web
server responds with a 200 OK HTTP
response code for a page that
doesn't exist rather than the
appropriate 404 Not Found. Soft
404s can limit a site's crawl
coverage by search engines because
these duplicate URLs may be
crawled instead of pages with unique
content.
To infinity and beyond? No!
When Googlebot crawls the web,
it often finds what we call an
"infinite space". These are very
large numbers of links that
usually provide little or no new
content for Googlebot to index. If
this happens on your site,
crawling those URLs may use
unnecessary bandwidth, and
could result in Googlebot failing
to completely index the real
content on your site.
Optimize your crawling & indexing
Remove user-specific details from
URLs.
URL parameters that don't change the content
of the page—like session IDs or sort order—can
be removed from the URL and put into a cookie.
By putting this information in a cookie and 301
redirecting to a "clean" URL, you retain the
information and reduce the number of URLs
pointing to that same content.
Rein in infinite spaces.
Do you have a calendar that links to an infinite
number of past or future dates (each with their
own unique URL)? Do you have paginated data
that returns a status code of 200 when you add
&page=3563 to the URL, even if there aren't
that many pages of data? If so, you have an
infinite crawl space on your website, and
crawlers could be wasting their (and your!)
bandwidth trying to crawl it all. Consider these
tips for reining in infinite spaces.
Disallow actions Googlebot can't
perform.
Using your robots.txt file, you can disallow crawling
of login pages, contact forms, shopping carts, and
other pages whose sole functionality is something
that a crawler can't perform. (Crawlers are
notoriously cheap and shy, so they don't usually
"Add to cart" or "Contact us.") This lets crawlers
spend more of their time crawling content that they
can actually do something with.
One man, one vote. One URL, one set
of content.
In an ideal world, there's a one-to-one pairing
between URL and content: each URL leads to a
unique piece of content, and each piece of
content can only be accessed via one URL. The
closer you can get to this ideal, the more
streamlined your site will be for crawling and
indexing. If your CMS or current site setup
makes this difficult, you can use the
rel=canonical element to indicate the preferred
URL for a particular piece of content.
Crawl optimization - ( How to optimize to increase crawl budget)
Crawl optimization - ( How to optimize to increase crawl budget)
Thank You !!!
-Syed Faraz

More Related Content

PDF
Webmaster tools (ICMK485)
PPTX
Clientside vs Serverside - SEO
PPT
The effect of duplicate content on search engine optimization
PPT
prestiva_blackhat
PDF
Unit 3 Search Engine Optimization
PPTX
Google Search Console - Search Traffic
PDF
Www amazon com-report
PPTX
How to Use Google Search Console
Webmaster tools (ICMK485)
Clientside vs Serverside - SEO
The effect of duplicate content on search engine optimization
prestiva_blackhat
Unit 3 Search Engine Optimization
Google Search Console - Search Traffic
Www amazon com-report
How to Use Google Search Console

What's hot (18)

PDF
Www snapdeal com-report
PPTX
Google webmaster tools
PPTX
Site audit presentation powerpoint template
PPTX
The SEO Recommendations that Really Matter
PPTX
SEO for developers (session 1)
PPT
Basic SEO
PPT
Search engine optimization (seo)
ODP
Understanding ABC of SEO
PPTX
Legal Publish SEO Webinar
PPTX
I have google analytics why do i need webmaster tools
PDF
Google Search Console
PDF
How to perform a technical SEO audit and ramp up your content strategy in 10 ...
PDF
Using Google Analytics and Google Webmaster Tools to improve your site
PDF
Sample seo report
PPT
Google Webmaster Tools - Beginner's Guide
PDF
Website Analysis Report - Website Designing Proposal
PPT
Seo tutorial
PDF
Website Seo Score Tool
Www snapdeal com-report
Google webmaster tools
Site audit presentation powerpoint template
The SEO Recommendations that Really Matter
SEO for developers (session 1)
Basic SEO
Search engine optimization (seo)
Understanding ABC of SEO
Legal Publish SEO Webinar
I have google analytics why do i need webmaster tools
Google Search Console
How to perform a technical SEO audit and ramp up your content strategy in 10 ...
Using Google Analytics and Google Webmaster Tools to improve your site
Sample seo report
Google Webmaster Tools - Beginner's Guide
Website Analysis Report - Website Designing Proposal
Seo tutorial
Website Seo Score Tool
Ad

Similar to Crawl optimization - ( How to optimize to increase crawl budget) (20)

PPTX
Crawl Budget and Its Significance in 2025
PPTX
Crawl Budget: Everything you Need to Know
PDF
How to Optimize Your Website for Crawl Efficiency
PPTX
Breaking Bad SEO - The Science of Crawl Space
PDF
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
PDF
Sasconbeta 2015 Dawn Anderson - Talk To The Spider
PDF
What is Crawl Budget? + Simple Ways to Optimize Crawl Budget
PDF
How To Optimize Your Site's Crawl Budget - Technical SEO Philly
PDF
Crawl Budget Optimisation at #dmwf2018
PDF
Crawl Budget Optimization - Technical SEO Meetup 1
PPTX
Website Audit [On Page and Off Page] by Carl Benedic Pantaleon
PPTX
Seo and analytics basics
PPTX
Crawl & Index @ Scale - BrightonSEO 2024 - Philip Mastroianni
PPTX
Crawling & Indexing for eCommerce Sites - Sam Taylor, BrightonSEO (Crawling &...
PDF
Crawl Budget - Some Insights & Ideas @ seokomm 2015
PPTX
Advanced Guide to Seo (Third Sector - Leeds Digital Festival 2016)
PDF
Programmatic SEO: How to Dominate SEO Like TripAdvisor, Yelp and Zillow
PDF
SEO for Developers
PDF
Negotiating crawl budget with googlebots
PDF
Search Engine Optimisation - Have you been crawled over?
Crawl Budget and Its Significance in 2025
Crawl Budget: Everything you Need to Know
How to Optimize Your Website for Crawl Efficiency
Breaking Bad SEO - The Science of Crawl Space
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Sasconbeta 2015 Dawn Anderson - Talk To The Spider
What is Crawl Budget? + Simple Ways to Optimize Crawl Budget
How To Optimize Your Site's Crawl Budget - Technical SEO Philly
Crawl Budget Optimisation at #dmwf2018
Crawl Budget Optimization - Technical SEO Meetup 1
Website Audit [On Page and Off Page] by Carl Benedic Pantaleon
Seo and analytics basics
Crawl & Index @ Scale - BrightonSEO 2024 - Philip Mastroianni
Crawling & Indexing for eCommerce Sites - Sam Taylor, BrightonSEO (Crawling &...
Crawl Budget - Some Insights & Ideas @ seokomm 2015
Advanced Guide to Seo (Third Sector - Leeds Digital Festival 2016)
Programmatic SEO: How to Dominate SEO Like TripAdvisor, Yelp and Zillow
SEO for Developers
Negotiating crawl budget with googlebots
Search Engine Optimisation - Have you been crawled over?
Ad

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Machine Learning_overview_presentation.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Spectroscopy.pptx food analysis technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Building Integrated photovoltaic BIPV_UPV.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Machine Learning_overview_presentation.pptx
Encapsulation theory and applications.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Artificial Intelligence
Unlocking AI with Model Context Protocol (MCP)
The Rise and Fall of 3GPP – Time for a Sabbatical?
Agricultural_Statistics_at_a_Glance_2022_0.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
sap open course for s4hana steps from ECC to s4
Encapsulation_ Review paper, used for researhc scholars
MIND Revenue Release Quarter 2 2025 Press Release
Spectroscopy.pptx food analysis technology
The AUB Centre for AI in Media Proposal.docx
A comparative analysis of optical character recognition models for extracting...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Crawl optimization - ( How to optimize to increase crawl budget)

  • 1. Crawl Optimization How to optimise to increase Crawl budget
  • 2. What is a Crawler ? A crawler is a program used by search engines to collect data from the internet. When a crawler visits a website, it picks over the entire website’s content (i.e. the text) and stores it in a databank. It also stores all the external and internal links to the website. The crawler will visit the stored links at a later point in time, which is how it moves from one website to the next. By this process the crawler captures and indexes every website that has links to at least one other website.
  • 3. What Crawl Budget Means for Googlebot ? Prioritizing what to crawl, when, and how much resource the server hosting the site can allocate to crawling is more important for bigger sites, or those that auto-generate pages based on URL parameters,
  • 5. Even a magic SEO wand will not get a web page to rank if the page has not been indexed Making sure web pages can be indexed is key during an SEO audit.
  • 6. Taking Crawl Rate and Crawl Demand together we define Crawl Budget as the number of URLs Googlebot can and wants to crawl.
  • 7. Crawl rate limit Googlebot is designed to be a good citizen of the web. Crawling is its main priority, while making sure it doesn't degrade the experience of users visiting the site. We call this the "crawl rate limit," which limits the maximum fetching rate for a given site. Simply put, this represents the number of simultaneous parallel connections Googlebot may use to crawl the site, as well as the time it has to wait between the fetches. The crawl rate can go up and down based on a couple of factors: ● Crawl health ● Limit set in Search Console
  • 8. Crawl Health : if the site responds really quickly for a while, the limit goes up, meaning more connections can be used to crawl. If the site slows down or responds with server errors, the limit goes down and Googlebot crawls less. Crawl Errors Loading Time
  • 10. Limit set in Search Console : website owners can reduce Googlebot's crawling of their site. Note that setting higher limits doesn't automatically increase crawling To limit the crawl rate: 1. On the Search Console Home page, click the site that you want. 2. Click the gear icon , then click Site Settings. 3. In the Crawl rate section, select the option you want and then limit the crawl rate as desired
  • 11. Crawl demand Even if the crawl rate limit isn't reached, if there's no demand from indexing, there will be low activity from Googlebot. The two factors that play a significant role in determining crawl demand are: ● Popularity ● Staleness
  • 12. Popularity: URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in our index. 1. Dilution of pagerank 2. Low value SEO pages 3. Depth 4. Linking internally 5. Orphaned pages 6. SiteMaps
  • 13. PAGE RANK Low Value Pages
  • 15. Pagerank Dilution (Thin Content) Pagerank Dilution (depth)
  • 16. Pagerank Dilution (depth) Internal Linking
  • 19. Staleness: Google systems attempt to prevent URLs from becoming stale in the index. Detection Index Refresh
  • 20. Factors affecting crawl budget According to our analysis, having many low-value-add URLs can negatively affect a site's crawling and indexing. We found that the low-value-add URLs fall into these categories, in order of significance: ● Faceted navigation and session identifiers ● On-site duplicate content ● Soft error pages ● Hacked pages ● Infinite spaces and proxies ● Low quality and spam content
  • 21. Faceted navigation and session identifiers Faceted navigation, such as filtering by color or price range, can be helpful for your visitors, but it’s often not search-friendly since it creates many combinations of URLs with duplicative content. With duplicative URLs, search engines may not crawl new or updated unique content as quickly, and/or they may not index a page accurately because indexing signals are diluted between the duplicate versions Selecting filters with faceted navigation can cause many URL combinations, such as http://www.example.com/category.php?category=gummy- candies&price=5-10&price=over-10 On the left is potential user navigation on the site (i.e., the click path), on the right are the pages accessed.
  • 23. Soft error pages "soft” or “crypto” 404s. A "soft 404" occurs when a web server responds with a 200 OK HTTP response code for a page that doesn't exist rather than the appropriate 404 Not Found. Soft 404s can limit a site's crawl coverage by search engines because these duplicate URLs may be crawled instead of pages with unique content.
  • 24. To infinity and beyond? No! When Googlebot crawls the web, it often finds what we call an "infinite space". These are very large numbers of links that usually provide little or no new content for Googlebot to index. If this happens on your site, crawling those URLs may use unnecessary bandwidth, and could result in Googlebot failing to completely index the real content on your site.
  • 25. Optimize your crawling & indexing Remove user-specific details from URLs. URL parameters that don't change the content of the page—like session IDs or sort order—can be removed from the URL and put into a cookie. By putting this information in a cookie and 301 redirecting to a "clean" URL, you retain the information and reduce the number of URLs pointing to that same content. Rein in infinite spaces. Do you have a calendar that links to an infinite number of past or future dates (each with their own unique URL)? Do you have paginated data that returns a status code of 200 when you add &page=3563 to the URL, even if there aren't that many pages of data? If so, you have an infinite crawl space on your website, and crawlers could be wasting their (and your!) bandwidth trying to crawl it all. Consider these tips for reining in infinite spaces.
  • 26. Disallow actions Googlebot can't perform. Using your robots.txt file, you can disallow crawling of login pages, contact forms, shopping carts, and other pages whose sole functionality is something that a crawler can't perform. (Crawlers are notoriously cheap and shy, so they don't usually "Add to cart" or "Contact us.") This lets crawlers spend more of their time crawling content that they can actually do something with. One man, one vote. One URL, one set of content. In an ideal world, there's a one-to-one pairing between URL and content: each URL leads to a unique piece of content, and each piece of content can only be accessed via one URL. The closer you can get to this ideal, the more streamlined your site will be for crawling and indexing. If your CMS or current site setup makes this difficult, you can use the rel=canonical element to indicate the preferred URL for a particular piece of content.