Crawl optimization - ( How to optimize to increase crawl budget)

Crawl Optimization
How to optimise to increase Crawl budget

What is a Crawler ?
A crawler is a program used by search engines to collect data from the internet.
When a crawler visits a website, it picks over the entire website’s content (i.e. the
text) and stores it in a databank. It also stores all the external and internal links to
the website. The crawler will visit the stored links at a later point in time, which is
how it moves from one website to the next. By this process the crawler captures
and indexes every website that has links to at least one other website.

What Crawl Budget Means for Googlebot ?
Prioritizing what to crawl, when, and how much resource the server hosting the site can allocate to crawling is
more important for bigger sites, or those that auto-generate pages based on URL parameters,

Even a magic SEO wand will not get a web page
to rank if the page has not been indexed
Making sure web pages can be indexed is key during an SEO audit.

Taking Crawl Rate and Crawl Demand together we
define Crawl Budget as the number of URLs Googlebot
can and wants to crawl.

Crawl rate limit
Googlebot is designed to be a good citizen of the web. Crawling is its main
priority, while making sure it doesn't degrade the experience of users visiting the
site. We call this the "crawl rate limit," which limits the maximum fetching rate for
a given site.
Simply put, this represents the number of simultaneous parallel connections
Googlebot may use to crawl the site, as well as the time it has to wait between the
fetches. The crawl rate can go up and down based on a couple of factors:
● Crawl health
● Limit set in Search Console

Crawl Health : if the site responds really quickly for a while, the limit goes up, meaning more connections
can be used to crawl. If the site slows down or responds with server errors, the limit goes down and Googlebot
crawls less.
Crawl Errors Loading Time

Limit set in Search Console : website owners can reduce Googlebot's crawling of their site. Note
that setting higher limits doesn't automatically increase crawling
To limit the crawl rate:
1. On the Search Console
Home page, click the site
that you want.
2. Click the gear icon , then
click Site Settings.
3. In the Crawl rate section,
select the option you want
and then limit the crawl
rate as desired

Crawl demand
Even if the crawl rate limit isn't reached, if there's no demand from indexing, there
will be low activity from Googlebot. The two factors that play a significant role in
determining crawl demand are:
● Popularity
● Staleness

Popularity: URLs that are more popular on the Internet tend to be crawled
more often to keep them fresher in our index.
1. Dilution of pagerank
2. Low value SEO pages
3. Depth
4. Linking internally
5. Orphaned pages
6. SiteMaps

Pagerank Dilution (Thin Content) Pagerank Dilution (depth)

Pagerank Dilution (depth) Internal Linking

Staleness: Google systems attempt to prevent URLs from becoming stale in the index.
Detection Index Refresh

Factors affecting crawl budget
According to our analysis, having many low-value-add URLs can negatively affect a site's crawling and indexing.
We found that the low-value-add URLs fall into these categories, in order of significance:
● Faceted navigation and session identifiers
● On-site duplicate content
● Soft error pages
● Hacked pages
● Infinite spaces and proxies
● Low quality and spam content

Faceted navigation
and session identifiers
Faceted navigation, such as filtering by color or
price range, can be helpful for your visitors, but
it’s often not search-friendly since it creates
many combinations of URLs with duplicative
content. With duplicative URLs, search engines
may not crawl new or updated unique content
as quickly, and/or they may not index a page
accurately because indexing signals are diluted
between the duplicate versions
Selecting filters with faceted navigation can cause many URL combinations,
such as
http://www.example.com/category.php?category=gummy-
candies&price=5-10&price=over-10
On the left is potential user navigation on the site (i.e., the click
path), on the right are the pages accessed.

Soft error pages
"soft” or “crypto” 404s.
A "soft 404" occurs when a web
server responds with a 200 OK HTTP
response code for a page that
doesn't exist rather than the
appropriate 404 Not Found. Soft
404s can limit a site's crawl
coverage by search engines because
these duplicate URLs may be
crawled instead of pages with unique
content.

To infinity and beyond? No!
When Googlebot crawls the web,
it often finds what we call an
"infinite space". These are very
large numbers of links that
usually provide little or no new
content for Googlebot to index. If
this happens on your site,
crawling those URLs may use
unnecessary bandwidth, and
could result in Googlebot failing
to completely index the real
content on your site.

Optimize your crawling & indexing
Remove user-specific details from
URLs.
URL parameters that don't change the content
of the page—like session IDs or sort order—can
be removed from the URL and put into a cookie.
By putting this information in a cookie and 301
redirecting to a "clean" URL, you retain the
information and reduce the number of URLs
pointing to that same content.
Rein in infinite spaces.
Do you have a calendar that links to an infinite
number of past or future dates (each with their
own unique URL)? Do you have paginated data
that returns a status code of 200 when you add
&page=3563 to the URL, even if there aren't
that many pages of data? If so, you have an
infinite crawl space on your website, and
crawlers could be wasting their (and your!)
bandwidth trying to crawl it all. Consider these
tips for reining in infinite spaces.

Disallow actions Googlebot can't
perform.
Using your robots.txt file, you can disallow crawling
of login pages, contact forms, shopping carts, and
other pages whose sole functionality is something
that a crawler can't perform. (Crawlers are
notoriously cheap and shy, so they don't usually
"Add to cart" or "Contact us.") This lets crawlers
spend more of their time crawling content that they
can actually do something with.
One man, one vote. One URL, one set
of content.
In an ideal world, there's a one-to-one pairing
between URL and content: each URL leads to a
unique piece of content, and each piece of
content can only be accessed via one URL. The
closer you can get to this ideal, the more
streamlined your site will be for crawling and
indexing. If your CMS or current site setup
makes this difficult, you can use the
rel=canonical element to indicate the preferred
URL for a particular piece of content.

Crawl optimization - ( How to optimize to increase crawl budget)

More Related Content

What's hot (18)

Similar to Crawl optimization - ( How to optimize to increase crawl budget) (20)

Recently uploaded (20)

Crawl optimization - ( How to optimize to increase crawl budget)