SlideShare a Scribd company logo
Checking Google Index status at scale with Node.js
Checking
Google Index status
at scale with Node.js
Jose Luis Hernando
@jlhernando #BrightonSEO
Senior Technical SEO Consultant
Checking Google Index status at scale with Node.js
Today’s agenda
1. Why it’s important to know your website’s indexing status
2. The challenge to extract this data
3. Getting the data with Node.js – Live Demo!
4. Using this data for your SEO strategy
Checking Google Index status at scale with Node.js
Why is it important?
Reason #1
Not in the Index => Not in the SERPs
Icons from Google, Flaticon & Sitecheckerpro
Checking Google Index status at scale with Node.js
Why is it important?
Reason #2
Google evaluates site quality based on indexed pages
Sources:
Google Only Can Judge Site Quality Based On Pages They Index – Barry Swartz (Search Engine Roundtable)
English Google Webmaster Central office-hours hangout – Google Webmasters YouTube Channel
Low Quality Pages
Uncontrolled Faceted Navigation URLs
Unsupervised User Generated Content
Indexable Non-Canonical URLs
High Quality Pages
Category Pages
Editorial Pages
Canonical Product Pages
+
Checking Google Index status at scale with Node.js
Why is it important?
Reason #3
Inefficient use of Google’s resources
https://website.com/category-one/
HTML CSS JS
/category-one/?color=red
/category-one/?color=blue
/category-one/?color=red&blue
…
∞
Checking Google Index status at scale with Node.js
71.7%
54.3%
41.7%
34.4%
45.3%
30.2%
15.1%
10.1%
1-10k
10k-100k
100k-1M
1M+
Avg. Crawl Ratio (%) Avg. Active Ratio (%)
Source: How Does Google Crawl the Web? – (Annabelle Bouard & Dimitri Brunel – Botify)
Crawl Ratio
Percentage of pages
crawled by Google in 30 days
Active Ratio
Percentage of pages that
have generated at least
one organic visit in 30 days.
How much of your site is Googlebot crawling?
Checking Google Index status at scale with Node.js
The challenge
to extract this data
• Googlebot’s crawling behaviour
doesn’t determine indexing status
Checking Google Index status at scale with Node.js
The challenge:
extracting this data
• Googlebot’s crawling behaviour
doesn’t determine indexing status
• You rely on partial and sometimes
inaccurate data points:
• site: & inurl: operators
• GSC Indexing reports:
• URL Inspection Tool (< 200 URLs /day)
• Coverage Reports (< 1,000 rows /
report)
Checking Google Index status at scale with Node.js
Proxy metrics != Accurate data
Checking Google Index status at scale with Node.js
If you can’t find it, build it
Checking Google Index status at scale with Node.js
{Live demo}
bit.ly/google-index-checker-script
Checking Google Index status at scale with Node.js
Using the following method
goes against Google’s Terms of Service
as it automatically requests search queries from Google Search
Quick FYI
Checking Google Index status at scale with Node.js
Our script outperforms every other method available
Checking Google Index status at scale with Node.js
How can you use Google index
data?
Identify inefficient
use of crawl budget
Error Prioritisation
Identify holes
in your
architecture
Check for pages from your
site that should be indexed
but are not.
Find pages that should not be
indexed but are indexed.
Detect pages that used to
exist and now return an error
(4xx) but are still indexed.
Checking Google Index status at scale with Node.js
Use case #1
Sitemap Health Check
How many URLs from your XML sitemap are
indexed?
• 200 Status Code – 81,688
Inspired by Data Secrets of the Index Coverage Report – AJ Kohn
Sitemaps = 111,772
URLs
80% Indexed 74,223
7,465
Google Index Status of 2xx URLs
from Sitemap
Indexed Not Indexed
Checking Google Index status at scale with Node.js
Use case #1
Sitemap Health Check
How many URLs from your XML sitemap are
indexed?
• 200 Status Code – 81,688
• 404 Status Code – 29,969
Inspired by Data Secrets of the Index Coverage Report – AJ Kohn
Sitemaps = 111,772
URLs
80% Indexed
21% Indexed
6,268
23,701
Google Index Status of 4xx URLs
from Sitemap
Indexed Not Indexed
Checking Google Index status at scale with Node.js
Use case #1
Sitemap Health Check
How many URLs from your XML sitemap are
indexed?
• 200 Status Code – 81,688
• 404 Status Code – 29,969
• 301 Status Code – 365
Inspired by Data Secrets of the Index Coverage Report – AJ Kohn
Sitemaps = 111,772
URLs
80% Indexed
21% Indexed
4% Indexed
16 349
Google Index Status of 3xx URLs
from Sitemap
Indexed Not Indexed
Checking Google Index status at scale with Node.js
Sitemap Health Check
Next Steps
1) Identify if these URLs are important to your site’s bottom line
2) Check if a pool of these URLs have issues on GSC’s
Index Coverage Report
3) Choose a tactic to improve the visibility of these URLs
4) Isolate the relevant URLs and modify the existing sitemap or create a
new-sitemap.xml to monitor progress
Checking Google Index status at scale with Node.js
Use case #2
Log File Analysis Plus+
How many URLs with Googlebot hits are
indexed?
• ~160k Googlebot hits to non-canonical URLs
(/Uppercase/ vs /lowercase/)
• Identified if non-canonical URLs were indexed
• Identified if the referenced canonical URLs
were indexed
35.8%
64.2%
Indexed Non-Canonical URLs
Requested by Googlebot
Indexed Not Indexed
Undisclosed Client
Checking Google Index status at scale with Node.js
Log File Analysis+
Next Steps
1) Identify if the canonical tag is correctly placed
2) Identify if the root cause is internal linking, external linking or other
3) Consider redirecting non-canonical URLs to canonical URLs
4) Create a new-sitemap.xml with problematic URLs to encourage
Googlebot revisiting those URLs and for monitoring purposes
Checking Google Index status at scale with Node.js
• Check Real-time indexing (News sites, Offer sites, Job Boards)
• Check uncontrolled faceted navigation (Crawl budget optimisation)
• Check inactive product/category URLs – (Site architecture
improvements)
• Check old 4xx that are live now & haven't been deindexed yet (Recover
organic opportunities)
Other use cases
Inform your SEO strategy
Checking Google Index status at scale with Node.js
Further reading
https://bit.ly/google-index-checks
Checking Google Index status at scale with Node.js
Further reading
https://bit.ly/gsc-index-coverage
Checking Google Index status at scale with Node.js
The Google Index Checker script has opened a door
to get useful, actionable data at scale for your sites
Use it, and act on it.
Checking Google Index status at scale with Node.js
Thank you.
builtvisible.com
Jose Luis Hernando
Senior Technical SEO Consultant
@jlhernando
Checking Google Index status at scale with Node.js
How does Google crawl the web – Annabelle Bouard & Dimitri Brunel (Botify)
English Google Webmaster Central office-hours hangout – Google Webmasters YouTube Channel
Google Only Can Judge Site Quality Based On Pages They Index – Barry Swartz (Search Engine Roundtable)
Data Secrets of the Index Coverage Report - Blind Five Year Old (AJ Kohn)
How Google Search Works – Google Documentation
How Search organises information – Google Documentation
Our new search index: Caffeine - Carrie Grimes
When indexing goes wrong: how Google Search recovered from indexing issues & lessons learned since -
Vincent Courson, Google Search Outreach
How Search Engines Work: Crawling, Indexing & Ranking – Moz
(Please) Stop Using Unsafe Characters in URLs – Jeff Starr
Sources & additional reading

More Related Content

PDF
An SEO's Guide to Website Migrations | Faye Watt | BrightonSEO's Advanced Tec...
PPT
How to Perform SEO Audits for Maximized Efficiency & Value
PPTX
Google's Search Signals For Page Experience - SMX Advanced 2021 Patrick Stox
PDF
Competitor Analysis: A Structured Method - Paola Didone
PDF
SearchLeeds 2018 - Steve Chambers - Stickyeyes - How not to F**K up a Migration
PDF
Conflicting Website Signals & Confused Search Engines - Rachel Costello, Tech...
PDF
Technical SEO Checklist for Beginners
PPTX
Local Link Building - Pubcon Local 2021 - Patrick Stox
An SEO's Guide to Website Migrations | Faye Watt | BrightonSEO's Advanced Tec...
How to Perform SEO Audits for Maximized Efficiency & Value
Google's Search Signals For Page Experience - SMX Advanced 2021 Patrick Stox
Competitor Analysis: A Structured Method - Paola Didone
SearchLeeds 2018 - Steve Chambers - Stickyeyes - How not to F**K up a Migration
Conflicting Website Signals & Confused Search Engines - Rachel Costello, Tech...
Technical SEO Checklist for Beginners
Local Link Building - Pubcon Local 2021 - Patrick Stox

What's hot (19)

PDF
Advanced data-driven technical SEO - SMX London 2019
PPT
How to Perform SEO Audits
PDF
BrightonSEO 2017 - SEO quick wins from a technical check
PPTX
Advanced SEO Ranking Relationships
PPTX
How to repurpose your content in 2016
PPTX
Technical SEO Audits - SEO Consultant Bill Hartzer - Triangle Marketing Club
PPTX
Redefining relevance: links in 2018 - #LeedsLovesSearch
PPTX
React JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
PPTX
Things Google Tries To Correct For You - SMX Advanced 2019 Insights Sessions ...
PDF
Technical SEO Competitive Analysis - BrightonSEO 2020
PPTX
Proactive Measures for Good Site Health - Brighton SEO 2014
PDF
Gaca-Tworek: JavaScript analysis is extremely important and anyone can do it!...
PPTX
SMX East - SEO Tools Panel
PPTX
What's Next for Page Experience - SMX Next 2021 - Patrick Stox
PDF
SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016
PPTX
the SEO cyborg - Moz 2018 (full edition)
PDF
FoundConf 2018 Signals Speak - Alexis Sanders
PPTX
A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019
PDF
How Googlebot Renders (Roleplaying as Google's Web Rendering Service-- D&D st...
Advanced data-driven technical SEO - SMX London 2019
How to Perform SEO Audits
BrightonSEO 2017 - SEO quick wins from a technical check
Advanced SEO Ranking Relationships
How to repurpose your content in 2016
Technical SEO Audits - SEO Consultant Bill Hartzer - Triangle Marketing Club
Redefining relevance: links in 2018 - #LeedsLovesSearch
React JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
Things Google Tries To Correct For You - SMX Advanced 2019 Insights Sessions ...
Technical SEO Competitive Analysis - BrightonSEO 2020
Proactive Measures for Good Site Health - Brighton SEO 2014
Gaca-Tworek: JavaScript analysis is extremely important and anyone can do it!...
SMX East - SEO Tools Panel
What's Next for Page Experience - SMX Next 2021 - Patrick Stox
SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016
the SEO cyborg - Moz 2018 (full edition)
FoundConf 2018 Signals Speak - Alexis Sanders
A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019
How Googlebot Renders (Roleplaying as Google's Web Rendering Service-- D&D st...
Ad

Similar to Checking Google Index Status at Scale using Node.js - Jose Hernando - BrightonSEO Oct 2020 (20)

PDF
Sample Report Sample Report Sample Report Sample Report Sample Report Sample ...
PDF
Alizeh: A Radiant Icon Among Pakistani Clothing Brands for Women’s Ethnic Fas...
PDF
Evaluating URLs at Scale
PPTX
SEO for Ecommerce: A Comprehensive Guide
PDF
Technical SEO - An Introduction to Core Aspects of Technical SEO Best-Practise
PPT
Paul Duncan - Advanced Tracking & Enriched SERP Results via Google Tag Manager
PPTX
Crawl Budget: Everything you Need to Know
PDF
What is Google Search Console and What is it provide?
PPTX
33 Tactics to Engage and Retain More Customers - IRCE 2016
PPTX
Site Analysis
PDF
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
PDF
Site Migrations by Nik Ranger
PPTX
SEO Audit Workshop : Frameworks , Techniques and Tools
PPTX
Know about Google Search Console PPT slide
PPTX
Things to know more about google console
PPTX
33 Tactics to Engage and Retain More Customers- IRCE 2016
PPT
Raven Tools for Reporting, Analysis & Strategy Development
PPTX
How to Monitor & Track SEO Performance..
PPTX
Web Mining.pptx
PDF
Dc seo fin
Sample Report Sample Report Sample Report Sample Report Sample Report Sample ...
Alizeh: A Radiant Icon Among Pakistani Clothing Brands for Women’s Ethnic Fas...
Evaluating URLs at Scale
SEO for Ecommerce: A Comprehensive Guide
Technical SEO - An Introduction to Core Aspects of Technical SEO Best-Practise
Paul Duncan - Advanced Tracking & Enriched SERP Results via Google Tag Manager
Crawl Budget: Everything you Need to Know
What is Google Search Console and What is it provide?
33 Tactics to Engage and Retain More Customers - IRCE 2016
Site Analysis
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Site Migrations by Nik Ranger
SEO Audit Workshop : Frameworks , Techniques and Tools
Know about Google Search Console PPT slide
Things to know more about google console
33 Tactics to Engage and Retain More Customers- IRCE 2016
Raven Tools for Reporting, Analysis & Strategy Development
How to Monitor & Track SEO Performance..
Web Mining.pptx
Dc seo fin
Ad

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction-to-Cloud-ComputingFinal.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Lecture1 pattern recognition............
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Mega Projects Data Mega Projects Data
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
climate analysis of Dhaka ,Banglades.pptx
Supervised vs unsupervised machine learning algorithms
Qualitative Qantitative and Mixed Methods.pptx
Fluorescence-microscope_Botany_detailed content
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Miokarditis (Inflamasi pada Otot Jantung)
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Quality review (1)_presentation of this 21
oil_refinery_comprehensive_20250804084928 (1).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx

Checking Google Index Status at Scale using Node.js - Jose Hernando - BrightonSEO Oct 2020

  • 1. Checking Google Index status at scale with Node.js Checking Google Index status at scale with Node.js Jose Luis Hernando @jlhernando #BrightonSEO Senior Technical SEO Consultant
  • 2. Checking Google Index status at scale with Node.js Today’s agenda 1. Why it’s important to know your website’s indexing status 2. The challenge to extract this data 3. Getting the data with Node.js – Live Demo! 4. Using this data for your SEO strategy
  • 3. Checking Google Index status at scale with Node.js Why is it important? Reason #1 Not in the Index => Not in the SERPs Icons from Google, Flaticon & Sitecheckerpro
  • 4. Checking Google Index status at scale with Node.js Why is it important? Reason #2 Google evaluates site quality based on indexed pages Sources: Google Only Can Judge Site Quality Based On Pages They Index – Barry Swartz (Search Engine Roundtable) English Google Webmaster Central office-hours hangout – Google Webmasters YouTube Channel Low Quality Pages Uncontrolled Faceted Navigation URLs Unsupervised User Generated Content Indexable Non-Canonical URLs High Quality Pages Category Pages Editorial Pages Canonical Product Pages +
  • 5. Checking Google Index status at scale with Node.js Why is it important? Reason #3 Inefficient use of Google’s resources https://website.com/category-one/ HTML CSS JS /category-one/?color=red /category-one/?color=blue /category-one/?color=red&blue … ∞
  • 6. Checking Google Index status at scale with Node.js 71.7% 54.3% 41.7% 34.4% 45.3% 30.2% 15.1% 10.1% 1-10k 10k-100k 100k-1M 1M+ Avg. Crawl Ratio (%) Avg. Active Ratio (%) Source: How Does Google Crawl the Web? – (Annabelle Bouard & Dimitri Brunel – Botify) Crawl Ratio Percentage of pages crawled by Google in 30 days Active Ratio Percentage of pages that have generated at least one organic visit in 30 days. How much of your site is Googlebot crawling?
  • 7. Checking Google Index status at scale with Node.js The challenge to extract this data • Googlebot’s crawling behaviour doesn’t determine indexing status
  • 8. Checking Google Index status at scale with Node.js The challenge: extracting this data • Googlebot’s crawling behaviour doesn’t determine indexing status • You rely on partial and sometimes inaccurate data points: • site: & inurl: operators • GSC Indexing reports: • URL Inspection Tool (< 200 URLs /day) • Coverage Reports (< 1,000 rows / report)
  • 9. Checking Google Index status at scale with Node.js Proxy metrics != Accurate data
  • 10. Checking Google Index status at scale with Node.js If you can’t find it, build it
  • 11. Checking Google Index status at scale with Node.js {Live demo} bit.ly/google-index-checker-script
  • 12. Checking Google Index status at scale with Node.js Using the following method goes against Google’s Terms of Service as it automatically requests search queries from Google Search Quick FYI
  • 13. Checking Google Index status at scale with Node.js Our script outperforms every other method available
  • 14. Checking Google Index status at scale with Node.js How can you use Google index data? Identify inefficient use of crawl budget Error Prioritisation Identify holes in your architecture Check for pages from your site that should be indexed but are not. Find pages that should not be indexed but are indexed. Detect pages that used to exist and now return an error (4xx) but are still indexed.
  • 15. Checking Google Index status at scale with Node.js Use case #1 Sitemap Health Check How many URLs from your XML sitemap are indexed? • 200 Status Code – 81,688 Inspired by Data Secrets of the Index Coverage Report – AJ Kohn Sitemaps = 111,772 URLs 80% Indexed 74,223 7,465 Google Index Status of 2xx URLs from Sitemap Indexed Not Indexed
  • 16. Checking Google Index status at scale with Node.js Use case #1 Sitemap Health Check How many URLs from your XML sitemap are indexed? • 200 Status Code – 81,688 • 404 Status Code – 29,969 Inspired by Data Secrets of the Index Coverage Report – AJ Kohn Sitemaps = 111,772 URLs 80% Indexed 21% Indexed 6,268 23,701 Google Index Status of 4xx URLs from Sitemap Indexed Not Indexed
  • 17. Checking Google Index status at scale with Node.js Use case #1 Sitemap Health Check How many URLs from your XML sitemap are indexed? • 200 Status Code – 81,688 • 404 Status Code – 29,969 • 301 Status Code – 365 Inspired by Data Secrets of the Index Coverage Report – AJ Kohn Sitemaps = 111,772 URLs 80% Indexed 21% Indexed 4% Indexed 16 349 Google Index Status of 3xx URLs from Sitemap Indexed Not Indexed
  • 18. Checking Google Index status at scale with Node.js Sitemap Health Check Next Steps 1) Identify if these URLs are important to your site’s bottom line 2) Check if a pool of these URLs have issues on GSC’s Index Coverage Report 3) Choose a tactic to improve the visibility of these URLs 4) Isolate the relevant URLs and modify the existing sitemap or create a new-sitemap.xml to monitor progress
  • 19. Checking Google Index status at scale with Node.js Use case #2 Log File Analysis Plus+ How many URLs with Googlebot hits are indexed? • ~160k Googlebot hits to non-canonical URLs (/Uppercase/ vs /lowercase/) • Identified if non-canonical URLs were indexed • Identified if the referenced canonical URLs were indexed 35.8% 64.2% Indexed Non-Canonical URLs Requested by Googlebot Indexed Not Indexed Undisclosed Client
  • 20. Checking Google Index status at scale with Node.js Log File Analysis+ Next Steps 1) Identify if the canonical tag is correctly placed 2) Identify if the root cause is internal linking, external linking or other 3) Consider redirecting non-canonical URLs to canonical URLs 4) Create a new-sitemap.xml with problematic URLs to encourage Googlebot revisiting those URLs and for monitoring purposes
  • 21. Checking Google Index status at scale with Node.js • Check Real-time indexing (News sites, Offer sites, Job Boards) • Check uncontrolled faceted navigation (Crawl budget optimisation) • Check inactive product/category URLs – (Site architecture improvements) • Check old 4xx that are live now & haven't been deindexed yet (Recover organic opportunities) Other use cases Inform your SEO strategy
  • 22. Checking Google Index status at scale with Node.js Further reading https://bit.ly/google-index-checks
  • 23. Checking Google Index status at scale with Node.js Further reading https://bit.ly/gsc-index-coverage
  • 24. Checking Google Index status at scale with Node.js The Google Index Checker script has opened a door to get useful, actionable data at scale for your sites Use it, and act on it.
  • 25. Checking Google Index status at scale with Node.js Thank you. builtvisible.com Jose Luis Hernando Senior Technical SEO Consultant @jlhernando
  • 26. Checking Google Index status at scale with Node.js How does Google crawl the web – Annabelle Bouard & Dimitri Brunel (Botify) English Google Webmaster Central office-hours hangout – Google Webmasters YouTube Channel Google Only Can Judge Site Quality Based On Pages They Index – Barry Swartz (Search Engine Roundtable) Data Secrets of the Index Coverage Report - Blind Five Year Old (AJ Kohn) How Google Search Works – Google Documentation How Search organises information – Google Documentation Our new search index: Caffeine - Carrie Grimes When indexing goes wrong: how Google Search recovered from indexing issues & lessons learned since - Vincent Courson, Google Search Outreach How Search Engines Work: Crawling, Indexing & Ranking – Moz (Please) Stop Using Unsafe Characters in URLs – Jeff Starr Sources & additional reading

Editor's Notes

  • #2: Technical SEO Consultant at Builtvisible Builtvisible is a Digital Marketing Agency focusing exclusively on Organic Performance. We are specialist in Technical SEO, Content Strategy, Digital PR and Analytics and we deal primarily with medium and large-scale sites targeting both national and global audiences online.
  • #4: If you’re not in Google’s index you will not appear in Google SERPs To appear in Search Results, Google has to discover, crawl, render and index your website’s pages. Only once you’re in the index, you will be eligible to appear in SERPs and then you can acquire users through organic search. If you don’t know which pages are indexed you don’t know which pages can acquire users organically
  • #5: Pages that you’ve probably spent lots of time customising to serve users. These pages will be evaluated in the same way as low quality pages that are indexable: Uncontrolled facet nav USG Non-canonicals
  • #6: If you have an e-com site that has uncontrolled faceted navigation, Gbot will have to download that page (and its resources) to evaluate if that page is valuable. If for example, you have uncontrolled facet navigation, Gbot will have to crawl and render those URLs to see if these pages contain valuable information for future user query. Since this is not controlled, it can go ad-infinitum and hence wasting Google’s resources on URLs that are very likely not as valuable as others that you have in your site architecture.
  • #7: Key step in the indexing pipeling  Crawling In order for Google to Index your site it needs to crawl your site. But how much of your site is Googlebot crawling? According to a study from Botify using 270 sites with different architecture sizes, certainly not all of it. In this graph there are 2 important concepts: Crawl Ratio & Active ration (explain) If you are dealing with a site that has less than 10k URLs, Google is crawling on avg. 71% of your site and only 45% of that gets organic clicks. If we continue increasing the size of a website we can see that the rate at which Googlebot crawls your site, declines more and more. To the point where, if your site has more than 1M URLs, Googlebot crawls on average only 34% of your site and only 10% of those URLs get clicks from Organic Search
  • #8: Challenges Even if you are lucky enough to have access to your logs on a regular basis, Googlebot’s crawling behaviour doesn’t determine indexing status - You cannot guarantee that those URLs that have not received clicks from Google Search are actually part of Google’s index
  • #9: 2) If you don’t have access to server logs you have even less data, and hence you rely information that Google provides you through: a) site: & inurl: operators  Rough estimate for site-wide numbers and a lot of times inaccurate info for individual URLs b) Google Search Console reports  Inspection Tool (Great but you hit quota limit after 200 URLs and hence a bit pointless to automate)  Coverage/Sitemap Coverage reports (Great but GSC only allows 1,000 rows of data per report)
  • #12: Download Our Google Index Checker script from Github – Developed by our Senior Developer Alvaro Fernandez Download/Update Node.js Script relies on using ScraperAPI to get info from Google Search  Super easy to use and you can Sign up for Free to get the API Key. Concurrent requests limited to 5  ScraperAPI Free Plan Max limit but Al has built a function to automatically adapt concurrent to the Tier Plan limit Unlimited number of URLs Perfect for Clean URLs but it can also process parameterised URLs, case sensitive, international encoded characters, reserved/unreserved symbols Recycling feature Nice overview of the index status check when finishes
  • #16: Download your XML sitemap/s using your preferred crawler (SF, DC, OC, SB) get your list of URLs and create a urls.csv file and add it to the Google Index Checker Once it’s finished, you will get a CSV file with your results and you can find out how much of your sitemap is indexed. In this example I’ve taken argos.co.uk because is a large Ecom site, with a mix of normal URLs and URLs with unsafe characters.
  • #17: Download your XML sitemap/s using your preferred crawler (SF, DC, OC, SB) get your list of URLs and create a urls.csv file and add it to the Google Index Checker Once it’s finished, you will get a CSV file with your results and you can find out how much of your sitemap is indexed. In this example I’ve taken argos.co.uk because is a large Ecom site, with a mix of normal URLs and URLs with unsafe characters.
  • #18: Download your XML sitemap/s using your preferred crawler (SF, DC, OC, SB) get your list of URLs and create a urls.csv file and add it to the Google Index Checker Once it’s finished, you will get a CSV file with your results and you can find out how much of your sitemap is indexed. In this example I’ve taken argos.co.uk because is a large Ecom site, with a mix of normal URLs and URLs with unsafe characters.
  • #20: We found ~160k Non-canonical category pages with a significant amount of Googlebot request The problem was that the non-canonical URLs contained an Uppercase character which wasn’t supposed to be there. Firstly, we wanted to identify if these pages were indexed Secondly we wanted to know if the non-canonical URLs were being indexed instead of the canonicals In the end we found approximately 36% of the Non-canonical URLs that were indexed instead of their canonicals.