GetINSIDE THE SEARCH
ENGINE CRAWLER HEAD
Filip Podstavec
Filip Podstavec - Get inside the head of a crawler
Filip Podstavec - Get inside the head of a crawler
Dr.
Supermodel gynecologist
PSYCHOLOGISTSPSYCHOLOGISTSPSYCHOLOGISTS
Filip Podstavec - Get inside the head of a crawler
Filip Podstavec - Get inside the head of a crawler
Filip Podstavec - Get inside the head of a crawler
Filip Podstavec - Get inside the head of a crawler
https://orig14.deviantart.net/9386/f/2015/100/e/8/mixels_cc__1_dizzy_robot_
by_supercoco142-d8p7kph.png
Goal:
FASTEST
FASTEST
MOST ACCESSIBLE
Goal:
FASTEST
MOST ACCESSIBLE
RICHEST
Goal:
Why?
Higher crawl rate
Benefits
Higher crawl rate
More indexed pages
Benefits
Higher crawl rate
More indexed pages
Make website faster
Benefits
Higher crawl rate
More indexed pages
Make website faster
Cleaner architecture
Benefits
Higher crawl rate
More indexed pages
Make website faster
Cleaner architecture
Fixed error pages
Benefits
January 2016 April 2016 July 2016 October 2016 January 2017
Organic traffic boost
Start of crawl budget optimization
Higher crawl rate !
Direct ranking signal
=
Filip Podstavec - Get inside the head of a crawler
Filip Podstavec - Get inside the head of a crawler
How?
Make diagnosis
Medical History
12.6.2010 | Filip Podstavec | sick | height:
183cm | 120/60
17.9.2010 | Filip Podstavec | sick (again) |
height: 184cm | 120/60
06.4.2014 | Filip Podstavec | diarrhea | height:
184cm | 120/60
Filip Podstavec | aneurisma|
0
Filip Podstavec - Get inside the head of a crawler
Filip Podstavec - Get inside the head of a crawler
Medical History
12.6.2010 | Filip Podstavec | sick | height:
183cm | 120/60
17.9.2010 | Filip Podstavec | sick (again) |
height: 184cm | 120/60
06.4.2014 | Filip Podstavec | diarrhea | height:
184cm | 120/60
Filip Podstavec | aneurisma|
0
Filip Podstavec - Get inside the head of a crawler
155.62.100.122 - - [25/Aug/2017:08:22:55
-0400] “GET /category/mktfest/ HTTP/1.1” 200
“-” “Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)”
155.62.100.122
[25/Aug/2017:08:22:55 -0400]
GET /category/mktfest/ HTTP/1.1
200
Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)
IP (WHO)
Timestamp (WHEN)
Method and requested URL (WHAT)
Status code (SUCCESSFULLY OR NOT)
User-agent (DETAILS)
“The biggest source
of website URLs”
Web
Logs
Web
Logs
Web
Logs
The most accurate source about
crawl budget
What is the crawl budget?
Crawl budget is...
How does a Googlebot allocate
the crawl budget for your website?
How much time it allocates:
Authority * Amount of acquired new info
How do you use that time:
Speed
How do you care
about future crawling:
Internal linking
“People choose the paths that
grant them the greatest rewards
for the least amount of effort. “
Dr. House / David Shore
“Crawlers choose the paths that
grant them the greatest rewards
for the least amount of effort.“
Filip Podstavec
“Crawlers choose the paths that
grant them the greatest rewards
for the least amount of effort.“
Filip Podstavec
Do I have logs?
Yes, but I don’t know about that
No
Yes
mktfest_com.log		 (146.55 GB)
podstavec_cz.log		 (11.07 GB)
Tools for log analysis
Static
Real-time
vs.
Static
< 50MB (Small Files)
50MB - 10GB (Medium Files)
Google BigQuery OpenRefine
Screaming Frog
Log Analyzer
> 10 GB (Large Files)
Google BigQuery
Real-time
Filip Podstavec - Get inside the head of a crawler
Real-time logs
= 20 minutes
= 20$ / m
+
Server
Bit.ly/mktfestfilip
What to check in logs:
(diagnosis)
#1 Which search engine robots do crawl
my website and how much?
Bot requests in last 7 days
Googlebot
Bingbot
Slurp
SeznamBot
Other
Bot requests in last 7 days
#1 Bot requests
Googlebot
Bingbot
Slurp
SeznamBot
Other
Googlebot
Bingbot
Slurp
SeznamBot
Other
DESKTOP : MOBILE : IMAGE
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.
google.com/bot.html)
Googlebot-Image/1.0
​Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safa-
ri/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
#2 Googlebot agents
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.
google.com/bot.html)
Googlebot-Image/1.0
​Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safa-
ri/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
#3 Which pages do they visit
with the highest frequency?
Most visited URLs by Googlebot
Most visited URLs by Googlebot
#3 Most visited URLs
Most crawled URL?
.txt
#4 Which status codes and how much
of them does a Googlebot crawl?
200
404
301
302
500
#4 Pie chart of Googlebot status codes
200
404
301
302
500
200
404
301
302
500
200
404
301
302
500
Where/when/why do they
request error pages?
Googlebot hits with status code higher than 399
#5 Googlebot errors
#6 User errors
Does somebody crawl
my website?
Of your traffic are bots!
61.8%
Filip Podstavec - Get inside the head of a crawler
Filip Podstavec - Get inside the head of a crawler
Filip Podstavec - Get inside the head of a crawler
Filip Podstavec - Get inside the head of a crawler
#7 IP Requests
#8 Googlebot IPs
Filip Podstavec - Get inside the head of a crawler
What can you do?
(content point of view)
Content
is
Filip Podstavec - Get inside the head of a crawler
What can you do?
(technical point of view)
Filip Podstavec - Get inside the head of a crawler
Filip Podstavec - Get inside the head of a crawler
503Service unavailable
Filip Podstavec - Get inside the head of a crawler
“Be in touch with somebody
responsible for infrastructure!”
Filip Podstavec
Communicate
with your
sysadmin
https://twitter.com/dimensionmedia/status/877513185238151168
DISALLOW
“Not all of your URLs should
be crawled or indexed”
Links
URL1 URL2
URL3
URL4
URL5
<a href=”a/”>
/a/ /a/a/ /a/a/a/ /a/a/a/a/
Sanitize the pagination
Filip Podstavec - Get inside the head of a crawler
Rel=”prev”
<link rel=”prev” href=”http://www.example.com/page1” />
Rel=”next”
<link rel=”next” href=”http://www.example.com/page3” />
Noindex,follow
<meta name=”robots” content=”noindex, follow”>
Sanitize the filter combinations
http://edition.cnn.com/videos/tech/2016/08/26/black-hole-breakthrough-lee-pkg.cnn
10 filters
25 variants
95 367 431 640 625
Variants
Filip Podstavec - Get inside the head of a crawler
3.8 Years
How to fix that?
Create rules like:
Disallow filters without search
volume (price, etc.)
Example: Eshop.tld/notebooks/price-0-1000/
Robots.txt block: Disallow: */price
Disallow more than one used filter
from each segment
Example: Eshop.tld/notebook/acer,apple,lenovo/
Robots.txt block: Disallow: /*,*,*
Delete unused filters by users
Use pseudo checkbox
Example: <input><label><a>Select link</a></label>
Example URL: https://notebooky.heureka.cz/
Avoid thin content crawling
Filip Podstavec - Get inside the head of a crawler
Disallow: /directory/
<meta name=”robots”
content=”noindex”>
Sometimes speed matters
Google PageSpeed Insights
Google Lighthouse
https://varvy.com/ 
Keep your sitemap clean
Filip Podstavec - Get inside the head of a crawler
Are you happy?
Filip Podstavec - Get inside the head of a crawler
Filip Podstavec - Get inside the head of a crawler
Thank you!
Filip Podstavec
THE MAIN CONSTRUCTER OF MARKETING MINER
FILIP@MARKETINGMINER.COM

More Related Content

PDF
Rand Fishkin - The Invisible Giant that Mucks Up Our Marketing
PDF
How to get inside the search engine crawler head - Marketing Festival
PDF
CITEC #CON2-Dirty Attack with Google Hacking
PPTX
BrightonSEO
PDF
How to Plan Purple Team Exercises
PDF
Log files: The Overlooked Source of SEO Opportunities
PDF
UK Top 5,000 Websites; Mobile Site Speed Benchmark - BrightonSEO
PPTX
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
Rand Fishkin - The Invisible Giant that Mucks Up Our Marketing
How to get inside the search engine crawler head - Marketing Festival
CITEC #CON2-Dirty Attack with Google Hacking
BrightonSEO
How to Plan Purple Team Exercises
Log files: The Overlooked Source of SEO Opportunities
UK Top 5,000 Websites; Mobile Site Speed Benchmark - BrightonSEO
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs

What's hot (20)

PPTX
BrightonSEO - The Search Universe - Links, Log Files, GSC and everything in b...
PDF
PT_OWASP_AUSTIN_2017
PPTX
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
PDF
google dork.pdf
PPT
A Technical Look at Content - PUBCON SFIMA 2017 - Patrick Stox
PDF
Technical SEO - Generational cruft in SEO - there is never a new site when th...
PPTX
Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...
PPTX
.htaccess for SEOs - A presentation by Roxana Stingu
PDF
Empire Work shop
PPT
Can’t Find Your 404s?
PDF
Deck 8983a1d9-68df-4447-8481-3b4fd0de734c-54
PPTX
The internet for SEOs by Roxana Stingu
PPTX
Everything That Can Go Wrong Will Go Wrong - Tech SEO Boost 2017 - Patrick Stox
PPT
Pubcon Vegas 2017 You're Going To Screw Up International SEO - Patrick Stox
PDF
CIA For WordPress Developers
PPTX
Microformats and SEO
PDF
Fast Is The Only Speed
PPTX
There's Nothing so Permanent as Temporary
PDF
Purple Teaming the Cyber Kill Chain: Practical Exercises for Everyone Sector...
PPTX
How to connect social media with open standards
BrightonSEO - The Search Universe - Links, Log Files, GSC and everything in b...
PT_OWASP_AUSTIN_2017
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
google dork.pdf
A Technical Look at Content - PUBCON SFIMA 2017 - Patrick Stox
Technical SEO - Generational cruft in SEO - there is never a new site when th...
Google's Top 3 Ranking Factors - Content, Links, and RankBrain - Raleigh SEO ...
.htaccess for SEOs - A presentation by Roxana Stingu
Empire Work shop
Can’t Find Your 404s?
Deck 8983a1d9-68df-4447-8481-3b4fd0de734c-54
The internet for SEOs by Roxana Stingu
Everything That Can Go Wrong Will Go Wrong - Tech SEO Boost 2017 - Patrick Stox
Pubcon Vegas 2017 You're Going To Screw Up International SEO - Patrick Stox
CIA For WordPress Developers
Microformats and SEO
Fast Is The Only Speed
There's Nothing so Permanent as Temporary
Purple Teaming the Cyber Kill Chain: Practical Exercises for Everyone Sector...
How to connect social media with open standards
Ad

Similar to Filip Podstavec - Get inside the head of a crawler (20)

PDF
How to Optimize Your Website for Crawl Efficiency
PDF
DMI Webinar Series - SEO Audits (Part 1 of 3)
PDF
Sasconbeta 2015 Dawn Anderson - Talk To The Spider
PDF
The Beginner's Guide to Googlebot Optimization
PDF
How to do a SEO Site Audit
PPTX
Bath City College SEO For Beginners Training | February 2017
PPTX
Crawl optimization - ( How to optimize to increase crawl budget)
PPTX
SEO for Developers - Little Rock Tech Fest 2014
PDF
Crawl Budget Optimization - Technical SEO Meetup 1
PPTX
SEO 101 - Google Search Console Explained
PPTX
SEO for Beginners Feb 2020 - Bristol Media
PDF
SEO Checklists
PPTX
Advanced Guide to Seo (Third Sector - Leeds Digital Festival 2016)
PDF
How To Optimize Your Site's Crawl Budget - Technical SEO Philly
DOCX
Read This FirstAnnielytics.com@AnnieCushingNOTES ABOUT THIS WORKBO.docx
PPTX
Google Webmaster Tools
PPTX
How to Perform an SEO Audit
PDF
Technical SEO Checklist for Beginners
PDF
Get Content Crawled & Ranked Faster: 5 Tips From An SEO Expert
PPT
Webmaster Tools Preview And Understandings
How to Optimize Your Website for Crawl Efficiency
DMI Webinar Series - SEO Audits (Part 1 of 3)
Sasconbeta 2015 Dawn Anderson - Talk To The Spider
The Beginner's Guide to Googlebot Optimization
How to do a SEO Site Audit
Bath City College SEO For Beginners Training | February 2017
Crawl optimization - ( How to optimize to increase crawl budget)
SEO for Developers - Little Rock Tech Fest 2014
Crawl Budget Optimization - Technical SEO Meetup 1
SEO 101 - Google Search Console Explained
SEO for Beginners Feb 2020 - Bristol Media
SEO Checklists
Advanced Guide to Seo (Third Sector - Leeds Digital Festival 2016)
How To Optimize Your Site's Crawl Budget - Technical SEO Philly
Read This FirstAnnielytics.com@AnnieCushingNOTES ABOUT THIS WORKBO.docx
Google Webmaster Tools
How to Perform an SEO Audit
Technical SEO Checklist for Beginners
Get Content Crawled & Ranked Faster: 5 Tips From An SEO Expert
Webmaster Tools Preview And Understandings
Ad

More from Marketing Festival (20)

PDF
Shailin dhar – Fighting Ad-Fraud: Understanding Monetization of the Internet
PDF
Oli Gardner – Data Driven Design
PDF
Kim Goodwin – Right Research, Right Value
PDF
Joe wade - We Must Scare Our Clients
PDF
Bob Hoffman - Marketers Are From Mars, Consumers Are From New Jersey
PDF
Andrej Pancik - Scaling e-commerce with marketing automation
PDF
Michael Aagaard - You’re Making My Brain Hurt! The Psychology Behind Terrible...
PPTX
Frederick Vallaeys - Will Robots Take Over PPC? What the Future of the Indust...
PDF
Stephen Anderson - Speaking up for experiences
PPTX
Jonas Viksten - Audience-powered Search
PDF
Martin Brixí - How content marketing influences sales – TchiboBlog.cz
PPT
Josef Kadlec - How to crack LinkedIn direct approach and get a 100% response ...
PDF
Petr Prochazka - How we raised 650k USD on Kickstarter
PPTX
Kevin Hillstrom - How The Future of Retail is Like Professional Sports
PDF
Panel: Programmatic revolution - end of the advertising as we know it?
PDF
Lucie Sperkova - Pioneering multi-channel attribution for the lack of compreh...
PPTX
Marcus Tober - SEO Has Changed Forever – Why User Signals and Content Relevan...
PPTX
Colin Woon - The bigger the business, the bigger the SEO challenge
PPTX
Michal Pastier - How I learned about effective marketing from Bugs Bunny, Pix...
PDF
Lenka Fucikova - E–mail marketing is dead. Long live e–mail marketing!
Shailin dhar – Fighting Ad-Fraud: Understanding Monetization of the Internet
Oli Gardner – Data Driven Design
Kim Goodwin – Right Research, Right Value
Joe wade - We Must Scare Our Clients
Bob Hoffman - Marketers Are From Mars, Consumers Are From New Jersey
Andrej Pancik - Scaling e-commerce with marketing automation
Michael Aagaard - You’re Making My Brain Hurt! The Psychology Behind Terrible...
Frederick Vallaeys - Will Robots Take Over PPC? What the Future of the Indust...
Stephen Anderson - Speaking up for experiences
Jonas Viksten - Audience-powered Search
Martin Brixí - How content marketing influences sales – TchiboBlog.cz
Josef Kadlec - How to crack LinkedIn direct approach and get a 100% response ...
Petr Prochazka - How we raised 650k USD on Kickstarter
Kevin Hillstrom - How The Future of Retail is Like Professional Sports
Panel: Programmatic revolution - end of the advertising as we know it?
Lucie Sperkova - Pioneering multi-channel attribution for the lack of compreh...
Marcus Tober - SEO Has Changed Forever – Why User Signals and Content Relevan...
Colin Woon - The bigger the business, the bigger the SEO challenge
Michal Pastier - How I learned about effective marketing from Bugs Bunny, Pix...
Lenka Fucikova - E–mail marketing is dead. Long live e–mail marketing!

Recently uploaded (20)

PPTX
eGramSWARAJ-PPT Training Module for beginners
PPT
statistics analysis - topic 3 - describing data visually
PPTX
statsppt this is statistics ppt for giving knowledge about this topic
PDF
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
ai agent creaction with langgraph_presentation_
PPT
Image processing and pattern recognition 2.ppt
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
MBA JAPAN: 2025 the University of Waseda
PDF
Session 11 - Data Visualization Storytelling (2).pdf
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
A biomechanical Functional analysis of the masitary muscles in man
PPTX
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
PPTX
SET 1 Compulsory MNH machine learning intro
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
eGramSWARAJ-PPT Training Module for beginners
statistics analysis - topic 3 - describing data visually
statsppt this is statistics ppt for giving knowledge about this topic
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
1 hour to get there before the game is done so you don’t need a car seat for ...
CYBER SECURITY the Next Warefare Tactics
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
ai agent creaction with langgraph_presentation_
Image processing and pattern recognition 2.ppt
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
DU, AIS, Big Data and Data Analytics.ppt
MBA JAPAN: 2025 the University of Waseda
Session 11 - Data Visualization Storytelling (2).pdf
retention in jsjsksksksnbsndjddjdnFPD.pptx
A biomechanical Functional analysis of the masitary muscles in man
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
SET 1 Compulsory MNH machine learning intro
AI AND ML PROPOSAL PRESENTATION MUST.pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt

Filip Podstavec - Get inside the head of a crawler