What I learned from analysing thousands of robots.txt files | BrightonSEO 2020

What I learned from
analysing thousands
of robots.txt ﬁles
samgipson
# brightonSEO
2020

Of all things...why robots.txt?
samgipson
# brightonSEO

2nd July 2019.
samgipson
# brightonSEO

samgipson
# brightonSEO
Google
Webmasters
blog post

samgipson
# brightonSEO
Robots
Exclusion
Checker

samgipson
# brightonSEO
How many top performing
sites still use unsupported
or incorrect rules?

What are the most
common mistakes within
robots.txt?
samgipson
# brightonSEO

Robots.txt: The history
samgipson
# brightonSEO

Based on Robots
Exclusion
Protocol (REP)
samgipson
# brightonSEO

Millions of sites use a robots.txt ﬁle
samgipson
# brightonSEO

Despite not an
oﬃcial internet
standard
samgipson
# brightonSEO

samgipson
# brightonSEO
Control the content
crawlers can and
can’t access

It’s hugely
powerful.
Mistakes can
cost you big.
samgipson
# brightonSEO

Did you guess the year?
samgipson
# brightonSEO

In 2019 Google submitted a
revised REP draft to try to make it
an oﬃcial standard
samgipson
# brightonSEO
FACT

Robots.txt: The basics
samgipson
# brightonSEO

samgipson
# brightonSEO
User-agent: *
Disallow: /checkout/

samgipson
# brightonSEO
User-agent: *
{field}

samgipson
# brightonSEO
User-agent: googlebot

samgipson
# brightonSEO
{value}

samgipson
# brightonSEO
User-agent: *
Allow:Dis /checkout/

samgipson
# brightonSEO
User-agent: *
{directive or rule}

samgipson
# brightonSEO
User-agent: *
{path}

samgipson
# brightonSEO
User-agent: *
{group}

samgipson
# brightonSEO
User-agent: *
Disallow: /basket/
{group A}
{group B}

robots.txt controls crawling
not indexation
samgipson
# brightonSEO
FACT

Here’s where it got confusing...
samgipson
# brightonSEO

Google used to support
unoﬃcial directives
samgipson
# brightonSEO

samgipson
# brightonSEO
HTML <head>
<meta name=”robots”
content=”noindex, nofollow”>

samgipson
# brightonSEO
HTTP header
X-Robots-Tag: googlebot:
noindex, nofollow

samgipson
# brightonSEO
robots.txt
Noindex: /checkout/
Nofollow: /checkout/

Webmasters / SEOs realised that
noindex: worked
samgipson
# brightonSEO

My analysis
samgipson
# brightonSEO

STEP ONE
samgipson
# brightonSEO

Identiﬁed top
traﬃc driving
sites across a
range of sectors
samgipson
# brightonSEO

Automotive
Computing
Cooking/Recipes
Electronics
Fashion
Gambling
Hardware
samgipson
# brightonSEO

Health/Medical
Insurance
Jobs
News
Real Estate
Telecoms
Travel
samgipson
# brightonSEO

STEP TWO
samgipson
# brightonSEO

Extracted the
robots.txt ﬁles
for 40,000
unique domains
samgipson
# brightonSEO

STEP THREE
samgipson
# brightonSEO

Noindex:
Nofollow:
Crawl-delay:
samgipson
# brightonSEO

<field>
<value>
<directive>
<path>
samgipson
# brightonSEO

Results: Unsupported rules
samgipson
# brightonSEO

samgipson
# brightonSEO
Out of the 40,000 site analysed,
0.5% used unsupported rules

Nofollow:
samgipson
# brightonSEO
1 Gambling
40,000 domains analysed

Crawl-delay:
samgipson
# brightonSEO
2,600

Crawl-delay:
samgipson
# brightonSEO
Real Estate
Hardware/DIY
Fashion
2,600

Noindex:
samgipson
# brightonSEO
220

Noindex:
samgipson
# brightonSEO
220
Retail
Finance
Jobs
Health

Brands using outdated rules
samgipson
# brightonSEO

Results: Basic Mistakes
samgipson
# brightonSEO

Issue 1
samgipson
# brightonSEO
<ﬁeld> name spelt
incorrectly

samgipson
# brightonSEO
<ﬁeld> name is case
insensitive
FACT

This is ok:
samgipson
# brightonSEO
User-Agent
user-agent
USER-AGENT
UsEr-AgEnt

This ISN’T:
samgipson
# brightonSEO
useragent
user agent
er-agent
ser-agent
user-agnet

<ﬁeld> name errors
samgipson
# brightonSEO
Telecoms30

Issue 2
samgipson
# brightonSEO
Incorrect user-agent
<value>

samgipson
# brightonSEO
User-agent <value> is
case insensitive
FACT

This is ok:
samgipson
# brightonSEO
Googlebot
googlebot
GOOGLEBOT
Bingbot
bingbot

This is a grey area:
samgipson
# brightonSEO
Googlebotrandomtext
Google bot
goglebot
Google

Issue 3
samgipson
# brightonSEO
Incorrect directives

samgipson
# brightonSEO
<directives> are case
insensitive
FACT

This is ok:
samgipson
# brightonSEO
allow:
ALLOW:
Allow:
disallow:
DISALLOW:
Disallow:

This ISN’T:
samgipson
# brightonSEO
dissalow:
dissallow:
disallo:
disalow:
allw:

<directive> errors
samgipson
# brightonSEO
All18

Issue 3
samgipson
# brightonSEO
Invalid <path>
format

samgipson
# brightonSEO
URL <path> should start
with a /
FACT

This is ok:
samgipson
# brightonSEO
Disallow: /*?delivery_type
Disallow: *?delivery_type

This ISN’T:
samgipson
# brightonSEO
Disallow: .js
Disallow: .css
Disallow: WebResource.axd
Disallow: ScriptResource.axd
Disallow: js/
Disallow: http://site.com/page

Incorrect <path>
samgipson
# brightonSEO
Equal spread
across sectors
231

Brands using incorrect <path>
samgipson
# brightonSEO

samgipson
# brightonSEO
URL <path> IS case
sensitive
FACT

Additional takeaways
samgipson
# brightonSEO

samgipson
# brightonSEO
A speciﬁc user-agent
overrules a catchall
FACT

samgipson
# brightonSEO
User-agent: *

samgipson
# brightonSEO
User-agent: *
Disallow: /another-folder/

samgipson
# brightonSEO
The order of
<directives> doesn’t
matter for most bots
FACT

samgipson
# brightonSEO
Speciﬁcity (length) of
the matching rule wins
FACT

samgipson
# brightonSEO
https://example.com/page
disallow: /
allow: /p

samgipson
# brightonSEO
Conﬂict?
Least restrictive WINS

samgipson
# brightonSEO
You can group
user-agents together
FACT

samgipson
# brightonSEO
User-agent: bingbot

Summary
samgipson
# brightonSEO

samgipson
# brightonSEO
Google are pushing for REP to
become an Internet standard

samgipson
# brightonSEO
We should all be pushing for a
best practice robots.txt

samgipson
# brightonSEO
Avoid Google having to make
allowances for inaccuracies

samgipson
# brightonSEO
Who knows…
they may suddenly stop

samgipson
# brightonSEO
Get the basics right. Many big
brands aren’t.

samgipson
# brightonSEO
Dig deeper.

samgipson
# brightonSEO
Nail it.

Further reading/resources
samgipson
# brightonSEO

samgipson
# brightonSEO
Tools Articles
Chrome Extension: Robots Exclusion Checker
samgipson.com/robots/
ContentKing: Robots.txt for SEO
contentkingapp.com/academy/robotstxt/
Ayima: Robots.txt Parser
ayima.com/robots/
Builtvisible: An SEO Guide to Robots.txt
builtvisible.com/wildcards-in-robots-txt/
Google’s Webmaster Robots.txt Testing Tool
google.com/webmasters/tools/robots-testing-tool
Original Robots.txt Draft (1996)
robotstxt.org/norobots-rfc.txt
Google’s C++ robots.txt parser and matcher
github.com/google/robotstxt
Google’s Robot Exclusion Protocol Draft (2019)
ietf.org/archive/id/draft-rep-wg-topic-00.txt

Thank you.
samgipson
# brightonSEO
@samgipson
samgipson
samgipson.com

What I learned from analysing thousands of robots.txt files | BrightonSEO 2020

More Related Content

Similar to What I learned from analysing thousands of robots.txt files | BrightonSEO 2020 (20)

Recently uploaded (20)

What I learned from analysing thousands of robots.txt files | BrightonSEO 2020