SlideShare a Scribd company logo
Prepare for walls of text.
Server Logs
After Excel Fails
@ohgm
About Me
• Former Senior Technical Consultant @ builtvisible.
• Now Freelance Technical SEO Consultant.
• @ohgm on Twitter.
• ohgm.co.uk for my webzone.
What I’d like to do today
1. Talk about access logs.
2. Show you some command line tools.
3. Show you some ways to apply these tools to
common scenarios.
4. Sit back down.
This talk is on the first significant difficulty spike in
server log analysis – having too much information.
Assumptions.
Assumptions
1. Your client is retaining their logs.
2. You don’t have access to your client’s
server.
What is an Access.log?
Server Logs: After Excel Fails
ohgm.co.uk 162.158.93.95 - - [11/Apr/2016:10:14:20 +0100] "GET /wmt-crawl-
representative-url-transfer-link-equity/ HTTP/1.1" 200 7976 "-" "Mozilla/5.0
(compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" 162.158.93.95
ohgm.co.uk 108.162.219.171 - - [11/Apr/2016:10:15:07 +0100] "GET /feed/ HTTP/1.1"
200 136953 "-" "Flamingo_SearchEngine (+http://www.flamingosearch.com/bot)"
108.162.219.171
ohgm.co.uk 108.162.219.176 - - [11/Apr/2016:10:22:54 +0100] "GET /wayback-
machine-seo HTTP/1.1" 200 9079 "http://www.traackr.com/" "Traackr.com"
108.162.219.176
ohgm.co.uk 173.245.55.114 - - [11/Apr/2016:10:23:35 +0100] "GET /author/ohgm/
HTTP/1.1" 301 20 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)" 173.245.55.114
ohgm.co.uk 173.245.55.123 - - [11/Apr/2016:10:23:42 +0100] "GET / HTTP/1.1" 200
6812 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
173.245.55.123
ohgm.co.uk1 173.245.55.1232 - - [11/Apr/2016:10:23:42 +01003]
"GET4 /please5 HTTP/1.16" 2007 68128 "-9" "Mozilla/5.0
(compatible; Googlebot/2.1;
+http://www.google.com/bot.html)10" 173.245.55.12311
1. The host responding to the
request.
2. The IP that serviced the
request.
3. The date and time of the
request.
4. The HTTP method:
GET, POST, PUT, HEAD, or
DELETE.
5. The resource requested.
6. The HTTP Version
{HTTP/1.0|HTTP/1.1|HTTP/2}
7. The server response.
8. The download size.
9. The referring URL.
10. The reported User-Agent.
11. The IP that made the
request.
Configurations vary substantially.
Why SEOs Like Them.
There is a lack of overlap between server logs and
crawl simulation tools.
Access logs show what’s being accessed rather than
what’s simply accessible.
We find correlation between crawl efficiency
improvements and organic performance. Access logs
are one of the best tools for identifying crawl waste.
Why ‘Excel Fails’?
Microsoft Excel currently supports
1,048,576 rows of data.
There are no plans to increase this.
Agency Scenario
Your manager has sold a Server Log Analysis
project, requesting 1 month of access logs
from the client, a UK high street retailer.
You receive 15 access_log.gz files, totalling
17.6GB. Excel won’t open any of them. You
don’t know it yet, but they are unfiltered.
Good Luck.
We also load balance on 6 servers.
Server Logs: After Excel Fails
“Just use a sample.”
“How do I even get a sample?”
Command Line
Tools
Server Logs: After Excel Fails
Advantages of Command Line
Tools.
• They’re fast.
• They’re not in the cloud.
• The main limit is you, not a development queue.
Disadvantages of Command Line
Tools.
• They’re scary at first.
• You can delete your computer.
• Don’t delete your computer.
Installation
If you’re on Mac, you’re ready.
If you’re on Linux, you’re ready.
If you’re on Windows, you
probably aren’t ready*.
*Unless ‘Ubuntu on Windows’ becomes
part of the non-developer release.
1. Windows Update > Update Settings >
Advanced > Get Insider Preview
Builds.
2. Install Build 14316 or greater.
3. Enable ‘Windows Subsystem for Linux
(Beta)’.
4. Open cmd and type ‘bash’.
5. Type ‘y’ and hit enter at the prompt.
1. Windows Update > Update Settings >
Advanced > Get Insider Preview
Builds.
2. Install Build 14316 or greater.
3. Enable ‘Windows Subsystem for Linux
(Beta)’.
4. Open cmd and type ‘bash’.
5. Type ‘y’ and hit enter at the prompt.
No
thanks.
Install GNU ON WINDOWS
(or Cygwin) instead.
Server Logs: After Excel Fails
Installation
Done
Getting Started
• Navigate to the folder
containing the
downloaded files.
• Open your chosen
terminal (cmd,
terminal, or bash).
CTRL+SHIFT+Rclick inside a folder is
an alternate method.
Server Logs: After Excel Fails
~$ type-commands-here
Then hit enter.
~$ echo hello.
hello.
The Title of The Talk
Was a Lie and We’re
Going to Try to Use
Excel Anyway. Sorry.
Sorry about the walls of text.
Server Logs
Until Excel Fails
@ohgm
Combining
Files.
Combine Multiple Log Files
We navigate to a folder containing all our server logs,
open the terminal, and type:
~$ cat *.log >> combined.log
“Take every .log file in the folder and append each to
combined.log”
But they gave me files in lots of
different folders.
Combine Multiple Files in Multiple
Folders
~$ find . -name ‘*.log’ -exec cat {} >> combined.log ;
“Search the current folder, and all subfolders for
filenames ending with ‘.log’. Append the contents of
these files to a new file called combined.log.”
gfind in GOW
They’re compressed.
Multiple times.
Combine Multiple Files in Multiple
Folders Some of Which are compressed
~$ find . -name *.gz -exec gzip -dkr {} +
&& find . -name ‘*.log’ -exec cat {} >> combined.log ;
“Find all the files with the .gz extension beneath the current
folder.
Recursively Decompress all files. Keep the originals.
Once finished, find all the .log files, append them to a new
combined.log file.”
Preview Huge Files with less
less streams the contents of a file to the terminal
without loading the whole file into memory.
$~ less combined.log
You can use less to review access logs without
crippling your machine.
Server Logs: After Excel Fails
R.T.F.MREAD THE FRIENDLY MANUAL
RTFM
If at any time you get stuck:
~$ toolname --help
or
~$ man toolname
or
Google what you are trying to do.
The --help ( often -h) flag will usually give you what you need to know.
‘man’ (manual) tends to be much more in depth. Both are read from the
command line.
We now have
one large file.
UA Filtering: Googlebot
Our combined access logs are in a single file:
combined.log – 16.4GB
Too large to open in Excel.
Too large to open in Notepad.
Examining it with less?
It’s too full of filthy human data.
We need to cut it down to Googlebot.
grep is a tool that extracts lines from text files based on a
regular expression. Using grep is pretty simple:
~$ grep [options] [pattern] [file]
…
~$ grep ‘Googlebot’ combined.log
“Give me all the lines containing Googlebot in
combined.log”
Press Enter.
grep
We forgot to store it somewhere.
Filtering Scenario: Googlebot
So we’ll store this output to a new file using ‘>>’:
~$ grep ‘Googlebot’ combined.log >> googlebot.log
“Append all lines in combined.log that contain
Googlebot into a new file, googlebot.log”
Like other tools, grep has a number of optional
argument flags. The count flag ‘-c’ can provide a
useful summary for direct questions:
~$ grep [options] [pattern] [file]
…
~$ grep -c “POST /wp-login” april.log
“Show me the count of login attempts in April on
ohgm.co.uk”
grep
Server Logs: After Excel Fails
Filtering Scenario – Googlebot
You can’t just verify Googlebot by name.
Apparently some people aren’t honest on the internet.
IP Filtering
Filtering Scenario – IP Ranges
Start End
64.233.160.0 64.233.191.255
66.102.0.0 66.102.15.255
66.249.64.0 66.249.95.255
72.14.192.0 72.14.255.255
74.125.0.0 74.125.255.255
209.85.128.0 209.85.255.255
216.239.32.0 216.239.63.255
If we were masochistic, we could write a
regular expression to capture these all…
Filtering Scenario – IP Ranges
The -E flag lets grep use Extended Regular Expressions.
~$ grep -E "((b(64).233.(1([6-8][0-9]|9[0-
1])))|(b(66).102.([0-9]|1[0-5]))|(b(66).249.(6[4-
9]|[7-8][0-9]|9[0-5]))|(b(72).14.(1(9[2-9])|2([0-4][0-
9]|5[0-5])))|(b(74).125.(25[0-5]|2[0-4][0-9]|[01]?[0-
9][0-9]?))|(209.85.(1(2[8-9]|[3-9][0-9])|2([0-4][0-
9]|5[0-5])))|(216.239.(25[0-5]|2[0-4][0-9]|[01]?[0-
9][0-9]?))).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"
GbotUA.log > GbotIP.log
This shouldn’t work, but it does*.
*WOMM
Filtering Scenario – Impostors
The -v flag inverts the grep query to find impostors:
~$ grep -vE "((b(64).233.(1([6-8][0-9]|9[0-
1])))|(b(66).102.([0-9]|1[0-5]))|(b(66).249.(6[4-9]|[7-
8][0-9]|9[0-5]))|(b(72).14.(1(9[2-9])|2([0-4][0-9]|5[0-
5])))|(b(74).125.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-
9]?))|(209.85.(1(2[8-9]|[3-9][0-9])|2([0-4][0-9]|5[0-
5])))|(216.239.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-
9]?))).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" GbotUA.log >
Impostors.log
“Give me every request that claims to be Googlebot, but
doesn’t come from this IP range. Put them in an impostors
file.”
Filtering Scenario – Verifying
Googlebot
• Disclaimer: don’t blindly use awful regex (check
with Regexr) or IP ranges, especially if you’re
analysing logs for a site using IP detection for
international SEO purposes. Read more about
Googlebot’s Geo-distributed Crawling here first.
• Use the correct reverse DNS > forward DNS lookup
when it’s important to be right. This can be
automated.
Filtering Scenario – IP Ranges
Anyone cloaking today
will have a good list.
You might find them at the bar.
“I Just Want
A Sample.”
The file is still too big.
Server Logs: After Excel Fails
I Want A Sample
The sort and split utilities do what you’d expect:
~$ sort -R combined.log | split -l 1048576
“Randomly sort the lines in the combined.log. split
the output of this command into multiple files (up to)
1048576 lines long.”
shuf is easier, but not default OSX/GOW.
A pipe ‘|’ takes the output of one
command as the input of another.
“I Just Want it
in Excel.”
Server Logs: After Excel Fails
I Just Want it in Excel.
Use wc to check it has fewer than 1,048,576 rows.
~$ wc -l sample.log
“Count the number of lines in sample.log.”
Server Logs: After Excel Fails
Server Logs: After Excel Fails
Server Logs: After Excel Fails
Server Logs: After Excel Fails
The Title of The Talk
Wasn’t a Lie And We
Aren’t Going to Use Excel
And Are Going to Answer
Questions Just Using The
Command Line I Hope
That’s OK. Sorry.
Asking Useful
Questions
Formulating Questions
Work a basic hypothesis. Decides what needs to be
done if it is true, false, or indeterminate before you
get the data.
“Google is ignoring robots.txt” may not be action
guiding, whilst “Googlebot is ignoring search console
parameter restrictions” just might be.
Formulating Questions
Some things just
aren’t very useful to
know.
Example Questions
How deep is Googlebot crawling?
Where is the wasted crawl? What proportion of
requests are currently wasted?
Where is Googlebot POSTing?
What are the most popular non-200/304 resources?
How many unique resources are being crawled?
Which is the more popular form of product page?
Which sitemap pages aren’t being crawled?
Always pivot with other data.
Getting Useful
Answers
AWK
AWK is a programming language focused on text
manipulation.
We are going to use it to print some columns from
our log files. That’s it.
Logs are space separated by default.
Awk uses spaces to define column numbers.
~$ awk ‘ {print $col_number1, $col_number2}’ [file]
ohgm.co.uk1
173.245.55.1232-3-4
[11/Apr/2016:10:23:425+0100]6
"GET7/8HTTP/1.1”920010681211“”12
"Mozilla/5.013(compatible;14Googlebot/2.1;15+http://ww
w.google.com/bot.html)“16173.245.55.123
AWK
~$ awk ‘{print $8, $10}’ Googlebot.log >> Gbot_responses.txt
“Output the requested resource and server response of
Googlebot.log to Gbot_responses.txt.”
/ 200
/robots.txt 304
/robots.txt 500
/amazing-blog-post 200
/forgotten-blog-post 404
/forbidden-blog-post 403
/ 200
Tailor the command to the access log format in use.
uniq
uniq takes text as an input and returns unique lines.
uniq -c returns these lines prefixed with a count.
uniq -d returns only repeated lines.
uniq -u returns only non-repeated lines.
AWK
Like grep, awk also matches patterns, using /pattern/.
~$ awk ‘/bingbot/ {print $10}’ combined.log | uniq -c
“Look for lines containing bingbot in the unfiltered logs and
print their server response codes. Deduplicate these and
return a summary.”
216663 - 302
109232 - 200
18395 - 301
2568 - 404
2147 - 304
274 - 500
261 - 403
Example Use:
Site Migrations
Ultimate Guide to Site Migrations
Get a big list of old URLs.
301 redirect them once to the
right places.
Make sure they get crawled.
Site Migrations
“We want a list of all URLs requested by Googlebot in
our pre-migration dataset, sorted by popularity
(number of requests).”
e.g.
/ 49587
/index.html 25169
/robots.txt 23334
/home 19417
Site Migrations
~$ awk ‘/Googlebot/ {print $7}’ combined.log |
uniq -c | sort -nr >> unique_requests.txt
“Take all access log requests, and filter to Googlebot.
Extract and output the requested resources.
Deduplicate these and return a summary.
Sort these by number in descending order.”
Site Migrations – Encouraging
Crawl
“We want our migration to switch as
quickly as possible.”
Get the list of redirects (URI stems) you want Google
to crawl into a file.
grep can use this file as the match criteria (lines
matching this OR this OR this).
Site Migrations – Encouraging
Crawl
We want the URLs Google has not yet crawled.
~$ grep -f wishlist.txt postmigration.log | awk
‘/Googlebot/ {print $8}’ | uniq >> wishlist-hits.txt
“Filter the post-migration log for lines that match
wishlist.txt. Return the resources requested by Googlebot.
Deduplicate and save.”
Site Migrations – Encouraging
Crawl
~$ cat wishlist-hits.txt wishlist.txt | uniq -u >>
uncrawled.txt
“Read the contents of both files. Save wishlist
entries that don’t appear in the access logs.”
Tip: use an indexing service like linklicious to encourage
crawling the uncrawled.
Taking This
Further
Keep Learning
Unix Utilities.
Learn SQL.
Also
These Techniques Apply to Other
SEO Activities.
Enterprise Link Audits.
Enterprise Keyword Research.
Enterprise Spamming.
Thanks.
Oliver Mason
Technical SEO Consultant
Twitter: @ohgm
Email: ohgm@ohgm.co.uk
Resources
GOW: https://github.com/bmatzelle/gow
Cygwin: http://cygwin.com/install.html
Command Line Crash Course:
http://cli.learncodethehardway.org/book/
Shameless links to my own stuff:
http://ohgm.co.uk/filter-server-logs-to-googlebot/
http://ohgm.co.uk/watch-googlebot-crawling/
http://ohgm.co.uk/preserve-link-equity-with-file-aliasing/
http://ohgm.co.uk/wayback-machine-seo/
Tools Used in this Talk
grep
sort
split
shuf
find
uniq
awk
wc
cowsay

More Related Content

PDF
Log File Analysis: The most powerful tool in your SEO toolkit
PPTX
Using server logs to your advantage
PPTX
How to automate all your SEO projects
PPT
LatJUG. Google App Engine
PPTX
Postman Collection Format v2.0 (pre-draft)
PDF
LAWDI - Rogue Linked Data
PPTX
SemaGrow demonstrator: “Web Crawler + AgroTagger”
PPT
Using Thinking Sphinx with rails
Log File Analysis: The most powerful tool in your SEO toolkit
Using server logs to your advantage
How to automate all your SEO projects
LatJUG. Google App Engine
Postman Collection Format v2.0 (pre-draft)
LAWDI - Rogue Linked Data
SemaGrow demonstrator: “Web Crawler + AgroTagger”
Using Thinking Sphinx with rails

What's hot (20)

PDF
Building Beautiful REST APIs in ASP.NET Core
PDF
Webinar: MongoDB Connector for Spark
PPTX
Android Lab Test : Using the network with HTTP (english)
PPT
Investigating server logs
PPTX
Android and REST
PDF
Google Hacking Basics
PPT
Rest services caching
PPTX
Django rest framework
PDF
Multi-threaded web crawler in Ruby
PPT
Web Crawler
PPTX
Combining Django REST framework & Elasticsearch
PDF
Understanding and testing restful web services
PPT
Understanding REST
PDF
Web Crawling & Crawler
PDF
The never-ending REST API design debate
PPT
Coding for a wget based Web Crawler
PPTX
Web crawler and applications
PPTX
Building Beautiful REST APIs in ASP.NET Core
PDF
Gitminer 2.0 - Advance Search on Github
PPTX
Controlling crawler for better Indexation and Ranking
Building Beautiful REST APIs in ASP.NET Core
Webinar: MongoDB Connector for Spark
Android Lab Test : Using the network with HTTP (english)
Investigating server logs
Android and REST
Google Hacking Basics
Rest services caching
Django rest framework
Multi-threaded web crawler in Ruby
Web Crawler
Combining Django REST framework & Elasticsearch
Understanding and testing restful web services
Understanding REST
Web Crawling & Crawler
The never-ending REST API design debate
Coding for a wget based Web Crawler
Web crawler and applications
Building Beautiful REST APIs in ASP.NET Core
Gitminer 2.0 - Advance Search on Github
Controlling crawler for better Indexation and Ranking
Ad

Viewers also liked (20)

PPTX
Server Log Files & Technical SEO Audits: What You Need to Know
PPTX
BrightonSEO - Biddable Media Session
PPTX
7 tools to turbo boost your SEO in 2017
PDF
In Pursuit of UGC: The Power of Genuine Reviews
PDF
Brighton SEO - What It's Like Having GA Premium
PPTX
Nichola Stott – BrightonSEO April 2016: SEO SUX: How and Why UX Must Be Front...
PPTX
ADN Antreprenor 2016 - Felix Tataru, GMP
PDF
Social Media & Networking - The Evolving Workforce
PPTX
Research Design
PPT
Towards a Grand Unified Theory of Systems Engineering (GUTSE)
PDF
SFR Certification
PPTX
How to Find a Good Work-Life Balance
PPTX
Mobile application development |#Mobileapplicationdevelopment
PPTX
Seguridad de la Informacion
PDF
Local Government Balances Security, Flexibility and Productivity with BlackBe...
DOCX
M2 t1 planificador_aamtic version final numeral 5
PDF
Through the Lens of an iPhone: Charleston, SC
PDF
18 TIPS TO-BE FOUNDERS
PPTX
6 Healthy Mother's Day Gifts
PDF
Perunapuu esittely 4
Server Log Files & Technical SEO Audits: What You Need to Know
BrightonSEO - Biddable Media Session
7 tools to turbo boost your SEO in 2017
In Pursuit of UGC: The Power of Genuine Reviews
Brighton SEO - What It's Like Having GA Premium
Nichola Stott – BrightonSEO April 2016: SEO SUX: How and Why UX Must Be Front...
ADN Antreprenor 2016 - Felix Tataru, GMP
Social Media & Networking - The Evolving Workforce
Research Design
Towards a Grand Unified Theory of Systems Engineering (GUTSE)
SFR Certification
How to Find a Good Work-Life Balance
Mobile application development |#Mobileapplicationdevelopment
Seguridad de la Informacion
Local Government Balances Security, Flexibility and Productivity with BlackBe...
M2 t1 planificador_aamtic version final numeral 5
Through the Lens of an iPhone: Charleston, SC
18 TIPS TO-BE FOUNDERS
6 Healthy Mother's Day Gifts
Perunapuu esittely 4
Ad

Similar to Server Logs: After Excel Fails (20)

PPT
Making Logs Sexy Again: Can We Finally Lose The Regexes?
PPTX
Logs: Can’t Hate Them, Won’t Love Them: Brief Log Management Class by Anton C...
PPTX
Log analysis using elk
PPTX
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
PPTX
A Guide to Log Analysis with Big Query
PDF
Linux intermediate level
PPTX
Power of logs: practices for network security
ODP
Love Your Command Line
PPTX
SearchLove Boston 2017 | Dom Woodman | How to Get Insight From Your Logs
PPTX
PDF
OpenFest 2012 : Leveraging the public internet
PDF
Advanced data-driven technical SEO - SMX London 2019
PDF
First there was the command line
PDF
10 awesome examples for viewing huge log files in unix
PDF
Linux Command Line - By Ranjan Raja
PPTX
Basics of-linux
PDF
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
PPTX
Nix for etl using scripting to automate data cleaning & transformation
PDF
Open Source Logging and Monitoring Tools
PDF
Open Source Logging and Metrics Tools
Making Logs Sexy Again: Can We Finally Lose The Regexes?
Logs: Can’t Hate Them, Won’t Love Them: Brief Log Management Class by Anton C...
Log analysis using elk
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
A Guide to Log Analysis with Big Query
Linux intermediate level
Power of logs: practices for network security
Love Your Command Line
SearchLove Boston 2017 | Dom Woodman | How to Get Insight From Your Logs
OpenFest 2012 : Leveraging the public internet
Advanced data-driven technical SEO - SMX London 2019
First there was the command line
10 awesome examples for viewing huge log files in unix
Linux Command Line - By Ranjan Raja
Basics of-linux
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
Nix for etl using scripting to automate data cleaning & transformation
Open Source Logging and Monitoring Tools
Open Source Logging and Metrics Tools

Recently uploaded (20)

PPTX
hnk joint business plan for_Rooftop_Plan
PPTX
Amazon - STRATEGIC.......................pptx
PDF
Pay-Per-Click Marketing: Strategies That Actually Work in 2025
PDF
Mastering Bulk Email Campaign Optimization for 2025
PPTX
Ranking a Webpage with SEO (And Tracking It with the Right Attribution Type a...
PPTX
UNIT 3 - 5 INDUSTRIAL PRICING.ppt x
PDF
How a Travel Company Can Implement Content Marketing
DOCX
AL-ahly Sabbour un official strategic plan.docx
PDF
exceptionalinsights.group visitor traffic statistics 08-08-25
DOCX
Parkville marketing plan .......MR.docx
PPTX
Tea and different types of tea in India
PDF
Prove and Prioritize Profitability in Every Marketing Campaign - Zach Sherrod...
PPTX
Mastering eCommerce SEO: Strategies to Boost Traffic and Maximize Conversions
PDF
Digital Marketing Agency in Thrissur with Proven Strategies for Local Growth
PDF
EVOLUTION OF RURAL MARKETING IN INDIAN CIVILIZATION
PPTX
Your score increases as you pick a category, fill out a long description and ...
PDF
MARG’s Door & Window Hardware Catalogue | Trending Branding Digital Solutions
PDF
PPTX
Ipsos+Protocols+Playbook+V1.2+(DEC2024)+final+IntClientUseOnly.pptx
PDF
UNIT 1 -3 Factors Influencing RURAL CONSUMER BEHAVIOUR.pdf
hnk joint business plan for_Rooftop_Plan
Amazon - STRATEGIC.......................pptx
Pay-Per-Click Marketing: Strategies That Actually Work in 2025
Mastering Bulk Email Campaign Optimization for 2025
Ranking a Webpage with SEO (And Tracking It with the Right Attribution Type a...
UNIT 3 - 5 INDUSTRIAL PRICING.ppt x
How a Travel Company Can Implement Content Marketing
AL-ahly Sabbour un official strategic plan.docx
exceptionalinsights.group visitor traffic statistics 08-08-25
Parkville marketing plan .......MR.docx
Tea and different types of tea in India
Prove and Prioritize Profitability in Every Marketing Campaign - Zach Sherrod...
Mastering eCommerce SEO: Strategies to Boost Traffic and Maximize Conversions
Digital Marketing Agency in Thrissur with Proven Strategies for Local Growth
EVOLUTION OF RURAL MARKETING IN INDIAN CIVILIZATION
Your score increases as you pick a category, fill out a long description and ...
MARG’s Door & Window Hardware Catalogue | Trending Branding Digital Solutions
Ipsos+Protocols+Playbook+V1.2+(DEC2024)+final+IntClientUseOnly.pptx
UNIT 1 -3 Factors Influencing RURAL CONSUMER BEHAVIOUR.pdf

Server Logs: After Excel Fails

  • 1. Prepare for walls of text. Server Logs After Excel Fails @ohgm
  • 2. About Me • Former Senior Technical Consultant @ builtvisible. • Now Freelance Technical SEO Consultant. • @ohgm on Twitter. • ohgm.co.uk for my webzone.
  • 3. What I’d like to do today 1. Talk about access logs. 2. Show you some command line tools. 3. Show you some ways to apply these tools to common scenarios. 4. Sit back down. This talk is on the first significant difficulty spike in server log analysis – having too much information.
  • 5. Assumptions 1. Your client is retaining their logs. 2. You don’t have access to your client’s server.
  • 6. What is an Access.log?
  • 8. ohgm.co.uk 162.158.93.95 - - [11/Apr/2016:10:14:20 +0100] "GET /wmt-crawl- representative-url-transfer-link-equity/ HTTP/1.1" 200 7976 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" 162.158.93.95 ohgm.co.uk 108.162.219.171 - - [11/Apr/2016:10:15:07 +0100] "GET /feed/ HTTP/1.1" 200 136953 "-" "Flamingo_SearchEngine (+http://www.flamingosearch.com/bot)" 108.162.219.171 ohgm.co.uk 108.162.219.176 - - [11/Apr/2016:10:22:54 +0100] "GET /wayback- machine-seo HTTP/1.1" 200 9079 "http://www.traackr.com/" "Traackr.com" 108.162.219.176 ohgm.co.uk 173.245.55.114 - - [11/Apr/2016:10:23:35 +0100] "GET /author/ohgm/ HTTP/1.1" 301 20 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 173.245.55.114 ohgm.co.uk 173.245.55.123 - - [11/Apr/2016:10:23:42 +0100] "GET / HTTP/1.1" 200 6812 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 173.245.55.123
  • 9. ohgm.co.uk1 173.245.55.1232 - - [11/Apr/2016:10:23:42 +01003] "GET4 /please5 HTTP/1.16" 2007 68128 "-9" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)10" 173.245.55.12311 1. The host responding to the request. 2. The IP that serviced the request. 3. The date and time of the request. 4. The HTTP method: GET, POST, PUT, HEAD, or DELETE. 5. The resource requested. 6. The HTTP Version {HTTP/1.0|HTTP/1.1|HTTP/2} 7. The server response. 8. The download size. 9. The referring URL. 10. The reported User-Agent. 11. The IP that made the request. Configurations vary substantially.
  • 10. Why SEOs Like Them. There is a lack of overlap between server logs and crawl simulation tools. Access logs show what’s being accessed rather than what’s simply accessible. We find correlation between crawl efficiency improvements and organic performance. Access logs are one of the best tools for identifying crawl waste.
  • 11. Why ‘Excel Fails’? Microsoft Excel currently supports 1,048,576 rows of data. There are no plans to increase this.
  • 12. Agency Scenario Your manager has sold a Server Log Analysis project, requesting 1 month of access logs from the client, a UK high street retailer. You receive 15 access_log.gz files, totalling 17.6GB. Excel won’t open any of them. You don’t know it yet, but they are unfiltered. Good Luck.
  • 13. We also load balance on 6 servers.
  • 15. “Just use a sample.”
  • 16. “How do I even get a sample?”
  • 19. Advantages of Command Line Tools. • They’re fast. • They’re not in the cloud. • The main limit is you, not a development queue.
  • 20. Disadvantages of Command Line Tools. • They’re scary at first. • You can delete your computer. • Don’t delete your computer.
  • 22. If you’re on Mac, you’re ready.
  • 23. If you’re on Linux, you’re ready.
  • 24. If you’re on Windows, you probably aren’t ready*. *Unless ‘Ubuntu on Windows’ becomes part of the non-developer release.
  • 25. 1. Windows Update > Update Settings > Advanced > Get Insider Preview Builds. 2. Install Build 14316 or greater. 3. Enable ‘Windows Subsystem for Linux (Beta)’. 4. Open cmd and type ‘bash’. 5. Type ‘y’ and hit enter at the prompt.
  • 26. 1. Windows Update > Update Settings > Advanced > Get Insider Preview Builds. 2. Install Build 14316 or greater. 3. Enable ‘Windows Subsystem for Linux (Beta)’. 4. Open cmd and type ‘bash’. 5. Type ‘y’ and hit enter at the prompt. No thanks.
  • 27. Install GNU ON WINDOWS (or Cygwin) instead.
  • 30. Getting Started • Navigate to the folder containing the downloaded files. • Open your chosen terminal (cmd, terminal, or bash). CTRL+SHIFT+Rclick inside a folder is an alternate method.
  • 34. The Title of The Talk Was a Lie and We’re Going to Try to Use Excel Anyway. Sorry.
  • 35. Sorry about the walls of text. Server Logs Until Excel Fails @ohgm
  • 37. Combine Multiple Log Files We navigate to a folder containing all our server logs, open the terminal, and type: ~$ cat *.log >> combined.log “Take every .log file in the folder and append each to combined.log”
  • 38. But they gave me files in lots of different folders.
  • 39. Combine Multiple Files in Multiple Folders ~$ find . -name ‘*.log’ -exec cat {} >> combined.log ; “Search the current folder, and all subfolders for filenames ending with ‘.log’. Append the contents of these files to a new file called combined.log.” gfind in GOW
  • 41. Combine Multiple Files in Multiple Folders Some of Which are compressed ~$ find . -name *.gz -exec gzip -dkr {} + && find . -name ‘*.log’ -exec cat {} >> combined.log ; “Find all the files with the .gz extension beneath the current folder. Recursively Decompress all files. Keep the originals. Once finished, find all the .log files, append them to a new combined.log file.”
  • 42. Preview Huge Files with less less streams the contents of a file to the terminal without loading the whole file into memory. $~ less combined.log You can use less to review access logs without crippling your machine.
  • 45. RTFM If at any time you get stuck: ~$ toolname --help or ~$ man toolname or Google what you are trying to do. The --help ( often -h) flag will usually give you what you need to know. ‘man’ (manual) tends to be much more in depth. Both are read from the command line.
  • 46. We now have one large file.
  • 47. UA Filtering: Googlebot Our combined access logs are in a single file: combined.log – 16.4GB Too large to open in Excel. Too large to open in Notepad. Examining it with less? It’s too full of filthy human data. We need to cut it down to Googlebot.
  • 48. grep is a tool that extracts lines from text files based on a regular expression. Using grep is pretty simple: ~$ grep [options] [pattern] [file] … ~$ grep ‘Googlebot’ combined.log “Give me all the lines containing Googlebot in combined.log” Press Enter. grep
  • 49. We forgot to store it somewhere.
  • 50. Filtering Scenario: Googlebot So we’ll store this output to a new file using ‘>>’: ~$ grep ‘Googlebot’ combined.log >> googlebot.log “Append all lines in combined.log that contain Googlebot into a new file, googlebot.log”
  • 51. Like other tools, grep has a number of optional argument flags. The count flag ‘-c’ can provide a useful summary for direct questions: ~$ grep [options] [pattern] [file] … ~$ grep -c “POST /wp-login” april.log “Show me the count of login attempts in April on ohgm.co.uk” grep
  • 53. Filtering Scenario – Googlebot You can’t just verify Googlebot by name. Apparently some people aren’t honest on the internet.
  • 55. Filtering Scenario – IP Ranges Start End 64.233.160.0 64.233.191.255 66.102.0.0 66.102.15.255 66.249.64.0 66.249.95.255 72.14.192.0 72.14.255.255 74.125.0.0 74.125.255.255 209.85.128.0 209.85.255.255 216.239.32.0 216.239.63.255 If we were masochistic, we could write a regular expression to capture these all…
  • 56. Filtering Scenario – IP Ranges The -E flag lets grep use Extended Regular Expressions. ~$ grep -E "((b(64).233.(1([6-8][0-9]|9[0- 1])))|(b(66).102.([0-9]|1[0-5]))|(b(66).249.(6[4- 9]|[7-8][0-9]|9[0-5]))|(b(72).14.(1(9[2-9])|2([0-4][0- 9]|5[0-5])))|(b(74).125.(25[0-5]|2[0-4][0-9]|[01]?[0- 9][0-9]?))|(209.85.(1(2[8-9]|[3-9][0-9])|2([0-4][0- 9]|5[0-5])))|(216.239.(25[0-5]|2[0-4][0-9]|[01]?[0- 9][0-9]?))).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" GbotUA.log > GbotIP.log This shouldn’t work, but it does*. *WOMM
  • 57. Filtering Scenario – Impostors The -v flag inverts the grep query to find impostors: ~$ grep -vE "((b(64).233.(1([6-8][0-9]|9[0- 1])))|(b(66).102.([0-9]|1[0-5]))|(b(66).249.(6[4-9]|[7- 8][0-9]|9[0-5]))|(b(72).14.(1(9[2-9])|2([0-4][0-9]|5[0- 5])))|(b(74).125.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0- 9]?))|(209.85.(1(2[8-9]|[3-9][0-9])|2([0-4][0-9]|5[0- 5])))|(216.239.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0- 9]?))).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" GbotUA.log > Impostors.log “Give me every request that claims to be Googlebot, but doesn’t come from this IP range. Put them in an impostors file.”
  • 58. Filtering Scenario – Verifying Googlebot • Disclaimer: don’t blindly use awful regex (check with Regexr) or IP ranges, especially if you’re analysing logs for a site using IP detection for international SEO purposes. Read more about Googlebot’s Geo-distributed Crawling here first. • Use the correct reverse DNS > forward DNS lookup when it’s important to be right. This can be automated.
  • 59. Filtering Scenario – IP Ranges Anyone cloaking today will have a good list. You might find them at the bar.
  • 60. “I Just Want A Sample.” The file is still too big.
  • 62. I Want A Sample The sort and split utilities do what you’d expect: ~$ sort -R combined.log | split -l 1048576 “Randomly sort the lines in the combined.log. split the output of this command into multiple files (up to) 1048576 lines long.” shuf is easier, but not default OSX/GOW. A pipe ‘|’ takes the output of one command as the input of another.
  • 63. “I Just Want it in Excel.”
  • 65. I Just Want it in Excel. Use wc to check it has fewer than 1,048,576 rows. ~$ wc -l sample.log “Count the number of lines in sample.log.”
  • 70. The Title of The Talk Wasn’t a Lie And We Aren’t Going to Use Excel And Are Going to Answer Questions Just Using The Command Line I Hope That’s OK. Sorry.
  • 72. Formulating Questions Work a basic hypothesis. Decides what needs to be done if it is true, false, or indeterminate before you get the data. “Google is ignoring robots.txt” may not be action guiding, whilst “Googlebot is ignoring search console parameter restrictions” just might be.
  • 73. Formulating Questions Some things just aren’t very useful to know.
  • 74. Example Questions How deep is Googlebot crawling? Where is the wasted crawl? What proportion of requests are currently wasted? Where is Googlebot POSTing? What are the most popular non-200/304 resources? How many unique resources are being crawled? Which is the more popular form of product page? Which sitemap pages aren’t being crawled? Always pivot with other data.
  • 76. AWK AWK is a programming language focused on text manipulation. We are going to use it to print some columns from our log files. That’s it.
  • 77. Logs are space separated by default. Awk uses spaces to define column numbers. ~$ awk ‘ {print $col_number1, $col_number2}’ [file] ohgm.co.uk1 173.245.55.1232-3-4 [11/Apr/2016:10:23:425+0100]6 "GET7/8HTTP/1.1”920010681211“”12 "Mozilla/5.013(compatible;14Googlebot/2.1;15+http://ww w.google.com/bot.html)“16173.245.55.123
  • 78. AWK ~$ awk ‘{print $8, $10}’ Googlebot.log >> Gbot_responses.txt “Output the requested resource and server response of Googlebot.log to Gbot_responses.txt.” / 200 /robots.txt 304 /robots.txt 500 /amazing-blog-post 200 /forgotten-blog-post 404 /forbidden-blog-post 403 / 200 Tailor the command to the access log format in use.
  • 79. uniq uniq takes text as an input and returns unique lines. uniq -c returns these lines prefixed with a count. uniq -d returns only repeated lines. uniq -u returns only non-repeated lines.
  • 80. AWK Like grep, awk also matches patterns, using /pattern/. ~$ awk ‘/bingbot/ {print $10}’ combined.log | uniq -c “Look for lines containing bingbot in the unfiltered logs and print their server response codes. Deduplicate these and return a summary.” 216663 - 302 109232 - 200 18395 - 301 2568 - 404 2147 - 304 274 - 500 261 - 403
  • 82. Ultimate Guide to Site Migrations Get a big list of old URLs. 301 redirect them once to the right places. Make sure they get crawled.
  • 83. Site Migrations “We want a list of all URLs requested by Googlebot in our pre-migration dataset, sorted by popularity (number of requests).” e.g. / 49587 /index.html 25169 /robots.txt 23334 /home 19417
  • 84. Site Migrations ~$ awk ‘/Googlebot/ {print $7}’ combined.log | uniq -c | sort -nr >> unique_requests.txt “Take all access log requests, and filter to Googlebot. Extract and output the requested resources. Deduplicate these and return a summary. Sort these by number in descending order.”
  • 85. Site Migrations – Encouraging Crawl “We want our migration to switch as quickly as possible.” Get the list of redirects (URI stems) you want Google to crawl into a file. grep can use this file as the match criteria (lines matching this OR this OR this).
  • 86. Site Migrations – Encouraging Crawl We want the URLs Google has not yet crawled. ~$ grep -f wishlist.txt postmigration.log | awk ‘/Googlebot/ {print $8}’ | uniq >> wishlist-hits.txt “Filter the post-migration log for lines that match wishlist.txt. Return the resources requested by Googlebot. Deduplicate and save.”
  • 87. Site Migrations – Encouraging Crawl ~$ cat wishlist-hits.txt wishlist.txt | uniq -u >> uncrawled.txt “Read the contents of both files. Save wishlist entries that don’t appear in the access logs.” Tip: use an indexing service like linklicious to encourage crawling the uncrawled.
  • 90. Also
  • 91. These Techniques Apply to Other SEO Activities. Enterprise Link Audits. Enterprise Keyword Research. Enterprise Spamming.
  • 92. Thanks. Oliver Mason Technical SEO Consultant Twitter: @ohgm Email: ohgm@ohgm.co.uk
  • 93. Resources GOW: https://github.com/bmatzelle/gow Cygwin: http://cygwin.com/install.html Command Line Crash Course: http://cli.learncodethehardway.org/book/ Shameless links to my own stuff: http://ohgm.co.uk/filter-server-logs-to-googlebot/ http://ohgm.co.uk/watch-googlebot-crawling/ http://ohgm.co.uk/preserve-link-equity-with-file-aliasing/ http://ohgm.co.uk/wayback-machine-seo/
  • 94. Tools Used in this Talk grep sort split shuf find uniq awk wc cowsay