SlideShare a Scribd company logo
3
Most read
4
Most read
10
Most read
First steps at parsing and analyzing
web server log files at scale
Elias Dabbas
@eliasdabbas
Raw log file
Harvard Dataverse, ecommerce site (zanbil.ir)


3.3GB ~1.3M lines
Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs", 

https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1
Parse and convert to DataFrame/Table
• Loading and parsing the whole file into memory probably won’t work (or scale)

• Log files are usually not big, they’re huge

• Sequentially parse chunks of lines, save to another efficient format (parquet), combine
Log File Analysis
• File ingestion gets even faster after saving the DataFrame to a single
optimized file, also more convenient to store as a single file
Log File Analysis
Log File Analysis
• Convert to more efficient data types

• Faster writing and reading time
Log File Analysis
• Magic provided by:

• Pandas

• Apache Arrow Project

• Apache Parquet Project
Model Name: MacBook Pro
Model Identifier: MacBookPro16,4
Processor Name: 8-Core Intel Core i9
Processor Speed: 2.4 GHz
Number of Processors: 1
Total Number of Cores: 8
L2 Cache (per Core): 256 KB
L3 Cache: 16 MB
Hyper-Threading Technology: Enabled
Memory: 32 GB
logs_to_df function
Assumes common (or combined) log format
Can be extended to other formats
def logs_to_df(logfile, output_dir, errors_file):
with open(logfile) as source_file:
linenumber = 0
parsed_lines = []
for line in source_file:
try:
log_line = re.findall(combined_regex, line)[0]
parsed_lines.append(log_line)
except Exception as e:
with open(errors_file, 'at') as errfile:
print((line, str(e)), file=errfile)
continue
linenumber += 1
if linenumber % 250_000 == 0:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')
parsed_lines.clear()
else:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(‘{output_dir}/file_{linenumber}.parquet’)
parsed_lines.clear()
combined_regex = '^(?P<client>S+) S+ (?P<userid>S+) [(?P<datetime>[^]]+)] "(?
P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?
P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)'
Regular Expressions Cookbook
by Jan Goyvaerts, Steven Levithan
Thank you

More Related Content

PDF
Automating Google Lighthouse
PPTX
Python for SEO
PPTX
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
PDF
Antifragility in Digital Marketing
PDF
A beginner's guide to machine learning for SEOs - WTSFest 2022
PDF
How to Incorporate ML in your SERP Analysis, Lazarina Stoy -BrightonSEO Oct, ...
PPTX
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
PPTX
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO
Automating Google Lighthouse
Python for SEO
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
Antifragility in Digital Marketing
A beginner's guide to machine learning for SEOs - WTSFest 2022
How to Incorporate ML in your SERP Analysis, Lazarina Stoy -BrightonSEO Oct, ...
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
Search Query Processing: The Secret Life of Queries, Parsing, Rewriting & SEO

What's hot (20)

PPTX
Keyword Research and Topic Modeling in a Semantic Web
PDF
Passage indexing is likely more important than you think
PPTX
Canonicalization for SEO BrightonSEO April 2023 Patrick Stox
PDF
Natural Language Search with Knowledge Graphs (Activate 2019)
PDF
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
PDF
SEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
PDF
Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...
PPTX
Brighton SEO April 2022 - Automate the technical SEO stuff
PPTX
How to Automatically Subcategorise Your Website Automatically With Python
PDF
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
PDF
40 Deep #SEO Insights for 2023
PDF
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
PPTX
News SEO: Why we’ve de commissioned AMP - Brighton SEO September 2021
PPTX
I Am A Donut - How To Avoid International SEO Mistakes
PDF
Quality Content at Scale Through Automated Text Summarization of UGC
PDF
PubCon, Lazarina Stoy. - Machine Learning in Search: Google's ML APIs vs Open...
PDF
How to construct your own SEO a b split tests (for free) - BrightonSEO July 2021
PPTX
HELP! I've Been Hit By An Algorithm Update - Jess Maloney - BrightonSEO Apri...
PDF
SEO Automation Without Using Hard Code by Tevfik Mert Azizoglu - BrightonSEO ...
PPTX
The Full Scoop on Google's Title Rewrites
Keyword Research and Topic Modeling in a Semantic Web
Passage indexing is likely more important than you think
Canonicalization for SEO BrightonSEO April 2023 Patrick Stox
Natural Language Search with Knowledge Graphs (Activate 2019)
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
SEO Case Study - Hangikredi.com From 12 March to 24 September Core Update
Building a Search Intent-Driven Website Architecture (SEO Mastery Summit 2022...
Brighton SEO April 2022 - Automate the technical SEO stuff
How to Automatically Subcategorise Your Website Automatically With Python
Coronavirus and Future of SEO: Digital Marketing and Remote Culture
40 Deep #SEO Insights for 2023
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
News SEO: Why we’ve de commissioned AMP - Brighton SEO September 2021
I Am A Donut - How To Avoid International SEO Mistakes
Quality Content at Scale Through Automated Text Summarization of UGC
PubCon, Lazarina Stoy. - Machine Learning in Search: Google's ML APIs vs Open...
How to construct your own SEO a b split tests (for free) - BrightonSEO July 2021
HELP! I've Been Hit By An Algorithm Update - Jess Maloney - BrightonSEO Apri...
SEO Automation Without Using Hard Code by Tevfik Mert Azizoglu - BrightonSEO ...
The Full Scoop on Google's Title Rewrites
Ad

Similar to Log File Analysis (7)

PPTX
Анализируем серверные логи на волосатой коленке c Python
PPTX
Build a DataWarehouse for your logs with Python, AWS Athena and Glue
PDF
Introduction to PySpark maka sakinaka loda
PDF
Log Analysis Engine with Integration of Hadoop and Spark
PDF
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
PDF
Why you should be using structured logs
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Анализируем серверные логи на волосатой коленке c Python
Build a DataWarehouse for your logs with Python, AWS Athena and Glue
Introduction to PySpark maka sakinaka loda
Log Analysis Engine with Integration of Hadoop and Spark
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Why you should be using structured logs
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Ad

More from Elias Dabbas (17)

PDF
Log file analysis with advertools
PDF
Twitter Dashboard
PDF
Don't research keywords, generate them...
PDF
BoxofficeMojo Data Interactive Dashboard
PDF
Remarketing Basics
PDF
Analytics and Adwords for Online Marketers DIC Excellence Series
PDF
Online Marketing - Forward to Basics
PDF
Structured Data - The Future of Search
KEY
Arabic Search Marketing MediaME Presentation 2011
PPT
Google Analytics and Google AdWords for the Online Marketer
PPTX
Adwords training social media forum 2010
PPTX
Online Marketing Using Adwords and Google Analytics social media forum 2010
PPTX
SEO / SEM Strategies - Presented in MediaME Forum
PPTX
CMS as a Marketing Tool - Drupal
PPT
Web Analytics - The Starting Point WAWDubai
PPSX
AdWords Research, Segmentation, Targeting, Strategies
PPS
Web2.0 Primer
Log file analysis with advertools
Twitter Dashboard
Don't research keywords, generate them...
BoxofficeMojo Data Interactive Dashboard
Remarketing Basics
Analytics and Adwords for Online Marketers DIC Excellence Series
Online Marketing - Forward to Basics
Structured Data - The Future of Search
Arabic Search Marketing MediaME Presentation 2011
Google Analytics and Google AdWords for the Online Marketer
Adwords training social media forum 2010
Online Marketing Using Adwords and Google Analytics social media forum 2010
SEO / SEM Strategies - Presented in MediaME Forum
CMS as a Marketing Tool - Drupal
Web Analytics - The Starting Point WAWDubai
AdWords Research, Segmentation, Targeting, Strategies
Web2.0 Primer

Recently uploaded (20)

PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Launch Your Data Science Career in Kochi – 2025
Business Acumen Training GuidePresentation.pptx
1_Introduction to advance data techniques.pptx
Miokarditis (Inflamasi pada Otot Jantung)
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Supervised vs unsupervised machine learning algorithms
Galatica Smart Energy Infrastructure Startup Pitch Deck
Data_Analytics_and_PowerBI_Presentation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction-to-Cloud-ComputingFinal.pptx
Clinical guidelines as a resource for EBP(1).pdf
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm

Log File Analysis

  • 1. First steps at parsing and analyzing web server log files at scale Elias Dabbas @eliasdabbas
  • 2. Raw log file Harvard Dataverse, ecommerce site (zanbil.ir) 
 3.3GB ~1.3M lines Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs",  https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1
  • 3. Parse and convert to DataFrame/Table • Loading and parsing the whole file into memory probably won’t work (or scale) • Log files are usually not big, they’re huge • Sequentially parse chunks of lines, save to another efficient format (parquet), combine
  • 5. • File ingestion gets even faster after saving the DataFrame to a single optimized file, also more convenient to store as a single file
  • 8. • Convert to more efficient data types • Faster writing and reading time
  • 10. • Magic provided by: • Pandas • Apache Arrow Project • Apache Parquet Project Model Name: MacBook Pro Model Identifier: MacBookPro16,4 Processor Name: 8-Core Intel Core i9 Processor Speed: 2.4 GHz Number of Processors: 1 Total Number of Cores: 8 L2 Cache (per Core): 256 KB L3 Cache: 16 MB Hyper-Threading Technology: Enabled Memory: 32 GB
  • 11. logs_to_df function Assumes common (or combined) log format Can be extended to other formats def logs_to_df(logfile, output_dir, errors_file): with open(logfile) as source_file: linenumber = 0 parsed_lines = [] for line in source_file: try: log_line = re.findall(combined_regex, line)[0] parsed_lines.append(log_line) except Exception as e: with open(errors_file, 'at') as errfile: print((line, str(e)), file=errfile) continue linenumber += 1 if linenumber % 250_000 == 0: df = pd.DataFrame(parsed_lines, columns=columns) df.to_parquet(f'{output_dir}/file_{linenumber}.parquet') parsed_lines.clear() else: df = pd.DataFrame(parsed_lines, columns=columns) df.to_parquet(‘{output_dir}/file_{linenumber}.parquet’) parsed_lines.clear() combined_regex = '^(?P<client>S+) S+ (?P<userid>S+) [(?P<datetime>[^]]+)] "(? P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (? P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)' Regular Expressions Cookbook by Jan Goyvaerts, Steven Levithan