Log File Analysis

First steps at parsing and analyzing
web server log files at scale
Elias Dabbas
@eliasdabbas

Raw log file
Harvard Dataverse, ecommerce site (zanbil.ir)
 
3.3GB ~1.3M lines
Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs",

https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1

Parse and convert to DataFrame/Table
• Loading and parsing the whole file into memory probably won’t work (or scale)

• Log files are usually not big, they’re huge

• Sequentially parse chunks of lines, save to another eﬃcient format (parquet), combine

• File ingestion gets even faster after saving the DataFrame to a single
optimized file, also more convenient to store as a single file

• Convert to more eﬃcient data types

• Faster writing and reading time

• Magic provided by:

• Pandas

• Apache Arrow Project

• Apache Parquet Project
Model Name: MacBook Pro
Model Identifier: MacBookPro16,4
Processor Name: 8-Core Intel Core i9
Processor Speed: 2.4 GHz
Number of Processors: 1
Total Number of Cores: 8
L2 Cache (per Core): 256 KB
L3 Cache: 16 MB
Hyper-Threading Technology: Enabled
Memory: 32 GB

logs_to_df function
Assumes common (or combined) log format
Can be extended to other formats
def logs_to_df(logfile, output_dir, errors_file):
with open(logfile) as source_file:
linenumber = 0
parsed_lines = []
for line in source_file:
try:
log_line = re.findall(combined_regex, line)[0]
parsed_lines.append(log_line)
except Exception as e:
with open(errors_file, 'at') as errfile:
print((line, str(e)), file=errfile)
continue
linenumber += 1
if linenumber % 250_000 == 0:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')
parsed_lines.clear()
else:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(‘{output_dir}/file_{linenumber}.parquet’)
parsed_lines.clear()
combined_regex = '^(?P<client>S+) S+ (?P<userid>S+) [(?P<datetime>[^]]+)] "(?
P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?
P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)'
Regular Expressions Cookbook
by Jan Goyvaerts, Steven Levithan

Log File Analysis

More Related Content

What's hot (20)

Similar to Log File Analysis (7)

More from Elias Dabbas (17)

Recently uploaded (20)

Log File Analysis