SlideShare a Scribd company logo
Session ID:
Prepared by:
Remember to complete your evaluation for this session within the app!
1495
Build a DataWarehouse
for your (alert!) logs
With Python, AWS Athena
and AWS Glue
Wednesday, April 25 2018
Maxym Kharchenko
Sr. Database Engineer
Amazon.com
whoami
• Sr Database Engineer @amazon.com Big Data Technologies team
• Developer <-> DBA
• OCM, ACE Associate, AWS Developer (all “alumni”)
• I have stickers!
Agenda
• Why query (alert) logs with SQL
• How to query (alert) logs with SQL
• How to make it easy and efficient with AWS Athena and Glue
• Demo
Logs are the best operational data
about your system
Logs are great at simple ”tactical” questions
“Why did my query fail at 17:17 yesterday ?”
Sun Feb 11 17:17:04 2018
ORA-01115: IO error reading block from file (block # )
ORA-01110: data file 16:
‘/ora02/database/mydb/tbs12mydb_01.dbf'
“Why am I missing today’s partition ?”
Thu Jan 11 11:40:55 2018
Errors in file /logs/mydb/trace/mydb-36_j005_38530.trc:
ORA-12012: error on auto execute of job
"PART_ADMIN"."CREATE_PARTITION”
ORA-00028: your session has been killed
mydb
alert.log
But not so great when questions get “broader”
“Did the last patch solve our problem ?
> grep ORA-28 alert.log
opiodr aborting process unknown ospid (3411) as a result
of ORA-28
opiodr aborting process unknown ospid (65973) as a
result of ORA-28
opiodr aborting process unknown ospid (56719) as a
result of ORA-28
opiodr aborting process unknown ospid (129663) as a
result of ORA-28
opiodr aborting process unknown ospid (11260) as a
result of ORA-28
opiodr aborting process unknown ospid (22534) as a
result of ORA-28
mydb
alert.log
Or when analyzing multiple logs
“What is the timeline
of the latest cluster lockup issue ?”
Wed May 24 11:17:10 2017
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Wed May 24 11:17:17 2017
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Wed May 24 11:17:28 2017
Post SMON to start 1st pass IR Fix write in gcs resources
Reconfiguration complete
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
Or when correlating data across different logs
“Are we seeing more node crashes because of
- Disk malfunctions ?
- ASM issues ?
- Network disconnects ?
”
> grep “WARNING: inbound connection timed out”
alert*.log
> grep “corrupted block” asm*.log
> grep -P “failed|error|critical” kern*.log
> grep -P “long wait|error|disconnect” tnsping*.log
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 18 18
Or when looking for trends
“Has the rate of network disconnects
increased over the last 6 months ?”
“What databases have the highest archived
log switch rate?”
“Do we see more problems in specific
datacenter locations ?”
“Are there times of the day with almost no
user activity ?”
Logs are not exactly easy to query
(in bulk)
If only there was a simpler way
to query all my logs …
SELECT trunc(event_time, ‘DD’), db, count(1) AS errors
FROM “all my logs”
WHERE event_time > sysdate – interval ‘90’ days
AND (
message LIKE ‘%ORA-00028%’
OR
message LIKE ‘%ORA-28%’
)
GROUP BY trunc(event_time, ‘DD’), db
ORDER BY 1,2
/
How to query
(application, db, …) logs
with SQL
Is it even possible
to query “unstructured text” with SQL ?
SQL Engines!
“Table”
• Linux “directory”
• HDFS “folder”
• Cloud storage “folder”
Log files
(aka: “text”)
?
How to make logs “queriable”
1. Structur-ize
2. Table-ize
3. Transform and Compact-ize
Step 1: Structur-ize
”Raw” logs
(i.e. alert_db.log)
“Structured”
(i.e. JSON) logs
Step 1: Find “structure” in logs
Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port
Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off
Thu Jan 11 17:15:54 2018
Thread 32 advanced to log sequence 34018 (LGWR switch)
Current log# 251 seq# 34018 mem# 0: +DG1/mydb-1/onlinelog/group_12.384.931698439
Thu Jan 11 17:16:25 2018
Unable to create archive log file ‘+DG1’
ARC1: Error 19504 Creating archive log file to ‘+DG1’
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance mydb-1 - Archival Error
ORA-16038: log 12 sequence# 34017 cannot be archived
ORA-19504: failed to create file "”
ORA-00312: online log 254 thread 32: ‘+DG1/mydb-1/onlinelog/group_12.593.933491557'
Step 1: Find “structure” in logs
Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port
Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off
Thu Jan 11 17:15:54 2018
Thread 32 advanced to log sequence 34018 (LGWR switch)
Current log# 251 seq# 34018 mem# 0: +DG1/mydb-1/onlinelog/group_12.384.931698439
Thu Jan 11 17:16:25 2018
Unable to create archive log file ‘+DG1’
ARC1: Error 19504 Creating archive log file to ‘+DG1’
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance mydb-1 - Archival Error
ORA-16038: log 12 sequence# 34017 cannot be archived
ORA-19504: failed to create file "”
ORA-00312: online log 254 thread 32: ‘+DG1/mydb-1/onlinelog/group_12.593.933491557'
Step 1: Make log structure explicit
#! /usr/bin/env python
import json, re, sys
# Line format: <timestamp> <message>
# i.e. Jan 11 20:30:59 kernel: [185012.404818] sd 2:0:1:168: [sdgfp]
LINE_FORMAT = re.compile("^(w+s+d+s+d+:d+:d+)s+(.*)$")
for line in sys.stdin:
matched = LINE_FORMAT.match(line)
if matched:
# print ",".join(matched.groups())
print json.dumps( 
dict(zip(("event_time", "message"), matched.groups()))
)
Step 1: Make log structure explicit
Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port
Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off
{
"message": "host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk",
"event_time": "Jan 11 20:30:59”
}
{
"message": "host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk",
"event_time": "Jan 11 20:30:59”
}
Step 1: Make log structure explicit
Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk
Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port
Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off
{
"message": "host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk",
"event_time": ”2018-01-11 20:30:59.000”
}
{
"message": "host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk",
"event_time": ”2018-01-11 20:30:59.000”
}
Step 2: Table-ize
Table “directory”
Table
“Metadata”
CREATE TABLE …
“Structured”
(i.e. JSON) logs
Step 2: Create table and “ingest” data
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
`event_time` timestamp,
`message` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’
LOCATION 's3://databucket/mydb/mytable/’
;
> cp log*.json /data/mydb/mytable
> hadoop fs -cp log*.json hdfs:/data/mydb/mytable
> aws s3 cp log*.json s3:/databucket//mydb/mytable
Step 3: Transform (into final form)
• Rollup
• Aggregations
• Materializing complex joins
• Partitioning
Step 3: Compact-ize
“Scan all files!”
Open data formats
TSV
• Text based
• Row-oriented
• Some compression
• Limited filtering
• Easy to make
• Binary
• Columnar
• Really good compression
• Advanced filtering
• More difficult to make
Step 3: Transform and Compact-ize
JSON logs PARQUET
Logs
• Format Transform
• SQL Transform
Step 4: Query
PARQUET
Logs
The SQL-on-logs pipeline
Staging
table(s)
”Raw” logs Structured
logs
Final
table(s)
Step 5: Make it simple with AWS
”Raw” logs Structured
logs
“Staging”
S3 bucket
“Final”
S3 bucket
AWS Glue AWS Athena
AWS Athena
• ”Query data in S3 using SQL”
• Serverless Presto cluster
• Rich SQL
• Supports multiple open data formats
• Fast, interactive performance
AWS Glue
• ”Prepare and load data (ETL!)”
• Serverless Apache Spark
• Crawlers: ”data discovery” and
automatic catalog maintenance
• Job scheduling
• Integrated with many data
“sources” and “sinks”
• ETL script generation (or BYO)
Demo time
Extending
SQL-on-logs pipeline
Pre-parse logs in the cloud
S3: “Staging”
(JSON)
S3: “Final”
(Parquet)
Glue
to_parquet()
Athena
“Raw” logs
S3:
“Raw”
logs
Lambda
to_json()
Build materialized views
S3: “Staging”
(JSON)
S3: “Final”
(Parquet)
Glue
to_parquet()
Athena
“Raw” logs
S3:
“Raw”
logs
Lambda
to_json()
Glue:
make_mview()
Use different SQL front-ends
S3: “Staging”
(JSON)
S3: “Final”
(Parquet)
Glue
to_parquet()
Athena
“Raw” logs
S3:
“Raw”
logs
Lambda
to_json()
to_redshift() Redshift
to_oracle() RDS ORACLE
Session ID:
Remember to complete your evaluation for this session within the app!
1495
Thank you!
maxym@amazon.com

More Related Content

PPT
Troubleshooting SQL Server 2000 Virtual Server /Service Pack ...
PDF
Manual Tecnico OGG Oracle to MySQL
PDF
Profiling the logwriter and database writer
PPTX
Oracle Unified Directory. Lessons learnt. Is it ready for a move from OID? (O...
PDF
Let your DBAs get some REST(api)
PDF
Long live to CMAN!
PDF
Oracle Database on Docker
Troubleshooting SQL Server 2000 Virtual Server /Service Pack ...
Manual Tecnico OGG Oracle to MySQL
Profiling the logwriter and database writer
Oracle Unified Directory. Lessons learnt. Is it ready for a move from OID? (O...
Let your DBAs get some REST(api)
Long live to CMAN!
Oracle Database on Docker

What's hot (19)

PPTX
ProxySQL & PXC(Query routing and Failover Test)
KEY
Varnish @ Velocity Ignite
PDF
Replication skeptic
PDF
Integration of neutron, nova and designate how to use it and how to configur...
PDF
DB エンジニアのマイクロサービス入門〜Oracle Database と Docker ではじめる API サービス〜
PDF
ClickHouse Monitoring 101: What to monitor and how
PPT
Oracle 10g Performance: chapter 09 enqueues
PPT
Jurijs Velikanovs - RAC Attack 101 - How to install 12c RAC on your laptop
PPTX
MySQL Audit using Percona audit plugin and ELK
PPT
UKOUG, Oracle Transaction Locks
PPTX
MySQL Without the SQL -- Oh My! Longhorn PHP Conference
PDF
2017 DNSSEC KSK Rollover
PDF
Rolling the Root KSK
PPT
Inside rac
PDF
State of The Dolphin - May 2021
PPTX
DataStax: An Introduction to DataStax Enterprise Search
PDF
Phd tutorial hawq_v0.1
PDF
Dsi 11g convert_to RAC
PDF
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
ProxySQL & PXC(Query routing and Failover Test)
Varnish @ Velocity Ignite
Replication skeptic
Integration of neutron, nova and designate how to use it and how to configur...
DB エンジニアのマイクロサービス入門〜Oracle Database と Docker ではじめる API サービス〜
ClickHouse Monitoring 101: What to monitor and how
Oracle 10g Performance: chapter 09 enqueues
Jurijs Velikanovs - RAC Attack 101 - How to install 12c RAC on your laptop
MySQL Audit using Percona audit plugin and ELK
UKOUG, Oracle Transaction Locks
MySQL Without the SQL -- Oh My! Longhorn PHP Conference
2017 DNSSEC KSK Rollover
Rolling the Root KSK
Inside rac
State of The Dolphin - May 2021
DataStax: An Introduction to DataStax Enterprise Search
Phd tutorial hawq_v0.1
Dsi 11g convert_to RAC
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
Ad

Similar to Build a DataWarehouse for your logs with Python, AWS Athena and Glue (20)

PPT
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
PPTX
Machine Learning and Logging for Monitoring Microservices
PDF
Apache Cassandra at Macys
PDF
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
PDF
MongoDB: Optimising for Performance, Scale & Analytics
PPTX
Search and analyze data in real time
PDF
Ireland OUG Meetup May 2017
PDF
MySQL Cluster Performance Tuning - 2013 MySQL User Conference
PDF
CollabSphere 2019 - Dirty Secrets of the Notes Client
PPT
Ibm aix technical deep dive workshop advanced administration and problem dete...
PPTX
Restfs internals
PDF
New bare-metal provisioning setup built around Collins
PDF
Centralized logging for (java) applications with the elastic stack made easy
PDF
10 Key MongoDB Performance Indicators
PPT
11thingsabout11g 12659705398222 Phpapp01
PPT
11 Things About11g
PPT
Advanced administration and problem determination
PDF
Frits Hoogland - About multiblock reads
PDF
FPC for the Masses - CoRIIN 2018
PPT
HeroLympics Eng V03 Henk Vd Valk
Recipe 14 of Data Warehouse and Business Intelligence - Build a Staging Area ...
Machine Learning and Logging for Monitoring Microservices
Apache Cassandra at Macys
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
MongoDB: Optimising for Performance, Scale & Analytics
Search and analyze data in real time
Ireland OUG Meetup May 2017
MySQL Cluster Performance Tuning - 2013 MySQL User Conference
CollabSphere 2019 - Dirty Secrets of the Notes Client
Ibm aix technical deep dive workshop advanced administration and problem dete...
Restfs internals
New bare-metal provisioning setup built around Collins
Centralized logging for (java) applications with the elastic stack made easy
10 Key MongoDB Performance Indicators
11thingsabout11g 12659705398222 Phpapp01
11 Things About11g
Advanced administration and problem determination
Frits Hoogland - About multiblock reads
FPC for the Masses - CoRIIN 2018
HeroLympics Eng V03 Henk Vd Valk
Ad

More from Maxym Kharchenko (7)

PPTX
Hadoop databases for oracle DBAs
PPTX
How to scale relational (OLTP) databases. Think: Sharding @C16LV
PPTX
Visualizing ORACLE performance data with R @ #C16LV
PPTX
Commit2015 kharchenko - python generators - ext
PPTX
2015 555 kharchenko_ppt
PPTX
Finding SQL execution outliers
PPTX
SQL Top-N and pagination pattern (IOUG)
Hadoop databases for oracle DBAs
How to scale relational (OLTP) databases. Think: Sharding @C16LV
Visualizing ORACLE performance data with R @ #C16LV
Commit2015 kharchenko - python generators - ext
2015 555 kharchenko_ppt
Finding SQL execution outliers
SQL Top-N and pagination pattern (IOUG)

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
sap open course for s4hana steps from ECC to s4
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation theory and applications.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
sap open course for s4hana steps from ECC to s4
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation theory and applications.pdf
Understanding_Digital_Forensics_Presentation.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
Reach Out and Touch Someone: Haptics and Empathic Computing
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Cloud computing and distributed systems.
KodekX | Application Modernization Development
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
Mobile App Security Testing_ A Comprehensive Guide.pdf

Build a DataWarehouse for your logs with Python, AWS Athena and Glue

  • 1. Session ID: Prepared by: Remember to complete your evaluation for this session within the app! 1495 Build a DataWarehouse for your (alert!) logs With Python, AWS Athena and AWS Glue Wednesday, April 25 2018 Maxym Kharchenko Sr. Database Engineer Amazon.com
  • 2. whoami • Sr Database Engineer @amazon.com Big Data Technologies team • Developer <-> DBA • OCM, ACE Associate, AWS Developer (all “alumni”) • I have stickers!
  • 3. Agenda • Why query (alert) logs with SQL • How to query (alert) logs with SQL • How to make it easy and efficient with AWS Athena and Glue • Demo
  • 4. Logs are the best operational data about your system
  • 5. Logs are great at simple ”tactical” questions “Why did my query fail at 17:17 yesterday ?” Sun Feb 11 17:17:04 2018 ORA-01115: IO error reading block from file (block # ) ORA-01110: data file 16: ‘/ora02/database/mydb/tbs12mydb_01.dbf' “Why am I missing today’s partition ?” Thu Jan 11 11:40:55 2018 Errors in file /logs/mydb/trace/mydb-36_j005_38530.trc: ORA-12012: error on auto execute of job "PART_ADMIN"."CREATE_PARTITION” ORA-00028: your session has been killed mydb alert.log
  • 6. But not so great when questions get “broader” “Did the last patch solve our problem ? > grep ORA-28 alert.log opiodr aborting process unknown ospid (3411) as a result of ORA-28 opiodr aborting process unknown ospid (65973) as a result of ORA-28 opiodr aborting process unknown ospid (56719) as a result of ORA-28 opiodr aborting process unknown ospid (129663) as a result of ORA-28 opiodr aborting process unknown ospid (11260) as a result of ORA-28 opiodr aborting process unknown ospid (22534) as a result of ORA-28 mydb alert.log
  • 7. Or when analyzing multiple logs “What is the timeline of the latest cluster lockup issue ?” Wed May 24 11:17:10 2017 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Wed May 24 11:17:17 2017 Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Wed May 24 11:17:28 2017 Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
  • 8. Or when correlating data across different logs “Are we seeing more node crashes because of - Disk malfunctions ? - ASM issues ? - Network disconnects ? ” > grep “WARNING: inbound connection timed out” alert*.log > grep “corrupted block” asm*.log > grep -P “failed|error|critical” kern*.log > grep -P “long wait|error|disconnect” tnsping*.log 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 18
  • 9. Or when looking for trends “Has the rate of network disconnects increased over the last 6 months ?” “What databases have the highest archived log switch rate?” “Do we see more problems in specific datacenter locations ?” “Are there times of the day with almost no user activity ?”
  • 10. Logs are not exactly easy to query (in bulk)
  • 11. If only there was a simpler way to query all my logs … SELECT trunc(event_time, ‘DD’), db, count(1) AS errors FROM “all my logs” WHERE event_time > sysdate – interval ‘90’ days AND ( message LIKE ‘%ORA-00028%’ OR message LIKE ‘%ORA-28%’ ) GROUP BY trunc(event_time, ‘DD’), db ORDER BY 1,2 /
  • 12. How to query (application, db, …) logs with SQL
  • 13. Is it even possible to query “unstructured text” with SQL ?
  • 14. SQL Engines! “Table” • Linux “directory” • HDFS “folder” • Cloud storage “folder” Log files (aka: “text”) ?
  • 15. How to make logs “queriable” 1. Structur-ize 2. Table-ize 3. Transform and Compact-ize
  • 16. Step 1: Structur-ize ”Raw” logs (i.e. alert_db.log) “Structured” (i.e. JSON) logs
  • 17. Step 1: Find “structure” in logs Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off Thu Jan 11 17:15:54 2018 Thread 32 advanced to log sequence 34018 (LGWR switch) Current log# 251 seq# 34018 mem# 0: +DG1/mydb-1/onlinelog/group_12.384.931698439 Thu Jan 11 17:16:25 2018 Unable to create archive log file ‘+DG1’ ARC1: Error 19504 Creating archive log file to ‘+DG1’ ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance mydb-1 - Archival Error ORA-16038: log 12 sequence# 34017 cannot be archived ORA-19504: failed to create file "” ORA-00312: online log 254 thread 32: ‘+DG1/mydb-1/onlinelog/group_12.593.933491557'
  • 18. Step 1: Find “structure” in logs Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off Thu Jan 11 17:15:54 2018 Thread 32 advanced to log sequence 34018 (LGWR switch) Current log# 251 seq# 34018 mem# 0: +DG1/mydb-1/onlinelog/group_12.384.931698439 Thu Jan 11 17:16:25 2018 Unable to create archive log file ‘+DG1’ ARC1: Error 19504 Creating archive log file to ‘+DG1’ ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance mydb-1 - Archival Error ORA-16038: log 12 sequence# 34017 cannot be archived ORA-19504: failed to create file "” ORA-00312: online log 254 thread 32: ‘+DG1/mydb-1/onlinelog/group_12.593.933491557'
  • 19. Step 1: Make log structure explicit #! /usr/bin/env python import json, re, sys # Line format: <timestamp> <message> # i.e. Jan 11 20:30:59 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] LINE_FORMAT = re.compile("^(w+s+d+s+d+:d+:d+)s+(.*)$") for line in sys.stdin: matched = LINE_FORMAT.match(line) if matched: # print ",".join(matched.groups()) print json.dumps( dict(zip(("event_time", "message"), matched.groups())) )
  • 20. Step 1: Make log structure explicit Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off { "message": "host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk", "event_time": "Jan 11 20:30:59” } { "message": "host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk", "event_time": "Jan 11 20:30:59” }
  • 21. Step 1: Make log structure explicit Jan 11 20:30:59 host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185012.426272] sd 2:0:1:169: [sdgfs] Attached SCSI disk Jan 11 20:30:59 host-12 kernel: [185022.726345] rport-2:0-22: blocked FC remote port Jan 11 20:30:59 host-12 kernel: [185076.513763] sd 2:0:13:0: [sdd] Write Protect is off { "message": "host-12 kernel: [185012.404818] sd 2:0:1:168: [sdgfp] Attached SCSI disk", "event_time": ”2018-01-11 20:30:59.000” } { "message": "host-12 kernel: [185012.425995] sd 2:0:1:167: [sdgfn] Attached SCSI disk", "event_time": ”2018-01-11 20:30:59.000” }
  • 22. Step 2: Table-ize Table “directory” Table “Metadata” CREATE TABLE … “Structured” (i.e. JSON) logs
  • 23. Step 2: Create table and “ingest” data CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable ( `event_time` timestamp, `message` string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’ LOCATION 's3://databucket/mydb/mytable/’ ; > cp log*.json /data/mydb/mytable > hadoop fs -cp log*.json hdfs:/data/mydb/mytable > aws s3 cp log*.json s3:/databucket//mydb/mytable
  • 24. Step 3: Transform (into final form) • Rollup • Aggregations • Materializing complex joins • Partitioning
  • 26. Open data formats TSV • Text based • Row-oriented • Some compression • Limited filtering • Easy to make • Binary • Columnar • Really good compression • Advanced filtering • More difficult to make
  • 27. Step 3: Transform and Compact-ize JSON logs PARQUET Logs • Format Transform • SQL Transform
  • 29. The SQL-on-logs pipeline Staging table(s) ”Raw” logs Structured logs Final table(s)
  • 30. Step 5: Make it simple with AWS ”Raw” logs Structured logs “Staging” S3 bucket “Final” S3 bucket AWS Glue AWS Athena
  • 31. AWS Athena • ”Query data in S3 using SQL” • Serverless Presto cluster • Rich SQL • Supports multiple open data formats • Fast, interactive performance
  • 32. AWS Glue • ”Prepare and load data (ETL!)” • Serverless Apache Spark • Crawlers: ”data discovery” and automatic catalog maintenance • Job scheduling • Integrated with many data “sources” and “sinks” • ETL script generation (or BYO)
  • 35. Pre-parse logs in the cloud S3: “Staging” (JSON) S3: “Final” (Parquet) Glue to_parquet() Athena “Raw” logs S3: “Raw” logs Lambda to_json()
  • 36. Build materialized views S3: “Staging” (JSON) S3: “Final” (Parquet) Glue to_parquet() Athena “Raw” logs S3: “Raw” logs Lambda to_json() Glue: make_mview()
  • 37. Use different SQL front-ends S3: “Staging” (JSON) S3: “Final” (Parquet) Glue to_parquet() Athena “Raw” logs S3: “Raw” logs Lambda to_json() to_redshift() Redshift to_oracle() RDS ORACLE
  • 38. Session ID: Remember to complete your evaluation for this session within the app! 1495 Thank you! maxym@amazon.com