SlideShare a Scribd company logo
Amazon Athena Workshop
26 January 2017
Agenda
1
2
3
4
5
Wi-Fi: DaHouseGuest, Pass: JustDoit!
Feedback Form: goo.gl/T9BZvy
Labs: github.com/doitintl/athena-workshop
2
Q & A
Breaks: 11:30 | 13:00 - 13:45 | 15:00
Facilities & Organization
DoIT International confidential │ Do not distribute
About us..
Vadim Solovey
CTO
Shahar Frank
Software Engineering Lead
DoIT International confidential │ Do not distribute
DoIT International confidential │ Do not distribute
DoIT International confidential │ Do not distribute
Workshop Agenda
● Module 1
○ Introduction to AWS Athena
○ Demo
● Module 2
○ Interacting with AWS Athena
○ Lab 2
● Module 3
○ Supported Formats and SerDes
○ Lab 3
● Module 4
○ Partitioning Data
○ Lab 4
● Module 5
○ Converting to columnar formats
○ Lab 5
● Module 6
○ Athena Security
● Module 7
○ Service Limits
● Module 8
○ Comparison to Google BigQuery
○ Demo
[1] AWS Athena
[1] Introduction
Understanding Purpose & Use-Cases
[1] Challenges
Organizations are challenged with data analysis without heavy investments and long deployment time
● Significant amount of effort required to analyze data on S3
● Users often have access to only aggregated data sets
● Managing Hadoop or data warehouse requires expertise
[1] Introducing AWS Athena
Athena is an interactive query service that makes it easy to
analyze data directly from AWS S3 using Standard SQL
[1] AWS Athena Overview
Easy to use
1. Login to a console
2. Create a table (either by following a wizard or by typing Hive DDL statement)
3. Start querying
[1] AWS Athena is Highly Available
High Availability Features
● You connect to a service endpoint or log into a console
● Athena uses warm compute pools across multiple availability zones
● Your data is in Amazon S3 which has 99.999999999% durability
[1] Querying Data Directly from Amazon S3
Direct access to your data without hassles
● No loading of data
● No ETL required
● No additional storage required
● Query of data in raw format
[1] Use ANSI SQL
Use of skills you probably already have
● Start with writing Standard ANSI SQL syntax
● Support for complex joins, nested queries & window functions
● Support for complex data types (arrays, structs)
● Support for partitioning of data by any key:
○ e.g. date, time, custom keys
○ Or customer-year-month-day-hour
[1] AWS Athena Overview
Amazon Athena is server-less way to query your data that lives on S3 using SQL
Features:
● Serverles with zero spin-up time and transparent upgrades
● Data can be stored in CSV, JSON, ORC, Parquet and even Apache web logs format
○ AVRO (coming soon)
● Compression is supported out of the box
● Queries cost $5 per terabyte of data scanned with a 10 MB minimum per query
Additional Information:
● Not a general purpose database
● Usually used by Data Analysts to run interactive queries over large datasets
● Currently available at us-east-1 (North Virginia) or the us-west-2 (Oregon)
[1] Underlying Technologies
Presto (originating from Facebook)
● Used for SQL queries
● In-memory distributed querying engine ANSI SQL compatible with
extensions
Hive (originating from Hadoop project)
● Used for DDL functionality
● Complex data types
● Multitude of formats
● Supports data partitioning
[1] Presto vs. Hive Architecture
[1] Use Cases
Athena complements Amazon Redshift and Amazon EMR
AWS Athena
[2] Interacting with AWS Athena
Develop, Execute and Visualize Queries
[2] Interacting with AWS Athena
Amazon Athena is server-less way to query your data that lives on S3 using SQL
Web User Interface:
● Run queries and examine results
● Manage databases and tables
● Save queries and share across organization for re-use
● Query History
JDBC Driver:
● Programmatic way to access AWS Athena
○ SQL Workbench, JetBrains DataGrip, sqlline
○ Your own app
AWS QuickSight:
● Visualize Athena data with charts, pivots and dashboards.
Hands On
Lab 2
Interacting with AWS Athena
Data Formats
[3] Supported Formats and SerDes
Efficient Data Storage
[3] Data and Compression Formats
The data formats presently supported are
● CSV
● TSV
● Parquet (Snappy is default compression)
● ORC (Zlib is default compression)
● JSON
● Apache Web Server logs (RegexSerDe)
● Custom Delimiters
Compression Formats
● Currently, Snappy, Zlib, and GZIP are the supported compression formats.
● LZO is not supported as of today
[3] CSV Example
CREATE EXTERNAL TABLE `mydb.yellow_trips`(
`vendor_id` string,
`pickup_datetime` timestamp,
`dropoff_datetime` timestamp,
`pickup_longitude` float,
`pickup_latitude` float,
`dropoff_longitude` float,
`dropoff_latitude` float,
`................` .....)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY ''
LINES TERMINATED BY 'n'
LOCATION 's3://nyc-yellow-trips/csv/'
[3] Parquet Example
CREATE EXTERNAL TABLE `mydb.yellow_trips`(
`vendor_id` string,
`pickup_datetime` timestamp,
`dropoff_datetime` timestamp,
`pickup_longitude` float,
`pickup_latitude` float,
`dropoff_longitude` float,
`dropoff_latitude` float,
`................` .....)
STORED AS PARQUET
LOCATION 's3://nyc-yellow-trips/parquet
tblproperties ("parquet.compress"="SNAPPY");
[3] ORC Example
CREATE EXTERNAL TABLE `mydb.yellow_trips`(
`vendor_id` string,
`pickup_datetime` timestamp,
`dropoff_datetime` timestamp,
`pickup_longitude` float,
`pickup_latitude` float,
`dropoff_longitude` float,
`dropoff_latitude` float,
`................` .....)
STORED AS ORC
LOCATION 's3://nyc-yellow-trips/orc/’
tblproperties ("parquet.compress"="ZLIB");
[3] RegEx Serde (Apache Log Example)
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
Date DATE, Time STRING, Location STRING,
Bytes INT, RequestIP STRING, Method STRING,
Host STRING, Uri STRING, Status INT, Referrer STRING,
os STRING, Browser STRING, BrowserVersion STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(?!#)([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^
]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^
]+)s+[^(]+[(]([^;]+).*%20([^/]+)[/](.*)$")
LOCATION 's3://athena-examples/cloudfront/plaintext/';
[3] Comparing Formats
PARQUET
● Columnar format
● Schema segregation into footer
● Column major format
● All data is pushed to the leaf
● Integrated compression and indexes
● Support for predicate pushdown
ORC
● Apache Top Level Project
● Schema segregation into footer
● Column major format with stripes
● Integrated compression and indexes
and stats
● Support for predicate pushdown
[3] Comparing Formats
[3] Converting to Parquet or ORC format
● You can use Hive CTAS to convert data:
CREATE TABLE new_key_value_store
STORED AS PARQUET
AS SELECT c1, c2, c3, .., cN FROM noncolumunartable
SORT BY key
● You can also use Spark to convert the files to Parquet or ORC
● 20 lines of PySpark code running on EMR [1]
○ Converts 1TB of text data into 130GB of Parquet with Snappy compression
○ Approx. cost is $5
[1] https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
[3] Pay By the Query ($5 per TB scanned)
● You are paying by the amount of scanned data
● Means to save on cost
○ Compress
○ Convert to columnar format
○ Use partitioning
● Free: DDL queries, failed queries
Dataset Size on S3 Query Runtime Data Scanned Cost
Logs stored as CSV 1TB 237s 1.15TB $5.75
Logs stored as PARQUET 130GB 5.13s 2.69GB $0.013
Savings 87% less 34x faster 99% less 99.7% cheaper
Hands On
Lab 3
Formats & SerDes
AWS Athena
[4] Partitioning Data
To improve performance and reduce cost
[4] Partitioning Data
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving
performance and reducing cost
Benefits of Data Partitioning:
● Partitions limit the scope of data being scanned during the query
● Improves Performance
● Reduce query cost
● You can partition your data by any key
Common Practice:
● Based on time, often leading with a multi-level partitioning scheme
○ YEAR -> MONTH -> DAY -> HOUR
[4] Data already partitioned and stored on S3
$ aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
PRE dt=2009-04-12-13-15/
PRE dt=2009-04-12-13-20/
PRE dt=2009-04-12-14-00/
PRE dt=2009-04-12-14-05/
CREATE EXTERNAL TABLE impressions (
... ...)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;
// load partitions into Athena
MSCK REPAIR TABLE impressions
// Run sample query
SELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00' and dt>='2009-04-12-13-00'
[4] Data is not partitioned
aws s3 ls s3://athena-examples/elb/plaintext/ --recursive
2016-11-23 17:54:46 11789573 elb/plaintext/2015/01/01/part-r-00000-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:46 8776899 elb/plaintext/2015/01/01/part-r-00001-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:46 9309800 elb/plaintext/2015/01/01/part-r-00002-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 9412570 elb/plaintext/2015/01/01/part-r-00003-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 10725938 elb/plaintext/2015/01/01/part-r-00004-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:46 9439710 elb/plaintext/2015/01/01/part-r-00005-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 0 elb/plaintext/2015/01/01_$folder$
2016-11-23 17:54:47 9012723 elb/plaintext/2015/01/02/part-r-00006-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 7571816 elb/plaintext/2015/01/02/part-r-00007-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 9673393 elb/plaintext/2015/01/02/part-r-00008-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:48 11979218 elb/plaintext/2015/01/02/part-r-00009-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:48 9546833 elb/plaintext/2015/01/02/part-r-00010-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-
examples/elb/plaintext/2015/01/01/'
[5] AWS Athena
[5] Converting to Columnar Formats
Apache Parquet & ORC
[5] Converting to Columnar Formats (batch data)
Your Amazon Athena query performance improves if you convert your data into open source columnar
formats such as Apache Parquet or ORC.
The process for converting to columnar formats using an EMR cluster is as follows:
● Create an EMR cluster with Hive installed.
● In the step section of the cluster create statement, you can specify a script stored in Amazon S3,
which points to your input data and creates output data in the columnar format in an Amazon S3
location. In this example, the cluster auto-terminates.
[5] Converting to Columnar Formats (streaming data)
Your Amazon Athena query performance improves if you convert your data into open source columnar
formats such as Apache Parquet or ORC.
The process for converting to columnar formats using an EMR cluster is as follows:
● Create an EMR cluster with Spark
● Run Spark Streaming Job reading the data from Kinesis Stream and writing Parquet files on S3
AWS Athena
[6] Athena Security
Authorization and Access
[6] Athena Security
Amazon offers three ways to control data access:
● AWS Identity and Access Management policies
● Access Control Lists
● Amazon S3 bucket policies
Users are in control who can access data on S3. It’s possible to fine-tune security to allow different
people to see different sets of data and also to grant access to other user’s data.
AWS Athena
[7] Service Limits
Know your limits and mitigate the risk
[7] Service Limits
You can request a limit increase by contacting AWS Support.
● Currently, you can only submit one query at a time and you can only have 5 (five) concurrent
queries at one time per account.
● Query timeout: 30 minutes
● Number of databases: 100
● Table: 100 per database
● Number of partitions: 20k per table
● You may encounter a limit for Amazon S3 buckets per account, which is 100.
[7] Known Limitations
The following are known limitations in Amazon Athena
● User-defined functions (UDF or UDAFs) are not supported.
● Stored procedures are not supported.
● Currently, Athena does not support any transactions found in Hive or Presto. For a full list of
keywords not supported, see Unsupported DDL.
● LZO is not supported. Use Snappy instead.
[7] Avoid Surprises
Use backticks if table names begin with an underscore. For example:
CREATE TABLE myUnderScoreTable (
`_id` string,
`_index`string,
...
For the LOCATION clause, using a trailing slash
USE
s3://path_to_bucket/
DO NOT USE
s3://path_to_bucket
s3://path_to_bucket/*
s3://path_to_bucket/mySpecialFile.dat
AWS Athena
[8] Comparing to Google BigQuery
Know your limits and mitigate the risk
DoIT International confidential │ Do not distribute
Google BigQuery
• Serverless Analytical Columnar Database based on Google Dremel
• Data:
• Native Tables
• External Tables (*SV, JSON, AVRO files stored in Google Cloud Storage bucket)
• Ingestion:
• File Imports
• Streaming API (up to 100K records/sec per table)
• Federated Tables (files in bucket, Bigtable table or Google Spreadsheet)
• ANSI SQL 2011
• Priced at $5/TB of scanned data + storage + streaming (if used)
• Cost Optimization - partitioning, limit queried columns, 24-hour cache, cold data.
DoIT International confidential │ Do not distribute
Summary
Feature  Product AWS Athena Google BigQuery
Data Formats *SV, JSON, PARQUET/z, ORC/z External (*SV, JSON, AVRO) / Native
ANSI SQL Support Yes* Yes*
DDL Support Only CREATE/ALTER/DROP CREATE/UPDATE/DELETE (w/ quotas)
Underlying Technology FB Presto Google Dremel
Caching No Yes
Cold Data Pricing S3 Lifecycle Policy 50% discount after 90 days of inactivity
User Defined Functions No Yes
Data Partitioning On Any Key By DAY
Pricing $5/TB (scanned) plus S3 ops $5/TB (scanned) less cached data
DoIT International confidential │ Do not distribute
Test Drive Summary
Query Type AWS Athens (GB/time) Google BigQuery (GB/time) t.diff %
[1] LOOKUP 48MB (4.1s) 130GB (2.0s) - 51%
[2] LOOKUP & AGGR 331MB (4.35s) 13.4GB (2.7s) - 48%
[3] GROUP/ORDER BY 5.74GB (8.85s) 8.26GB (5.4s) - 27%
[4] TEXT FUNCTIONS 606MB (11.3s) 13.6GB (2.4s) - 470%
[5] JSON FUNCTIONS 29MB (17.8s) 63.9GB (8.9s) - 100%
[6] REGEX FUNCTIONS (1.3s) 5.45GB (1.9s) + 31%
[7] FEDERATED DATA 133GB (19.4s) 133GB (36.4s) +47%
DoIT International confidential │ Do not distribute
What Athena does better than BigQuery?
Advantages:
• Can be faster than BigQuery, especially with federated/external tables
• Ability to use regex to define a schema (query files without needing to change the format)
• Can be faster and cheaper than BigQuery when using a partitioned/columnar format
• Tables can be partitioned on any column
Issues:
• It’s not easy to convert data between formats
• Doesn’t support DDL, i.e. no insert/update/delete
• No built-in ingestion
DoIT International confidential │ Do not distribute
What BigQuery does better than Athena?
• It has native table support giving it better performance and more features
• It’s easy to manipulate data, insert/update records and write query results back to a table
• Querying native tables is very fast
• Easy to convert non-columnar formats into a native table for columnar queries
• Supports UDFs, although they will be available in the future for Athena
• Supports nested tables (nested and repeated fields)
Remember to complete
your evaluations ;-)
https://goo.gl/T9BZvy

More Related Content

PPTX
Intro to Apache Spark
PPTX
Tuning and Debugging in Apache Spark
PDF
Python을 활용한 챗봇 서비스 개발 2일차
PPTX
Implementation Approach of Artifical Intelligence
PDF
MongoDB Database Replication
PPT
JavaScript Objects
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Deep dive into stateful stream processing in structured streaming by Tathaga...
Intro to Apache Spark
Tuning and Debugging in Apache Spark
Python을 활용한 챗봇 서비스 개발 2일차
Implementation Approach of Artifical Intelligence
MongoDB Database Replication
JavaScript Objects
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Deep dive into stateful stream processing in structured streaming by Tathaga...

What's hot (20)

PDF
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
PDF
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
Write Faster SQL with Trino.pdf
PDF
[pgday.Seoul 2022] POSTGRES 테스트코드로 기여하기 - 이동욱
DOC
IMPLEMENTATION OF AUTO KEY IN C++
PDF
A deeper-understanding-of-spark-internals-aaron-davidson
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
PPTX
Introduction to MongoDB
PPTX
Grails object relational mapping: GORM
PDF
Elastic Search (엘라스틱서치) 입문
PPTX
Using Apache Hive with High Performance
PPTX
Dynamic filtering for presto join optimisation
PPTX
JSON: The Basics
PDF
Python을 활용한 챗봇 서비스 개발 1일차
PPTX
MongoDB
PPSX
Javascript variables and datatypes
PDF
A Deep Dive into Query Execution Engine of Spark SQL
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Write Faster SQL with Trino.pdf
[pgday.Seoul 2022] POSTGRES 테스트코드로 기여하기 - 이동욱
IMPLEMENTATION OF AUTO KEY IN C++
A deeper-understanding-of-spark-internals-aaron-davidson
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Introduction to MongoDB
Grails object relational mapping: GORM
Elastic Search (엘라스틱서치) 입문
Using Apache Hive with High Performance
Dynamic filtering for presto join optimisation
JSON: The Basics
Python을 활용한 챗봇 서비스 개발 1일차
MongoDB
Javascript variables and datatypes
A Deep Dive into Query Execution Engine of Spark SQL
Ad

Viewers also liked (15)

PPTX
Google Cloud Spanner Preview
PPTX
Google BigQuery 101 & What’s New
PDF
An overview of Amazon Athena
PDF
AWS Athena vs. Google BigQuery for interactive SQL Queries
PDF
Big Data answers in seconds with Amazon Athena
PPTX
Webinar: Fighting Fraud with Graph Databases
PDF
2015 Internet Trends Report
PPTX
PDF
AWS Black Belt Online Seminar 2017 Amazon Athena
PPTX
AWS Cyber Security Best Practices
PPTX
Aws Atlanta meetup Amazon Athena
PDF
الفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستير
PDF
Superfunds Magazine - Ready to take on the world
DOCX
Ensayo blogger def 2
PDF
Посібник "Конспекти уроків у 1 семестрі"
Google Cloud Spanner Preview
Google BigQuery 101 & What’s New
An overview of Amazon Athena
AWS Athena vs. Google BigQuery for interactive SQL Queries
Big Data answers in seconds with Amazon Athena
Webinar: Fighting Fraud with Graph Databases
2015 Internet Trends Report
AWS Black Belt Online Seminar 2017 Amazon Athena
AWS Cyber Security Best Practices
Aws Atlanta meetup Amazon Athena
الفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستير
Superfunds Magazine - Ready to take on the world
Ensayo blogger def 2
Посібник "Конспекти уроків у 1 семестрі"
Ad

Similar to Amazon Athena Hands-On Workshop (15)

PPTX
Cloud architectural patterns and Microsoft Azure tools
PPTX
Day 1 - Technical Bootcamp azure synapse analytics
PPTX
Taking SharePoint to the Cloud
PDF
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
PPTX
Los Angeles AWS Users Group - Athena Deep Dive
PDF
In-memory ColumnStore Index
PDF
Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
PDF
AWS glue technical enablement training
PDF
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
PPTX
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
PPTX
Azure satpn19 time series analytics with azure adx
PPTX
Introducing Azure SQL Data Warehouse
PPTX
Analytics in the Cloud
PDF
Introdução ao data warehouse Amazon Redshift
PPTX
Azure Data Lake and Azure Data Lake Analytics
Cloud architectural patterns and Microsoft Azure tools
Day 1 - Technical Bootcamp azure synapse analytics
Taking SharePoint to the Cloud
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Los Angeles AWS Users Group - Athena Deep Dive
In-memory ColumnStore Index
Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
AWS glue technical enablement training
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Azure satpn19 time series analytics with azure adx
Introducing Azure SQL Data Warehouse
Analytics in the Cloud
Introdução ao data warehouse Amazon Redshift
Azure Data Lake and Azure Data Lake Analytics

More from DoiT International (15)

PPTX
Terraform Modules Restructured
PPTX
GAN training with Tensorflow and Tensor Cores
PDF
Orchestrating Redis & K8s Operators
PPTX
K8s best practices from the field!
PPTX
An Open-Source Platform to Connect, Manage, and Secure Microservices
PDF
Is your Elastic Cluster Stable and Production Ready?
PPTX
Applying ML for Log Analysis
PPTX
GCP for AWS Professionals
PPTX
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
PDF
Running Production-Grade Kubernetes on AWS
PPTX
Scaling Jenkins with Kubernetes by Ami Mahloof
PPTX
CI Implementation with Kubernetes at LivePerson by Saar Demri
PPTX
Kubernetes @ Nanit by Chen Fisher
PDF
Dataflow - A Unified Model for Batch and Streaming Data Processing
PPTX
Kubernetes - State of the Union (Q1-2016)
Terraform Modules Restructured
GAN training with Tensorflow and Tensor Cores
Orchestrating Redis & K8s Operators
K8s best practices from the field!
An Open-Source Platform to Connect, Manage, and Secure Microservices
Is your Elastic Cluster Stable and Production Ready?
Applying ML for Log Analysis
GCP for AWS Professionals
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Running Production-Grade Kubernetes on AWS
Scaling Jenkins with Kubernetes by Ami Mahloof
CI Implementation with Kubernetes at LivePerson by Saar Demri
Kubernetes @ Nanit by Chen Fisher
Dataflow - A Unified Model for Batch and Streaming Data Processing
Kubernetes - State of the Union (Q1-2016)

Recently uploaded (20)

PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPTX
Introduction to Information and Communication Technology
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PDF
The Internet -By the Numbers, Sri Lanka Edition
PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPTX
innovation process that make everything different.pptx
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
DOCX
Unit-3 cyber security network security of internet system
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPTX
SAP Ariba Sourcing PPT for learning material
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PDF
LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PPT
tcp ip networks nd ip layering assotred slides
PPTX
durere- in cancer tu ttresjjnklj gfrrjnrs mhugyfrd
PDF
“Google Algorithm Updates in 2025 Guide”
PPTX
cyber security Workshop awareness ppt.pptx
PPTX
QR Codes Qr codecodecodecodecocodedecodecode
introduction about ICD -10 & ICD-11 ppt.pptx
Introduction to Information and Communication Technology
Introuction about WHO-FIC in ICD-10.pptx
The Internet -By the Numbers, Sri Lanka Edition
international classification of diseases ICD-10 review PPT.pptx
522797556-Unit-2-Temperature-measurement-1-1.pptx
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
innovation process that make everything different.pptx
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
Unit-3 cyber security network security of internet system
Module 1 - Cyber Law and Ethics 101.pptx
SAP Ariba Sourcing PPT for learning material
Job_Card_System_Styled_lorem_ipsum_.pptx
LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1
Introuction about ICD -10 and ICD-11 PPT.pptx
tcp ip networks nd ip layering assotred slides
durere- in cancer tu ttresjjnklj gfrrjnrs mhugyfrd
“Google Algorithm Updates in 2025 Guide”
cyber security Workshop awareness ppt.pptx
QR Codes Qr codecodecodecodecocodedecodecode

Amazon Athena Hands-On Workshop

  • 2. Agenda 1 2 3 4 5 Wi-Fi: DaHouseGuest, Pass: JustDoit! Feedback Form: goo.gl/T9BZvy Labs: github.com/doitintl/athena-workshop 2 Q & A Breaks: 11:30 | 13:00 - 13:45 | 15:00 Facilities & Organization
  • 3. DoIT International confidential │ Do not distribute About us.. Vadim Solovey CTO Shahar Frank Software Engineering Lead
  • 4. DoIT International confidential │ Do not distribute
  • 5. DoIT International confidential │ Do not distribute
  • 6. DoIT International confidential │ Do not distribute
  • 7. Workshop Agenda ● Module 1 ○ Introduction to AWS Athena ○ Demo ● Module 2 ○ Interacting with AWS Athena ○ Lab 2 ● Module 3 ○ Supported Formats and SerDes ○ Lab 3 ● Module 4 ○ Partitioning Data ○ Lab 4 ● Module 5 ○ Converting to columnar formats ○ Lab 5 ● Module 6 ○ Athena Security ● Module 7 ○ Service Limits ● Module 8 ○ Comparison to Google BigQuery ○ Demo
  • 8. [1] AWS Athena [1] Introduction Understanding Purpose & Use-Cases
  • 9. [1] Challenges Organizations are challenged with data analysis without heavy investments and long deployment time ● Significant amount of effort required to analyze data on S3 ● Users often have access to only aggregated data sets ● Managing Hadoop or data warehouse requires expertise
  • 10. [1] Introducing AWS Athena Athena is an interactive query service that makes it easy to analyze data directly from AWS S3 using Standard SQL
  • 11. [1] AWS Athena Overview Easy to use 1. Login to a console 2. Create a table (either by following a wizard or by typing Hive DDL statement) 3. Start querying
  • 12. [1] AWS Athena is Highly Available High Availability Features ● You connect to a service endpoint or log into a console ● Athena uses warm compute pools across multiple availability zones ● Your data is in Amazon S3 which has 99.999999999% durability
  • 13. [1] Querying Data Directly from Amazon S3 Direct access to your data without hassles ● No loading of data ● No ETL required ● No additional storage required ● Query of data in raw format
  • 14. [1] Use ANSI SQL Use of skills you probably already have ● Start with writing Standard ANSI SQL syntax ● Support for complex joins, nested queries & window functions ● Support for complex data types (arrays, structs) ● Support for partitioning of data by any key: ○ e.g. date, time, custom keys ○ Or customer-year-month-day-hour
  • 15. [1] AWS Athena Overview Amazon Athena is server-less way to query your data that lives on S3 using SQL Features: ● Serverles with zero spin-up time and transparent upgrades ● Data can be stored in CSV, JSON, ORC, Parquet and even Apache web logs format ○ AVRO (coming soon) ● Compression is supported out of the box ● Queries cost $5 per terabyte of data scanned with a 10 MB minimum per query Additional Information: ● Not a general purpose database ● Usually used by Data Analysts to run interactive queries over large datasets ● Currently available at us-east-1 (North Virginia) or the us-west-2 (Oregon)
  • 16. [1] Underlying Technologies Presto (originating from Facebook) ● Used for SQL queries ● In-memory distributed querying engine ANSI SQL compatible with extensions Hive (originating from Hadoop project) ● Used for DDL functionality ● Complex data types ● Multitude of formats ● Supports data partitioning
  • 17. [1] Presto vs. Hive Architecture
  • 18. [1] Use Cases Athena complements Amazon Redshift and Amazon EMR
  • 19. AWS Athena [2] Interacting with AWS Athena Develop, Execute and Visualize Queries
  • 20. [2] Interacting with AWS Athena Amazon Athena is server-less way to query your data that lives on S3 using SQL Web User Interface: ● Run queries and examine results ● Manage databases and tables ● Save queries and share across organization for re-use ● Query History JDBC Driver: ● Programmatic way to access AWS Athena ○ SQL Workbench, JetBrains DataGrip, sqlline ○ Your own app AWS QuickSight: ● Visualize Athena data with charts, pivots and dashboards.
  • 21. Hands On Lab 2 Interacting with AWS Athena
  • 22. Data Formats [3] Supported Formats and SerDes Efficient Data Storage
  • 23. [3] Data and Compression Formats The data formats presently supported are ● CSV ● TSV ● Parquet (Snappy is default compression) ● ORC (Zlib is default compression) ● JSON ● Apache Web Server logs (RegexSerDe) ● Custom Delimiters Compression Formats ● Currently, Snappy, Zlib, and GZIP are the supported compression formats. ● LZO is not supported as of today
  • 24. [3] CSV Example CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '' LINES TERMINATED BY 'n' LOCATION 's3://nyc-yellow-trips/csv/'
  • 25. [3] Parquet Example CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....) STORED AS PARQUET LOCATION 's3://nyc-yellow-trips/parquet tblproperties ("parquet.compress"="SNAPPY");
  • 26. [3] ORC Example CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....) STORED AS ORC LOCATION 's3://nyc-yellow-trips/orc/’ tblproperties ("parquet.compress"="ZLIB");
  • 27. [3] RegEx Serde (Apache Log Example) CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( Date DATE, Time STRING, Location STRING, Bytes INT, RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, os STRING, Browser STRING, BrowserVersion STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^(?!#)([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+[^(]+[(]([^;]+).*%20([^/]+)[/](.*)$") LOCATION 's3://athena-examples/cloudfront/plaintext/';
  • 28. [3] Comparing Formats PARQUET ● Columnar format ● Schema segregation into footer ● Column major format ● All data is pushed to the leaf ● Integrated compression and indexes ● Support for predicate pushdown ORC ● Apache Top Level Project ● Schema segregation into footer ● Column major format with stripes ● Integrated compression and indexes and stats ● Support for predicate pushdown
  • 30. [3] Converting to Parquet or ORC format ● You can use Hive CTAS to convert data: CREATE TABLE new_key_value_store STORED AS PARQUET AS SELECT c1, c2, c3, .., cN FROM noncolumunartable SORT BY key ● You can also use Spark to convert the files to Parquet or ORC ● 20 lines of PySpark code running on EMR [1] ○ Converts 1TB of text data into 130GB of Parquet with Snappy compression ○ Approx. cost is $5 [1] https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
  • 31. [3] Pay By the Query ($5 per TB scanned) ● You are paying by the amount of scanned data ● Means to save on cost ○ Compress ○ Convert to columnar format ○ Use partitioning ● Free: DDL queries, failed queries Dataset Size on S3 Query Runtime Data Scanned Cost Logs stored as CSV 1TB 237s 1.15TB $5.75 Logs stored as PARQUET 130GB 5.13s 2.69GB $0.013 Savings 87% less 34x faster 99% less 99.7% cheaper
  • 33. AWS Athena [4] Partitioning Data To improve performance and reduce cost
  • 34. [4] Partitioning Data By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost Benefits of Data Partitioning: ● Partitions limit the scope of data being scanned during the query ● Improves Performance ● Reduce query cost ● You can partition your data by any key Common Practice: ● Based on time, often leading with a multi-level partitioning scheme ○ YEAR -> MONTH -> DAY -> HOUR
  • 35. [4] Data already partitioned and stored on S3 $ aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/ PRE dt=2009-04-12-13-00/ PRE dt=2009-04-12-13-05/ PRE dt=2009-04-12-13-10/ PRE dt=2009-04-12-13-15/ PRE dt=2009-04-12-13-20/ PRE dt=2009-04-12-14-00/ PRE dt=2009-04-12-14-05/ CREATE EXTERNAL TABLE impressions ( ... ...) PARTITIONED BY (dt string) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ; // load partitions into Athena MSCK REPAIR TABLE impressions // Run sample query SELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00' and dt>='2009-04-12-13-00'
  • 36. [4] Data is not partitioned aws s3 ls s3://athena-examples/elb/plaintext/ --recursive 2016-11-23 17:54:46 11789573 elb/plaintext/2015/01/01/part-r-00000-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:46 8776899 elb/plaintext/2015/01/01/part-r-00001-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:46 9309800 elb/plaintext/2015/01/01/part-r-00002-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 9412570 elb/plaintext/2015/01/01/part-r-00003-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 10725938 elb/plaintext/2015/01/01/part-r-00004-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:46 9439710 elb/plaintext/2015/01/01/part-r-00005-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 0 elb/plaintext/2015/01/01_$folder$ 2016-11-23 17:54:47 9012723 elb/plaintext/2015/01/02/part-r-00006-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 7571816 elb/plaintext/2015/01/02/part-r-00007-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 9673393 elb/plaintext/2015/01/02/part-r-00008-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:48 11979218 elb/plaintext/2015/01/02/part-r-00009-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:48 9546833 elb/plaintext/2015/01/02/part-r-00010-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01') location 's3://athena- examples/elb/plaintext/2015/01/01/'
  • 37. [5] AWS Athena [5] Converting to Columnar Formats Apache Parquet & ORC
  • 38. [5] Converting to Columnar Formats (batch data) Your Amazon Athena query performance improves if you convert your data into open source columnar formats such as Apache Parquet or ORC. The process for converting to columnar formats using an EMR cluster is as follows: ● Create an EMR cluster with Hive installed. ● In the step section of the cluster create statement, you can specify a script stored in Amazon S3, which points to your input data and creates output data in the columnar format in an Amazon S3 location. In this example, the cluster auto-terminates.
  • 39. [5] Converting to Columnar Formats (streaming data) Your Amazon Athena query performance improves if you convert your data into open source columnar formats such as Apache Parquet or ORC. The process for converting to columnar formats using an EMR cluster is as follows: ● Create an EMR cluster with Spark ● Run Spark Streaming Job reading the data from Kinesis Stream and writing Parquet files on S3
  • 40. AWS Athena [6] Athena Security Authorization and Access
  • 41. [6] Athena Security Amazon offers three ways to control data access: ● AWS Identity and Access Management policies ● Access Control Lists ● Amazon S3 bucket policies Users are in control who can access data on S3. It’s possible to fine-tune security to allow different people to see different sets of data and also to grant access to other user’s data.
  • 42. AWS Athena [7] Service Limits Know your limits and mitigate the risk
  • 43. [7] Service Limits You can request a limit increase by contacting AWS Support. ● Currently, you can only submit one query at a time and you can only have 5 (five) concurrent queries at one time per account. ● Query timeout: 30 minutes ● Number of databases: 100 ● Table: 100 per database ● Number of partitions: 20k per table ● You may encounter a limit for Amazon S3 buckets per account, which is 100.
  • 44. [7] Known Limitations The following are known limitations in Amazon Athena ● User-defined functions (UDF or UDAFs) are not supported. ● Stored procedures are not supported. ● Currently, Athena does not support any transactions found in Hive or Presto. For a full list of keywords not supported, see Unsupported DDL. ● LZO is not supported. Use Snappy instead.
  • 45. [7] Avoid Surprises Use backticks if table names begin with an underscore. For example: CREATE TABLE myUnderScoreTable ( `_id` string, `_index`string, ... For the LOCATION clause, using a trailing slash USE s3://path_to_bucket/ DO NOT USE s3://path_to_bucket s3://path_to_bucket/* s3://path_to_bucket/mySpecialFile.dat
  • 46. AWS Athena [8] Comparing to Google BigQuery Know your limits and mitigate the risk
  • 47. DoIT International confidential │ Do not distribute Google BigQuery • Serverless Analytical Columnar Database based on Google Dremel • Data: • Native Tables • External Tables (*SV, JSON, AVRO files stored in Google Cloud Storage bucket) • Ingestion: • File Imports • Streaming API (up to 100K records/sec per table) • Federated Tables (files in bucket, Bigtable table or Google Spreadsheet) • ANSI SQL 2011 • Priced at $5/TB of scanned data + storage + streaming (if used) • Cost Optimization - partitioning, limit queried columns, 24-hour cache, cold data.
  • 48. DoIT International confidential │ Do not distribute Summary Feature Product AWS Athena Google BigQuery Data Formats *SV, JSON, PARQUET/z, ORC/z External (*SV, JSON, AVRO) / Native ANSI SQL Support Yes* Yes* DDL Support Only CREATE/ALTER/DROP CREATE/UPDATE/DELETE (w/ quotas) Underlying Technology FB Presto Google Dremel Caching No Yes Cold Data Pricing S3 Lifecycle Policy 50% discount after 90 days of inactivity User Defined Functions No Yes Data Partitioning On Any Key By DAY Pricing $5/TB (scanned) plus S3 ops $5/TB (scanned) less cached data
  • 49. DoIT International confidential │ Do not distribute Test Drive Summary Query Type AWS Athens (GB/time) Google BigQuery (GB/time) t.diff % [1] LOOKUP 48MB (4.1s) 130GB (2.0s) - 51% [2] LOOKUP & AGGR 331MB (4.35s) 13.4GB (2.7s) - 48% [3] GROUP/ORDER BY 5.74GB (8.85s) 8.26GB (5.4s) - 27% [4] TEXT FUNCTIONS 606MB (11.3s) 13.6GB (2.4s) - 470% [5] JSON FUNCTIONS 29MB (17.8s) 63.9GB (8.9s) - 100% [6] REGEX FUNCTIONS (1.3s) 5.45GB (1.9s) + 31% [7] FEDERATED DATA 133GB (19.4s) 133GB (36.4s) +47%
  • 50. DoIT International confidential │ Do not distribute What Athena does better than BigQuery? Advantages: • Can be faster than BigQuery, especially with federated/external tables • Ability to use regex to define a schema (query files without needing to change the format) • Can be faster and cheaper than BigQuery when using a partitioned/columnar format • Tables can be partitioned on any column Issues: • It’s not easy to convert data between formats • Doesn’t support DDL, i.e. no insert/update/delete • No built-in ingestion
  • 51. DoIT International confidential │ Do not distribute What BigQuery does better than Athena? • It has native table support giving it better performance and more features • It’s easy to manipulate data, insert/update records and write query results back to a table • Querying native tables is very fast • Easy to convert non-columnar formats into a native table for columnar queries • Supports UDFs, although they will be available in the future for Athena • Supports nested tables (nested and repeated fields)
  • 52. Remember to complete your evaluations ;-) https://goo.gl/T9BZvy