SlideShare a Scribd company logo
Glue
Let's Get Stuck In!
Gold Sponsors
Silver Sponsors
Bronze Sponsors Local Partners
Introduction
to Containers
Chris Taylor
- Worked with SQL Server since 2001
- MCSE – Data Platform
- SQLNE PASS Chapter Group Leader
- SQLRelay Organiser
- Cricket/Football Coaching
Agenda
• Session Aim
• The Problem
• What is AWS Glue?
• Use Cases
• Demos
• Costs
• Q&A
Not on the Agenda
• Comparison with other cloud offerings
Session Aim
An understanding
of the issues faced
with ETL
Development
Learn by example Enough of a taste to
get the Glue bug and
start experimenting!
The Problem
“….consumes 70 percent of the
resources needed for
implementation and maintenance of
a typical data warehouse”
R. Kimball and J. Caserta. The Data Warehouse ETL
Toolkit: Practical Techniques for Extracting, Cleaning,
Conforming, and Delivering Data. Wiley, 2004.
The Problem
70% of ETL Jobs are hand-coded
with no use of ETL Tools
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
Why hand-code?
• Flexible
• Powerful
• Unit test
• Deploy with other code
• You know your dev tools
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
Involves a lot of effort
• Data formats change
• Source/target schemas change
• You add sources
• Data volume grows
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
What is AWS Glue?
• Fully managed, ETL service
• Serverless
• Automates the undifferentiated heavy lifting of ETL
•Discover, Develop, Deploy
• For Developers by Developers
Components
• Data Catalog
• Hive Metastore compatible
• Crawlers automatically extracts metadata and creates tables
• Integrated with Amazon Athena, Amazon Redshift Spectrum
• Job Authoring
• Auto-generates ETL code
• Build on open frameworks – Python and Spark
• Developer-centric
• Job Execution
• Run jobs on a serverless Spark platform
• Provides flexible scheduling
• Handles dependency resolution, monitoring and alerting
Use Cases?
Understand your data
https://aws.amazon.com/glue/
Query your data lake on Amazon S3
https://aws.amazon.com/glue/
Build event driven ETL pipelines
https://aws.amazon.com/glue/
DEMO
Costs
• https://aws.amazon.com/free/
• 1 Million objects stored in the AWS Glue Data Catalog**
• 1 Million requests made per month to the AWS Glue Data
Catalog**
** These free tier offers do not automatically expire at the end of your 12
month AWS Free Tier term, but are available to both existing and new AWS
customers indefinitely
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
Costs
• DPU
• Compute based usage:
• AWS Glue pricing ETL jobs, development endpoints, and crawlers $0.44 per
DPU-Hour
• 1 minute increments
• 10-minute minimum 
• A single DPU Unit = 4 vCPU and 16 GB of memory
• Data Catalog usage:
• Data Catalog Storage:
• Free for the first million objects stored $1 per 100,000 objects, per month, stored
above 1M
• Data Catalog Requests:
• Free for the first million requests per month $1 per million requests above 1M
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
Costs Example #1
• ETL job
• Ran for 10 minutes on a 6 DPU environment.
• The price of 1 DPU-Hour in US East (N. Virginia) is $0.44.
• The cost for this job run = 6 DPUs * (10/60) hour * $0.44 per DPU-
Hour or $0.44.
• Development Endpoint
• Active for 24 min.
• Each development endpoint is provisioned with 5 DPUs
• The cost to use the development endpoint = 5 DPUs * (24/ 60) hour *
0.44 per DPU-Hour or $0.88.
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
Costs Example #2
• Store 1 million tables in your Data Catalog in a given month and
make 1 million requests to access these tables.
• You pay $0 for using data catalog.
• You are covered under the Data Catalog free tier.
• Your requests double to 2 million requests.
• You will only be paying for one million requests above the free tier,
which is $1
• If you use crawlers to find new tables and they run for 30 min and use
2 DPUs. You will pay for 2 DPUs * (30/60) hour * $0.44 per DPU-Hour
or $0.44. Your total monthly bill = $0 + $1 + $0.44 or $1.44
https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
Why can’t I just use Data Pipeline?
Glue
• Discovering unstructured data
• Managed ETL service
• Runs on a serverless Apache Spark
environment.
• Takes a data first approach
• Provides an integrated data catalog
/ metadata
• Querying via Amazon Athena and
Amazon Redshift Spectrum
• ETL jobs are Scala or Python based
Data Pipeline
• Simple data replication tasks
• Managed orchestration service
• Greater flexibility (environment,
access and compute resources)
• Launches compute resources in your
account allowing you direct access to
the Amazon EC2 instances or Amazon
EMR clusters.
• Run on a different engine (Hive, Pig)
Conclusion
Good
• Fully Managed ETL
• Serverless
• Crawlers for discovering and
relationalizing semi /
unstructured data
• Developer Endpoints
Not so good
• Complex costing
• 10 minute minimum Job run
• Developer Endpoint £££££££
• AWS Documentation is lacking
• Multiple Files in folder (Athena)
• Complex non-scheduled
automation
• None for Crawlers!
Summary
• Session Aim
• The Problem
• What is AWS Glue?
• Use Cases
• Demos
• Costs
Questions?
Contact
Links
• http://aws.amazon.com/documentation/glue
• https://www.slideshare.net/search/slideshow?searchfrom=header&q=aws
+glue
• https://www.slideshare.net/AmazonWebServices/building-serverless-etl-
pipelines-with-aws-glue-aws-summit-sydney-2018?qid=b3da6acd-c11b-
4576-8f40-88906fb6c3f3&v=&b=&from_search=6
• https://www.slideshare.net/MichaelRainey3/going-serverless-an-
introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-
88906fb6c3f3&v=&b=&from_search=5
• https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-
using-aws-step-functions-and-aws-lambda/
• https://gluent.com/access-catalog-query-enterprise-data-gluent-cloud-
sync-aws-glue/
Best Practices and Questions
• https://docs.aws.amazon.com/athena/latest/ug/glue-best-
practices.html
• https://aws.amazon.com/glue/faqs/
• https://www.accenture.com/us-en/blogs/blogs-kalyani-sayyed-
amazon-glue-etl

More Related Content

PDF
AWS Data Analytics on AWS
PPTX
What is AWS Glue
PDF
AWS glue technical enablement training
PPTX
AWS (Amazon Redshift) presentation
PPTX
AWS Lake Formation Deep Dive
PPTX
Introduction to AWS Lake Formation.pptx
PDF
AWS CDK in Practice
PPTX
A tour of Amazon Redshift
AWS Data Analytics on AWS
What is AWS Glue
AWS glue technical enablement training
AWS (Amazon Redshift) presentation
AWS Lake Formation Deep Dive
Introduction to AWS Lake Formation.pptx
AWS CDK in Practice
A tour of Amazon Redshift

What's hot (20)

PPTX
Snowflake Architecture.pptx
PPTX
Snowflake essentials
PDF
Let’s get to know Snowflake
PPTX
Zero to Snowflake Presentation
PPTX
Introduction to Azure Databricks
PDF
Azure Data Factory V2; The Data Flows
PPTX
Snowflake Datawarehouse Architecturing
PPT
An overview of snowflake
PPTX
Azure DataBricks for Data Engineering by Eugene Polonichko
PDF
Azure Synapse 101 Webinar Presentation
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
PDF
Azure Cosmos DB
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
PPTX
Introduction to snowflake
PDF
KSnow: Getting started with Snowflake
PPTX
Azure Data Factory Data Flow
PPTX
Azure Synapse Analytics Overview (r2)
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PDF
Amazon RDS Proxy 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나
PPTX
Snowflake Overview
Snowflake Architecture.pptx
Snowflake essentials
Let’s get to know Snowflake
Zero to Snowflake Presentation
Introduction to Azure Databricks
Azure Data Factory V2; The Data Flows
Snowflake Datawarehouse Architecturing
An overview of snowflake
Azure DataBricks for Data Engineering by Eugene Polonichko
Azure Synapse 101 Webinar Presentation
Introducing the Snowflake Computing Cloud Data Warehouse
Azure Cosmos DB
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Introduction to snowflake
KSnow: Getting started with Snowflake
Azure Data Factory Data Flow
Azure Synapse Analytics Overview (r2)
Building robust CDC pipeline with Apache Hudi and Debezium
Amazon RDS Proxy 집중 탐구 - 윤석찬 :: AWS Unboxing 온라인 세미나
Snowflake Overview
Ad

Similar to AWS Glue - let's get stuck in! (20)

PPTX
Going Serverless - an Introduction to AWS Glue
PPTX
Aws meetup 20190427
PPTX
Glue.pptx
PDF
a glue tutorial for the amazon web services
PDF
Sederhanakan_integrasi_data_anda_dengan_AWS_Glue_handout.pdf
PDF
AWS Community Day - Jessie Daubner - Building a data lake
PDF
Serverless Data Lake on AWS
PDF
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
PPTX
The Top AWS Data Engineering Online Course in Ameerpet.pptx
PDF
[AWS Builders] Effective AWS Glue
PPTX
Unleashing the Power of Data Analytics with AWS Glue and Data Lakes.pptx
PDF
PDF
Aws Data Engineer Course | Aws Data Engineer Training
PDF
Speed up data preparation for ML pipelines on AWS
PPTX
Athena & AWS Glue for AWS Data analytics.pptx
PPTX
Aws community day pune 2020 v3
PPTX
Serverless ETL and Optimization on ML pipeline
PDF
Serverless data lake architecture
PDF
Datalake project
PDF
Hyun joong
Going Serverless - an Introduction to AWS Glue
Aws meetup 20190427
Glue.pptx
a glue tutorial for the amazon web services
Sederhanakan_integrasi_data_anda_dengan_AWS_Glue_handout.pdf
AWS Community Day - Jessie Daubner - Building a data lake
Serverless Data Lake on AWS
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
The Top AWS Data Engineering Online Course in Ameerpet.pptx
[AWS Builders] Effective AWS Glue
Unleashing the Power of Data Analytics with AWS Glue and Data Lakes.pptx
Aws Data Engineer Course | Aws Data Engineer Training
Speed up data preparation for ML pipelines on AWS
Athena & AWS Glue for AWS Data analytics.pptx
Aws community day pune 2020 v3
Serverless ETL and Optimization on ML pipeline
Serverless data lake architecture
Datalake project
Hyun joong
Ad

Recently uploaded (20)

PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Computer network topology notes for revision
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Lecture1 pattern recognition............
PDF
Foundation of Data Science unit number two notes
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Introduction to Business Data Analytics.
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Database Infoormation System (DBIS).pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Computer network topology notes for revision
Reliability_Chapter_ presentation 1221.5784
Lecture1 pattern recognition............
Foundation of Data Science unit number two notes
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Supervised vs unsupervised machine learning algorithms
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Business Data Analytics.
Fluorescence-microscope_Botany_detailed content
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Database Infoormation System (DBIS).pptx
Mega Projects Data Mega Projects Data
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Launch Your Data Science Career in Kochi – 2025
oil_refinery_comprehensive_20250804084928 (1).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction-to-Cloud-ComputingFinal.pptx

AWS Glue - let's get stuck in!

  • 2. Gold Sponsors Silver Sponsors Bronze Sponsors Local Partners
  • 3. Introduction to Containers Chris Taylor - Worked with SQL Server since 2001 - MCSE – Data Platform - SQLNE PASS Chapter Group Leader - SQLRelay Organiser - Cricket/Football Coaching
  • 4. Agenda • Session Aim • The Problem • What is AWS Glue? • Use Cases • Demos • Costs • Q&A
  • 5. Not on the Agenda • Comparison with other cloud offerings
  • 6. Session Aim An understanding of the issues faced with ETL Development Learn by example Enough of a taste to get the Glue bug and start experimenting!
  • 7. The Problem “….consumes 70 percent of the resources needed for implementation and maintenance of a typical data warehouse” R. Kimball and J. Caserta. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Wiley, 2004.
  • 8. The Problem 70% of ETL Jobs are hand-coded with no use of ETL Tools https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 9. Why hand-code? • Flexible • Powerful • Unit test • Deploy with other code • You know your dev tools https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 10. Involves a lot of effort • Data formats change • Source/target schemas change • You add sources • Data volume grows https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 11. What is AWS Glue? • Fully managed, ETL service • Serverless • Automates the undifferentiated heavy lifting of ETL •Discover, Develop, Deploy • For Developers by Developers
  • 12. Components • Data Catalog • Hive Metastore compatible • Crawlers automatically extracts metadata and creates tables • Integrated with Amazon Athena, Amazon Redshift Spectrum • Job Authoring • Auto-generates ETL code • Build on open frameworks – Python and Spark • Developer-centric • Job Execution • Run jobs on a serverless Spark platform • Provides flexible scheduling • Handles dependency resolution, monitoring and alerting
  • 15. Query your data lake on Amazon S3 https://aws.amazon.com/glue/
  • 16. Build event driven ETL pipelines https://aws.amazon.com/glue/
  • 17. DEMO
  • 18. Costs • https://aws.amazon.com/free/ • 1 Million objects stored in the AWS Glue Data Catalog** • 1 Million requests made per month to the AWS Glue Data Catalog** ** These free tier offers do not automatically expire at the end of your 12 month AWS Free Tier term, but are available to both existing and new AWS customers indefinitely https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 19. Costs • DPU • Compute based usage: • AWS Glue pricing ETL jobs, development endpoints, and crawlers $0.44 per DPU-Hour • 1 minute increments • 10-minute minimum  • A single DPU Unit = 4 vCPU and 16 GB of memory • Data Catalog usage: • Data Catalog Storage: • Free for the first million objects stored $1 per 100,000 objects, per month, stored above 1M • Data Catalog Requests: • Free for the first million requests per month $1 per million requests above 1M https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 20. Costs Example #1 • ETL job • Ran for 10 minutes on a 6 DPU environment. • The price of 1 DPU-Hour in US East (N. Virginia) is $0.44. • The cost for this job run = 6 DPUs * (10/60) hour * $0.44 per DPU- Hour or $0.44. • Development Endpoint • Active for 24 min. • Each development endpoint is provisioned with 5 DPUs • The cost to use the development endpoint = 5 DPUs * (24/ 60) hour * 0.44 per DPU-Hour or $0.88. https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 21. Costs Example #2 • Store 1 million tables in your Data Catalog in a given month and make 1 million requests to access these tables. • You pay $0 for using data catalog. • You are covered under the Data Catalog free tier. • Your requests double to 2 million requests. • You will only be paying for one million requests above the free tier, which is $1 • If you use crawlers to find new tables and they run for 30 min and use 2 DPUs. You will pay for 2 DPUs * (30/60) hour * $0.44 per DPU-Hour or $0.44. Your total monthly bill = $0 + $1 + $0.44 or $1.44 https://www.slideshare.net/AmazonWebServices/bda311-introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40-88906fb6c3f3&v=&b=&from_search=2
  • 22. Why can’t I just use Data Pipeline? Glue • Discovering unstructured data • Managed ETL service • Runs on a serverless Apache Spark environment. • Takes a data first approach • Provides an integrated data catalog / metadata • Querying via Amazon Athena and Amazon Redshift Spectrum • ETL jobs are Scala or Python based Data Pipeline • Simple data replication tasks • Managed orchestration service • Greater flexibility (environment, access and compute resources) • Launches compute resources in your account allowing you direct access to the Amazon EC2 instances or Amazon EMR clusters. • Run on a different engine (Hive, Pig)
  • 23. Conclusion Good • Fully Managed ETL • Serverless • Crawlers for discovering and relationalizing semi / unstructured data • Developer Endpoints Not so good • Complex costing • 10 minute minimum Job run • Developer Endpoint £££££££ • AWS Documentation is lacking • Multiple Files in folder (Athena) • Complex non-scheduled automation • None for Crawlers!
  • 24. Summary • Session Aim • The Problem • What is AWS Glue? • Use Cases • Demos • Costs
  • 27. Links • http://aws.amazon.com/documentation/glue • https://www.slideshare.net/search/slideshow?searchfrom=header&q=aws +glue • https://www.slideshare.net/AmazonWebServices/building-serverless-etl- pipelines-with-aws-glue-aws-summit-sydney-2018?qid=b3da6acd-c11b- 4576-8f40-88906fb6c3f3&v=&b=&from_search=6 • https://www.slideshare.net/MichaelRainey3/going-serverless-an- introduction-to-aws-glue?qid=b3da6acd-c11b-4576-8f40- 88906fb6c3f3&v=&b=&from_search=5 • https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs- using-aws-step-functions-and-aws-lambda/ • https://gluent.com/access-catalog-query-enterprise-data-gluent-cloud- sync-aws-glue/
  • 28. Best Practices and Questions • https://docs.aws.amazon.com/athena/latest/ug/glue-best- practices.html • https://aws.amazon.com/glue/faqs/ • https://www.accenture.com/us-en/blogs/blogs-kalyani-sayyed- amazon-glue-etl