SlideShare a Scribd company logo
Automate all your EMR related activities
Eitan Sela - System Architect
eitan.sela@weissbeerger.com
$ whoami
• "Hands-On" system Architect with more than 17 years of
experience with billing, banking, information security (DLP) and
Cloud IoT/Big Data applications.
• Big Data specialist – Hadoop, Spark, Hive and EMR on AWS.
• Work with vast AWS services, and with serverless projects
especially.
• Java development, scalability performance and stabilization
expert.
• Alexa skills developer.
• Love to share my experience in lectures and meetups.
What to expect from this session
• WeissBeerger use case – Aggregating raw orders and IoT data.
• Amazon EMR basics.
• Implementing ETLs with Spark.
• Submitting work to a Cluster.
• Provisioning scheduled transient EMR Clusters for ETLs jobs.
• Our new Slack Chabot for EMR, using Amazon Lex!
WeissBeerger use case – Aggregating raw orders and IoT data
Solution
• WeissBeerger bridges the gap between breweries, bars and
customers.
Benefits for the brewery
• Consumption Analytics.
• Dynamic Promotions.
• Beer Quality.
• Value creation.
• Beer penetration.
Benefits for the bar
• Real Time Consumption Tracking.
• Waste Reduction.
• Sales Growth.
How does it work?
• IoT (Pouring) – Beverage Analytics Hub.
• Point of sales – POS Vendors, via REST API, S3, DB, etc.
Aggregating raw point of sales orders and IoT data
Amazon EMR basics
Amazon EMR - Easily Run and Scale Apache Hadoop, Spark, HBase, Presto,
Hive, and other Big Data Frameworks
Amazon EMR – Create Cluster – Software and Steps
Amazon EMR – Create Cluster – Hardware
Amazon EMR – General Cluster Settings
Amazon EMR – Security
Implementing ETLs with Spark
Problem – Huge queries from MySQL to aggregative tables in Redshift
Solution - Implementing all ETL’s with PySpark
New data pipeline using EMR PySpark jobs – ELT rather than ETL
Submitting work to a Cluster
Launching Applications with spark-submit
./bin/spark-submit 
--jars jar1.jar,jar2.jar 
--py-files path/to/my/pymodule1.py, path/to/my/pymodule2.py
my_program.py arg1 arg2
• The spark-submit script in Spark’s bin directory is used to launch
applications on a cluster.
• It can use all of Spark’s supported cluster managers through a
uniform interface so you don’t have to configure your application
especially for each one.
EMR Steps - Submit Work to a Cluster
• You can submit work to a cluster by adding steps or by interactively
submitting Hadoop jobs to the master node.
• You can add steps to a cluster using the AWS Management Console,
the AWS CLI, or the Amazon EMR API.
• You can add step during cluster creation or to a running cluster.
EMR Steps – Job Types
• Custom Jar.
• Streaming Program.
• Spark App.
• Hive Program.
EMR Steps - Lifecycle
• Pending
• Cancelled (by user or API request)
• Running
• Completed / Failed.
EMR Steps – Add jobs to a running cluster
EMR Steps – View logs
WeissBeerger’s Spark ETL jobs submitted to EMR Cluster
Provisioning scheduled transient EMR Clusters for ETLs jobs
Requirements
• Run ETL using Spark on EMR cluster every 1 hour for one month
back.
• Input: MySQL or Hive (stg).
• Output: Hive (stg) or Redshift.
• Storage should be separated from the compute, so EMR clusters
should be transient.
• Multiple clusters should be able to run together.
• Fully automated and monitored.
Automate all your EMR related activities
Passing Spark Job steps parameters to Lambda input
• We created a simple json with all parameters required to add step to EMR cluster.
Monitoring EMR Steps with Lambda and Datadog
• We created a Lambda to sample all running EMR clusters for failed steps.
As more developers are developing PySpark Jobs…
Our new Slack Chabot for EMR, using Amazon Lex
Amazon Lex
• Conversational interfaces for your applications.
• Powered by the same deep learning technologies as Alexa.
• Amazon Lex provides the advanced deep learning functionalities of
automatic speech recognition (ASR) for converting speech to text,
and natural language understanding (NLU).
Amazon Lex - Use cases - Call Center Bots
awsbot - the chatbot that help you manage AWS resources
awsbot - Demo
awsbot - Demo - EMR Cluster is ready
Q & A
We Are Hiring!
Senior Data Scientist
Senior Designer (UI/UX)
Senior Full Stack Developer
Java Developer
Senior Manual QA
Director of Ops
BI Analyst
Data Management Analyst
Customer Success Manager
Senior BI Analyst

More Related Content

PPTX
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
PDF
CDC Stream Processing with Apache Flink
PPTX
Hadoop project design and a usecase
PDF
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
PPTX
Introduction to Data Engineering
PDF
Improving Presto performance with Alluxio at TikTok
PDF
Apache Iceberg: An Architectural Look Under the Covers
PPTX
Raid technology
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
CDC Stream Processing with Apache Flink
Hadoop project design and a usecase
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Introduction to Data Engineering
Improving Presto performance with Alluxio at TikTok
Apache Iceberg: An Architectural Look Under the Covers
Raid technology

What's hot (20)

PDF
The Apache Spark File Format Ecosystem
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PPTX
Designing modern dw and data lake
PDF
PostgreSQL High Availability in a Containerized World
PDF
Data Lake
PPTX
Hive: Loading Data
PPTX
How to Actually Tune Your Spark Jobs So They Work
PDF
The delta architecture
PPTX
Oracle Data Integrator
PPTX
What is Change Data Capture (CDC) and Why is it Important?
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PPTX
Datastage to ODI
PPT
Backups And Recovery
PPTX
Introduction to Data Engineering
PDF
Data Lake - Multitenancy Best Practices
PPTX
Introduction to snowflake
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PPTX
Ozone- Object store for Apache Hadoop
PDF
db tech showcase 2019 D10 Oracle Database New Features
PDF
Optimizing RocksDB for Open-Channel SSDs
The Apache Spark File Format Ecosystem
A Deep Dive into Query Execution Engine of Spark SQL
Designing modern dw and data lake
PostgreSQL High Availability in a Containerized World
Data Lake
Hive: Loading Data
How to Actually Tune Your Spark Jobs So They Work
The delta architecture
Oracle Data Integrator
What is Change Data Capture (CDC) and Why is it Important?
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Datastage to ODI
Backups And Recovery
Introduction to Data Engineering
Data Lake - Multitenancy Best Practices
Introduction to snowflake
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Ozone- Object store for Apache Hadoop
db tech showcase 2019 D10 Oracle Database New Features
Optimizing RocksDB for Open-Channel SSDs
Ad

Similar to Automate all your EMR related activities (15)

PDF
Cloud Native Data Pipelines (DataEngConf SF 2017)
PDF
Aws-What You Need to Know_Simon Elisha
PPT
AWS (Hadoop) Meetup 30.04.09
PPTX
Aws re invent 2018 recap
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PPT
Cloud & Native Cloud for Managers
PDF
DataOps with Project Amaterasu
PPTX
analytic engine - a common big data computation service on the aws
PPTX
.NET for Azure Synapse (and viceversa)
PDF
Machine learning model to production
PPTX
Building Machine Learning Inference Pipelines at Scale (July 2019)
PDF
Running Apache Spark Jobs Using Kubernetes
PDF
Scaling your analytics with Amazon EMR
PPTX
EMR Training
PDF
Data Summer Conf 2018, “Build, train, and deploy machine learning models at s...
Cloud Native Data Pipelines (DataEngConf SF 2017)
Aws-What You Need to Know_Simon Elisha
AWS (Hadoop) Meetup 30.04.09
Aws re invent 2018 recap
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Cloud & Native Cloud for Managers
DataOps with Project Amaterasu
analytic engine - a common big data computation service on the aws
.NET for Azure Synapse (and viceversa)
Machine learning model to production
Building Machine Learning Inference Pipelines at Scale (July 2019)
Running Apache Spark Jobs Using Kubernetes
Scaling your analytics with Amazon EMR
EMR Training
Data Summer Conf 2018, “Build, train, and deploy machine learning models at s...
Ad

Recently uploaded (20)

PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
System and Network Administration Chapter 2
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Odoo Companies in India – Driving Business Transformation.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Operating system designcfffgfgggggggvggggggggg
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Reimagine Home Health with the Power of Agentic AI​
wealthsignaloriginal-com-DS-text-... (1).pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Design an Analysis of Algorithms I-SECS-1021-03
Odoo POS Development Services by CandidRoot Solutions
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PTS Company Brochure 2025 (1).pdf.......
System and Network Administration Chapter 2
How Creative Agencies Leverage Project Management Software.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Odoo Companies in India – Driving Business Transformation.pdf

Automate all your EMR related activities

  • 1. Automate all your EMR related activities Eitan Sela - System Architect eitan.sela@weissbeerger.com
  • 2. $ whoami • "Hands-On" system Architect with more than 17 years of experience with billing, banking, information security (DLP) and Cloud IoT/Big Data applications. • Big Data specialist – Hadoop, Spark, Hive and EMR on AWS. • Work with vast AWS services, and with serverless projects especially. • Java development, scalability performance and stabilization expert. • Alexa skills developer. • Love to share my experience in lectures and meetups.
  • 3. What to expect from this session • WeissBeerger use case – Aggregating raw orders and IoT data. • Amazon EMR basics. • Implementing ETLs with Spark. • Submitting work to a Cluster. • Provisioning scheduled transient EMR Clusters for ETLs jobs. • Our new Slack Chabot for EMR, using Amazon Lex!
  • 4. WeissBeerger use case – Aggregating raw orders and IoT data
  • 5. Solution • WeissBeerger bridges the gap between breweries, bars and customers.
  • 6. Benefits for the brewery • Consumption Analytics. • Dynamic Promotions. • Beer Quality. • Value creation. • Beer penetration.
  • 7. Benefits for the bar • Real Time Consumption Tracking. • Waste Reduction. • Sales Growth.
  • 8. How does it work? • IoT (Pouring) – Beverage Analytics Hub. • Point of sales – POS Vendors, via REST API, S3, DB, etc.
  • 9. Aggregating raw point of sales orders and IoT data
  • 11. Amazon EMR - Easily Run and Scale Apache Hadoop, Spark, HBase, Presto, Hive, and other Big Data Frameworks
  • 12. Amazon EMR – Create Cluster – Software and Steps
  • 13. Amazon EMR – Create Cluster – Hardware
  • 14. Amazon EMR – General Cluster Settings
  • 15. Amazon EMR – Security
  • 17. Problem – Huge queries from MySQL to aggregative tables in Redshift
  • 18. Solution - Implementing all ETL’s with PySpark
  • 19. New data pipeline using EMR PySpark jobs – ELT rather than ETL
  • 20. Submitting work to a Cluster
  • 21. Launching Applications with spark-submit ./bin/spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/pymodule1.py, path/to/my/pymodule2.py my_program.py arg1 arg2 • The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. • It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application especially for each one.
  • 22. EMR Steps - Submit Work to a Cluster • You can submit work to a cluster by adding steps or by interactively submitting Hadoop jobs to the master node. • You can add steps to a cluster using the AWS Management Console, the AWS CLI, or the Amazon EMR API. • You can add step during cluster creation or to a running cluster.
  • 23. EMR Steps – Job Types • Custom Jar. • Streaming Program. • Spark App. • Hive Program.
  • 24. EMR Steps - Lifecycle • Pending • Cancelled (by user or API request) • Running • Completed / Failed.
  • 25. EMR Steps – Add jobs to a running cluster
  • 26. EMR Steps – View logs
  • 27. WeissBeerger’s Spark ETL jobs submitted to EMR Cluster
  • 28. Provisioning scheduled transient EMR Clusters for ETLs jobs
  • 29. Requirements • Run ETL using Spark on EMR cluster every 1 hour for one month back. • Input: MySQL or Hive (stg). • Output: Hive (stg) or Redshift. • Storage should be separated from the compute, so EMR clusters should be transient. • Multiple clusters should be able to run together. • Fully automated and monitored.
  • 31. Passing Spark Job steps parameters to Lambda input • We created a simple json with all parameters required to add step to EMR cluster.
  • 32. Monitoring EMR Steps with Lambda and Datadog • We created a Lambda to sample all running EMR clusters for failed steps.
  • 33. As more developers are developing PySpark Jobs…
  • 34. Our new Slack Chabot for EMR, using Amazon Lex
  • 35. Amazon Lex • Conversational interfaces for your applications. • Powered by the same deep learning technologies as Alexa. • Amazon Lex provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU).
  • 36. Amazon Lex - Use cases - Call Center Bots
  • 37. awsbot - the chatbot that help you manage AWS resources
  • 39. awsbot - Demo - EMR Cluster is ready
  • 40. Q & A
  • 41. We Are Hiring! Senior Data Scientist Senior Designer (UI/UX) Senior Full Stack Developer Java Developer Senior Manual QA Director of Ops BI Analyst Data Management Analyst Customer Success Manager Senior BI Analyst