SlideShare a Scribd company logo
Confidential
End to End Pipelines Using Apache Spark/Livy/Airflow
An integrated solution to batch data processing
Rikin Tanna and Karunasri Maram
Capital One Auto Finance
February 12, 2020
2
Agenda
1. Problem Statement
2. Integrated Solution, Brief Overview
3. Explanation of Components
• Apache Spark
• Apache Livy
• Apache Airflow
4. Integrated Solution, Fully Explained
5. Demo
3
Problem StatementNeed: Batch data processing on schedule
4
Solution Requirements
?
Scalable
Handle jobs with
growing data sets
End to End Data Pipeline
Parallel Execution
Ability to run multiple jobs
in parallel
Open-Source Support
Active contributions to
components used to
stay efficient
Dynamic
Generation of pipeline on
demand to support varying
characteristics
Dependency Enabled
Support ordering of tasks based
on dependency
5
A fully integrated big data pipeline…. with just 3
components!
• Apache Spark
• Unified data analytics engine for large-scale data processing
• Served on EMR cluster
• Apache Livy
• REST Interface to enable easy interaction with Apache Spark
• Served on master node of EMR cluster
• Apache Airflow
• WMS to schedule, trigger, and monitor workflows on a single
compute resource
• Served on single compute instance, with metadata DB on
separate RDS instance
Solution: Brief
5
Confidential
7
What is Airflow?
An open source platform to programatically author, schedule, and monitor workflows
Dynamic: Airflow pipelines are
configured as code, allowing
for dynamic pipeline
generation as DAGs
Extensible: Easily extend the
library and usability by
creating your own operators
and executors
General-Purpose: Airflow is
written in Python, and all
pipelines are configured in
Python
Accessible: Rich UI allows for
non-technical users to
monitor workflows
8
Airflow Luigi Oozie Azkaban
Dynamic Pipelines
Rich, Interactable UI
General Purpose
Usability
Scalability
Dependency
Management
Maturity/Support
Why Airflow?
Comparison of common open source workflow management systems
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
9
Apache Airflow Architecture
Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
● Metadata Database
○ stores information necessary for scheduler/executor
and webserver
○ task state, DAG definitions, log locations, etc
● Scheduler/Executor
○ process that uses DAG definitions and task states to push
tasks onto queue for execution
● Workers
○ process(es) that execute the logic of the tasks
● Webserver
○ process that renders web UI, interacting with metadata
database to allow user monitoring and interaction with
workflows
10
How to Get Started with Apache Airflow
1. Install Python
2. “pip install apache-airflow”
a. install from pypi using pip
b. AIRFLOW_HOME = ~/airflow (default)
3. “airflow initdb”
a. initialize database
4. “airflow webserver -p 8080”
a. start web server, default port is 8080
5. “airflow scheduler”
a. start scheduler (also starts executor processes)
6. visit localhost:8080 in browser and enable example dags
Deeper Understanding
1. Connect to database (using Datagrip or DBeaver)
and view tables. See how the data is altered as
workflows execute and changes are made
2. Dig into source code
(https://github.com/apache/airflow) and view
actions triggered by scheduler CLI command
Confidential
12
Spark Flink Storm
Streaming
Batch/Interactive/Iterative
Processing
General Purpose Usability
Scalability
Product Maturity
Community Support
Why Spark?
Comparison of common open source big data processing systems.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
13
14
15
Apache Spark Architecture
Confidential
What is that?
Why do we need?
17
Livy spark-jobserver Mist Apache Toree
Streaming jobs
Batch Jobs
General Purpose
Usability
High Availability
Supports major languages
(Scala/Java/Python)
Dependency (No Code
changes required)
Why Livy?
Comparison of common open source Spark Interfaces.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
18
○ Apache Livy is a service that enables
easy interaction with a Spark cluster
over a REST interface.
○ It enables easy submission of Spark jobs
or snippets of Spark code, synchronous
or asynchronous result retrieval, as well
as Spark Context management, all via a
simple REST interface or an RPC client
library.
A Rest Service for Spark Jobs
18Confidential
19
Sample Request
20
• Workflow Management
• Schedules, monitors, and triggers
workflows
• Characteristics
• Dynamic
• Dependency Enabled
• Open-Source
Apache Airflow
• REST Interface to Interact
with Apache Spark
Apache Livy
• Big-Data Processing
• platform to execute large scale
data processing
• Characteristics
• Parallel jobs
• Scalable
• Open-Source
Apache Spark
Summary of Components
21
Current Solution
22
Failure Resiliency
● Current weakness
○ Current solution lacks resiliency in Airflow (single EC2 instance)
○ solution:
■ containerize Airflow, deploy on pod with separate worker pod,
distribute tasks using external queue
● Livy
○ It supports session recovery using Zookeeper and reconnects to the
existing session even if its fails while executing the job.
● Spark
○ Failed tasks can be re-launched in parallel on all the other nodes in the cluster and distribute the recomputations across many nodes,and
recovering from the failures very fast.
Confidential
Thank you!
rikin.tanna@capitalone.com
https://www.linkedin.com/in/rikin-tanna/

More Related Content

PDF
Apache Kafka Architecture & Fundamentals Explained
PPTX
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
PPTX
Apache Tez – Present and Future
PDF
Combining logs, metrics, and traces for unified observability
PDF
Apache kafka performance(latency)_benchmark_v0.3
PPTX
Apache HBase™
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
PDF
Introduction to Apache NiFi 1.11.4
Apache Kafka Architecture & Fundamentals Explained
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Apache Tez – Present and Future
Combining logs, metrics, and traces for unified observability
Apache kafka performance(latency)_benchmark_v0.3
Apache HBase™
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Introduction to Apache NiFi 1.11.4

What's hot (20)

PDF
Considerations for Data Access in the Lakehouse
PDF
Data ingestion and distribution with apache NiFi
PDF
SQOOP PPT
PDF
Intro to HBase
PDF
ClickHouse Monitoring 101: What to monitor and how
PDF
Apache Hadoop and HBase
PPTX
Sql vs NoSQL
PPTX
Apache Kafka Best Practices
PPTX
Hadoop File system (HDFS)
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PPTX
Cloudera Hadoop Distribution
PDF
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
PPTX
Backup and Disaster Recovery in Hadoop
PPT
Chicago Data Summit: Apache HBase: An Introduction
PPTX
Elastic stack Presentation
PDF
Dataflow with Apache NiFi
PPTX
Elastic - ELK, Logstash & Kibana
PPTX
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
PPTX
In-Memory Big Data Analytics
PDF
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Considerations for Data Access in the Lakehouse
Data ingestion and distribution with apache NiFi
SQOOP PPT
Intro to HBase
ClickHouse Monitoring 101: What to monitor and how
Apache Hadoop and HBase
Sql vs NoSQL
Apache Kafka Best Practices
Hadoop File system (HDFS)
ClickHouse Deep Dive, by Aleksei Milovidov
Cloudera Hadoop Distribution
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Backup and Disaster Recovery in Hadoop
Chicago Data Summit: Apache HBase: An Introduction
Elastic stack Presentation
Dataflow with Apache NiFi
Elastic - ELK, Logstash & Kibana
Apache Zeppelin + Livy: Bringing Multi Tenancy to Interactive Data Analysis
In-Memory Big Data Analytics
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Ad

Similar to E2E Data Pipeline - Apache Spark/Airflow/Livy (20)

PDF
Introduction to Apache Airflow - Programmatically Manage Your Workflows for ...
PDF
Building Better Data Pipelines using Apache Airflow
PPTX
Introduction to Apache Airflow & Workflow Orchestration.pptx
PPTX
Installing & Setting Up Apache Airflow (Local & Cloud) - AccentFuture
PDF
Introducing Apache Airflow and how we are using it
PDF
Airflow Intro-1.pdf
PPTX
DataPipelineApacheAirflow.pptx
PPTX
Apache Airdrop detailed description.pptx
PPTX
Apache Airflow overview
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PPTX
Apache Airflow in Production
PDF
Building Automated Data Pipelines with Airflow.pdf
PPTX
Apache Airflow presentation by GenPPT.pptx
PDF
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
PDF
Airflow presentation
PPTX
Apache AirfowAsaSAsaSAsSas - Session1.pptx
PDF
Apache airflow
PDF
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Programmatically Manage Your Workflows for ...
Building Better Data Pipelines using Apache Airflow
Introduction to Apache Airflow & Workflow Orchestration.pptx
Installing & Setting Up Apache Airflow (Local & Cloud) - AccentFuture
Introducing Apache Airflow and how we are using it
Airflow Intro-1.pdf
DataPipelineApacheAirflow.pptx
Apache Airdrop detailed description.pptx
Apache Airflow overview
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Running Airflow Workflows as ETL Processes on Hadoop
Apache Airflow in Production
Building Automated Data Pipelines with Airflow.pdf
Apache Airflow presentation by GenPPT.pptx
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
Airflow presentation
Apache AirfowAsaSAsaSAsSas - Session1.pptx
Apache airflow
Introduction to Apache Airflow - Data Day Seattle 2016
Ad

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Modernizing your data center with Dell and AMD
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Building Integrated photovoltaic BIPV_UPV.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation_ Review paper, used for researhc scholars
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
Reach Out and Touch Someone: Haptics and Empathic Computing
“AI and Expert System Decision Support & Business Intelligence Systems”
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Approach and Philosophy of On baking technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Modernizing your data center with Dell and AMD
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Review of recent advances in non-invasive hemoglobin estimation
Unlocking AI with Model Context Protocol (MCP)
Mobile App Security Testing_ A Comprehensive Guide.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

E2E Data Pipeline - Apache Spark/Airflow/Livy

  • 1. Confidential End to End Pipelines Using Apache Spark/Livy/Airflow An integrated solution to batch data processing Rikin Tanna and Karunasri Maram Capital One Auto Finance February 12, 2020
  • 2. 2 Agenda 1. Problem Statement 2. Integrated Solution, Brief Overview 3. Explanation of Components • Apache Spark • Apache Livy • Apache Airflow 4. Integrated Solution, Fully Explained 5. Demo
  • 3. 3 Problem StatementNeed: Batch data processing on schedule
  • 4. 4 Solution Requirements ? Scalable Handle jobs with growing data sets End to End Data Pipeline Parallel Execution Ability to run multiple jobs in parallel Open-Source Support Active contributions to components used to stay efficient Dynamic Generation of pipeline on demand to support varying characteristics Dependency Enabled Support ordering of tasks based on dependency
  • 5. 5 A fully integrated big data pipeline…. with just 3 components! • Apache Spark • Unified data analytics engine for large-scale data processing • Served on EMR cluster • Apache Livy • REST Interface to enable easy interaction with Apache Spark • Served on master node of EMR cluster • Apache Airflow • WMS to schedule, trigger, and monitor workflows on a single compute resource • Served on single compute instance, with metadata DB on separate RDS instance Solution: Brief 5
  • 7. 7 What is Airflow? An open source platform to programatically author, schedule, and monitor workflows Dynamic: Airflow pipelines are configured as code, allowing for dynamic pipeline generation as DAGs Extensible: Easily extend the library and usability by creating your own operators and executors General-Purpose: Airflow is written in Python, and all pipelines are configured in Python Accessible: Rich UI allows for non-technical users to monitor workflows
  • 8. 8 Airflow Luigi Oozie Azkaban Dynamic Pipelines Rich, Interactable UI General Purpose Usability Scalability Dependency Management Maturity/Support Why Airflow? Comparison of common open source workflow management systems Use this box for citations, sources, statements, notes, and legal disclaimers that are required. Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
  • 9. 9 Apache Airflow Architecture Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a ● Metadata Database ○ stores information necessary for scheduler/executor and webserver ○ task state, DAG definitions, log locations, etc ● Scheduler/Executor ○ process that uses DAG definitions and task states to push tasks onto queue for execution ● Workers ○ process(es) that execute the logic of the tasks ● Webserver ○ process that renders web UI, interacting with metadata database to allow user monitoring and interaction with workflows
  • 10. 10 How to Get Started with Apache Airflow 1. Install Python 2. “pip install apache-airflow” a. install from pypi using pip b. AIRFLOW_HOME = ~/airflow (default) 3. “airflow initdb” a. initialize database 4. “airflow webserver -p 8080” a. start web server, default port is 8080 5. “airflow scheduler” a. start scheduler (also starts executor processes) 6. visit localhost:8080 in browser and enable example dags Deeper Understanding 1. Connect to database (using Datagrip or DBeaver) and view tables. See how the data is altered as workflows execute and changes are made 2. Dig into source code (https://github.com/apache/airflow) and view actions triggered by scheduler CLI command
  • 12. 12 Spark Flink Storm Streaming Batch/Interactive/Iterative Processing General Purpose Usability Scalability Product Maturity Community Support Why Spark? Comparison of common open source big data processing systems. Use this box for citations, sources, statements, notes, and legal disclaimers that are required. Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
  • 13. 13
  • 14. 14
  • 17. 17 Livy spark-jobserver Mist Apache Toree Streaming jobs Batch Jobs General Purpose Usability High Availability Supports major languages (Scala/Java/Python) Dependency (No Code changes required) Why Livy? Comparison of common open source Spark Interfaces. Use this box for citations, sources, statements, notes, and legal disclaimers that are required. Use this box for citations, sources, statements, notes, and legal disclaimers that are required.
  • 18. 18 ○ Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. ○ It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library. A Rest Service for Spark Jobs 18Confidential
  • 20. 20 • Workflow Management • Schedules, monitors, and triggers workflows • Characteristics • Dynamic • Dependency Enabled • Open-Source Apache Airflow • REST Interface to Interact with Apache Spark Apache Livy • Big-Data Processing • platform to execute large scale data processing • Characteristics • Parallel jobs • Scalable • Open-Source Apache Spark Summary of Components
  • 22. 22 Failure Resiliency ● Current weakness ○ Current solution lacks resiliency in Airflow (single EC2 instance) ○ solution: ■ containerize Airflow, deploy on pod with separate worker pod, distribute tasks using external queue ● Livy ○ It supports session recovery using Zookeeper and reconnects to the existing session even if its fails while executing the job. ● Spark ○ Failed tasks can be re-launched in parallel on all the other nodes in the cluster and distribute the recomputations across many nodes,and recovering from the failures very fast.