E2E Data Pipeline - Apache Spark/Airflow/Livy

Confidential
End to End Pipelines Using Apache Spark/Livy/Airﬂow
An integrated solution to batch data processing
Rikin Tanna and Karunasri Maram
Capital One Auto Finance
February 12, 2020

2
Agenda
1. Problem Statement
2. Integrated Solution, Brief Overview
3. Explanation of Components
• Apache Spark
• Apache Livy
• Apache Airﬂow
4. Integrated Solution, Fully Explained
5. Demo

3
Problem StatementNeed: Batch data processing on schedule

4
Solution Requirements
?
Scalable
Handle jobs with
growing data sets
End to End Data Pipeline
Parallel Execution
Ability to run multiple jobs
in parallel
Open-Source Support
Active contributions to
components used to
stay eﬃcient
Dynamic
Generation of pipeline on
demand to support varying
characteristics
Dependency Enabled
Support ordering of tasks based
on dependency

5
A fully integrated big data pipeline…. with just 3
components!
• Apache Spark
• Unified data analytics engine for large-scale data processing
• Served on EMR cluster
• Apache Livy
• REST Interface to enable easy interaction with Apache Spark
• Served on master node of EMR cluster
• Apache Airflow
• WMS to schedule, trigger, and monitor workflows on a single
compute resource
• Served on single compute instance, with metadata DB on
separate RDS instance
Solution: Brief
5

7
What is Airflow?
An open source platform to programatically author, schedule, and monitor workflows
Dynamic: Airflow pipelines are
configured as code, allowing
for dynamic pipeline
generation as DAGs
Extensible: Easily extend the
library and usability by
creating your own operators
and executors
General-Purpose: Airflow is
written in Python, and all
pipelines are configured in
Python
Accessible: Rich UI allows for
non-technical users to
monitor workflows

8
Airflow Luigi Oozie Azkaban
Dynamic Pipelines
Rich, Interactable UI
General Purpose
Usability
Scalability
Dependency
Management
Maturity/Support
Why Airflow?
Comparison of common open source workflow management systems
Use this box for citations, sources, statements, notes, and legal disclaimers that are required.

9
Apache Airflow Architecture
Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
● Metadata Database
○ stores information necessary for scheduler/executor
and webserver
○ task state, DAG definitions, log locations, etc
● Scheduler/Executor
○ process that uses DAG definitions and task states to push
tasks onto queue for execution
● Workers
○ process(es) that execute the logic of the tasks
● Webserver
○ process that renders web UI, interacting with metadata
database to allow user monitoring and interaction with
workflows

10
How to Get Started with Apache Airflow
1. Install Python
2. “pip install apache-airflow”
a. install from pypi using pip
b. AIRFLOW_HOME = ~/airflow (default)
3. “airflow initdb”
a. initialize database
4. “airflow webserver -p 8080”
a. start web server, default port is 8080
5. “airflow scheduler”
a. start scheduler (also starts executor processes)
6. visit localhost:8080 in browser and enable example dags
Deeper Understanding
1. Connect to database (using Datagrip or DBeaver)
and view tables. See how the data is altered as
workflows execute and changes are made
2. Dig into source code
(https://github.com/apache/airflow) and view
actions triggered by scheduler CLI command

12
Spark Flink Storm
Streaming
Batch/Interactive/Iterative
Processing
General Purpose Usability
Scalability
Product Maturity
Community Support
Why Spark?
Comparison of common open source big data processing systems.

Confidential
What is that?
Why do we need?

17
Livy spark-jobserver Mist Apache Toree
Streaming jobs
Batch Jobs
General Purpose
Usability
High Availability
Supports major languages
(Scala/Java/Python)
Dependency (No Code
changes required)
Why Livy?
Comparison of common open source Spark Interfaces.

18
○ Apache Livy is a service that enables
easy interaction with a Spark cluster
over a REST interface.
○ It enables easy submission of Spark jobs
or snippets of Spark code, synchronous
or asynchronous result retrieval, as well
as Spark Context management, all via a
simple REST interface or an RPC client
library.
A Rest Service for Spark Jobs
18Confidential

20
• Workflow Management
• Schedules, monitors, and triggers
workflows
• Characteristics
• Dynamic
• Dependency Enabled
• Open-Source
Apache Airflow
• REST Interface to Interact
with Apache Spark
Apache Livy
• Big-Data Processing
• platform to execute large scale
data processing
• Characteristics
• Parallel jobs
• Scalable
• Open-Source
Apache Spark
Summary of Components

22
Failure Resiliency
● Current weakness
○ Current solution lacks resiliency in Airﬂow (single EC2 instance)
○ solution:
■ containerize Airﬂow, deploy on pod with separate worker pod,
distribute tasks using external queue
● Livy
○ It supports session recovery using Zookeeper and reconnects to the
existing session even if its fails while executing the job.
● Spark
○ Failed tasks can be re-launched in parallel on all the other nodes in the cluster and distribute the recomputations across many nodes,and
recovering from the failures very fast.

Confidential
Thank you!
rikin.tanna@capitalone.com
https://www.linkedin.com/in/rikin-tanna/

E2E Data Pipeline - Apache Spark/Airflow/Livy

More Related Content

What's hot (20)

Similar to E2E Data Pipeline - Apache Spark/Airflow/Livy (20)

Recently uploaded (20)

E2E Data Pipeline - Apache Spark/Airflow/Livy