SlideShare a Scribd company logo
Copyright © 2017 The Nielsen Company (US), LLC. Confidential and proprietary. Do not distribute.
The Big
Web
Theory
The Big
Web
Theory
2
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
AIRFLOWAIRFLOW
3
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
● Tal Sharon - Software Architect
● Aviel Buskila - DevOps Tech Lead
● Max Peres - Data Engineer
Who we are?
4
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
What will you learn today ?
• Airflow and how it solved our problems
• How you can deploy Airflow to production
• Best practices for annoying data problems
5
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Nielsen Marketing Cloud
● eXelate - acquired by Nielsen in 2015
● Marketing data cloud service
● Creating targeting profiles
● VERY BIG DATA
● Machine learning
6
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Nielsen Marketing Cloud - main challenge
How many unique users of a certain profile can we
reach?
e.g. campaign for young women who love tech
7
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
What we had...
8
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Data pipeline UI
Multiple workflow parameters
are missing - duration,
successes & failures, ..
9
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
And this is how we had to configure it ...
Not all
configurations
are visible
The Workflow
is hidden
10
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
We wanted something else..
● Configuration and workflow visibility
● Monitoring and statistics
● Share common configuration/code between our
workflows
● Ability to have only 1 concurrent execution of a
workflow
11
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
The Alternatives
Community Main Purpose Flow
Definition
UI Auto
scheduling
Smart
Scheduling
Airflow
(Apache)
Very Active General Purpose Python Rich V V
Luigi
(Spotify)
Active General Purpose Python Limited X X
Oozie
(Apache)
Active Hadoop Job
Scheduling
XML Limited V X
Azkaban
(LinkedIn)
Not very active Hadoop Job
Scheduling
Custom DSL Rich V X
12
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
What is Airflow ?
‘A platform to programmatically author, schedule, and monitor workflows’
Each workflow is described by a DAG(Directed Acyclic Graph) which is constructed by -
● Operators: determine what actually gets done
● Sensors: monitor the job and report success/failure
13
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Airflow overview
Webserver
Accepts HTTP requests and allows the
user to interact with it. It provides the
ability to act on the DAG status (pause,
unpause, trigger).
Scheduler
Monitors the DAGs and periodically
inspects tasks to see if they can be
triggered.
Worker
Daemons that actually execute
the logic of tasks.
14
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Airflow UI
15
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Taking it to Production
16
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Challenges
1. Code dependency management
2. Automated deployment from dev to prod
3. Long running tasks infrastructure
4. Scaling airflow workers
5. Deploy to kubernetes
17
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Code Dependency management - DAG Structure
DAG Dependency
DAG Dependency
DAG Helpers
DAG File
18
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Code Dependency management - DAG Build
19
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Automated deployment from dev to prod
Package artifact with
a version and push
Deploy a DAG file
with a specific
version
20
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Function as a service (FaaS)
FaaS is a category of cloud computing services that
provides a platform allowing customers to develop, run,
and manage application functionalities without the
complexity of building and maintaining the infrastructure
typically associated with developing and launching an app.
Source:
https://en.wikipedia.org/wiki/Function_as_a_service
21
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
OpenFaaS - Long running tasks infrastructure
OpenFaaS (Functions as a Service) is a framework for building Serverless
functions with Docker and Kubernetes which has first-class support for
metrics. Any process can be packaged as a function enabling you to consume
a range of web events without repetitive boiler-plate coding.
22
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Airflow and OpenFaaS
+ =
23
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Open source donations
24
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Scaling airflow workers
Airflow has 3 types of executors:
1. Sequential Executor - One task & one worker
2. Local Executor - Parallel tasks & one worker
3. Celery Executor - Parallel tasks & multiple workers
25
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Celery executor setup
26
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Deployment to kubernetes
27
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
So what did we achieve?
✓ Dependency management
✓ Automated deployment from dev to prod
✓ Long running tasks infrastructure
✓ Scalable airflow workers
✓ Deploy to kubernetes
28
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Annoying
Problems
Annoying
Problems
29
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Annoying Data Maintenance Problems
✓ Troubleshooting broken pipelines
✓ Rerunning parts of a pipeline
30
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Classical Cron Scheduling
31
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Troubleshooting broken pipelines
32
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Understanding a Pipeline (DAG)
33
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Sub Pipeline (sub-DAG)
Create_EMR_Cluster Create_EMR_Step Step_Sensor
34
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Rerunning a Task
Create_EMR_Cluster Create_EMR_Step Step_Sensor
Failed Task
35
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Cool Features
36
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Scaling Out
37
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Scaling Out
38
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Message Exchange (Xcom)
Allows sharing data between tasks
(example: configs)
39
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Sample DAG
40
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Sample DAG
41
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
42
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Summary
● Maintenance becomes a BRE E Z E …...
● Recovery is a NO BRAINER !
● Lots of cool & convenient features
43
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
● Amazing tool for data pipelines
● Open source community
● Cool UI & API
AIRFLOWAIRFLOW
Make you life EASIER !!!Make you life EASIER !!!
44
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Questions?
45
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Come and join us!
https://www.comeet.co/jobs/nielsen/33.000
46
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
Thank you!
47
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
BACKUP SLIDES
48
Copyright©2018TheNielsenCompany(US),LLC.Confidentialandproprietary.Donotdistribute.
{
"Data-In-Dpu_us": {
"bucket": "xl8emr-assets",
"file": "data-in/config/dpu_airflow.json"
},
"Data-In-Dpu_eu": {
"bucket": "xl8emr-assets-eu",
"file": "data-in/config/dpu_airflow.json"
},
"Data-In-Dpu_ap": {
"bucket": "xl8emr-assets-eu",
"file": "data-in/config/dpu_airflow.json"
}
}

More Related Content

PDF
Llama-index
PPTX
Data Engineering Proposal for Homerunner.pptx
PDF
Introduction to Azure Data Lake
PDF
Accessing Data Anywhere with Unified Namespace
PPSX
Domain Driven Design
PDF
Machine learning and big data @ uber a tale of two systems
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
PPSX
Event Sourcing & CQRS, Kafka, Rabbit MQ
Llama-index
Data Engineering Proposal for Homerunner.pptx
Introduction to Azure Data Lake
Accessing Data Anywhere with Unified Namespace
Domain Driven Design
Machine learning and big data @ uber a tale of two systems
Airflow Best Practises & Roadmap to Airflow 2.0
Event Sourcing & CQRS, Kafka, Rabbit MQ

What's hot (20)

PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
PPT
Hive(ppt)
PPTX
Lambda kappa architecture - the jury are still out
PDF
Apache airflow
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PDF
The Complete MariaDB Server tutorial
PDF
Streaming all over the world Real life use cases with Kafka Streams
PDF
Cassandra background-and-architecture
PPSX
Microservices Architecture - Cloud Native Apps
PDF
Modern real-time streaming architectures
PPTX
Zeta Architecture: The Next Generation Big Data Architecture
PPTX
Cloud computing hybrid architecture
PPTX
Presentation of Apache Cassandra
PPTX
Data Vault and DW2.0
PDF
RedisConf18 - Redis on Google Cloud Platform
PPTX
Part 1: Lambda Architectures: Simplified by Apache Kudu
PDF
Building an open data platform with apache iceberg
PDF
Event-Driven Architecture (EDA)
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Hive(ppt)
Lambda kappa architecture - the jury are still out
Apache airflow
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
The Complete MariaDB Server tutorial
Streaming all over the world Real life use cases with Kafka Streams
Cassandra background-and-architecture
Microservices Architecture - Cloud Native Apps
Modern real-time streaming architectures
Zeta Architecture: The Next Generation Big Data Architecture
Cloud computing hybrid architecture
Presentation of Apache Cassandra
Data Vault and DW2.0
RedisConf18 - Redis on Google Cloud Platform
Part 1: Lambda Architectures: Simplified by Apache Kudu
Building an open data platform with apache iceberg
Event-Driven Architecture (EDA)
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Ad

Similar to From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Marketing Cloud (20)

PDF
Airflow Intro-1.pdf
PPTX
Apache airflow
PPTX
Apache Airdrop detailed description.pptx
PPTX
DataPipelineApacheAirflow.pptx
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PDF
Introduction to Apache Airflow
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PDF
Airflow presentation
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
PDF
Introduction to Apache Airflow - Data Day Seattle 2016
PPTX
Apache Airflow overview
PPTX
airflow web UI and CLI.pptx
PDF
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
PPTX
Fyber - airflow best practices in production
PDF
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
PPTX
Airflow 101
PDF
Building Automated Data Pipelines with Airflow.pdf
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PPTX
airflowpresentation1-180717183432.pptx
PDF
Introducing Apache Airflow and how we are using it
Airflow Intro-1.pdf
Apache airflow
Apache Airdrop detailed description.pptx
DataPipelineApacheAirflow.pptx
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Introduction to Apache Airflow
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Airflow presentation
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Introduction to Apache Airflow - Data Day Seattle 2016
Apache Airflow overview
airflow web UI and CLI.pptx
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Fyber - airflow best practices in production
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Airflow 101
Building Automated Data Pipelines with Airflow.pdf
How I learned to time travel, or, data pipelining and scheduling with Airflow
airflowpresentation1-180717183432.pptx
Introducing Apache Airflow and how we are using it
Ad

More from Itai Yaffe (20)

PDF
Mastering Partitioning for High-Volume Data Processing
PDF
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
PDF
Lessons Learnt from Running Thousands of On-demand Spark Applications
PPTX
Why do the majority of Data Science projects never make it to production?
PDF
Planning a data solution - "By Failing to prepare, you are preparing to fail"
PDF
Evaluating Big Data & ML Solutions - Opening Notes
PDF
Big data serving: Processing and inference at scale in real time
PDF
Data Lakes on Public Cloud: Breaking Data Management Monoliths
PDF
Unleashing the Power of your Data
PDF
Data Lake on Public Cloud - Opening Notes
PDF
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
PDF
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
PDF
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
PDF
Introducing Kafka Connect and Implementing Custom Connectors
PDF
A Day in the Life of a Druid Implementor and Druid's Roadmap
PDF
Scalable Incremental Index for Druid
PDF
Funnel Analysis with Spark and Druid
PDF
The benefits of running Spark on your own Docker
PDF
Optimizing Spark-based data pipelines - are you up for it?
PDF
Scheduling big data workloads on serverless infrastructure
Mastering Partitioning for High-Volume Data Processing
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Lessons Learnt from Running Thousands of On-demand Spark Applications
Why do the majority of Data Science projects never make it to production?
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Evaluating Big Data & ML Solutions - Opening Notes
Big data serving: Processing and inference at scale in real time
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Unleashing the Power of your Data
Data Lake on Public Cloud - Opening Notes
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Introducing Kafka Connect and Implementing Custom Connectors
A Day in the Life of a Druid Implementor and Druid's Roadmap
Scalable Incremental Index for Druid
Funnel Analysis with Spark and Druid
The benefits of running Spark on your own Docker
Optimizing Spark-based data pipelines - are you up for it?
Scheduling big data workloads on serverless infrastructure

Recently uploaded (20)

PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Mega Projects Data Mega Projects Data
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Computer network topology notes for revision
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
ISS -ESG Data flows What is ESG and HowHow
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Clinical guidelines as a resource for EBP(1).pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
STUDY DESIGN details- Lt Col Maksud (21).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Fluorescence-microscope_Botany_detailed content
Mega Projects Data Mega Projects Data
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Computer network topology notes for revision
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
IBA_Chapter_11_Slides_Final_Accessible.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
ISS -ESG Data flows What is ESG and HowHow
Miokarditis (Inflamasi pada Otot Jantung)
Acceptance and paychological effects of mandatory extra coach I classes.pptx

From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Marketing Cloud