SlideShare a Scribd company logo
Learning and Development
2019
Apache Airflow
For Data Engineering
- Sampath Kumar, Principle Engineer
Agenda
Goal of this session is to provide an overview of Airflow capabilities & how to use it
for Data Engineering
● Introduction (DE, Airflow)
● Key Concepts (Key Words, Demo)
● Architecture (System Design)
● Features (Challenges, Loggins, Analytics)
● Challenges & Recommendations
● Q & A
DATA ENGINEERING
Data Engineering is the aspect of data science that focuses on practical applications of data
collection and analysis. Data Engineers are tasked with
● Designing, building, testing, integrating, managing, and optimizing data from a variety of
sources
● Build the infrastructure and architecture that enable data generation
● Primary focus is to build free-flowing data pipelines
by combining a variety of big data technologies
that enable real-time analytics
● Data engineers also write complex queries
to ensure that data is easily accessible
AIRFLOW
● Open-source workflow automation and scheduling system that can be used to author and manage
your data pipelines.
● It started at Airbnb in October 2014 as a solution to manage the company's increasing complex
workflows.
● License: Apache License 2.0
● Written in: Python
● Operating system: Microsoft Windows, macOS, Linux
● Stable release: 1.10.5 / August 30, 2019; 3 months ago
CRON
CRON: Derived from work CRONOS(means time), is a software for unix-like
systems to schedule jobs based on time.
Airflow is cron on steroids: it allows you to schedule tasks to run, run them in a
particular order, and monitor / manage all of your tasks.
KEY CONCEPTS
● DAGS
○ Directed acyclic graphs that represent tasks workflow
● Task
○ Operators - Bash, Python, SSH, Http, MySql, SparkSubmit, Sensors(s3), Docker, Hive, Slack,..
● Hooks
○ To store credentials for services - AWS, GCP, DataBase, Email,..
● Vars & XCom
○ For sharing any global values or inter-task communication
Sample - DAG
Start End
S3Sensor
Build VM
Goal: As and when user files come into AWS S3 bucket, start a high-spec docker
VM and process the job. Prior to start and post process, job status to be notified.
Solutions required to be cost efficient and need to auto-scale if required.
DEMO - UI
Sample - DAG
A data engineering pipeline, where an S3 sensor is used to identify the arrival of
input file, following by several validation checks and then load into ElasticSearch
which will be used for serving clients.
Start
(S3Sensor)
Schema
Validation
Inputs
Validation
Data
Validation
Load to
ElasticSearch
End
Failures
ARCHITECTURE
● MetaDB
● Message Broker
● Airflow Webserver
● Airflow Scheduler
● Airflow Workers
FEATURES
● Airflow Workers are Horizontally scalable
● Airflow Messaging Broker - Celery
● Airflow Integrations - GCP, Azure, AWS, Qubole & Databricks
● Hooks, Connections & Pools - Environment(dev/test/prod) friendly
● DAG - Dynamic sub dags & Branching
ANALYTICS
● Part of being productive with data
is having the right weapons to
profile the data you are working
with.
● Airflow provides a simple query
interface to write SQL and get
results quickly, and a charting
application letting you visualize
data.
LOGGING
● Logging is visible from UI
CHALLENGES
● Airflow is not Apache NIFI or Apache Spark
○ is not - data routing or data transformation system.
● Airflow Workers
○ requires identical access(network, authentication & authorisation)
○ requires similar hardware capabilities
● Airflow Webserver & Scheduler
○ not scalable (Work-around - supervisor, docker health checks)
RECOMMENDATIONS
Airflow is a best suited where
● Agility is important
● Portability
● Segregation b/w compute and workflow mgmt
● Scalability on demand
● Pool/Connection management
● Job Analytics
● Logging visibility
Q&A
Open for discussions,..
REFERENCES
● Roles of Data Engineer - https://inlovewithcode.wordpress.com/2019/05/15/roles-of-data-engineer-required-skill/
● System Design - https://inlovewithcode.wordpress.com/system-design/
● Airflow Tutorials - https://airflow-tutorial.readthedocs.io/en/latest/airflow-intro.html
● Airflow Documentation - https://airflow.apache.org/docs/stable/
● Airflow Dockers - https://towardsdatascience.com/getting-started-with-airflow-using-docker-cd8b44dbff98
Thank you

More Related Content

PDF
Real Time Serverless Polling App
PPTX
North Point Geographic Solutions - ArcPAD SQL Server
PDF
Torkel Ödegaard (Creator of Grafana) - Grafana at #DOXLON
PDF
Google Charts for native Android apps
PPTX
Openstack Heat & How Autoscaling works
PDF
Modular GraphQL with Schema Stitching
PDF
Why UI Developers Love GraphQL - Sashko Stubailo, Apollo/Meteor
PPTX
Introduction to Aneka, Aneka Model is explained
Real Time Serverless Polling App
North Point Geographic Solutions - ArcPAD SQL Server
Torkel Ödegaard (Creator of Grafana) - Grafana at #DOXLON
Google Charts for native Android apps
Openstack Heat & How Autoscaling works
Modular GraphQL with Schema Stitching
Why UI Developers Love GraphQL - Sashko Stubailo, Apollo/Meteor
Introduction to Aneka, Aneka Model is explained

What's hot (20)

PDF
An Introduction to the Heatmap / Histogram Plugin
PPTX
Cacique presentation (english)
PPTX
Utilizing Esri Out of the Box Tools for Field Data Verification
PPTX
PDF
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
PPTX
Tour of Dapr
PDF
Deploy your machine learning models to production with Kubernetes
PDF
INTERFACE, by apidays - Apache Cassandra now speaks developer with Stargate ...
PPTX
Azure Functions VS AWS Lambda: overview and comparison
PDF
Deploying GraphQL Services as Managed APIs
PDF
Webinar kubernetes and-spark
PPT
Proposed bench test for gis servers
PPTX
Tordatasci meetup-precima-retail-analytics-201901
PDF
Building Mobile Dashboards With D3 and Google Charts
PPTX
NDC London 2021 - Application Autoscaling Made Easy With Kubernetes Event-Dri...
PDF
Facilitez votre transition DevOps grâce à l'automatisation de votre infras...
PDF
Shubhangi Prasad
PDF
Designing and Using Cached Map
PPTX
Microsoft Partners - Application Autoscaling Made Easy With Kubernetes Event-...
An Introduction to the Heatmap / Histogram Plugin
Cacique presentation (english)
Utilizing Esri Out of the Box Tools for Field Data Verification
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Tour of Dapr
Deploy your machine learning models to production with Kubernetes
INTERFACE, by apidays - Apache Cassandra now speaks developer with Stargate ...
Azure Functions VS AWS Lambda: overview and comparison
Deploying GraphQL Services as Managed APIs
Webinar kubernetes and-spark
Proposed bench test for gis servers
Tordatasci meetup-precima-retail-analytics-201901
Building Mobile Dashboards With D3 and Google Charts
NDC London 2021 - Application Autoscaling Made Easy With Kubernetes Event-Dri...
Facilitez votre transition DevOps grâce à l'automatisation de votre infras...
Shubhangi Prasad
Designing and Using Cached Map
Microsoft Partners - Application Autoscaling Made Easy With Kubernetes Event-...
Ad

Similar to Airflow techtonic template (20)

PDF
Cloud APIs Overview Tucker
PDF
Big Data Adavnced Analytics on Microsoft Azure
PDF
Microsoft Azure For Solutions Architects
PDF
AWS Data Pipeline Tutorial | AWS Tutorial For Beginners | AWS Certification T...
PDF
GreatLearning Webinar - Microservices and Event-Driven Architecture.pdf
PPTX
Accenture 2014 AWS re:Invent Enterprise Migration Breakout Session
PDF
Azure Data Engineer Online Training | Microsoft Azure Data Engineer
PPTX
How Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and Metrics
PPTX
Name_Surname_Your primary skill resume.pptx
PDF
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
PDF
DSDT Meetup Nov 2017
PDF
Dsdt meetup 2017 11-21
DOC
Vimala_Gadegi
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
PPT
Protecting from transient failures in cloud microsoft azure deployments
PPTX
Introduction to Google Cloud & GCCP Campaign
PDF
Copy of Hari Intern Presentation.pdf
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PPTX
ArchitectNow - Migrating Legacy .NET Apps to Azure
PDF
Presto: Query Anything - Data Engineer’s perspective
Cloud APIs Overview Tucker
Big Data Adavnced Analytics on Microsoft Azure
Microsoft Azure For Solutions Architects
AWS Data Pipeline Tutorial | AWS Tutorial For Beginners | AWS Certification T...
GreatLearning Webinar - Microservices and Event-Driven Architecture.pdf
Accenture 2014 AWS re:Invent Enterprise Migration Breakout Session
Azure Data Engineer Online Training | Microsoft Azure Data Engineer
How Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and Metrics
Name_Surname_Your primary skill resume.pptx
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
DSDT Meetup Nov 2017
Dsdt meetup 2017 11-21
Vimala_Gadegi
Infrastructure Agnostic Machine Learning Workload Deployment
Protecting from transient failures in cloud microsoft azure deployments
Introduction to Google Cloud & GCCP Campaign
Copy of Hari Intern Presentation.pdf
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
ArchitectNow - Migrating Legacy .NET Apps to Azure
Presto: Query Anything - Data Engineer’s perspective
Ad

Recently uploaded (20)

PPTX
Impressionism_PostImpressionism_Presentation.pptx
PPTX
Emphasizing It's Not The End 08 06 2025.pptx
PPTX
Understanding-Communication-Berlos-S-M-C-R-Model.pptx
PPTX
_ISO_Presentation_ISO 9001 and 45001.pptx
DOCX
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
PPTX
Effective_Handling_Information_Presentation.pptx
PPTX
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
PPTX
fundraisepro pitch deck elegant and modern
PPTX
Human Mind & its character Characteristics
PPTX
The spiral of silence is a theory in communication and political science that...
PPTX
The Effect of Human Resource Management Practice on Organizational Performanc...
PPTX
Self management and self evaluation presentation
PPTX
Intro to ISO 9001 2015.pptx wareness raising
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PPTX
Hydrogel Based delivery Cancer Treatment
PPTX
Relationship Management Presentation In Banking.pptx
PDF
oil_refinery_presentation_v1 sllfmfls.pdf
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PPTX
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
DOC
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
Impressionism_PostImpressionism_Presentation.pptx
Emphasizing It's Not The End 08 06 2025.pptx
Understanding-Communication-Berlos-S-M-C-R-Model.pptx
_ISO_Presentation_ISO 9001 and 45001.pptx
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
Effective_Handling_Information_Presentation.pptx
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
fundraisepro pitch deck elegant and modern
Human Mind & its character Characteristics
The spiral of silence is a theory in communication and political science that...
The Effect of Human Resource Management Practice on Organizational Performanc...
Self management and self evaluation presentation
Intro to ISO 9001 2015.pptx wareness raising
Tablets And Capsule Preformulation Of Paracetamol
Hydrogel Based delivery Cancer Treatment
Relationship Management Presentation In Banking.pptx
oil_refinery_presentation_v1 sllfmfls.pdf
2025-08-10 Joseph 02 (shared slides).pptx
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证

Airflow techtonic template

  • 1. Learning and Development 2019 Apache Airflow For Data Engineering - Sampath Kumar, Principle Engineer
  • 2. Agenda Goal of this session is to provide an overview of Airflow capabilities & how to use it for Data Engineering ● Introduction (DE, Airflow) ● Key Concepts (Key Words, Demo) ● Architecture (System Design) ● Features (Challenges, Loggins, Analytics) ● Challenges & Recommendations ● Q & A
  • 3. DATA ENGINEERING Data Engineering is the aspect of data science that focuses on practical applications of data collection and analysis. Data Engineers are tasked with ● Designing, building, testing, integrating, managing, and optimizing data from a variety of sources ● Build the infrastructure and architecture that enable data generation ● Primary focus is to build free-flowing data pipelines by combining a variety of big data technologies that enable real-time analytics ● Data engineers also write complex queries to ensure that data is easily accessible
  • 4. AIRFLOW ● Open-source workflow automation and scheduling system that can be used to author and manage your data pipelines. ● It started at Airbnb in October 2014 as a solution to manage the company's increasing complex workflows. ● License: Apache License 2.0 ● Written in: Python ● Operating system: Microsoft Windows, macOS, Linux ● Stable release: 1.10.5 / August 30, 2019; 3 months ago
  • 5. CRON CRON: Derived from work CRONOS(means time), is a software for unix-like systems to schedule jobs based on time. Airflow is cron on steroids: it allows you to schedule tasks to run, run them in a particular order, and monitor / manage all of your tasks.
  • 6. KEY CONCEPTS ● DAGS ○ Directed acyclic graphs that represent tasks workflow ● Task ○ Operators - Bash, Python, SSH, Http, MySql, SparkSubmit, Sensors(s3), Docker, Hive, Slack,.. ● Hooks ○ To store credentials for services - AWS, GCP, DataBase, Email,.. ● Vars & XCom ○ For sharing any global values or inter-task communication
  • 7. Sample - DAG Start End S3Sensor Build VM Goal: As and when user files come into AWS S3 bucket, start a high-spec docker VM and process the job. Prior to start and post process, job status to be notified. Solutions required to be cost efficient and need to auto-scale if required.
  • 9. Sample - DAG A data engineering pipeline, where an S3 sensor is used to identify the arrival of input file, following by several validation checks and then load into ElasticSearch which will be used for serving clients. Start (S3Sensor) Schema Validation Inputs Validation Data Validation Load to ElasticSearch End Failures
  • 10. ARCHITECTURE ● MetaDB ● Message Broker ● Airflow Webserver ● Airflow Scheduler ● Airflow Workers
  • 11. FEATURES ● Airflow Workers are Horizontally scalable ● Airflow Messaging Broker - Celery ● Airflow Integrations - GCP, Azure, AWS, Qubole & Databricks ● Hooks, Connections & Pools - Environment(dev/test/prod) friendly ● DAG - Dynamic sub dags & Branching
  • 12. ANALYTICS ● Part of being productive with data is having the right weapons to profile the data you are working with. ● Airflow provides a simple query interface to write SQL and get results quickly, and a charting application letting you visualize data.
  • 13. LOGGING ● Logging is visible from UI
  • 14. CHALLENGES ● Airflow is not Apache NIFI or Apache Spark ○ is not - data routing or data transformation system. ● Airflow Workers ○ requires identical access(network, authentication & authorisation) ○ requires similar hardware capabilities ● Airflow Webserver & Scheduler ○ not scalable (Work-around - supervisor, docker health checks)
  • 15. RECOMMENDATIONS Airflow is a best suited where ● Agility is important ● Portability ● Segregation b/w compute and workflow mgmt ● Scalability on demand ● Pool/Connection management ● Job Analytics ● Logging visibility
  • 17. REFERENCES ● Roles of Data Engineer - https://inlovewithcode.wordpress.com/2019/05/15/roles-of-data-engineer-required-skill/ ● System Design - https://inlovewithcode.wordpress.com/system-design/ ● Airflow Tutorials - https://airflow-tutorial.readthedocs.io/en/latest/airflow-intro.html ● Airflow Documentation - https://airflow.apache.org/docs/stable/ ● Airflow Dockers - https://towardsdatascience.com/getting-started-with-airflow-using-docker-cd8b44dbff98