SlideShare a Scribd company logo
July 2020
Tao Feng | @feng-tao | Engineer, Lyft Data Platform
Blog: go.lyft.com/airflowblog
Airflow @ Lyft
2
Who
● Engineer at Lyft Data Platform and Tools
● Apache Airflow PMC and Committer
● Working on different data products (Airflow, Amundsen, etc)
● Previously at Linkedin, Oracle
Agenda
• Data Platform @ Lyft
• Airflow Customization @ Lyft
• Current Focus For Airflow @ Lyft
• Summary
3
Data Platform @ Lyft
4
About Lyft
MISSION: Improve people's life
with the world's best
transportation
Lyft’s data analytics platform architecture
Backend Services
Mobile app
PubSub
Events Batch ETL
Presto, Hive Client,
and BI Tools
Airflow main use cases @ Lyft
7
Airflow usage @ Lyft
8
● Two Clusters
● Celery Executors
Airflow Customization @
Lyft
9
Airflow customization @ Lyft
• UI auditing
• DAG dependency graph
10
Airflow customization @ Lyft
• Extra link for task instance UI panel
11
● Hive query log
● Dr elephant report for performance tuning
● Hive job analysis dashboard
Airflow customization @ Lyft
• Amundsen is an open-sourced data discovery portal.
• It is integrated with Airflow to show the task and table lineage.
• It is currently used by 18+ companies.
12
Current Focus For
Airflow @ Lyft
13
ETL Expiration System
14
• Lots of ETLs are not well maintained with no clear ownership.
• Built an ETL Expiration system to:
‒ Disabled DAGs with expired TTLs (DAG owner needs to renew the TTL every six
months).
‒ Disabled DAGs that produced unused datasets
‒ Disabled DAGs that are failing for a long time
PY2 -> PY3
• Built a dashboard to understand PY3
issue.
‒ Most issues are related to string encoding or
string and integer comparison.
• DAG loading time is higher in py3
compared to py2
‒ Cherry pick a few performance improvement
patches from upstream
15
Airflow Upgrade
• Leverage new features:
‒ DAG serialization
‒ RBAC
‒ Data Lineage
‒ Performance Improvements
• Current status:
‒ Built a new multi-tenant cluster to onboard new use cases.
‒ Finishing PY3 upgrade for legacy DAGs.
‒ Converting the existing legacy mono DAG repo as another tenant on the new
cluster.
16
Summary
17
Summary
18
• Covers Lyft data platform in general
• Discusses about Airflow customization at Lyft
• Discusses about Airflow current work at Lyft
Acknowledgement
19
• Members who maintain Airflow at Lyft
‒ Andrew Stahlman
‒ Bhanu Renukuntla
‒ Chao-han Tsai (committer)
‒ Jinhyuk Chang
‒ Junda Yang
‒ Max Payton
‒ Sherry Zhao
‒ Shenghu Yang (EM)
‒ Tao Feng (committer)
• Thanks Maxime for his guidance
Tao Feng | @feng-tao
Blog at go.lyft.com/airflowblog
20

More Related Content

PDF
Meetup SF - Amundsen
PPTX
Strata sf - Amundsen presentation
PDF
The Patterns of Distributed Logging and Containers
PDF
OSMC 2021 | Introduction into OpenSearch
PDF
Exploring the power of OpenTelemetry on Kubernetes
PDF
Designing a complete ci cd pipeline using argo events, workflow and cd products
PPTX
Apache Arrow: In Theory, In Practice
PPTX
Data council sf amundsen presentation
Meetup SF - Amundsen
Strata sf - Amundsen presentation
The Patterns of Distributed Logging and Containers
OSMC 2021 | Introduction into OpenSearch
Exploring the power of OpenTelemetry on Kubernetes
Designing a complete ci cd pipeline using argo events, workflow and cd products
Apache Arrow: In Theory, In Practice
Data council sf amundsen presentation

What's hot (20)

PDF
Airflow introduction
PDF
Jitney, Kafka at Airbnb
PDF
Introduction to Kubernetes with demo
PDF
Airflow for Beginners
PDF
Introduction to Apache Beam
PDF
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Using ClickHouse for Experimentation
PDF
初探 OpenTelemetry - 蒐集遙測數據的新標準
PPTX
Apache airflow
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Argocd up and running
PPTX
How to Import JSON Using Cypher and APOC
PDF
Fluentd vs. Logstash for OpenStack Log Management
PDF
Redis vs Infinispan | DevNation Tech Talk
PDF
Apache Airflow
PDF
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
PDF
Gitlab, GitOps & ArgoCD
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Airflow introduction
Jitney, Kafka at Airbnb
Introduction to Kubernetes with demo
Airflow for Beginners
Introduction to Apache Beam
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Apache Iceberg - A Table Format for Hige Analytic Datasets
Using ClickHouse for Experimentation
初探 OpenTelemetry - 蒐集遙測數據的新標準
Apache airflow
Scaling your Data Pipelines with Apache Spark on Kubernetes
Argocd up and running
How to Import JSON Using Cypher and APOC
Fluentd vs. Logstash for OpenStack Log Management
Redis vs Infinispan | DevNation Tech Talk
Apache Airflow
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Airflow Best Practises & Roadmap to Airflow 2.0
Gitlab, GitOps & ArgoCD
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Ad

Similar to Airflow at lyft for Airflow summit 2020 conference (20)

PPTX
Airflow at lyft
PDF
Atlan to Airflow integration.pdf
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
PDF
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
PPTX
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft
PPTX
Apache AirfowAsaSAsaSAsSas - Session1.pptx
PPTX
DataPipelineApacheAirflow.pptx
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PDF
Airflow 4 manager
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
PDF
Apache airflow
PDF
Airflow Intro-1.pdf
PPTX
Apache Airflow presentation by GenPPT.pptx
PPTX
Apache Airflow overview
PDF
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
PDF
Lyft data Platform - 2019 slides
PDF
The Lyft data platform: Now and in the future
PPTX
Apache Airflow Introduction
PDF
Airflow techtonic template
Airflow at lyft
Atlan to Airflow integration.pdf
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft
Apache AirfowAsaSAsaSAsSas - Session1.pptx
DataPipelineApacheAirflow.pptx
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Airflow 4 manager
Running Airflow Workflows as ETL Processes on Hadoop
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
Apache airflow
Airflow Intro-1.pdf
Apache Airflow presentation by GenPPT.pptx
Apache Airflow overview
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
Lyft data Platform - 2019 slides
The Lyft data platform: Now and in the future
Apache Airflow Introduction
Airflow techtonic template
Ad

Recently uploaded (20)

PPTX
UNIT 4 Total Quality Management .pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPT
Project quality management in manufacturing
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
DOCX
573137875-Attendance-Management-System-original
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
UNIT 4 Total Quality Management .pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Lesson 3_Tessellation.pptx finite Mathematics
CH1 Production IntroductoryConcepts.pptx
OOP with Java - Java Introduction (Basics)
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Project quality management in manufacturing
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
573137875-Attendance-Management-System-original
bas. eng. economics group 4 presentation 1.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Construction Project Organization Group 2.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
CYBER-CRIMES AND SECURITY A guide to understanding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf

Airflow at lyft for Airflow summit 2020 conference

  • 1. July 2020 Tao Feng | @feng-tao | Engineer, Lyft Data Platform Blog: go.lyft.com/airflowblog Airflow @ Lyft
  • 2. 2 Who ● Engineer at Lyft Data Platform and Tools ● Apache Airflow PMC and Committer ● Working on different data products (Airflow, Amundsen, etc) ● Previously at Linkedin, Oracle
  • 3. Agenda • Data Platform @ Lyft • Airflow Customization @ Lyft • Current Focus For Airflow @ Lyft • Summary 3
  • 5. About Lyft MISSION: Improve people's life with the world's best transportation
  • 6. Lyft’s data analytics platform architecture Backend Services Mobile app PubSub Events Batch ETL Presto, Hive Client, and BI Tools
  • 7. Airflow main use cases @ Lyft 7
  • 8. Airflow usage @ Lyft 8 ● Two Clusters ● Celery Executors
  • 10. Airflow customization @ Lyft • UI auditing • DAG dependency graph 10
  • 11. Airflow customization @ Lyft • Extra link for task instance UI panel 11 ● Hive query log ● Dr elephant report for performance tuning ● Hive job analysis dashboard
  • 12. Airflow customization @ Lyft • Amundsen is an open-sourced data discovery portal. • It is integrated with Airflow to show the task and table lineage. • It is currently used by 18+ companies. 12
  • 14. ETL Expiration System 14 • Lots of ETLs are not well maintained with no clear ownership. • Built an ETL Expiration system to: ‒ Disabled DAGs with expired TTLs (DAG owner needs to renew the TTL every six months). ‒ Disabled DAGs that produced unused datasets ‒ Disabled DAGs that are failing for a long time
  • 15. PY2 -> PY3 • Built a dashboard to understand PY3 issue. ‒ Most issues are related to string encoding or string and integer comparison. • DAG loading time is higher in py3 compared to py2 ‒ Cherry pick a few performance improvement patches from upstream 15
  • 16. Airflow Upgrade • Leverage new features: ‒ DAG serialization ‒ RBAC ‒ Data Lineage ‒ Performance Improvements • Current status: ‒ Built a new multi-tenant cluster to onboard new use cases. ‒ Finishing PY3 upgrade for legacy DAGs. ‒ Converting the existing legacy mono DAG repo as another tenant on the new cluster. 16
  • 18. Summary 18 • Covers Lyft data platform in general • Discusses about Airflow customization at Lyft • Discusses about Airflow current work at Lyft
  • 19. Acknowledgement 19 • Members who maintain Airflow at Lyft ‒ Andrew Stahlman ‒ Bhanu Renukuntla ‒ Chao-han Tsai (committer) ‒ Jinhyuk Chang ‒ Junda Yang ‒ Max Payton ‒ Sherry Zhao ‒ Shenghu Yang (EM) ‒ Tao Feng (committer) • Thanks Maxime for his guidance
  • 20. Tao Feng | @feng-tao Blog at go.lyft.com/airflowblog 20