SlideShare a Scribd company logo
Using Airflow to speed up
development of data intensive tools
Blaine Elliott
Data Engineer @ One Medical
Twitter: @blainee
Airflow Summit
July 10th, 2020
Purpose of this talk?
● To demonstrate how Airflow can help you build
new tools
● Inspire others to do the same
Who am I?
● Data Engineer @ One Medical
● Formerly @ LinkedIn, Chegg, MySpace
Intro...
Proprietary and ConfidentialOne Medical
● A tool to detect data anomalies
● The architecture of this tool
...also how the tool communicates with Airflow
● How Airflow decreased the cost to develop this tool
3
What are we going to cover in this talk?
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical
● At One Medical, we consume and create a lot of data
● We want to find bad data before it’s passed on to analysts
● We’re lazy engineers
4
Setting up the problem...
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical
● Needs to detect abnormal data
● Can scale to thousands of tables and columns
● Cost to develop the tool is minimized
5
Feature requirements for our Data Anomaly Detector(“DAD Tool”)
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical 6
● The ability to do statistical analysis
● Storage to persist data & test results
● UI/UX to manage the tool, create tests, & analyze results
● Database interoperability
(authentication, communication)
● The ability to run thousands of tests per day
● Must be secure
(must pass a security audit)
What is need to make this work?
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical
Airflow steps…
1. Create dynamic DAGs
2. Tell Airflow to run our DAGs
3. Process the DAGs
4. Send results to the DAD Tool
7
The Data Anomaly Detector(“DAD Tool”)
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical 8
Airflow Integration (4 steps)
Airflow Summit - Summer 2020
1. Send SQL to Airflow, as a text file on S3
2. Send request to Airflow to process DAGs
3. Process DAGs
4. Send test results to DAD Tool, as a pickled file on S3
Proprietary and ConfidentialOne Medical
1. User defines a test
Ex, all values in a time series must be within X σ’s of the mean.
2. User applies the test to a column
Ex, Using our new test, set threshold to 3-σ’s, use the table patients
w/the column systolic_blood_pressure for the most recent 90 days.
3. The DAD Tool + Airflow processes all the things
4. User analyzes results in the DAD Tool UI
9
Anatomy of a test
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical
● Needs to detect abnormal data
● Can scale to thousands of tables and columns
● Cost to develop the tool is minimized
10
Requirements Review
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical 11
● The complexity of Airflow is hidden from users
● Using Airflow for part of the backend processing of the DAD Tool
significantly decreased development time
● Because Airflow was already actively used at One Medical, desirable
features already available in Airflow could be made available to the
DAD Tool
● Time that would have been spent building features in Airflow were
repurposed to improve the DAD Tool
Conclusions
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical
● No need to manage database authentication
● Databases configured in Airflow are immediately available to the
DAD Tool
● Parallelism is managed by Airflow
● Throttling is managed by Airflow
● Since Airflow already passed our security audit, minimal effort was
needed to get approved to leverage Airflow in the DAD Tool
12
List of Airflow features that enable the DAD Tool
Airflow Summit - Summer 2020
Proprietary and ConfidentialOne Medical
Q. Why not use XCOM?
A. Using S3 (an any other object store) is stateful, fault tolerant and avoids
any limitations on how much data is being transferred.
Q. Is the DAD Tool open source?
A. Not currently but I am working towards that goal.
Answers to common questions
13
Airflow Summit - Summer 2020
Thank you
Blaine Elliott
Sr Data Engineer @ One Medical
Twitter: @blainee
Airflow Summit
July 10th, 2020

More Related Content

PPTX
Project Portfolio Dashboard
ODP
Monitoring via Datadog
DOC
Peter Kupec Resume 2020
PDF
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
PDF
SW360 Update Tooling Telco
PPTX
12th Meeting OpenChain Reference Tooling Work Group - 25th March - Slides
PDF
SiamQuant 2.0 Discovering Alphas Annoucement : จงค้นหา แล้วจะค้นพบ! เริ่มต้นว...
PDF
Slidedeck Datenanalysen auf Speed - Oracle R Enterprise (ORE) Demo - DOAG Big...
Project Portfolio Dashboard
Monitoring via Datadog
Peter Kupec Resume 2020
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
SW360 Update Tooling Telco
12th Meeting OpenChain Reference Tooling Work Group - 25th March - Slides
SiamQuant 2.0 Discovering Alphas Annoucement : จงค้นหา แล้วจะค้นพบ! เริ่มต้นว...
Slidedeck Datenanalysen auf Speed - Oracle R Enterprise (ORE) Demo - DOAG Big...

Similar to Using airflow for tools development (20)

PDF
Industrial IoT bootcamp
PDF
Multiple awr reports_parser
DOCX
Resume (1)
DOCX
Resume (1)
PPT
Universal test solutions customer testimonial 10192013-v2.2
PPTX
Major Project Report on Designing an Android Application for Electrical Maint...
PDF
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
PDF
Game Analytics at London Apache Druid Meetup
PDF
Project Management Sample
PDF
ODSC data science to DataOps
PDF
Fri benghiat gil-odsc-data-kitchen-data science to dataops
PPTX
Airflow presentation
PDF
Self-Service Analytics with Guard Rails
PDF
Monitoring MongoDB Atlas with Datadog
PDF
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
PDF
Technology Primer: Monitor Microservices, Containers, Cloud Foundry and Node ...
PDF
Advanced Analytics and Machine Learning with Data Virtualization
PDF
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
PDF
Dagster @ R&S MNT
PDF
Splunk bangalore user group 2020-06-01
Industrial IoT bootcamp
Multiple awr reports_parser
Resume (1)
Resume (1)
Universal test solutions customer testimonial 10192013-v2.2
Major Project Report on Designing an Android Application for Electrical Maint...
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Game Analytics at London Apache Druid Meetup
Project Management Sample
ODSC data science to DataOps
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Airflow presentation
Self-Service Analytics with Guard Rails
Monitoring MongoDB Atlas with Datadog
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Technology Primer: Monitor Microservices, Containers, Cloud Foundry and Node ...
Advanced Analytics and Machine Learning with Data Virtualization
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Dagster @ R&S MNT
Splunk bangalore user group 2020-06-01
Ad

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Big Data Technologies - Introduction.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PPT
Teaching material agriculture food technology
PDF
cuic standard and advanced reporting.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
“AI and Expert System Decision Support & Business Intelligence Systems”
Network Security Unit 5.pdf for BCA BBA.
Big Data Technologies - Introduction.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
Teaching material agriculture food technology
cuic standard and advanced reporting.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Review of recent advances in non-invasive hemoglobin estimation
Ad

Using airflow for tools development

  • 1. Using Airflow to speed up development of data intensive tools Blaine Elliott Data Engineer @ One Medical Twitter: @blainee Airflow Summit July 10th, 2020
  • 2. Purpose of this talk? ● To demonstrate how Airflow can help you build new tools ● Inspire others to do the same Who am I? ● Data Engineer @ One Medical ● Formerly @ LinkedIn, Chegg, MySpace Intro...
  • 3. Proprietary and ConfidentialOne Medical ● A tool to detect data anomalies ● The architecture of this tool ...also how the tool communicates with Airflow ● How Airflow decreased the cost to develop this tool 3 What are we going to cover in this talk? Airflow Summit - Summer 2020
  • 4. Proprietary and ConfidentialOne Medical ● At One Medical, we consume and create a lot of data ● We want to find bad data before it’s passed on to analysts ● We’re lazy engineers 4 Setting up the problem... Airflow Summit - Summer 2020
  • 5. Proprietary and ConfidentialOne Medical ● Needs to detect abnormal data ● Can scale to thousands of tables and columns ● Cost to develop the tool is minimized 5 Feature requirements for our Data Anomaly Detector(“DAD Tool”) Airflow Summit - Summer 2020
  • 6. Proprietary and ConfidentialOne Medical 6 ● The ability to do statistical analysis ● Storage to persist data & test results ● UI/UX to manage the tool, create tests, & analyze results ● Database interoperability (authentication, communication) ● The ability to run thousands of tests per day ● Must be secure (must pass a security audit) What is need to make this work? Airflow Summit - Summer 2020
  • 7. Proprietary and ConfidentialOne Medical Airflow steps… 1. Create dynamic DAGs 2. Tell Airflow to run our DAGs 3. Process the DAGs 4. Send results to the DAD Tool 7 The Data Anomaly Detector(“DAD Tool”) Airflow Summit - Summer 2020
  • 8. Proprietary and ConfidentialOne Medical 8 Airflow Integration (4 steps) Airflow Summit - Summer 2020 1. Send SQL to Airflow, as a text file on S3 2. Send request to Airflow to process DAGs 3. Process DAGs 4. Send test results to DAD Tool, as a pickled file on S3
  • 9. Proprietary and ConfidentialOne Medical 1. User defines a test Ex, all values in a time series must be within X σ’s of the mean. 2. User applies the test to a column Ex, Using our new test, set threshold to 3-σ’s, use the table patients w/the column systolic_blood_pressure for the most recent 90 days. 3. The DAD Tool + Airflow processes all the things 4. User analyzes results in the DAD Tool UI 9 Anatomy of a test Airflow Summit - Summer 2020
  • 10. Proprietary and ConfidentialOne Medical ● Needs to detect abnormal data ● Can scale to thousands of tables and columns ● Cost to develop the tool is minimized 10 Requirements Review Airflow Summit - Summer 2020
  • 11. Proprietary and ConfidentialOne Medical 11 ● The complexity of Airflow is hidden from users ● Using Airflow for part of the backend processing of the DAD Tool significantly decreased development time ● Because Airflow was already actively used at One Medical, desirable features already available in Airflow could be made available to the DAD Tool ● Time that would have been spent building features in Airflow were repurposed to improve the DAD Tool Conclusions Airflow Summit - Summer 2020
  • 12. Proprietary and ConfidentialOne Medical ● No need to manage database authentication ● Databases configured in Airflow are immediately available to the DAD Tool ● Parallelism is managed by Airflow ● Throttling is managed by Airflow ● Since Airflow already passed our security audit, minimal effort was needed to get approved to leverage Airflow in the DAD Tool 12 List of Airflow features that enable the DAD Tool Airflow Summit - Summer 2020
  • 13. Proprietary and ConfidentialOne Medical Q. Why not use XCOM? A. Using S3 (an any other object store) is stateful, fault tolerant and avoids any limitations on how much data is being transferred. Q. Is the DAD Tool open source? A. Not currently but I am working towards that goal. Answers to common questions 13 Airflow Summit - Summer 2020
  • 14. Thank you Blaine Elliott Sr Data Engineer @ One Medical Twitter: @blainee Airflow Summit July 10th, 2020