SlideShare a Scribd company logo
1
1
Derek Gorthy
Senior Software Development Engineer
Yuan Feng
Software Development Engineer
Empowering Zillow’s
Developers with
Self-Service ETL
2
Who We Are
Zillow Offers Data Engineering Team
@ Zillow
Derek Gorthy
Senior Software Development
Engineer, Big Data
Yuan Feng
Software Development Engineer,
Big Data
3
Agenda
● How We Think About Self-Service ETL
● Core Components
● Self-Service ETL in Action at Zillow
○ Zetlas
○ Zagger
● Next Steps and Takeaways
Zillow
About Zillow
● Reimagining real estate to make it
easier to unlock life’s next chapter
● Offer customers an on-demand
experience for selling, buying,
renting and financing with
transparency and nearly seamless
end-to-end service
● Most-visited real estate website in
the United States
* As of Q4-2020
How We Think About
Self-Service ETL
Zagger Integrations
Zagger Pipeline Utilities Package
User Interaction Zagger Managed Service
Integrations
Execution
Zetlas
DQ
Module
API
Parser 1
Parser N
Airflow
Renderer
... ...
Kafka
Renderer
What Is Self-Service ETL?
User Interaction Pipeline
Configuration
File
?
How We Think About Self-Service ETL
User Interaction Pipeline
Interpret Pipeline
Metadata
Render
Configuration
File
Opinionated Unopinionated
Core Components
User Interaction
User Interaction Pipeline
Interpret Pipeline
Metadata
Render
Configuration
File
Opinionated Unopinionated
Interpret User Input
User Interaction Pipeline
Interpret Pipeline
Metadata
Render
Configuration
File
Opinionated Unopinionated
Pipeline Metadata
User Interaction Pipeline
Interpret Pipeline
Metadata
Render
Configuration
File
Opinionated Unopinionated
Render Pipeline
User Interaction Pipeline
Interpret Pipeline
Metadata
Render
Configuration
File
Opinionated Unopinionated
Data Pipeline & Shared Integrations
User Interaction Pipeline
Interpret Pipeline
Metadata
Render
Configuration
File
Opinionated Unopinionated
Self-Service ETL in
Action at Zillow
Applied Self-Service ETL - Zetlas
Motivation Features Target Users
● Modernized and reliable
self-service tool to
automate SQL based
workflows
● No coding experience
needed to create ETL
workflows
● UI-driven
● Rapid prototyping and
deployment
● Job monitoring/alerting
● Automated validation
● Integration with multiple
internal services
● Scalable and expandable
● Data scientists
● Data analysts
Zetlas UX Design
Applied Self-Service ETL - Zagger
Motivation Features Target Users
● Provide a
developer-friendly
abstraction from ETL tools
● Create a service that
automates data
engineering ancillary
tasks
● Create common
processing patterns for
fast pipeline development
● Integrates with Terraform
● Exposes create/delete
endpoints for other
access patterns
● Allows for custom
interpreter creation
● Integration with multiple
internal services
● Data engineers
● Data producer teams
Zagger Integrations
Zagger Pipeline Utilities Package
User Interaction Zagger Managed Service
Integrations
Execution
Zetlas
DQ
Module
API
Parser 1
Parser N
Airflow
Renderer
... ...
Kafka
Renderer
Next Steps and Takeaways
Development Timeline
2019 2020 2021
Pipeler shared
Spark processing
library development
Zetlas official
launch in Zillow
Zagger Managed Service
and Pipeline Utilities
Package library
User Growth for
Zagger and Zetlas
ZETL retirement Zetlas and Zagger
backend unification
Takeaways
● UI must be designed to meet the needs of its users
● Self-service ETL isn’t just for non-data engineers
● Modular platform design allows for capabilities to be developed in
piecemeal
● Abstraction from tool-specific implementation gives flexibility
More From Zillow
Democratizing Data Quality Through a
Centralized Platform
5/27 @ 3:15 PM PST
Scaling AutoML-Driven Anomaly Detection
With Luminaire
5/27 @ 5:00 PM PST
Questions?
Thank you!
https://www.zillow.com/careers/

More Related Content

PDF
KFServing, Model Monitoring with Apache Spark and a Feature Store
PDF
Migrating Your Data Platform At a High Growth Startup
PPTX
Data Engineering Roles
PDF
Hybrid Apache Spark Architecture with YARN and Kubernetes
PDF
SQL Analytics Powering Telemetry Analysis at Comcast
PDF
You Can Do It in SQL
PDF
Scaling ML-Based Threat Detection For Production Cyber Attacks
PDF
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
KFServing, Model Monitoring with Apache Spark and a Feature Store
Migrating Your Data Platform At a High Growth Startup
Data Engineering Roles
Hybrid Apache Spark Architecture with YARN and Kubernetes
SQL Analytics Powering Telemetry Analysis at Comcast
You Can Do It in SQL
Scaling ML-Based Threat Detection For Production Cyber Attacks
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale

What's hot (20)

PDF
Accelerate Your ML Pipeline with AutoML and MLflow
PDF
Getting Ready to Use Redis with Apache Spark with Tague Griffith
PDF
Redash: Open Source SQL Analytics on Data Lakes
PDF
Accelerate Data Science Initiatives: Databricks & Privacera
PDF
Using Databricks as an Analysis Platform
PDF
Challenges of Operationalising Data Science in Production
PDF
Databricks Overview for MLOps
PDF
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
PDF
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
PDF
Is there a way that we can build our Azure Synapse Pipelines all with paramet...
PDF
Simplifying Disaster Recovery with Delta Lake
PDF
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
PDF
Semantic Image Logging Using Approximate Statistics & MLflow
PDF
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
PDF
Presto: Fast SQL on Everything
PDF
PDF
Consolidating MLOps at One of Europe’s Biggest Airports
PDF
Building a Web Application with Kafka as your Database
PDF
Continuous Integration & Continuous Delivery
Accelerate Your ML Pipeline with AutoML and MLflow
Getting Ready to Use Redis with Apache Spark with Tague Griffith
Redash: Open Source SQL Analytics on Data Lakes
Accelerate Data Science Initiatives: Databricks & Privacera
Using Databricks as an Analysis Platform
Challenges of Operationalising Data Science in Production
Databricks Overview for MLOps
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Is there a way that we can build our Azure Synapse Pipelines all with paramet...
Simplifying Disaster Recovery with Delta Lake
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
Semantic Image Logging Using Approximate Statistics & MLflow
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Presto: Fast SQL on Everything
Consolidating MLOps at One of Europe’s Biggest Airports
Building a Web Application with Kafka as your Database
Continuous Integration & Continuous Delivery
Ad

Similar to Empowering Zillow’s Developers with Self-Service ETL (20)

PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
PDF
AirBNB's ML platform - BigHead
PDF
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
PDF
Easy Microservices with JHipster - Devoxx BE 2017
PDF
Devoxx Belgium 2017 - easy microservices with JHipster
PPTX
Reactive Micro Services with Java seminar
DOC
Neethu_Abraham
PDF
The Fn Project: A Quick Introduction (December 2017)
PDF
Has serverless adoption hit a roadblock?
PDF
The Fn Project by Jesse Butler
PDF
Serverless Boston @ Oracle Meetup
PDF
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdf
PPTX
Dirigible powered by Orion for Cloud Development (EclipseCon EU 2015)
PDF
Preparing for Neo - Singapore OutSystems User Group October 2022 Meetup
PPTX
Not my problem - Delegating responsibility to infrastructure
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
DOCX
Zakir_Hussain_cv
PPTX
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
DOC
Shaik Niyas Ahamed M Resume
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
AirBNB's ML platform - BigHead
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Easy Microservices with JHipster - Devoxx BE 2017
Devoxx Belgium 2017 - easy microservices with JHipster
Reactive Micro Services with Java seminar
Neethu_Abraham
The Fn Project: A Quick Introduction (December 2017)
Has serverless adoption hit a roadblock?
The Fn Project by Jesse Butler
Serverless Boston @ Oracle Meetup
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdf
Dirigible powered by Orion for Cloud Development (EclipseCon EU 2015)
Preparing for Neo - Singapore OutSystems User Group October 2022 Meetup
Not my problem - Delegating responsibility to infrastructure
Building a fully managed stream processing platform on Flink at scale for Lin...
Zakir_Hussain_cv
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Shaik Niyas Ahamed M Resume
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Introduction to machine learning and Linear Models
PPT
Quality review (1)_presentation of this 21
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Computer network topology notes for revision
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Business Analytics and business intelligence.pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Foundation of Data Science unit number two notes
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to machine learning and Linear Models
Quality review (1)_presentation of this 21
STUDY DESIGN details- Lt Col Maksud (21).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Computer network topology notes for revision
1_Introduction to advance data techniques.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Business Analytics and business intelligence.pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
Clinical guidelines as a resource for EBP(1).pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
.pdf is not working space design for the following data for the following dat...
Fluorescence-microscope_Botany_detailed content
Foundation of Data Science unit number two notes
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf

Empowering Zillow’s Developers with Self-Service ETL

  • 1. 1 1 Derek Gorthy Senior Software Development Engineer Yuan Feng Software Development Engineer Empowering Zillow’s Developers with Self-Service ETL
  • 2. 2 Who We Are Zillow Offers Data Engineering Team @ Zillow Derek Gorthy Senior Software Development Engineer, Big Data Yuan Feng Software Development Engineer, Big Data
  • 3. 3 Agenda ● How We Think About Self-Service ETL ● Core Components ● Self-Service ETL in Action at Zillow ○ Zetlas ○ Zagger ● Next Steps and Takeaways
  • 5. About Zillow ● Reimagining real estate to make it easier to unlock life’s next chapter ● Offer customers an on-demand experience for selling, buying, renting and financing with transparency and nearly seamless end-to-end service ● Most-visited real estate website in the United States * As of Q4-2020
  • 6. How We Think About Self-Service ETL
  • 7. Zagger Integrations Zagger Pipeline Utilities Package User Interaction Zagger Managed Service Integrations Execution Zetlas DQ Module API Parser 1 Parser N Airflow Renderer ... ... Kafka Renderer
  • 8. What Is Self-Service ETL? User Interaction Pipeline Configuration File ?
  • 9. How We Think About Self-Service ETL User Interaction Pipeline Interpret Pipeline Metadata Render Configuration File Opinionated Unopinionated
  • 11. User Interaction User Interaction Pipeline Interpret Pipeline Metadata Render Configuration File Opinionated Unopinionated
  • 12. Interpret User Input User Interaction Pipeline Interpret Pipeline Metadata Render Configuration File Opinionated Unopinionated
  • 13. Pipeline Metadata User Interaction Pipeline Interpret Pipeline Metadata Render Configuration File Opinionated Unopinionated
  • 14. Render Pipeline User Interaction Pipeline Interpret Pipeline Metadata Render Configuration File Opinionated Unopinionated
  • 15. Data Pipeline & Shared Integrations User Interaction Pipeline Interpret Pipeline Metadata Render Configuration File Opinionated Unopinionated
  • 17. Applied Self-Service ETL - Zetlas Motivation Features Target Users ● Modernized and reliable self-service tool to automate SQL based workflows ● No coding experience needed to create ETL workflows ● UI-driven ● Rapid prototyping and deployment ● Job monitoring/alerting ● Automated validation ● Integration with multiple internal services ● Scalable and expandable ● Data scientists ● Data analysts
  • 19. Applied Self-Service ETL - Zagger Motivation Features Target Users ● Provide a developer-friendly abstraction from ETL tools ● Create a service that automates data engineering ancillary tasks ● Create common processing patterns for fast pipeline development ● Integrates with Terraform ● Exposes create/delete endpoints for other access patterns ● Allows for custom interpreter creation ● Integration with multiple internal services ● Data engineers ● Data producer teams
  • 20. Zagger Integrations Zagger Pipeline Utilities Package User Interaction Zagger Managed Service Integrations Execution Zetlas DQ Module API Parser 1 Parser N Airflow Renderer ... ... Kafka Renderer
  • 21. Next Steps and Takeaways
  • 22. Development Timeline 2019 2020 2021 Pipeler shared Spark processing library development Zetlas official launch in Zillow Zagger Managed Service and Pipeline Utilities Package library User Growth for Zagger and Zetlas ZETL retirement Zetlas and Zagger backend unification
  • 23. Takeaways ● UI must be designed to meet the needs of its users ● Self-service ETL isn’t just for non-data engineers ● Modular platform design allows for capabilities to be developed in piecemeal ● Abstraction from tool-specific implementation gives flexibility
  • 24. More From Zillow Democratizing Data Quality Through a Centralized Platform 5/27 @ 3:15 PM PST Scaling AutoML-Driven Anomaly Detection With Luminaire 5/27 @ 5:00 PM PST