SlideShare a Scribd company logo
4
Most read
17
Most read
Data testing
Gleb Mezhanskiy
CEO & Co-founder @ Datafold
Ex-Lyft Data PM
Agenda
1. Principles of data testing
2. Embedding testing in production & development workflows
3. Types of tests, pros/cons & tools
principles of effective
data testing3
Embed testing
in existing
workflows
Automate
everything
Cut
the noise
1 2 3
Data testing in production
Goal: catch issues as early and upstream as possible
Run ETL batch
Run tests
Tests
pass?
YES
NO
Notify
owners
Investigate
/ Fix
Publish
new data
What to test in production
Assertions
Metric
monitoring
checking
hard rules
detecting
anomalies
in metrics
When
to use
assertions
Value-level checks
> assert `email` is of xxx@yyy.com format
> assert `user_id` is unique and not-null
> assert SUM(source.revenue) = SUM(target.revenue)
Integrity checks
Balance checks
Tools for running assertions
Embedded in ETL tools
> dbt for SQL
> Dagster for general ETL
Standalone
> great_expectations for SQL
> deequ for Spark
Hard rules don’t work for metrics because
of natural variance, trend and seasonality!
When to use metric monitoring
Hard rules don’t work for metrics because
of natural variance, trend and seasonality! Answer:
apply a bit of ML!
When to use metric monitoring
Tools for metric monitoring
Prophet
by Facebook
Datafold
Alerting
Data testing in development
Goal: do no harm – prevent breaking things that work
Build & Backfill
Run tests
Tests
pass?
YES
NO
Code
review
Deploy
Investigate
/ Fix
How to test in development
Assertions Data diff
checking
hard rules
– just like
in production!
visualizing
changes
in data
Data Testing
Data diff = git diff for data
Compares values
Production
Development
…and distributions
Remember –
automate!
Diff tools
> dbt-audit-helper
> BigDiffy by Spotify
> Datafold Diff
Bottom line
Development Production
What changes in between tests
Goal
Frequency
Trigger
Methods
Assertions
Data diff
Assertions
Metric monitoring
On every new data batchOn every commit / PR
Github/lab + CI ETL orchestrators
Prevent regressions
Source code Data
Learn about issues ASAP

More Related Content

PDF
Reproducibility with Unstructured Data in 3 steps
PDF
Buliding Reliable Data Apps
PDF
Reactive for Machine Learning Teams
PPTX
Interesting MATLAB Projects Research Help
PPTX
IEEE MATLAB Projects Research Ideas
PPT
Unit testing using Mock objects and dependency injection
PPTX
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...
DOCX
Yiran_Wang_Resume
Reproducibility with Unstructured Data in 3 steps
Buliding Reliable Data Apps
Reactive for Machine Learning Teams
Interesting MATLAB Projects Research Help
IEEE MATLAB Projects Research Ideas
Unit testing using Mock objects and dependency injection
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...
Yiran_Wang_Resume

What's hot (19)

PDF
Tech leaders guide to effective building of machine learning products
PPTX
Matlab-Assignment-Projects
PPTX
Freenome's Biological Machine Learning Platform
PDF
Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad
DOCX
Avery M Allen power 2016D2
DOCX
College Resume 2015
PDF
Lviv MD Day 2015 Малаховський Віталій "Архітектура компонентів обробки даних ...
PPTX
Introducing Entity Framework Core
PDF
MLOps Bridging the gap between Data Scientists and Ops.
PPTX
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai
PDF
Reverse Engineering for Documenting Software Architectures, a Literature Review
PPTX
Bro, manage test data like a pro!
PDF
DataEngConf SF16 - Three lessons learned from building a production machine l...
PPTX
Msc-Project-Topics-in-Information-Technology
PPT
Wrestling Large Data Volumes to the Ground
PPTX
Computer Science Projects in Scilab
PDF
CMIS 102 FINAL PROJECT
PDF
What is MLOps
PPTX
Ray distributed python framework
Tech leaders guide to effective building of machine learning products
Matlab-Assignment-Projects
Freenome's Biological Machine Learning Platform
Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad
Avery M Allen power 2016D2
College Resume 2015
Lviv MD Day 2015 Малаховський Віталій "Архітектура компонентів обробки даних ...
Introducing Entity Framework Core
MLOps Bridging the gap between Data Scientists and Ops.
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai
Reverse Engineering for Documenting Software Architectures, a Literature Review
Bro, manage test data like a pro!
DataEngConf SF16 - Three lessons learned from building a production machine l...
Msc-Project-Topics-in-Information-Technology
Wrestling Large Data Volumes to the Ground
Computer Science Projects in Scilab
CMIS 102 FINAL PROJECT
What is MLOps
Ray distributed python framework
Ad

Similar to Data Testing (20)

PDF
Exploratory Testing: Make It Part of Your Test Strategy
PDF
Testing Data & Data-Centric Applications - Whitepaper
PPTX
1 testing fundamentals
PDF
Use Automation to Assist—Not Replace—Manual Testing
PPT
ISTQB / ISEB Foundation Exam Practice - 5
PDF
Defect Metrics for Organization and Project Health
PPTX
Software Testing Strategies
PPT
Testing Software Solutions
PDF
Test Data Management Explained: Why It’s the Backbone of Quality Testing
PPT
End to end testing - strategies
PPTX
QA Worskhop For Begginers In the Power Point Presentation
PPTX
Fundamentals of Testing Section 1/6
DOCX
PPT
Design testabilty
PPT
ISTQB / ISEB Foundation Exam Practice - 2
PPTX
Generating Test Cases
PDF
softwaretestingppt-120810095500-phpapp02 (1).pdf
PPT
Itpi metricon 0906a final
Exploratory Testing: Make It Part of Your Test Strategy
Testing Data & Data-Centric Applications - Whitepaper
1 testing fundamentals
Use Automation to Assist—Not Replace—Manual Testing
ISTQB / ISEB Foundation Exam Practice - 5
Defect Metrics for Organization and Project Health
Software Testing Strategies
Testing Software Solutions
Test Data Management Explained: Why It’s the Backbone of Quality Testing
End to end testing - strategies
QA Worskhop For Begginers In the Power Point Presentation
Fundamentals of Testing Section 1/6
Design testabilty
ISTQB / ISEB Foundation Exam Practice - 2
Generating Test Cases
softwaretestingppt-120810095500-phpapp02 (1).pdf
Itpi metricon 0906a final
Ad

Recently uploaded (20)

PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
New ISO 27001_2022 standard and the changes
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
DOCX
Factor Analysis Word Document Presentation
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
Predictive modeling basics in data cleaning process
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Introduction to Data Science and Data Analysis
PPTX
A Complete Guide to Streamlining Business Processes
PDF
annual-report-2024-2025 original latest.
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
STERILIZATION AND DISINFECTION-1.ppthhhbx
New ISO 27001_2022 standard and the changes
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Factor Analysis Word Document Presentation
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Predictive modeling basics in data cleaning process
IBA_Chapter_11_Slides_Final_Accessible.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Qualitative Qantitative and Mixed Methods.pptx
Introduction to Data Science and Data Analysis
A Complete Guide to Streamlining Business Processes
annual-report-2024-2025 original latest.
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
retention in jsjsksksksnbsndjddjdnFPD.pptx
CYBER SECURITY the Next Warefare Tactics
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx

Data Testing

  • 1. Data testing Gleb Mezhanskiy CEO & Co-founder @ Datafold Ex-Lyft Data PM
  • 2. Agenda 1. Principles of data testing 2. Embedding testing in production & development workflows 3. Types of tests, pros/cons & tools
  • 3. principles of effective data testing3 Embed testing in existing workflows Automate everything Cut the noise 1 2 3
  • 4. Data testing in production Goal: catch issues as early and upstream as possible Run ETL batch Run tests Tests pass? YES NO Notify owners Investigate / Fix Publish new data
  • 5. What to test in production Assertions Metric monitoring checking hard rules detecting anomalies in metrics
  • 6. When to use assertions Value-level checks > assert `email` is of xxx@yyy.com format > assert `user_id` is unique and not-null > assert SUM(source.revenue) = SUM(target.revenue) Integrity checks Balance checks
  • 7. Tools for running assertions Embedded in ETL tools > dbt for SQL > Dagster for general ETL Standalone > great_expectations for SQL > deequ for Spark
  • 8. Hard rules don’t work for metrics because of natural variance, trend and seasonality! When to use metric monitoring
  • 9. Hard rules don’t work for metrics because of natural variance, trend and seasonality! Answer: apply a bit of ML! When to use metric monitoring
  • 10. Tools for metric monitoring Prophet by Facebook Datafold Alerting
  • 11. Data testing in development Goal: do no harm – prevent breaking things that work Build & Backfill Run tests Tests pass? YES NO Code review Deploy Investigate / Fix
  • 12. How to test in development Assertions Data diff checking hard rules – just like in production! visualizing changes in data
  • 14. Data diff = git diff for data Compares values Production Development
  • 17. Diff tools > dbt-audit-helper > BigDiffy by Spotify > Datafold Diff
  • 18. Bottom line Development Production What changes in between tests Goal Frequency Trigger Methods Assertions Data diff Assertions Metric monitoring On every new data batchOn every commit / PR Github/lab + CI ETL orchestrators Prevent regressions Source code Data Learn about issues ASAP