SlideShare a Scribd company logo
Data Asserts
Defensive Data Science
Tommy Guy
Microsoft
Observation: Complexity In Pipeline
Our pipeline:
DATA!!!
Insight! Direction! Strategy!
Our pipeline in reality: bugs tend to compound
DATA!!!
How do Engineers Manage Complexity?
Encapsulate: create functions/classes/subsystems
with clear APIs. This helps isolate complexity
Integration Tests: ensure that the components interact
correctly. This helps identify breaking changes.
Data introduces a few complications
Pipelines take many upstream dependencies
Researcher use cases are frequently unknown and
unanticipated by data providers.
Pushing requirements upstream to all producers is
Sisyphean.
We are not talking about data pipeline tests
The data pipeline teams:
Are all rows that are produced stored
• Counter fields to ensure no dropped rows
• Sentinel events to measure join fidelity
Are availability SLAs being met?
• Progressive server-client merging
Data Scientists Require Semantic Correctness
Does this field mean what I think it does?
How do Data Scientists identify potential
errors?
How do Data Scientists identify potential
errors?
Some follow-on fact is absurd…
… which leads to investigation …
… which finds a broader problem
If [potential conclusion], then we must have 3 billion
OneDrive users…
… because my user table doesn’t have a primary key …
… so I should aggregate by user.
What are your Assumptions?
If I conclude “Users who upload files to OneDrive are XXX% more likely
to buy Office if they also sent mail through Mobile Outlook”, I’m
making many silent assumptions:
Field Assumptions
User Id • Logged and PII-encrypted similarly in Outlook and OneDrive
• Correctly logging timestamp for Office purchase
• User Id isn’t empty or missing
OneDrive activity • Wasn’t automated traffic [identified by a certain flag].
Email Activity • Mobile client identifiers are correct.
All • Any upstream changes to OneDrive, Office, or Exchange
data have been communicated to pipeline owners.
What are your Sanity Checks?
• If a column “OfficeId” is really a user id, it has certain known properties:
• Observation: these sorts of checks take place when the pipeline is set
up, but they may not be re-checked very often.
Assumption Why does it matter?
Never null/empty Causes job-breaking data skew issues
Users are 1:* with Tenants Logical constraint: sign you are missing something.
Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id.
All rows in event data join to it Otherwise, your data is incomplete.
Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a
user-id.
Data Asserts: Defensive Data Science
Data Asserts: Maintain Quality
Data Asserts: Clear Trust Boundaries
These should
match!
Data Asserts: Defensive Data Science
Data Asserts in Production: A few
Observations
• Most of the analysis-impacting assertion failures we’ve seen were
actually errors in our assumptions not errors in the pipeline.
• Good tests beget good code: we’ve had to modularize our code in
order to produce testable chunks that get re-used in pipelines.
• Data Asserts is the backbone to data provenance. A data conclusion
can directly link all of the assumptions about the input that we made.

More Related Content

PDF
DataEngConf SF16 - Three lessons learned from building a production machine l...
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
PDF
Open Data Science Conference Agile Data
PDF
Consolidating MLOps at One of Europe’s Biggest Airports
PDF
Join2017_Deep Dive_AWS Operations
PPTX
Real time analytics @ netflix
PPTX
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
PPTX
Hugs instead of Bugs: Dreaming of Quality Tools for Devs and Testers
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Open Data Science Conference Agile Data
Consolidating MLOps at One of Europe’s Biggest Airports
Join2017_Deep Dive_AWS Operations
Real time analytics @ netflix
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
Hugs instead of Bugs: Dreaming of Quality Tools for Devs and Testers

What's hot (20)

PPTX
Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty
PDF
Metrics & more
PDF
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
PPTX
How to keep you out of the News: Web and End-to-End Performance Tips
PPTX
Database DevOps Anti-patterns
PPTX
Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How
PDF
Nordstrom Customer Presentation
PDF
DMCA#21: reactive-programming
PPTX
Web and App Performance: Top Problems to avoid to keep you out of the News
PPTX
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
PPTX
Getting CI right for SQL Server
PDF
Code Once Use Often with Declarative Data Pipelines
PPTX
DevOps 101 for data professionals
PPTX
HSPS 2015 - SharePoint Performance Santiy Checks
PDF
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
PDF
Machine learning model to production
PDF
node-crate: node.js and big data
PPTX
Netflix Data Engineering @ Uber Engineering Meetup
PDF
Building Scalable Prediction Services in R
PPTX
Four Practices to Fix Your Top .NET Performance Problems
Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty
Metrics & more
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
How to keep you out of the News: Web and End-to-End Performance Tips
Database DevOps Anti-patterns
Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How
Nordstrom Customer Presentation
DMCA#21: reactive-programming
Web and App Performance: Top Problems to avoid to keep you out of the News
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
Getting CI right for SQL Server
Code Once Use Often with Declarative Data Pipelines
DevOps 101 for data professionals
HSPS 2015 - SharePoint Performance Santiy Checks
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Machine learning model to production
node-crate: node.js and big data
Netflix Data Engineering @ Uber Engineering Meetup
Building Scalable Prediction Services in R
Four Practices to Fix Your Top .NET Performance Problems
Ad

Viewers also liked (16)

PDF
DataEngConf SF16 - Beginning with Ourselves
PDF
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
PPTX
DataEngConf SF16 - High cardinality time series search
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
PDF
DataEngConf SF16 - Running simulations at scale
PDF
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
PDF
DataEngConf SF16 - Recommendations at Instacart
PDF
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
PDF
Always Valid Inference (Ramesh Johari, Stanford)
PDF
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
PDF
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
PDF
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
PDF
DataEngConf SF16 - Bridging the gap between data science and data engineering
PDF
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
PDF
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Always Valid Inference (Ramesh Johari, Stanford)
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DataEngConf SF16 - Multi-temporal Data Structures
Ad

Similar to DataEngConf SF16 - Data Asserts: Defensive Data Science (20)

PDF
Measuring Data Quality with DataOps
PDF
Common Mistakes in Data Engineering and How to Avoid Them | IABAC
PDF
IT Operation Analytic for security- MiSSconf(sp1)
PPTX
Data Quality
PPTX
Building the enterprise data architecture
PPTX
Data quality and data profiling
PPT
End User Informatics
PDF
BI on Big Data Presentation
PDF
Creating a Data validation and Testing Strategy
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
PPT
Qiagram
PDF
How to improve your system monitoring
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPT
Unit-2 Part-1(a)-1.pptgguuijjiiioooooooooo
PDF
Streamline Your Data Workflows with DataOps for Better Efficiency.pdf
PPT
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
PDF
DGIQ 2015 The Fundamentals of Data Quality
PPTX
9-Common-Power-BI-Issues-and-How-to-Fix-Them
PPTX
Use of Formal Methods at Amazon Web Services
PDF
Data Quality: principles, approaches, and best practices
Measuring Data Quality with DataOps
Common Mistakes in Data Engineering and How to Avoid Them | IABAC
IT Operation Analytic for security- MiSSconf(sp1)
Data Quality
Building the enterprise data architecture
Data quality and data profiling
End User Informatics
BI on Big Data Presentation
Creating a Data validation and Testing Strategy
MLOps and Data Quality: Deploying Reliable ML Models in Production
Qiagram
How to improve your system monitoring
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Unit-2 Part-1(a)-1.pptgguuijjiiioooooooooo
Streamline Your Data Workflows with DataOps for Better Efficiency.pdf
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
DGIQ 2015 The Fundamentals of Data Quality
9-Common-Power-BI-Issues-and-How-to-Fix-Them
Use of Formal Methods at Amazon Web Services
Data Quality: principles, approaches, and best practices

More from Hakka Labs (12)

PDF
DataEngConf SF16 - Spark SQL Workshop
PDF
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
PDF
DataEngConf: Data Science at the New York Times by Chris Wiggins
PPTX
DataEngConf: Building the Next New York Times Recommendation Engine
PDF
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
PDF
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
PPTX
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
PPTX
DataEngConf: The Science of Virality at BuzzFeed
PPTX
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
PDF
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
PPTX
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
PPTX
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf SF16 - Spark SQL Workshop
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Data Science at the New York Times by Chris Wiggins
DataEngConf: Building the Next New York Times Recommendation Engine
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
DataEngConf: The Science of Virality at BuzzFeed
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Cloud computing and distributed systems.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Unlocking AI with Model Context Protocol (MCP)
CIFDAQ's Market Insight: SEC Turns Pro Crypto
20250228 LYD VKU AI Blended-Learning.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
Cloud computing and distributed systems.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Understanding_Digital_Forensics_Presentation.pptx

DataEngConf SF16 - Data Asserts: Defensive Data Science

  • 1. Data Asserts Defensive Data Science Tommy Guy Microsoft
  • 4. Our pipeline in reality: bugs tend to compound DATA!!!
  • 5. How do Engineers Manage Complexity? Encapsulate: create functions/classes/subsystems with clear APIs. This helps isolate complexity Integration Tests: ensure that the components interact correctly. This helps identify breaking changes.
  • 6. Data introduces a few complications Pipelines take many upstream dependencies Researcher use cases are frequently unknown and unanticipated by data providers. Pushing requirements upstream to all producers is Sisyphean.
  • 7. We are not talking about data pipeline tests The data pipeline teams: Are all rows that are produced stored • Counter fields to ensure no dropped rows • Sentinel events to measure join fidelity Are availability SLAs being met? • Progressive server-client merging
  • 8. Data Scientists Require Semantic Correctness Does this field mean what I think it does?
  • 9. How do Data Scientists identify potential errors?
  • 10. How do Data Scientists identify potential errors? Some follow-on fact is absurd… … which leads to investigation … … which finds a broader problem If [potential conclusion], then we must have 3 billion OneDrive users… … because my user table doesn’t have a primary key … … so I should aggregate by user.
  • 11. What are your Assumptions? If I conclude “Users who upload files to OneDrive are XXX% more likely to buy Office if they also sent mail through Mobile Outlook”, I’m making many silent assumptions: Field Assumptions User Id • Logged and PII-encrypted similarly in Outlook and OneDrive • Correctly logging timestamp for Office purchase • User Id isn’t empty or missing OneDrive activity • Wasn’t automated traffic [identified by a certain flag]. Email Activity • Mobile client identifiers are correct. All • Any upstream changes to OneDrive, Office, or Exchange data have been communicated to pipeline owners.
  • 12. What are your Sanity Checks? • If a column “OfficeId” is really a user id, it has certain known properties: • Observation: these sorts of checks take place when the pipeline is set up, but they may not be re-checked very often. Assumption Why does it matter? Never null/empty Causes job-breaking data skew issues Users are 1:* with Tenants Logical constraint: sign you are missing something. Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id. All rows in event data join to it Otherwise, your data is incomplete. Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a user-id.
  • 13. Data Asserts: Defensive Data Science
  • 15. Data Asserts: Clear Trust Boundaries
  • 16. These should match! Data Asserts: Defensive Data Science
  • 17. Data Asserts in Production: A few Observations • Most of the analysis-impacting assertion failures we’ve seen were actually errors in our assumptions not errors in the pipeline. • Good tests beget good code: we’ve had to modularize our code in order to produce testable chunks that get re-used in pipelines. • Data Asserts is the backbone to data provenance. A data conclusion can directly link all of the assumptions about the input that we made.