DataEngConf SF16 - Data Asserts: Defensive Data Science

Data Asserts
Defensive Data Science
Tommy Guy
Microsoft

Observation: Complexity In Pipeline

Our pipeline:
DATA!!!
Insight! Direction! Strategy!

Our pipeline in reality: bugs tend to compound
DATA!!!

How do Engineers Manage Complexity?
Encapsulate: create functions/classes/subsystems
with clear APIs. This helps isolate complexity
Integration Tests: ensure that the components interact
correctly. This helps identify breaking changes.

Data introduces a few complications
Pipelines take many upstream dependencies
Researcher use cases are frequently unknown and
unanticipated by data providers.
Pushing requirements upstream to all producers is
Sisyphean.

We are not talking about data pipeline tests
The data pipeline teams:
Are all rows that are produced stored
• Counter fields to ensure no dropped rows
• Sentinel events to measure join fidelity
Are availability SLAs being met?
• Progressive server-client merging

Data Scientists Require Semantic Correctness
Does this field mean what I think it does?

How do Data Scientists identify potential
errors?

How do Data Scientists identify potential
errors?
Some follow-on fact is absurd…
… which leads to investigation …
… which finds a broader problem
If [potential conclusion], then we must have 3 billion
OneDrive users…
… because my user table doesn’t have a primary key …
… so I should aggregate by user.

What are your Assumptions?
If I conclude “Users who upload files to OneDrive are XXX% more likely
to buy Office if they also sent mail through Mobile Outlook”, I’m
making many silent assumptions:
Field Assumptions
User Id • Logged and PII-encrypted similarly in Outlook and OneDrive
• Correctly logging timestamp for Office purchase
• User Id isn’t empty or missing
OneDrive activity • Wasn’t automated traffic [identified by a certain flag].
Email Activity • Mobile client identifiers are correct.
All • Any upstream changes to OneDrive, Office, or Exchange
data have been communicated to pipeline owners.

What are your Sanity Checks?
• If a column “OfficeId” is really a user id, it has certain known properties:
• Observation: these sorts of checks take place when the pipeline is set
up, but they may not be re-checked very often.
Assumption Why does it matter?
Never null/empty Causes job-breaking data skew issues
Users are 1:* with Tenants Logical constraint: sign you are missing something.
Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id.
All rows in event data join to it Otherwise, your data is incomplete.
Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a
user-id.

Data Asserts: Defensive Data Science

Data Asserts: Maintain Quality

Data Asserts: Clear Trust Boundaries

These should
match!
Data Asserts: Defensive Data Science

Data Asserts in Production: A few
Observations
• Most of the analysis-impacting assertion failures we’ve seen were
actually errors in our assumptions not errors in the pipeline.
• Good tests beget good code: we’ve had to modularize our code in
order to produce testable chunks that get re-used in pipelines.
• Data Asserts is the backbone to data provenance. A data conclusion
can directly link all of the assumptions about the input that we made.

DataEngConf SF16 - Data Asserts: Defensive Data Science

More Related Content

What's hot (20)

Viewers also liked (16)

Similar to DataEngConf SF16 - Data Asserts: Defensive Data Science (20)

More from Hakka Labs (12)

Recently uploaded (20)

DataEngConf SF16 - Data Asserts: Defensive Data Science