SlideShare a Scribd company logo
Validating Data at Scale 
Spenser Skates 
CEO at Amplitude
Doing things at scale is noisy 
u Code is supposed to run the same way, but what if you run the 
same loop a million times on a million different machines- how 
confident are you it will always run the same?
Data from phones is noisier 
u Running on tens of thousands of different platforms with 
hundreds of thousands of different software configurations on 
hundreds of millions of phones 
u Platforms have the craziest settings
How data can get messed up 
u HTTP requests get mangled in transit 
u Phone might not get the acknowledgement from the server 
u People’s clocks are off 
u People are running weird versions of Android 
u Memory/disk corruption 
u Gamma ray events
You can’t trust data from the 
client
Problem: Data gets mangled in 
transit 
u Parameters from post requests get dropped 
u Within a parameter, a chunk of data may not actually reach the 
server
Solution: Checksumming 
u Send a checksum that’s a function of all the fields 
u If the checksum is wrong/not present, you know that you haven’t 
got all the data. Tell the phone the upload wasn’t successful 
u The phone will attempt to reupload the data
Problem: Client sends the same 
data twice 
u How does the phone know that the server has received the data 
so it doesn’t reupload the same piece of data twice? It gets an 
acknowledgement back 
u How does the server know that the phone has received the 
acknowledgement? It doesn’t! 
u Equivalent to the two generals problem 
u Requests that are successfully received by the server fail to 
successfully send an acknowledgement to the phone 5% of the 
time 
u That means all counts are inflated by about 5%!
Solution: Deduplication 
u Your system must be idempotent on the event level- it must be 
able to receive an event it’s received before and not change its 
state 
u Create a unique key for every event that has been sent 
u When you see an event, check your list of keys if the key is already 
present, discard the event
Problem: Clocks are off 
u Phones are often offline, so an analytics SDK needs to cache data 
locally before uploading, including the time the event occurred 
u But people’s clocks are often off, occasionally by years! 
u We can’t timestamp to the upload time, 5% of data is uploaded 
>24 hours after an event happened
Solution: Get an estimate of the 
actual time an event was logged 
u Timestamp the upload from the phone 
u For each event, let’s compare: 
u The difference between the phone event timestamp and the server 
upload time 
u The difference between the phone upload timestamp and the server 
upload time
Validating big data at scale
Validating big data at scale
Solution: Get an estimate of the 
actual time an event was logged 
u For each event timestamp, subtract the difference between the 
phone’s upload time and the server’s upload time
Other Problems 
u People are running weird versions of Android 
u MD5 library 
u Memory/disk corruption 
u Gamma ray events
Clean Data
Questions? 
Always happy to talk about analytics problems! 
spenser@amplitude.com 
blog.amplitude.com 
twitter: @amplitudemobile 
MOBILE ANALYTICS FOR DECISION MAKERS

More Related Content

PPTX
Monitoring Quality Metrics to Know When to Ship
PPTX
1Spatial Australia: Ultimate real time - monitor anything, update anything
PPTX
How to install a Wireless NIC
PDF
The Delivery Conference 2016 - Patrick Wall
PPTX
Android session 3-behestee
PDF
Top 5 Things to do Before Your Move
PDF
Xebia-Agile consulting and training offerings
PDF
UX made in China @UXRepublic
Monitoring Quality Metrics to Know When to Ship
1Spatial Australia: Ultimate real time - monitor anything, update anything
How to install a Wireless NIC
The Delivery Conference 2016 - Patrick Wall
Android session 3-behestee
Top 5 Things to do Before Your Move
Xebia-Agile consulting and training offerings
UX made in China @UXRepublic

Viewers also liked (20)

PDF
Design Thinking for Startups - Are You Design Driven?
PDF
Les technologies immersives @UXRepublic
PDF
Web real time communication @UXRepublic
PDF
Les magasins de demain @uxrepublic
PDF
Le design éthique
PDF
Tips digital communication victoria pereira
PDF
SEO+UX = SEOUX @UXRepublic
PDF
XebiCon'16 : Les 5 questions con(tre) l'agilité et comment y répondre. Par M...
PDF
Why the lean start-up changes everything
PDF
XebiCon'16 : Thiga - Qu'est ce que le Growth Hacking en 2016 ? Par Nicolas G...
PDF
Le rôle du développeur front dans la User eXperience
PDF
Le social coding pour la Creative Technologie
PDF
XebiCon'16 : Europ Assistance - Un grand groupe peut-il construire une market...
PDF
Webinar "Agile for Managers"
PDF
Jeux d'innovation - UXDAY @UXRepublic
PDF
Offline first @UXRepublic
PDF
XebiCon'16 : Orange et Xebia Labs - De l'Agilité vers le Déploiement Continu ...
PDF
Le Design empathique @UXRepublic
PDF
23062014 jarl meijer agile survey xebia
PDF
Le système cognitif par l’exemple @UXRepublic
Design Thinking for Startups - Are You Design Driven?
Les technologies immersives @UXRepublic
Web real time communication @UXRepublic
Les magasins de demain @uxrepublic
Le design éthique
Tips digital communication victoria pereira
SEO+UX = SEOUX @UXRepublic
XebiCon'16 : Les 5 questions con(tre) l'agilité et comment y répondre. Par M...
Why the lean start-up changes everything
XebiCon'16 : Thiga - Qu'est ce que le Growth Hacking en 2016 ? Par Nicolas G...
Le rôle du développeur front dans la User eXperience
Le social coding pour la Creative Technologie
XebiCon'16 : Europ Assistance - Un grand groupe peut-il construire une market...
Webinar "Agile for Managers"
Jeux d'innovation - UXDAY @UXRepublic
Offline first @UXRepublic
XebiCon'16 : Orange et Xebia Labs - De l'Agilité vers le Déploiement Continu ...
Le Design empathique @UXRepublic
23062014 jarl meijer agile survey xebia
Le système cognitif par l’exemple @UXRepublic
Ad

Similar to Validating big data at scale (20)

PDF
Z-Push debugging
KEY
Cross-platform logging and analytics
PDF
Альона Тудан “World of bugs: let’s find together”
PDF
What Your Tech Lead Thinks You Know (But Didn't Teach You)
PDF
20220621 Project Management Innovation Conference Harrisburg PA Seatbelts and...
PDF
Start with passing tests (tdd for bugs) v0.5 (22 sep 2016)
PPTX
ST-UNIT-4.pptx software testing
PDF
Reliable and Scalable Data Ingestion at Airbnb
PDF
Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds ma...
PDF
RR2010 Keynote
PDF
Responsive Information Design
PPT
Data validation in the Digital Age
PPTX
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
PPTX
AS level 9626-IT-1.4 Accuracy of Data.pptx
PDF
Debs 2012 uncertainty tutorial
PDF
Fault Tolerance 101
PDF
Practical solutions to detecting bugs
PPTX
RE thesis presentation
PPTX
Functional Big Data (by Vance Shipley)
PDF
Observability for Emerging Infra (what got you here won't get you there)
Z-Push debugging
Cross-platform logging and analytics
Альона Тудан “World of bugs: let’s find together”
What Your Tech Lead Thinks You Know (But Didn't Teach You)
20220621 Project Management Innovation Conference Harrisburg PA Seatbelts and...
Start with passing tests (tdd for bugs) v0.5 (22 sep 2016)
ST-UNIT-4.pptx software testing
Reliable and Scalable Data Ingestion at Airbnb
Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds ma...
RR2010 Keynote
Responsive Information Design
Data validation in the Digital Age
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
AS level 9626-IT-1.4 Accuracy of Data.pptx
Debs 2012 uncertainty tutorial
Fault Tolerance 101
Practical solutions to detecting bugs
RE thesis presentation
Functional Big Data (by Vance Shipley)
Observability for Emerging Infra (what got you here won't get you there)
Ad

More from Amplitude (20)

PDF
Amplitude Behavioral Cohorts Deep Dive
PDF
Product and Marketing Maximize Impact by Elie Javice, RBI and Marcelo Pascoa,...
PDF
Product Intelligence by Justin Bauer and Shadi Rostami, Product and Engineeri...
PDF
On Change by Siqi Chen, President and CPO, Sandbox VR
PDF
Happy to Help by Merci Victoria Grace, Partner, Lightspeed Venture Partners
PDF
Building a Successful B2B Paid Growth Marketing Program by Lisa Sullivan Cros...
PDF
Product Vision by Spenser Skates, CEO & Co-founder, Amplitude
PDF
Be a great product leader by Adam Nash, VP Product, Dropbox
PDF
Backstage 2019 - The UX of Data - Lex Roman
PDF
Backstage 2019 - How to find friends and influence product - Rebecca Nackson
PDF
Backstage 2019 - Data Our Common Language - Jonathan Hastings
PDF
Backstage 2019 - Building the Product Intelligence Muscle - John Cutler
PDF
Backstage 2019 - Accelerating Product Insights at Intuit - John Humphrey
PDF
Backstage 2019 - The Atlassian Journey with Amplitude - Itzik Feldman
PDF
Putting Your North Star Metric Into Action
PDF
Hire More Designers, OK?
PDF
Creating Value and Flow in Product Development
PDF
Product Oriented Engineering Teams
PDF
How to Stop Wasting Time—Jake Knapp at Amplify
PDF
A Framework for Integrity-Driven Product Development
Amplitude Behavioral Cohorts Deep Dive
Product and Marketing Maximize Impact by Elie Javice, RBI and Marcelo Pascoa,...
Product Intelligence by Justin Bauer and Shadi Rostami, Product and Engineeri...
On Change by Siqi Chen, President and CPO, Sandbox VR
Happy to Help by Merci Victoria Grace, Partner, Lightspeed Venture Partners
Building a Successful B2B Paid Growth Marketing Program by Lisa Sullivan Cros...
Product Vision by Spenser Skates, CEO & Co-founder, Amplitude
Be a great product leader by Adam Nash, VP Product, Dropbox
Backstage 2019 - The UX of Data - Lex Roman
Backstage 2019 - How to find friends and influence product - Rebecca Nackson
Backstage 2019 - Data Our Common Language - Jonathan Hastings
Backstage 2019 - Building the Product Intelligence Muscle - John Cutler
Backstage 2019 - Accelerating Product Insights at Intuit - John Humphrey
Backstage 2019 - The Atlassian Journey with Amplitude - Itzik Feldman
Putting Your North Star Metric Into Action
Hire More Designers, OK?
Creating Value and Flow in Product Development
Product Oriented Engineering Teams
How to Stop Wasting Time—Jake Knapp at Amplify
A Framework for Integrity-Driven Product Development

Recently uploaded (20)

PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Mega Projects Data Mega Projects Data
PDF
Introduction to Business Data Analytics.
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Computer network topology notes for revision
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
climate analysis of Dhaka ,Banglades.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Acumen Training GuidePresentation.pptx
Moving the Public Sector (Government) to a Digital Adoption
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Mega Projects Data Mega Projects Data
Introduction to Business Data Analytics.
Fluorescence-microscope_Botany_detailed content
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
.pdf is not working space design for the following data for the following dat...
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Clinical guidelines as a resource for EBP(1).pdf
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Computer network topology notes for revision

Validating big data at scale

  • 1. Validating Data at Scale Spenser Skates CEO at Amplitude
  • 2. Doing things at scale is noisy u Code is supposed to run the same way, but what if you run the same loop a million times on a million different machines- how confident are you it will always run the same?
  • 3. Data from phones is noisier u Running on tens of thousands of different platforms with hundreds of thousands of different software configurations on hundreds of millions of phones u Platforms have the craziest settings
  • 4. How data can get messed up u HTTP requests get mangled in transit u Phone might not get the acknowledgement from the server u People’s clocks are off u People are running weird versions of Android u Memory/disk corruption u Gamma ray events
  • 5. You can’t trust data from the client
  • 6. Problem: Data gets mangled in transit u Parameters from post requests get dropped u Within a parameter, a chunk of data may not actually reach the server
  • 7. Solution: Checksumming u Send a checksum that’s a function of all the fields u If the checksum is wrong/not present, you know that you haven’t got all the data. Tell the phone the upload wasn’t successful u The phone will attempt to reupload the data
  • 8. Problem: Client sends the same data twice u How does the phone know that the server has received the data so it doesn’t reupload the same piece of data twice? It gets an acknowledgement back u How does the server know that the phone has received the acknowledgement? It doesn’t! u Equivalent to the two generals problem u Requests that are successfully received by the server fail to successfully send an acknowledgement to the phone 5% of the time u That means all counts are inflated by about 5%!
  • 9. Solution: Deduplication u Your system must be idempotent on the event level- it must be able to receive an event it’s received before and not change its state u Create a unique key for every event that has been sent u When you see an event, check your list of keys if the key is already present, discard the event
  • 10. Problem: Clocks are off u Phones are often offline, so an analytics SDK needs to cache data locally before uploading, including the time the event occurred u But people’s clocks are often off, occasionally by years! u We can’t timestamp to the upload time, 5% of data is uploaded >24 hours after an event happened
  • 11. Solution: Get an estimate of the actual time an event was logged u Timestamp the upload from the phone u For each event, let’s compare: u The difference between the phone event timestamp and the server upload time u The difference between the phone upload timestamp and the server upload time
  • 14. Solution: Get an estimate of the actual time an event was logged u For each event timestamp, subtract the difference between the phone’s upload time and the server’s upload time
  • 15. Other Problems u People are running weird versions of Android u MD5 library u Memory/disk corruption u Gamma ray events
  • 17. Questions? Always happy to talk about analytics problems! spenser@amplitude.com blog.amplitude.com twitter: @amplitudemobile MOBILE ANALYTICS FOR DECISION MAKERS