SlideShare a Scribd company logo
Utilizing Human Data Validation
for KPI Analysis
and Machine Learning
Dan Morris
Radius Intelligence
Overview and Key Takeaways
• Data science problems @ Radius

• Human validation: costs and benefits

• Sampling and experimentation for multiple consumers

• Positive feedback cycles in production
Radius - B2B Predictive Marketing
Radius - Data Engineering
MLlib
Why Human Validation?
• Business firmographic data is a difficult data problem

• Our sources face the same challenges that we do

• Each source must be considered a “proposal”

• Independent Human Validation is (the closest thing to)
ground truth
Degrees of Human Validation
• Prompted Validation

• Research Validation
• Research Curation
Value
Cost
PV
RV
RC
Prompted Validation
Example Assignment:
Verify the Phone Number and
Address of a Business
Research Validation
Example Assignment:
Determine if a Business
Belongs to a Chain / Franchise
Business Name: Bob’s Donuts
Address: 123 Main St.
Website: www.bobsdonuts.biz
Industry: Limited Service Restaurants
Is Chain: (Y / N / U)
Chain Type: (Local / Regional / National)
Research Curation
Example Assignment:
Where is the Headquarters of
this Company Located?
Company Name: Bob’s Donuts Inc.
Website: www.bobsdonuts.biz
Has many locations: (Y / N / U)
HQ Location: ???
Source of information: ???
Human Validation: Benefits
• Ground Truth
• Supervised ML
• Internal Metrics
• Competitive Analysis

• Our customers are humans, too!
Human Validation: Costs
• Money
• Time

• Us and Them
Cost: Money
• Validated data costs more than aggregated data
>
Validation + Data Science
Pure Validation
Cost: Time
• Automated experimental framework
• Shift bottleneck to validation teams

• Parallelized validation improves turnaround time
• Be mindful of differences in teams / validators
• Decay / Obsolescence of validations
Cost: Us and Them
Clearly communicate expectations and interpretations
Uses for Validated Data
• KPI Analysis
• ML Training Sets
• Spot Hypothesis
Validation
Challenge: minimize number of validations
while meeting all downstream needs
Multiple-Consumer Sampling
Standalone vs. Chain Experiment
1 value per 1 location == Easy Sampling!
Multiple-Consumer Sampling
Phone Accuracy Experiment
(0, 1, 2, 3, …) values to 1 location == Difficult Sampling.
Basic Production Pipeline
Positive Feedback Production Cycle
THANKS!
email me: dan.morris@radius.com
stalk me: @djsensei
connect me: linkedin.com/in/danielepmorris
work with me: radius.com/jobs

More Related Content

PDF
Big Data in Production: Lessons from Running in the Cloud
PDF
H2O World - Machine Learning at Comcast - Andrew Leamon & Chushi Ren
PPTX
Operationalizing analytics to scale
PDF
Machines and the Magic of Fast Learning
PDF
Dataiku productive application to production - pap is may 2015
PPTX
Better Customer Experience with Data Science - Bernard Burg, Comcast
PPTX
Beyond Data Discovery: The Value Unlocked by Modern Data Modeling
PPTX
How to Build a Data-Driven Company: From Infrastructure to Insights
Big Data in Production: Lessons from Running in the Cloud
H2O World - Machine Learning at Comcast - Andrew Leamon & Chushi Ren
Operationalizing analytics to scale
Machines and the Magic of Fast Learning
Dataiku productive application to production - pap is may 2015
Better Customer Experience with Data Science - Bernard Burg, Comcast
Beyond Data Discovery: The Value Unlocked by Modern Data Modeling
How to Build a Data-Driven Company: From Infrastructure to Insights

What's hot (20)

PDF
Industrial Data Science
PDF
Understanding DataOps and Its Impact on Application Quality
PDF
Effective Cost Management for Amazon EMR
PDF
Measuring Data Quality with DataOps
PPTX
Real time analytics @ netflix
PPTX
H2O World - Self Guiding Applications with Venkatesh Yadav
PPTX
Real time machine learning
PDF
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
PDF
An Architecture for Agile Machine Learning in Real-Time Applications
PDF
Data kitchen 7 agile steps - big data fest 9-18-2015
PDF
RightScale Webinar: Enterprise-Grade Cloud Cost Planning and Management
PPTX
Embedding Data & Analytics With Looker
PDF
Real Time Business Platform by Ivan Novick from Pivotal
PDF
The paradox of big data - dataiku / oxalide APEROTECH
PDF
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
PPTX
Zero Downtime App Deployment using Hadoop
PDF
Spark and the Enterprise by Tony Baer
PPTX
Eric Andersen Keynote
PPTX
Big data prototyping in AWS cloud
PDF
Smart App@Pivotal by Dat Tran
Industrial Data Science
Understanding DataOps and Its Impact on Application Quality
Effective Cost Management for Amazon EMR
Measuring Data Quality with DataOps
Real time analytics @ netflix
H2O World - Self Guiding Applications with Venkatesh Yadav
Real time machine learning
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
An Architecture for Agile Machine Learning in Real-Time Applications
Data kitchen 7 agile steps - big data fest 9-18-2015
RightScale Webinar: Enterprise-Grade Cloud Cost Planning and Management
Embedding Data & Analytics With Looker
Real Time Business Platform by Ivan Novick from Pivotal
The paradox of big data - dataiku / oxalide APEROTECH
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Zero Downtime App Deployment using Hadoop
Spark and the Enterprise by Tony Baer
Eric Andersen Keynote
Big data prototyping in AWS cloud
Smart App@Pivotal by Dat Tran
Ad

Viewers also liked (20)

PPTX
Predicting Retail KPIs using Magento & Machine Learning
PDF
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
PDF
A Graph-Based Method For Cross-Entity Threat Detection
PDF
Solving The N+1 Problem In Personalized Genomics
PDF
Spark Summit EU talk by Erwin Datema and Roeland van Ham
PDF
From MapReduce to Apache Spark
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
PDF
Finding Graph Isomorphisms In GraphX And GraphFrames
PDF
Time-Evolving Graph Processing On Commodity Clusters
PDF
MmmooOgle: From Big Data to Decisions for Dairy Cows
PDF
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
PDF
Huohua: A Distributed Time Series Analysis Framework For Spark
PDF
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
PPTX
Validation of solid oral dosage form, tablet 1
PDF
GPU Computing With Apache Spark And Python
PDF
Recent Developments In SparkR For Advanced Analytics
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Re-Architecting Spark For Performance Understandability
PDF
Spark on Mesos
Predicting Retail KPIs using Magento & Machine Learning
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
A Graph-Based Method For Cross-Entity Threat Detection
Solving The N+1 Problem In Personalized Genomics
Spark Summit EU talk by Erwin Datema and Roeland van Ham
From MapReduce to Apache Spark
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Finding Graph Isomorphisms In GraphX And GraphFrames
Time-Evolving Graph Processing On Commodity Clusters
MmmooOgle: From Big Data to Decisions for Dairy Cows
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Huohua: A Distributed Time Series Analysis Framework For Spark
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Validation of solid oral dosage form, tablet 1
GPU Computing With Apache Spark And Python
Recent Developments In SparkR For Advanced Analytics
Spark And Cassandra: 2 Fast, 2 Furious
Re-Architecting Spark For Performance Understandability
Spark on Mesos
Ad

More from Jen Aman (20)

PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
RISELab:Enabling Intelligent Real-Time Decisions
PDF
Spatial Analysis On Histological Images Using Spark
PDF
Deploying Accelerators At Datacenter Scale Using Spark
PDF
Re-Architecting Spark For Performance Understandability
PDF
Low Latency Execution For Apache Spark
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Livy: A REST Web Service For Apache Spark
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
PDF
Spark at Bloomberg: Dynamically Composable Analytics
PDF
Spark Uber Development Kit
PDF
EclairJS = Node.Js + Apache Spark
PDF
Spark: Interactive To Production
PDF
High-Performance Python On Spark
PDF
Scalable Deep Learning Platform On Spark In Baidu
PDF
Scaling Machine Learning To Billions Of Parameters
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Snorkel: Dark Data and Machine Learning with Christopher Ré
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
RISELab:Enabling Intelligent Real-Time Decisions
Spatial Analysis On Histological Images Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
Re-Architecting Spark For Performance Understandability
Low Latency Execution For Apache Spark
Efficient State Management With Spark 2.0 And Scale-Out Databases
Livy: A REST Web Service For Apache Spark
Building Custom Machine Learning Algorithms With Apache SystemML
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Spark at Bloomberg: Dynamically Composable Analytics
Spark Uber Development Kit
EclairJS = Node.Js + Apache Spark
Spark: Interactive To Production
High-Performance Python On Spark
Scalable Deep Learning Platform On Spark In Baidu
Scaling Machine Learning To Billions Of Parameters

Recently uploaded (20)

PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
annual-report-2024-2025 original latest.
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Mega Projects Data Mega Projects Data
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to machine learning and Linear Models
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Quality review (1)_presentation of this 21
PDF
.pdf is not working space design for the following data for the following dat...
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Clinical guidelines as a resource for EBP(1).pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
climate analysis of Dhaka ,Banglades.pptx
annual-report-2024-2025 original latest.
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Mega Projects Data Mega Projects Data
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Supervised vs unsupervised machine learning algorithms
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to machine learning and Linear Models
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Quality review (1)_presentation of this 21
.pdf is not working space design for the following data for the following dat...
ISS -ESG Data flows What is ESG and HowHow
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx

Utilizing Human Data Validation For KPI Analysis And Machine Learning