SlideShare a Scribd company logo
1
By	Greg	Makowski
Predictive	Model	and	Record	Description	
Using Segmented	Sensitivity	Analysis	(SSA)
Cloud+Data NEXT	Conference,	Santa	Clara	Convention	Center
http://www.cdnextcon.com/
Sunday,	July	16,	2017
Benefits
Describe	the	most	important	data	inputs	to	a	model	
• What	is	driving	the	forecast?
• Good	Communication	is	a	Competitive	Advantage
During	model	building	– use	to	improve	the	model
Use	to	detect	data	drift	– when	model	refresh	is	
needed
For	each	record,	what	are	reasons	for	the	forecast?
2
“3	Reasons	Why	Data	Scientist	Remains	
the	Top	Job	in	America”	– Infoworld 4/14/17
In	2015:		11k	to	19k	Data	Scientists	(existed)
Now:		On	LinkedIn,	13.7k	OPEN	POSITIONS		(89%	more	pos in	2	yrs)
Reason	#1:	There’s	a	shortage	of	talent
• “Business	leaders	are	after	professionals	who	can	not	only	understand	
the	numbers,	but	also	communicate	their	findings	effectively.”
Reason	#2:	Org	Face	Challenges	in	organizing	data
• “Data	preparation	accounts	for	80%	of	the	work	of	Data	Scientists”
Reason	#3:	Need	for	DS	is	no	longer	restricted	to	tech	giants
3
http://www.infoworld.com/article/3190008/big-data/3-reasons-why-data-scientist-
remains-the-top-job-in-america.html#tk.drr_mlt
“3	Reasons	Why	Data	Scientist	Remains	
the	Top	Job	in	America”	– Infoworld 4/14/17
4
http://www.infoworld.com/article/3190008/big-data/3-reasons-why-data-scientist-
remains-the-top-job-in-america.html#tk.drr_mlt
Algorithm	Design	Objectives
1. Describe	the	model	in	terms	of	variables	understandable	
to	the	target	audience
2. Be	independent	of	the	algorithm	(i.e.	Neural	Net,	SVM,	
Xtreame Gradient	Boosting,	Random	Forests…)
3. Support	describing	an	arbitrary	ensemble	of	models
4. Pick	up	non-linearities in	the	vars
5. Pick	up	interaction	effects
6. Understand	the	model	system	in	a	very	local	way 5
x
z (target)
Set	Client	Expectations
I	understand	completely	how	a	bicycle	works….
However,	I	still	drive	a	car	to	work
A	certain	level	of	detail	is	NOT	needed
Do	you	find	out	why	the	automotive	engineer	picked	X	mm	
for	the	diameter	of	the	cylinders?
You	can	learn	enough	detail	to	let	the	model	drive	your	
business
6
Sensitivity	Analysis
(OAT)	One	At	a	Time
https://en.wikipedia.org/wiki/Sensitivity_analysis
Arbitrarily	Complex
Data	Mining	System
(S)	Source	fields
Target	
field
For	source	fields	with	
binned	ranges,	sensitivity	
tells	you	importance	of	the	
range,	i.e.	“low”,	….	“high”
Can	put	sensitivity	values	in	
Pivot	Tables	
or	Cluster
Record	Level	“Reason	
codes” can	be	extracted	
from	the	most	important	
bins	that	apply	to	the	given	
record
Delta in	
forecast
Present	record	N,	S times,	each	input	5%	bigger	(fixed	input	delta)
Record	delta change	in	output,	S times	per	record
Aggregate:		average(abs(delta)),	target	change	per	input	field	delta
5	Example	Sensitivity	Records
Intermediate	Table	of	Sensitivities	/rec	/var
Forecasted
Target
Variable
Changes from the target variable,
after multiplying each input by 1.05,
One At a Time (OAT)
Delta
1
Delta
2
Delta
N
Both	Positive	and	Negative	Effects
Changes	within	Variable	Range	(Neural	Net	model	3)
Example	Raw	Values	for	Top	12	Variables
Standard
Deviation
Can be
another
ranking
metric
Abs = (Total Width over neg and pos)
Both	Positive	and	Negative	Effects
Changes	within	Variable	Range	(Neural	Net	model	3)
Avg(negative values)
by variable
Avg(positive values)
by variable
11
Define	business	objectives	and	project	plan	during	the	
Knowledge	Discovery	Workshop
Select	the	“Analysis	Universe” data
Include	holdout	verification	data
Repeat	through	model	loop (1-3	times,	~2	weeks	each)
Exploratory	Data	Analysis	 (EDA)
Transformation (Preprocessing)
Build	Model	– dozens	or	100’s	of	models (Data	Mining)
Evaluate	and	explain	the	model	– use	business	metric
Score or	deploy	the	model	on	“Forecast	Universe”	
Track results,	refresh	or	rebuild	model,	
subdivide	or	refine	as	needed
Data	Mining	Project	Overview
Scoring	past
Analysis	past										
Forecasted	
future
Example	future
Reference	Date
Days per sprint
2 1 1
5 4 4
2 4 3
1 1 2
https://www.csd.uwo.ca/faculty/ling/cs435/fayyad.pdf
From Data Mining to Knowledge Discovery in Databases, 1996
During	the	Data	Mining	Project
at	the	End	of	the	First	Sprint
Sprint	1:		basic	data	preprocessing	and	clean	up
At	the	end	(before	Sprint	2)	
• Perform	Sensitivity	Analysis	to	rank	variables
Sprint	2,	start
• Now	have	quantitative	feedback	on	the	most	important	variables
• Start	working	on	more	detailed	knowledge	representation
• Check	variable	interactions
“More data beats clever algorithms,
But BETTER DATA beats more data”
- Peter Norvig
Director of Research at Google
Fellow of Association for the Advancement of Artificial Intelligence
Higher	Level	Detectors	
Illustrated	as	rules,	but	typically	functions	for	a	continuous	score
”Higher	Level”	or	compound	detectors
–Group	one	of	many	to	an	overall	behavior	issue	(using	NLP	tags)
if	(hide	communications	identity	with	email	alias)	or
(hide	communication	subject	with	code	phrase)	then		
hiding_comm on	date_time X	=	0.2
–Group	many	low	level	alerts	in	a	short	time
if	(5	<=	failed	login	attempts)	and	(3	minutes	<=	time	window)	then
Possible	password	guessing	=	0.3
else	if	(20	<=	failed	login	attempts)	and	(5	minutes	<=	time)	then
Possible	password	guessing	=	0.7
–Compare	different	levels	of	context	(possibly	from	different	source	systems)
if	(4	<=	sum(over=week,	event=hiding_comm)	and			#	sum	smaller	detector	over	time
(3	<=	comm network	size(hiding_comm))	and							#	network	analysis
(manager	not	in(network(hiding_comm)))														#	reporting	hierarchy
escalating	comm secrecy	=	0.8																																	#	thresholds	distance	increases	score	
Analogy
• Defense attorney
debating plausible
innocence
• Prosecuting attorney
debating guilt
• Detectors seeing the
plausible “best case”
(to reduce false alerts)
• Other detectors seeing
the “worst case” in
each record
Accurate
General
Want	to	Capture	COMPLEX	Interactions
All this complex
variation is
incredibly
helpful !!!
Capture	“Data	Drift”	Over	Time
Behavior	Changes (pricing,	competition)
Current
Scoring	
Data
Training
Data
Think	about	
what	you	want	
the	model	to	be	
general	on,	
capture	
behavior	
VARIETY:
satellite	images	
only	during	
afternoon
Christmas	or	
vacation	
spending	spikes
The	best	model	is	
limited	by	fitting	
the	TRAINING	
data	surface
Do	you	have	a	large	
enough	sample	by	
behavior	pocket?
“Non-Stationary	
Data” DOES	change	
over	training	to	
scoring	time
MODEL	DRIFT	DETECTOR	in	N	dimensions
• Change	in	distribution	of	most	important	input	fields
Diagnose	CAUSES,	what	is	changing,	how	much…
Out	of	the	top	25%	of	the	most	important	input	fields…
Which	had	the	largest	change?
Tracking	Model	Drift
Distribution of
important variable
X (where Y=15)
changes from one
peak to two
x
z (target)
x
z (target)TRAINING DATA SCORING DATA
General
Capture	“Data	Drift”	Over	Time
Behavior	Changes (pricing,	competition)
Use	“Training	Data”	as	the	baseline	
• Create	20	equal	frequency	bins	of	the	forecast	variable	(5.0%	/	bin)
• Save	the	original,	Training,	bin	thresholds
Check	the	Scored	data	over	time	(i.e.		daily,	monthly) Chi-Sqare or
KS-Statistic
To measure
The slow
changes
Description Per	Record
Use	Segments	of	Variable	Ranges
• Reason	codes	are	specific	to	the	model	and	record
record	1							record	2
• Ranked	predictive	fields											 Mr.	Smith				Mrs.	Jones
max_late_payment_120d 0 0
max_late_payment_90d 1 0
bankrupt_in_last_5_yrs 1 0
max_late_payment_60d 0 1
• Mr.	Smith’s	reason	codes	include:
max_late_payment_90d 1
bankrupt_in_last_5_yrs 1
Description	Per	Record
Need	”reasons”	that	apply	to	some	people	(records)	but	not	
others
A	given	variable	has	some	value	for	everybody
Need	“sub-ranges”	that	only	apply	to	some	people,	i.e.
• Very	Low,		Low,		Medium,		High,		Very	High
• Create	5	“bins”,	with	a	roughly	equal	number	of	records	per	bin
• Focus	on	the	sub-ranges	or	bins	that	have	the	highest	sensitivity
20
Questions?
Greg_Makowski@yahoo.com
21
5.	Model	Training	Demo/Lab	with	HMEQ	(Home	Equity)	Data
Line	of	credit	loan	application,	using	existing	home	as	loan	equity.							
5,960	records
COLUMN
rec_ID
BAD
CLAGE
CLNO
DEBTINC
DELINQ
DEROG
JOB
LOAN
MORTDUE
NINQ
REASON
VALUE
YOJ
DATA ROLE
Key
Target
Applicant
Applicant
Applicant
Applicant
Applicant
Applicant
Loan applic
Property
Applicant
Loan applic
Property
Applicant
DESCRIPTION
Record ID or key field, for each line of credit loan or person
After 1 year, loan went in default, (=1, 20%) vs. still being paid (=0)
Credit Line Age, in months (for another credit line)
Credit Line Number
Debt to Income ratio
Number of delinquent credit lines
Number of major derogatory reports
Job, 6 occupation categories
Requested loan amount
Amount due on existing mortgage
Number of recent credit inquiries
“DebtCon“ = debt consolidation, “HomeImp“ = home improvement
Value of current property
Years on present job
https://inclass.kaggle.com/c/pred-411-2016-04-u2-bonus-hmeq/data?heloc.csv
Rules	or	Queries	to	Detectors
Simple	Example
Select 1 as detect_prospect (result field has 0 or 1 values)
where (.6 < recency) and
(.7 < frequency) and
(.3 < time)
Select recency + frequency + time as detect_prospect
where (.6 < recency) and (has 100’s of values
(.7 < frequency) and in the [0..1] range)
(.3 < time)
Develop “fuzzy” detectors, result in [0..1]
22
Accurate
General
Compound	Detectors	
Implemented	as	a	Lookup	Table				(in	this	case,	same	for	all	people)
• This	illustrates	the	process	of	
creating	a	detector
• Lets	not	debate	now	about	
specific	values
• Don’t	need	perfection
• Dozens	of	reasonable	detectors	
are	powerful
• If	user	is	failing	login	attempts	
over	more	applications,	that	is	
more	suspicious	(virus	
intrusion?)
• Joe	failed	logging	in	over	3	
applications,	8	times	in	5	
minutes					
à failed_log_risk =	0.6
Accurate
General

More Related Content

PPTX
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
PDF
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
PPSX
Data Refinement: The missing link between data collection and decisions
PDF
Doing Analytics Right - Building the Analytics Environment
PPTX
CO1_Session_1&2 modified on introduction
PPTX
MongoDB on Financial Services Sector
PPTX
KU_Big_Data_3_25_2015a
PPTX
Open Source North - MongoDB Advanced Schema Design Patterns
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
Data Refinement: The missing link between data collection and decisions
Doing Analytics Right - Building the Analytics Environment
CO1_Session_1&2 modified on introduction
MongoDB on Financial Services Sector
KU_Big_Data_3_25_2015a
Open Source North - MongoDB Advanced Schema Design Patterns

Similar to Predictive model and segmented sensitivity analysis (20)

PDF
MongoDB Europe 2016 - The Rise of the Data Lake
PDF
Barga Galvanize Sept 2015
PDF
Thinking Outside the Cube: How In-Memory Bolsters Analytics
PPTX
Big Data By Vijay Bhaskar Semwal
PPTX
Shikha fdp 62_14july2017
PPTX
Pin the tail on the metric v00 75 min version
PDF
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
PDF
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
PPT
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
ODP
Introduction To Data Warehousing
PPTX
Big Data, NoSQL, NewSQL & The Future of Data Management
PPTX
Big_Data.pptx
PPTX
Benchmarking search relevance in industry vs academia
PPTX
L’architettura di Classe Enterprise di Nuova Generazione
PDF
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
PDF
Data Profiling: The First Step to Big Data Quality
PPTX
Mark Seiss, Dun & Bradstreet - Importance of Domain Expertise for Building ML...
PDF
The Death of the Star Schema
PDF
2022 Trends in Enterprise Analytics
PDF
Data Modeling for Big Data
MongoDB Europe 2016 - The Rise of the Data Lake
Barga Galvanize Sept 2015
Thinking Outside the Cube: How In-Memory Bolsters Analytics
Big Data By Vijay Bhaskar Semwal
Shikha fdp 62_14july2017
Pin the tail on the metric v00 75 min version
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
Introduction To Data Warehousing
Big Data, NoSQL, NewSQL & The Future of Data Management
Big_Data.pptx
Benchmarking search relevance in industry vs academia
L’architettura di Classe Enterprise di Nuova Generazione
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
Data Profiling: The First Step to Big Data Quality
Mark Seiss, Dun & Bradstreet - Importance of Domain Expertise for Building ML...
The Death of the Star Schema
2022 Trends in Enterprise Analytics
Data Modeling for Big Data
Ad

More from Bill Liu (20)

PDF
Walk Through a Real World ML Production Project
PDF
Redefining MLOps with Model Deployment, Management and Observability in Produ...
PDF
Productizing Machine Learning at the Edge
PPTX
Transformers in Vision: From Zero to Hero
PDF
Deep AutoViML For Tensorflow Models and MLOps Workflows
PDF
Metaflow: The ML Infrastructure at Netflix
PDF
Practical Crowdsourcing for ML at Scale
PDF
Building large scale transactional data lake using apache hudi
PDF
Deep Reinforcement Learning and Its Applications
PDF
Big Data and AI in Fighting Against COVID-19
PDF
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
PDF
Build computer vision models to perform object detection and classification w...
PDF
Causal Inference in Data Science and Machine Learning
PDF
Weekly #106: Deep Learning on Mobile
PDF
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
PDF
AISF19 - On Blending Machine Learning with Microeconomics
PDF
AISF19 - Travel in the AI-First World
PDF
AISF19 - Unleash Computer Vision at the Edge
PDF
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
PDF
Toronto meetup 20190917
Walk Through a Real World ML Production Project
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Productizing Machine Learning at the Edge
Transformers in Vision: From Zero to Hero
Deep AutoViML For Tensorflow Models and MLOps Workflows
Metaflow: The ML Infrastructure at Netflix
Practical Crowdsourcing for ML at Scale
Building large scale transactional data lake using apache hudi
Deep Reinforcement Learning and Its Applications
Big Data and AI in Fighting Against COVID-19
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Build computer vision models to perform object detection and classification w...
Causal Inference in Data Science and Machine Learning
Weekly #106: Deep Learning on Mobile
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - Travel in the AI-First World
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Toronto meetup 20190917
Ad

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
KodekX | Application Modernization Development
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Approach and Philosophy of On baking technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
MIND Revenue Release Quarter 2 2025 Press Release
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KodekX | Application Modernization Development
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Approach and Philosophy of On baking technology
Review of recent advances in non-invasive hemoglobin estimation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Understanding_Digital_Forensics_Presentation.pptx
Cloud computing and distributed systems.
NewMind AI Weekly Chronicles - August'25 Week I
MIND Revenue Release Quarter 2 2025 Press Release
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Predictive model and segmented sensitivity analysis