SlideShare a Scribd company logo
Data Augmentation and
Disaggregation
Neal Fultz
nfultz@system1.com
July 26, 2017
https://goo.gl/6uQrss
● Many data sets are only available in aggregated form
○ Precluding use of stock statistics / ML directly.
● Data augmentation, a classic tool from Bayesian computation,
can be applied more generally.
○ Disaggregating across and within observations
Executive Summary
Part 1: Motivating Example
A Data Set
n Price
42 2.406
33 2.283
10 2.114
10 2.815
2 1.691
1 2.033
1 2.061
1 0.133
1 0.627
● Like to use Price ~ LN( , 2
)
○ Lognormal has nice interpretation as random walk of ± %
○ Also won’t go negative
○ Common Alternatives: Exponential, Gamma
● Estimate both parameters for later use
● Actually, we want to do so for 10k items
Modeling Price
Log-normal Recap
● If Y ~ N( , 2
), X = exp(Y) ~ LN( , 2
)
● E(X) = exp( + 2
/ 2)
● Var(X) = [exp( 2
) - 1] exp(2 + 2
)
● Standard estimators:
○ MoM - uses log of mean of X
○ MLE - uses mean of log X
Log-normal Recap
● Method of Moments
○ s2
= ln(Σ X2
/ N) - 2 ln (Σ X / N)
○ m = ln(Σ X / N) - s2
/2
● Maximum Likelihood
○ m = Σ ln X / N
○ s2
= Σ (ln X - m)2
/ N
Estimation v0.1
What if we just ignore n, and plug in hourly averages to our estimators?
=>Gives equal weight to (n=1, $=0.133) as (n=42, $=2.406)
=> Everything biased towards the small obs
Estimation v0.2
What if we just plug in weighted sample averages?
● Method of Moments:
○ m = 0.342, s2
= 0.996
○ Expected Value: exp(.342 + .996/2) = 2.32
● Maximum Likelihood:
○ m = 0.811, s2
= 0.105
○ Expected Value: exp(.811 + .105/2) = 2.37
Data Augmentation and Disaggregation by Neal Fultz
Are these trustworthy?
To check if these make sense:
● Simulate from both estimates as ground truth
● Apply both estimators
● Inspect bias
Data Augmentation and Disaggregation by Neal Fultz
Why are these not working?
● Many distributions are additive
○ N(0,1) + N(1,1) => N(1,3)
○ Pois(4) + Pois(5) => Pois(9)
● Log Normal is not!
○ So (n=42, $=2.406) is not LN, even if individual prices are
○ It is in fact a marginal distribution
■ contains 41 integrals :(
● What about CLT?
■ Even if (n=42) is approx N, (n=10) and (n=2) are probably not
A Data Set
n Price
42 2.406
33 2.283
10 2.114
10 2.815
2 1.691
1 2.033
1 2.061
1 0.133
1 0.627
Part 1
Main Points
Violate iid at your own risk!
● Do NOT plug and chug
● Do NOT expect weights will fix your problem
● Do NOT use predictive models
● Do NOT use multi-armed bandits
Get better, unaggregated data!
… but if you can’t ...
Part 2: Data Augmentation
A Data Set
n Price
42 2.406
33 2.283
10 2.114
10 2.815
2 1.691
1 2.033
1 2.061
1 0.133
1 0.627
Long format….
id Group Price
1 1 2.406
2 1 2.406
3 1 2.406
... ... ...
96 5 1.691
97 5 1.691
98 6 2.033
99 7 2.061
100 8 0.133
101 9 0.627
Estimation
● MCMC using stock methods, eg Metropolis-Hastings
● MH requires:
○ Target Distribution / probability model
○ State transition functions / proposal distributions
● MH outputs:
○ Numerical samples from target distribution
Proposal Distribution
● Transitions on m and s2
- easy
● Transitions on missing T Prices ?
○ hourly constraints on total $
■ Don’t want to propose out-of-bounds
○ Option 1 - draw from dirichlet,
■ use that to disaggregate, transition whole hours at once
■ Big steps => lots of rejections
○ Option 2 - pairwise transitions within group
Data Augmentation and Disaggregation by Neal Fultz
Data Augmentation and Disaggregation by Neal Fultz
Part 2
Main Points
Switching from aggregated to long format shows
aggregation can be thought of as a form of missing data.
However, group averages => constraints on the missing data.
In our example data, 97/101 points are missing,
but we can still get reasonable estimates via MCMC
Part 3: Disaggregation
A Data Set
n Price
42 2.406
33 2.283
10 2.114
10 2.815
2 1.691
1 2.033
1 2.061
1 0.133
1 0.627
Additional Challenges
What if aggregation is over multiple heterogeneous groups, and we need
to split the money between the groups (“disaggregate”)?
Do we know the split a priori?
What if we don’t?
A Grouped Data Set
Known Groups
Desktop Mobile Price
38 4 2.406
27 6 2.283
2 8 2.114
6 4 2.815
0 2 1.691
0 1 2.033
1 0 2.061
1 0 0.133
0 1 0.627
Common Heuristics
● Linear disaggregation
○ Weighted averages by another name
○ Doesn’t account for variation in other columns
● Iterative Proportional Fitting
○ If you have subtotals in all dimension
○ Alternates disaggregating by rows/columns
Desktop Mobile Price
38 4 2.406
27 6 2.283
2 8 2.114
6 4 2.815
0 2 1.691
0 1 2.033
1 0 2.061
1 0 0.133
0 1 0.627
Long format….
id Group mobile Price
1 1 1 2.406
2 1 0 2.406
3 1 1 2.406
... ...
96 5 1 1.691
97 5 1 1.691
98 6 1 2.033
99 7 0 2.061
100 8 0 0.133
101 9 1 0.627
A Grouped Data Set
Unknown Groups
n Prime Sub Price
42 ? ? 2.406
33 ? ? 2.283
10 ? ? 2.114
10 ? ? 2.815
2 ? ? 1.691
1 ? ? 2.033
1 ? ? 2.061
1 ? ? 0.133
1 ? ? 0.627
A Grouped Data Set
Unknown Groups
n Prime Sub Price
42 30 12 2.406
33 23 9 2.283
10 7 3 2.114
10 8 2 2.815
2 2 0 1.691
1 1 0 2.033
1 1 0 2.061
1 0 1 0.133
1 0 1 0.627
Data Augmentation and Disaggregation by Neal Fultz
Part 3
Main Points
By extending the previous model, we can deal with
“heterogeneous aggregates”.
If the grouping variable is known, solve like a regression problem.
If not known / latent, solve it like a mixture problem.
Either way, going Bayes let’s you borrow strength between aggregates,
which disaggregation heuristics are not good at.
Questions?

More Related Content

PDF
Clustering tutorial
PDF
Applied numerical methods lec1
PDF
Applied numerical methods lec6
PPTX
Fuzzy c means manual work
PDF
Data scientist training in bangalore
PPT
Fuzzy c means clustering protocol for wireless sensor networks
PDF
Estimate of house price using statistical and neural network model
PDF
IIR filter realization using direct form I & II
Clustering tutorial
Applied numerical methods lec1
Applied numerical methods lec6
Fuzzy c means manual work
Data scientist training in bangalore
Fuzzy c means clustering protocol for wireless sensor networks
Estimate of house price using statistical and neural network model
IIR filter realization using direct form I & II

Similar to Data Augmentation and Disaggregation by Neal Fultz (20)

PDF
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
PDF
Machine_Learining_Concepts_DecisionTrees&PCA.pdf
PPTX
Predictive Modelling
PPTX
CLUSTER ANALYSIS ALGORITHMS.pptx
PDF
Advance data structure & algorithm
PDF
Principal component analysis and lda
PDF
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
PPTX
Introduction to Algorithms
PDF
Chapter#04[Part#01]K-Means Clusterig.pdf
PDF
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
PDF
Regression analysis in excel
PDF
3Measurements of health and disease_MCTD.pdf
PPTX
Md2k 0219 shang
PPTX
Machine Learning, Deep Learning and Data Analysis Introduction
PDF
Unsupervised Learning in Machine Learning
PPTX
Feature Engineering
PPTX
Instance Based Learning in machine learning
PDF
Parallel Implementation of K Means Clustering on CUDA
PDF
Clustering
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Machine_Learining_Concepts_DecisionTrees&PCA.pdf
Predictive Modelling
CLUSTER ANALYSIS ALGORITHMS.pptx
Advance data structure & algorithm
Principal component analysis and lda
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
Introduction to Algorithms
Chapter#04[Part#01]K-Means Clusterig.pdf
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
Regression analysis in excel
3Measurements of health and disease_MCTD.pdf
Md2k 0219 shang
Machine Learning, Deep Learning and Data Analysis Introduction
Unsupervised Learning in Machine Learning
Feature Engineering
Instance Based Learning in machine learning
Parallel Implementation of K Means Clustering on CUDA
Clustering
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka
Ad

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation theory and applications.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Programs and apps: productivity, graphics, security and other tools
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.
Encapsulation_ Review paper, used for researhc scholars
MIND Revenue Release Quarter 2 2025 Press Release
A comparative analysis of optical character recognition models for extracting...
Per capita expenditure prediction using model stacking based on satellite ima...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
Encapsulation theory and applications.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Reach Out and Touch Someone: Haptics and Empathic Computing
20250228 LYD VKU AI Blended-Learning.pptx

Data Augmentation and Disaggregation by Neal Fultz

  • 1. Data Augmentation and Disaggregation Neal Fultz nfultz@system1.com July 26, 2017 https://goo.gl/6uQrss
  • 2. ● Many data sets are only available in aggregated form ○ Precluding use of stock statistics / ML directly. ● Data augmentation, a classic tool from Bayesian computation, can be applied more generally. ○ Disaggregating across and within observations Executive Summary
  • 4. A Data Set n Price 42 2.406 33 2.283 10 2.114 10 2.815 2 1.691 1 2.033 1 2.061 1 0.133 1 0.627
  • 5. ● Like to use Price ~ LN( , 2 ) ○ Lognormal has nice interpretation as random walk of ± % ○ Also won’t go negative ○ Common Alternatives: Exponential, Gamma ● Estimate both parameters for later use ● Actually, we want to do so for 10k items Modeling Price
  • 6. Log-normal Recap ● If Y ~ N( , 2 ), X = exp(Y) ~ LN( , 2 ) ● E(X) = exp( + 2 / 2) ● Var(X) = [exp( 2 ) - 1] exp(2 + 2 ) ● Standard estimators: ○ MoM - uses log of mean of X ○ MLE - uses mean of log X
  • 7. Log-normal Recap ● Method of Moments ○ s2 = ln(Σ X2 / N) - 2 ln (Σ X / N) ○ m = ln(Σ X / N) - s2 /2 ● Maximum Likelihood ○ m = Σ ln X / N ○ s2 = Σ (ln X - m)2 / N
  • 8. Estimation v0.1 What if we just ignore n, and plug in hourly averages to our estimators? =>Gives equal weight to (n=1, $=0.133) as (n=42, $=2.406) => Everything biased towards the small obs
  • 9. Estimation v0.2 What if we just plug in weighted sample averages? ● Method of Moments: ○ m = 0.342, s2 = 0.996 ○ Expected Value: exp(.342 + .996/2) = 2.32 ● Maximum Likelihood: ○ m = 0.811, s2 = 0.105 ○ Expected Value: exp(.811 + .105/2) = 2.37
  • 11. Are these trustworthy? To check if these make sense: ● Simulate from both estimates as ground truth ● Apply both estimators ● Inspect bias
  • 13. Why are these not working? ● Many distributions are additive ○ N(0,1) + N(1,1) => N(1,3) ○ Pois(4) + Pois(5) => Pois(9) ● Log Normal is not! ○ So (n=42, $=2.406) is not LN, even if individual prices are ○ It is in fact a marginal distribution ■ contains 41 integrals :( ● What about CLT? ■ Even if (n=42) is approx N, (n=10) and (n=2) are probably not
  • 14. A Data Set n Price 42 2.406 33 2.283 10 2.114 10 2.815 2 1.691 1 2.033 1 2.061 1 0.133 1 0.627
  • 15. Part 1 Main Points Violate iid at your own risk! ● Do NOT plug and chug ● Do NOT expect weights will fix your problem ● Do NOT use predictive models ● Do NOT use multi-armed bandits Get better, unaggregated data! … but if you can’t ...
  • 16. Part 2: Data Augmentation
  • 17. A Data Set n Price 42 2.406 33 2.283 10 2.114 10 2.815 2 1.691 1 2.033 1 2.061 1 0.133 1 0.627
  • 18. Long format…. id Group Price 1 1 2.406 2 1 2.406 3 1 2.406 ... ... ... 96 5 1.691 97 5 1.691 98 6 2.033 99 7 2.061 100 8 0.133 101 9 0.627
  • 19. Estimation ● MCMC using stock methods, eg Metropolis-Hastings ● MH requires: ○ Target Distribution / probability model ○ State transition functions / proposal distributions ● MH outputs: ○ Numerical samples from target distribution
  • 20. Proposal Distribution ● Transitions on m and s2 - easy ● Transitions on missing T Prices ? ○ hourly constraints on total $ ■ Don’t want to propose out-of-bounds ○ Option 1 - draw from dirichlet, ■ use that to disaggregate, transition whole hours at once ■ Big steps => lots of rejections ○ Option 2 - pairwise transitions within group
  • 23. Part 2 Main Points Switching from aggregated to long format shows aggregation can be thought of as a form of missing data. However, group averages => constraints on the missing data. In our example data, 97/101 points are missing, but we can still get reasonable estimates via MCMC
  • 25. A Data Set n Price 42 2.406 33 2.283 10 2.114 10 2.815 2 1.691 1 2.033 1 2.061 1 0.133 1 0.627
  • 26. Additional Challenges What if aggregation is over multiple heterogeneous groups, and we need to split the money between the groups (“disaggregate”)? Do we know the split a priori? What if we don’t?
  • 27. A Grouped Data Set Known Groups Desktop Mobile Price 38 4 2.406 27 6 2.283 2 8 2.114 6 4 2.815 0 2 1.691 0 1 2.033 1 0 2.061 1 0 0.133 0 1 0.627
  • 28. Common Heuristics ● Linear disaggregation ○ Weighted averages by another name ○ Doesn’t account for variation in other columns ● Iterative Proportional Fitting ○ If you have subtotals in all dimension ○ Alternates disaggregating by rows/columns Desktop Mobile Price 38 4 2.406 27 6 2.283 2 8 2.114 6 4 2.815 0 2 1.691 0 1 2.033 1 0 2.061 1 0 0.133 0 1 0.627
  • 29. Long format…. id Group mobile Price 1 1 1 2.406 2 1 0 2.406 3 1 1 2.406 ... ... 96 5 1 1.691 97 5 1 1.691 98 6 1 2.033 99 7 0 2.061 100 8 0 0.133 101 9 1 0.627
  • 30. A Grouped Data Set Unknown Groups n Prime Sub Price 42 ? ? 2.406 33 ? ? 2.283 10 ? ? 2.114 10 ? ? 2.815 2 ? ? 1.691 1 ? ? 2.033 1 ? ? 2.061 1 ? ? 0.133 1 ? ? 0.627
  • 31. A Grouped Data Set Unknown Groups n Prime Sub Price 42 30 12 2.406 33 23 9 2.283 10 7 3 2.114 10 8 2 2.815 2 2 0 1.691 1 1 0 2.033 1 1 0 2.061 1 0 1 0.133 1 0 1 0.627
  • 33. Part 3 Main Points By extending the previous model, we can deal with “heterogeneous aggregates”. If the grouping variable is known, solve like a regression problem. If not known / latent, solve it like a mixture problem. Either way, going Bayes let’s you borrow strength between aggregates, which disaggregation heuristics are not good at.