SlideShare a Scribd company logo
A Notional Framework for a Theory of Data Systems
Maggie Johnson
Joint with members of the ToDS subgroup
of the SAMSI CLIM Remote Sensing Working Group
Workshop on Remote Sensing, Uncertainty Quantification,
and a Theory of Data Systems
February 12, 2018
M. Johnson Remote Sensing Workshop February 12, 2018 1 / 22
Motivation
Motivation for this workshop:
. . .data must be brought together in some way . . . but moving data to a
central location for analysis is tedious at best and impossible at worst.
Some (remote) data reduction is almost certainly necessary, but how
much? What are the consequences for inference?
. . .how to navigate the trade-space between computational, transmission
and infrastructure costs versus uncertainty (a.k.a. “statistical costs”) in
the estimates or inferences that are ultimately produced.
In other words, can we integrate the design of data systems with the design of
statistical methodology to balance the various tradeoffs in these costs?
Can this be formulated as a well-specified optimization problem?
M. Johnson Remote Sensing Workshop February 12, 2018 2 / 22
What are the costs?
1 Computational
Number of operations, memory, time, etc.
2 Statistical
variance, prediction error, etc.
3 Transmission/Data Movement
bandwidth, latency, money, privacy, etc.
4 System Infrastructure/Design
data storage, types of connections, compute resources, etc.
5 . . .
M. Johnson Remote Sensing Workshop February 12, 2018 3 / 22
From the Software/System Architects Perspective
In designing a data system, architects consider the infrastructural costs and how
the design of the data system affects how data can be manipulated and moved
throughout the system
how to stage data across
servers?
where to build connections,
and how fast do they need
to be?
how to deploy compute
resources?
which services on which
machines?
privacy?
M. Johnson Remote Sensing Workshop February 12, 2018 4 / 22
From the Statistician’s/Data Scientist’s Perspective
In designing a statistical analysis, statisticians/data scientists are familiar with the
ideas of balancing the tradeoffs between the quality of a statistical analysis and
the computational costs of that analysis.
how much data, which data, where to move data?
which methodology?
what are the tradeoffs in efficiency of estimators/quality of inference
(uncertainty)?
Statistical analyses of distributed data depends on how data can be accessed,
computational resources, etc. (i.e. the design of the data system).
M. Johnson Remote Sensing Workshop February 12, 2018 5 / 22
A Theory of Data Systems
The simultaneous optimization of the data system architecture and the statistical
methodology balancing the tradeoffs in costs, for a given data analysis objective.
In theory, in order to do this we need to:
1 be able to quantify all of the various costs of performing data analysis in a
distributed setting
Many of the costs are very difficult to quantify
2 solve a highly complex, constrained, multi-objective optimization problem
competing objectives
3 choose a solution with costs we are willing to accept from a set of Pareto
optimal solutions
i.e., ”choose your battles”
M. Johnson Remote Sensing Workshop February 12, 2018 6 / 22
Illustration with a Toy Example
M. Johnson Remote Sensing Workshop February 12, 2018 7 / 22
Data System Setup
J servers, each with Nj observations (j = 1, . . . , J)
Assume only the user has computational resources
Cost to access the jth
server is aj and to move a data value from server j to
the user is bj
nj is the number of downloaded observations from server j to the user
M. Johnson Remote Sensing Workshop February 12, 2018 8 / 22
Data Analysis Objective
The statistical objective is to perform inference on the population mean from data
distributed across J servers, with the following statistical properties
Let Yij be the ith
observation on server j, assume E(Yij ) = µ, Var(Yij ) = 1
Correlation between two observations on the same server is φ
Correlation between an observation on server j and on server k is ρ|j−k|
φ and ρ are assumed known
Goal is to perform inference on µ using the sample mean
¯Yn =


J
j=1
nj


−1
J
j=1
nj
i=1
Yij
computed from n = {n1, . . . , nJ } observations as the estimator.
M. Johnson Remote Sensing Workshop February 12, 2018 9 / 22
The Costs
1 Statistical Cost (squared error loss −> minimize variance):
Cst(n) = Var( ¯Yn) = N−2
n
J
j=1

nj + φ(n2
j − nj ) +
k=j
nj nk ρ|j−k|


Given (assumed known) φ and ρ, the statistical cost depends only on the
amount of data downloaded from each server.
2 Infrastructure/Design Cost:
Cds(a, b) =
J
j=1
a−1
j + b−0.5
j
Meant to penalize small aj and bj (i.e. it is expensive to build a faster
connection)
Idea is that more resources should be allocated to servers where we need to
download more data.
M. Johnson Remote Sensing Workshop February 12, 2018 10 / 22
The Costs
3 Data Movement & Computation Cost:
Define data movement costs for n = {n1, . . . , nJ } observations as
J
j=1
(aj I(nj > 0) + bj nj )
Computational complexity is O( J
j=1 nj )
Combine both into a cost function for data movement and computation.
Cc (a, b, n) =
J
j=1
(aj I(nj > 0) + bj nj ) +
J
j=1
nj
M. Johnson Remote Sensing Workshop February 12, 2018 11 / 22
Multiobjective Optimization
The optimal distributed analysis for the toy example is a solution with jointly
minimizes the costs associated with the statistical analysis and the data system
infrastructure.
minimize
n,a,b
Cds(a, b), Cst(n), Cc (a, b, n)
subject to aj ∈ (c, d)
bj ∈ (e, f )
nj ∈ N
nj ≤ Nj
For the toy example, this optimization is feasible.
M. Johnson Remote Sensing Workshop February 12, 2018 12 / 22
The Pareto Front
Let φ = 0.5, ρ = 0.1, Nj = 100, J = 5, aj ∈ (1, 50), bj ∈ (1, 20). Using the R
package nloptr:
M. Johnson Remote Sensing Workshop February 12, 2018 13 / 22
“Choosing your Battles”
Suppose we wish to keep computational/data movement costs low (e.g. < 2000).
High statistical accuracy
(Var( ¯Yn) = 0.13)
Trades-off with expensive
data system design
(Cds = 5)
M. Johnson Remote Sensing Workshop February 12, 2018 14 / 22
“Choosing your Battles”
Suppose we wish to keep computational/data movement costs low (e.g. < 2000).
Cheap data system design
(e.g. Cds < 2)
Trades-off with reduced
statistical accuracy
(Var( ¯Yn) = 0.14)
M. Johnson Remote Sensing Workshop February 12, 2018 15 / 22
Effect of the Statistical Properties of the Data
Let φ = 0.5, ρ = 0.4, recall that the correlation between two servers is ρ|j−k|
.
It is more efficient to
sample from servers far
away from each other
More resources are then
focused on these servers
M. Johnson Remote Sensing Workshop February 12, 2018 16 / 22
Alternative Formulations of the Optimization Problem
Knowledge/decisions about the acceptable tradeoffs in costs can reduce the
optimization problem.
If all costs are equally important, combine the costs into a single objective
function
There are multiple subproblems given prior decisions on acceptable costs.
For example, in the toy example set Cc < 2000 as a constraint rather than
including the cost in the objective function
M. Johnson Remote Sensing Workshop February 12, 2018 17 / 22
The Difference between Theory and Practice
“Good Enough” Solution
In practice, it may only be realistic to obtain a solution which achieves
“acceptable” costs.
The statistical design may depend on unknown properties of the data (e.g.
unknown φ and ρ)
Not feasible to build the data system at every iteration of an optimization
algorithm
Quantifying statistical performance may in itself be computationally
demanding, and/or require data movement
Multiple analysis objectives
M. Johnson Remote Sensing Workshop February 12, 2018 18 / 22
The Difference between Theory and Practice
“Good Enough” Solution
In practice, one might iterate between the data system design and the statistical
design, to find a “good enough” solution.
1 Start with a distributed data system design, learn about the data
2 Given preliminary knowledge of statistical properties of the data, update the
data system architecture
3 Given the new data system architecture, update the statistical design
4 . . .
DAWN provides a potential framework to simulate this procedure.
M. Johnson Remote Sensing Workshop February 12, 2018 19 / 22
A Few Next Steps
For the toy example:
Allow connections between servers and distributed computation
Analysis objectives beyond inference about the mean
Incorporation of more realistic costs/constraints
Maximum likelihood estimation of the parameters of a covariance function
Multilayer networks as a framework for organizing the optimization problem
M. Johnson Remote Sensing Workshop February 12, 2018 20 / 22
Closing Thoughts
To continue to scale data analyses to the ever growing massive size of data, we
need to be able to exploit distributed data system architecture.
Requires understanding and accounting for the tradeoffs in the costs
associated with distributed data analysis and inferential quality for both data
system design and the design of the data analysis.
A Theory of Data Systems requires collaboration between statisticians, computer
scientists, data system architects, software engineers, and more.
Understanding realistic costs for all aspects of distributed data analysis
requires expert knowledge in each area.
M. Johnson Remote Sensing Workshop February 12, 2018 21 / 22
Thank You!
M. Johnson Remote Sensing Workshop February 12, 2018 22 / 22

More Related Content

PDF
V2 i9 ijertv2is90699-1
PDF
Drsp dimension reduction for similarity matching and pruning of time series ...
PDF
Variance rover system web analytics tool using data
PDF
Variance rover system
PDF
Different Classification Technique for Data mining in Insurance Industry usin...
DOCX
Clustering big spatiotemporal interval data
PDF
2013-imMens-EuroVis
PDF
Data Dimensional Reduction by Order Prediction in Heterogeneous Environment
V2 i9 ijertv2is90699-1
Drsp dimension reduction for similarity matching and pruning of time series ...
Variance rover system web analytics tool using data
Variance rover system
Different Classification Technique for Data mining in Insurance Industry usin...
Clustering big spatiotemporal interval data
2013-imMens-EuroVis
Data Dimensional Reduction by Order Prediction in Heterogeneous Environment

What's hot (19)

PDF
A frame work for clustering time evolving data
PDF
Click Model-Based Information Retrieval Metrics
PDF
IN SEARCH OF ACTIONABLE PATTERNS OF LOWEST COST - A SCALABLE GRAPH METHOD
PDF
Meta heuristic based clustering of two-dimensional data using-2
PPT
Data Mining In Market Research
PDF
A Novel Approach for Clustering Big Data based on MapReduce
PDF
A study on rough set theory based
PDF
A Comprehensive review of Conversational Agent and its prediction algorithm
PDF
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
PDF
Qo s aware scientific application scheduling algorithm in cloud environment
PDF
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
PDF
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
PDF
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTER
PDF
Support Vector Machine–Based Prediction System for a Football Match Result
PPTX
An Introduction to Data Mining
PPTX
presentationIDC - 14MAY2015
PDF
Optimization of Resource Allocation Strategy Using Modified PSO in Cloud Envi...
PDF
A bi objective workflow application
DOCX
Hashedcubes simple, low memory, real time visual
A frame work for clustering time evolving data
Click Model-Based Information Retrieval Metrics
IN SEARCH OF ACTIONABLE PATTERNS OF LOWEST COST - A SCALABLE GRAPH METHOD
Meta heuristic based clustering of two-dimensional data using-2
Data Mining In Market Research
A Novel Approach for Clustering Big Data based on MapReduce
A study on rough set theory based
A Comprehensive review of Conversational Agent and its prediction algorithm
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
Qo s aware scientific application scheduling algorithm in cloud environment
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTER
Support Vector Machine–Based Prediction System for a Football Match Result
An Introduction to Data Mining
presentationIDC - 14MAY2015
Optimization of Resource Allocation Strategy Using Modified PSO in Cloud Envi...
A bi objective workflow application
Hashedcubes simple, low memory, real time visual
Ad

Similar to CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of Data Systems - Maggie Johnson, Feb 12, 2018 (20)

PDF
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
PDF
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
PDF
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
PDF
CLIM Program: Remote Sensing Workshop, Some Ideas on Theory of Data Systems -...
PDF
CLIM Program: Remote Sensing Workshop, Optimization Methods in Remote Sensing...
PDF
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
PDF
Pattern Based Compression of Multi Band Image Data for Landscape Analysis Env...
PDF
MUMS Opening Workshop - Model Uncertainty in Data Fusion for Remote Sensing -...
PDF
Pattern Based Compression of Multi Band Image Data for Landscape Analysis Env...
PDF
Data Clustering Theory Algorithms And Applications Guojun Gan
PPTX
Data similarity and dissimilarity.pptx Data similarity and dissimilarity.pptx...
PDF
A Comparative Case Study on Compression Algorithm for Remote Sensing Images
PPTX
04-Data-Analysis-Overview.pptx
PDF
Data_Visualization_and_Engineering_UC_2022.pdf
PDF
Big Data and IOT
PPTX
Term Paper Presentation
PPTX
K ingoldsby
PDF
Undergraduate Modeling Workshop - Hierarchical Models for Sparsely Sampled Hi...
PDF
How Data Scientists Make Reliable Decisions with Data
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
CLIM Program: Remote Sensing Workshop, Some Ideas on Theory of Data Systems -...
CLIM Program: Remote Sensing Workshop, Optimization Methods in Remote Sensing...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Pattern Based Compression of Multi Band Image Data for Landscape Analysis Env...
MUMS Opening Workshop - Model Uncertainty in Data Fusion for Remote Sensing -...
Pattern Based Compression of Multi Band Image Data for Landscape Analysis Env...
Data Clustering Theory Algorithms And Applications Guojun Gan
Data similarity and dissimilarity.pptx Data similarity and dissimilarity.pptx...
A Comparative Case Study on Compression Algorithm for Remote Sensing Images
04-Data-Analysis-Overview.pptx
Data_Visualization_and_Engineering_UC_2022.pdf
Big Data and IOT
Term Paper Presentation
K ingoldsby
Undergraduate Modeling Workshop - Hierarchical Models for Sparsely Sampled Hi...
How Data Scientists Make Reliable Decisions with Data
Ad

More from The Statistical and Applied Mathematical Sciences Institute (20)

PDF
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
PDF
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
PDF
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
PDF
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
PDF
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
PDF
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
PPTX
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
PDF
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
PDF
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
PPTX
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
PDF
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
PDF
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
PDF
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
PDF
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
PDF
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
PDF
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
PPTX
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
PPTX
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
PDF
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
PDF
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...

Recently uploaded (20)

PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PDF
Indian roads congress 037 - 2012 Flexible pavement
PDF
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PDF
Hazard Identification & Risk Assessment .pdf
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PDF
Complications of Minimal Access Surgery at WLH
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PDF
Empowerment Technology for Senior High School Guide
PPTX
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
1_English_Language_Set_2.pdf probationary
LDMMIA Reiki Yoga Finals Review Spring Summer
A systematic review of self-coping strategies used by university students to ...
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Indian roads congress 037 - 2012 Flexible pavement
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
Hazard Identification & Risk Assessment .pdf
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
Complications of Minimal Access Surgery at WLH
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
Empowerment Technology for Senior High School Guide
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
UNIT III MENTAL HEALTH NURSING ASSESSMENT
Chinmaya Tiranga quiz Grand Finale.pdf
1_English_Language_Set_2.pdf probationary

CLIM Program: Remote Sensing Workshop, A Notional Framework for a Theory of Data Systems - Maggie Johnson, Feb 12, 2018

  • 1. A Notional Framework for a Theory of Data Systems Maggie Johnson Joint with members of the ToDS subgroup of the SAMSI CLIM Remote Sensing Working Group Workshop on Remote Sensing, Uncertainty Quantification, and a Theory of Data Systems February 12, 2018 M. Johnson Remote Sensing Workshop February 12, 2018 1 / 22
  • 2. Motivation Motivation for this workshop: . . .data must be brought together in some way . . . but moving data to a central location for analysis is tedious at best and impossible at worst. Some (remote) data reduction is almost certainly necessary, but how much? What are the consequences for inference? . . .how to navigate the trade-space between computational, transmission and infrastructure costs versus uncertainty (a.k.a. “statistical costs”) in the estimates or inferences that are ultimately produced. In other words, can we integrate the design of data systems with the design of statistical methodology to balance the various tradeoffs in these costs? Can this be formulated as a well-specified optimization problem? M. Johnson Remote Sensing Workshop February 12, 2018 2 / 22
  • 3. What are the costs? 1 Computational Number of operations, memory, time, etc. 2 Statistical variance, prediction error, etc. 3 Transmission/Data Movement bandwidth, latency, money, privacy, etc. 4 System Infrastructure/Design data storage, types of connections, compute resources, etc. 5 . . . M. Johnson Remote Sensing Workshop February 12, 2018 3 / 22
  • 4. From the Software/System Architects Perspective In designing a data system, architects consider the infrastructural costs and how the design of the data system affects how data can be manipulated and moved throughout the system how to stage data across servers? where to build connections, and how fast do they need to be? how to deploy compute resources? which services on which machines? privacy? M. Johnson Remote Sensing Workshop February 12, 2018 4 / 22
  • 5. From the Statistician’s/Data Scientist’s Perspective In designing a statistical analysis, statisticians/data scientists are familiar with the ideas of balancing the tradeoffs between the quality of a statistical analysis and the computational costs of that analysis. how much data, which data, where to move data? which methodology? what are the tradeoffs in efficiency of estimators/quality of inference (uncertainty)? Statistical analyses of distributed data depends on how data can be accessed, computational resources, etc. (i.e. the design of the data system). M. Johnson Remote Sensing Workshop February 12, 2018 5 / 22
  • 6. A Theory of Data Systems The simultaneous optimization of the data system architecture and the statistical methodology balancing the tradeoffs in costs, for a given data analysis objective. In theory, in order to do this we need to: 1 be able to quantify all of the various costs of performing data analysis in a distributed setting Many of the costs are very difficult to quantify 2 solve a highly complex, constrained, multi-objective optimization problem competing objectives 3 choose a solution with costs we are willing to accept from a set of Pareto optimal solutions i.e., ”choose your battles” M. Johnson Remote Sensing Workshop February 12, 2018 6 / 22
  • 7. Illustration with a Toy Example M. Johnson Remote Sensing Workshop February 12, 2018 7 / 22
  • 8. Data System Setup J servers, each with Nj observations (j = 1, . . . , J) Assume only the user has computational resources Cost to access the jth server is aj and to move a data value from server j to the user is bj nj is the number of downloaded observations from server j to the user M. Johnson Remote Sensing Workshop February 12, 2018 8 / 22
  • 9. Data Analysis Objective The statistical objective is to perform inference on the population mean from data distributed across J servers, with the following statistical properties Let Yij be the ith observation on server j, assume E(Yij ) = µ, Var(Yij ) = 1 Correlation between two observations on the same server is φ Correlation between an observation on server j and on server k is ρ|j−k| φ and ρ are assumed known Goal is to perform inference on µ using the sample mean ¯Yn =   J j=1 nj   −1 J j=1 nj i=1 Yij computed from n = {n1, . . . , nJ } observations as the estimator. M. Johnson Remote Sensing Workshop February 12, 2018 9 / 22
  • 10. The Costs 1 Statistical Cost (squared error loss −> minimize variance): Cst(n) = Var( ¯Yn) = N−2 n J j=1  nj + φ(n2 j − nj ) + k=j nj nk ρ|j−k|   Given (assumed known) φ and ρ, the statistical cost depends only on the amount of data downloaded from each server. 2 Infrastructure/Design Cost: Cds(a, b) = J j=1 a−1 j + b−0.5 j Meant to penalize small aj and bj (i.e. it is expensive to build a faster connection) Idea is that more resources should be allocated to servers where we need to download more data. M. Johnson Remote Sensing Workshop February 12, 2018 10 / 22
  • 11. The Costs 3 Data Movement & Computation Cost: Define data movement costs for n = {n1, . . . , nJ } observations as J j=1 (aj I(nj > 0) + bj nj ) Computational complexity is O( J j=1 nj ) Combine both into a cost function for data movement and computation. Cc (a, b, n) = J j=1 (aj I(nj > 0) + bj nj ) + J j=1 nj M. Johnson Remote Sensing Workshop February 12, 2018 11 / 22
  • 12. Multiobjective Optimization The optimal distributed analysis for the toy example is a solution with jointly minimizes the costs associated with the statistical analysis and the data system infrastructure. minimize n,a,b Cds(a, b), Cst(n), Cc (a, b, n) subject to aj ∈ (c, d) bj ∈ (e, f ) nj ∈ N nj ≤ Nj For the toy example, this optimization is feasible. M. Johnson Remote Sensing Workshop February 12, 2018 12 / 22
  • 13. The Pareto Front Let φ = 0.5, ρ = 0.1, Nj = 100, J = 5, aj ∈ (1, 50), bj ∈ (1, 20). Using the R package nloptr: M. Johnson Remote Sensing Workshop February 12, 2018 13 / 22
  • 14. “Choosing your Battles” Suppose we wish to keep computational/data movement costs low (e.g. < 2000). High statistical accuracy (Var( ¯Yn) = 0.13) Trades-off with expensive data system design (Cds = 5) M. Johnson Remote Sensing Workshop February 12, 2018 14 / 22
  • 15. “Choosing your Battles” Suppose we wish to keep computational/data movement costs low (e.g. < 2000). Cheap data system design (e.g. Cds < 2) Trades-off with reduced statistical accuracy (Var( ¯Yn) = 0.14) M. Johnson Remote Sensing Workshop February 12, 2018 15 / 22
  • 16. Effect of the Statistical Properties of the Data Let φ = 0.5, ρ = 0.4, recall that the correlation between two servers is ρ|j−k| . It is more efficient to sample from servers far away from each other More resources are then focused on these servers M. Johnson Remote Sensing Workshop February 12, 2018 16 / 22
  • 17. Alternative Formulations of the Optimization Problem Knowledge/decisions about the acceptable tradeoffs in costs can reduce the optimization problem. If all costs are equally important, combine the costs into a single objective function There are multiple subproblems given prior decisions on acceptable costs. For example, in the toy example set Cc < 2000 as a constraint rather than including the cost in the objective function M. Johnson Remote Sensing Workshop February 12, 2018 17 / 22
  • 18. The Difference between Theory and Practice “Good Enough” Solution In practice, it may only be realistic to obtain a solution which achieves “acceptable” costs. The statistical design may depend on unknown properties of the data (e.g. unknown φ and ρ) Not feasible to build the data system at every iteration of an optimization algorithm Quantifying statistical performance may in itself be computationally demanding, and/or require data movement Multiple analysis objectives M. Johnson Remote Sensing Workshop February 12, 2018 18 / 22
  • 19. The Difference between Theory and Practice “Good Enough” Solution In practice, one might iterate between the data system design and the statistical design, to find a “good enough” solution. 1 Start with a distributed data system design, learn about the data 2 Given preliminary knowledge of statistical properties of the data, update the data system architecture 3 Given the new data system architecture, update the statistical design 4 . . . DAWN provides a potential framework to simulate this procedure. M. Johnson Remote Sensing Workshop February 12, 2018 19 / 22
  • 20. A Few Next Steps For the toy example: Allow connections between servers and distributed computation Analysis objectives beyond inference about the mean Incorporation of more realistic costs/constraints Maximum likelihood estimation of the parameters of a covariance function Multilayer networks as a framework for organizing the optimization problem M. Johnson Remote Sensing Workshop February 12, 2018 20 / 22
  • 21. Closing Thoughts To continue to scale data analyses to the ever growing massive size of data, we need to be able to exploit distributed data system architecture. Requires understanding and accounting for the tradeoffs in the costs associated with distributed data analysis and inferential quality for both data system design and the design of the data analysis. A Theory of Data Systems requires collaboration between statisticians, computer scientists, data system architects, software engineers, and more. Understanding realistic costs for all aspects of distributed data analysis requires expert knowledge in each area. M. Johnson Remote Sensing Workshop February 12, 2018 21 / 22
  • 22. Thank You! M. Johnson Remote Sensing Workshop February 12, 2018 22 / 22