SlideShare a Scribd company logo
Simulation Informatics!
Analyzing Large Datasets
from Scientific Simulations

DAVID F. GLEICH !     PAUL G. CONSTANTINE!
 PURDUE UNIVERSITY
     STANFORD UNIVERSITY
COMPUTER SCIENCE !    JOE RUTHRUFF!
 DEPARTMENT
            & JEREMY TEMPLETON !
                        SANDIA NATIONAL LABS





                                                                   1
                           David Gleich · Purdue 
 CS&E Seminar
This talk is a story …




                                                          2
                  David Gleich · Purdue 
 CS&E Seminar
How I learned to stop
worrying and love the
simulation!




                                                        3
                David Gleich · Purdue 
 CS&E Seminar
I asked …!
Can we do UQ on
PageRank?




                                                     4
             David Gleich · Purdue 
 CS&E Seminar
PageRank by Google
 Google’s PageRank
 PageRank by Google
                  3
                       3
                                    The Model
    2                       5       1.The Model uniformly with
                                       follow edges
         2
                  4
                                5     1. follow edges uniformly with
                                       probability , and
                       4
                                    2. randomly jump, with probability
                                         probability   and
    1                       6
                                      2. randomlyassume everywhere is
                                       1    , we’ll jump with probability
         1                      6      equally, likely assume everywhere is
                                         1       we’ll
                                         equally likely



                                     The places we find the
                                       The places we find the
                                     surfer most often are im-
                                     portant pages. often are im-
                                       surfer most
                                       portant pages.




                                                                                                  5
 David F. Gleich (Sandia)            PageRank intro
                                                      David Gleich · Purdue 
 CS&E Seminar
/ 36
                                                                                    Purdue 5
h sensitivity?
 alpha alpha PageRank
 PageRa
         PageRank
    RandomPageRank
 dom alpha
  Random         alpha
       RAPr
   or PageRank meets UQ

               (            P)x = (1                      )v
 s the random variables as the random variables
        Model PageRank
 ageRank as the random variables
 y to the links : examined and understoo
            x(A)                    x(A)
                     x(A)
        and look at
k E [x(A)] and Std [x(A)] .
  at
                           E [x(A)] and Std [x(A)] .
y to the E [x(A)]: and Std [x(A)] .understood,
            jump examined,
    Explored in Constantine and Gleich, WAW2007; and "
    Constantine and Gleich, J. Internet Mathematics 2011.




                                                                                      6
                                              David Gleich · Purdue 
 CS&E Seminar
Random alpha PageRank has
   Convergence theory
a rigorous convergence theory.
   Method                        Conv. Work Required                          What is N?
                                  1                                           number of
   Monte Carlo                   p       N PageRank systems
                                   N                                          samples from A
   Path Damping
                                 r N+2   N + 1 matrix vector                  terms of
   (without
                                 N1+     products                             Neumann series
   Std [x(A)])
                                                                              number of
   Gaussian
                                 r 2N    N PageRank systems                   quadrature
   Quadrature
                                                                              points


                            and r are parameters from Bet ( , b, , r)




                                                                                                         7
      David F. Gleich (Sandia)                         David
                                             Random sensitivity   Gleich · Purdue 
 CS&E Seminar
 / 36
                                                                                          Purdue 27
Working with
PageRank showed us
how to treat UQ more
generally …




                                                   8
           David Gleich · Purdue 
 CS&E Seminar
Constantine, Gleich, and Iaccarino.
We studied           Spectral Methods for Parameterized
                     Matrix Equations, SIMAX, 2010.
parameterized            

                     A(s)x(s) = b(s)
                     
matrices.
           
                     
 , A(J 1 )x(J 1 ) = b(J 1 )
                     
      ) A(J N )x(J N ) = b(J N ) or
   Parameterized     
      Solution
      
      ) AN (J 1 )xN (J 1 ) = bN (J 1 )
                     
                     Constantine, Gleich, and Iaccarino. A
A(s)x(s) = b(s)      factorization of the spectral Galerkin
                     system for parameterized matrix
                     equations: derivation and applications,
                     SISC 2011.
                     
                     How to compute the Galerkin solution
   Discretized PDE   in a weakly intrusive manner.!
     with explicit
     parameters




                                                                  9
                          David Gleich · Purdue 
 CS&E Seminar
Simulation!
The Third Pillar of Science
21st Century Science in a nutshell!
    Experiments are not practical or feasible.
    Simulate things instead.
But do we trust the simulations?!

We’re trying!
    Model Fidelity
    Verification & Validation (V&V)
    Uncertainty Quantification (UQ)




                                                                            10

                                   David Gleich · Purdue 
 CS&E Seminar
The message
Insight and confidence
requires multiple runs.




                                                          11
                  David Gleich · Purdue 
 CS&E Seminar
The problem
A simulation run ain’t cheap!




                                                         12
                 David Gleich · Purdue 
 CS&E Seminar
Another problem
It’s very hard to “modify”
current codes.




                                                          13
                  David Gleich · Purdue 
 CS&E Seminar
Large scale nonlinear, time
dependent heat transfer problem
                    105 nodes
                    103 time steps
                    30 minutes on 16 cores

                    
                    Questions
                    What is the probability of failure? 
                    Which input values cause failure?




                                                                14
                        David Gleich · Purdue 
 CS&E Seminar
It’s time to ask "
What can science
learn from Google?"
"
                      
- Wired Magazine (2008)




                                                                  15
                          David Gleich · Purdue 
 CS&E Seminar
We can throw the numbers
                              21.1st Century Science
into the biggest computing    in a nutshell?
clusters the world has ever       Simulations are "
seen and let statistical          too expensive.
algorithms find patterns
                                  Let data provide a
where science cannot.
            surrogate.
- Wired (again)
              




                                                                      16/18
                              David Gleich · Purdue 
 CS&E Seminar
Our approach!
Construct an interpolating
reduced order model from a
budget-constrained ensemble of
runs for uncertainty and
optimization studies.




                                                         17
                 David Gleich · Purdue 
 CS&E Seminar
That is, we store the runs
 Supercomputer            Data computing cluster         Engineer




Each multi-day HPC     A data cluster can         … enabling engineers to query
simulation generates   hold hundreds or thousands and analyze months of simulation
gigabytes of data.     of old simulations …       data for statistical studies and
                                                  uncertainty quantification.

                       and build the interpolant from
                             the pre-computed data.




                                                                                  18
                                          David Gleich · Purdue 
 CS&E Seminar
The Database

       Input "                                                     Time history"          s1 -> f1
    Parameters
                                                    of simulation
         s2 -> f2
                              s
                                         f
                  
                                                                                          sk -> fk

                                     2                 3 A single simulation
The simulation as a vector




                                        q(x1 , t1 , s)
                                      6       .
                                              .        7 at one time step
                                      6       .        7
                                      6                7
                                      6q(xn , t1 , s)7
                                      6                7
                                      6q(x1 , t2 , s)7
                                      6                7                   ⇥                         ⇤
                               f(s) = 6       .        7
                                      6
                                      6
                                              .
                                              .        7
                                                       7             X = f(s1 ) f(s2 ) ... f(sp )
                                      6q(xn , t2 , s)7
                                      6                7
                                      6       .        7                   The database as a matrix
                                      4       .
                                              .        5
                                         q(xn , tk , s)




                                                                                                          19
                                                                  David Gleich · Purdue 
 CS&E Seminar
The interpolant

Motivation!
                                               This idea was inspired by
Let the data give you the basis.
              the success of other
         ⇥                            ⇤        reduced order models

    X = f(s1 ) f(s2 ) ... f(sp )              like POD; and Paul’s
                                               residual minimizing idea.
Then find the right combination
            Xr

     f(s) ⇡     uj ↵j (s)

           j=1

                     These are the left singular
                     vectors from X!




                                                                         20
                                 David Gleich · Purdue 
 CS&E Seminar
Why the SVD?!
 Let’s study a simple case.
    2                                                                3
        g(x1 , s1 )    g(x1 , s2 )     ···              g(x1 , sp )
  6                       ..           ..                   .
                                                            .        7
  6 g(x2 , s1 )              .            .                 .        7
X=6
  6     .
                                                                     7
                                                                     7
  4     .                 ..           ..
        .                    .            .            g(xm 1 , sp )5     treat each right
    g(xm , s1 )                    g(xm , sp                              singular vector
                          ···                     1)    g(xm , sp ).
                                                                          as samples of
  = U⌃VT ,                                                                the unknown
                 r
                 X                          r
                                            X                             basis functions
g(xi , sj ) =           Ui,`   ` Vj,`   =         u` (xi ) ` v` (sj )
                 `=1                        `=1                           split x and s
   a general parameter
               r                                         p
             X                                           X                (`)
g(xi , s) =           u` (xi ) ` v` (s) v` (s) ⇡               v` (sj )   j (s)
               `=1                                       j=1
                       Interpolate v any way you wish




                                                                                           21
                                               David Gleich · Purdue 
 CS&E Seminar
Method summary



Compute SVD of X!
Compute interpolant of right singular vectors
Approximate a new value of f(s)!




                                                                    22
                            David Gleich · Purdue 
 CS&E Seminar
A quiz!
Which section would you rather
try and interpolate, A or B?




          A
          B




                                                             23
                     David Gleich · Purdue 
 CS&E Seminar
How predictable is a !
singular vector?
Folk Theorem (O’Leary 2011)
The singular vectors of a matrix of “smooth” data
become more oscillatory as the index increases.
Implication!
The gradient of the singular vectors increases as
the index increases. 

v1 (s), v2 (s), ... , vt (s)

                                   vt+1 (s), ... , vr (s)
        Predictable
                         Unpredictable





                                                                       24
                               David Gleich · Purdue 
 CS&E Seminar
A refined method with !
an error model
                                Don’t even try to
                                               interpolate the
                                               predictable modes.
         t(s)                            r
         X                               X
f(s) ⇡          uj ↵j (s)       +                   uj j ⌘j
         j=1     Predictable
          j=t(s)+1         Unpredictable
                                                       ⌘j ⇠ N(0, 1)
                                0                           1
                                     r
                                     X
                                                        TA
  Variance[f] = diag @                            j uj uj
                                    j=t(s)+1

           But now, how to choose t(s)?




                                                                         25
                                David Gleich · Purdue 
 CS&E Seminar
Our current approach to
choosing the predictability

  t(s) is the largest  such that
        ⌧
        X
      1              @vi
                 i           < threshold
       1             @s
           i=1




                                                                       26
                               David Gleich · Purdue 
 CS&E Seminar
An experimental test case

                                A heat equation
                                problem
                                
                                Two parameters
                                that control the
                                material properties




                                                             27
                     David Gleich · Purdue 
 CS&E Seminar
Experiments




 20 point, Latin hypercube sample




                                                                             28
                                     David Gleich · Purdue 
 CS&E Seminar
Our Reduced Order Model



Where the error is the worst




                                The Truth




                                                                             29
                                     David Gleich · Purdue 
 CS&E Seminar
A Large Scale Example




Nonlinear heat transfer model
80k nodes, 300 time-steps
104 basis runs
SVD of 24m x 104 data matrix
 500x reduction in wall clock time
(100x including the SVD)




                                                                               30
                                       David Gleich · Purdue 
 CS&E Seminar
PART 2!





Tall-and-skinny
QR (and SVD)!
on MapReduce


                                                  31
          David Gleich · Purdue 
 CS&E Seminar
Quick review of QR
 QR Factorization
Let                              , real                         Using QR for regression

                                                                              is given by
                                                                the solution of   

                                                                QR is block normalization
   is                   orthogonal (              )             “normalize” a vector
                                                                usually generalizes to
                                                                computing    in the QR
   is                   upper triangular.




                                                                  0
                                   A        =         Q
                                                                      R




                                                                                                32
David Gleich (Sandia)                                David
                                          MapReduce 2011     Gleich · Purdue 
 CS&E Seminar
                                                                                         4/22
Intro to MapReduce
Originated at Google for indexing web   Data scalable
pages and computing PageRank.
                Maps
                        M         M
                                                                           1
        2
                                        1
     M
The idea Bring the                                  Reduce
                                        2
     M                           M         M
computations to the data.
                            R                    3
        4
                                        3
     M
                                                      R
                                               M                                M
Express algorithms in "
                                        4
                                                                                5
                                        5
     M Shuffle
data-local operations.
                                        Fault-tolerance by design
Implement one type of                        Input stored in triplicate
communication: shuffle.
                                 M
                                                                    Reduce input/"
                                                                    output on disk
                                                        M
Shuffle moves all data with                              M
                                                                 R

the same key to the same                                M        R

reducer.
                                                   Map output"
                                                            persisted to disk"




                                                                                          33
                                                            before shuffle
                                         David Gleich · Purdue 
 CS&E Seminar
Mesh point variance in MapReduce
          Run 1
                Run 2
                         Run 3


T=1
   T=2
    T=3
   T=1
   T=2
    T=3
      T=1
     T=2
       T=3




                                                                             34
                                     David Gleich · Purdue 
 CS&E Seminar
Mesh point variance in MapReduce
             Run 1
                 Run 2
                        Run 3


 T=1
     T=2
     T=3
    T=1
   T=2
    T=3
      T=1
     T=2
       T=3
            M
                       M
                         M


1. Each mapper out-                                           2. Shuffle moves all
puts the mesh points                                          values from the same
with the same key.
                                           mesh point to the
                          R
                        R
        same reducer.


  3. Reducers just
  compute a numerical
  variance.
                                                 Bring the computations
                                                 to the data!




                                                                                  35
                                          David Gleich · Purdue 
 CS&E Seminar
Communication avoiding QR
Communication avoiding TSQR
 (Demmel et al. 2008)



 First, do QR                                        Second, compute
 factorizations                                      a QR factorization
 of each local                                       of the new “R”
 matrix   




                                                                                36
                  Demmel et al.David Communicating avoiding CS&E and sequential QR.
                               2008. Gleich · Purdue 
 parallel Seminar
Serial QR factorizations!
Fully serialet al. 2008)
  (Demmel TSQR

                   Compute QR of    ,
                   read    , update QR, …




                                                                                            37
                   Demmel et al. 2008. Communicating avoiding
parallel and sequential QR.
                                   David Gleich · Purdue CS&E Seminar
Tall-and-skinnymatrix storage
MapReduce matrix
storage in MapReduce
  
                                                                      A1

Key is an arbitrary row-id
Value is the       array for                                          A2
  a row.

                                                                      A3
Each submatrix          is an
  input split.
                                                                      A4




                                                                              38
David Gleich (Sandia)           MapReduce 2011                                 10/2
                                      David Gleich · Purdue 
 CS&E Seminar
Algorithm
                                             Data Rows of a matrix
              A1   A1                        Map QR factorization of rows
                   A2
                        qr                   Reduce QR factorization of rows
              A2             Q2   R2
Mapper 1                                qr
Serial TSQR   A3                  A3          Q3    R3
                                                    A4   qr             emit
              A4                                              Q4   R4

              A5   A5
                        qr
              A6   A6        Q6   R6
Mapper 2                                qr
Serial TSQR   A7                  A7          Q7    R7

                                                    A8   qr             emit
              A8                                              Q8   R8


              R4   R4
Reducer 1
Serial TSQR             qr             emit
              R8   R8        Q    R




                                                                                      39
                                              David Gleich · Purdue 
 CS&E Seminar
Key Limitations
Computes only R and not Q

Can get Q via Q = AR+ with another MR iteration. "
  (we currently use this for computing the SVD) 
Dubious numerical stability; iterative refinement helps.

Working on better ways to compute Q "
(with Austin Benson, Jim Demmel)




                                                                     40
                             David Gleich · Purdue 
 CS&E Seminar
In hadoopy
  Full code in hadoopy
import random, numpy, hadoopy                            def close(self):
class SerialTSQR:                                          self.compress()
 def __init__(self,blocksize,isreducer):                   for row in self.data:
                                                            key = random.randint(0,2000000000)
   self.bsize=blocksize                                     yield key, row
   self.data = []
   if isreducer: self.__call__ = self.reducer             def mapper(self,key,value):
   else: self.__call__ = self.mapper                       self.collect(key,value)

                                                          def reducer(self,key,values):
 def compress(self):                                       for value in values: self.mapper(key,value)
  R = numpy.linalg.qr(
         numpy.array(self.data),'r')                     if __name__=='__main__':
  # reset data and re-initialize to R                      mapper = SerialTSQR(blocksize=3,isreducer=False)
  self.data = []                                           reducer = SerialTSQR(blocksize=3,isreducer=True)
  for row in R:                                            hadoopy.run(mapper, reducer)
   self.data.append([float(v) for v in row])

 def collect(self,key,value):
  self.data.append(value)
  if len(self.data)>self.bsize*len(self.data[0]):
    self.compress()




                                                                                                              41
  David Gleich (Sandia)                             MapReduce 2011                                       13/22
                                                             David Gleich · Purdue 
 CS&E Seminar
Lots many maps? an iteration.
Too of data? Add Add an iteration!
                   map           emit                          reduce            emit                                      reduce        emit
                          R1                                              R2,1                                                      R
             A1     Mapper 1-1
                                                          S1    Reducer 1-1
                                                                                                                    S(2)
                                                                                                                    A2     Reducer 2-1
                   Serial TSQR                                  Serial TSQR                                                Serial TSQR




                                                                                                       shuffle
                                                                                        identity map
                   map           emit                          reduce            emit
                          R2                                              R2,2
             A2     Mapper 1-2
                                                  S(1)    A2
                                                          S     Reducer 1-2
                                        shuffle
                   Serial TSQR                                  Serial TSQR

  A
                   map           emit                          reduce            emit
                          R3                                              R2,3
             A3     Mapper 1-3
                                                          A2
                                                          S3    Reducer 1-3
                   Serial TSQR                                  Serial TSQR


                   map           emit
                          R4
             A3
              4     Mapper 1-4
                   Serial TSQR



                                 Iteration 1                                                                     Iteration 2




                                                                                                                                                 42
David Gleich (Sandia)                                    MapReduce 2011                                                                  14/22
                                                                          David Gleich · Purdue 
 CS&E Seminar
mrtsqr – of parameters
parameters
Summary summary of
Blocksize How many rows to
                                                              A1            A1
  read before computing a QR
                                                                                           qr
  factorization, expressed as a                               A2            A2                  Q2
  multiple of the number of
  columns (See paper)
                                                                     map             emit
                                                                            R1
Splitsize The size of each local                             A1       Mapper 1-1
  matrix                                                             Serial TSQR




Reduction tree




                                                                                   (Red)
                                                                         S(2)
  The number of


                                                                 (Red)
                                                 (Red)    S(2)
                                shuffle



  reducers and                            S(1)
                          A
  iterations to use

                              Iteration 1                   Iter 2         Iter 3




                                                                                                     43
David Gleich (Sandia)         MapReduce 2011
                                         David                                         15/22
                                                         Gleich · Purdue 
 CS&E Seminar
Varying splitsize and the tree
Data
 Varying splitsize Synthetic
 Cols.   Iters.   Split   Maps   Secs.   Increasing split size
                  (MB)                      improves performance
 50      1        64      8000   388        (accounts for Hadoop
 –       –        256     2000   184        data movement)
 –       –        512     1000   149

 –       2        64      8000   425     Increasing iterations helps
 –       –        256     2000   220        for problems with many
                                            columns.
 –       –        512     1000   191

 1000 1           512     1000   666     (1000 columns with 64-MB
                                           split size overloaded the
 –       2        64      6000   590
                                           single reducer.)
 –       –        256     2000   432
 –       –        512     1000   337




                                                                                 44
                                         David Gleich · Purdue 
 CS&E Seminar
MapReduceTSQR summary
 MapReduce is great for TSQR!
Data A tall and skinny (TS) matrix by rows

Map QR factorization of local rows                       Demmel et al. showed that
                                                         this construction works to
Reduce QR factorization of local rows                    compute a QR factorization
                                                         with minimal communication
Input 500,000,000-by-100 matrix
Each record 1-by-100 row
HDFS Size 423.3 GB
Time to compute        (the norm of each column) 161 sec.
Time to compute    in qr(   ) 387 sec.




                                                                                        45
                         On a 64-node Hadoop cluster with · Purdue 
 CS&E Seminar
                                        David Gleich 4x2TB, one Core i7-920, 12GB RAM/node
Our vision!
To enable analysts
and engineers to
hypothesize from "               Paul G. Constantine "
                                          

data computations                      Sandia!
                                  Jeremy Templeton
                                     Joe Ruthruff
instead of expensive                      
                                   … and you ? …
HPC computations.




                                                          46
                  David Gleich · Purdue 
 CS&E Seminar

More Related Content

PDF
Deep generative learning_icml_part1
PDF
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
PDF
Dialectica Categories... and Lax Topological Spaces?
PDF
Sandia
PDF
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
PDF
PageRank
PDF
Spectral methods for linear systems with random inputs
PDF
02 newton-raphson
Deep generative learning_icml_part1
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Dialectica Categories... and Lax Topological Spaces?
Sandia
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...
PageRank
Spectral methods for linear systems with random inputs
02 newton-raphson

Similar to Simulation Informatics; Analyzing Large Scientific Datasets (20)

PDF
Two numerical graph algorithms
PDF
Approximate Bayesian Computation on GPUs
PDF
The spectre of the spectrum
PDF
Quantum Algorithms and Lower Bounds in Continuous Time
PDF
Fast pair-wise and node-wise algorithms for commute times and Katz scores
PDF
Simulation (AMSI Public Lecture)
PDF
Distinguishing the signal from noise in an SVD of simulation data
PDF
Machine Learning
PDF
Direct tall-and-skinny QR factorizations in MapReduce architectures
PDF
MATHEON Center Days: Index determination and structural analysis using Algori...
PDF
Litvinenko, Uncertainty Quantification - an Overview
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
A multithreaded method for network alignment
PDF
Tall-and-skinny QR factorizations in MapReduce architectures
PDF
Lesson 1: Functions
PDF
2 Capitulo Metodos Numericos
PDF
Lesson 1: Functions
PDF
Fast matrix computations for pair-wise and column-wise Katz scores and commut...
PDF
Lesson 1: Functions
PDF
Bayesian inversion of deterministic dynamic causal models
Two numerical graph algorithms
Approximate Bayesian Computation on GPUs
The spectre of the spectrum
Quantum Algorithms and Lower Bounds in Continuous Time
Fast pair-wise and node-wise algorithms for commute times and Katz scores
Simulation (AMSI Public Lecture)
Distinguishing the signal from noise in an SVD of simulation data
Machine Learning
Direct tall-and-skinny QR factorizations in MapReduce architectures
MATHEON Center Days: Index determination and structural analysis using Algori...
Litvinenko, Uncertainty Quantification - an Overview
International Journal of Computational Engineering Research(IJCER)
A multithreaded method for network alignment
Tall-and-skinny QR factorizations in MapReduce architectures
Lesson 1: Functions
2 Capitulo Metodos Numericos
Lesson 1: Functions
Fast matrix computations for pair-wise and column-wise Katz scores and commut...
Lesson 1: Functions
Bayesian inversion of deterministic dynamic causal models
Ad

More from David Gleich (20)

PDF
Engineering Data Science Objectives for Social Network Analysis
PDF
Correlation clustering and community detection in graphs and networks
PDF
Spectral clustering with motifs and higher-order structures
PDF
Higher-order organization of complex networks
PDF
Spacey random walks and higher-order data analysis
PDF
Non-exhaustive, Overlapping K-means
PDF
Using Local Spectral Methods to Robustify Graph-Based Learning
PDF
Spacey random walks and higher order Markov chains
PDF
Localized methods in graph mining
PDF
PageRank Centrality of dynamic graph structures
PDF
Iterative methods with special structures
PDF
Big data matrix factorizations and Overlapping community detection in graphs
PDF
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
PDF
Localized methods for diffusions in large graphs
PDF
Anti-differentiating Approximation Algorithms: PageRank and MinCut
PDF
Fast relaxation methods for the matrix exponential
PDF
Fast matrix primitives for ranking, link-prediction and more
PDF
Gaps between the theory and practice of large-scale matrix-based network comp...
PDF
MapReduce Tall-and-skinny QR and applications
PDF
Recommendation and graph algorithms in Hadoop and SQL
Engineering Data Science Objectives for Social Network Analysis
Correlation clustering and community detection in graphs and networks
Spectral clustering with motifs and higher-order structures
Higher-order organization of complex networks
Spacey random walks and higher-order data analysis
Non-exhaustive, Overlapping K-means
Using Local Spectral Methods to Robustify Graph-Based Learning
Spacey random walks and higher order Markov chains
Localized methods in graph mining
PageRank Centrality of dynamic graph structures
Iterative methods with special structures
Big data matrix factorizations and Overlapping community detection in graphs
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
Localized methods for diffusions in large graphs
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Fast relaxation methods for the matrix exponential
Fast matrix primitives for ranking, link-prediction and more
Gaps between the theory and practice of large-scale matrix-based network comp...
MapReduce Tall-and-skinny QR and applications
Recommendation and graph algorithms in Hadoop and SQL
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Spectroscopy.pptx food analysis technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Cloud computing and distributed systems.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
Dropbox Q2 2025 Financial Results & Investor Presentation
Programs and apps: productivity, graphics, security and other tools
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Review of recent advances in non-invasive hemoglobin estimation
Spectroscopy.pptx food analysis technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Cloud computing and distributed systems.
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Simulation Informatics; Analyzing Large Scientific Datasets

  • 1. Simulation Informatics! Analyzing Large Datasets from Scientific Simulations DAVID F. GLEICH ! PAUL G. CONSTANTINE! PURDUE UNIVERSITY STANFORD UNIVERSITY COMPUTER SCIENCE ! JOE RUTHRUFF! DEPARTMENT & JEREMY TEMPLETON ! SANDIA NATIONAL LABS 1 David Gleich · Purdue CS&E Seminar
  • 2. This talk is a story … 2 David Gleich · Purdue CS&E Seminar
  • 3. How I learned to stop worrying and love the simulation! 3 David Gleich · Purdue CS&E Seminar
  • 4. I asked …! Can we do UQ on PageRank? 4 David Gleich · Purdue CS&E Seminar
  • 5. PageRank by Google Google’s PageRank PageRank by Google 3 3 The Model 2 5 1.The Model uniformly with follow edges 2 4 5 1. follow edges uniformly with probability , and 4 2. randomly jump, with probability probability and 1 6 2. randomlyassume everywhere is 1 , we’ll jump with probability 1 6 equally, likely assume everywhere is 1 we’ll equally likely The places we find the The places we find the surfer most often are im- portant pages. often are im- surfer most portant pages. 5 David F. Gleich (Sandia) PageRank intro David Gleich · Purdue CS&E Seminar / 36 Purdue 5
  • 6. h sensitivity? alpha alpha PageRank PageRa PageRank RandomPageRank dom alpha Random alpha RAPr or PageRank meets UQ ( P)x = (1 )v s the random variables as the random variables Model PageRank ageRank as the random variables y to the links : examined and understoo x(A) x(A) x(A) and look at k E [x(A)] and Std [x(A)] . at E [x(A)] and Std [x(A)] . y to the E [x(A)]: and Std [x(A)] .understood, jump examined, Explored in Constantine and Gleich, WAW2007; and " Constantine and Gleich, J. Internet Mathematics 2011. 6 David Gleich · Purdue CS&E Seminar
  • 7. Random alpha PageRank has Convergence theory a rigorous convergence theory. Method Conv. Work Required What is N? 1 number of Monte Carlo p N PageRank systems N samples from A Path Damping r N+2 N + 1 matrix vector terms of (without N1+ products Neumann series Std [x(A)]) number of Gaussian r 2N N PageRank systems quadrature Quadrature points and r are parameters from Bet ( , b, , r) 7 David F. Gleich (Sandia) David Random sensitivity Gleich · Purdue CS&E Seminar / 36 Purdue 27
  • 8. Working with PageRank showed us how to treat UQ more generally … 8 David Gleich · Purdue CS&E Seminar
  • 9. Constantine, Gleich, and Iaccarino. We studied Spectral Methods for Parameterized Matrix Equations, SIMAX, 2010. parameterized A(s)x(s) = b(s) matrices. , A(J 1 )x(J 1 ) = b(J 1 ) ) A(J N )x(J N ) = b(J N ) or Parameterized Solution ) AN (J 1 )xN (J 1 ) = bN (J 1 ) Constantine, Gleich, and Iaccarino. A A(s)x(s) = b(s) factorization of the spectral Galerkin system for parameterized matrix equations: derivation and applications, SISC 2011. How to compute the Galerkin solution Discretized PDE in a weakly intrusive manner.! with explicit parameters 9 David Gleich · Purdue CS&E Seminar
  • 10. Simulation! The Third Pillar of Science 21st Century Science in a nutshell! Experiments are not practical or feasible. Simulate things instead. But do we trust the simulations?! We’re trying! Model Fidelity Verification & Validation (V&V) Uncertainty Quantification (UQ) 10 David Gleich · Purdue CS&E Seminar
  • 11. The message Insight and confidence requires multiple runs. 11 David Gleich · Purdue CS&E Seminar
  • 12. The problem A simulation run ain’t cheap! 12 David Gleich · Purdue CS&E Seminar
  • 13. Another problem It’s very hard to “modify” current codes. 13 David Gleich · Purdue CS&E Seminar
  • 14. Large scale nonlinear, time dependent heat transfer problem 105 nodes 103 time steps 30 minutes on 16 cores Questions What is the probability of failure? Which input values cause failure? 14 David Gleich · Purdue CS&E Seminar
  • 15. It’s time to ask " What can science learn from Google?" " - Wired Magazine (2008) 15 David Gleich · Purdue CS&E Seminar
  • 16. We can throw the numbers 21.1st Century Science into the biggest computing in a nutshell? clusters the world has ever Simulations are " seen and let statistical too expensive. algorithms find patterns Let data provide a where science cannot. surrogate. - Wired (again) 16/18 David Gleich · Purdue CS&E Seminar
  • 17. Our approach! Construct an interpolating reduced order model from a budget-constrained ensemble of runs for uncertainty and optimization studies. 17 David Gleich · Purdue CS&E Seminar
  • 18. That is, we store the runs Supercomputer Data computing cluster Engineer Each multi-day HPC A data cluster can … enabling engineers to query simulation generates hold hundreds or thousands and analyze months of simulation gigabytes of data. of old simulations … data for statistical studies and uncertainty quantification. and build the interpolant from the pre-computed data. 18 David Gleich · Purdue CS&E Seminar
  • 19. The Database Input " Time history" s1 -> f1 Parameters of simulation s2 -> f2 s f sk -> fk 2 3 A single simulation The simulation as a vector q(x1 , t1 , s) 6 . . 7 at one time step 6 . 7 6 7 6q(xn , t1 , s)7 6 7 6q(x1 , t2 , s)7 6 7 ⇥ ⇤ f(s) = 6 . 7 6 6 . . 7 7 X = f(s1 ) f(s2 ) ... f(sp ) 6q(xn , t2 , s)7 6 7 6 . 7 The database as a matrix 4 . . 5 q(xn , tk , s) 19 David Gleich · Purdue CS&E Seminar
  • 20. The interpolant Motivation! This idea was inspired by Let the data give you the basis. the success of other ⇥ ⇤ reduced order models X = f(s1 ) f(s2 ) ... f(sp ) like POD; and Paul’s residual minimizing idea. Then find the right combination Xr f(s) ⇡ uj ↵j (s) j=1 These are the left singular vectors from X! 20 David Gleich · Purdue CS&E Seminar
  • 21. Why the SVD?! Let’s study a simple case. 2 3 g(x1 , s1 ) g(x1 , s2 ) ··· g(x1 , sp ) 6 .. .. . . 7 6 g(x2 , s1 ) . . . 7 X=6 6 . 7 7 4 . .. .. . . . g(xm 1 , sp )5 treat each right g(xm , s1 ) g(xm , sp singular vector ··· 1) g(xm , sp ). as samples of = U⌃VT , the unknown r X r X basis functions g(xi , sj ) = Ui,` ` Vj,` = u` (xi ) ` v` (sj ) `=1 `=1 split x and s a general parameter r p X X (`) g(xi , s) = u` (xi ) ` v` (s) v` (s) ⇡ v` (sj ) j (s) `=1 j=1 Interpolate v any way you wish 21 David Gleich · Purdue CS&E Seminar
  • 22. Method summary Compute SVD of X! Compute interpolant of right singular vectors Approximate a new value of f(s)! 22 David Gleich · Purdue CS&E Seminar
  • 23. A quiz! Which section would you rather try and interpolate, A or B? A B 23 David Gleich · Purdue CS&E Seminar
  • 24. How predictable is a ! singular vector? Folk Theorem (O’Leary 2011) The singular vectors of a matrix of “smooth” data become more oscillatory as the index increases. Implication! The gradient of the singular vectors increases as the index increases. v1 (s), v2 (s), ... , vt (s) vt+1 (s), ... , vr (s) Predictable Unpredictable 24 David Gleich · Purdue CS&E Seminar
  • 25. A refined method with ! an error model Don’t even try to interpolate the predictable modes. t(s) r X X f(s) ⇡ uj ↵j (s) + uj j ⌘j j=1 Predictable j=t(s)+1 Unpredictable ⌘j ⇠ N(0, 1) 0 1 r X TA Variance[f] = diag @ j uj uj j=t(s)+1 But now, how to choose t(s)? 25 David Gleich · Purdue CS&E Seminar
  • 26. Our current approach to choosing the predictability t(s) is the largest such that ⌧ X 1 @vi i < threshold 1 @s i=1 26 David Gleich · Purdue CS&E Seminar
  • 27. An experimental test case A heat equation problem Two parameters that control the material properties 27 David Gleich · Purdue CS&E Seminar
  • 28. Experiments 20 point, Latin hypercube sample 28 David Gleich · Purdue CS&E Seminar
  • 29. Our Reduced Order Model Where the error is the worst The Truth 29 David Gleich · Purdue CS&E Seminar
  • 30. A Large Scale Example Nonlinear heat transfer model 80k nodes, 300 time-steps 104 basis runs SVD of 24m x 104 data matrix 500x reduction in wall clock time (100x including the SVD) 30 David Gleich · Purdue CS&E Seminar
  • 31. PART 2! Tall-and-skinny QR (and SVD)! on MapReduce 31 David Gleich · Purdue CS&E Seminar
  • 32. Quick review of QR QR Factorization Let    , real Using QR for regression    is given by    the solution of    QR is block normalization    is    orthogonal (   ) “normalize” a vector usually generalizes to computing    in the QR    is    upper triangular. 0 A = Q R 32 David Gleich (Sandia) David MapReduce 2011 Gleich · Purdue CS&E Seminar 4/22
  • 33. Intro to MapReduce Originated at Google for indexing web Data scalable pages and computing PageRank. Maps M M 1 2 1 M The idea Bring the Reduce 2 M M M computations to the data. R 3 4 3 M R M M Express algorithms in " 4 5 5 M Shuffle data-local operations. Fault-tolerance by design Implement one type of Input stored in triplicate communication: shuffle. M Reduce input/" output on disk M Shuffle moves all data with M R the same key to the same M R reducer. Map output" persisted to disk" 33 before shuffle David Gleich · Purdue CS&E Seminar
  • 34. Mesh point variance in MapReduce Run 1 Run 2 Run 3 T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 34 David Gleich · Purdue CS&E Seminar
  • 35. Mesh point variance in MapReduce Run 1 Run 2 Run 3 T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3 M M M 1. Each mapper out- 2. Shuffle moves all puts the mesh points values from the same with the same key. mesh point to the R R same reducer. 3. Reducers just compute a numerical variance. Bring the computations to the data! 35 David Gleich · Purdue CS&E Seminar
  • 36. Communication avoiding QR Communication avoiding TSQR (Demmel et al. 2008) First, do QR Second, compute factorizations a QR factorization of each local of the new “R” matrix    36 Demmel et al.David Communicating avoiding CS&E and sequential QR. 2008. Gleich · Purdue parallel Seminar
  • 37. Serial QR factorizations! Fully serialet al. 2008) (Demmel TSQR Compute QR of    , read    , update QR, … 37 Demmel et al. 2008. Communicating avoiding parallel and sequential QR. David Gleich · Purdue CS&E Seminar
  • 38. Tall-and-skinnymatrix storage MapReduce matrix storage in MapReduce    A1 Key is an arbitrary row-id Value is the    array for A2 a row. A3 Each submatrix    is an input split. A4 38 David Gleich (Sandia) MapReduce 2011 10/2 David Gleich · Purdue CS&E Seminar
  • 39. Algorithm Data Rows of a matrix A1 A1 Map QR factorization of rows A2 qr Reduce QR factorization of rows A2 Q2 R2 Mapper 1 qr Serial TSQR A3 A3 Q3 R3 A4 qr emit A4 Q4 R4 A5 A5 qr A6 A6 Q6 R6 Mapper 2 qr Serial TSQR A7 A7 Q7 R7 A8 qr emit A8 Q8 R8 R4 R4 Reducer 1 Serial TSQR qr emit R8 R8 Q R 39 David Gleich · Purdue CS&E Seminar
  • 40. Key Limitations Computes only R and not Q Can get Q via Q = AR+ with another MR iteration. " (we currently use this for computing the SVD) Dubious numerical stability; iterative refinement helps. Working on better ways to compute Q " (with Austin Benson, Jim Demmel) 40 David Gleich · Purdue CS&E Seminar
  • 41. In hadoopy Full code in hadoopy import random, numpy, hadoopy def close(self): class SerialTSQR: self.compress() def __init__(self,blocksize,isreducer): for row in self.data: key = random.randint(0,2000000000) self.bsize=blocksize yield key, row self.data = [] if isreducer: self.__call__ = self.reducer def mapper(self,key,value): else: self.__call__ = self.mapper self.collect(key,value) def reducer(self,key,values): def compress(self): for value in values: self.mapper(key,value) R = numpy.linalg.qr( numpy.array(self.data),'r') if __name__=='__main__': # reset data and re-initialize to R mapper = SerialTSQR(blocksize=3,isreducer=False) self.data = [] reducer = SerialTSQR(blocksize=3,isreducer=True) for row in R: hadoopy.run(mapper, reducer) self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress() 41 David Gleich (Sandia) MapReduce 2011 13/22 David Gleich · Purdue CS&E Seminar
  • 42. Lots many maps? an iteration. Too of data? Add Add an iteration! map emit reduce emit reduce emit R1 R2,1 R A1 Mapper 1-1 S1 Reducer 1-1 S(2) A2 Reducer 2-1 Serial TSQR Serial TSQR Serial TSQR shuffle identity map map emit reduce emit R2 R2,2 A2 Mapper 1-2 S(1) A2 S Reducer 1-2 shuffle Serial TSQR Serial TSQR A map emit reduce emit R3 R2,3 A3 Mapper 1-3 A2 S3 Reducer 1-3 Serial TSQR Serial TSQR map emit R4 A3 4 Mapper 1-4 Serial TSQR Iteration 1 Iteration 2 42 David Gleich (Sandia) MapReduce 2011 14/22 David Gleich · Purdue CS&E Seminar
  • 43. mrtsqr – of parameters parameters Summary summary of Blocksize How many rows to A1 A1 read before computing a QR qr factorization, expressed as a A2 A2 Q2 multiple of the number of columns (See paper) map emit R1 Splitsize The size of each local A1 Mapper 1-1 matrix Serial TSQR Reduction tree (Red) S(2) The number of (Red) (Red) S(2) shuffle reducers and S(1) A iterations to use Iteration 1 Iter 2 Iter 3 43 David Gleich (Sandia) MapReduce 2011 David 15/22 Gleich · Purdue CS&E Seminar
  • 44. Varying splitsize and the tree Data Varying splitsize Synthetic Cols. Iters. Split Maps Secs. Increasing split size (MB) improves performance 50 1 64 8000 388 (accounts for Hadoop – – 256 2000 184 data movement) – – 512 1000 149 – 2 64 8000 425 Increasing iterations helps – – 256 2000 220 for problems with many columns. – – 512 1000 191 1000 1 512 1000 666 (1000 columns with 64-MB split size overloaded the – 2 64 6000 590 single reducer.) – – 256 2000 432 – – 512 1000 337 44 David Gleich · Purdue CS&E Seminar
  • 45. MapReduceTSQR summary MapReduce is great for TSQR! Data A tall and skinny (TS) matrix by rows Map QR factorization of local rows Demmel et al. showed that this construction works to Reduce QR factorization of local rows compute a QR factorization with minimal communication Input 500,000,000-by-100 matrix Each record 1-by-100 row HDFS Size 423.3 GB Time to compute    (the norm of each column) 161 sec. Time to compute    in qr(   ) 387 sec. 45 On a 64-node Hadoop cluster with · Purdue CS&E Seminar David Gleich 4x2TB, one Core i7-920, 12GB RAM/node
  • 46. Our vision! To enable analysts and engineers to hypothesize from " Paul G. Constantine " data computations Sandia! Jeremy Templeton Joe Ruthruff instead of expensive … and you ? … HPC computations. 46 David Gleich · Purdue CS&E Seminar