SlideShare a Scribd company logo
4th International Summer School
    Achievements and Applications of Contemporary
    Informatics, Mathematics and Physics
    National University of Technology of the Ukraine
    Kiev, Ukraine, August 5-16, 2009



Motivatio                          Regression Theory
                                          with
                               Additive Models and CMARS
n



                                    Gerhard-
                                    Gerhard-Wilhelm Weber *
                Inci Batmaz , Gülser Köksal , Fatma Yerlikaya , Pakize Taylan               * *,
                           Elcin Kartal , Efsun Kürüm , Ayse Özmen

                                      Institute of Applied Mathematics,
                                Middle East Technical University, Ankara, Turkey

                    * Faculty of Economics, Management and Law, University of Siegen, Germany
                Center for Research on Optimization and Control, University of Aveiro, Portugal
                            * * Department of Mathematics, Dicle University, Turkey
Content

 •   Introduction, Motivation
 •   Regression
 •   Additive Models
 •   MARS
 •   PRSS for MARS
 •   CQP for MARS
 •   Tikhonov Regularization for MARS
 •   Numerical Experience and Comparison
 •   Research Extensions
 •   Conclusion
Introduction

 learning from data has become very important
 in every field of science and technology, e.g., in

 •   financial sector,
 •   quality improvent in manufacturing,
 •   computational biology,
 •   medicine and
 •   engineering.

 Learning enables for doing estimation and prediction.

 Regression is mainly based on the problems and methods of

 • least squares estimation,
 • maximum likelihood estimation
 • and classification.

 New tools for data analysis, based on nonparametric regression and smoothing:

 • additive (and multiplicative) models.
Introduction



               CART


                      vs.




                            MARS
Introduction

   Additive (and multiplicative) models (studied at IAM, METU):

• spline regression in additive models,
• spline regression in generalized addive models,

• MARS:
  piecewise linear (per dimension) regression in multiplicative models,

• spline regression for stochastic differential equations
   via additive and nonlinear models.
Regression:                 a Motivation


 One of motivations of this research has been the approximation of financial data points
 (x,y), e.g., coming from

 •   the stock market,
 •   credit rating,
 •   economic factor,
 •   company properties.

 For example, to estimate the probability of a default of a particular credit:
 • It is used one of the latest three data points above.

 There are different approaches for estimating the probability of a default.
 • Regression models (binary choice) are one of them.
 • For example, we assume that we have the dependent variable Y
    with Y = 1 (“default”) or Y = 0 (“no default”) satisfies
                                     Y = F(X ) +ε ,

 X : vector of independent variable(s) (input) such as credit rating.
Regression:                 a Motivation


 •   Estimation for the default probability P,

                                  P = E [ F ( X ) + ε ] = F ( X ).

 •   Also, this estimation can be done via following linear regression

                                     Y = α + βΤ X + ε .
 •   An estimate for the default probability of a corparate bond can be obtained:

                                            P = α + βΤ X ;

 α and β are unknown parameters. They can be estimated via linear regression
 methods or maximum likelihood estimation. In many important cases, these just mean
 least squares estimation.
Regression

                X = ( X1 , X 2 ,..., X m ) and output variable Y ;
                                         T
 Input vector

 linear regression :
                                                                     m
                         Y = E (Y X 1 ,..., X m ) + ε = β 0 + ∑ X j β j + ε
                                                                 j =1


 •    E(Y | X) is   linear (...) and



      β = ( β 0 , β1 ,..., β m ) which minimizes
                           T
 •


                                                                     2

                                                     (           )
                                               N
                               RSS ( β ) := ∑ yi − x β       T
                                                             i
                                              i =1
 or
                               RSS ( β ) =   (Y − Xβ ) (Y − Xβ )              β = (XT X ) XT y,
                                                         T                                 −1
                                                                              ˆ

                                                                                       (        )
                                                                                                    −1
                                                                              Cov( β) = X T X
                                                                                   ˆ                     σ2
Regression, Additive Models

 In the input space:



 •   Classical understanding:

     additive separation of variables

 •   New interpretation:

     separation of clusters and corresponding enumeration
Regression, Additive Models


                         E Yi xi1 , xi 2 ,..., xi m = β 0 + ∑ f j ( xij )
                           (                      )
                                                                 m
 (A)
                                                                 j =1

            f j are estimated by a smoothing on a single coordinate.



             Standard convention at xij :             (           )
                                                 E f j ( xij ) = 0 .


 •     Backfitting algorithm (Gauss-Seidel algorithm).
 •     This procedure depends on the partial residual against xij :


                                    rij = yi − β 0 − ∑ f k ( xik ) .
                                                       ˆ
                                                          k≠ j
Regression, Additive Models


 •   Estimating each smooth function by holding all the other ones fixed.



      initialization:         β 0 := ave( yi | i = 1,..., N ),
                              ˆ                                      f j ( xij ) ≡ 0,
                                                                     ˆ                      ∀i, j
      cycle j = 1,..., m,1,..., m,1,...,
                                                             m
                                         rij = yi − β 0 − ∑ f k ( xik ) ,
                                                    ˆ       ˆ                      i = 1,..., N
                                                            k≠ j
       ˆ
       f j is updated by smoothing the partial residuals
                             m
           rij = yi − β 0 − ∑ f k ( xik ) (i = 1,..., N ) against
                      ˆ       ˆ                                             x ij
                            k≠ j


      until the functions almost do not change.



 •   Convergence (condition)
Regression, Additive Models


 •   Convergence of the backfitting,        ˆ                             f                 
                                            Tf = f ,
                                                                                            
                                                                           .                
                                                                           .                
                                                                                            
                                                                           .                
                                                                          (              )
                                                                  I                         
                                            T j : IR Nm → IR Nm
                                            ˆ                     a  S j −∑ k ≠ j f k       
                                                                                            
                                                                           .                
                                                                           .                
                                                                                            
                                                                           .                
                                                                          fm                
                                                                                            

 •   Full cycle:   T = Tm Tm-1...T1 ; then, Tl corresponds to l full cycles.
                   ˆ ˆ ˆ         ˆ          ˆ

 •                                                                        ˆ
     Always converges if all smoother are symetric and all eigenvalues of T
     are either +1 or in the interior of the unit ball: | λ |< 1 .
Regression, Generalized Additive Models

•   To extend the additive model to a wide range of distribution families:
    generalized additive models (GAM):




                                     G ( µ ( X ) ) = ψ ( X ) = β0 + ∑ f j ( X j ),
                                                                      m


                                                                      i =1




                             θ := ( β 0 , f1 ,..., f m ) ,
                                                        T
•    f j are unspecified,                                    G : link function;

•   f j : elements of a finite dimensional space consisting, e.g., of splines;


•   spline orders (or degrees):
               suitably choosen, depending on the density and variation properties
               of the corresponding data in x and y components, respectively.

•    problem of specifying   θ   becomes a finite dimensional parameter estimation problem.
Regression, Generalized Additive Models,
Splines
 •     x0 ,..., xN be N + 1 distinct knots of [a, b] and                   a = x0 < x1 < ... < xN = b

 •    The function g k (x) on the interval              [a, b]   is a spline of degree k relative to the
      knots x j .

 •    If

     (1)   fk x ,x            ∈ IPk   (polynomial of degree ≤ k ; j = 0,..., N − 1 ),
                 j   j +1 
                      j+ 



     (2)   f k ∈ C k −1 [ a, b] ,

      the space of splines g k on [a, b] is called ℘k and relative to the N + 1 distinct

      knots; then,              dim℘k = N + k. .
 •    In practice, a spline is represented by a different polynomial on each subinterval and
      for this reason there could be a discontinuity in its kth derivative at the internal knots
       x1 ,..., xN −1.
Regression, Generalized Additive Models,
Splines

•   Characterize a spline of degree k,         f k , j := f k  x , x             can be represented by
                                                                    j   j +1 
                                                                              
                                                             k
                                             f k , j ( x ) = ∑ g ij ( x − x j )i , if x ∈  x j , x j +1  ;
                                                                                                        
                                                           i =0

    (k + 1) N coefficients g ij to be determined.

• To hold:       f k(,lj)−1 ( x j ) = f k(,lj) ( x j ) ( j = 1,..., N − 1; l = 0,..., k − 1),
    there are   k ( N − 1)    conditions, and the remaining degrees of freedom are

    (k + 1) N - k ( N − 1) = k + N .
Clustering for Generalized Additive Models

 •      Financial markets have different kinds of trading activities.
        These activities work with

 •      short-, mid- or long-term horizons

 •      from days and weeks to months and years.

 •      These data can sometimes be problematic for being used at the models,

        e.g.,
        given a longer horizon with sometimes less frequent data recorded,
        but to other times highly frequent measurements.

 •          the structure of data may has particular properties:

 i.           larger variability
 ii.          outliers
 iii.         some data do not have any meaning.
Clustering for Generalized Additive Models
Clustering for Generalized Additive Models

  •   data variation:




  •      for the sake of simplicity:
                                       Nj ≡ N   for each interval   Ij
Clustering for Generalized Additive Models

 •   Density:

     I1 ,..., I m   ; the density of the input data in the j-th interval:

                                   number of point xij in I j
                          D j :=                                       .
                                            length of I j

 •   Variation
     If over the interval I j the data are           (x1 j , y1 j ),..., (xN j , yN j ) :

                                            N −1
                                   V j :=   ∑y
                                            i =1
                                                   i +1 j   − yi j .

 •   If this value is big, at many data points,
     the curvature of any approximating curve could be big.

     occurrence of outliers,
     instability of the model.
Clustering for Generalized Additive Models


 •    I1 ,..., I p (or Q1 ,..., Qm ) intervals (or cubes) according to the data grouped.
 •    I j (cube Q j), the associated index of data variation

                                  Ind j :=D j ⋅ V j
     or
                                  Ind j :=d j ( D j ) ⋅v j (V j )

 •   In fact, from both the viewpoints of data fitting and complexity (or stability),

 o   cases with a high variation distributed over a very long intervall are very much
     less problematic than cases with a high variation over a short interval;

 o   oscillation,
 o   curvature,
 o   up to nonsmoothness,


 o   penalty!
Regression, Additive Models

 •   Additive model can be fit by data. Given observations for ( yi , xi ) (i = 1,2,...,N ).



 •   penalized sum of squares PRRS
                                                                   2
                                    N              m                m    b
     PRSS (β 0 , f1 ,..., f m ) : = ∑  yi − β 0 − ∑ f j ( xij )  + ∑ µ j ∫  f j (t j )  dt j
                                                                                  ''        2
                                                                                         
                                    i =1          j =1             j =1  a


 •   µj ≥ 0      (smoothing parameters, tradeoff)

 •   large values of µ j yield smoother curves,
     smaller ones result in more fluctuation.

 •   New estimation methods for additive model with CQP :
Regression, Additive Models

         min            t,
         t , β0 , f
                                                                 2
                             N               m            
         subject to          ∑
                             i=1 
                                   yi − β0 − ∑ f j ( xij )  ≤ t 2 , t ≥ 0,
                                             j =1          
                                               2
                             ∫f         (t j )  dt j ≤ M j   (j = 1, 2,..., m).
                                    ''
                                  j            

                                                                     dj

 •   The functions f j are splines:                     f j ( x) = ∑ θl j hl j ( x).
                                                                     l =1

 •   Then, we get

            min         t,
           t , β0 , f

                              W ( β 0 , θ ) 2 ≤ t 2 , t ≥ 0,
                                               2
           subject to
                                               2
                              V j ( β0 ,θ ) ≤ M j         (j = 1,..., m).
                                               2
Regression, Additive Models




                              http://144.122.137.55/gweber/
MARS Multivariate Adaptive Regression Spline

 •   To estimate general functions of high-dimensional arguments.

 •   An adaptive procedure.

 •   A nonparametric regression procedure.

 •   No specific assumption about the underlying functional relationship
     between the dependent and independent variables.

 •   Ability to estimate the contributions of the basis functions so that both
     the additive and the interactive effects of the predictors are allowed to
     determine the response variable.

 •   Uses expansions in piecewise linear basis functions of the form

                  c + ( x,τ ) = [ + ( x − τ )]+ ,   c - ( x,τ ) = [−( x − τ )]+ .


                                                                                    [q]+ := max {0, q}
MARS


                         y

                             • •                    •
                                  •                  •
                              • •                • •
                                •     •           •
                                    • •       • •
                                        • • ••

                               c-(x,τ)=[−(x−τ)]+       c+(x,τ)=[+(x−τ)]+
                                                   τ                       x
                             Basic elements in the regression with MARS.




•   Let us consider     Y = f (X ) + ε,                      X = ( X 1 , X 2 ,..., X p )Τ

•   The goal is to construct reflected pairs for each input           X j ( j = 1, 2,..., p ).
MARS


                         y

                             • •                    •
                                  •                  •
                              • •                • •
                                •     •           •
                                    • •       • •
                                        • • ••

                               c-(x,τ)=[−(x−τ)]+       c+(x,τ)=[+(x−τ)]+
                                                   τ                       x
                             Basic elements in the regression with MARS.




•   Let us consider     Y = f (X ) + ε,                      X = ( X 1 , X 2 ,..., X p )Τ

•   The goal is to construct reflected pairs for each input           X j ( j = 1, 2,..., p )
MARS


                         y

                             • •                    •
                                  •                  •
                              • •                • •
                                •     •           •
                                    • •       • •
                                        • • ••
                               c-(x,τ)=[−(x−τ)]+       c+(x,egressionx−τ)]+
                                                           rτ)=[+( w ith
                                                   τ                          x
                             Basic elements in the regression with MARS.




•   Let us consider     Y = f (X ) + ε,                        X = ( X 1 , X 2 ,..., X p )Τ

•   The goal is to construct reflected pairs for each input              X j ( j = 1, 2,..., p )
MARS


•   Set of basis functions:


                   {                                  {
                                                    % %                 %  }                     }
             ℘:= ( X j − τ ) + , (τ − X j ) + | τ ∈ x1, j , x2, j ,..., xN , j , j ∈ {1, 2,..., p}

•   Thus, f ( X ) can be represented by
                                                      M
                                        Y = θ 0 + ∑ θ mψ m ( X ) + ε .
                                                     m =1

•   ψ m (m = 1, 2,..., M ) are basis functions from ℘ or products of two or more such
    functions; interaction basis functions are created by multiplying an existing basis
    function with a truncated linear function involving a new variable.

•   Provided the observations represented by the data ( xi , yi ) (i = 1, 2,..., N ) :

                                                    Km
                                     ψ m ( x ) := ∏ [ sκ ⋅ ( xκ − τ κ )]+ .
                                                            m
                                                            j
                                                                   m
                                                                   j
                                                                           m
                                                                           j
                                                    j =1
MARS

•     Two subalgorithms:

(i)   Forward stepwise algorithm:

•     Search for the basis functions.
•     Minimization of some “lack of fit” criterion.
•     The process stops when a user-specified value M max is reached.

•     Overfitting.
      So a backward deletion procedure is applied
      by decreasing the complexity of the model
      without degrading the fit to the data.

(i)   Backward stepwise algorithm:
MARS

•   Remove from the model basis functions that contribute to the smallest increase
                                                                                        ˆ
    in the residual squared error at each stage, producing an optimally estimated model fα
    with respect to each number of terms, called α .

•   α is related with some complexity of the estimation.
•   To estimate the optimal value of α :


               N                            M(α ) := u + d K
              ∑ ( yi − ˆα ( xi ))2
              i =1
                       f                    N := number of samples
     GCV :=                                 u := number of independent basis functions
               (1 − M(α ) N ) 2
                                            K := number of knots selected by forward stepwise algorithm
                                            d := cost of optimal basis


•   Alternative:
PRSS for MARS

                           N                       M max       2

                          ∑ ( yi − f ( xi ) ) +    ∑ λm      ∑               ∑
                                                                                                                   2
      PRSS :=                                                                              θ m  Drα,sψ m (t m )  d t m
                                                                                          ∫ 
                                               2                                             2
                                                                                                                 
                           i =1                    m =1      α   =1           r<s
                                                           α = (α1 ,α 2 ) r , s∈V ( m )




                {
     V (m) := κ m | j = 1, 2,..., K m
                j                       }
     t m := (tm1 , tm2 ,..., tm K )T
                                  m


     α = (α1 , α 2 )
      α := α1 + α 2 , where α1 , α 2 ∈ {0,1}                                              (                    )
                                                                 Drα, sψ m (t m ) := ∂αψ m ∂α1 trm ∂α 2 tsm (t m )



 •   Tradeoff between both accuracy and complexity.
 •   Penalty parameters λm .
Knot Selection
Grid Selection
Grid Selection




                         n




                 Motivatio
CQP and Tikhonov Regularization for MARS

    ψ ( d i ) := (1,ψ 1 ( xi1 ),...,ψ M ( xiM ),ψ M +1 ( xiM +1 ),...,ψ M ( xiM ) )
                                                                                             T
                                                                                      max
                                                                            max



       d i := ( xi1 , xi2 ,..., xiM , xiM +1 , xiM + 2 ,..., xiM max )T
       θ := (θ0 ,θ1,...,θ M           )
                                      Τ
                                                                             (σ j ) j∈{1,2,..., Km } ∈ {0,1, 2,..., N + 1}
                                                                                  κ                                          Km

                                max
                                              ,


                                                                                     Km                       
     x =  x κ m m , x κ m m ,..., x κ m           ,                       ∆x := ∏  x κ m m − x κ m m 
       m
     ˆ                                                                       ˆ    m
       i
          lσ1κ1 ,κ1 lσ κ2 ,κ 2
                        2
                                    l κ m ,κ Km
                                       K     m                                   i
                                                                                        j
                                                                                  j =1  lσ κ j +1 ,κ j l κ ,κ j 
                                                                                                          j
                                    σ Km                                                               σ j     


                                                                                                                1
                                                          2                                                
                                                                                                                    2



     ψ (d ) := (ψ (d1 ),...,ψ (d N ) )                                                             2
                                                                                                         ˆ 
                                                  Lim :=  ∑                  ∑  Drα,sψ m ( xim )   ∆x im  .
                                          T

                                                                                             ˆ 
                                                          α =αα=1α 2 )
                                                          ( 1 ,
                                                                                r <s
                                                                           r , s∈V ( m )
                                                                                                      
                                                                                                             
                                                                                                              


      L is an ( M max + 1) × ( M max + 1) matrix.
CQP and Tikhonov Regularization for MARS

•   For a short representation, we can rewrite the approximate relation as


                                               M max       ( N +1) Km

                                               ∑ λm         ∑
                                         2
                  PRSS = y −ψ (d )θ +                                   L2 θ m .
                                                                         im
                                                                             2
                                         2
                                               m =1          i =1




•   In case of the same penalty parameter λ = λm (=: ϕ 2 ), then:

                                         2
                   PRSS = y −ψ (d ) θ        + λ Lθ 2 .
                                                       2
                                         2



                                                             Tikhonov regularization
CQP for MARS

•   Conic quadratic programming:


                   min     t,
                    t ,θ

                   subject to     ψ (d ) θ − y 2 ≤ t ,
                                  Lθ   2
                                           ≤ M.


    In general :   min cT x ,   subject to   Di x − di 2 ≤ piT x − qi (i = 1, 2,..., k ).
                    x
CQP for MARS

•   Conic quadratic programming:


                   min     t,
                    t ,θ

                   subject to     ψ (d ) θ − y 2 ≤ t ,
                                  Lθ   2
                                           ≤ M.


    In general :   min cT x ,   subject to   Di x − di 2 ≤ piT x − qi (i = 1, 2,..., k ).
                    x
CQP for MARS


. Moreover, (t,θ , χ ,η , ω , ω )
                             1   2    is a primal dual optimal solution if and only if



                      0N    ψ (d )   t   − y 
                χ := 
                                    +          ,
                       1    0T +1   θ   0 
                              M  max
                                          
                     0 M max +1     L   t   0 M max +1 
               η :=                          +         ,
                     0            T         θ   M 
                                 0 M max +1 
                                                          
                  0T            1           0T max +1    0              1 
                                        ω +                        ω2 = 
                     N                          M
                 
                 ψ (d )T                                                              ,
                            0 M max +1  1  LT                    
                                                         0 M max +1        0 M max +1 
                                             
               ω1T χ = 0, ω2 η = 0,
                           T


               ω1 ∈ LN +1 , ω2 ∈ L    M max + 2
                                                   ,
                χ ∈ LN +1 , η ∈ LM   max + 2
                                               .
CQP for MARS

•   CQPs belong to the well-structured convex problems.

•   Interior Point Methods.

•   Better complexity bounds.

•   Better practical performance.




C-MARS
Numerical Experience and Comparison

•     We had the following data:
X1   1,5554   1,5326    -0,1823   0,1627    0,5687    0,1706    0,2041    -0,1823    -0,82     -0,7234   0,4446   -0,3291   -1,5583   1,2706    1,7555

X2   0,1849   1,1538    0,7586    -1,5363   1,906     0,3761    1,3323    -0,0064    -1,7275   1,141     0,3761   0,5673    -0,1976   0,7586     0,1849

X3   1,264    1,2023    -1,0995   0,8529    1,3051    -0,3802   -0,7913    0,1336    0,2363    -1,0995   -0,0719 -0,894     -1,0995   0,9557     1,5722

X4   1,2843   1,0175    -0,9676   0,7408    1,0635    -0,506    -0,7937   -0,0564     0,0455   -0,9676 -0,2482-   0,8557    -0,9676   0,8707    1,7339

X5 -0,7109     0,1777    0,1422    0,0355   3,2699    0,3554    -0,1777    1,5283    -0,0711    0,3554 0,8886      0,4621   -0,9241   -0,9241   -0,0711

Y    0,67      0,9047   -0,197    -1,0108    0,1616    0,2984   -0,6039    0,8823 -1,6832        0,9531 -0,3208    0,0507   -0,3916   0,44       0,263

X1   0,0474   -0,8713   -0,2158   0,2179    1,5426    -1,16      0,9857    0,6752    0,5402     -1,4528 1,9349    -0,8299   -0,681    0,7304    -1,1305

X2   0,9498   -0,1976   -1,7275   -0,9626   1,3323    -0,9626    0,1849   -1,345     1,3323     -0,0064 0,1849     0,3761   -1,345    -0,7713   -0,0064

X3   0,0308   -0,6885    1,0584   0,5446    0,5446    -0,483     0,4419    1,264      0,0308   -1,3051 2,086      -0,5857   -0,2775   1,5722    -1,3051

X4   0,1543   -0,7278    1,0046   0,3752    0,3752    -0,5839    0,2613    1,2843    -0,1543   -1,0635 2,5631     -0,6578   -0,4241    1,7339   -1,0635

X5   1,1018   0,6753    -0,391    -0,2843   1,4217    0,4621    -0,8175    0,7819     0,2488    1,5283 -0,1777    -1,7771   0,4621    -1,0307   0,3554

Y    1,1477   -0,3916   -0,4624   -1,0993   2,8639    -1,0285    0,1923    -0,7631     2,05     1,0238 0,9177     -1,2055   -0,3208   -0,5862   -0,6216
Numerical Experience and Comparison

•   We constructed model functions for these data using the MARS Software where we
    selected the maximum number of basis elements: M        = 5. Then,
                                                        max


             Model 1 : ω = 1                   Model 2 : ω = 2
             BF1 = max{0, X 2 + 1.728};        BF1 = max{0, X 2 + 1.728};
             Y = -1.081 + 0.626 * BF1          BF2 = max{0, X 5 - 0.462}* BF1
                                               Y = -1.073 + 0.499* BF1 + 0.656 * BF2


                            best model >>> Model 3 : ω = 3
                             BF1 = max{0, X 2 + 1.728};
                             BF2 = max{0, X 5 - 0.462} * BF1
                             BF4 = max{0, X 3 + 0.586} * BF1
                             Y = -1.176 + 0.422 * BF1 + 0.597 * BF2 + 0.236 * BF4
Numerical Experience and Comparison
 •      and, finally,

 Model 4 : ω = 4
 BF1 = max{0, X 2 + 1.728}
     BF2 = max{0, X 5 - 0.462} * BF1
     BF3 = max{0, 0.462 - X 5 } * BF1
     BF4 = max{0, X 3 + 0.586} * BF1
     Y = -1.242 + 0.555 * BF1 + 0.484 * BF2 - 0.093 * BF3
           + 0.226 * BF4
     Model 5 : ω = 5
     BF1 = max{0, X 2 + 1.728};
     BF2 = max{0, X 5 - 0.462} * BF1
     BF3 = max{0, 0.462 - X 5 } * BF1
     BF4 = max{0, X 3 + 0.586} * BF1
     BF5 = max{0, - 0.586 - X 3 } * BF1
     Y = -1.248 + 0.487 * BF1 + 0.486 * BF2 - 0.118 * BF3 + 0.282 * BF4 + 0.263 * BF5
Numerical Experience and Comparison

•   Then, we considered a large model with 5 five basis functions; we found
    (writing a MATLAB code):


                    0    0        0        0        0         0   
                    0 1.8419       0       0        0          0 
                                                                  
                    0    0      0.7514     0        0          0 
                 L=                                               
                    0    0         0     0.9373     0          0 
                    0    0        0        0      2.1996       0 
                                                                  
                    0    0        0        0         0      0.3905


•   We constructed models using different values for     M   in the optimization problem,
    which was solved by MOSEK (CQP).

•   Our algorithm constructs a model with 5 parameters always;
    in case of Salford, there are 1, 2, 3, 4 or 5 parameters.
Numerical Experience and Comparison


                    RESULTS OF SALFORD MARS



      ω    RSS           z = RSS       t = Lθ 2   GCV

      1   17.6425         4.2003       1.1531     0.771
      2   11.1870         3.3447       1.0430     0.613
      3   7.7824          2.7897       1.0368     0.550

      4   6.6126          2.5715       1.1967     0.626

      5   6.2961          2.5092       1.1600     0.840
Numerical Experience and Comparison
                    RESULTS OF OUR APPROACH


       M       ω   z = RSS    t = Lθ 2    M       ω   z = RSS     t = Lθ 2

      0.05     5    5.16894     0.05     0.2940   5     4.2024     0.2940
       0.1     5   4.959342     0.1      0.2945   5     4.2006     0.2945
      0.15     5   4.755559     0.15     0.295    5     4.1988     0.2950
       0.2     5   4.557617     0.2       0.3     5    4.180557     0.3
      0.25     5   4.365811     0.25      0.35    5    4.002338     0.35
      0.265    5    4.3095     0.2650     0.4     5    3.831675     0.4
      0.275    5    4.2723     0.2750     0.45    5    3.669118     0.45
      0.285    5    4.2354     0.2850     0.5     5    3.515233     0.5
     0.2865    5    4.2299     0.2865     0.55    5    3.370588     0.55
     0.2875    5    4.2262     0.2875    0.552    5     3.3650     0.5520
     0.2885    5    4.2226     0.2885    0.555    5     3.3567     0.5550
     0.2895    5    4.2189     0.2895    0.558    5     3.3483     0.558
     0.28965   5    4.2183     0.2897    0.560    5     3.3428     0.5600
     0.28975   5    4.2180     0.2897    0.561    5     3.3401     0.5610
     0.28985   5    4.2176     0.2899    0.562    5     3.3373     0.5620
     0.28995   5    4.2172     0.2899    0.565    5     3.3291     0.5650
Numerical Experience and Comparison
       M     ω   z = RSS    t = Lθ 2    M     ω   z = RSS     t = Lθ 2

     0.575   5    3.3019     0.5750    0.96   5     2.5968      0.96
     0.585   5    3.2751     0.5850    0.97   5     2.5880      0.97
     0.595   5    3.2488     0.5950    0.98   5     2.5797      0.98
      0.6    5   3.235746     0.6      0.99   5     2.5718      0.99
     0.65    5   3.111253     0.6 5     1     5    2.564459      1

      0.7    5   2.997622     0. 7      2     5    2.509165    1.16009
     0.75    5   2.895324     0.7 5    2.1    5    2.509165    1.16009
      0.8    5   2.804764     0.8      2.2    5    2.509165    1.16009
     0.805   5    2.7964     0.8050    2.3    5    2.509165    1.16007
     0.810   5    2.7881     0.8100    2.4    5    2.509165    1.16008
     0.820   5    2.7719     0.8200    2.5    5    2.509165    1.16001
     0.830   5    2.7562     0.8300    2.6    5    2.509165    1.16007
     0.840   5    2.7410     0.8400    2.7    5    2.509165    1.16007
     0.85    5   2.726261     0.85     2.8    5    2.509165    1.16009
      0.9    5   2.660023     0.9      2.9    5    2.509165    1.16009
     0.95    5    2.60612     0.95      3     5    2.509165    1.16009
                                        4     5    2.509165   1.160084
Numerical Experience and Comparison

.   We drew L curves:




                         5.5                                                           5.5



                          5                                                             5
                    2
          ψ (d ) θ − y




                                                                       ψ (d )θ − y 2
                         4.5                                                           4.5



                          4                                                             4


                         3.5                                                           3.5


                          3                                                             3


                         2.5                                                           2.5
                           0   0.2   0.4   0.6   0.8   1   1.2   1.4                         0   0.2   0.4   0.6    0.8       1   1.2   1.4

                                           Lθ     2                                                                Lθ     2



•   Conclusion: Based on the L curve criterion and for the given data, our solution is better
    than Salford solution for MARS.
Numerical Experience and Comparison

   •   All test data sets are also compared according to the performance
       measure such as MSE, MAE, Correlation Coefficient, R2, PRESS,
       Mallows’ Cp etc..
   •   These measures are based on the average of nine values (one for
       each fold and each replication).
                                              C- M A R S
Numerical Experience and Comparison

 Please find much more numerical experience and comparison in

 Yerlikaya, Fatma,

 A New Contribution to Nonlinear Robust Regression and Classification with
 MARS and Its Application to Data Mining for Quality Control in Manufacturing,

 MSc. Thesis at Institute of Applied Mathematics of METU, Ankara, 2008.
Piecewise Linear Functions - Stock Market




                                  figures generated by
                                  Erik Kropat
Forward Stepwise Algorithm Revisited




               high complexity
Forward Stepwise Algorithm Revisited
Forward Stepwise Algorithm Revisited
Forward Stepwise Algorithm Revisited
Forward Stepwise Algorithm Revisited
Regularization & Uncertainty Robust Optimization




                       •   •
                               •




                                     Laurent El Ghaoui
Regularization & Uncertainty Robust Optimization
References
•   Aster, A., Borchers, B., and Thurber, C., Parameter Estimation and Inverse Problems, Academic Press,
    2004.
•   Breiman, L., Friedman, J. H., Olshen, R., and Stone, C., Classification and Regression Trees, Belmont, CA:
    Wadsworth Int. Group, 1984.
•   Craven, P., and Wahba, G., Smoothing noisy data with spline functions: estimating the correct degree of
    smoothing by the method of generalized cross-validation, Numerische Mathematik 31 (1979) 377-403.
•   Friedman, J.H., Multivariate adaptive regression splines, The Annals of Statistics 19, 1 (1991) 1-141.
•   Hansen, P.C., Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion,
    SIAM, Philadelphia, 1998.
•   Hastie, T., Tibshirani, R., and Friedman, J.H., The Element of Statistical Learning, Springer Verlag, NY, 2001.
•   MOSEK SOFTWARE, http://www.mosek.com/ .
•   Myers, R.H., and Montgomery, D.C., Response Surface Methodology: Process and Product
    Optimization Using Designed Experiments,New York: Wiley (2002).
•    Nemirovski, A., Lectures on modern convex optimization, Israel Institute Technology (2002),
     http://iew3.technion.ac.il/Labs/Opt/LN/Final.pdf.
•    Nesterov, Y.E., and Nemirovskii, A.S., Interior Point Methods in Convex Programming, SIAM, 1993.
•    Taylan, P., Weber, G.-W., and Beck, A., New approaches to regression by generalized additive models and
     continuous optimization for modern applications in finance, science and technology, Optimization, 56, 5–6,
     October–December (2007) 675–698.
•    P. Taylan, P., Weber , G.-W., and Yerlikaya, F., Continuous optimization applied in MARS for modern
     applications in finance, science and technology, in ISI Proceedings of 20th Mini-EURO Conference
     Continuous Optimization and Knowledge-Based Technologies, Neringa, Lithuania, May 20-23, 2008.

More Related Content

PDF
Prediction of Financial Processes
PDF
Mesh Processing Course : Multiresolution
PDF
修士論文発表会
PDF
Further Advanced Methods from Mathematical Optimization
PDF
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
PDF
Prévision de consommation électrique avec adaptive GAM
PDF
Intraguild mutualism
Prediction of Financial Processes
Mesh Processing Course : Multiresolution
修士論文発表会
Further Advanced Methods from Mathematical Optimization
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Prévision de consommation électrique avec adaptive GAM
Intraguild mutualism

What's hot (20)

PDF
11.application of matrix algebra to multivariate data using standardize scores
PDF
Application of matrix algebra to multivariate data using standardize scores
PDF
rinko2011-agh
PDF
rinko2010
KEY
Tprimal agh
PDF
Paris2012 session1
PDF
Signal Processing Course : Convex Optimization
PDF
Bouguet's MatLab Camera Calibration Toolbox for Stereo Camera
PDF
Slides euria-1
PDF
YSC 2013
PDF
Geodesic Method in Computer Vision and Graphics
PDF
Slides euria-2
PDF
Numerical solution of poisson’s equation
PDF
Tele3113 wk1wed
PDF
Signal Processing Course : Inverse Problems Regularization
PDF
Scatter diagrams and correlation and simple linear regresssion
PDF
Rouviere
PDF
E028047054
PDF
Analysis of monitoring of connection between
11.application of matrix algebra to multivariate data using standardize scores
Application of matrix algebra to multivariate data using standardize scores
rinko2011-agh
rinko2010
Tprimal agh
Paris2012 session1
Signal Processing Course : Convex Optimization
Bouguet's MatLab Camera Calibration Toolbox for Stereo Camera
Slides euria-1
YSC 2013
Geodesic Method in Computer Vision and Graphics
Slides euria-2
Numerical solution of poisson’s equation
Tele3113 wk1wed
Signal Processing Course : Inverse Problems Regularization
Scatter diagrams and correlation and simple linear regresssion
Rouviere
E028047054
Analysis of monitoring of connection between
Ad

Viewers also liked (9)

PDF
First Steps In Hypnosis
PPTX
Hypnosis in psychotherapy and hypnosis as psicotherapy
PPTX
Capitalizingon Innovation Hagiu
PDF
Clinical use of hypnosis
PDF
Master Self hypnosis Now
KEY
Hypnosis Slides
PPTX
Hypnosis theory and practice
PPTX
gastroenteritis
PPTX
Gastroenteritis ppt
First Steps In Hypnosis
Hypnosis in psychotherapy and hypnosis as psicotherapy
Capitalizingon Innovation Hagiu
Clinical use of hypnosis
Master Self hypnosis Now
Hypnosis Slides
Hypnosis theory and practice
gastroenteritis
Gastroenteritis ppt
Ad

Similar to Regression Theory (20)

PDF
Parameter Estimation in Stochastic Differential Equations by Continuous Optim...
PDF
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
PDF
Curve fitting
PDF
Classification Theory
PDF
Bayesian Methods for Machine Learning
PDF
icml2004 tutorial on bayesian methods for machine learning
PDF
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
PDF
Neural Networks
PDF
Applied numerical methods lec8
PDF
Matrix Computations in Machine Learning
PDF
On Foundations of Parameter Estimation for Generalized Partial Linear Models ...
PDF
Cross-Validation
PDF
Introduction to Machine Learning
PDF
1 - Linear Regression
PPT
fghdfh
PDF
Parameter Estimation for Semiparametric Models with CMARS and Its Applications
PDF
Image Processing
PDF
Introduction to the theory of optimization
PDF
Engr 371 final exam april 1999
PDF
ma112011id535
Parameter Estimation in Stochastic Differential Equations by Continuous Optim...
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
Curve fitting
Classification Theory
Bayesian Methods for Machine Learning
icml2004 tutorial on bayesian methods for machine learning
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Neural Networks
Applied numerical methods lec8
Matrix Computations in Machine Learning
On Foundations of Parameter Estimation for Generalized Partial Linear Models ...
Cross-Validation
Introduction to Machine Learning
1 - Linear Regression
fghdfh
Parameter Estimation for Semiparametric Models with CMARS and Its Applications
Image Processing
Introduction to the theory of optimization
Engr 371 final exam april 1999
ma112011id535

More from SSA KPI (20)

PDF
Germany presentation
PDF
Grand challenges in energy
PDF
Engineering role in sustainability
PDF
Consensus and interaction on a long term strategy for sustainable development
PDF
Competences in sustainability in engineering education
PDF
Introducatio SD for enginers
PPT
DAAD-10.11.2011
PDF
Talking with money
PDF
'Green' startup investment
PDF
From Huygens odd sympathy to the energy Huygens' extraction from the sea waves
PDF
Dynamics of dice games
PPT
Energy Security Costs
PPT
Naturally Occurring Radioactivity (NOR) in natural and anthropic environments
PDF
Advanced energy technology for sustainable development. Part 5
PDF
Advanced energy technology for sustainable development. Part 4
PDF
Advanced energy technology for sustainable development. Part 3
PDF
Advanced energy technology for sustainable development. Part 2
PDF
Advanced energy technology for sustainable development. Part 1
PPT
Fluorescent proteins in current biology
PPTX
Neurotransmitter systems of the brain and their functions
Germany presentation
Grand challenges in energy
Engineering role in sustainability
Consensus and interaction on a long term strategy for sustainable development
Competences in sustainability in engineering education
Introducatio SD for enginers
DAAD-10.11.2011
Talking with money
'Green' startup investment
From Huygens odd sympathy to the energy Huygens' extraction from the sea waves
Dynamics of dice games
Energy Security Costs
Naturally Occurring Radioactivity (NOR) in natural and anthropic environments
Advanced energy technology for sustainable development. Part 5
Advanced energy technology for sustainable development. Part 4
Advanced energy technology for sustainable development. Part 3
Advanced energy technology for sustainable development. Part 2
Advanced energy technology for sustainable development. Part 1
Fluorescent proteins in current biology
Neurotransmitter systems of the brain and their functions

Recently uploaded (20)

PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Pre independence Education in Inndia.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
Classroom Observation Tools for Teachers
PDF
Basic Mud Logging Guide for educational purpose
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
RMMM.pdf make it easy to upload and study
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Cell Structure & Organelles in detailed.
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Pharma ospi slides which help in ospi learning
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Pre independence Education in Inndia.pdf
Cell Types and Its function , kingdom of life
Classroom Observation Tools for Teachers
Basic Mud Logging Guide for educational purpose
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Week 4 Term 3 Study Techniques revisited.pptx
RMMM.pdf make it easy to upload and study
Microbial diseases, their pathogenesis and prophylaxis
Cell Structure & Organelles in detailed.
PPH.pptx obstetrics and gynecology in nursing
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx

Regression Theory

  • 1. 4th International Summer School Achievements and Applications of Contemporary Informatics, Mathematics and Physics National University of Technology of the Ukraine Kiev, Ukraine, August 5-16, 2009 Motivatio Regression Theory with Additive Models and CMARS n Gerhard- Gerhard-Wilhelm Weber * Inci Batmaz , Gülser Köksal , Fatma Yerlikaya , Pakize Taylan * *, Elcin Kartal , Efsun Kürüm , Ayse Özmen Institute of Applied Mathematics, Middle East Technical University, Ankara, Turkey * Faculty of Economics, Management and Law, University of Siegen, Germany Center for Research on Optimization and Control, University of Aveiro, Portugal * * Department of Mathematics, Dicle University, Turkey
  • 2. Content • Introduction, Motivation • Regression • Additive Models • MARS • PRSS for MARS • CQP for MARS • Tikhonov Regularization for MARS • Numerical Experience and Comparison • Research Extensions • Conclusion
  • 3. Introduction learning from data has become very important in every field of science and technology, e.g., in • financial sector, • quality improvent in manufacturing, • computational biology, • medicine and • engineering. Learning enables for doing estimation and prediction. Regression is mainly based on the problems and methods of • least squares estimation, • maximum likelihood estimation • and classification. New tools for data analysis, based on nonparametric regression and smoothing: • additive (and multiplicative) models.
  • 4. Introduction CART vs. MARS
  • 5. Introduction Additive (and multiplicative) models (studied at IAM, METU): • spline regression in additive models, • spline regression in generalized addive models, • MARS: piecewise linear (per dimension) regression in multiplicative models, • spline regression for stochastic differential equations via additive and nonlinear models.
  • 6. Regression: a Motivation One of motivations of this research has been the approximation of financial data points (x,y), e.g., coming from • the stock market, • credit rating, • economic factor, • company properties. For example, to estimate the probability of a default of a particular credit: • It is used one of the latest three data points above. There are different approaches for estimating the probability of a default. • Regression models (binary choice) are one of them. • For example, we assume that we have the dependent variable Y with Y = 1 (“default”) or Y = 0 (“no default”) satisfies Y = F(X ) +ε , X : vector of independent variable(s) (input) such as credit rating.
  • 7. Regression: a Motivation • Estimation for the default probability P, P = E [ F ( X ) + ε ] = F ( X ). • Also, this estimation can be done via following linear regression Y = α + βΤ X + ε . • An estimate for the default probability of a corparate bond can be obtained: P = α + βΤ X ; α and β are unknown parameters. They can be estimated via linear regression methods or maximum likelihood estimation. In many important cases, these just mean least squares estimation.
  • 8. Regression X = ( X1 , X 2 ,..., X m ) and output variable Y ; T Input vector linear regression : m Y = E (Y X 1 ,..., X m ) + ε = β 0 + ∑ X j β j + ε j =1 • E(Y | X) is linear (...) and β = ( β 0 , β1 ,..., β m ) which minimizes T • 2 ( ) N RSS ( β ) := ∑ yi − x β T i i =1 or RSS ( β ) = (Y − Xβ ) (Y − Xβ ) β = (XT X ) XT y, T −1 ˆ ( ) −1 Cov( β) = X T X ˆ σ2
  • 9. Regression, Additive Models In the input space: • Classical understanding: additive separation of variables • New interpretation: separation of clusters and corresponding enumeration
  • 10. Regression, Additive Models E Yi xi1 , xi 2 ,..., xi m = β 0 + ∑ f j ( xij ) ( ) m (A) j =1 f j are estimated by a smoothing on a single coordinate. Standard convention at xij : ( ) E f j ( xij ) = 0 . • Backfitting algorithm (Gauss-Seidel algorithm). • This procedure depends on the partial residual against xij : rij = yi − β 0 − ∑ f k ( xik ) . ˆ k≠ j
  • 11. Regression, Additive Models • Estimating each smooth function by holding all the other ones fixed. initialization: β 0 := ave( yi | i = 1,..., N ), ˆ f j ( xij ) ≡ 0, ˆ ∀i, j cycle j = 1,..., m,1,..., m,1,..., m rij = yi − β 0 − ∑ f k ( xik ) , ˆ ˆ i = 1,..., N k≠ j ˆ f j is updated by smoothing the partial residuals m rij = yi − β 0 − ∑ f k ( xik ) (i = 1,..., N ) against ˆ ˆ x ij k≠ j until the functions almost do not change. • Convergence (condition)
  • 12. Regression, Additive Models • Convergence of the backfitting, ˆ  f  Tf = f ,    .   .     .  ( ) I  T j : IR Nm → IR Nm ˆ a  S j −∑ k ≠ j f k     .   .     .   fm    • Full cycle: T = Tm Tm-1...T1 ; then, Tl corresponds to l full cycles. ˆ ˆ ˆ ˆ ˆ • ˆ Always converges if all smoother are symetric and all eigenvalues of T are either +1 or in the interior of the unit ball: | λ |< 1 .
  • 13. Regression, Generalized Additive Models • To extend the additive model to a wide range of distribution families: generalized additive models (GAM): G ( µ ( X ) ) = ψ ( X ) = β0 + ∑ f j ( X j ), m i =1 θ := ( β 0 , f1 ,..., f m ) , T • f j are unspecified, G : link function; • f j : elements of a finite dimensional space consisting, e.g., of splines; • spline orders (or degrees): suitably choosen, depending on the density and variation properties of the corresponding data in x and y components, respectively. • problem of specifying θ becomes a finite dimensional parameter estimation problem.
  • 14. Regression, Generalized Additive Models, Splines • x0 ,..., xN be N + 1 distinct knots of [a, b] and a = x0 < x1 < ... < xN = b • The function g k (x) on the interval [a, b] is a spline of degree k relative to the knots x j . • If (1) fk x ,x ∈ IPk (polynomial of degree ≤ k ; j = 0,..., N − 1 ),  j j +1  j+  (2) f k ∈ C k −1 [ a, b] , the space of splines g k on [a, b] is called ℘k and relative to the N + 1 distinct knots; then, dim℘k = N + k. . • In practice, a spline is represented by a different polynomial on each subinterval and for this reason there could be a discontinuity in its kth derivative at the internal knots x1 ,..., xN −1.
  • 15. Regression, Generalized Additive Models, Splines • Characterize a spline of degree k, f k , j := f k  x , x can be represented by  j j +1   k f k , j ( x ) = ∑ g ij ( x − x j )i , if x ∈  x j , x j +1  ;   i =0 (k + 1) N coefficients g ij to be determined. • To hold: f k(,lj)−1 ( x j ) = f k(,lj) ( x j ) ( j = 1,..., N − 1; l = 0,..., k − 1), there are k ( N − 1) conditions, and the remaining degrees of freedom are (k + 1) N - k ( N − 1) = k + N .
  • 16. Clustering for Generalized Additive Models • Financial markets have different kinds of trading activities. These activities work with • short-, mid- or long-term horizons • from days and weeks to months and years. • These data can sometimes be problematic for being used at the models, e.g., given a longer horizon with sometimes less frequent data recorded, but to other times highly frequent measurements. • the structure of data may has particular properties: i. larger variability ii. outliers iii. some data do not have any meaning.
  • 17. Clustering for Generalized Additive Models
  • 18. Clustering for Generalized Additive Models • data variation: • for the sake of simplicity: Nj ≡ N for each interval Ij
  • 19. Clustering for Generalized Additive Models • Density: I1 ,..., I m ; the density of the input data in the j-th interval: number of point xij in I j D j := . length of I j • Variation If over the interval I j the data are (x1 j , y1 j ),..., (xN j , yN j ) : N −1 V j := ∑y i =1 i +1 j − yi j . • If this value is big, at many data points, the curvature of any approximating curve could be big. occurrence of outliers, instability of the model.
  • 20. Clustering for Generalized Additive Models • I1 ,..., I p (or Q1 ,..., Qm ) intervals (or cubes) according to the data grouped. • I j (cube Q j), the associated index of data variation Ind j :=D j ⋅ V j or Ind j :=d j ( D j ) ⋅v j (V j ) • In fact, from both the viewpoints of data fitting and complexity (or stability), o cases with a high variation distributed over a very long intervall are very much less problematic than cases with a high variation over a short interval; o oscillation, o curvature, o up to nonsmoothness, o penalty!
  • 21. Regression, Additive Models • Additive model can be fit by data. Given observations for ( yi , xi ) (i = 1,2,...,N ). • penalized sum of squares PRRS 2 N  m  m b PRSS (β 0 , f1 ,..., f m ) : = ∑  yi − β 0 − ∑ f j ( xij )  + ∑ µ j ∫  f j (t j )  dt j '' 2   i =1  j =1  j =1 a • µj ≥ 0 (smoothing parameters, tradeoff) • large values of µ j yield smoother curves, smaller ones result in more fluctuation. • New estimation methods for additive model with CQP :
  • 22. Regression, Additive Models min t, t , β0 , f 2 N  m  subject to ∑ i=1  yi − β0 − ∑ f j ( xij )  ≤ t 2 , t ≥ 0, j =1  2 ∫f (t j )  dt j ≤ M j (j = 1, 2,..., m). ''  j  dj • The functions f j are splines: f j ( x) = ∑ θl j hl j ( x). l =1 • Then, we get min t, t , β0 , f W ( β 0 , θ ) 2 ≤ t 2 , t ≥ 0, 2 subject to 2 V j ( β0 ,θ ) ≤ M j (j = 1,..., m). 2
  • 23. Regression, Additive Models http://144.122.137.55/gweber/
  • 24. MARS Multivariate Adaptive Regression Spline • To estimate general functions of high-dimensional arguments. • An adaptive procedure. • A nonparametric regression procedure. • No specific assumption about the underlying functional relationship between the dependent and independent variables. • Ability to estimate the contributions of the basis functions so that both the additive and the interactive effects of the predictors are allowed to determine the response variable. • Uses expansions in piecewise linear basis functions of the form c + ( x,τ ) = [ + ( x − τ )]+ , c - ( x,τ ) = [−( x − τ )]+ . [q]+ := max {0, q}
  • 25. MARS y • • • • • • • • • • • • • • • • • • •• c-(x,τ)=[−(x−τ)]+ c+(x,τ)=[+(x−τ)]+ τ x Basic elements in the regression with MARS. • Let us consider Y = f (X ) + ε, X = ( X 1 , X 2 ,..., X p )Τ • The goal is to construct reflected pairs for each input X j ( j = 1, 2,..., p ).
  • 26. MARS y • • • • • • • • • • • • • • • • • • •• c-(x,τ)=[−(x−τ)]+ c+(x,τ)=[+(x−τ)]+ τ x Basic elements in the regression with MARS. • Let us consider Y = f (X ) + ε, X = ( X 1 , X 2 ,..., X p )Τ • The goal is to construct reflected pairs for each input X j ( j = 1, 2,..., p )
  • 27. MARS y • • • • • • • • • • • • • • • • • • •• c-(x,τ)=[−(x−τ)]+ c+(x,egressionx−τ)]+ rτ)=[+( w ith τ x Basic elements in the regression with MARS. • Let us consider Y = f (X ) + ε, X = ( X 1 , X 2 ,..., X p )Τ • The goal is to construct reflected pairs for each input X j ( j = 1, 2,..., p )
  • 28. MARS • Set of basis functions: { { % % % } } ℘:= ( X j − τ ) + , (τ − X j ) + | τ ∈ x1, j , x2, j ,..., xN , j , j ∈ {1, 2,..., p} • Thus, f ( X ) can be represented by M Y = θ 0 + ∑ θ mψ m ( X ) + ε . m =1 • ψ m (m = 1, 2,..., M ) are basis functions from ℘ or products of two or more such functions; interaction basis functions are created by multiplying an existing basis function with a truncated linear function involving a new variable. • Provided the observations represented by the data ( xi , yi ) (i = 1, 2,..., N ) : Km ψ m ( x ) := ∏ [ sκ ⋅ ( xκ − τ κ )]+ . m j m j m j j =1
  • 29. MARS • Two subalgorithms: (i) Forward stepwise algorithm: • Search for the basis functions. • Minimization of some “lack of fit” criterion. • The process stops when a user-specified value M max is reached. • Overfitting. So a backward deletion procedure is applied by decreasing the complexity of the model without degrading the fit to the data. (i) Backward stepwise algorithm:
  • 30. MARS • Remove from the model basis functions that contribute to the smallest increase ˆ in the residual squared error at each stage, producing an optimally estimated model fα with respect to each number of terms, called α . • α is related with some complexity of the estimation. • To estimate the optimal value of α : N M(α ) := u + d K ∑ ( yi − ˆα ( xi ))2 i =1 f N := number of samples GCV := u := number of independent basis functions (1 − M(α ) N ) 2 K := number of knots selected by forward stepwise algorithm d := cost of optimal basis • Alternative:
  • 31. PRSS for MARS N M max 2 ∑ ( yi − f ( xi ) ) + ∑ λm ∑ ∑ 2 PRSS := θ m  Drα,sψ m (t m )  d t m ∫  2 2  i =1 m =1 α =1 r<s α = (α1 ,α 2 ) r , s∈V ( m ) { V (m) := κ m | j = 1, 2,..., K m j } t m := (tm1 , tm2 ,..., tm K )T m α = (α1 , α 2 ) α := α1 + α 2 , where α1 , α 2 ∈ {0,1} ( ) Drα, sψ m (t m ) := ∂αψ m ∂α1 trm ∂α 2 tsm (t m ) • Tradeoff between both accuracy and complexity. • Penalty parameters λm .
  • 34. Grid Selection n Motivatio
  • 35. CQP and Tikhonov Regularization for MARS ψ ( d i ) := (1,ψ 1 ( xi1 ),...,ψ M ( xiM ),ψ M +1 ( xiM +1 ),...,ψ M ( xiM ) ) T max max d i := ( xi1 , xi2 ,..., xiM , xiM +1 , xiM + 2 ,..., xiM max )T θ := (θ0 ,θ1,...,θ M ) Τ (σ j ) j∈{1,2,..., Km } ∈ {0,1, 2,..., N + 1} κ Km max ,   Km  x =  x κ m m , x κ m m ,..., x κ m , ∆x := ∏  x κ m m − x κ m m  m ˆ ˆ m i  lσ1κ1 ,κ1 lσ κ2 ,κ 2 2 l κ m ,κ Km K m  i  j j =1  lσ κ j +1 ,κ j l κ ,κ j  j  σ Km  σ j  1  2   2 ψ (d ) := (ψ (d1 ),...,ψ (d N ) )  2 ˆ  Lim :=  ∑ ∑  Drα,sψ m ( xim )   ∆x im  . T  ˆ   α =αα=1α 2 )  ( 1 , r <s r , s∈V ( m )     L is an ( M max + 1) × ( M max + 1) matrix.
  • 36. CQP and Tikhonov Regularization for MARS • For a short representation, we can rewrite the approximate relation as M max ( N +1) Km ∑ λm ∑ 2 PRSS = y −ψ (d )θ + L2 θ m . im 2 2 m =1 i =1 • In case of the same penalty parameter λ = λm (=: ϕ 2 ), then: 2 PRSS = y −ψ (d ) θ + λ Lθ 2 . 2 2 Tikhonov regularization
  • 37. CQP for MARS • Conic quadratic programming: min t, t ,θ subject to ψ (d ) θ − y 2 ≤ t , Lθ 2 ≤ M. In general : min cT x , subject to Di x − di 2 ≤ piT x − qi (i = 1, 2,..., k ). x
  • 38. CQP for MARS • Conic quadratic programming: min t, t ,θ subject to ψ (d ) θ − y 2 ≤ t , Lθ 2 ≤ M. In general : min cT x , subject to Di x − di 2 ≤ piT x − qi (i = 1, 2,..., k ). x
  • 39. CQP for MARS . Moreover, (t,θ , χ ,η , ω , ω ) 1 2 is a primal dual optimal solution if and only if  0N ψ (d )   t   − y  χ :=    + ,  1 0T +1   θ   0  M max    0 M max +1 L   t   0 M max +1  η :=    +  ,  0 T  θ   M  0 M max +1      0T 1   0T max +1 0   1  ω +   ω2 =  N M  ψ (d )T  ,  0 M max +1  1  LT  0 M max +1   0 M max +1   ω1T χ = 0, ω2 η = 0, T ω1 ∈ LN +1 , ω2 ∈ L M max + 2 , χ ∈ LN +1 , η ∈ LM max + 2 .
  • 40. CQP for MARS • CQPs belong to the well-structured convex problems. • Interior Point Methods. • Better complexity bounds. • Better practical performance. C-MARS
  • 41. Numerical Experience and Comparison • We had the following data: X1 1,5554 1,5326 -0,1823 0,1627 0,5687 0,1706 0,2041 -0,1823 -0,82 -0,7234 0,4446 -0,3291 -1,5583 1,2706 1,7555 X2 0,1849 1,1538 0,7586 -1,5363 1,906 0,3761 1,3323 -0,0064 -1,7275 1,141 0,3761 0,5673 -0,1976 0,7586 0,1849 X3 1,264 1,2023 -1,0995 0,8529 1,3051 -0,3802 -0,7913 0,1336 0,2363 -1,0995 -0,0719 -0,894 -1,0995 0,9557 1,5722 X4 1,2843 1,0175 -0,9676 0,7408 1,0635 -0,506 -0,7937 -0,0564 0,0455 -0,9676 -0,2482- 0,8557 -0,9676 0,8707 1,7339 X5 -0,7109 0,1777 0,1422 0,0355 3,2699 0,3554 -0,1777 1,5283 -0,0711 0,3554 0,8886 0,4621 -0,9241 -0,9241 -0,0711 Y 0,67 0,9047 -0,197 -1,0108 0,1616 0,2984 -0,6039 0,8823 -1,6832 0,9531 -0,3208 0,0507 -0,3916 0,44 0,263 X1 0,0474 -0,8713 -0,2158 0,2179 1,5426 -1,16 0,9857 0,6752 0,5402 -1,4528 1,9349 -0,8299 -0,681 0,7304 -1,1305 X2 0,9498 -0,1976 -1,7275 -0,9626 1,3323 -0,9626 0,1849 -1,345 1,3323 -0,0064 0,1849 0,3761 -1,345 -0,7713 -0,0064 X3 0,0308 -0,6885 1,0584 0,5446 0,5446 -0,483 0,4419 1,264 0,0308 -1,3051 2,086 -0,5857 -0,2775 1,5722 -1,3051 X4 0,1543 -0,7278 1,0046 0,3752 0,3752 -0,5839 0,2613 1,2843 -0,1543 -1,0635 2,5631 -0,6578 -0,4241 1,7339 -1,0635 X5 1,1018 0,6753 -0,391 -0,2843 1,4217 0,4621 -0,8175 0,7819 0,2488 1,5283 -0,1777 -1,7771 0,4621 -1,0307 0,3554 Y 1,1477 -0,3916 -0,4624 -1,0993 2,8639 -1,0285 0,1923 -0,7631 2,05 1,0238 0,9177 -1,2055 -0,3208 -0,5862 -0,6216
  • 42. Numerical Experience and Comparison • We constructed model functions for these data using the MARS Software where we selected the maximum number of basis elements: M = 5. Then, max Model 1 : ω = 1 Model 2 : ω = 2 BF1 = max{0, X 2 + 1.728}; BF1 = max{0, X 2 + 1.728}; Y = -1.081 + 0.626 * BF1 BF2 = max{0, X 5 - 0.462}* BF1 Y = -1.073 + 0.499* BF1 + 0.656 * BF2 best model >>> Model 3 : ω = 3 BF1 = max{0, X 2 + 1.728}; BF2 = max{0, X 5 - 0.462} * BF1 BF4 = max{0, X 3 + 0.586} * BF1 Y = -1.176 + 0.422 * BF1 + 0.597 * BF2 + 0.236 * BF4
  • 43. Numerical Experience and Comparison • and, finally, Model 4 : ω = 4 BF1 = max{0, X 2 + 1.728} BF2 = max{0, X 5 - 0.462} * BF1 BF3 = max{0, 0.462 - X 5 } * BF1 BF4 = max{0, X 3 + 0.586} * BF1 Y = -1.242 + 0.555 * BF1 + 0.484 * BF2 - 0.093 * BF3 + 0.226 * BF4 Model 5 : ω = 5 BF1 = max{0, X 2 + 1.728}; BF2 = max{0, X 5 - 0.462} * BF1 BF3 = max{0, 0.462 - X 5 } * BF1 BF4 = max{0, X 3 + 0.586} * BF1 BF5 = max{0, - 0.586 - X 3 } * BF1 Y = -1.248 + 0.487 * BF1 + 0.486 * BF2 - 0.118 * BF3 + 0.282 * BF4 + 0.263 * BF5
  • 44. Numerical Experience and Comparison • Then, we considered a large model with 5 five basis functions; we found (writing a MATLAB code): 0 0 0 0 0 0  0 1.8419 0 0 0 0    0 0 0.7514 0 0 0  L=   0 0 0 0.9373 0 0  0 0 0 0 2.1996 0    0 0 0 0 0 0.3905 • We constructed models using different values for M in the optimization problem, which was solved by MOSEK (CQP). • Our algorithm constructs a model with 5 parameters always; in case of Salford, there are 1, 2, 3, 4 or 5 parameters.
  • 45. Numerical Experience and Comparison RESULTS OF SALFORD MARS ω RSS z = RSS t = Lθ 2 GCV 1 17.6425 4.2003 1.1531 0.771 2 11.1870 3.3447 1.0430 0.613 3 7.7824 2.7897 1.0368 0.550 4 6.6126 2.5715 1.1967 0.626 5 6.2961 2.5092 1.1600 0.840
  • 46. Numerical Experience and Comparison RESULTS OF OUR APPROACH M ω z = RSS t = Lθ 2 M ω z = RSS t = Lθ 2 0.05 5 5.16894 0.05 0.2940 5 4.2024 0.2940 0.1 5 4.959342 0.1 0.2945 5 4.2006 0.2945 0.15 5 4.755559 0.15 0.295 5 4.1988 0.2950 0.2 5 4.557617 0.2 0.3 5 4.180557 0.3 0.25 5 4.365811 0.25 0.35 5 4.002338 0.35 0.265 5 4.3095 0.2650 0.4 5 3.831675 0.4 0.275 5 4.2723 0.2750 0.45 5 3.669118 0.45 0.285 5 4.2354 0.2850 0.5 5 3.515233 0.5 0.2865 5 4.2299 0.2865 0.55 5 3.370588 0.55 0.2875 5 4.2262 0.2875 0.552 5 3.3650 0.5520 0.2885 5 4.2226 0.2885 0.555 5 3.3567 0.5550 0.2895 5 4.2189 0.2895 0.558 5 3.3483 0.558 0.28965 5 4.2183 0.2897 0.560 5 3.3428 0.5600 0.28975 5 4.2180 0.2897 0.561 5 3.3401 0.5610 0.28985 5 4.2176 0.2899 0.562 5 3.3373 0.5620 0.28995 5 4.2172 0.2899 0.565 5 3.3291 0.5650
  • 47. Numerical Experience and Comparison M ω z = RSS t = Lθ 2 M ω z = RSS t = Lθ 2 0.575 5 3.3019 0.5750 0.96 5 2.5968 0.96 0.585 5 3.2751 0.5850 0.97 5 2.5880 0.97 0.595 5 3.2488 0.5950 0.98 5 2.5797 0.98 0.6 5 3.235746 0.6 0.99 5 2.5718 0.99 0.65 5 3.111253 0.6 5 1 5 2.564459 1 0.7 5 2.997622 0. 7 2 5 2.509165 1.16009 0.75 5 2.895324 0.7 5 2.1 5 2.509165 1.16009 0.8 5 2.804764 0.8 2.2 5 2.509165 1.16009 0.805 5 2.7964 0.8050 2.3 5 2.509165 1.16007 0.810 5 2.7881 0.8100 2.4 5 2.509165 1.16008 0.820 5 2.7719 0.8200 2.5 5 2.509165 1.16001 0.830 5 2.7562 0.8300 2.6 5 2.509165 1.16007 0.840 5 2.7410 0.8400 2.7 5 2.509165 1.16007 0.85 5 2.726261 0.85 2.8 5 2.509165 1.16009 0.9 5 2.660023 0.9 2.9 5 2.509165 1.16009 0.95 5 2.60612 0.95 3 5 2.509165 1.16009 4 5 2.509165 1.160084
  • 48. Numerical Experience and Comparison . We drew L curves: 5.5 5.5 5 5 2 ψ (d ) θ − y ψ (d )θ − y 2 4.5 4.5 4 4 3.5 3.5 3 3 2.5 2.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Lθ 2 Lθ 2 • Conclusion: Based on the L curve criterion and for the given data, our solution is better than Salford solution for MARS.
  • 49. Numerical Experience and Comparison • All test data sets are also compared according to the performance measure such as MSE, MAE, Correlation Coefficient, R2, PRESS, Mallows’ Cp etc.. • These measures are based on the average of nine values (one for each fold and each replication). C- M A R S
  • 50. Numerical Experience and Comparison Please find much more numerical experience and comparison in Yerlikaya, Fatma, A New Contribution to Nonlinear Robust Regression and Classification with MARS and Its Application to Data Mining for Quality Control in Manufacturing, MSc. Thesis at Institute of Applied Mathematics of METU, Ankara, 2008.
  • 51. Piecewise Linear Functions - Stock Market figures generated by Erik Kropat
  • 52. Forward Stepwise Algorithm Revisited high complexity
  • 57. Regularization & Uncertainty Robust Optimization • • • Laurent El Ghaoui
  • 58. Regularization & Uncertainty Robust Optimization
  • 59. References • Aster, A., Borchers, B., and Thurber, C., Parameter Estimation and Inverse Problems, Academic Press, 2004. • Breiman, L., Friedman, J. H., Olshen, R., and Stone, C., Classification and Regression Trees, Belmont, CA: Wadsworth Int. Group, 1984. • Craven, P., and Wahba, G., Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation, Numerische Mathematik 31 (1979) 377-403. • Friedman, J.H., Multivariate adaptive regression splines, The Annals of Statistics 19, 1 (1991) 1-141. • Hansen, P.C., Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion, SIAM, Philadelphia, 1998. • Hastie, T., Tibshirani, R., and Friedman, J.H., The Element of Statistical Learning, Springer Verlag, NY, 2001. • MOSEK SOFTWARE, http://www.mosek.com/ . • Myers, R.H., and Montgomery, D.C., Response Surface Methodology: Process and Product Optimization Using Designed Experiments,New York: Wiley (2002). • Nemirovski, A., Lectures on modern convex optimization, Israel Institute Technology (2002), http://iew3.technion.ac.il/Labs/Opt/LN/Final.pdf. • Nesterov, Y.E., and Nemirovskii, A.S., Interior Point Methods in Convex Programming, SIAM, 1993. • Taylan, P., Weber, G.-W., and Beck, A., New approaches to regression by generalized additive models and continuous optimization for modern applications in finance, science and technology, Optimization, 56, 5–6, October–December (2007) 675–698. • P. Taylan, P., Weber , G.-W., and Yerlikaya, F., Continuous optimization applied in MARS for modern applications in finance, science and technology, in ISI Proceedings of 20th Mini-EURO Conference Continuous Optimization and Knowledge-Based Technologies, Neringa, Lithuania, May 20-23, 2008.