The Impact of Smoothness on Model Class Selection in Nonlinear System Identification

The Impact of Smoothness on Model Class
Selection in Nonlinear System Identiﬁcation:
An Application of Derivatives in the RKHS
Y. Bhujwalla, V. Laurain, M. Gilson
6th July 2016
yusuf-michael.bhujwalla@univ-lorraine.fr
Yusuf Bhujwalla (Université de Lorraine) IEEE ACC 2016 1 / 23

Introduction
The Data-Generating System
Measured data : DN = {(u1, y1), (u2, y2), . . . , (uN , yN )}.
Describes So, an unknown nonlinear system with function fo : X → R,
So :
yo,k = fo(xk)
yk = yo,k + eo,k
Where xk = [yk−1 · · · yk−na uk · · · uk−nb ]⊤
∈ X = Rna+nb+1
.
Parametric Models
Nθ low (ﬁxed)
→ Physically interpretable
Choice of basis function?
→ Combinatorially hard problem X
Nonparametric Models
Nθ high (∼ data)
→ Not interpretable X
Can deﬁne a general model class.
→ Flexibility

Introduction
The Data-Generating System
Measured data : DN = {(u1, y1), (u2, y2), . . . , (uN , yN )}.
Describes So, an unknown nonlinear system with function fo : X → R,
So :
yo,k = fo(xk)
yk = yo,k + eo,k
Where xk = [yk−1 · · · yk−na uk · · · uk−nb ]⊤
∈ X = Rna+nb+1
.
Parametric Models
Nθ low (ﬁxed)
→ Physically interpretable
Choice of basis function?
→ Combinatorially hard problem X
Nonparametric Models
Such as kernel methods :
Input
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Output
0
0.5
1
1.5
2
yo
kx

Outline
1 Kernel Methods in Nonlinear Identiﬁcation
2 The Kernel Selection Problem
3 Smoothness in the RKHS
4 Simulation Examples

1. Kernel Methods in Nonlinear Identification
Reproducing Kernel Hilbert Spaces
Hilbert Spaces
H is a space over a class of functions, f : X → R ∈ H :
· ∥ f ∥H
· ⟨ f , g ⟩H.
In system identification, H ⇔ model class.
Reproducing Kernels
H has a unique, associated kernel function, K : X × X → R, spanning the space
H.
The Reproducing Property states that f (x) can be explicitly represented as an
infinite sum in terms of the kernel function :
f (x) = ⟨ f , Kx⟩H =
∞
i=1
αiK(xi, x)

Identification in the RKHS
Identification in the RKHS
For ˆf ∈ H close to fo, ˆf should reflect observations :
ˆf = min
f
{ V( f ) = L(x, y, f (x)) }
However, infinitely many solutions ⇒ add constraint to model :
ˆf = min
f
{ V( f ) = L(x, y, f (x)) + g(∥ f ∥H) }
For such cost-functions, f (x) can be reduced to :
f (x) =
N
i=1
αiK(xi, x), α ∈ RN
· f (x) → a finite sum over the observations.
· The Representer Theorem (Schölkopf, Herbrich and Smola, 2001)

A Widely-Used Example
A Widely-Used Example
As an example minimise squared-error :
L(x, y, f (x)) = ∥y − f (x)∥2
2,
and use regularisation to avoid overparameterisation :
g(∥ f ∥H) = λ∥ f ∥2
H.
Giving :
Vf : V( f ) = ∥y − f (x)∥2
2 + λf ∥ f ∥2
H
⇒ αf = (K + λf I)−1
y
· Solution depends on
I. K and
II. λf

Outline

2. The Kernel Selection Problem
Choosing a Kernel Function
Choosing a kernel function...
K defines the model class
Let X = R, and K be the Gaussian
RBF kernel :
K(xi, x) = exp −
∥x − xi∥2
σ2
.
Width (σ) defines smoothness of
the kernel function.
Hence σ determines the model
class !
Other kernels have different
hyperparameters, but they will still
influence H.
Input
-1 -0.5 0 0.5 1
0
0.2
0.4
0.6
0.8
1
Input
-1 -0.5 0 0.5 1
0
0.2
0.4
0.6
0.8
1
KxKx
σ1
σ2 > σ1
σ

Implications of the Hyperparameter Selection
Estimation of 1D switching signal
using Vf = ∥y − f (x)∥2
2 + λf ∥f ∥2
H.
Many observations (N = 103
).
uk ∼ U(−1, 1).
Signiﬁcant noise disturbances
(SNR = 5dB).
Two hyperparameters :
I. σ and
II. λ
-1 -0.5 0 0.5 1
-20
-10
0
10
20
30
fo(uk)
uk
FIGURE: Estimation of 1D switching
signal for different hyperparameter
values.

-1 -0.5 0 0.5 1
-20
0
20
-1 -0.5 0 0.5 1
-20
0
20
-1 -0.5 0 0.5 1
-20
0
20
-1 -0.5 0 0.5 1
-20
0
20
yoyo
yoyo
ˆfMEAN
ˆfMEAN
ˆfMEAN
ˆfMEAN
ˆfSD
ˆfSD
ˆfSD
ˆfSD
SMALL λ LARGE λ
SMALLσLARGEσ

-1 -0.5 0 0.5 1
-20
0
20
-1 -0.5 0 0.5 1
-20
0
20
-1 -0.5 0 0.5 1
-20
0
20
-1 -0.5 0 0.5 1
-20
0
20
SMOOTHNESS
FLEXIBILITY
yoyo
yoyo
ˆfMEAN
ˆfMEAN
ˆfMEAN
ˆfMEAN
ˆfSD
ˆfSD
ˆfSD
ˆfSD
SMALL λ LARGE λ
SMALLσLARGEσ

Summary
Summary
Vf : V(f ) = ∥y − f (x)∥2
2 + λf ∥ f ∥2
H.
Kernel framework very effective :
· ﬂexible,
· well-understood.
However, choice of kernel often compromised (e.g. by noise).
⇒ Trade-off between ﬂexibility and smoothness.
So, why regularise over ∥ f ∥H . . .
. . . when smoothness is often a more interesting property to control?
⇒ Desirable property in many models.
⇒ Characterises many systems.

Outline

3. Smoothness in the RKHS
Regularisation Using Derivatives
Proposition
Replace functional regularisation :
Vf : V(f ) = ∥y − f (x)∥2
2 + λf ∥ f ∥2
H,
With smoothness-enforcing regularisation :
VD : V(f ) = ∥y − f (x)∥2
2 + λD∥Df ∥2
H.
Now :
· Hence, smoothness controlled by regularisation.
· And, kernel hyperparameter removed from optimisation problem.

Derivatives in the RKHS
For f ∈ H, Df ∈ H (Zhou, 2008)
Hence, a derivative reproducing property can be deﬁned :
Df = ⟨ f , DKx ⟩H
The Representer Theorem
Representer f (x) = N
i=1 αiK(xi, x) requires
g(∥ f ∥H) : a monotically increasing function of ∥ f ∥H
Clearly, ∥Df ∥H g(∥ f ∥H) ⇒ representer is suboptimal for VD.
However, if system is well-excited, f (x) = N
i=1 αiK(xi, x) can be used.
However, it loosely preserves the bias-variance properties of Vf
lim
λ→∞
f (x) = 0, ∀x ∈ R.

A Closed-Form Solution
Using derivative reproducing property, ∥Df ∥H can be deﬁned :
∥Df ∥2
H = α⊤
D(1, 1)
Kα,
where
D(1, 1)
K(xi, xj) =
∂2
K(xi, xj)
∂xj ∂xi
.
Permitting a closed-form solution :
αD = K⊤
K + λDD(1, 1)
K
−1
K⊤
y.
As per Vf ⇒ αf = (K + λf I)−1
y.

Outline

4. Simulation Examples
Example 1 : Effect of the Regularisation
Estimation of 1D switching signal
using Vf and VD.
Many observations (N = 103
).
uk ∼ U(−1, 1).
Signiﬁcant noise disturbances
(SNR = 5dB).
Gaussian RBF kernel, with σ = 0.01.
Varying levels of regularisation
(through λf , λD).
-1 -0.5 0 0.5 1
-20
-10
0
10
20
30
fo(uk)
uk
FIGURE: Estimation of 1D switching
signal for different λ values.

⇒ Negligible regularisation (very small λf , λD).
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output -20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)

⇒ Light regularisation (small λf , λD).
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output -20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)

⇒ Moderate regularisation.
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output -20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)

⇒ Heavy regularisation (large λf , λD).
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output -20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)

⇒ Excessive regularisation (very large λf , λD).
Input
-1 -0.5 0 0.5 1
Output
-20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: Vf : R( f)
Input
-1 -0.5 0 0.5 1
Output -20
-10
0
10
20
30
yo
ˆfMEAN
ˆfSD
FIGURE: VD : R(Df)

Example 2 : 1D Structural Selection
Identiﬁcation of two unknown systems (X ∈ [−1, 1], SNR = 10dB, N = 103
).
Vf : λ, σ optimised using cross-validation.
VD : λ optimised using cross-validation, σ set based on data.
-0.5 0 0.5
-10
-5
0
5
10
15
20
25
f1
o(uk)
uk
FIGURE: S1
o : Smooth
-0.5 0 0.5
-10
-5
0
5
10
15
20
25
f2
o(uk)
uk
FIGURE: S2
o : Nonsmooth

Example 2 : Smooth S1
o
Using a small kernel, VD can reconstruct a smooth function.
Not feasible using Vf - needs kernel smoothing effect.
Input
-0.5 0 0.5
Output
-10
-5
0
5
10
15
20
25
ˆfMEAN
ˆfSD
kx
FIGURE: Vf : R( f)
Input
-0.5 0 0.5
Output
-10
-5
0
5
10
15
20
25
ˆfMEAN
ˆfSD
kx
FIGURE: VD : R(Df)

Example 2 : Nonsmooth S2
o
Using a small kernel, VD can detect structural nonlinearity.
However, Vf is too smooth, as σ must counteract noise.
Input
-0.5 0 0.5
Output
-10
-5
0
5
10
15
20
25
ˆfMEAN
ˆfSD
kx
FIGURE: Vf : R( f)
Input
-0.5 0 0.5
Output
-10
-5
0
5
10
15
20
25
ˆfMEAN
ˆfSD
kx
FIGURE: VD : R(Df)

Conclusions
RKHS in Nonlinear Identification
Flexible framework : attractive for nonlinear identification.
Smoothness controlled by kernel function and regularisation (σ and λf )
⇒ Constrained kernel function.
Smoothness controlled by regularisation (λD).
⇒ Simpler steering of the smoothness.
Simpler hyperparameter optimisation (just λD) and increased model flexibility.
⇒ Through use of a smaller kernel (small σ).
However, relies on a suboptimal representer.
⇒ Nonetheless, promising results have been obtained.

The Impact of Smoothness on Model Class
Selection in Nonlinear System Identiﬁcation:
An Application of Derivatives in the RKHS
Y. Bhujwalla, V. Laurain, M. Gilson
6th July 2016
yusuf-michael.bhujwalla@univ-lorraine.fr

A. Bibliography
Alternative Smoothness-Enforcing Optimisation Schemes
Sobolev Spaces (Wahba, 1990 ; Pillonetto et al, 2014)
∥f ∥Hk
=
m
i=0 X
di
f (x)
dxi
2
dx
Identiﬁcation using derivative observations (Zhou, 2008; Rosasco et al, 2010)
Vobvs( f ) = ∥y − f (x)∥2
2 + γ1
dy
dx
−
df (x)
dx
2
2
+ · · · γm
dm
y
dxm
−
dm
f (x)
dxm
2
2
+ λ ∥f ∥H
Regularization Using Derivatives (Rosasco et al, 2010; Lauer, Le and Bloch,
2012; Duijkers et al, 2014)
VD( f ) = ∥y − f (x)∥2
2 + λ∥Dm
f ∥p.

A. Bibliography
Literature Review
Kernel Methods in Machine Learning and System Identiﬁcation
· Kernel methods in system identiﬁcation, machine learning and function
estimation : A survey, G. Pillonetto, F. Dinuzzo, T. Chen, G. D. Nicolao and L.
Ljung, 2014.
· Learning with Kernels, B. Schölkopf, R. Herbrich and A. J. Smola, 2002.
· Gaussian Processes for Machine Learning, C. Rasmussen and C. Williams,
2006.
Reproducing Kernel Hilbert Spaces
· Theory of Reproducing Kernels, N. Aronszajn, 1950.
· A Generalized Representer Theorem, B. Schölkopf, R. Herbrich and A. J. Smola,
2001.
· Derivative reproducing properties for kernel methods in learning theory, D. Zhou,
2008.

B. Example 2 : 1D Structural Selection
S1
o : Smooth
Input
-0.5 0 0.5
Output
-10
-5
0
5
10
15
20
25
ˆfMEAN
ˆfSD
kx
FIGURE: Vf : R( f)
Input
-0.5 0 0.5Output
-10
-5
0
5
10
15
20
25
ˆfMEAN
ˆfSD
kx
FIGURE: VD : R(Df)

B. Example 2 : 1D Structural Selection
S2
o : Nonsmooth
Input
-0.5 0 0.5
Output
-10
-5
0
5
10
15
20
25
ˆfMEAN
ˆfSD
kx
FIGURE: Vf : R( f)
Input
-0.5 0 0.5Output
-10
-5
0
5
10
15
20
25
ˆfMEAN
ˆfSD
kx
FIGURE: VD : R(Df)

C. Applicability of the Representer
Kernel Density
Applicability of the representer depends on the kernel density, i.e. the ratio of
observations to the kernel width :
Input (x/σ)
-6 -4 -2 0 2 4 6
Output
0
0.5
1
1.5
ˆf
Kx
FIGURE: ρk = 0.6
Input (x/σ)
-6 -4 -2 0 2 4 6
Output
0
0.5
1
1.5
ˆf
Kx
FIGURE: ρk = 0.6
Input (x/σ)
-6 -4 -2 0 2 4 6
Output
0
0.5
1
1.5
ˆf
Kx
FIGURE: ρk = 0.4
Desirable to ensure σ ≈ max(∆x) (where ∆x is the spacing between adjacent
observations).

The Impact of Smoothness on Model Class Selection in Nonlinear System Identification

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to The Impact of Smoothness on Model Class Selection in Nonlinear System Identification (20)

Recently uploaded (20)

The Impact of Smoothness on Model Class Selection in Nonlinear System Identification