SlideShare a Scribd company logo
Lecture 4: kernels and associated functions
Stéphane Canu
stephane.canu@litislab.eu
Sao Paulo 2014
March 4, 2014
Plan
1 Statistical learning and kernels
Kernel machines
Kernels
Kernel and hypothesis set
Functional differentiation in RKHS
Introducing non linearities through the feature map
SVM Val
f (x) =
d
j=1
xj wj + b =
n
i=1
αi (xi x) + b
t1
t2
∈ IR2
x1
x2
x3
x4
x5
linear in x ∈ IR5
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 3 / 37
Introducing non linearities through the feature map
SVM Val
f (x) =
d
j=1
xj wj + b =
n
i=1
αi (xi x) + b
t1
t2
∈ IR2
φ(t) =
t1 x1
t2
1 x2
t2 x3
t2
2 x4
t1t2 x5
linear in x ∈ IR5
quadratic in t ∈ IR2
The feature map
φ : IR2
−→ IR5
t −→ φ(t) = x
xi x = φ(ti ) φ(t)
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 3 / 37
Introducing non linearities through the feature map
A. Lorena & A. de Carvalho, Uma Introducão às Support Vector Machines, 2007
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 4 / 37
Non linear case: dictionnary vs. kernel
in the non linear case: use a dictionary of functions
φj (x), j = 1, p with possibly p = ∞
for instance polynomials, wavelets...
f (x) =
p
j=1
wj φj (x) with wj =
n
i=1
αi yi φj (xi )
so that
f (x) =
n
i=1
αi yi
p
j=1
φj (xi )φj (x)
k(xi ,x)
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 5 / 37
Non linear case: dictionnary vs. kernel
in the non linear case: use a dictionary of functions
φj (x), j = 1, p with possibly p = ∞
for instance polynomials, wavelets...
f (x) =
p
j=1
wj φj (x) with wj =
n
i=1
αi yi φj (xi )
so that
f (x) =
n
i=1
αi yi
p
j=1
φj (xi )φj (x)
k(xi ,x)
p ≥ n so what since k(xi , x) =
p
j=1 φj (xi )φj (x)
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 5 / 37
closed form kernel: the quadratic kernel
The quadratic dictionary in IRd
:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1, s1, s2, ..., sd , s2
1 , s2
2 , ..., s2
d , ..., si sj , ...
in this case
Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2
1 t2
1 + ... + s2
d t2
d + ... + si sj ti tj + ...
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37
closed form kernel: the quadratic kernel
The quadratic dictionary in IRd
:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1, s1, s2, ..., sd , s2
1 , s2
2 , ..., s2
d , ..., si sj , ...
in this case
Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2
1 t2
1 + ... + s2
d t2
d + ... + si sj ti tj + ...
The quadratic kenrel:
s, t ∈ IRd
, k(s, t) = s t + 1
2
= 1 + 2s t + s t
2
computes the dot product of the reweighted dictionary:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1,
√
2s1,
√
2s2, ...,
√
2sd , s2
1 , s2
2 , ..., s2
d , ...,
√
2si sj , ...
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37
closed form kernel: the quadratic kernel
The quadratic dictionary in IRd
:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1, s1, s2, ..., sd , s2
1 , s2
2 , ..., s2
d , ..., si sj , ...
in this case
Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2
1 t2
1 + ... + s2
d t2
d + ... + si sj ti tj + ...
The quadratic kenrel:
s, t ∈ IRd
, k(s, t) = s t + 1
2
= 1 + 2s t + s t
2
computes the dot product of the reweighted dictionary:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1,
√
2s1,
√
2s2, ...,
√
2sd , s2
1 , s2
2 , ..., s2
d , ...,
√
2si sj , ...
p = 1 + d + d(d+1)
2 multiplications vs. d + 1
use kernel to save computration
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37
kernel: features throught pairwise comparizons
x φ(x)
e.g. a text e.g. BOW
K
n examples
nexamples
Φ
p features
nexamples
k(xi , xj ) =
p
j=1
φj (xi )φj (xj )
K The matrix of pairwise comparizons (O(n2
))
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 7 / 37
Kenrel machine
kernel as a dictionary
f (x) =
n
i=1
αi k(x, xi )
αi influence of example i depends on yi
k(x, xi ) the kernel do NOT depend on yi
Definition (Kernel)
Let X be a non empty set (the input space).
A kernel is a function k from X × X onto IR.
k : X × X −→ IR
s, t −→ k(s, t)
Kenrel machine
kernel as a dictionary
f (x) =
n
i=1
αi k(x, xi )
αi influence of example i depends on yi
k(x, xi ) the kernel do NOT depend on yi
Definition (Kernel)
Let X be a non empty set (the input space).
A kernel is a function k from X × X onto IR.
k : X × X −→ IR
s, t −→ k(s, t)
semi-parametric version: given the family qj (x), j = 1, p
f (x) =
n
i=1
αi k(x, xi )+
p
j=1
βj qj (x)
Kernel Machine
Definition (Kernel machines)
A (xi , yi )i=1,n (x) = ψ
n
i=1
αi k(x, xi ) +
p
j=1
βj qj (x)
α et β: parameters to be estimated.
Exemples
A(x) =
n
i=1
αi (x − xi )3
+ + β0 + β1x splines
A(x) = sign
i∈I
αi exp−
x−xi
2
b +β0 SVM
IP(y|x) = 1
Z exp
i∈I
αi 1I{y=yi }(x xi + b)2
exponential family
Plan
1 Statistical learning and kernels
Kernel machines
Kernels
Kernel and hypothesis set
Functional differentiation in RKHS
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 10 / 37
In the beginning was the kernel...
Definition (Kernel)
a function of two variable k from X × X to IR
Definition (Positive kernel)
A kernel k(s, t) on X is said to be positive
if it is symetric: k(s, t) = k(t, s)
an if for any finite positive interger n:
∀{αi }i=1,n ∈ IR, ∀{xi }i=1,n ∈ X,
n
i=1
n
j=1
αi αj k(xi , xj ) ≥ 0
it is strictly positive if for αi = 0
n
i=1
n
j=1
αi αj k(xi , xj ) > 0
Examples of positive kernels
the linear kernel: s, t ∈ IRd
, k(s, t) = s t
symetric: s t = t s
positive:
n
i=1
n
j=1
αi αj k(xi , xj ) =
n
i=1
n
j=1
αi αj xi xj
=
n
i=1
αi xi


n
j=1
αj xj

 =
n
i=1
αi xi
2
the product kernel: k(s, t) = g(s)g(t) for some g : IRd
→ IR,
symetric by construction
positive:
n
i=1
n
j=1
αi αj k(xi , xj ) =
n
i=1
n
j=1
αi αj g(xi )g(xj )
=
n
i=1
αi g(xi )


n
j=1
αj g(xj )

 =
n
i=1
αi g(xi )
2
k is positive ⇔ (its square root exists) ⇔ k(s, t) = φs, φt
J.P. Vert, 2006
Example: finite kernel
let φj , j = 1, p be a finite dictionary of functions from X to IR (polynomials,
wavelets...)
the feature map and linear kernel
feature map:
Φ : X → IRp
s → Φ = φ1(s), ..., φp(s)
Linear kernel in the feature space:
k(s, t) = φ1(s), ..., φp(s) φ1(t), ..., φp(t)
e.g. the quadratic kernel: s, t ∈ IRd
, k(s, t) = s t + b
2
feature map:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1,
√
2s1, ...,
√
2sj , ...,
√
2sd , s2
1 , ..., s2
j , ..., s2
d , ...,
√
2si sj , ...
Positive definite Kernel (PDK) algebra (closure)
if k1(s, t) and k2(s, t) are two positive kernels
DPK are a convex cone: ∀a1 ∈ IR+
a1k1(s, t) + k2(s, t)
product kernel k1(s, t)k2(s, t)
proofs
by linearity:
n
i=1
n
j=1
αi αj a1k1(i, j) + k2(i, j) = a1
n
i=1
n
j=1
αi αj k1(i, j) +
n
i=1
n
j=1
αi αj k2(i, j)
assuming ∃ψ s.t. k1(s, t) = ψ (s)ψ (t)
n
i=1
n
j=1
αi αj k1(xi , xj )k2(xi , xj ) =
n
i=1
n
j=1
αi αj ψ (xi )ψ (xj )k2(xi , xj )
=
n
i=1
n
j=1
αi ψ (xi ) αj ψ (xj ) k2(xi , xj )
N. Cristianini and J. Shawe Taylor, kernel methods for pattern analysis, 2004
Kernel engineering: building PDK
for any polynomial with positive coef. φ from IR to IR
φ k(s, t)
if Ψis a function from IRd
to IRd
k Ψ(s), Ψ(t)
if ϕ from IRd
to IR+
, is minimum in 0
k(s, t) = ϕ(s + t) − ϕ(s − t)
convolution of two positive kernels is a positive kernel
K1 K2
Example : the Gaussian kernel is a PDK
exp(− s − t 2
) = exp(− s 2
− t 2
+ 2s t)
= exp(− s 2
) exp(− t 2
) exp(2s t)
s t is a PDK and function exp as the limit of positive series expansion, so
exp(2s t) is a PDK
exp(− s 2
) exp(− t 2
) is a PDK as a product kernel
the product of two PDK is a PDK
an attempt at classifying PD kernels
stationary kernels, (also called translation invariant):
k(s, t) = ks(s − t)
radial (isotropic) gaussian: exp −r2
b
, r = s − t
with compact support
c.s. Matèrn : max 0, 1 − r
b
κ r
b
k
Bk
r
b
, κ ≥ (d + 1)/2
locally stationary kernels: k(s, t) = k1(s + t)ks(s − t)
K1 is a non negative function and K2 a radial kernel.
non stationary (projective kernels):
k(s, t) = kp(s t)
separable kernels k(s, t) = k1(s)k2(t) with k1 and k2(t) PDK
in this case K = k1k2 where k1 = (k1(x1), ..., k1(xn))
MG Genton, Classes of Kernels for Machine Learning: A Statistics Perspective - JMLR, 2002
some examples of PD kernels...
type name k(s, t)
radial gaussian exp −r2
b , r = s − t
radial laplacian exp(−r/b)
radial rationnal 1 − r2
r2+b
radial loc. gauss. max 0, 1 − r
3b
d
exp(−r2
b )
non stat. χ2 exp(−r/b), r = k
(sk −tk )2
sk +tk
projective polynomial (s t)p
projective affine (s t + b)p
projective cosine s t/ s t
projective correlation exp s t
s t − b
Most of the kernels depends on a quantity b called the bandwidth
the importance of the Kernel bandwidth
for the affine Kernel: Bandwidth = biais
k(s, t) = (s t + b)p
= bp s t
b
+ 1
p
for the gaussian Kernel: Bandwidth = influence zone
k(s, t) =
1
Z
exp −
s − t 2
2σ2
b = 2σ2
the importance of the Kernel bandwidth
for the affine Kernel: Bandwidth = biais
k(s, t) = (s t + b)p
= bp s t
b
+ 1
p
for the gaussian Kernel: Bandwidth = influence zone
k(s, t) =
1
Z
exp −
s − t 2
2σ2
b = 2σ2
Illustration
1 d density estimation b = 1
2 b = 2
+ data
(x1, x2, ..., xn)
– Parzen estimate
IP(x) = 1
Z
n
i=1
k(x, xi )
kernels for objects and structures
kernels on histograms and probability distributions
kernel on strings
spectral string kernel k(s, t) = u φu(s)φu(t)
using sub sequences
similarities by alignements k(s, t) = π exp(β(s, t, π))
kernels on graphs
the pseudo inverse of the (regularized) graph Laplacian
L = D − A A is the adjency matrixD the degree matrix
diffusion kernels 1
Z(b)
expbL
subgraph kernel convolution (using random walks)
and kernels on HMM, automata, dynamical system...
Shawe-Taylor & Cristianini’s Book, 2004 ; JP Vert, 2006
Multiple kernel
M. Cuturi, Positive Definite Kernels in Machine Learning, 2009
Gram matrix
Definition (Gram matrix)
let k(s, t) be a positive kernel on X and (xi )i=1,n a sequence on X. the
Gram matrix is the square K of dimension n and of general term
Kij = k(xi , xj ).
practical trick to check kernel positivity:
K is positive ⇔ λi > 0 its eigenvalues are posivies: if Kui = λi ui ; i = 1, n
ui Kui = λi ui ui = λi
matrix K is the one to be used
Examples of Gram matrices with different bandwidth
raw data Gram matrix for b = 2
b = .5 b = 10
different point of view about kernels
kernel and scalar product
k(s, t) = φ(s), φ(t) H
kernel and distance
d(s, t)2
= k(s, s) + k(t, t) − 2k(s, t)
kernel and covariance: a positive matrix is a covariance matrix
IP(f) =
1
Z
exp −
1
2
(f − f0) K−1
(f − f0)
if f0 = 0 and f = Kα, IP(α) = 1
Z
exp −1
2
α Kα
Kernel and regularity (green’s function)
k(s, t) = P∗
Pδs−t for some operator P (e.g. some differential)
Let’s summarize
positive kernels
there is a lot of them
can be rather complex
2 classes: radial / projective
the bandwith matters (more than the kernel itself)
the Gram matrix summarize the pairwise comparizons
Roadmap
1 Statistical learning and kernels
Kernel machines
Kernels
Kernel and hypothesis set
Functional differentiation in RKHS
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 25 / 37
From kernel to functions
H0 =



f mf < ∞; fj ∈ IR; tj ∈ X, f (x) =
mf
j=1
fj k(x, tj )



let define the bilinear form (g(x) =
mg
i=1 gi k(x, si )) :
∀f , g ∈ H0, f , g H0
=
mf
j=1
mg
i=1
fj gi k(tj , si )
Evaluation functional: ∀x ∈ X
f (x) = f (•), k(x, •) H0
from k to H
for any positive kernel, a hypothesis set can be constructed H = H0 with
its metric
RKHS
Definition (reproducing kernel Hibert space (RKHS))
a Hilbert space H embeded with the inner product •, • H is said to be
with reproducing kernel if it exists a positive kernel k such that
∀s ∈ X, k(•, s) ∈ H
∀f ∈ H, f (s) = f (•), k(s, •) H
Beware: f = f (•) is a function while f (s) is the real value of f at point s
positive kernel ⇔ RKHS
any function in H is pointwise defined
defines the inner product
it defines the regularity (smoothness) of the hypothesis set
Exercice: let f (•) =
n
i=1 αi k(•, xi ). Show that f 2
H = α Kα
Other kernels (what really matters)
finite kernels
k(s, t) = φ1(s), ..., φp(s) φ1(t), ..., φp(t)
Mercer kernels
positive on a compact set ⇔ k(s, t) =
p
j=1 λj φj (s)φj (t)
positive kernels
positive semi-definite
conditionnaly positive (for some functions pj )
∀{xi }i=1,n, ∀αi ,
n
i
αi pj (xi ) = 0; j = 1, p,
n
i=1
n
j=1
αi αj k(xi , xj ) ≥ 0
symetric non positive
k(s, t) = tanh(s t + α0)
non symetric – non positive
the key property: Jt (f ) = k(t, .) holds
C. Ong et al, ICML , 2004
The kernel map
observation: x = (x1, . . . , xj , . . . , xd )
f (x) = w x = w, x IRd
feature map: x −→ Φ(x) = (φ1(x), . . . , φj (x), . . . , φp(x))
Φ : IRd
−→ IRp
f (x) = w Φ(x) = w, Φ(x) IRp
kernel dictionary: x −→ k(x) = (k(x, x1), . . . , k(x, xi ), . . . , k(x, xn))
k : IRd
−→ IRn
f (x) =
n
i=1
αi k(x, xi ) = α, k(x) IRn
kernel map: x −→ k(•, x) p = ∞
f (x) = f (•), K(•, x) H
Roadmap
1 Statistical learning and kernels
Kernel machines
Kernels
Kernel and hypothesis set
Functional differentiation in RKHS
Functional differentiation in RKHS
Let J be a functional
J : H → IR
f → J(f )
examples: J1(f ) = f 2
H, J2(f ) = f (x),
J directional derivative in direction g at point f
dJ(f , g) =
lim
ε → 0
J(f + εg) − J(f )
ε
Gradient J(f )
J : H → H
f → J (f )
if dJ(f , g) = J (f ), g H
exercise: find out J1
(f ) et J2
(f )
Hint
dJ(f , g) =
dJ(f + εg)
dε ε=0
Solution
dJ1(f , g) =
lim
ε → 0
f +εg 2
− f 2
ε
=
lim
ε → 0
f 2
+ε2
g 2
+2ε f ,g H− f 2
ε
=
lim
ε → 0
ε g 2
+ 2 f , g H
= 2f , g H
⇔ J1 (f ) = 2f
dJ2(f , g) =
lim
ε → 0
f (x)+εg(x)−f (x)
ε
= g(x)
= k(x, .), g H
⇔ J2
(f ) = k(x, .)
Solution
dJ1(f , g) =
lim
ε → 0
f +εg 2
− f 2
ε
=
lim
ε → 0
f 2
+ε2
g 2
+2ε f ,g H− f 2
ε
=
lim
ε → 0
ε g 2
+ 2 f , g H
= 2f , g H
⇔ J1 (f ) = 2f
dJ2(f , g) =
lim
ε → 0
f (x)+εg(x)−f (x)
ε
= g(x)
= k(x, .), g H
⇔ J2
(f ) = k(x, .)
Minimize
f ∈H
J(f ) ⇔ ∀g ∈ H, dJ(f , g) = 0 ⇔ J(f ) = 0
Subdifferential in a RKHS H
Definition (Sub gradient)
a subgradient of J : H −→ IR at f0 is any function g ∈ H such that
∀f ∈ V(f0), J(f ) ≥ J(f0) + g, (f − f0) H
Definition (Subdifferential)
∂J(f ), the subdifferential of J at f is the set of all subgradients of J at f .
H = IR J3(x) = |x| ∂J3(0) = {g ∈ IR | − 1 < g < 1}
H = IR J4(x) = max(0, 1 − x) ∂J4(1) = {g ∈ IR | − 1 < g < 0}
Theorem (Chain rule for linear Subdifferential)
Let T be a linear operator H −→ IR and ϕ a function from IR to IR
If J(f ) = ϕ(Tf )
Then ∂J(f ) = {T∗
g | g ∈ ∂ϕ(Tf )}, where T∗
denotes T’s adjoint operator
example of subdifferential in H
evaluation operator and its adjoint
T : H −→ IRn
f −→ Tf = (f (x1), . . . , f (xn))
T∗
: IRn
−→ H
α −→ T∗
α
build the adjoint Tf , α IRn = f , T∗
α H
example of subdifferential in H
evaluation operator and its adjoint
T : H −→ IRn
f −→ Tf = (f (x1), . . . , f (xn))
T∗
: IRn
−→ H
α −→ T∗
α =
n
i=1
αi k(•, xi )
build the adjoint Tf , α IRn = f , T∗
α H
Tf , α IRn =
n
i=1
f (xi )αi
=
n
i=1
f (•), k(•, xi ) Hαi
= f (•),
n
i=1
αi k(•, xi )
T∗α
H
example of subdifferential in H
evaluation operator and its adjoint
T : H −→ IRn
f −→ Tf = (f (x1), . . . , f (xn))
T∗
: IRn
−→ H
α −→ T∗
α =
n
i=1
αi k(•, xi )
build the adjoint Tf , α IRn = f , T∗
α H
Tf , α IRn =
n
i=1
f (xi )αi
=
n
i=1
f (•), k(•, xi ) Hαi
= f (•),
n
i=1
αi k(•, xi )
T∗α
H
TT∗
: IRn
−→ IRn
α −→ TT∗
α =
n
j=1
αj k(xj , xi )
= Kα
example of subdifferential in H
evaluation operator and its adjoint
T : H −→ IRn
f −→ Tf = (f (x1), . . . , f (xn))
T∗
: IRn
−→ H
α −→ T∗
α =
n
i=1
αi k(•, xi )
build the adjoint Tf , α IRn = f , T∗
α H
Tf , α IRn =
n
i=1
f (xi )αi
=
n
i=1
f (•), k(•, xi ) Hαi
= f (•),
n
i=1
αi k(•, xi )
T∗α
H
TT∗
: IRn
−→ IRn
α −→ TT∗
α =
n
j=1
αj k(xj , xi )
= Kα
Example of subdifferentials
x given J5(f ) = |f (x)| ∂J5(f0) = g(•) = αk(•, x) ; −1 < α < 1
x given J6(f ) = max(0, 1 − f (x)) ∂J6(f1) = g(•) = αk(•, x) ; −1 < α < 0
Optimal conditions
Theorem (Fermat optimality criterion)
When J(f ) is convex, f is a stationary point of problem min
f ∈H
J(f )
If and only if 0 ∈ ∂J(f )
ff f
∂J(f )
exercice: find for a given y ∈ IR (from Obozinski)
min
x∈IR
1
2 (x − y)2
+ λ|x|
Let’s summarize
positive kernels ⇔ RKHS = H ⇔ regularity f 2
H
the key property: Jt (f ) = k(t, .) holds not only for positive kernels
f (xi ) exists (pointwise defined functions)
universal consistency in RKHS
the Gram matrix summarize the pairwise comparizons

More Related Content

PDF
MLP輪読スパース8章 トレースノルム正則化
PDF
Estimating structured vector autoregressive models
PDF
Neural Processes
PDF
Least squares support Vector Machine Classifier
PDF
Conditional neural processes
PDF
Decimation in Time
PPTX
Radix-2 DIT FFT
PDF
NIPS2010: optimization algorithms in machine learning
MLP輪読スパース8章 トレースノルム正則化
Estimating structured vector autoregressive models
Neural Processes
Least squares support Vector Machine Classifier
Conditional neural processes
Decimation in Time
Radix-2 DIT FFT
NIPS2010: optimization algorithms in machine learning

What's hot (20)

PDF
Optimal interval clustering: Application to Bregman clustering and statistica...
PPT
Rdnd2008
PDF
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
PDF
Gtti 10032021
PDF
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
PDF
Andreas Eberle
PDF
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Macrocanonical models for texture synthesis
PDF
Coordinate sampler: A non-reversible Gibbs-like sampler
PDF
Low-rank tensor approximation (Introduction)
PDF
overlap add and overlap save method
PDF
computational stochastic phase-field
PDF
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
PDF
03 image transform
PDF
Signal Processing Course : Sparse Regularization of Inverse Problems
PDF
Md2521102111
PDF
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Optimal interval clustering: Application to Bregman clustering and statistica...
Rdnd2008
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Welcome to International Journal of Engineering Research and Development (IJERD)
Gtti 10032021
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
Andreas Eberle
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Macrocanonical models for texture synthesis
Coordinate sampler: A non-reversible Gibbs-like sampler
Low-rank tensor approximation (Introduction)
overlap add and overlap save method
computational stochastic phase-field
Quantitative Propagation of Chaos for SGD in Wide Neural Networks
03 image transform
Signal Processing Course : Sparse Regularization of Inverse Problems
Md2521102111
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Ad

Viewers also liked (8)

PDF
Lecture7 cross validation
PDF
Lecture 2: linear SVM in the dual
PDF
Lecture5 kernel svm
PPTX
Research Methods for Computational Statistics
PPTX
Cross-validation aggregation for forecasting
PDF
Machine Learning Strategies for Time Series Prediction
PDF
Cross-Validation
PPTX
Resampling methods
Lecture7 cross validation
Lecture 2: linear SVM in the dual
Lecture5 kernel svm
Research Methods for Computational Statistics
Cross-validation aggregation for forecasting
Machine Learning Strategies for Time Series Prediction
Cross-Validation
Resampling methods
Ad

Similar to Lecture4 kenrels functions_rkhs (20)

PDF
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
PDF
Crib Sheet AP Calculus AB and BC exams
PDF
On the smallest enclosing information disk
PDF
IVR - Chapter 1 - Introduction
PDF
Tucker tensor analysis of Matern functions in spatial statistics
PDF
On Twisted Paraproducts and some other Multilinear Singular Integrals
PDF
Low rank tensor approximation of probability density and characteristic funct...
PDF
lec_3.pdf
PDF
kactl.pdf
PPT
1531 fourier series- integrals and trans
PDF
Multilinear singular integrals with entangled structure
PDF
Bayesian inference on mixtures
PDF
QMC Error SAMSI Tutorial Aug 2017
PDF
University of manchester mathematical formula tables
PDF
Hierarchical matrices for approximating large covariance matries and computin...
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
DOC
Math report
PPT
Derivatives
PDF
k-MLE: A fast algorithm for learning statistical mixture models
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Crib Sheet AP Calculus AB and BC exams
On the smallest enclosing information disk
IVR - Chapter 1 - Introduction
Tucker tensor analysis of Matern functions in spatial statistics
On Twisted Paraproducts and some other Multilinear Singular Integrals
Low rank tensor approximation of probability density and characteristic funct...
lec_3.pdf
kactl.pdf
1531 fourier series- integrals and trans
Multilinear singular integrals with entangled structure
Bayesian inference on mixtures
QMC Error SAMSI Tutorial Aug 2017
University of manchester mathematical formula tables
Hierarchical matrices for approximating large covariance matries and computin...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Math report
Derivatives
k-MLE: A fast algorithm for learning statistical mixture models

More from Stéphane Canu (8)

PDF
Lecture 2: linear SVM in the Dual
PDF
Lecture10 outilier l0_svdd
PDF
Lecture8 multi class_svm
PDF
Lecture6 svdd
PDF
Lecture3 linear svm_with_slack
PDF
Lecture 1: linear SVM in the primal
PDF
Lecture9 multi kernel_svm
PDF
Main recsys factorisation
Lecture 2: linear SVM in the Dual
Lecture10 outilier l0_svdd
Lecture8 multi class_svm
Lecture6 svdd
Lecture3 linear svm_with_slack
Lecture 1: linear SVM in the primal
Lecture9 multi kernel_svm
Main recsys factorisation

Lecture4 kenrels functions_rkhs

  • 1. Lecture 4: kernels and associated functions Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 March 4, 2014
  • 2. Plan 1 Statistical learning and kernels Kernel machines Kernels Kernel and hypothesis set Functional differentiation in RKHS
  • 3. Introducing non linearities through the feature map SVM Val f (x) = d j=1 xj wj + b = n i=1 αi (xi x) + b t1 t2 ∈ IR2 x1 x2 x3 x4 x5 linear in x ∈ IR5 Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 3 / 37
  • 4. Introducing non linearities through the feature map SVM Val f (x) = d j=1 xj wj + b = n i=1 αi (xi x) + b t1 t2 ∈ IR2 φ(t) = t1 x1 t2 1 x2 t2 x3 t2 2 x4 t1t2 x5 linear in x ∈ IR5 quadratic in t ∈ IR2 The feature map φ : IR2 −→ IR5 t −→ φ(t) = x xi x = φ(ti ) φ(t) Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 3 / 37
  • 5. Introducing non linearities through the feature map A. Lorena & A. de Carvalho, Uma Introducão às Support Vector Machines, 2007 Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 4 / 37
  • 6. Non linear case: dictionnary vs. kernel in the non linear case: use a dictionary of functions φj (x), j = 1, p with possibly p = ∞ for instance polynomials, wavelets... f (x) = p j=1 wj φj (x) with wj = n i=1 αi yi φj (xi ) so that f (x) = n i=1 αi yi p j=1 φj (xi )φj (x) k(xi ,x) Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 5 / 37
  • 7. Non linear case: dictionnary vs. kernel in the non linear case: use a dictionary of functions φj (x), j = 1, p with possibly p = ∞ for instance polynomials, wavelets... f (x) = p j=1 wj φj (x) with wj = n i=1 αi yi φj (xi ) so that f (x) = n i=1 αi yi p j=1 φj (xi )φj (x) k(xi ,x) p ≥ n so what since k(xi , x) = p j=1 φj (xi )φj (x) Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 5 / 37
  • 8. closed form kernel: the quadratic kernel The quadratic dictionary in IRd : Φ : IRd → IRp=1+d+ d(d+1) 2 s → Φ = 1, s1, s2, ..., sd , s2 1 , s2 2 , ..., s2 d , ..., si sj , ... in this case Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2 1 t2 1 + ... + s2 d t2 d + ... + si sj ti tj + ... Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37
  • 9. closed form kernel: the quadratic kernel The quadratic dictionary in IRd : Φ : IRd → IRp=1+d+ d(d+1) 2 s → Φ = 1, s1, s2, ..., sd , s2 1 , s2 2 , ..., s2 d , ..., si sj , ... in this case Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2 1 t2 1 + ... + s2 d t2 d + ... + si sj ti tj + ... The quadratic kenrel: s, t ∈ IRd , k(s, t) = s t + 1 2 = 1 + 2s t + s t 2 computes the dot product of the reweighted dictionary: Φ : IRd → IRp=1+d+ d(d+1) 2 s → Φ = 1, √ 2s1, √ 2s2, ..., √ 2sd , s2 1 , s2 2 , ..., s2 d , ..., √ 2si sj , ... Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37
  • 10. closed form kernel: the quadratic kernel The quadratic dictionary in IRd : Φ : IRd → IRp=1+d+ d(d+1) 2 s → Φ = 1, s1, s2, ..., sd , s2 1 , s2 2 , ..., s2 d , ..., si sj , ... in this case Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2 1 t2 1 + ... + s2 d t2 d + ... + si sj ti tj + ... The quadratic kenrel: s, t ∈ IRd , k(s, t) = s t + 1 2 = 1 + 2s t + s t 2 computes the dot product of the reweighted dictionary: Φ : IRd → IRp=1+d+ d(d+1) 2 s → Φ = 1, √ 2s1, √ 2s2, ..., √ 2sd , s2 1 , s2 2 , ..., s2 d , ..., √ 2si sj , ... p = 1 + d + d(d+1) 2 multiplications vs. d + 1 use kernel to save computration Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37
  • 11. kernel: features throught pairwise comparizons x φ(x) e.g. a text e.g. BOW K n examples nexamples Φ p features nexamples k(xi , xj ) = p j=1 φj (xi )φj (xj ) K The matrix of pairwise comparizons (O(n2 )) Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 7 / 37
  • 12. Kenrel machine kernel as a dictionary f (x) = n i=1 αi k(x, xi ) αi influence of example i depends on yi k(x, xi ) the kernel do NOT depend on yi Definition (Kernel) Let X be a non empty set (the input space). A kernel is a function k from X × X onto IR. k : X × X −→ IR s, t −→ k(s, t)
  • 13. Kenrel machine kernel as a dictionary f (x) = n i=1 αi k(x, xi ) αi influence of example i depends on yi k(x, xi ) the kernel do NOT depend on yi Definition (Kernel) Let X be a non empty set (the input space). A kernel is a function k from X × X onto IR. k : X × X −→ IR s, t −→ k(s, t) semi-parametric version: given the family qj (x), j = 1, p f (x) = n i=1 αi k(x, xi )+ p j=1 βj qj (x)
  • 14. Kernel Machine Definition (Kernel machines) A (xi , yi )i=1,n (x) = ψ n i=1 αi k(x, xi ) + p j=1 βj qj (x) α et β: parameters to be estimated. Exemples A(x) = n i=1 αi (x − xi )3 + + β0 + β1x splines A(x) = sign i∈I αi exp− x−xi 2 b +β0 SVM IP(y|x) = 1 Z exp i∈I αi 1I{y=yi }(x xi + b)2 exponential family
  • 15. Plan 1 Statistical learning and kernels Kernel machines Kernels Kernel and hypothesis set Functional differentiation in RKHS Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 10 / 37
  • 16. In the beginning was the kernel... Definition (Kernel) a function of two variable k from X × X to IR Definition (Positive kernel) A kernel k(s, t) on X is said to be positive if it is symetric: k(s, t) = k(t, s) an if for any finite positive interger n: ∀{αi }i=1,n ∈ IR, ∀{xi }i=1,n ∈ X, n i=1 n j=1 αi αj k(xi , xj ) ≥ 0 it is strictly positive if for αi = 0 n i=1 n j=1 αi αj k(xi , xj ) > 0
  • 17. Examples of positive kernels the linear kernel: s, t ∈ IRd , k(s, t) = s t symetric: s t = t s positive: n i=1 n j=1 αi αj k(xi , xj ) = n i=1 n j=1 αi αj xi xj = n i=1 αi xi   n j=1 αj xj   = n i=1 αi xi 2 the product kernel: k(s, t) = g(s)g(t) for some g : IRd → IR, symetric by construction positive: n i=1 n j=1 αi αj k(xi , xj ) = n i=1 n j=1 αi αj g(xi )g(xj ) = n i=1 αi g(xi )   n j=1 αj g(xj )   = n i=1 αi g(xi ) 2 k is positive ⇔ (its square root exists) ⇔ k(s, t) = φs, φt J.P. Vert, 2006
  • 18. Example: finite kernel let φj , j = 1, p be a finite dictionary of functions from X to IR (polynomials, wavelets...) the feature map and linear kernel feature map: Φ : X → IRp s → Φ = φ1(s), ..., φp(s) Linear kernel in the feature space: k(s, t) = φ1(s), ..., φp(s) φ1(t), ..., φp(t) e.g. the quadratic kernel: s, t ∈ IRd , k(s, t) = s t + b 2 feature map: Φ : IRd → IRp=1+d+ d(d+1) 2 s → Φ = 1, √ 2s1, ..., √ 2sj , ..., √ 2sd , s2 1 , ..., s2 j , ..., s2 d , ..., √ 2si sj , ...
  • 19. Positive definite Kernel (PDK) algebra (closure) if k1(s, t) and k2(s, t) are two positive kernels DPK are a convex cone: ∀a1 ∈ IR+ a1k1(s, t) + k2(s, t) product kernel k1(s, t)k2(s, t) proofs by linearity: n i=1 n j=1 αi αj a1k1(i, j) + k2(i, j) = a1 n i=1 n j=1 αi αj k1(i, j) + n i=1 n j=1 αi αj k2(i, j) assuming ∃ψ s.t. k1(s, t) = ψ (s)ψ (t) n i=1 n j=1 αi αj k1(xi , xj )k2(xi , xj ) = n i=1 n j=1 αi αj ψ (xi )ψ (xj )k2(xi , xj ) = n i=1 n j=1 αi ψ (xi ) αj ψ (xj ) k2(xi , xj ) N. Cristianini and J. Shawe Taylor, kernel methods for pattern analysis, 2004
  • 20. Kernel engineering: building PDK for any polynomial with positive coef. φ from IR to IR φ k(s, t) if Ψis a function from IRd to IRd k Ψ(s), Ψ(t) if ϕ from IRd to IR+ , is minimum in 0 k(s, t) = ϕ(s + t) − ϕ(s − t) convolution of two positive kernels is a positive kernel K1 K2 Example : the Gaussian kernel is a PDK exp(− s − t 2 ) = exp(− s 2 − t 2 + 2s t) = exp(− s 2 ) exp(− t 2 ) exp(2s t) s t is a PDK and function exp as the limit of positive series expansion, so exp(2s t) is a PDK exp(− s 2 ) exp(− t 2 ) is a PDK as a product kernel the product of two PDK is a PDK
  • 21. an attempt at classifying PD kernels stationary kernels, (also called translation invariant): k(s, t) = ks(s − t) radial (isotropic) gaussian: exp −r2 b , r = s − t with compact support c.s. Matèrn : max 0, 1 − r b κ r b k Bk r b , κ ≥ (d + 1)/2 locally stationary kernels: k(s, t) = k1(s + t)ks(s − t) K1 is a non negative function and K2 a radial kernel. non stationary (projective kernels): k(s, t) = kp(s t) separable kernels k(s, t) = k1(s)k2(t) with k1 and k2(t) PDK in this case K = k1k2 where k1 = (k1(x1), ..., k1(xn)) MG Genton, Classes of Kernels for Machine Learning: A Statistics Perspective - JMLR, 2002
  • 22. some examples of PD kernels... type name k(s, t) radial gaussian exp −r2 b , r = s − t radial laplacian exp(−r/b) radial rationnal 1 − r2 r2+b radial loc. gauss. max 0, 1 − r 3b d exp(−r2 b ) non stat. χ2 exp(−r/b), r = k (sk −tk )2 sk +tk projective polynomial (s t)p projective affine (s t + b)p projective cosine s t/ s t projective correlation exp s t s t − b Most of the kernels depends on a quantity b called the bandwidth
  • 23. the importance of the Kernel bandwidth for the affine Kernel: Bandwidth = biais k(s, t) = (s t + b)p = bp s t b + 1 p for the gaussian Kernel: Bandwidth = influence zone k(s, t) = 1 Z exp − s − t 2 2σ2 b = 2σ2
  • 24. the importance of the Kernel bandwidth for the affine Kernel: Bandwidth = biais k(s, t) = (s t + b)p = bp s t b + 1 p for the gaussian Kernel: Bandwidth = influence zone k(s, t) = 1 Z exp − s − t 2 2σ2 b = 2σ2 Illustration 1 d density estimation b = 1 2 b = 2 + data (x1, x2, ..., xn) – Parzen estimate IP(x) = 1 Z n i=1 k(x, xi )
  • 25. kernels for objects and structures kernels on histograms and probability distributions kernel on strings spectral string kernel k(s, t) = u φu(s)φu(t) using sub sequences similarities by alignements k(s, t) = π exp(β(s, t, π)) kernels on graphs the pseudo inverse of the (regularized) graph Laplacian L = D − A A is the adjency matrixD the degree matrix diffusion kernels 1 Z(b) expbL subgraph kernel convolution (using random walks) and kernels on HMM, automata, dynamical system... Shawe-Taylor & Cristianini’s Book, 2004 ; JP Vert, 2006
  • 26. Multiple kernel M. Cuturi, Positive Definite Kernels in Machine Learning, 2009
  • 27. Gram matrix Definition (Gram matrix) let k(s, t) be a positive kernel on X and (xi )i=1,n a sequence on X. the Gram matrix is the square K of dimension n and of general term Kij = k(xi , xj ). practical trick to check kernel positivity: K is positive ⇔ λi > 0 its eigenvalues are posivies: if Kui = λi ui ; i = 1, n ui Kui = λi ui ui = λi matrix K is the one to be used
  • 28. Examples of Gram matrices with different bandwidth raw data Gram matrix for b = 2 b = .5 b = 10
  • 29. different point of view about kernels kernel and scalar product k(s, t) = φ(s), φ(t) H kernel and distance d(s, t)2 = k(s, s) + k(t, t) − 2k(s, t) kernel and covariance: a positive matrix is a covariance matrix IP(f) = 1 Z exp − 1 2 (f − f0) K−1 (f − f0) if f0 = 0 and f = Kα, IP(α) = 1 Z exp −1 2 α Kα Kernel and regularity (green’s function) k(s, t) = P∗ Pδs−t for some operator P (e.g. some differential)
  • 30. Let’s summarize positive kernels there is a lot of them can be rather complex 2 classes: radial / projective the bandwith matters (more than the kernel itself) the Gram matrix summarize the pairwise comparizons
  • 31. Roadmap 1 Statistical learning and kernels Kernel machines Kernels Kernel and hypothesis set Functional differentiation in RKHS Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 25 / 37
  • 32. From kernel to functions H0 =    f mf < ∞; fj ∈ IR; tj ∈ X, f (x) = mf j=1 fj k(x, tj )    let define the bilinear form (g(x) = mg i=1 gi k(x, si )) : ∀f , g ∈ H0, f , g H0 = mf j=1 mg i=1 fj gi k(tj , si ) Evaluation functional: ∀x ∈ X f (x) = f (•), k(x, •) H0 from k to H for any positive kernel, a hypothesis set can be constructed H = H0 with its metric
  • 33. RKHS Definition (reproducing kernel Hibert space (RKHS)) a Hilbert space H embeded with the inner product •, • H is said to be with reproducing kernel if it exists a positive kernel k such that ∀s ∈ X, k(•, s) ∈ H ∀f ∈ H, f (s) = f (•), k(s, •) H Beware: f = f (•) is a function while f (s) is the real value of f at point s positive kernel ⇔ RKHS any function in H is pointwise defined defines the inner product it defines the regularity (smoothness) of the hypothesis set Exercice: let f (•) = n i=1 αi k(•, xi ). Show that f 2 H = α Kα
  • 34. Other kernels (what really matters) finite kernels k(s, t) = φ1(s), ..., φp(s) φ1(t), ..., φp(t) Mercer kernels positive on a compact set ⇔ k(s, t) = p j=1 λj φj (s)φj (t) positive kernels positive semi-definite conditionnaly positive (for some functions pj ) ∀{xi }i=1,n, ∀αi , n i αi pj (xi ) = 0; j = 1, p, n i=1 n j=1 αi αj k(xi , xj ) ≥ 0 symetric non positive k(s, t) = tanh(s t + α0) non symetric – non positive the key property: Jt (f ) = k(t, .) holds C. Ong et al, ICML , 2004
  • 35. The kernel map observation: x = (x1, . . . , xj , . . . , xd ) f (x) = w x = w, x IRd feature map: x −→ Φ(x) = (φ1(x), . . . , φj (x), . . . , φp(x)) Φ : IRd −→ IRp f (x) = w Φ(x) = w, Φ(x) IRp kernel dictionary: x −→ k(x) = (k(x, x1), . . . , k(x, xi ), . . . , k(x, xn)) k : IRd −→ IRn f (x) = n i=1 αi k(x, xi ) = α, k(x) IRn kernel map: x −→ k(•, x) p = ∞ f (x) = f (•), K(•, x) H
  • 36. Roadmap 1 Statistical learning and kernels Kernel machines Kernels Kernel and hypothesis set Functional differentiation in RKHS
  • 37. Functional differentiation in RKHS Let J be a functional J : H → IR f → J(f ) examples: J1(f ) = f 2 H, J2(f ) = f (x), J directional derivative in direction g at point f dJ(f , g) = lim ε → 0 J(f + εg) − J(f ) ε Gradient J(f ) J : H → H f → J (f ) if dJ(f , g) = J (f ), g H exercise: find out J1 (f ) et J2 (f )
  • 38. Hint dJ(f , g) = dJ(f + εg) dε ε=0
  • 39. Solution dJ1(f , g) = lim ε → 0 f +εg 2 − f 2 ε = lim ε → 0 f 2 +ε2 g 2 +2ε f ,g H− f 2 ε = lim ε → 0 ε g 2 + 2 f , g H = 2f , g H ⇔ J1 (f ) = 2f dJ2(f , g) = lim ε → 0 f (x)+εg(x)−f (x) ε = g(x) = k(x, .), g H ⇔ J2 (f ) = k(x, .)
  • 40. Solution dJ1(f , g) = lim ε → 0 f +εg 2 − f 2 ε = lim ε → 0 f 2 +ε2 g 2 +2ε f ,g H− f 2 ε = lim ε → 0 ε g 2 + 2 f , g H = 2f , g H ⇔ J1 (f ) = 2f dJ2(f , g) = lim ε → 0 f (x)+εg(x)−f (x) ε = g(x) = k(x, .), g H ⇔ J2 (f ) = k(x, .) Minimize f ∈H J(f ) ⇔ ∀g ∈ H, dJ(f , g) = 0 ⇔ J(f ) = 0
  • 41. Subdifferential in a RKHS H Definition (Sub gradient) a subgradient of J : H −→ IR at f0 is any function g ∈ H such that ∀f ∈ V(f0), J(f ) ≥ J(f0) + g, (f − f0) H Definition (Subdifferential) ∂J(f ), the subdifferential of J at f is the set of all subgradients of J at f . H = IR J3(x) = |x| ∂J3(0) = {g ∈ IR | − 1 < g < 1} H = IR J4(x) = max(0, 1 − x) ∂J4(1) = {g ∈ IR | − 1 < g < 0} Theorem (Chain rule for linear Subdifferential) Let T be a linear operator H −→ IR and ϕ a function from IR to IR If J(f ) = ϕ(Tf ) Then ∂J(f ) = {T∗ g | g ∈ ∂ϕ(Tf )}, where T∗ denotes T’s adjoint operator
  • 42. example of subdifferential in H evaluation operator and its adjoint T : H −→ IRn f −→ Tf = (f (x1), . . . , f (xn)) T∗ : IRn −→ H α −→ T∗ α build the adjoint Tf , α IRn = f , T∗ α H
  • 43. example of subdifferential in H evaluation operator and its adjoint T : H −→ IRn f −→ Tf = (f (x1), . . . , f (xn)) T∗ : IRn −→ H α −→ T∗ α = n i=1 αi k(•, xi ) build the adjoint Tf , α IRn = f , T∗ α H Tf , α IRn = n i=1 f (xi )αi = n i=1 f (•), k(•, xi ) Hαi = f (•), n i=1 αi k(•, xi ) T∗α H
  • 44. example of subdifferential in H evaluation operator and its adjoint T : H −→ IRn f −→ Tf = (f (x1), . . . , f (xn)) T∗ : IRn −→ H α −→ T∗ α = n i=1 αi k(•, xi ) build the adjoint Tf , α IRn = f , T∗ α H Tf , α IRn = n i=1 f (xi )αi = n i=1 f (•), k(•, xi ) Hαi = f (•), n i=1 αi k(•, xi ) T∗α H TT∗ : IRn −→ IRn α −→ TT∗ α = n j=1 αj k(xj , xi ) = Kα
  • 45. example of subdifferential in H evaluation operator and its adjoint T : H −→ IRn f −→ Tf = (f (x1), . . . , f (xn)) T∗ : IRn −→ H α −→ T∗ α = n i=1 αi k(•, xi ) build the adjoint Tf , α IRn = f , T∗ α H Tf , α IRn = n i=1 f (xi )αi = n i=1 f (•), k(•, xi ) Hαi = f (•), n i=1 αi k(•, xi ) T∗α H TT∗ : IRn −→ IRn α −→ TT∗ α = n j=1 αj k(xj , xi ) = Kα Example of subdifferentials x given J5(f ) = |f (x)| ∂J5(f0) = g(•) = αk(•, x) ; −1 < α < 1 x given J6(f ) = max(0, 1 − f (x)) ∂J6(f1) = g(•) = αk(•, x) ; −1 < α < 0
  • 46. Optimal conditions Theorem (Fermat optimality criterion) When J(f ) is convex, f is a stationary point of problem min f ∈H J(f ) If and only if 0 ∈ ∂J(f ) ff f ∂J(f ) exercice: find for a given y ∈ IR (from Obozinski) min x∈IR 1 2 (x − y)2 + λ|x|
  • 47. Let’s summarize positive kernels ⇔ RKHS = H ⇔ regularity f 2 H the key property: Jt (f ) = k(t, .) holds not only for positive kernels f (xi ) exists (pointwise defined functions) universal consistency in RKHS the Gram matrix summarize the pairwise comparizons