Lecture4 kenrels functions_rkhs

Lecture 4: kernels and associated functions
Stéphane Canu
stephane.canu@litislab.eu
Sao Paulo 2014
March 4, 2014

Plan
1 Statistical learning and kernels
Kernel machines
Kernels
Kernel and hypothesis set
Functional diﬀerentiation in RKHS

Introducing non linearities through the feature map
SVM Val
f (x) =
d
j=1
xj wj + b =
n
i=1
αi (xi x) + b
t1
t2
∈ IR2
x1
x2
x3
x4
x5
linear in x ∈ IR5
Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 3 / 37

SVM Val
f (x) =
d
j=1
xj wj + b =
n
i=1
αi (xi x) + b
t1
t2
∈ IR2
φ(t) =
t1 x1
t2
1 x2
t2 x3
t2
2 x4
t1t2 x5
linear in x ∈ IR5
quadratic in t ∈ IR2
The feature map
φ : IR2
−→ IR5
t −→ φ(t) = x
xi x = φ(ti ) φ(t)

A. Lorena & A. de Carvalho, Uma Introducão às Support Vector Machines, 2007

Non linear case: dictionnary vs. kernel
in the non linear case: use a dictionary of functions
φj (x), j = 1, p with possibly p = ∞
for instance polynomials, wavelets...
f (x) =
p
j=1
wj φj (x) with wj =
n
i=1
αi yi φj (xi )
so that
f (x) =
n
i=1
αi yi
p
j=1
φj (xi )φj (x)
k(xi ,x)

Non linear case: dictionnary vs. kernel
in the non linear case: use a dictionary of functions
φj (x), j = 1, p with possibly p = ∞
for instance polynomials, wavelets...
f (x) =
p
j=1
wj φj (x) with wj =
n
i=1
αi yi φj (xi )
so that
f (x) =
n
i=1
αi yi
p
j=1
φj (xi )φj (x)
k(xi ,x)
p ≥ n so what since k(xi , x) =
p
j=1 φj (xi )φj (x)

closed form kernel: the quadratic kernel
The quadratic dictionary in IRd
:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1, s1, s2, ..., sd , s2
1 , s2
2 , ..., s2
d , ..., si sj , ...
in this case
Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2
1 t2
1 + ... + s2
d t2
d + ... + si sj ti tj + ...

:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1, s1, s2, ..., sd , s2
1 , s2
2 , ..., s2
d , ..., si sj , ...
in this case
Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2
1 t2
1 + ... + s2
d t2
d + ... + si sj ti tj + ...
The quadratic kenrel:
s, t ∈ IRd
, k(s, t) = s t + 1
2
= 1 + 2s t + s t
2
computes the dot product of the reweighted dictionary:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1,
√
2s1,
√
2s2, ...,
√
2sd , s2
1 , s2
2 , ..., s2
d , ...,
√
2si sj , ...

:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1, s1, s2, ..., sd , s2
1 , s2
2 , ..., s2
d , ..., si sj , ...
in this case
Φ(s) Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s2
1 t2
1 + ... + s2
d t2
d + ... + si sj ti tj + ...
The quadratic kenrel:
s, t ∈ IRd
, k(s, t) = s t + 1
2
= 1 + 2s t + s t
2
computes the dot product of the reweighted dictionary:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1,
√
2s1,
√
2s2, ...,
√
2sd , s2
1 , s2
2 , ..., s2
d , ...,
√
2si sj , ...
p = 1 + d + d(d+1)
2 multiplications vs. d + 1
use kernel to save computration

kernel: features throught pairwise comparizons
x φ(x)
e.g. a text e.g. BOW
K
n examples
nexamples
Φ
p features
nexamples
k(xi , xj ) =
p
j=1
φj (xi )φj (xj )
K The matrix of pairwise comparizons (O(n2
))

Kenrel machine
kernel as a dictionary
f (x) =
n
i=1
αi k(x, xi )
αi inﬂuence of example i depends on yi
k(x, xi ) the kernel do NOT depend on yi
Deﬁnition (Kernel)
Let X be a non empty set (the input space).
A kernel is a function k from X × X onto IR.
k : X × X −→ IR
s, t −→ k(s, t)

Kenrel machine
kernel as a dictionary
f (x) =
n
i=1
αi k(x, xi )
αi inﬂuence of example i depends on yi
k(x, xi ) the kernel do NOT depend on yi
Let X be a non empty set (the input space).
A kernel is a function k from X × X onto IR.
k : X × X −→ IR
s, t −→ k(s, t)
semi-parametric version: given the family qj (x), j = 1, p
f (x) =
n
i=1
αi k(x, xi )+
p
j=1
βj qj (x)

Kernel Machine
Deﬁnition (Kernel machines)
A (xi , yi )i=1,n (x) = ψ
n
i=1
αi k(x, xi ) +
p
j=1
βj qj (x)
α et β: parameters to be estimated.
Exemples
A(x) =
n
i=1
αi (x − xi )3
+ + β0 + β1x splines
A(x) = sign
i∈I
αi exp−
x−xi
2
b +β0 SVM
IP(y|x) = 1
Z exp
i∈I
αi 1I{y=yi }(x xi + b)2
exponential family

Plan
Kernel machines
Kernels

In the beginning was the kernel...
a function of two variable k from X × X to IR
Deﬁnition (Positive kernel)
A kernel k(s, t) on X is said to be positive
if it is symetric: k(s, t) = k(t, s)
an if for any ﬁnite positive interger n:
∀{αi }i=1,n ∈ IR, ∀{xi }i=1,n ∈ X,
n
i=1
n
j=1
αi αj k(xi , xj ) ≥ 0
it is strictly positive if for αi = 0
n
i=1
n
j=1
αi αj k(xi , xj ) > 0

Examples of positive kernels
the linear kernel: s, t ∈ IRd
, k(s, t) = s t
symetric: s t = t s
positive:
n
i=1
n
j=1
αi αj k(xi , xj ) =
n
i=1
n
j=1
αi αj xi xj
=
n
i=1
αi xi


n
j=1
αj xj

 =
n
i=1
αi xi
2
the product kernel: k(s, t) = g(s)g(t) for some g : IRd
→ IR,
symetric by construction
positive:
n
i=1
n
j=1
αi αj k(xi , xj ) =
n
i=1
n
j=1
αi αj g(xi )g(xj )
=
n
i=1
αi g(xi )


n
j=1
αj g(xj )

 =
n
i=1
αi g(xi )
2
k is positive ⇔ (its square root exists) ⇔ k(s, t) = φs, φt
J.P. Vert, 2006

Example: ﬁnite kernel
let φj , j = 1, p be a ﬁnite dictionary of functions from X to IR (polynomials,
wavelets...)
the feature map and linear kernel
feature map:
Φ : X → IRp
s → Φ = φ1(s), ..., φp(s)
Linear kernel in the feature space:
k(s, t) = φ1(s), ..., φp(s) φ1(t), ..., φp(t)
e.g. the quadratic kernel: s, t ∈ IRd
, k(s, t) = s t + b
2
feature map:
Φ : IRd
→ IRp=1+d+
d(d+1)
2
s → Φ = 1,
√
2s1, ...,
√
2sj , ...,
√
2sd , s2
1 , ..., s2
j , ..., s2
d , ...,
√
2si sj , ...

Positive deﬁnite Kernel (PDK) algebra (closure)
if k1(s, t) and k2(s, t) are two positive kernels
DPK are a convex cone: ∀a1 ∈ IR+
a1k1(s, t) + k2(s, t)
product kernel k1(s, t)k2(s, t)
proofs
by linearity:
n
i=1
n
j=1
αi αj a1k1(i, j) + k2(i, j) = a1
n
i=1
n
j=1
αi αj k1(i, j) +
n
i=1
n
j=1
αi αj k2(i, j)
assuming ∃ψ s.t. k1(s, t) = ψ (s)ψ (t)
n
i=1
n
j=1
αi αj k1(xi , xj )k2(xi , xj ) =
n
i=1
n
j=1
αi αj ψ (xi )ψ (xj )k2(xi , xj )
=
n
i=1
n
j=1
αi ψ (xi ) αj ψ (xj ) k2(xi , xj )
N. Cristianini and J. Shawe Taylor, kernel methods for pattern analysis, 2004

Kernel engineering: building PDK
for any polynomial with positive coef. φ from IR to IR
φ k(s, t)
if Ψis a function from IRd
to IRd
k Ψ(s), Ψ(t)
if ϕ from IRd
to IR+
, is minimum in 0
k(s, t) = ϕ(s + t) − ϕ(s − t)
convolution of two positive kernels is a positive kernel
K1 K2
Example : the Gaussian kernel is a PDK
exp(− s − t 2
) = exp(− s 2
− t 2
+ 2s t)
= exp(− s 2
) exp(− t 2
) exp(2s t)
s t is a PDK and function exp as the limit of positive series expansion, so
exp(2s t) is a PDK
exp(− s 2
) exp(− t 2
) is a PDK as a product kernel
the product of two PDK is a PDK

an attempt at classifying PD kernels
stationary kernels, (also called translation invariant):
k(s, t) = ks(s − t)
radial (isotropic) gaussian: exp −r2
b
, r = s − t
with compact support
c.s. Matèrn : max 0, 1 − r
b
κ r
b
k
Bk
r
b
, κ ≥ (d + 1)/2
locally stationary kernels: k(s, t) = k1(s + t)ks(s − t)
K1 is a non negative function and K2 a radial kernel.
non stationary (projective kernels):
k(s, t) = kp(s t)
separable kernels k(s, t) = k1(s)k2(t) with k1 and k2(t) PDK
in this case K = k1k2 where k1 = (k1(x1), ..., k1(xn))
MG Genton, Classes of Kernels for Machine Learning: A Statistics Perspective - JMLR, 2002

some examples of PD kernels...
type name k(s, t)
radial gaussian exp −r2
b , r = s − t
radial laplacian exp(−r/b)
radial rationnal 1 − r2
r2+b
radial loc. gauss. max 0, 1 − r
3b
d
exp(−r2
b )
non stat. χ2 exp(−r/b), r = k
(sk −tk )2
sk +tk
projective polynomial (s t)p
projective aﬃne (s t + b)p
projective cosine s t/ s t
projective correlation exp s t
s t − b
Most of the kernels depends on a quantity b called the bandwidth

the importance of the Kernel bandwidth
for the aﬃne Kernel: Bandwidth = biais
k(s, t) = (s t + b)p
= bp s t
b
+ 1
p
for the gaussian Kernel: Bandwidth = inﬂuence zone
k(s, t) =
1
Z
exp −
s − t 2
2σ2
b = 2σ2

the importance of the Kernel bandwidth
for the aﬃne Kernel: Bandwidth = biais
k(s, t) = (s t + b)p
= bp s t
b
+ 1
p
for the gaussian Kernel: Bandwidth = inﬂuence zone
k(s, t) =
1
Z
exp −
s − t 2
2σ2
b = 2σ2
Illustration
1 d density estimation b = 1
2 b = 2
+ data
(x1, x2, ..., xn)
– Parzen estimate
IP(x) = 1
Z
n
i=1
k(x, xi )

kernels for objects and structures
kernels on histograms and probability distributions
kernel on strings
spectral string kernel k(s, t) = u φu(s)φu(t)
using sub sequences
similarities by alignements k(s, t) = π exp(β(s, t, π))
kernels on graphs
the pseudo inverse of the (regularized) graph Laplacian
L = D − A A is the adjency matrixD the degree matrix
diﬀusion kernels 1
Z(b)
expbL
subgraph kernel convolution (using random walks)
and kernels on HMM, automata, dynamical system...
Shawe-Taylor & Cristianini’s Book, 2004 ; JP Vert, 2006

Multiple kernel
M. Cuturi, Positive Deﬁnite Kernels in Machine Learning, 2009

Gram matrix
Deﬁnition (Gram matrix)
let k(s, t) be a positive kernel on X and (xi )i=1,n a sequence on X. the
Gram matrix is the square K of dimension n and of general term
Kij = k(xi , xj ).
practical trick to check kernel positivity:
K is positive ⇔ λi > 0 its eigenvalues are posivies: if Kui = λi ui ; i = 1, n
ui Kui = λi ui ui = λi
matrix K is the one to be used

Examples of Gram matrices with diﬀerent bandwidth
raw data Gram matrix for b = 2
b = .5 b = 10

diﬀerent point of view about kernels
kernel and scalar product
k(s, t) = φ(s), φ(t) H
kernel and distance
d(s, t)2
= k(s, s) + k(t, t) − 2k(s, t)
kernel and covariance: a positive matrix is a covariance matrix
IP(f) =
1
Z
exp −
1
2
(f − f0) K−1
(f − f0)
if f0 = 0 and f = Kα, IP(α) = 1
Z
exp −1
2
α Kα
Kernel and regularity (green’s function)
k(s, t) = P∗
Pδs−t for some operator P (e.g. some diﬀerential)

Let’s summarize
positive kernels
there is a lot of them
can be rather complex
2 classes: radial / projective
the bandwith matters (more than the kernel itself)
the Gram matrix summarize the pairwise comparizons

Roadmap
Kernel machines
Kernels

From kernel to functions
H0 =



f mf < ∞; fj ∈ IR; tj ∈ X, f (x) =
mf
j=1
fj k(x, tj )



let deﬁne the bilinear form (g(x) =
mg
i=1 gi k(x, si )) :
∀f , g ∈ H0, f , g H0
=
mf
j=1
mg
i=1
fj gi k(tj , si )
Evaluation functional: ∀x ∈ X
f (x) = f (•), k(x, •) H0
from k to H
for any positive kernel, a hypothesis set can be constructed H = H0 with
its metric

RKHS
Definition (reproducing kernel Hibert space (RKHS))
a Hilbert space H embeded with the inner product •, • H is said to be
with reproducing kernel if it exists a positive kernel k such that
∀s ∈ X, k(•, s) ∈ H
∀f ∈ H, f (s) = f (•), k(s, •) H
Beware: f = f (•) is a function while f (s) is the real value of f at point s
positive kernel ⇔ RKHS
any function in H is pointwise defined
defines the inner product
it defines the regularity (smoothness) of the hypothesis set
Exercice: let f (•) =
n
i=1 αi k(•, xi ). Show that f 2
H = α Kα

Other kernels (what really matters)
ﬁnite kernels
k(s, t) = φ1(s), ..., φp(s) φ1(t), ..., φp(t)
Mercer kernels
positive on a compact set ⇔ k(s, t) =
p
j=1 λj φj (s)φj (t)
positive kernels
positive semi-deﬁnite
conditionnaly positive (for some functions pj )
∀{xi }i=1,n, ∀αi ,
n
i
αi pj (xi ) = 0; j = 1, p,
n
i=1
n
j=1
αi αj k(xi , xj ) ≥ 0
symetric non positive
k(s, t) = tanh(s t + α0)
non symetric – non positive
the key property: Jt (f ) = k(t, .) holds
C. Ong et al, ICML , 2004

The kernel map
observation: x = (x1, . . . , xj , . . . , xd )
f (x) = w x = w, x IRd
feature map: x −→ Φ(x) = (φ1(x), . . . , φj (x), . . . , φp(x))
Φ : IRd
−→ IRp
f (x) = w Φ(x) = w, Φ(x) IRp
kernel dictionary: x −→ k(x) = (k(x, x1), . . . , k(x, xi ), . . . , k(x, xn))
k : IRd
−→ IRn
f (x) =
n
i=1
αi k(x, xi ) = α, k(x) IRn
kernel map: x −→ k(•, x) p = ∞
f (x) = f (•), K(•, x) H

Roadmap
Kernel machines
Kernels

Let J be a functional
J : H → IR
f → J(f )
examples: J1(f ) = f 2
H, J2(f ) = f (x),
J directional derivative in direction g at point f
dJ(f , g) =
lim
ε → 0
J(f + εg) − J(f )
ε
Gradient J(f )
J : H → H
f → J (f )
if dJ(f , g) = J (f ), g H
exercise: ﬁnd out J1
(f ) et J2
(f )

Hint
dJ(f , g) =
dJ(f + εg)
dε ε=0

Solution
dJ1(f , g) =
lim
ε → 0
f +εg 2
− f 2
ε
=
lim
ε → 0
f 2
+ε2
g 2
+2ε f ,g H− f 2
ε
=
lim
ε → 0
ε g 2
+ 2 f , g H
= 2f , g H
⇔ J1 (f ) = 2f
dJ2(f , g) =
lim
ε → 0
f (x)+εg(x)−f (x)
ε
= g(x)
= k(x, .), g H
⇔ J2
(f ) = k(x, .)

Solution
dJ1(f , g) =
lim
ε → 0
f +εg 2
− f 2
ε
=
lim
ε → 0
f 2
+ε2
g 2
+2ε f ,g H− f 2
ε
=
lim
ε → 0
ε g 2
+ 2 f , g H
= 2f , g H
⇔ J1 (f ) = 2f
dJ2(f , g) =
lim
ε → 0
f (x)+εg(x)−f (x)
ε
= g(x)
= k(x, .), g H
⇔ J2
(f ) = k(x, .)
Minimize
f ∈H
J(f ) ⇔ ∀g ∈ H, dJ(f , g) = 0 ⇔ J(f ) = 0

Subdifferential in a RKHS H
Definition (Sub gradient)
a subgradient of J : H −→ IR at f0 is any function g ∈ H such that
∀f ∈ V(f0), J(f ) ≥ J(f0) + g, (f − f0) H
Definition (Subdifferential)
∂J(f ), the subdifferential of J at f is the set of all subgradients of J at f .
H = IR J3(x) = |x| ∂J3(0) = {g ∈ IR | − 1 < g < 1}
H = IR J4(x) = max(0, 1 − x) ∂J4(1) = {g ∈ IR | − 1 < g < 0}
Theorem (Chain rule for linear Subdifferential)
Let T be a linear operator H −→ IR and ϕ a function from IR to IR
If J(f ) = ϕ(Tf )
Then ∂J(f ) = {T∗
g | g ∈ ∂ϕ(Tf )}, where T∗
denotes T’s adjoint operator

example of subdiﬀerential in H
evaluation operator and its adjoint
T : H −→ IRn
f −→ Tf = (f (x1), . . . , f (xn))
T∗
: IRn
−→ H
α −→ T∗
α
build the adjoint Tf , α IRn = f , T∗
α H

T : H −→ IRn
f −→ Tf = (f (x1), . . . , f (xn))
T∗
: IRn
−→ H
α −→ T∗
α =
n
i=1
αi k(•, xi )
α H
Tf , α IRn =
n
i=1
f (xi )αi
=
n
i=1
f (•), k(•, xi ) Hαi
= f (•),
n
i=1
αi k(•, xi )
T∗α
H

T : H −→ IRn
f −→ Tf = (f (x1), . . . , f (xn))
T∗
: IRn
−→ H
α −→ T∗
α =
n
i=1
αi k(•, xi )
α H
Tf , α IRn =
n
i=1
f (xi )αi
=
n
i=1
f (•), k(•, xi ) Hαi
= f (•),
n
i=1
αi k(•, xi )
T∗α
H
TT∗
: IRn
−→ IRn
α −→ TT∗
α =
n
j=1
αj k(xj , xi )
= Kα

T : H −→ IRn
f −→ Tf = (f (x1), . . . , f (xn))
T∗
: IRn
−→ H
α −→ T∗
α =
n
i=1
αi k(•, xi )
α H
Tf , α IRn =
n
i=1
f (xi )αi
=
n
i=1
f (•), k(•, xi ) Hαi
= f (•),
n
i=1
αi k(•, xi )
T∗α
H
TT∗
: IRn
−→ IRn
α −→ TT∗
α =
n
j=1
αj k(xj , xi )
= Kα
Example of subdiﬀerentials
x given J5(f ) = |f (x)| ∂J5(f0) = g(•) = αk(•, x) ; −1 < α < 1
x given J6(f ) = max(0, 1 − f (x)) ∂J6(f1) = g(•) = αk(•, x) ; −1 < α < 0

Optimal conditions
Theorem (Fermat optimality criterion)
When J(f ) is convex, f is a stationary point of problem min
f ∈H
J(f )
If and only if 0 ∈ ∂J(f )
ff f
∂J(f )
exercice: ﬁnd for a given y ∈ IR (from Obozinski)
min
x∈IR
1
2 (x − y)2
+ λ|x|

Let’s summarize
positive kernels ⇔ RKHS = H ⇔ regularity f 2
H
the key property: Jt (f ) = k(t, .) holds not only for positive kernels
f (xi ) exists (pointwise deﬁned functions)
universal consistency in RKHS
the Gram matrix summarize the pairwise comparizons

Lecture4 kenrels functions_rkhs

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Lecture4 kenrels functions_rkhs (20)

More from Stéphane Canu (8)

Lecture4 kenrels functions_rkhs