CS-ChapterI.pdf

ELEG 867 - Compressive Sensing and Sparse
Signal Representations
Gonzalo R. Arce
Depart. of Electrical and Computer Engineering
University of Delaware
Fall 2011
Compressive Sensing G. Arce Fall, 2011 1 / 75

Outline
Introduction and Motivation
Vector Spaces and the Nyquist-Shannon Sampling Theorem
Sparsity and the ℓ1 Norm
Sparse Signal Representation

Compressed Sensing encompasses exciting and surprising
developments in signal processing resulting from sparse
representations.
It is about the interplay between sparsity and signal recovery. Roots
trace back to †
Mathematics and harmonic analysis
Physical sciences and geophysics
Vision
Optimization and computational tools
This course describes this fascinating topic and the tools needed in its
applications.
†
D. Donoho, ”Scanning the Technology,” Proceedings of the IEEE. Vol. 98, No. 6,
June 2010

Shannon-Nyquist Sampling Theorem
The Shannon-Nyquist Theorem: sampling frequency of an analog
signal must be greater than twice the highest frequency of the signal in
order to perfectly reconstruct the original signal from the sampled
version.
Theorem
If a function f (t) contains no frequencies higher than W cps, it is
completely determined by giving its ordinates at a series of points
spaced (W
2
) seconds apart.†
† C. E. Shannon. ”Communication in the presence of noise.” Proceedings of the IRE, Vol. 37, no.1, pp.10-21, Jan.1949.
H. Nyquist. ”Certain topics in telegraph transmission theory.” Trans. AIEE, vol.47, pp.617-644, Apr.1928.
Compressive Sensing G. Arce Introduction and Motivation Fall, 2011 4 / 75

Traditional signal sampling and signal compression.
Nyquist sampling rate gives exact reconstruction.
Pessimistic for some types of signals!

Sampling and Compression
Transform data and keep important coefficients.
Original Image Biorthogonal Spline
Wavelet
Wavelet Transform

Lots of work to then throw away majority of data!.
e.g. JPEG 2000 Lossy Compression: A digital camera can take
millions of pixels but the picture is encoded on a few hundred of
kilobytes.
Original Image Wavelet Transform

Problem: Recent applications require a very large number of samples:
Higher resolution in medical imaging devices, cameras, etc.
Spectral imaging, confocal microscopy, radar arrays, etc.
Medical Imaging
y
x
λ
Spectral Imaging

Sampling and Compressive Sensing
Donoho †
, Candès ‡
, Romberg and Tao, discovered important
results on the minimum number of data needed to reconstruct a
signal
Compressive Sensing (CS) unifies sensing and compression into a
single task
Minimum number of samples to reconstruct a signal depends on
its sparsity rather than its bandwidth.
†
D. Donoho. ”Compressive Sensing”. IEEE Trans. on Information Theory. Vol.52(2), pp.5406-5425, Dec.2006.
‡
E. Candès, J. Romberg and T. Tao. ”Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency
Information”. IEEE Trans. on Information Theory. Vol.52(4), pp.1289-1306, Apr.2006.

Vector Spaces and the Nyquist-Shannon Sampling
Theorem
Vector space: set of vectors H satisfying the following axioms:
Associativity property: v1 + (v2 + v3) = (v1 + v2) + v3.
Commutativity property: v1 + v2 = v2 + v1.
Identity element: ∃0 ∈ H, such that v + 0 = v, ∀v ∈ H.
Inverse element: ∀v ∈ H, then ∃ − v ∈ H, such that v + (−v) = 0.
Distribut. of scalar: s is a scalar, such that s(v1 + v2) = sv1 + sv2.
Distribut. of scalar: s1, s2 are scalars, such that
(s1 + s2)v = s1v + s2v.
Associat. of scalars: s1, s2 are scalars, such that s1(s2v) = (s1s2)v.
Identity element of product: ∃ a scalar 1, such that 1v = v.
Compressive Sensing G. Arce Vector Spaces & Nyquist-Shannon Theorem Fall, 2011 10 / 75

Norms: A norm k · k on the vector space H satisfies:
∀x ∈ H, kxk ≥ 0, and kxk = 0 ⇔ x = 0.
∀α ∈ C, kαxk = |α|kxk. (Homogeneity).
∀x, y ∈ H, kx + yk ≤ kxk + kyk. (Triangle inequality).

Examples of norms:
H is the space Rn
, with norm kxkℓp = (
Pn
k=1 |xk|p
)1/p
, for p ≥ 1.
In R2
, set the unit ball Bp = {x : kxkℓp = 1; p ≥ 1}:
The unit ball is the set of all points (x1, x2) which satisfy the equations:
|x1| + |x2| = 1, for B1.
x2
1 + x2
2 = 1, for B2.
max{x1, x2} = 1, for B∞.

In Rn
, kxkℓ1
=
Pn
k=1 |xk| is a norm since it satisfies:
∀x ∈ Rn
, then kxkℓ1
=
Pn
k=1 |xk| ≥ 0. Also,
Pn
k=1 |xk| = 0, if and
only if xk = 0, ∀k.
∀α ∈ C, then kαxkℓ1
=
Pn
k=1 |αxk| = |α|
Pn
k=1 |xk| = |α|kxkℓ1
.
∀x, y ∈ Rn
, then
kx + ykℓ1
=
n
X
k=1
|xk + yk|
≤
n
X
k=1
(|xk| + |yk|); Convex Function
=
n
X
k=1
|xk| +
n
X
k=1
|yk|
= kxkℓ1
+ kykℓ1
.

In Rn
, kxkℓp = (
Pn
k=1 |xk|p
)1/p
, with p = 0.5, is not a norm:
∀x ∈ Rn
, then kxkℓ0.5
= (
Pn
k=1 |xk|1/2
)2
≥ 0. Also,
(
Pn
k=1 |xk|0.5
)2
= 0, if and only if xk = 0, ∀k.
∀α ∈ C, then kαxkℓ0.5
= (
Pn
k=1 |αxk|1/2
)2
=
(
Pn
k=1 |α|1/2
|xk|1/2
)2
= (|α|1/2
Pn
k=1 |xk|1/2
)2
= |α|kxkℓ0.5
.
∀x, y ∈ Rn
, then
kx + ykℓ0.5
= (
n
X
k=1
|xk + yk|1/2
)2
≥ (
n
X
k=1
|xk|1/2
+
n
X
k=1
|yk|1/2
)2
− 2
n
X
k=1
|xk|1/2
n
X
k=1
|yk|1/2
;
= (
n
X
k=1
|xk|1/2
)2
+ (
n
X
k=1
|yk|1/2
)2
= kxkℓ0.5
+ kykℓ0.5
(Triangle inequality is not satisfied)

Other Examples of Norms:
Operator norm: H is the space of m × n matrices A
kAk = σmax(A) = maximum singular value of A.
Frobenius norm: H is the space of m × n matrices A
kAkF = (
P
i,j A2
i,j)1/2
= (
P
k σ2
k )1/2
Normed vector spaces: vector spaces H satisfying the norm properties.
Examples of normed vector spaces:
ℓ2(R) (also known as ℓ2
or Euclidean space): the vector space R
satisfying the properties of the ℓ2-norm.
ℓ∞(R): the vector space R satisfying the properties of the
ℓ∞-norm.

Inner Products
An inner product < ·, · > on H satisfies ∀x, y, z ∈ H and α ∈ C:
< x, y >=< y, x >∗
< αx, y >= α < x, y >
< x + y, z >=< x, z > + < y, z >
< x, x >≥ 0, < x, x >= 0 ⇔ x = 0
A inner product operator induces a norm on H:
√
< x, x > = kxk.
In ℓ2(R), for instance, the inner product is given by:
< x, y >=
Z ∞
−∞
x(t)y∗
(t)dt. (1)
< x, x >=
Z ∞
−∞
x(t)x∗
(t)dt = kxk2
ℓ2
. (2)

Hilbert Spaces
A vector space H that satisfies the inner product properties is known as
Hilbert space.
Examples of Hilbert spaces:
The Euclidean space Rn
with the dot product as inner product:
< x, y >=
Pn
i=1 xiyi.
The space of real-valued, finite variance, zero-mean random
variables: < x, y >= E[xy].
The space of m × n matrices with: < A, B >tr= trace(AB).

Definitions
Orthogonality: two signals x, y are orthogonal if < x, y >= 0.
Orthonormal basis: a basis of a vector space is orthonormal if
their vectors are orthonormal.
Orthonormal sequence: {βn}n∈Z is an orthonormal sequence if:
kβnk = 1, ∀n, and < βn, βm >= 0, ∀n 6= m
Example:
Fourier series: {βn}n∈Z = {ej2πnt
}n∈Z is an orthobasis for
ℓ2([0,1]), since:
kβnkℓ2
= 1
< βn, βm >= 0

Definitions
Cauchy-Schwarz Inequality: | < x, y > | ≤ kxkkyk.
For the Euclidean space H = Rn :
| < x, y > | =
P
i xiyi ≤
q
(
P
i x2
i )
q
(
P
i y2
i ) = kxkℓ2
kykℓ2
.
For the space of real-valued, finite variance, zero-mean random
variables:| < x, y > | = E[xy] ≤ (E[x])(E[y]) = kxkkyk.

Shannon-Nyquist Sampling Theorem
Sampling of a bandlimited signal.
Let f̂(w) be the Fourier transform of f (t). Let the space of bandlimited
signals be
Bπ/T = {f (t) ∈ Rn
s.t. f̂(w) = 0, ∀|w| > π/T}.
Define
hT(t) =
√
T sin(πt/T)
πt
↔ ĥ(w) =
√
T ;if |w| ≤ π/T
0 ;if |w| π/T.
Compressive Sensing G. Arce Vector Spaces Nyquist-Shannon Theorem Fall, 2011 20 / 75

By the linear shift property of the Fourier series
hT(t − nT) ↔
√
TejwnT
.
Using the Parseval theorem definition
Parseval theorem:
R ∞
−∞
f (t)g∗
(t)dt = 1
2π
R ∞
−∞
f̂(w)ĝ(w)dw,
note that hT(t − nT) is an orthobasis for the bandlimited signals f (t) in
Bπ/T:
Z ∞
−∞
hT(t)h(t − nT)dt =
1
2π
Z π/T
−π/T
TejwnT
dw
=
1
2jπn
ejwnT
|
π/T
−π/T
=
1
2jπn
(ejπn
− e−jπn
)
= 0, ∀n ∈ Z.

The signals f (t) in Bπ/T can be expressed in terms of its orthobasis
f (t) =
X
n∈Z
hf (t), h(t − nT)ih(t − nT). (3)
Using the inner product definition in (2) and the parseval theorem, the
coefficients for the signal expansion in terms of its orthobasis are
hf (t), h(t − nT)i =
1
2π
Z π/T
−π/T
f̂(w)
√
TejwnT
dw
=
√
Tf (nT) (4)

Replacing (4) in (3), the signals f (t) in Bπ/T can then be expressed in
terms of a sequence
f (t) =
√
T
X
n∈Z
f (nT)h(t − nT). (5)
where, the coefficients f (nT) of the sequence are samples of f (t).
Nyquist-Shannon-Kotelnikov Theorem
If a signal f (t) contains frequencies satisfying |w| π/T, the signal is
completely determined by series of points spaced T seconds apart.

Sparsity
Signal sparsity critical to CS
Plays roughly the same role in CS that bandwidth plays in
Shannon-Nyquist theory
A signal x ∈ RN
is S-sparse on the basis Ψ if x can be represented
by a linear combination of S vectors of Ψ as x = Ψα with S ≪ N
Ψ
x
α
At most S non-zero components
Compressive Sensing G. Arce Sparsity and the ℓ1-Norm Fall, 2011 24 / 75

The ℓ1 Norm and Sparsity
The ℓ0 norm is defined by: kxk0 = #{i : x(i) 6= 0}
Sparsity of x is measured by its number of non-zero elements.
The ℓ1 norm is defined by: kxk1 =
P
i |x(i)|
ℓ1 norm has two key properties:
Robust data fitting
Sparsity inducing norm
The ℓ2 norm is defined by: kxk2 = (
P
i |x(i)|2
)1/2
ℓ2 norm is not effective in measuring sparsity of x

Why ℓ1 Norm Promotes Sparsity?
Given two N-dimensional signals:
x1 = (1, 0, ..., 0) → ”Spike” signal
x2 = (1/
√
N, 1/
√
N, ..., 1/
√
N) → ”Comb” signal
x1 and x2 have the same ℓ2 norm:
kx1k2 = 1 and kx2k2 = 1.
However, kx1k1 = 1 and
kx2k1 =
√
N.
x2
x1

ℓ1 Norm in Regression
Linear regression is widely used in science and engineering.
Given A ∈ Rm×n
and b ∈ Rm
; m n
Find x s.t. b = Ax (overdetermined)

ℓ1 Norm Regression
Two approaches:
Minimize the ℓ2 norm of the residuals
min
x∈Rn
kb − Axk2
The ℓ2 norm penalizes large residuals
Minimizes the ℓ1 norm of the residuals
min
x∈Rn
kb − Axk1
The ℓ1 norm puts much more weight on small residuals

Matlab Code
minx∈Rn kAx − bk2
A=randn(500,150);
b=randn(500,1);
x = (A′
∗ A)(−1)
∗ A′
∗ b; Least Squares Solution
minx∈Rn kAx − bk1
A=randn(500,150);
b=randn(500,1);
X = medrec(b,A,max(A’*b),0,100,1e-5);

m = 500, n = 150. A = randn(m, n) and b = randn(m, 1)
−3 −2 −1 0 1 2 3
0
5
10
15
20
25
30
−3 −2 −1 0 1 2 3
20
40
60
80
100
120
140
160
ℓ2 Residuals ℓ1 Residuals

ℓ1 Norm in Regression
Given A ∈ Rm×n
and b ∈ Rm
; m n
Find x s.t. b = Ax (underdetermined)

Two approaches:
Minimize the ℓ2 norm of x
min
x∈Rn
kxk2 subject to Ax = b
Minimize the ℓ1 norm of x
min
x∈Rn
kxk1 subject to Ax = b

Matlab Code
minx∈Rn kxk2 subject to Ax = b
A=randn(150,500);
b=randn(150,1);
C=eye(150,500);
d=zeros(150,1);
X=lsqlin(C,d,[],[],A,b);
In general:
minx∈Rn f (x) subject to Ax = b
X= fmincon(@(x) f(x),zeros(500,1),[],[],A,b,[],[],options);
where f (x) is a convex function.

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
0
5
10
15
20
25
30
35
−4 −3 −2 −1 0 1 2 3 4
x 10
6
0
50
100
150
200
250
ℓ2 Solution ℓ1 Solution

Consider N observation pairs (xi, bi) modeled in a linear fashion
bi = Axi + c + Ui, i = 1, 2, . . . , N (6)
A: Unknown slope of the fitting line.
c: Intercept.
Ui: Unobservable errors
The Least Absolute Deviation regression is
F1(A, c) =
N
X
i=1
|bi − Axi − c|, (7)

PN
i=1 |bi − Axi − c| c = −xiA + bi
A
c

ℓ1 Norm in Estimation
Location Estimate in Gaussian Noise
Let x1, x2, · · · , xN, i.i.d. Gaussian with a constant but unknown mean
β. The Maximum Likelihood estimate of location is the value β̂ which
maximizes the likelihood function
f (x1, x2, · · · , xN; β) =
N
Y
i=1
f (xi − β)
=
N
Y
i=1
1
√
2πσ
e−(xi−β)2
/2σ2
(8)
=

1
2πσ2
N/2
e−
PN
i=1(xi−β)2
/2σ2
.

The ML estimate β̂ minimizes the least squares sum
β̂ML = arg min
β
N
X
i=1
(xi − β)2
. (9)
Results in the sample mean
β̂ML =
1
N
N
X
i=1
xi. (10)

Location Estimate in Generalized Gaussian Noise
If the x′
s obey a generalized Gaussian distribution, the ML estimate of
location is
f (x1, x2, · · · , xN; β) =
N
Y
i=1
fγ(xi − β)
=
N
Y
i=1
C e−|xi−β|γ
/σ
= CN
e−
PN
i=1|xi−β|γ
/σ
, (11)
where C is a normalizing constant, and γ is the dispersion parameter.

Maximizing the likelihood function is equivalent to
β̃ML = arg min
β
N
X
i=1
|xi − β|γ
.
X
1
X
4
X
3
X5
X
2
γ = 2
γ = 1
γ = 0.5
Figure: Cost function for γ = 0.5, 1, and 2.

For N odd there is an integer k, such that the slopes over the intervals
(x(k−1), x(k)] and (x(k), x(k+1)], are negative and positive, respectively.
β̂ML = arg min
β
N
X
i=1
|xi − β|
=
(
x(N+1
2
) N odd

x(N
2 ), x(N
2 )
i
N even
= MEDIAN(x1, x2, · · · , xN). (12)

ML Estimate of Location for Generalized Gaussian
Here the samples have a common location parameter β, but different
scale parameter σi. The ML estimate of location is
Gp(β) =
N
X
i=1
1
σp
i
|xi − β|p
. (13)
For the Gaussian distribution (p = 2), the ML estimate reduces to
β̂ = arg min
β
N
X
i=1
1
σ2
i
(xi − β)2
=
PN
i=1 Wi · xi
PN
i=1 Wi
(14)
where Wi = 1/σ2
i 0.

For the Laplacian distribution (p = 1), the ML estimate minimizes
G1(β) =
N
X
i=1
1
σi
|xi − β|. (15)
where Wi
△
= 1/σi 0. G1(β) is piecewise linear and convex. The
weighted median output is defined as
Y(n) = arg min
β
N
X
i=1
Wi|xi − β|
= MEDIAN[W1♦x1(n), W2♦x2(n), · · · , WN♦xN(n)]
where Wi 0 and ♦ is the replication operator defined as
Wi♦xi =
Wi times
z }| {
xi, xi, · · · , xi.

Next, consider N observation pairs (xi, bi)
bi = Axi + c + Ui, i = 1, 2, . . . , N (16)
A: Unknown slope of the fitting line.
c: Intercept.
Ui: Unobservable errors
The L1 or Least Absolute Deviation (LAD) regression is
F1(A, c) =
N
X
i=1
|bi − Axi − c|, (17)

Sample space: bi = Axi + c
1. Each sample pair (xi, bi) represents a point on the plane
2. The solution is a line with slope A∗
and intercept c∗
.
3. If this line goes through some sample pair (xi, bi), then
the equation bi = A∗
xi + c∗
is satisfied
Parameter space: c = −xiA + bi
1. The solution (A∗
, b∗
) is a point.
2. The sample pair (xi, bi) defines a line with slope −xi and
intercept bi.
3. When c∗
= −xiA∗
+ bi holds, it can be inferred that the
point (A∗
, c∗
) is on the line defined by (−xi, bi)

c
A
(A*,c*)

Set A = A0, the objective function now becomes a one-parameter
function of c
F(c) =
N
X
i=1
| bi − A0xi
| {z }
Observations
−c|. (18)
The parameter c∗
is the Maximum Likelihood estimator of location for
c. It can be obtained by
c∗
= MED(bi − A0xi) | N
i=1. (19)

Set c = c0, the objective function reduces to
F(a) =
N
X
i=1
|bi − c0 − Axi|
=
N
X
i=1
|xi|
bi − c0
xi
− A . (20)
The parameter A∗
can be seen as the ML estimator of location for A,
and can be calculated as the weighted median,
A∗
= MED

|xi| ⋄
bi − c0
xi
N
i=1
, (21)

A simple and intuitive way of solving the LAD regression problem is:
1. Set k = 0. Find an initial value A0 for A, such as the Least
Squares (LS) solution.
2. Set k = k + 1 and obtain a new estimate of c for a fixed Ak−1 using
ck = MED(bi − Ak−1xi) | N
i=1.
3. Obtain a new estimate of A for a fixed ck using
Ak = MED

|xi| ⋄
bi − ck
xi
N
i=1
.
4. Once Ak and ck do not deviate from Ak−1 and ck−1 within a
tolerance range, end the iteration. Otherwise, go back to step 2).

c
A

Signal Representation
A sparse signal x ∈ RN
can be represented by a linear
combination of basis of an orthogonal representation matrix Ψ
x(t) =
X
i
αiψi(t)
Compressive Sensing G. Arce Sparse Signal Representation Fall, 2011 55 / 75

Active development for effective signal representation in the 90’s
Fourier
Wavelet
Curvelet
There is no universal best representation
Best representation = sparsest

Wavelets
A wavelet is a ”small wave” with finite energy that allows the analysis
of transient, or time-varying phenomena.
Figure: Daubechies (D20) Wavelet example

A signal x(t) can be represented in terms of its wavelet coefficients as
x(t) =
X
j∈Z
X
n∈Z
hx, Ψj,niΨj,n(t)
where:
Ψj,n are the wavelets that form an orthogonal basis.
hx, Ψj,ni are the wavelet coefficients.
Wavelets are vectors of a orthogonal basis formed by shifting and
dilating a mother wavelet, Ψ(t):
Ψj,n(t) = 2−j/2
Ψ(2−j
t − n), ∀j, n ∈ Z
where j is the scale parameter and n is the location parameter.

Examples of wavelet expansion functions are:
Figure: Haar wavelet Figure: Daubechies
wavelet
Figure: Symlet wavelet

Daubechies Wavelet
Daubechies Wavelets are continuous and smooth wavelets.
The mother wavelet is defined by means of a scaling function.
A daubechies wavelet Ψ(t) has p − 1 vanishing moments if:
Z ∞
−∞
tk
Ψ(t)dt = 0; for 0 ≤ k p.
The smoothness of the scaling and wavelet functions increase as
the number of vanishing moments increases.

Examples of Daubechies wavelets:
(a) (b) (c)
(a) Daubechies scaling and wavelet functions with 2 vanishing
moments.
(b) Daubechies scaling and wavelet functions with 6 vanishing
moments.
(c) Daubechies scaling and wavelet functions with 10 vanishing
moments.

Examples of Wavelet decompositions

Other examples: original signals

Noisy signals

Denoising using wavelet approximation

JPEG
JPEG
JPEG2000
JPEG2000

Different representations are best for different applications.
Fourier Dictionary → For oscillatory phenomena
Wavelet Dictionary → For images with isolated singularities
Curvelet Dictionary → For images with contours and edges
This motivates overcomplete signal representation ‡
‡
S. Mallat and Z. Zhang. ”Matching Pursuit in a Time-Frequency Dictionary”. IEEE Trans. on Signal Proc. Vol.41, pp.3397-3415, 1993.

Overcomplete dictionary representation
Different bases merged into a combined dictionary
Ψ = [Ψ1, Ψ2, ..., ΨN]
Representation of x in an overcomplete dictionary
x =
X
i
αiψi, with the sparsest α

Basis Pursuit (BP)
Basis Pursuit → find the sparsest approximation of x
min
α
kαk1 s.t. x = Ψα
where kαk1 =
P
i |αi|.
BP decomposes a signal into a superposition of dictionary
elements having the smallest ℓ1-norm among all such
decompositions.
†
D. L. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform. Theory, 47:2845-2862, 2001.

Compressible Signals
In most applications
Signals are not perfectly sparse, but only a few coefficients
concentrate most of the energy.
Most of the transform coefficients are negligible.
Compressible signals can be approximated by a S-sparse signal:
- There is a transform vector αS with only S terms such that
kαS − αk2 is small.

Compressible Signals
Wavelet coefficients of natural scenes exhibit the (1/n)-decay†
.
1 Megapixel Image Wavelet Coefficients Sorted Wavelet Coeff.
† E. J. Candès and J. Romberg ”Sparsity and Incoherence in Compressive Sampling.” Inverse Problems.
vol.23, pp.969-985. 2006.

Examples of Compressible Signals
Bat echolocation
Time Signal
Time-Frequency
Representation
Confocal microscopy
3D Image
3D Wavelet Coefficients
Ultra wideband signaling Amplitude[mV]
Time[ps]

CS-ChapterI.pdf

More Related Content

Similar to CS-ChapterI.pdf (20)

Recently uploaded (20)

CS-ChapterI.pdf