ASSOCIATIVEMEMORYANDUNSUPERVISEDLEARNINGNETWORKS

Ugur HALICI ARTIFICIAL NEURAL NETWORKS CHAPTER 3
EE543 LECTURE NOTES . METU EEE . ANKARA 43
CHAPTER III
Neural Networks as Associative
Memory
One of the primary functions of the brain is associative memory. We associate the
faces with names, letters with sounds, or we can recognize the people even if they
have sunglasses or if they are somehow elder now.
Associative memories can be implemented either by using feedforward or recurrent
neural networks. Such associative neural networks are used to associate one set of
vectors with another set of vectors, say input and output patterns. The aim of an
associative memory is, to produce the associated output pattern whenever one of the
input pattern is applied to the neural network. The input pattern may be applied to the
network either as input or as initial state, and the output pattern is observed at the
outputs of some neurons constituting the network. According to the way that the
network handles errors at the input pattern, they are classified as interpolative and
accretive memory. In the interpolative memory it is allowed to have some deviation
from the desired output pattern when added some noise to the related input pattern.
However, in accretive memory, it is desired the output to be exactly the same as the
associated output pattern, even if the input pattern is noisy. Another classification of
associative memory is such that while the memory in which the associated input and
output patterns differ are called heteroassociative memory, it is called autoassociative
memory if they are the same.
In this chapter, first the basic definitions about associative memory is given and then
it is explained how neural networks can be made linear associators so as to perform as
interpolative memory. Next it is explained how the Hopfield network can be used as
autoassociative memory and then Bipolar Associative Memory network, which is
designed to operate as heteroassociative memory, is introduced.

3.1. Associative Memory
In an associative memory, we store a set of patterns µk, k=1...K, so that the network
responds by producing whichever of the stored patterns most closely resembles the
one presented to the network
Here we need a measure for defining resemblance of the patterns. For this purpose the
norms that were introduced in Section 2.1 may be used. While Euclidean distance is
convenient for the continuous valued pattern vectors, Hamming distance, which gives
the number of mismatched components, is more appropriate for patterns with binary
or bipolar entries.
Suppose that the stored patterns, which are called exemplars or memory elements, are
in the form of pairs of associations, µk=(uk,yk), uk∈RN, yk∈RM, k=1..K. According to
the mapping ϕ: RN→RM that they implement, we distinguish the following types of
associative memories:
• Interpolative associative memory: when u=ur is presented to the memory it
responds by producing yr of the stored association. However if u differs from ur by
an amount of ε, that is if u=ur+ε is presented to the memory, then the response differs
from yr by some amount εr. Therefore in interpolative associative memory we have
K
k
that
such r
r
r
r
..
1
,
)
( =
=
⇒
=
+
= 0
0
y
u ε
ε
ε
ε
ϕ + (3.1.1)
• Accretive associative memory: when u is presented to the memory, it responds by
producing yr of the stored association such that ur is the one closest to u among uk,
k=1..K, that is,
k
r
r
that
such
u
u
y
u min
)
( =
=
ϕ || uk - u ||, k=1..K (3.1.2)
The accretive associative memory in the form given above is called heteroassociative
memory. However if the stored exemplars are in a special form such that the desired
patterns and the input patterns are the same, that is yk=uk for k=1..K, then it is called
autoassociative memory. In such a memory, whenever u is presented to the memory
it responds by ur which is the closest one to u among uk, k=1..K, that is,
k
r
r
that
such
u
u
u
u min
)
( =
=
ϕ || uk - u || k=1..K (3.1.3)

While interpolative memories can be implemented by using feed-forward neural
networks, it is more appropriate to use recurrent networks as accretive memories.
The advantage of using recurrent networks as associative memory is their
convergence to one of a finite number of stable states when started at some initial
state. The basic goals are :
• to be able to store as many exemplars as we need, each corresponding to a
different stable state of the network,
• to have no other stable state
• to have the stable state that the network converges to be the one closest to the
applied pattern
The problems that we are faced with being:
• the capacity of the network is restricted,
• depending on the number and properties of the patterns to be stored, some of the
exemplar may not be the stable states,
• some spurious stable states different than the exemplars may arise by themselves
• the converged stable state may be other than the one closest to the applied pattern
One way of using recurrent neural networks as associative memory is to fix the
external input of the network and present the input pattern ur to the system by setting
x(0)=ur. If we relax such a network, then it will converge to the attractor x* for which
x(0) is within the basin attraction as explained in Section 2.7. If we are able to place
each µk as an attractor of the network by proper choice of the connection weights,
then we expect the network to relax to the attractor x*=µr that is related to the initial
state x(0)=ur. For a good performance of the network, we need the network to
converge only to one of the stored patterns µk, k=1...K. Unfortunately, some initial
states may converge to spurious states, which are the undesired attractors of the
network representing none of the stored patterns. Spurious states may arise by
themselves depending on the model used and the patterns stored. The capacity of the
neural associative memories is restricted by the size of the networks. If we increment
the number of stored patterns for a fixed size neural network, spurious states arise
inevitably. Sometimes, the network may converge not to a spurious state, but to a
memory pattern not so close to the presented pattern. What we expect for a feasible
operation is, at least for the stored memory patterns themselves, if any of them is

presented to the network by setting x (0)= µk , then the network should stay
converged to x*=µr (Figure 3.1).
Figure 3.1 In associative memory each memory element is assigned to an attractor
A second way to use recurrent networks as associative memory, is to present the input
pattern ur to the system as an external input. This can be done by setting θ=ur, where
θ is the threshold vector whose ith
component is corresponding to the threshold of
neuron i. After setting x(0) to some fixed value we relax the network and then wait
until it converges to an attractor x*. For a good performance of the network, we desire
the network to have a single attractor such that x*=µk for each stored input pattern uk,
therefore the network will converge to this attractor independent of the initial state of
the network. Another solution to the problem is to have predetermined initial values,
so that these initial values lie within the basin attraction of µk whenever uk is applied.
We will consider this kind of networks in Chapter 7 in more detail, where we will
examine how these recurrent networks are trained.
3.2 Linear Associators as Interpolative Memory
It is quite easy to implement interpolative associative memory when the set of input
memory elements {uk} constitutes an othonormal set of vectors, that is
u u
i j i j
i j
⋅ =
=
≠
1
0
(3.2.1)

By using kronecker delta, we write simply
ui .uj =δij (3.2.2)
The mapping function ϕ(u) defined below may be used to establish an interpolative
associative memory:
ϕ ( )
u W u
= T
(3.2.3)
where T denotes transpose and
W u y
= ×
∑ k k
k
(3.2.4)
Here the symbol × is used to denote outer product of x∈RN and y∈RM, which is
defined as
T
)
(
T
k
k
k
k
k
k
u
y
y
u
y
u =
=
×
T
, (3.2.5)
resulting in a matrix of size N by M.
By defining matrices [Haykin 94]:
U=[u1 u2.. uk.. uK] (3.2.6)
and
Y=[y1 y2.. yk.. yK] (3.2.7)
the weight matrix can be formulated as
W YU
T T
= (3.2.8)
If the network is going to be used as autoassociative memory we have Y=U so,
W UU
T T
= (3.2.9)
For a function ϕ(u) to constitute an interpolative associative memory, it should satisfy
the condition

ϕ(ur)=yr for r=1..K (3.2.10)
We can check it simply as
ϕ ( )
u W u
r r
= T
(3.2.11)
which is
W u YU u
T T
r r
= (3.2.12)
Since the set {uk} is orthonormal, we have
r
k
k
kr
r
y
y
u
YU =
= ∑δ
T
(3.2.13)
which results in
ϕ ( )
u YU u y
r r r
= =
T
(3.2.14)
as we desired.
Furthermore, if an input pattern u=ur+ε different than the stored patterns is applied as
input to the network, we obtain
ϕ ε
ε
( ) ( )
u W u
W u W
= +
= +
T
T T
r
r
(3.2.15)
Using equation (3.2.12) and (3.2.13) results in
ϕ ε
( )
u y W
= +
r T
(3.2.16)
Therefore, we have
ϕ ε
( )
u y
= +
r r
(3.2.17)
in the required form, where
ε ε
r
= WT
(3.2.18)

Such a memory can be implemented as shown in Figure 3.2 by using M neurons each
having N inputs. The connection weights of neuron i is assigned value Wi, which is
the ith
column vector of matrix W. Here each neuron has a linear output transfer
function f(a)=a. When a stored pattern uk is applied as input to the network, the
desired value yk is observed at the output of the network as:
x W u
k k
= T
(3.2.19)
Figure 3.2 Linear Associator
Until now, we have investigated the use of linear mapping YUT
as associative
memory, which works well when the input patterns are orthonormal. In the case the
input patterns are not orthonormal, the linear associator cannot map some input
patterns to desired output patterns without error. In the following we will investigate
the conditions necessary to minimize the output error for the exemplar patterns. That
is, for a given set of exemplars µk=(uk,yk), uk∈RN, yk∈RM, k=1.. K, our purpose is to
find a linear mapping A* among A: RN→RM such that:
A
A
*
min
= ∑
k
|| yk - Auk || (3.2.20)
where ||.|| is chosen as Euclidean norm.
The problem may be reformulated by using the matrices U and Y [Haykin 94]:
A
A
*
min
= || Y - AU || (3.2.21)
1 2 i M
output layer
input layer
u u u u
1 2 N
j
x 1 x 1 x i x M
W

The pseudo inverse method [Kohonen 76] based on least squares estimation provides
a solution for the problem in which A* is determined as:
A YU
*
= +
(3.2.22)
where U+ is pseudo inverse of U.
The pseudoinverse U+ is a matrix satisfying the condition:
U U 1
+
= (3.2.23)
where 1 is the identity matrix. A perfect match is obtained by using A YU
*
= +
, since
A U YU U Y
*
= =
+
(3.2.24)
resulting in no error due to the fact
|| Y - A*U|| = 0 (3.2.25)
In the case the input patterns are linearly independent, that is none of them can be
obtained as a linear combinations of the others, then a matrix U+ satisfying Eq.
(3.2.23) can be obtained by applying the formula [Golub and Van Loan 89, Haykin
94]
U U U U
+ −
= ( )
T T
1
(3.2.26)
Notice that for the input patterns, which are the columns of the matrix U, to be
linearly independent, the number of columns should not be more than the number of
rows, that is K N
≤ , otherwise UTU will be singular and no inverse will exist. The
condition K N
≤ means that the number of entries constituting the patterns restricts
the capacity of the memory. At most N patterns can be stored in such a memory.
This memory can be implemented by a neural network for which WT=YU+ . The
desired value yk appears at the output of the network as xk when uk is applied as
input to the network:
x W u
k k
= T
(3.2.27)

as explained in the previous section.
Notice that for the special case of orthonormal patterns that we examined previously
in this section, we have
U U 1
T
= (3.2.28)
that results in the pseudoinverse, which is in the form
U U
+
= T
(3.2.29)
and therefore
W YU
T T
= (3.2.30)
as we have derived previously.
3.3. Hopfield Autoassociative Memory
In section 2.9 we have examined continuous input, continuous time Hopfield network.
In this section we will investigate how Hopfield network can be used as
autoassociative memory. For this purpose some modifications are done on it so that it
works in discrete state space and discrete time. When discrete Hopfield network was
introduced as associative memory in [Hopfield 82] it had attracted a great attention. In
[Hopfield 84] it is shown that many important characteristics of the discrete and
continuos deterministic models are closely related (Figure 3.3)
Figure 3.3 Hopfield Associative Memory
1 2 j
i N
x
f(a)
a
θ θ θ θ θ
1 2 i j N
x
2
x
1 x
i x
j x
N

Note that, whenever the patterns to be stored in Hopfield network are from N
dimensional bipolar space constituting a hypercube, that is uk∈{-1,1}N
, k=1..K, then
it is convenient to have any stable state of the network on the corners of the
hypercube. For this purpose refer to the output transfer function given by Eq. (2.9.3)
and to Figure 2.7 for different values of the gain. If we let the output transfer function
of the neurons in the network to have a very high gain, in the extreme case
f a a
i ( ) lim tanh( )
=
→∞
κ
κ (3.3.1)
we obtain
f a a
for a
for a
for a
i ( ) sign( )
= =
>
=
− >
1 0
0 0
1 0
(3.3.2)
Furthermore note that the second term of the energy function given by Eq. (2.9.7) that
we repeat here for convenience:
i
N
i
i
N
i
x
R
N
i
N
j
i
j
ji x
dx
x
f
x
x
w
E
i
i ∑
∑ ∫
∑∑ =
−
=
= =
−
+
−
=
1
1
1
0
1
1 1
)
(
2
1
θ (3.3.3)
approaches to zero. Therefore the stable states of the network corresponds to the local
minima of the function:
∑
∑
∑ −
−
=
i
i
i
i
j
j
ji
i
x
x
x
w
E θ
2
1
(3.3.4)
so that they lie on the corners of the hypercube as explained previously.
In section 2.10, we have derived the discrete time approximation for continuous time
Hopfield network described in section 2.9. However in this section we investigate a
special case of the Hopfield network where the stable states of the network are forced
to take discrete values in bipolar state space. Knowing in advance that the local
minima of the energy function should take place at the corners of the N dimensional
hypercube, we can get rid of the slow convergence problem due to small value of η.

For this purpose a discrete state excitation [Hopfield 82] of the network, is provided
in the following:
x k f a k
for a k
x k for a k
for a k
i i
i
i
i
( ) ( ( ))
( )
( ) ( )
( )
+ = =
>
=
− >
1
1 0
0
1 0
(3.3.5)
where ai(k) is defined in a manner similar to that we used to:
a k w x k
i ji
j
j i
( ) ( )
= +
∑ θ (3.3.6)
The processing elements of the network are updated one at a time, such that all of the
processing elements must be updated at the same average rate.
Note that, for any vector x having bipolar entries, that is xi∈{-1,1}, we obtain the
vector itself if we apply the function defined by Eq. (3.3.5) on it, that is
f(x)=x (3.3.7)
Here f is used to denote the vector function such that the function f is applied at each
entry.
For stability of the discrete Hopfield network, it is further required wii=0 in addition
to the constraint wij=wji
In order to use discrete Hopfield network as autoassociative memory, its weights are
fixed to
W UU
T T
= (3.3.8)
where U is the input pattern matrix as defined in Eq. (3.2.6). Remember that in
autoassociative memory we have Y=U, where Y is the matrix of desired output
patterns as defined in Eq (3.2.7). For the stability of the network, the diagonal entries
of W is set to 0, that is wii=0, i=1..N
If all the states of the network are to be updated at once, then the next state of the
system can be represented in the form

x(k+1)=f(WTx(k)) (3.3.9)
For the special case if the exemplars are orthonormal, then due to fact indicated by
Eqs. (3.3.7) and (3.2.13) we have
f(WTur)=f(ur)=ur (3.3.10)
that means each exemplar is a stable state of the network. Whenever the initial state is
set to one of the exemplar, the system remains there. However, if the initial state is set
to some arbitrary input, then the network converges to one of the stored exemplars,
depending on the basin of attraction in which x(0) lies.
However, in general, the input patterns are not orthonormal, so there is no guarantee
that each exemplar is corresponding to a stable state. Therefore the problems that we
mentioned in Section 3.1 arise. The capacity of the Hopfield net is less than 0.138N
patterns, where N is the number of units in the network [Lippmann 89].
In the following we will show that the energy function always decreases as the state
of the processing elements are changed one by one. Notice that:
)
(
)
(
)
(
2
1
)
1
(
)
1
(
)
1
(
2
1
))
(
(
))
1
(
(
k
x
k
x
k
x
w
k
x
k
x
k
x
w
k
E
k
E
E
i
i
i
i
j
j
ji
i
i
i
i
i
j
j
ji
i
∑
∑
∑
∑
∑
∑
+
+
+
−
+
+
−
=
−
+
=
∆
θ
θ
x
x
(3.3.11)
Assume that the neuron that just changes state at step k is neuron p. Therefore xp(k+1)
is determined by equation 3.3.5 and for all the other neurons we have xi(k+1)=xi(k), i≠
p. Furthermore we have wpp=0. Hence,
∆E x k x k w x k
p p jp j
j
p
= − + − +
∑
(( ( ) ( ))( ( )) )
1 θ (3.3.12)
that is,
∆E x k x k a k
p p p
= − + −
(( ( ) ( )) ( )
1 (3.3.13)
Notice that if the value of xp remains the same, then xp(k+1)=xp(k) so ∆E=0. If they
are not the same, than it is either the case xp(k)= -1 and xp(k+1)=1 due to fact ap(k)>0,

or xp(k)=1 and xp(k+1)= -1 due to fact ap(k)<0. Whatever the case is, if xp(k+1)≠xp(k)
it is in a direction for which ∆E<0. Therefore, for discrete Hopfield network we have
∆E ≤ 0 (3.3.14)
Because at each state change, the energy function decreases at least by some fixed
minimum amount, and because the energy function is bounded, it reaches a minimum
value in a finite number of state changes. So the Hopfield network converges to one
stable state in finite time in contrary to the asymptotic convergence in the continuous
Hopfield network. The schedule, in which only one unit of the discrete Hopfield
network is updated at a time, is called asynchronous update. The other approach in
which all the units are updated at once is called synchronous update. Although the
convergence with the asynchronous update mechanism is guaranteed, it may result in
a cycle of length two in synchronous update.
It should be noted that, the continuous deterministic model implies the possibility of
implementing the discrete network in actual hardware because of the close relation
between discrete and continuous models. However, the discrete model is often
implemented through computer simulations because of its simplicity.
Exercise: Explain how can we use the Hopfield network as autoassociative memory if
the states are not from bipolar space {-1,1}N but from binary space{0,1}N .
3.4. Bi-directional Associative Memory
The Bi-directional Associative Memory (BAM) introduced in [Kosko 88] is a
recurrent network (Figure 3.4) designed to work as heteroassociative memory
[Nielsen 90]. BAM network consists of two sets of neurons whose outputs are
represented by vectors x ∈ RN and v∈RM respectively, having activation defined by
the pair of equations:
M
i
for
a
f
w
a
dt
da
i
N
j
v
ji
x
i
x
j
i
i
..
1
)
(
1
=
+
+
−
= ∑
=
θ
α (3.4.1)
N
j
for
a
f
w
a
dt
da
j
M
i
x
ij
y
j
v
i
j
j
..
1
)
(
1
=
+
+
−
= ∑
=
φ
β (3.4.2)

where αi
, βj
, θi
, φj
are positive constants for all i=1..M, j=1..N, f is sigmoid function
and W=[wij
] is any N×M real matrix.
The stability of the BAM network can be proved easily by applying Cohen-Grossberg
theorem on the state vector z∈RN+M defined as
N
M
i
M
M
i
j
v
M
i
x
z
j
i
i
+
≤
<
−
=
≤
=
,
(3.4.3)
that is z obtained through concatenation x and v.
Figure 3.4: Bi-directional Associative Memory
Since BAM is a special case of the network defined by Cohen-Grossberg theorem, it
has a Lyapunov Energy function as it is provided in the following:
j
N
j
j
i
M
i
i
a
i
N
j
a
i
M
i
v
x
ij
N
j
M
i
v
f
x
f
db
b
b
f
da
a
a
f
a
f
a
f
w
E
i
v
i
x
j
i
φ
θ
β
α
)
(
)
(
)
(
)
(
)
(
)
(
)
,
(
1
1
0
1
0
1
1
1
∑
∑
∫
∑
∫
∑
∑
∑
=
=
=
=
=
=
−
−
′
+
′
+
−
=
v
x
(3.4.4)
The discrete BAM model is defined in a manner similar to discrete Hopfield network.
The output functions are chosen to be f(a)=sign(a) and states are excited as:
x layer
v layer
W
1 2 i M
N
j
2
1

x k f a k
for a k
x k for a k
for a k
i x i
x i
x i
x i
( ) ( ( ))
( )
( ) ( )
( )
+ = =
>
=
− <
1
1 0
0
1 0
(3.4.5)
v k f a k
for a k
v k for a k
for a k
j v j
v j
v j
v j
( ) ( ( ))
( )
( ) ( )
( )
+ = =
>
=
− <
1
1 0
0
1 0
(3.4.6)
where
M
i
for
a
f
w
a i
N
j
v
ji
x j
i
..
1
)
(
1
=
+
= ∑
=
θ (3.4.7)
and
N
j
for
a
f
w
a j
N
i
x
ij
v i
j
..
1
)
(
1
=
+
= ∑
=
φ . (3.4.8)
or in compact matrix notation it is shortly
x(k+1)=f (WTv(k)) (3.4.9)
and
v(k+1)=f(Wx(k)). (3.4.10)
In the discrete BAM, the energy function becomes
j
N
j
v
i
M
i
x
v
x
ij
N
j
M
i
j
i
j
i
a
f
a
f
a
f
a
f
w
E
φ
θ )
(
)
(
)
(
)
(
)
,
(
1
1
1
1
∑
∑
∑
∑
=
=
=
=
−
−
−
=
y
x
(3.4.11)
satisfying the condition
∆E ≤ 0 (3.4.12)
which implies the stability of the system.
The weights of BAM is determined by the equation

WT=YUT (3.4.13)
For the special case of orthonormal input and output patterns we have
f(WTur)= f(YUTur)=f(yr)=yr (3.4.14)
and
f(Wyr)= f(UYTyr)=f(ur)=ur (3.4.15)
indicating that exemplars are stable states of the network. Whenever the initial state is
set to one of the exemplar, the system remains there. For arbitrary initial states the
network converges to one of the stored exemplars, depending on the basin of
attraction in which x(0) lies.
For the input patterns that are not orthonormal, the network behaves as it is explained
for the Hopfield network.
Exercise: Compare the special case U=Y of BAM with Hopfield Autoassociative
memory.

ASSOCIATIVEMEMORYANDUNSUPERVISEDLEARNINGNETWORKS

More Related Content

Similar to ASSOCIATIVEMEMORYANDUNSUPERVISEDLEARNINGNETWORKS (20)

Recently uploaded (20)

ASSOCIATIVEMEMORYANDUNSUPERVISEDLEARNINGNETWORKS