UNIT-II
INFORMATION THEORY & SOURCE CODING
 Uncertainty, Information and Entropy
 Source Coding
 Shannon-Fano and Huffman Coding
 Discrete Memory Less Channel
 Probability relations in Channel
 Priori & Posteriori Entropies
 Mutual Information
 Channel Capacity
 Binary Symmetric Channel, Binary Erasure Channel
 Cascaded Channels
 Information Rate
 Shannon-Hartley Theorem – Shannon Bound
UNIT-II
Information Theory and Source Coding
INTRODUCTION
Communication systems are basically meant to transfer information from one location to
another. Information Theory is a branch of probability Theory, which can be applied to study of
communication systems. The communication of information is statistical in nature and the main
aim of information theory is to study the simple ideal statistical communication models.
In 1948 Claude E Shannon known as “Father of Information Theory”, published a treatise
on the mathematical theory of communication in which he established basic theoretical bands for
the performance of communication systems. Shannon’s theory is based on probabilistic models
for information sources and communication channels.
Measure of Information:
In order to know and compare the “information content” of various messages produced
by an information source, a measure is necessary to quantitatively know the information content.
Consider the following three statements:
1. Brazil defeats India in football. This statement is almost certain to be true. So. The
probability is 1 and very much predictable. So. One does not get any information at all.
2. Japan defeats India in Football. This statement also would be true almost but have less
certain than earlier case. So. The probability (say 0.8 or 0.9, but not 1). So. This news gives
us some information even though may be scanty.
3. India defeats Brazil in Football. This is great piece of information. Probability of this very
very small (say 0.01 or 0.02), but if this happens it should be transmitted at once, because
people cannot predict this result. So, this news contains a lot of information.
Thus we see that there should be some sort of inverse relationship between the probability of
an event and the amount of information associated with it. The more the probability of an event,
the less is the amount of information associated with it and vice-versa.
 Average Information /Self-Information,
where “b” is base of logarithm, if b=2 then the units are called “BITS”, if b = 10, the units are
HARTLEY or DECITS and if b= e , the units are called NATS.
Properties of Information:
1. Information is non-negative, ( ) 0
i
I x 
2. ( ) 0; ( ) 1
i i
I x if p x
 
3. ( ) ( ) ( ) ( )
i j i j
I x I x if p x p x
 
4. For an independent events, the overall information is sum of individual information.
. ( , ) ( ) ( )
i j i j
i e I x x I x I x
 
1
( ) log log ( )
( )
i b b i
i
I x p x
p x
  
Table 2.1 Conversion of Information Units
Sources:
A source produces signals and as signals may be basically continuous time, or discrete time
in nature, so are the sources too. The continuous time signals like speech signals or video signals
are, essentially band-limited though because the spectra of speech signals have very little power
beyond about 4 kHz and similarly the spectra of video signals are mostly confined to about 6 MHz
These information-bearing signals, being band limited, can as well be modeled by the samples of
band limited random processes, i.e., as discrete random variables. Since the samples may have a
continuum of values, we shall restrict our sources to emit discrete random variables Xi which take
only a discrete set of values, are statistically independent and are identically distributed. Such
sources are generally referred to as Discrete Memoryless Sources (DMS), in the sense that they
emit discrete-time random processes that take only discrete amplitudes, in which all Xis are
generated independently and with the same distribution.
Entropy:
A communication system is not only meant to deal with a single message but with all
possible messages. Messages produced by information sources consists of sequence of symbols
that corresponds to the message. From the receiving point of view the entire message is treated as
a single unit while the communication system has to deal with individual symbol. Hence, although
the instantaneous information flows corresponding to individual messages from the source may be
erratic, we may describe the source interms of average information (statistical average) per
individual messages known as “entropy” of source.
The average information per individual message can now be calculated in the following manner.
Let there are M different messages 1 2 3
, , , M
m m m m with their respective probabilities of
occurrences 1 2 3
,p ,p , M
p p . Let us assume that in a long time interval, L messages have been
generated. Let L be very large so that L>>M; then contains
1
p L number of messages of type 1
m ,
2
p L number of messages of type 2
m ,



M
p L number of messages of type M
m
And The amount of self-information in message, 1
1
1
log
m
p
 ,
The amount of self-information in message, 2
2
1
log
m
p
 ,



The amount of self-information in message,
1
log
M
M
m
p

1
p L
 number of messages of type 1
m contain 1
1
1
log
p L
p
bits of information,
2
p L number of messages of type 2
m contain 2
2
1
log
p L
p
bits of information,



M
p L number of messages of type M
m contain
1
log
M
M
p L
p
bits of information.
 The total amount of information in all L messages will then be,
1 2
1 2
1 1 1
log log log
total M
M
I p L p L p L
p p p
   
The average information per message (or) Entropy is
1 2
1 2
1 1 1
log log log
( )
M
total M
p L p L p L
I p p p
H M
L L
  
 
1 2
1 2
1 1 1
log log log
M
M
p p p
p p p
   
Entropy of source, bits/message
The quantity H(M) , represents the average information per message emitted by DMS with source
alphabet M is called “entropy” of the source. It should be read as “entropy of the source M”. It
is a measure of uncertainty, the probability distribution that generates maximum uncertainty will
have maximum entropy.
For a binary system (M=2), the entropy is
1 2
1 2
1 1
log log
H p p
p p
 
Let 1 2 1
, 1 1
p p then p p p
    
Hence,
1 1
log (1 )log
(1 )
H p p
p p
  

The condition for maximum entropy can be found by differtiating above expression with respect
to p and equating to zero yields,
0 ln 2 log ln 2 log(1 )
dH
p p
dp
      
i.e log log(1 )
p p
 
i.e 0.5
p 
At 0.5
p  , either a maxima or minima is present. If the second derivative of H is positive, then
there is a minima and if it is negative, then it is a maxima.
2
2
1 1
0
1
d H
dp p p
    

Hence H has a maximum at 0.5.
The Maximum value is, max 0.5
1 /
p
H H bit message

 
2 2
1 1
1
( ) log log
M M
k k k
k k
k
H M p p p
p
 
   
 
Fig 2.1 Entropy of DMS
Let us examine H under different cases for M=2
Case I: p1=0.01, p2= 0.99, H= 0.08
Case II: p1= 0.4, p2 = 0.6, H= 0.97
Case III: p1 = 0.5, p2 = 0.5, H = 1.00
In case-I, the message m1 with p1=0.01 will occur or the message m2 with p2=0.99 will occur,
but most of times m2 will occur. Thus, the uncertainty is less. In case-II, difficult to guess bcz
probabilities are nearly equal. Thus, uncertainty is more. In case-III, it is extremely difficult to
guess bcz probabilities are equal. Thus, uncertainty is maximum. Thus, the entropy is less when
uncertainty is less and is more when uncertainty is more. Thus, we can say that entropy is a
measure of uncertainty.
Properties of Entropy:
1. Entropy of source ia always a non-negative zero quantity . ( ) 0
i e H M 
Proof: 2
1
1
( ) log
M
k
k k
H M p
p

 
2
1
since,log 0
k
p

( ) 0
H M
 
2. Entropy of a sure and impossible events is Zero.
3. The maximum value of entropy of a source is given as, max 2
( ) log
H M M
 .
Proof: The entropy is maximum when all the messages are equally likely.
Thus, 1 2
1
M
p p p
M
    .
max 2 2
1
1 1
( ) log log
M
k
k k
H M p M M
p M

 
   
 

max 2
( ) log
H M M
 Bits/message.
4. The upper and lower bound of entropy is given as
2
0 ( ) log
H M
  
Where  indicates source, say X, Y, Z, M …
Rate of Information: If a message source generates at the rate r messages per second, the
rate of information R is defined as the average number of bits of information per second. Hence,
r- rate at which messages are generated/sec
H- avg. information (or) entropy
Example 2.1 An event has six possible outcomes with the probabilities, p1=1/2, p2=1/4, p3=
1/8, p4=1/16, p5=1/32 and p6=1/32. Find the entropy of the system and rate of information if
there are 16 outcomes per second.
Sol: Entropy,
6
2
1
1
log
k
k k
H p
p

 
2 2 2 2 2 2
1 1 1 1 1 1
log 2 log 4 log 8 log 16 log 32 log 32
2 4 8 16 32 32
     
31
/
16
bits message

Now r =16 outcomes/sec
Information Rate, R=rH=
31
16 31 / sec
16
bits
 
Extended Entropy
Suppose a DMS source, X emits symbols of Size M, if we consider a block of such symbols
and each block consisting of n symbols, consider that a new source, called the extended source,
is emitting such blocks as its symbols. The alphabet size for the extended source is n
M since there
will be that many distinct blocks that can be formed from the alphabet of the original source. The
entropy of the source n
X can be expressed to be n times the entropy of X.
R=rH bits/sec
( ) ( )
n
H X nH X

Proof: This is easily shown by representing a message sequence from the output of the order 𝑛
source extension as
 
1 2 2
, ,
i i i
, where k
i
can take on one of two states with probability ik
p . The
entropy of the order 𝑛 extension of the source is
   
1 2 1 2
1 2
2 2 2
2
1 1 1
( ) log
n n
n
n
i i i i i i
i i i
H X p p p p p p
  
  
   
1 2 1 2
1 2
2 2 2
2
1 1 1
log
n n
n
i i i i i i
i i i
p p p p p p
  
    
 
1 1 2 3 1 2 2 3 4
1 2 3 1 2 3 4
2 2 2 2 2 2 2 2 2
2 2
1 1 1 1 1 1 1 1 1
log log
n n
n n
i i i i i i i i i i i
i i i i i i i i i
p p p p p p p p p p p
        
   
 
  
   
 
 
   
        
1 2 3 2 1 1
1 2 3 2 1
2 2 2 2 2 2
2
1 1 1 1 1 1
log
n n n n
n n n
i i i i i i i
i i i i i i
p p p p p p p
  
 
     
   
   
   
     
1 2 3 1
1 2 3 1
2 2 2 2 2
2
1 1 1 1 1
log
n n n
n n
i i i i i i
i i i i i
p p p p p p


    
 
 
 
    
Since all of the terms in parentheses are equal to 1, we have,
2 2
2
1 1 1
( ) log ( )
k k
k
n
n
i i
k i k
H X p p H X
  
   
 
( ) ( )
n
H X nH X
 
Example2.2 Consider a discrete source  
0 1
,
S S with probabilities 0 1
1 3
( ) , ( )
4 4
p S p S
  . Find
entropy and extended entropy of size M=3.
Sol:
1
2 2 2
0
1 3 3
( ) log log 4 log 0.811 /
4 4 4
i i
i
H X p p bits source

 
     
 
 

Extended Source:
0
X 0
S 0
S 0
S
0
1 1 1 1
(X )
4 4 4 64
p    
1
X 0
S 0
S 1
S
1
1 1 3 3
(X )
4 4 4 64
p    
2
X 0 1 0
S S S
2
1 3 1 3
(X )
4 4 4 64
p    
3
X 0 1 1
S S S
3
1 3 3 9
(X )
4 4 4 64
p    
4
X 1 0 0
S S S
4
3 1 1 3
(X )
4 4 4 64
p    
5
X 1 0 1
S S S
5
3 1 3 9
(X )
4 4 4 64
p    
6
X 1 1 0
S S S
6
3 3 1 9
(X )
4 4 4 64
p    
7
X 1 1 1
S S S
7
3 3 3 27
(X )
4 4 4 64
p    
7
3
2
0
( ) log
i i
i
H X p p

  
2 2 2
1 3 64 9 64
log 64 log log
64 64 3 64 9
 
    
 
 
2.433 /
bits source

(or)
3
( ) ( ) 3 0.811 2.433 /
H X nH X bits source
   
Joint Entropy
Let X be a R.V with M possible values 1 2 3
, , , , m
x x x x with probabilities 1 2 3
, , , , m
p p p p .
Let Y be a R.V with N possible values 1 2 3
, y , y , , yn
y with probabilities 1 2 3
, , , , n
p p p p
respectively and their sample spaces are shown in Figure 2.2.
Fig 2.2 Finite Discrete Sample Spaces
Since X, has M possible values and Y has N possible values then a total of MN outcomes are
possible. Any event say, i
x of X may occur in conjunction with any event j
y of Y. Hence, the
complete set of events in the product space S is
 
1 1 1 2 1
2 1 2 2 2
1 2
n
n
m m m n
x y x y x y
x y x y x y
XY
x y x y x y
 
 
 

 
 
 
Thus we have three sets of complete probability distributions
 
( ) ( )
i
P X P x
 (Y) ( )
j
P P y
 
   ( ) ( , )
i j
P XY P x y
 
  
With these three complete probability distributions, we can associate three entropies as follows:
2
1
( ) ( )log ( )
m
i i
i
H X p x p x

 
1
where, ( ) ( , )
n
i i j
j
p x p x y

 
2
1
(Y) (y )log (y )
n
j j
j
H p p

 
1
where, (y ) ( , )
m
j i j
i
p p x y

 
The entropies H(X) and H(Y) are called marginal entropies of X and Y respectively. The joint
entropies of X and Y are given as
or
The amount of information associated with the joint event i
X x
 and j
Y y
 is
2
( , ) log ( , )
i j i j
I X x Y y p x y
   
Properties:
1. If X and Y are two R.Vs and they are dependent each other then H(X, Y) ≠H(X) +H(Y).
2. If X and Y are two R.Vs and they are independent each other then H(X,Y) =H(X) +H(Y).
Conditional Entropies
Let X and Y be the two RVs then the Conditional Uncertainty present in Y given X and
Conditional Uncertainty present in X given Y are given by expressions:
  2
1 1
( , )log ( )
m n
i k i k
i k
H X Y p x y p x y
 
 
  2
1 1
( , )log ( )
n m
k j j k
j k
H Y X p x y p y x
 
 
2
1 1
(X,Y) ( , )log ( , )
m n
i j i j
i j
H p x y p x y
 
 
2
1 1
(Y,X) ( , )log ( , )
n m
j i j i
j i
H p y x p y x
 
 
Proof: The conditional Probability  
p X Y is given by
 
( , )
( )
p X Y
p X Y
p Y

We know that k
y may occur in conjunction with 1 2 3
, , , , m
x x x x . Thus, associated
conditional probability is given by,
   
1 2
( ) ( ) ( )
k k k m k
p X y p x y p x y p x y

1 2
( , ) ( , ) ( , )
(y ) (y ) (y )
k k m k
k k k
p x y p x y p x y
p p p
 
  
 
We know that, 1 2
( , ) ( , ) ( , ) (y )
k k m k k
p x y p x y p x y p
    . Therefore, the sum of
elements of the matrix is unity. Hence the probability distribution is complete. Therefore, an
entropy associated with it is,
  2
1
( , ) ( , )
log
(y ) (y )
m
i k i k
k
i k k
p x y p x y
H X y
p p

 
2
1
( )log ( )
m
i k i k
i
p x y p x y

 
The average of this conditional entropy for all values of k
y in order to obtain a measure of an
average conditional entropy,
     
1
( )
n
k k k
k
H X Y H X y p y H X y

  
2
1 1
( ) ( )log ( )
n m
k i k i k
k i
p y p x y p x y
 
  
2
1 1
( ) ( )log ( )
n m
k i k i k
k i
p y p x y p x y
 
 
Similarly,
H(𝑋 ∣ Y ) is a measure of our average uncertainty of the transmitted symbol after we have
received a symbol, sometimes called the equivocation. The function H (𝑌 ∣ 𝑋) is also called
equivocation (i.e gives a measure of error, or noise) is the average uncertainty of the received
symbol given that 𝑋 was transmitted.
  2
1 1
( , )log ( )
m n
i k i k
i k
H X Y p x y p x y
 
  
  2
1 1
( , )log ( )
n m
k j j k
j k
H Y X p x y p y x
 
 
Thus, in all there are five entropies associated with a two-dimensional probability
distribution. They are H(X), H(Y), H(X, Y), H(X/Y) and H(Y/X). Let X represent a transmitter
and Y be a receiver, then the five entropies can be interpreted as follows:
H(X): Average information per character at the transmitter, or entropy of the transmitter.
H(Y): Average information per character at the receiver, or entropy of the receiver.
H(X, Y): Average information per pair of the transmitted and received characters, or average
uncertainty of the communication system as a whole.
H(X/Y): A measure of information about the transmitter, where it is known that Y is received.
H(Y/X): A measure of information about the receiver, where it is known that X is transmitted.
Chain Rule: H(X, Y) = H(X) + H(Y / X) or
= H(Y) + H(X / Y)
Proof:
2
1 1
( , ) ( , )log ( , )
m n
i k i k
i k
H X Y p x y p x y
 
 
 
2
1 1
( , )log ( ) (y )
m n
i k i k k
i k
p x y p x y p
 
 
 
2 2
1 1
( , ) log ( ) log (y )
m n
i k i k k
i k
p x y p x y p
 
  

 
2 2
1 1
( , )log ( ) ( , )log (y )
m n
i k i k i k k
i k
p x y p x y p x y p
 
  

   
2
1 1
( , )log (y )
m n
i k k
i k
H X Y p x y p
 
 
  2
1 1
( , ) log (y )
n m
i k k
k i
H X Y p x y p
 
 
   
 
 
  2
1
(y )log (y )
n
k k
k
H X Y p p

 
Similarly we can show
 
( , ) ( )
H X Y H X Y H Y
 
 
( , ) (X)
H X Y H Y X H
 
Example2.3 A discrete source transmits message 1 2 3
, ,
x x x with
probabilities 0.3, 0.4 and 0.3. The source is connected to the channel
as given in Fig Calculate all entropies.
Sol: From given channel the conditional probability matrix P(Y /X) is,
 
1 2 3
1
2
3
0.8 0.2 0
0 1 0
0 0.3 0.7
y y y
x
P Y X x
x
 
 
  
 
 
Also given,
 
( ) 0.3 0.4 0.3
P X 
The joint probability matrix P(X,Y) can be obtained by multiplying the rows of P(Y/X) by
1 2 3
( ), ( ), ( )
p x p x p x i.e by 0.3, 0.4 and 0.3 respectively, giving
0.8 0.3 0.2 0.3 0
P(X,Y) 0 1 0.4 0
0 0.3 0.3 0.7 0.3
 
 
 
 
 
 
 
 
1 2 3
1
2
3
0.24 0.06 0
0 0.4 0
0 0.09 0.21
y y y
x
x
x
 
 
  
 
 
The probabilities 1 2 3
(y ), (y ), (y )
p p p can be obtained by adding the columns of P(X,Y) giving
1 2 3
(y ) 0.24, (y ) 0.06 0.4 0.09 0.55, (y ) 0.21
p p p
     
The conditional probability P(X/Y) can be obtained by dividing the columns of P(X,Y) by
1 2 3
(y ), (y ), (y )
p p p respectively, giving
 
1 2 3
1
2
3
1 0.109 0
0 0.727 0
0 0.164 1
y y y
x
P X Y x
x
 
 
  
 
 
The entropies are calculated as follows:
(i)
3
2
1
( ) (x )log (x )
j j
j
H X p p

 
2 2 2
(0.3log 0.3 0.4log 0.4 0.3log 0.3)
   
1.571 /
bits message

(ii)
3
2
1
(Y) (y )log (y )
k k
k
H p p

 
2 2 2
(0.24log 0.24 0.55log 0.55 0.21log 0.21)
1.441 /
bits message
   

(iii)
3 3
2
1 1
( , ) (x ,y )log (x ,y )
j k j k
j k
H X Y p p
 
 
 
2 2 2 2 2
0.24log 0.24 0.06log 0.06 0.4log 0.4 0.09log 0.09 0.21log 0.21
     
2.053 /
bits message

(iv)
 
3 3
2
1 1
(x ,y )log (x y )
j k j k
j k
H X Y p p
 
 
 
2 2 2 2 2
0.24log 1 0.06log 0.109 0.4log 0.727 0.09log 0.164 0.21log 1
     
0.612 /
bit message

(v)
 
3 3
2
1 1
(x ,y )log (y x )
j k k j
j k
H Y X p p
 
 
 
2 2 2 2 2
0.24log 0.8 0.06log 0.2 0.4log 1 0.09log 0.3 0.21log 0.7
     
0.482 /
bit message

SOURCE CODING
A discrete source produces symbols which are to be represented in such a way that it would
be possible to transmit them over a given channel. The source output, which is in the form of a
sequence of source symbols, must be converted into a sequence of what are called the ‘code
elements’. This process is called ‘encoding’ and the device, or system which performs this
encoding is called an encoder. The encoder assigns a unique sequence of code elements, called a
‘codeword’, for representing each source symbol.
Fig 2.2 Source Coding Encoding Process
The objective of source coding is to remove or reduce the redundancy in the source
output so as to give an efficient representation of the message information given by the source
by using less number of bits
Classification of Codes:
Fig 2.3 Classification of Codes
Block Codes: A block code is a code that maps each of the symbols of the source onto a fixed
sequence of bits. These fixed sequences of bits are called codewords. The codewords defining a
block code may or may not have equal number of bits.
Non-Singular Code (Distinct Codes): A block code is said to be non-singular if and only if all
the codewords are distinct and easily distinguishable from one another else are called singular
(non-distinct) codewords.
Fixed-Length Codes: It is the simplest method to encode each symbol of a discrete source into a
block of bits, where each block consists of the same number of m-bits.
Variable-Length Codes: The variable-length code is one whose codeword length is not the same
for all source symbols i.e length of the codewords are different for each and every message
symbols.
Prefix-Free (Instantaneous) Codes: Any sequence consisting of the initial part of the codeword
is called a prefix of the codeword. In a prefix-free code, no codeword is a prefix of another
codeword. The decoding of a prefix-free code can be accomplished as soon as the codeword
representing a source symbol is fully received. Prefix-free codes are also called
instantaneous codes. (or) A uniquely decodable code is said to be “instantaneous” if it is
possible to recognize the end of any code word in any received sequence without reference to the
succeeding symbols.
Uniquely Decodable Codes: A distinct code must be uniquely decodable, in that for each sequence
of source symbols, there is a corresponding codeword that is different from a codeword
corresponding to any other sequence of source symbols. A sufficient condition, but not a necessary
one, for a code to be uniquely decodable is to be prefix-free. In other words, all
prefix-free codes are uniquely decodable, but the converse is not true.
The two basic requirements to be met by any source encoder are:
1. Minimum average length of a codeword for a given set of source alphabet {X} and the
source symbol probability set {p(xi)}.
2. Unique decipherability of the encoded sequence.
Codes having above two properties are called ‘Optimal Codes’.
Optimal Codes: An instantaneous code is said to be optimal code if it has minimum average length
L for a source with a given probability assignment for the source symbols.
Classification of codes is best illustrated by an example as shown in below table where a source
of size 4 has been encoded with binary 0 and1.
Table 2.2 Illustration of classification of codes
i
x Code1 Code2 Code3 Code4 Code5 Code6
1
x 00 00 0 0 0 1
2
x 01 01 1 10 01 01
3
x 00 10 00 110 011 001
4
x 11 11 11 111 0111 0001
Code-1 and Code-2 are fixed length codes with length 2. Except Code-1 and Code-2 are variable
length codes. Except code-1 remaining codes are distinct codes. Codes-2, 4 and 6 are pre-fix free
codes. Code-3 of table 2.2 is not uniquely decodable code, consider the binary data 1001 which
can be decoded as 2 3 2
x x x or 2 1 1 2
x x x x . Thus the pre-fix free codes are uniquely decodable codes
but it not necessary sufficient condition because Code-5 does not satisfy prefix-free condition but
still it is uniquely decodable.
Average Codeword Length:
Let X be a DMS with finite entropy H(X) and an alphabet  
0 1 2 1
, , , , m
x x x x  with corresponding
Probabilities of occurrence ( )
i
P x (i=0, 1, 2……m-1). Let the binary codeword assigned to symbol
i
x by encoder having a length i
n , measured in bits. The average codeword length L per source
symbol is given by,
For fast transmission of any message from source to the destination (Sink) it is necessary that the
average length of code is as small as possible.
Code Efficiency: It is defined as the ratio of minimum possible value of average codeword length
to average codeword length of the symbols used in source encoding process.
min
L
L
 
1
0
( ) / message
M
i i
i
L P x n bits


 
Let H(x) be the entropy of the source in bits/message. Let logM be the maximum average
information associated with each letter in bits/letter.
min
( )
log
H x
L
M
  Letters/message
Hence Coding Efficiency is,
Or
Where M is the size of encoder alphabet.
Code Redundancy:
Kraft’s Inequality: A necessary and sufficient condition for the existence of an instantaneous
binary codes.
Where D is the size of the encoder alphabet
For binary D=2 then the Kraft’s inequality is given by,
1
0
2 1
i
M
n
i





It only assures existence of an instantaneous decodable code with code length that satisfy the
inequality. But it does not show how to obtain codewords, nor does it say any code satisfies the
inequality is automatically uniquely decodable.
SOURCE CODING FOR DISCRETE MEMORYLESS SOURCES
The objective of source coding is to remove that redundancy and make the data more compact.
The ultimate limit for data compaction without loss of any information. A few prefix-free coding
schemes for discrete memoryless sources, called compact coding schemes, which are suboptimum,
in the sense that, they give, on the average, longer codewords than the optimum value which is
H(X), the entropy of the source. So these codes will be uniquely decipherable, but not necessarily
optimal.
min
2
( )
% 100
log
L H x
L L M
   
( )
H x
L
 
1
 
 
1
0
D 1
i
M
n
i





Shannon-Fano Coding:
Steps:
1. Write down the message or source symbols in the order of decreasing probabilities.
2. Draw a line after say the k-th symbol such that the total probability of the symbols above
the line and below the line are approximately equal, i.e., divide the source symbols into
two groups of almost equal probability.
3. Assign to each symbol above the line a binary value ‘0’ and to each symbol below the
line a binary value ‘1’.
4. Repeat steps 2 and 3 until all subgroups have only one symbol is left.
5. When that stage is reached, the coding is complete.
Example 2.4: consider a source emits independent symbols A,B,C,D with probability of
occurrence as P(A)=0.5, P(B)=0.25, P(C)=0.125 and P(D)= 0.125
Table 2.3 Shannon-Fano coding procedure
Symbol probabilities I II III Codeword Length( i
n )
A 0.5 0 - - 0 1
B 0.25 1
1
1
0 - 10 2
C 0.125 1
1
0 110 3
D 0.125 1 111 3
(i) Average length,
1
0
( )
M
i i
i
L P x n


 
0.5 1 0.25 2 0.125 3 0.125 3
       
1.750
 Bits/symbol
(ii) Entropy,
1
2
0
H(S) ( )log ( )
M
i i
i
P x P x


 
2 2 2 2
0.5log 0.5 0.25log 0.25 0.125log 0.125 0.125log 0.125
    
1.750
 Bits/symbol
(iii) Efficiency,
( )
% 100
H S
L
  
1.750
100 100%
1.750
  
(iv) Redundancy, 1
 
  =1 - 1= 0(no redundancy left).
Sometimes Shannon-Fano method is ambiguous, because due to the availability of more than
one equally valid schemes of partitioning resulting in ambiguity. Moreover, as M increases this
method is not suitable and formation of M approximately equi-propable groups is rather difficult
and little choice.
Huffman Coding
Huffman code leads to the lowest possible value of L for a given M, resulting in
maximum efficiency (or) minimum redundancy. Hence, it is also known as minimum redundancy
code (or) optimum code. Huffman code is a prefix-free code, which can thus be decoded
instantaneously and uniquely. The average length of a Huffman code is the same as the entropy
(i.e., maximum efficiency), if the probability of every symbol produced by the source is an integer
power of 1/2. Modified versions of Huffman coding are employed in fax machines and scanners
as well as in the assignments of area and country codes for long-distance telephone calls in most
parts of the world.
Steps:
1. Write down the message or source probabilities in decreasing order.
2. Assign a binary value 0 and binary value 1 to the last two symbols of lowest probability.
This forms stage-1.
3. Combine the last two symbols into one new symbol with probability equal to sum of
probabilities of the two original symbols. List the probabilities of original symbols except
the last two and the new symbol in decreasing order. This forms stage-II.
4. Step-3 process is adopted (steps) until we are left with two symbols in which 0 and 1 is
assigned. This forms the last stage.
5. The code for each original source is then obtained by tracking out the sequence of 0’s and
1’s which we have to go through backwards to arrive to the original source symbols.
Example 2.2: Apply the Huffman coding procedure for the following message ensemble
   
1 2 3 4 5 6 7
X x x x x x x x

   
( ) 0.4 0.2 0.12 0.08 0.08 0.08 0.04
p X 
Message ( )
i
P x I II III IV V
1
x 1 0.40 0.40 0.40 0.40 0.40 0.60 0
2
x 000 0.20 0.20 0.20 0.24 0.36 00 0.40 1
3
x 011 0.12 0.12 0.16 0.20 000 0.24 01
4
x 0010 0.08 0.12 0.12 010 0.16 001
5
x 0011 0.08 0.08 0010 0.12 011
6
x 0100 0.08 0100 0.08 0011
7
x 0101 0.04 0101
(i) Average length,
1
0
( )
M
i i
i
L P x n


 
0.4 1 0.2 3 0.12 3 0.08 4 0.08 4 0.08 4 0.04 4
             
2.48
 Bits/message
Messages Code Word Length( i
n )
1
x 1 1
2
x 000 3
3
x 011 3
4
x 0010 4
5
x 0011 4
6
x 0100 4
7
x 0101 4
(ii) Entropy,
1
2
0
H(X) ( )log ( )
M
i i
i
P x P x


 
2 2 2 2 2 2 2
0.4log 0.4 0.2log 0.2 0.12log 0.12 0.08log 0.08 0.08log 0.08 0.08log 0.08 0.04log 0.04
       
= 2.42 bits/message
(iii) Efficiency,
(X)
% 100
H
L
  
2.42
100 97.6%
2.48
  
(iv) Redundancy, 1
 
  =1-0.976 = 0.024= 2.4%
Example 2.5 Compute the Huffman Source Coding for the message symbols in two methods
and compute the variance of the average codewordlength and comment on the result.
   
0 1 2 3 4
S S S S S S

   
(s ) 0.4 0.2 0.2 0.1 0.1
i
P 
Method-1: The combined probabilities placing high as possible with already existing
probability.
Message (s )
i
P
0
S 00 0.4 0.4 0.4 0.6 0
1
S 10 0.2 0.2 0.4 00 0.4 1
2
S 11 0.2 0.2 10 0.2 01
3
S 010 0.1 010 0.2 11
4
S 011 0.1 011
Method-2: The combined probabilities placing low as possible with already existing probability.
Message (s )
i
P
0
S 1 0.4 0.4 0.4 0.6 0
1
S 01 0.2 0.2 00 0.4 1
2
S 000 0.2 0.2 000 0.2 01
3
S 0010 0.1 0010 001
4
S 0011 0.1 0011
0.2
.
0.4
Placing High order
0.2
0.4
Placing low order
Table 2.4 Comparison of Methods
Message
Symbols
Probabilities
(s )
i
P
Method-1 Method-2
Code Words Length Code Words Length
0
S 0.4 00 2 1 1
1
S 0.2 10 2 01 2
2
S 0.2 11 2 000 3
3
S 0.1 010 3 0010 4
4
S 0.1 011 3 0011 4
Calculation of avg. Lengths,
1 0.4 2 0.2 2 0.2 2 0.1 3 0.1 3
L          
2.2 bits/message-symbols

2 0.4 1 0.2 2 0.2 3 0.1 4 0.1 4
L          
2.2 bits/message-symbols

Even though both the methods yields same Average length they differ in variance of the average
code length,
Variance of Avg.Codeword length,
Where 1 2
, , , k
p p p are the source probabilities, 1 2
,n , ,nk
n are the length of the codewords and
L is the average length of codeword.
Variance of Method-1
         
2 2 2 2 2
2
1 0.4 2 2.2 0.2 2 2.2 0.2 2 2.2 0.1 3 2.2 0.1 3 2.2
               
0.160

Variance of Method-2
         
2 2 2 2 2
2
2 0.4 1 2.2 0.2 2 2.2 0.2 3 2.2 0.1 4 2.2 0.1 4 2.2
               
1.928

From the above results, it is seen that 2 2
1 2
 

Comment: It means that when a combined probability of two lower most symbols are places as
high as possible the resulting Huffman Source code has significantly smaller value of Variance
as compared to when it is placed as low as possible. So, when the combined probabilities of two
symbols become equal to the probability of a symbol already in the list should be placed as high
as possible.
 
2
2
1
N
k k
k
p n L


 

Huffman Code can be applied to an M-ary source as well. The algorithm is,
(a) Rearrange the symbols in the order of decreasing probability.
(b) The last M symbols are combined into one symbol.
(c) Steps a and b are repeated till the set reduces to M.
(d) Each of these reduced symbols are now assigned one of the 0,1,……..,M-1 numbers as
the first digit in their respective codeword.
(e) Now retrace and assign the numbers 0,1,……..,M-1 to the second digit for the M symbols
that were combined in the previous step.
(f) This is repeated till the original symbol set is reached.
For an M-ary code, it should be ensured that the number of terms to be combined in the last
reduction should be equal to M, if the total number of source symbols is equal to M+k (M-1) where
kI, because each reduction reduces the number of symbols by M-1. In cases where the source
alphabet size is not equal to M+k (M-1), one must add the requisite number of dummy symbols
with zero probability.
Although Huffman code is an optimal code in the sense that it gives minimum average codeword
length for a discrete memoryless source with a given source statistics, it suffers from the following
two disadvantages in practical applications.
1. It depends heavily on source statistics, an a priori knowledge of the probabilities of
occurrence of the source symbols is a must.
2. As most of the sources that we come across in practice are not memoryless, (i.e., the
probability of occurrence of a symbol is not independent of which symbols have preceded
it) and since Huffman coding takes into account only the individual symbol probabilities,
use of this coding in practical applications does not yield good compression.
Mutual Information (Trans-Information/Transferred Information)
In order to characteristic the complete communication system none of the entropies
discussed quantifies about the information lost in the channel. To incorporate this missing part, the
concept of Mutual Information is needed.
Consider a source as Random variable X can take any one of the M values 1 2 3
, , , , m
x x x x
. Let the output of the channel be Random Variable Y can take any one of the N possible values
1 2 3
, y , y , , yn
y . If the channel is ideal then N=M i.e. ,
i k
x y i k
  . Let us say that source symbol
i
x
is transmitted and the channel output k
y . Then there are two probabilities are concerned:
Priori Probability: The probability ( )
i
p x of i
x being transmitted, relates to our ‘state of
knowledge’ at the destination, before i
x is transmitted regarding which symbol from the source
alphabet would be transmitted. In other words, Prior to the reception of message, the state of
knowledge at the receiver about a transmitted signal i
x is the probability that i
x would be selected
for transmission by transmitter. This is the Priori probability, ( )
i
p x .
Posteriori Probability: If i
x is being transmitted and k
y is received at the destination, our ‘state
of knowledge’ at the destination regarding which symbol from the alphabet of the source has been
transmitted is represented by probability, ( )
i k
p x y . In other words, after the reception of the
symbol k
y at the receiver, the state of knowledge about the same transmitted signal i
x is the
conditional probability ( )
i k
p x y is posteriori probability.
Under ideal conditions, ( )
i k
p x y = 1 for one particular i
x
= 0 for rest
Unfortunately, owing to the presence of noise and other channel imperfections, ( )
i k
p x y
will not be one for one i
x and zero for the rest. Instead, it will be large (but not equal to one) for
some i
x and small but not exactly zero for the rest. Because of this, the uncertainty at the
destination regarding which i
x was actually transmitted will not be completely removed even after
the channel output k
y is observed. The uncertainty persisting after receiving k
y would be less
compared to what it was before it was received, indicating that some amount of information has
been transferred from the source to the destination.
The uncertainty about i
x before k
y was received is given by, 2
log ( )
i
p x
 and the uncertainty after
receiving k
y is given by 2
log ( )
i k
p x y
 .
The information gained about the symbol i
x by the reception of k
y is the net reduction in the
uncertainty and is known as mutual information  
,
i k
I x y .
 
,
i k
I x y = initial uncertainty – final uncertainty
=
 
2 2
log ( ) log ( )
i i k
p x p x y
  
2 2
( ) ( , )
log log
( ) ( ) ( )
i k i k
i i k
p x y p x y
p x p x p y
 
 
2
( )
log ,
( )
k i
k i
k
p y x
I y x
p y
 
 
,
i k
I x y represents the difference in the amount of information at the destination
after and before the reception of the symbol k
y consequent to the transmission of
the symbol i
x , and it is called ‘Mutual Information’ of the channel, between the
transmitted symbol i
x and the received symbol k
y .
The average of the mutual information for a pair of symbols over all transmitted and received
symbol pairs is the mutual information.
 
( , ) ,
i k
I X Y I x y

 
1 1
( , ) ,
m n
i k i k
i k
p x y I x y
 
 
2
1 1
( )
( , )log
( )
m n
i k
i k
i k i
p x y
p x y
p x
 
 
 
2 2
1 1
( , ) log ( ) log ( )
m n
i k i k i
i k
p x y p x y p x
 
 

2 2
1 1 1 1
( , )log ( ) ( , )log ( )
m n m n
i k i i k i k
i k i k
p x y p x p x y p x y
   
 
   
 
 
 
 
2
1 1
( , ) log ( ) /
m n
i k i
i k
p x y p x H X Y
 
 
  
 
 
 
 
2
1
( )log ( ) /
m
i i
i
p x p x H X Y

  

Avg. Mutual Information
(or)
Trans-information
i.e. when an avg. information H(X) or H(Y) is transmitted over the channel, an avg. amount of
information equal to H(X/Y) or H(Y/X) is lost in the channel due to intersymbol conversion
which is due to noise. The balance amount of information received at the receiver with respect
to an observed output symbol is the mutual information.
Conclusions:
1. I(X, Y) is an average mutual information, indicates measure of the information transferred
through the channel. It is also known as transferred information or transinformation of
the channel.
2. The equation ( , ) ( ) ( / )
I X Y H X H X Y
  , states that the transferred information is equal
to the average source information minus the average uncertainty that still remains about
the messages. In other words, H(X/Y) is the average additional information needed at the
receiver after reception in order to completely specify the message sent. Thus, H(X/Y)
gives the information lost in the channel. This is also known as equivocation.
( , ) ( ) ( / )
I X Y H X H X Y
 
( , ) H(Y) H(Y/ X)
I X Y  
3. The equation ( , ) H(Y) H(Y/ X)
I X Y   , states that the transferred information is equal
to the receiver information minus that part of the receiver entropy which is not the
information about the source. Thus, H(Y/X) is a measure of noise or error due to the
channel.
Properties of Mutual Information:
1. The mutual information of a channel is symmetric i.e I(X;Y)=I(Y;X)
Proof: we know that,
H(X,Y) = H(X) + H(Y/X)=H(Y) + H(X/Y)
Therefore, H(X) +H(Y / X) = H(Y) + H(X / Y)
H(X) – H(X / Y) = H(Y) – H (Y / X)
Hence, I(X;Y) = I(Y;X)
2. The mutual information is non-negative i.e I(X;Y )≥ 0.
Proof: we know that,
2
1 1
( )
( , ) ( , )log
( )
m n
i k
i k
i k i
p x y
I X Y p x y
p x
 
 
2
1 1
( )
( , )log
( )
m n
i
i k
i k i k
p x
p x y
p x y
 
 
2
1 1
( ) ( )
( , ) ( , )log
( , )
m n
i k
i k
i k i k
p x p y
I X Y p x y
p x y
 
 
   
 

1 1
( ) ( )
1
( , )log
log 2 ( , )
m n
i k
i k e
i k
e i k
p x p y
p x y
p x y
 
 
 
 
  
 
 
 
 

But we know that,
log ( 1)
e x x
 
1 1
1
( ) ( )
1
( , ) ( , )
log 2 ( , )
m n
i k
i k
i k
e i k
p x p y
I X Y p x y
p x y
 
 
 
 

 
 
 
 
 
 
  
1 1
1
( ) ( ) ( , )
log 2
m n
i i
k k
i k
e
p x p y p x y
 
 
 

 
 
 
 
But, 1 1
1 1
1
( ) ( ) ( , )
m n m n
i i
k k
i i
k k
p x p y p x y
 
 
 
 
( , ) 0
I X Y
 
Hence,
( , ) 0
I X Y 
i.e even on a noisy channel, by observing the output of the channel, on the average we cannot
lose any information. At the most, the mutual information may be zero, i.e., we do not gain any
information by observing the output, and this happens when the input and output symbols of the
channel are statistically independent.
3. The mutual information I(X;Y) of a channel is related to the marginal entropies H(X) ,
H(Y) and H(X,Y) as I(X;Y) = H(X) + H(Y) – H(X,Y)
Proof: we know that,
2
1 1
1
(X,Y) ( , )log
( )
m n
i k
i k i k
H p x y
p x y
 
 
2 2
1 1 1 1
( ) ( ) 1
( , )log ( , )log
( , ) ( ) ( )
m n m n
i k
i k i k
i k i k
i k i k
p x p y
p x y p x y
p x y p x p y
   
   
 
   
   
 
2
1 1
1
( ; ) ( , )log
( ) ( )
m n
i k
i k i k
I X Y p x y
p x p y
 
 
    
 

2 2
1 1 1 1
1 1
( ; ) ( , )log ( , )log
( ) ( )
m n m n
i k i k
i k i k
i k
I X Y p x y p x y
p x p y
   
   
   
   
   
 
2 2
1 1 1 1
1 1
( ; ) log ( , ) ( , ) log
( ) ( )
m n m n
i k i k
i k i k
i k
I X Y p x y p x y
p x p y
   
   
   
   
   
   
   
   
   
2 2
1 1
1 1
( ; ) ( )log ( )log
( ) ( )
m n
i k
i k
i k
I X Y p x p y
p x p y
 
   
   
   
   
 
( ; ) ( ) ( )
I X Y H X H Y
   
Fig 2.4 Relation between Entropies and Mutual Information
( , ) H(X) H(Y) H(X,Y)
I X Y
   
Discrete Memoryless Channels (DMC)
A communication channel is defined as the path or media through which symbols flow to
the receiver end. A DMC is one which has discrete random variables, X may take any one of the
M possible values i
x , i = 0, 1, …, (M – 1) as its input and the random variable Y may take any
one of the L possible values, j
y , j = 0, 1, 2, …,(L – 1) as its output respectively. It is called a
Discrete Memoryless Channel because the input and output are discrete random variables with a
finite set of alphabet for each, and it is memoryless because its present output symbol depends
only on the present input symbol but not on any of the previous input symbols.
The transitional probabilities ML in number is generally given by a matrix called “Channel
Matrix” of sizeM L
 . M & L need not be equal.
Channel Matrix:
     
     
     
0 0 1 0 1 0
0 1 1 1 1 1
0 1 1 1 1 1
P
L
L
j i
M M L M
Y
p y x p y x p y x
p y x p y x p y x
Y y X x X
p y x p y x p y x


   
 
 
 
 
 
  
   
 
 
 
In the matrix of transitional probabilities,  
j i
P y x represents the probability of the
random variable Y taking the value j
y given that random variable X has taken the value i
x i.e.,
the probability of receiving symbol j
y when the symbol i
x is transmitted (because of noise in the
channel).
  P
j i j i
P y x Y y X x
 
  
  Where 0,1,2, ,( 1)
i M
  , 0,1,2, ,(L 1)
j  
If channel coding is done then M < L, but if two input symbols lead to the same output symbol
then M > L. If the channel were to be ideal, M = L and j
y = j
x for every j.
Properties of Transistion Matrix/Channel Matrix:
1. Each row of the transition matrix corresponds to a particular fixed input symbol to the
channel.
2. Each column of the transition matrix corresponds to a certain fixed channel output symbol.
3. The sum of the probabilities along any particular row of the transition matrix is equal to
one,
 
1
0
. 1
L
j i
j
i e P y x i


 

4. If the input symbol probabilities are known i.e ( )
i
p x then the joint distribution is
determined as  
( , ) ( )
i j j i i
p x y p y x p x
 and the marginal distribution (y )
j
p is,
 
1
0
(y ) ( ); 0,1,2, ,(L 1)
M
j j i i
i
p P y x p x j


  

5. If the input probability is a row matrix, P(Y) = P(X) P(Y /X)
6. If the input probability is a diagonal matrix, P(X,Y) = P(X)d P(Y/ X)
Types of Channels:
(1) Symmetric/Uniform Channel
A channel is said to be symmetric (or) Uniform channel if the second and subsequent
rows of the channel matrix contains the same elements as that of the first row, but in different
order.
 
1 2 3
1
2
3
1 1 1
2 3 6
1 1 1
/
3 6 2
1 1 1
6 2 3
y y y
x
P Y X x
x
 
 
 

 
 
 
 
(2) Lossless Channel
A channel represented by a channel matrix with one and only one non-zero
element in “every column” is defined as “Lossless channel”.
Noiseless channel has two properties:
(i) Each column only one element with other elements
in that column being zeros.
(ii) Sum of all elements in any row is equal to unity.
 
1 2 3 4 5
1
2
3
3 1 0 0 0
4 4
1 2
/ 0 0 0
3 3
0 0 0 0 1
y y y y y
x
P Y X x
x
 
 
 

 
 
 
(3) Deterministic Channel
A channel represented by a channel matrix with one and only one non-zero
element in “every row” is defined as “deterministic channel”.
A deterministic channel has two properties:
(i) Each row must contain only one element with other elements in that column being
zeros.
(iii) Sum of all elements in any row is equal to unity.
 
1 2 3
1
2
3
4
5
1 0 0
1 0 0
/ 0 1 0
0 1 0
0 0 1
y y y
x
x
P Y X x
x
x
 
 
 
 

 
 
 
 
At the transmitter stage itself on can determine which symbol is to be received at the
receiver and hence the channel is called deterministic channel.
(4) Binary Symmetric Channel(BSC)
A BSC consists of two inputs  
0 1
0& 1
x x
  and two outputs 
0 1
0& y 1
y   .
The channel is symmetric because the probability of receiving 1 when 0 is sent is same as
the probability of receiving 0 when 1. This common transition probability is denoted by p.
 
0 1
0
1
1
/
1
y y
x p p
P Y X
x p p

 
  

 
(5) Binary Erasure Channel(BEC)
A BEC consists of two inputs  
0 1
0& 1
x x
  and three outputs 
0 1 2
0,y & 1
y y
   .
Due to noise, it may not be possible to identify the output symbol as one or the other of the input
symbols. In that case, it is erased, i.e., ignored and a request is sent to the transmitter to retransmit.
That is why it is called a binary erasure channel, indicated by symbol that output is erased.
 
0 1 2
0
1
1 0
/
0 1
y y y
x p p
P Y X
x p p

 
  

 
(6) Noise-free Channel
In this channel there is a one-to-one correspondence between input and output i.e each
input symbol is received as one and only one output symbol. In this channel there is no loss
of information in transition. The number of source and destination symbols are same,
n = m.
The joint probability matrix P(X, Y) is of the diagonal form.
 
1 1
2 2
( , ) 0 0
0 ( , ) 0
( , )
0 0 ( , )
m m
p x y
p x y
P X Y
p x y
 
 
 

 
 
 
And the channel probability matrices [P(Y/X)] and
[P(X/Y)] are unity-diagonal matrices.
   
1 0 0 0 0
0 1 0 0 0
( / ) ( / )
0 0 0 0 1
P Y X P X Y
 
 
 
 
 
 
 
It can be seen that,
( , ) 0
j k
p x y for j k
 
And, 2
1 1
( , ) ( , )log ( , )
m m
j k j k
j k
H X Y P x y p x y
 
 
2
1
( , )log ( , )
m
j j j j
j
P x y p x y

 
From joint probability matrix it can be seen that,
( , ) ( ) ( )
j j j j
p x y p x p y
 
So the joint entropy becomes, H(X,Y) H(X) H(Y)
 
And the conditional entropies, H(Y/ X) H(X/ Y) (1log1) 0
m
   
Therefore, the mutual information for Noise-free channel is
( ; ) ( ) ( / ) ( ) ( ) ( ,Y)
I X Y H X H X Y H X H Y H X
    
Channel with Independent Input and Output
In these channels there is no correlation between the input and
output symbols. Let us consider the channel shown in Figure 2.5(a).
The joint probability matrix (JPM) is given as,
 
1 2
1 1 1 1
2 2 2 2
( , )
n
m m m m
y y y
x p p p
x p p p
P X Y
x p p p
 
 
 

 
 
 
It can be seen from JPM that Fig 2.5(a)
( ) , j=1,2,.......,m
j j
p x np for

1
1
( ) , 1,2,....,
m
k j
j
p y p k n
n

  

And, ( , )
j k j
p x y p

Hence, ( , ) ( ) ( )
j k j k
p x y p x p y
 , shows that &
j k
x y are independent for all j and k , i.e
input and output are independent for the channel shown in Fig 2.5(a).
Hence, we get,  
( , )
( ) ( )
( )
j k
j j k j
k
p x y
p x or p x y p x
p y
 
And  
( , )
( ) ( )
( )
j k
k k j k
j
p x y
p y or p y x p y
p x
 
Now,  
2
1 1
H(Y/ X) ( , )log
m n
j k k j
j k
P x y p y x
 
 
2
1 1
( ) ( )log ( )
m n
j k k
j k
p x p y p y
 
 
2
1 1
( ) ( )log ( )
n m
j k k
k j
p x p y p y
 
 
   
 
 
2
1
( )log ( )
n
k k
k
p y p y

 
Thus, H(Y/ X) ( )
H Y

Similarly, it can be shown ( / ) ( )
H X Y H X

Hence, for the channel shown in Fig 2.5(a), we have,
( ; ) ( ) ( / ) 0
( ; ) ( ) ( / ) 0
I X Y H X H X Y
I X Y H Y H Y X
  
  
Now consider another channel shown in Figure 2.5(b) along with joint probability matrix,
 
1 2
1 1 2
2 1 2
1 2
( , )
n
n
n
m n
y y y
x p p p
x p p p
P X Y
x p p p
 
 
 

 
 
 
It can be seen that,
1
1
( ) , j 1,2,....,
n
j k
k
p x p n
m

  

( ) , k=1,2,.......,n
k k
p y mp for

( , )
j k k
p x y p

Hence, ( , ) ( ) ( )
j k j k
p x y p x p y
 , shows that &
j k
x y are independent for all j and k , i.e
input and output are independent for the channel shown in Fig 2.5(b).
Following the same procedure it cab ne shown that ( ; ) 0
I X Y  .
Hence it can be said that in the case of channel with an independent input and output,
“No information is transmitted through the channel” i.e. ( ; ) 0
I X Y  as shown in Figures 2.5(a)
and 2.5(b).
** A channel with an independent input and output with JPM satisfies atleast one of the following
conditions:
(a) Each row consists of the same element.
(b) Each column consists of the same element.
CHANNEL CAPACITY
(A)Channel capacity per Symbol (C)
The channel capacity of a discrete memoryless channel, commonly denoted by
C, is defined as the maximum mutual information I(X;Y) in any single use of the
channel (i.e., signaling interval), where the maximization is over all possible
input probability distributions  
( )
i
P x on X.
bits/Symbol
(B) Channel Capacity per Second Cs
If r symbols are being transmitted per second, then the maximum rate of transmission of
information per second is rC. This is the channel capacity binits per second and is denoted by
Cs (binits/sec).
   
max ( ; ) max ( ) ( / )
C I X Y H X H X Y
  
The capacity is a measure of the number of bits that are delivered to the output with that channel
use. The maximization is with respect to the source probabilities, since the transition probabilities
are fixed by the channel. However, the channel capacity is a function of only the channel transition
probabilities j
i
y
P
x
 
 
 
, since the maximization process eliminates the dependence on the source
probabilities.
The transmission efficiency or channel efficiency is defined as
sin
max sin
actual tran formation
imum tran formation
 
 
( ; )
max ( ; )
I X Y
I X Y

The redundancy of the channel is defined as
Estimation of channel capacities for special channels:
(A)Lossless Channel: For a lossless Channel H(X / Y) = 0 and I(X; Y) = H(X). Thus the
mutual information is equal to the input (source) entropy and no source is lost in
transmission. The channel capacity per symbol is given by
  2
( )
max (X) log
i
P x
C H m
 
Where m is number of symbols in X.
(B) Deterministic Channel: For deterministic channel H(Y / X) = 0 for all input distributions
 
( )
i
P x and I(X; Y) = H(Y). Thus the information transfer is equal to output entropy. The
channel capacity per symbol is given as
  2
( )
max (Y) log
i
P x
C H n
 
Where n is number of symbols in Y.
/ sec
s
C r C binits
 
( ; ) ( ; )
s
I X Y I X Y
C C
  
( ; )
( ; )
1 s
s
C I X Y
C I X Y
R
C C



   
(C) Noise-Free Channel: Since a noise-free channel is both noiseless and deterministic, we
have ( ; ) (X) H(Y)
I X Y H
  . Thus the channel capacity per symbol is given as
2 2
C max ( ; ) log log
I X Y m n
  
(D)Symmetric Channel: A symmetric channel is defined as one for which
(i)  
j
H Y x is independent of j; i.e the entropy corresponding to each row of  
( / )
P Y X is
same.
(ii)  
1
m
k j
j
P y x

 is independent of k, i.e the sum of all columns of  
( / )
P Y X is the same.
For symmetric channel,
( ; ) (Y) H(Y/ X)
I X Y H
 
   
1
( )
m
j j
j
H Y H Y x P x

  
 
1
( )
m
j
j
H Y A P x

  
Where,  
j
A H Y x
 is independent of j, also  
1
1
m
j
j
P x



Hence, ( ; ) H(Y) A
I X Y  
The capacity of the symmetric channel is,
 
C max ( ; ) max H(Y) A
I X Y
  
 
max ( )
H Y A
 
2
log
C n A
  ,   2
sin max ( ) log
ce H Y n

Where n is the total number of receiver (output) symbols.
(E) Binary Symmetric Channel: consider the binary symmetric channel as shown in below
fig.
Let the source alphabet consists of two symbols 1
x with probability, 1
( )
P x 
 and 2
x
with probability 2
( ) (1 )
P x 
  as shown in above figure. The destination alphabet is
1 2
y & y .
This model represents a binary transmission in which errors are statistically independent
and the error probabilities are same for both the symbols, so the average error probability
per symbol is.
   
1 2 1 2 1 2
( ) ( )
e
P P x P y x P x P y x
 
(1 )
p p
 
  
p

So p is the error probability of the symbols in BSC.
Calculation of I (X; Y):
we know that,  
1
/
1
p p
P Y X
p p

 
  

 
The joint probability matrix is calculated as,
Method-1   1 1 1 2
2 1 2 2
( , ) ( , )
( , )
( , ) ( , )
p x y p x y
P X Y
p x y p x y
 
  
 
 
 
 
 
1 1 1 1 1
1 2 1 2 1
2 1 2 1 2
2 2 2 2 2
( , ) ( ) (1 )
( , ) ( )
( , ) ( ) (1 )
( , ) ( ) (1 )(1 )
p x y P x P y x p
p x y P x P y x p
p x y P x P y x p
p x y P x P y x p




  
 
  
   
 
(1 )
( , )
(1 ) (1 )(1 )
p p
P X Y
p p
 
 

 
   
  
 
Method-2      
( , ) ( ) /
d
P X Y P X P Y X

Where   1
2
( ) 0 0
( )
0 ( ) 0 (1 )
d
P x
P X
P x


   
 
   

 
 
 
0 1
( , )
0 (1 ) 1
p p
P X Y
p p



   
    
 
   
Simplifying the above matrix,   1 1 1 2
2 1 2 2
( , ) ( , )
(1 )
( , )
( , ) ( , )
(1 ) (1 )(1 )
p x y p x y
p p
P X Y
p x y p x y
p p
 
 
  
 
    
 
  
   
We know that ( ; ) ( ) ( / )
I X Y H Y H Y X
 
The noise entropy H(Y / X) is calculated as,
 
2
1 1
1
( / ) ( , )log
n m
j k
j k j k
H Y X p x y
p y x
 
 
 
 
2 2
2
1 1
1
p( ) log
k j k
k j j k
x p y x
p y x
 
 
  
 
 
 
 
 
 
 
 
 
 
 
1 1 1 2 1 2 1 2
1 1 2 1
2 1 2 2 2 2 2 2
1 2 2 2
1 1
( )p log ( )p log
1 1
( )p log ( )p log
p x y x p x y x
p y x p y x
p x y x p x y x
p y x p y x
  

2 2 2 2
1 1 1 1
(1 )log log (1 ) log (1 )(1 )log
(1 p) (1 p)
p p p p
p p
   
       
 
2 2
1 1
log (1 )log
(1 p)
p p
p
  

 
2 2
log (1 )log (1 )
p p p p
    
 
2 2
( / ) log (1 )log (1 )
H Y X p p p p
    
Calculation of Channel Capacity: The channel capacity is found by maximizing the I(X : Y)
with respect to input symbols.
 
2 2
C max ( ; ) max ( ) log (1 )log (1 )
I X Y H Y p p p p
     
  2 2
max ( ) log (1 )log (1 )
H Y p p p p
    
i.e we have to maximize the H(Y) since  
2 2
log (1 )log (1 )
p p p p
   is independent of input
values.
2 2
( ; ) ( ) log (1 )log (1 )
I X Y H Y p p p p
     
   
1 2 2 2
1 2
1 1
H(Y) (y )log (y )log
p p
p y p y
 
   
1 1 1 1 2 1 2
(y ) ( )p ( )p
p p x y x p x y x
 
(1 ) (1 )
p p
 
   
2
p p
 
  
   
2 1 2 1 2 2 2
(y ) ( )p ( )p
p p x y x p x y x
 
(1 )(1 )
p p
 
   
1 2
p p
 
   
Or 2 1
(y ) 1 (y )
p p
 
 
1 2
p p
 
   
1 2
p p
 
   
Method-2 for calculating output probabilities.
Consider JPM,   1 1 1 2
2 1 2 2
( , ) ( , ) (1 )
( , )
( , ) ( , ) (1 ) (1 )(1 )
p x y p x y p p
P X Y
p x y p x y p p
 
 

   
 
   
  
 
 
The sum of each column values produces output probabilities.
1 1 1 2 1
(y ) ( , ) ( , )
p p x y p x y
 
(1 ) (1 )
p p
 
   
2
p p
 
  
2 1 2 2 2
(y ) ( , ) ( , )
p p x y p x y
 
(1 )(1 )
p p
 
   
1 2
p p
 
   
Since output entropy attains a maximum when the input symbols are equiprobable,
i.e. 1
( ) 0.5
p x 
  and 2
( ) (1 ) 0.5
p x 
  
then output probabilities are
1
(y ) 2 0.5
p p p
 
   
2
(y ) 1 2 0.5
p p p
 
    
   
1 2 2 2
max
1 2
1 1
H(Y) (y )log (y )log
p p
p y p y
 
2 2
1 1
0.5log 0.5log
0.5 0.5
 
max
H(Y) 1 bit/symbol


Hence, 2 2
max ( ; ) max[H(Y)] log (1 )log (1 )
BSC
C I X Y p p p p
     
Where,  
2 2
( ) log (1 )log (1 )
H p p p p p
    
Hence, variation of channel capacity of a BSC as shown in Figure 2.6 with respect to transition
probability p of the BSC.
Fig 2.6 channel-capacity variation with transition probability, p
Conclusions:
1. When the channel is noise free, i.e if p = 0 or 1, the channel output is completely
determined by the channel input, and the capacity is 1 bit per symbol. At this value of p,
the entropy function H (p) attains its minimum value of zero.
2. When the conditional probability of error p is equal to 0.5 due to channel noise, an input
symbol yields either output symbol with equal probability, and the capacity is zero.
whereas the entropy function H(p) attains its maximum value of unity; in such a case, the
channel is said to be useless in the sense that the channel input and output assume
statistically independent structures.
2 2
1 log (1 )log (1 )
BSC
C p p p p
    
1 ( )
BSC
C H p
 
(F) Binary Erasure Channel: consider the binary erasure channel as shown in below fig.
Let the source alphabet consists of two symbols 1
x with probability, 1
( )
P x 
 and 2
x with
probability 2
( ) (1 )
P x 
  as shown in above figure. The destination alphabet is 1 2 3
y , &
y y .
Calculation of I(X;Y)
 
     
     
1 1 2 1 3 1
1 2 2 2 3 2
1 0
/
0 1
P y x P y x P y x
p p
P Y X
P y x P y x P y x
p p
 

 
   
 

   
     
( ) ( ) /
P Y P X P Y X
 
  
Where ,      
1 2
( ) 1 ( ) ( )
P X P x P x
 
  
   
1 0
( ) 1
0 1
p p
P Y
p p
 

 
   

 
 
(1 ) (1 )(1 )
p p p
 
   
 
1 2 3
(y ) (y ) (y )
P P P

Method-2
 
1 1 1 1
(y ) ( ) (1 )
P P x P y x p

  
   
2 1 2 1 2 2 2
(y ) ( ) ( ) ( )
P P P x P y x P x P y x
   
(1 )
p p
 
  
p

 
3 2 3 2
(y ) ( ) (1 )(1 )
P P x P y x p

   
The joint probability Matrix (JPM)
     
( , ) ( ) /
d
P X Y P X P Y X

Where,   1
2
( ) 0 0
( )
0 ( ) 0 (1 )
d
P x
P X
P x


   
 
   

 
 
 
0 1 0
( , )
0 (1 ) 0 1
p p
P X Y
p p



   
    
 
   
  1 1 1 2 1 3
2 1 2 2 2 3
( , ) ( , ) ( , )
(1 ) 0
( , )
( , ) ( , ) ( , )
0 (1 ) (1 )(1 )
p x y p x y p x y
p p
P X Y
p x y p x y p x y
p p
 
 
  
 
    
 
  
   
We know that ( ; ) ( ) ( / )
I X Y H Y H Y X
 
 
2
1 1
( / ) ( , )log
n m
j k j k
j k
H Y X p x y p y x
 
  
   
2 3
2
1 1
p( ) log
k j k j k
k j
x p y x p y x
 
 
  
 
 
       
       
1 1 1 2 1 1 1 2 1 2 2 1
2 2 2 2 2 2 2 3 2 2 3 2
( )p log ( )p log
( )p log ( )p log
p x y x p y x p x y x p y x
p x y x p y x p x y x p y x
  
 
2 2 2 2
(1 )log (1 p) log (1 ) log (1 )(1 )log (1 p)
p p p p p p
   
         
2 2
log (1 )log (1 p)
p p p
   
3
2
1
( ) ( )log ( )
j j
j
H Y P y p y

 
 
1 2 1 2 2 2 3 2 3
( )log ( ) ( )log ( ) ( )log ( )
P y p y P y p y P y p y
   
 
2 2 2
(1 )log (1 ) log (1 )(1 )log (1 )(1 )
p p p p p p
   
         
 
2 2 2 2
(1 ) log (1 )log (1 ) log (1 )log (1 )
p p p p p
   
         
 
 
( ; ) ( ) ( / )
I X Y H Y H Y X
  
 
2 2
(1 ) log (1 )log (1 )
p    
     
(1 ) ( )
p H X
 
 
max ( ; ) max (1 ) ( )
BEC
C I X Y p H X
  
 
(1 )max ( )
BEC
C p H X
 
The maximum value of output entropy will be maximized if input entropy is maximized when
input symbols are equiprobable, 1
( ) 0.5
p x 
  and 2
( ) (1 ) 0.5
p x 
  
 
2 2
( ) log (1 )log (1 )
H X    
    
2 2
0.5log 0.5 (0.5)log (0.5) 1 bit/symbol
   
Therefore the channel capacity of BEC is given as,
1
BEC
C p
 
Binary Channels:
It is easy to analyze a BSC channels, in practice we come across binary channels with non-
symmetric structures. A binary channel is shown in Figure 2.7.
The channel matrix is given as,
  11 12
21 22
( / )
P P
D P Y X
P P
 
   
 
To find the channel capacity of the binary channel, a method was
suggested by Dr. S.MUROGA.
Defining the auxiliary variables Q1 and Q2 found using the
matrix equation,
11 1 12 2 11 2 11 12 2 12
log log
P Q P Q P P P P

  
21 1 22 2 21 2 21 22 2 22
log log
P Q P Q P P P P

 
Solving the above two expressions gives the values of Q1 and Q2.
The channel capacity is found by,
2
1 2
log bits/message-symbol
2 2
Q Q
C  

 

In general for a channel with m m
 is given by
11 12 1
21 22 2
1 2
( / )
m
m
m m mm
P P P
P P P
P P P
P Y X
 
 
   

   
 
 
And the auxiliary variables 1 2
, ,....., m
Q Q Q can be obtained by using the matrix and solving the
equations obtained from the matrix
11 12 1 1 11 2 11 12 2 12 1 2 1
21 22 2 2 21 2 21 22 2 22 2 2 2
1 2 1 2 1 2 2 2 2
log log log
log log log
log log log
m m m
m m m
m m mm m m m m m mm mm
P P P Q P P P P P P
P P P Q P P P P P P
P P P Q P P P P P P
  
     
     
  
     

     
     
  
     
The channel capacity is found by,
2
1 2 bits/message-symbol
log 2
2 2 m
Q
Q Q
C
 
 
 
  

Cascaded Channels:
Let us consider two channels connected in cascade as shown in Figure 2.8. When the information
is transmitted from X to Y through the channel-I, there will be loss of information due to the noise
in channel-I and the mutual information at the output of channel –I is
( , ) (Y) H(Y/X)
I X Y H
 
When I(X, Y) is passed through channel-II, there will be further loss of information and the
mutual information at the output of channel-II is
( ,Z) (Z) H(Z/X)
I X H
 
It is obvious that I(X, Z) < I(X, Y).
Let us consider the case of two cascaded identical BSC channels as shown in Figure 2.9.
The messages from X1 reaches Z1 in two ways: 1 1 1 1 2 1
& x
x y z y z
    . The respective path
probabilities are &
p p q q .
Hence, 2 2 2
( ) 2 1 2
p p q p q pq pq
       
Similarly, the message from X1 reaches Z2 in two ways: 1 1 2 1 2 2
& x
x y z y z
    . The respective
path probabilities are &
p q q p .
Hence, 2
q pq qp pq
   
The channel matrix of the cascaded channel is
 
1 2 2
(Z/ )
2 1 2
pq pq p q
P X
pq pq q p
 

   
 
   
 

   
The channel capacity of cascaded channel is 1 (2 )
C H pq
 
For 0.5>q>0, 2pq is always greater than q. hence the channel capacity of two cascaded BSC’s is
less than single BSC.
Shannon’s Theorem on Channel Capacity (Shannon’s Second Theorem)
It states that it is possible to device a means where by a communication system will
transmit information with an arbitrarily small probability of error provided that the information
rate, t s
R C
 the channel capacity.
Explanation: Given a source M equally likely messages, with M >>1 which is generating
information at a rate Rt. Given a channel capacity Cs, then if t s
R C
 , there exists a coding technique
which enables transmission over the channel with as small a probability of error as possible, even
in the presence of noise in the channel. But if t s
R C
 , then reliable transmission of information is
not possible without errors and cannot be controlled by any coding technique and the probability
error receiving the correct message becomes close to unity.
Shannon’s Channel Coding Theorem
Given a discrete memoryless source with an entropy of H(S) bits per symbol emitting
symbols at the rate of (1/Ts) symbols per second, and given a discrete memoryless channel with a
capacity of C bits per symbol and through which the symbols are transmitted at the rate of (1/Tc)
symbols per second, it is possible to construct a channel code which would make it possible to
transmit the source symbols through the channel and be reconstructed with arbitrarily small
probability of error, if and only if
( )
s c
H S C
T T

Shannon’s theorem merely states that codes do exist that enable us to transmit information
over a given channel with arbitrarily small probability of error as long as the rate of transmission
(in bits/sec) is less than the channel capacity (in bits/sec) of the channel. It does not tell us anything
about how such codes may be constructed.
Differential Entropy and Mutual Information for Continuous Random
Ensembles
Differential Entropies:
Consider a continuous random variable X with the probability density function (pdf) ( )
X
f x , then
the average amount of information measured per sample value of x(t) is
2
( ) ( )log ( ) bits/sample
X X
h X f x f x dx


  
The term h(X) as the differential entropy of X.
Similarly, with random variable Y with pdf (y)
Y
f , then
2
(Y) (y)log (y) bits/sample
Y Y
h f f dx


  
(a) Uniform Distribution: Consider a random variable X uniformly distributed over the
interval (0, a). The probability density function of X is
1 0
( )
0
X
x a
a
f x
  

 


Then, 2
( ) ( )log ( ) bits/sample
X X
h X f x f x dx


  
2
0
1 1
log
a
dx
a a
 
2
log a

Remark: if a<1, the log a will be negative, this shows that unlike in the case of H(X) for discrete
sources, h(X) of continuous sources can be negative.
(b) Gaussian Distribution: Consider a random variable X Gaussian distributed with 2
 as
variance and  as mean should satisfy the two conditions:
(x) 1
X
f dx



 (1)
2 2
( ) (x)
X
x f dx
 


 
 (2)
In order to maximize differential entropy with respect to eq(1) and eq(2), we make use of
Lagrange multipliers.
  2 2
1 2 1 2
( ), , ( ) (x) 1 ( ) (x)
X X X
g f x h X f dx x f dx
     
 
 
   
     
   
   
  (3)
Where 1 2
,
  are Lagrange multipliers. Then the maximization of h(X) requires
 
1 2 2
2 2 1 2
( ), ,
log (x) log ( ) 0
(x)
X
X
X
g f x
f e x
f
 
  

      

Solving above expression gives
2
1 2
2 2
( )
(x) exp 1
log log
X
x
f
e e
  
 

   
 
 
(4)
If eq(4 )is substituted in eq(1) and eq(2) , the integrals will not converge, unless 2
 is
negative.
Let 1
2
exp 1
log e


 
  
 
 
and
2
2
2
( )
log
x
e
 



Eq(4) re-written as,
2 2
( )
(x) x
X
f e  
  
 (5)
Substituting eq(5) in eq(1) we have,
2 2
( )
1
x
e dx
 


 



Let ( )
x y
 
  and solving, we get
1

 
 

 
 
(6)
Substituting eq (5) in eq(2) we have ,
2 2
2 ( ) 2
( ) x
x e dx
 
  

 

 
 and solving
2
3
2
 
 
 

 
 
(7)
Dividing eq (6) and eq (7) one by the other,
1
2



And substitution β in eq (6) we get
1
2



Now substituting β and α in eq (5) we get
2 2
( ) 2
1
(x)
2
x
X
f e  

 

Hence, the differential entropy h(X) of r.v.X has maximum value when X is Gaussian
and the maximum value is given as,
 
2
( ) (x) log ( )
X X
h X f f x dx


  
But,  
2 2
log ( ) log log ( )
X e X
f x e f x

2 2
( ) 2
2
1
log log
2
x
e
e e  

 
 
 
  
 
 
 
2
2 2
1 ( )
log log
2
2
e
x
e



 

 
 
 
 
 
 
 
2
2
2 2
( )
log log 2
2
e
x
e



 

  
 
 
 
2
2
2 2
2
( )
(x)log log 2 (x) log
2
X e X
x
f e dx f edx



 
 

 
 
2 2
2 2
2
1
log 2 (x) log ( ) (x)
2
X X
f dx e x f dx
 

 
 
  
 
2
2 2
1
( ) log 2 log
2
h X e

  
2
2 2
1 1
log 2 log
2 2
e

 
2
2 2
1
( ) log (2 ) log ( 2 )
2
h X e e
  
  
Mutual Information: Consider a pair of random variables X and Y then,
, 2
( / )
( ; ) ( , )log
(x)
X
X Y
X
f x y
I X Y f x y dxdy
f
 
 
 
  
 
 
Where ( / )
X
f x y is the conditional density function of X given that Y= y and f(x,y) is the joint
probability density function.
The conditional differential entropy of continuous r.v.X given continuous r.v.Y is
, 2
1
(X/ Y) ( , )log
( / )
X Y
X
h f x y dxdy
f x y
 
 
 
  
 
 
The conditional differential entropy of continuous r.v.Y given continuous r.v.X is
, 2
1
(Y/ X) ( , )log
(y/ x)
X Y
X
h f x y dxdy
f
 
 
 
  
 
 
Properties:
1. ( ; ) (Y;X)
2. ( ; ) 0
3. ( ; ) ( ) ( / ) ( ) ( / )
I X Y I
I X Y
I X Y h X h X Y h Y h Y X


   
Shannon-Hartley Law /Shannon’s Information –Capacity Theorem
Shannon’s information capacity theorem is also known as Shannon’s third theorem or
Shannon-Hartley theorem or Gaussian Channel capacity theorem. If a channel bandwidth B is
fixed and output is also band-limited signal completely characterized by its periodic sample values
taken at Nyquist rate 2B samples/sec. then the channel capacity C (bits/sec) of AWGN channel
is given by
2
log 1 bits/second
S
C B
N
 
 
 
 
It deals with the channel capacity of a band limited and power limited continuous channel which
is corrupted by additive white Gaussian noise of zero mean. It is one of the important results of
information theory as it shows the relationship between the channel capacity and the three key
parameters—channel bandwidth, average transmitted power and the power spectral density of the
white noise on the channel.
Proof: The input to the channel is in the form of a sequence of the samples of a zero-mean Gaussian
distributed stationary random process x(t) band limited to B Hz. These samples are taken uniformly
exactly at the Nyquist rate of 2B samples per second. It is assumed that the average signal power
is constrained to be S. Since the channel is band limited to B Hz, the received signal as well as
noise are also band limited to B Hz. If the input sample is a continuous zero-mean Gaussian random
variable, X, the corresponding output sample is a Gaussian random variable Y given by
Y X N
  (1)
where N is a sample of one realization of a zero-mean Gaussian noise process which is bandlimited
to B Hz
Since the signal and noise are statistically independent, the average power of Y
2 2 2
[ ] [X ] [N ]
E Y E E
  (2)
Where E[ ]
 is the mathematical expectation.
The mutual information of the channel is given by
I(X;Y) h(Y) h(Y/ X)
  bits/sample (3)
The channel capacity C is the maximum value of this mutual information under the constraint that
the average signal power has to be equal to S
2
( )
C ( ; ) : ( )
max
X
f x
I X Y E X S
I  
 
  bits/sample (4)
The conditional differential entropy ( / )
h Y X is expressed as,
,
1
( / ) ( , )log
( / )
X Y
Y
h Y X f x y dxdy
f y x
 
 
 
  
 
 
1
( ) ( / )log
( / )
X Y
Y
f x f y x dxdy
f y x
 
 
 
  
 
 
1
( ) ( / )log
( / )
X Y
Y
f x dx f y x dy
f y x
 
 
 
 
  
 
 
 
 
1
( / ) ( / )log ,
( / )
Y
Y
h Y X f y x dy
f y x


 
  
 
 sin ( ) 1
X
ce f x dx



 (5)
But, y x n
 
If ( )
N
f n denotes the probability density function of the noise sample N, then
( / ) ( )
Y N
f y x f y x
 
1
( / ) ( )log
( )
N
N
h Y X f y x dy
f y x


 
   

 

Let y x z
  , then
1
( / ) (z)log ( )
(z)
N
N
h Y X f dy h N
f


 
 
 
 
 (6)
I(X;Y) (Y) (N)
h h
  
h(Y) will be maximum, for a given variance of Y, only when Y is Gaussian distributed if and only
if X, the input random variable is Gaussian. We have assumed that the input signal x (t) is a zero-
mean Gaussian distributed stationary random process with an average power of S.
Since X and Y are statistically independent, the total average power of Y is
2 2 2
[ ] [X ] [N ] S N
E Y E E
    (7)
The maximum value of differential entropy h(Y) is
 
2
max
1
( ) log 2 ( )
2
h Y e S N

 
But,  
2
max max
1
I(X;Y) ( ) (N) log 2 ( ) ( )
2
h Y h e S N h N

     (8)
Since the channel noise is Gaussian, h(N) is given by,
2
1
( ) log 2
2
h N eN


  
2 2
max
1 1
I(X;Y) log 2 ( ) log 2
2 2
e S N eN
 
  
2 2
1 2 ( ) 1
log log 1 /
2 2 2
e S N S
bits sample
eN N



   
  
   
   
Since 2B samples are transmitted per second over the channel, assuming these samples of x(t)
taken at Nyquist rate to be statistically independent
max
2 ( : )
C B I X Y

Channel Capacity of band-limited Gaussian Channel
Where B= channel bandwidth in Hz
S= avg. signal power in watts
N= Noise power in watts
If
2

is the two sided power spectral density of noise then, N B


Bandwidth and S/N –tradeoff: If bandwidth becomes ‘∞’ , the channel capacity does not
becomes infinity because an increase in bandwidth, the noise power also increases. Thus, for a
fixed signal power, and in the presence of white Gaussian noise, the channel capacity approaches
an upper limit called “Shannon Limit” with bandwidth increased to infinite.
2
log 1
S
C B
B

 
 
 
 
2
log 1 bits/second
S
C B
N
 
 
 
 
2
log 1 bits/sec
S
C B
B

 
 
 
 
2
log 1
S S
B
S B

 
 
  
 
 
2
log 1
B
S
S S
B

 
 
 
 
 
We know that
1/
0
lim(1 ) x
x
x e

 
If , 0
S
x when B then
B

   
lim(1 )
B
S
B
S
e
B



 
The Shannon-limit on channel capacity is
Thus for an AWGN channel, the channel capacity C instead of going to infinity approaches a limit
as we go on increasing the bandwidth. The capacity can be made infinity only by increasing the
signal power S to infinity. For finite signal and noise powers, the channel capacity always remains
finite.
Let us consider the trade-off between the bandwidth and S/N ratio:
Let 15& 5
S
B KHz
N
 
Then, 2 2
log 1 =5log (16)=20 kbits/sec
S
C B
N
 
 
 
 
If 31
S
N
 , the bandwidth for the same channel capacity is 2
20 log (32)
C B
 
Therefore B 4KHz

A 20% reduction in the bandwidth (5 KHz to 4 KHz) requires 65% increase in the signal power.
Thus, to decrease the bandwidth the signal power has to be increased. Similarly, to decrease signal
power, the bandwidth must be increased.
Ideal System: An ideal system is defined as one that transmits data at a bit rate that is equal to the
channel capacity C, in bits per second.
Suppose the energy per bit is b
E . Then the average power transmitted is
b
S E C

Then,
b
E C
S
N B


2
lim log 1.44
B
S S
C e
 

  
2
log 1 b
E C
C B
B

 
 
 
 
 
/
/
2 1
2 1
/
C B
C B
b
E B
C B C


  
Where,
C
B
is bandwidth efficiency.
But,
/ (C/B)ln2
2C B
e

If B C
 , then
/
2 1 ln 2
C B C
B
 
  
 
1 ln 2 1
ln 2 1.6
/
b
C
E B
db
C B

 
 
 
 
    
Replacing b
R with C, a plot of b
E

 
 
 
and b
R
B
 
 
 
is shown in Figure 2.10
From the curve, b
R C
 (ideal system) is the boundary between two distinct regions. The region
b
R C
 is the one in which it is possible to make the probability of error arbitrarily small and is
the one which is of interest in practice. The other region, b
R C
 is the one in which it is not
possible to achieve arbitrarily small probabilities of error.
From the figure, it is evident that when b
R
B
is large, a very large value of b
E

is needed to keep
b
R C
 i.e., a large average power is needed. Suppose the bit-rate Rb is fixed and the channel
bandwidth is increased, so that B >> Rb. In this case, operation in the region Rb < C requires only
that b
E

be only slightly greater than the Shannon limit, i.e., –1.6 dB. The corresponding signal
power required is
ln 2
b
S R watts


This represents the minimum signal power required for operation in the Rb < C region. For power-
limited operation, therefore, the bandwidth should be very large compared to the bit rate so that
only minimum signal power represented is required.
But, suppose the system has to operate with Rb >> B. Then, from the figure, it is clear that b
E

necessary for operation in the Rb < C region, is quite large. Then the system is said to be operating
in the bandwidth-limited condition.
Drill Problems
1. One internet service provider (ISP) gives dial-up connections at 56kbps. Assume that the
telephone connection provides a usable bandwidth of 3.5 kHz. What is the minimum
SNR required to support this?
Sol: The minimum SNR required is the one when channel capacity rate is achieved,
i.e Rb=C(ideal) , so C=56kbps
SNR=
/
2 1
C B

Given B= 3.5 KHz
56/3.5
2 1
  65,535 48.16dB
 
2. A Gaussian channel has 1MHz bandwidth. Calculate the channel capacity if the signal
power to noise spectral density ratio is 105 Hz. Also, find maximum information rate.
Sol: given, B= 1 MHz, 5
10
S
Hz


2
log 1
S
C B
B

 
 
 
 
5
6
2 6
10
10 log 1 13,800 bits/sec
10
 
  
 
 
Maximum information rate, 5
max
R 1.44 1.44 10 144,000 bits/sec
S

   
3. A black and white TV picture consists of 525 lines of picture information. Assume that
each line consists of 525 picture elements (pixels) and that each element can have 256
brightness levels. Picture are repeated at a rate 30 frames/sec. calculate the average rate
of information conveyed by a TV set to a viewer.
Sol: Total number of pixels in one frame =525 525 2,75,625
  pixels.
Total number of different frames possible =
2,75,625
(256) frames.
Assume that all these frames occur with equal probability, the net maximum information
content per frame is
Therefore, average rate of information conveyed by a TV
( )
R rH S

5 5
30 22.05 10 66.15 10 / sec
bits
    
4. For the given channel matrix, compute mutual information with 1 2
( ) ( ) 0.5
P x P x
 
2 / 3 1/ 3 0
( / )
0 1/ 6 5/ 6
P Y X
 
  
 
Sol: mutual information I(X ; Y) = H(Y) – H(Y/X)
The JPM is found by multiplying 1st
row of given channel matrix with P(x1) and 2nd
row
by P(x2)
1/ 3 1/ 6 0
(X,Y)
0 1/12 5/12
P
 
  
 
2 2
2
1 1
1
( / ) ( , )log
( )
i k
i k k i
H Y X p x y
p y x
 
 
2 2 2 2
1 3 1 1 5 6
log log 3 log 6 log
3 2 6 12 12 5
   
0.7842 /
bits symbol

1 2 3
1 1 1 1 5
(y ) , (y ) , (y )
3 6 12 4 12
P P P
    
2
2
1
( ) (y )log (y )
k k
k
H Y p p

 
2 2 2
1 1 1 1 5 5
log log log
3 3 4 4 12 12
 
   
 
 
1.5546 / symbol
bits

Mutual Information, I(X, Y) = 1.5546 - 0.7842 = 0.7704 bits/symbol.
2,75,625 5
2
( ) log (256) 22.05 10 /
H S bits frame
  
5. Find the mutual information as shown in figure.
Sol: From figure, JPM
0.25 0.25
(X,Y) 0.15 0.15
0.1 0.1
P
 
 
  
 
 
1
2
3
( ) 0.25 0.25 0.5
( ) 0.15 0.15 0.3
( ) 0.1 0.1 0.2
P x
P x
P x
  
  
  
1
2
(y ) 0.25 0.15 0.1 0.5
(y ) 0.25 0.15 0.1 0.5
P
P
   
   
3
2
1
(X) (x )log (x )
j j
j
H p p

 
 
2 2 2
0.5log 0.5 0.3log 0.3 0.2log 0.2
1.485 /
bits message
   

2
2
1
(Y) (y )log (y )
k k
k
H p p

 
 
2 2
0.5log 0.5 0.5log 0.5
1 /
bits message
  

3 2
2
1 1
( , ) ( , )log ( , )
j k j k
j k
H X Y p x y p x y
 
 
 
2 2 2 2 2 2
0.25log 0.25 0.25log 0.25 0.15log 0.15 0.15log 0.15 0.1log 0.1 0.1log 0.1
2.485 /
bits message
      

( , ) ( ) ( ) ( , )
I X Y H X H Y H X Y
  
1.485 1 2.485 0
   
6. Find the mutual information and channel capacity of the channel shown in figure.
Given 1 2
( ) 0.6, ( ) 0.4
P x P x
 
Sol: The channel matrix,  
0.8 0.2
(Y/ X)
0.3 0.7
D P
 
   
 
The JPM,  
0.48 0.12
(X,Y)
0.12 0.28
P
 
  
 
1
2
(y ) 0.48 0.12 0.6
(y ) 0.12 0.28 0.4
P
P
  
  
 
0.48/ 0.6 0.12 / 0.4 0.8 0.3
(X/ Y)
0.12 / 0.6 0.28/ 0.4 0.2 0.7
P
   
 
   
   
2
2
1
(X) (x )log (x )
j j
j
H p p

 
 
2 2
0.6log 0.6 0.4log 0.4
0.971 /
bits message
  

2 2
2
1 1
( / ) ( , )log ( / )
j k j k
j k
H X Y p x y p x y
 
 
 
2 2 2 2
0.48log 0.8 0.12log 0.3 0.12log 0.2 0.28log 0.7
0.786 /
bits message
    

Mutual information, ( , ) ( ) ( / )
I X Y H X H X Y
 
0.971 0.786 0.185 /
bits message
  
Finding channel capacity by MURGOAs Method,
11 12 21 22
0.8, 0.2, 0.3 0.7
P P P P
   
1 2 2
2 2 2
0.8log 0.8 0.2log 0.2
0.8 0.2
0.3log 0.3 0.7log 0.7
0.3 0.7
Q
Q

   
 

   
  
     
Upon solving for Q1 and Q2 yields, 1 2
0.6568, 0.9764
Q Q
   
The channel capacity,
1 2
2
log 2 2
Q Q
C  
 
 
 
0.6568 0.9764
2
2
log 2 2
log 0.633 0.513
0.2 bit/message
 
 
 
 
 


More Related Content

PDF
information theory
PDF
Principios WDM
PPTX
Ec 2401 wireless communication unit 3
PPT
Source coding
PDF
Dcs unit 2
PPT
UNIT-3 : CHANNEL CODING
PPTX
Turbo codes.ppt
information theory
Principios WDM
Ec 2401 wireless communication unit 3
Source coding
Dcs unit 2
UNIT-3 : CHANNEL CODING
Turbo codes.ppt

What's hot (20)

PPTX
Error control coding techniques
PPTX
linear codes and cyclic codes
PPT
Multiplexing
PPT
PPTX
Turbo codes
PPT
Lecture 6
PPTX
Digital t carriers and multiplexing power point (laurens)
PPTX
Large scale path loss 1
DOCX
Pause frames an overview
PPTX
BCH Codes
PPTX
Basic WDM Optical Network
PDF
ALL INDIA RADIO TRAINING REPORT
PPT
HDLC
PPTX
3rd UNIT Microwave Engineering PPT.pptx
PDF
UNIT-4 Baseband Digital Modulation.pdf
PPT
DC@UNIT 2 ppt.ppt
PDF
Error Control Coding -Introduction
PPTX
Ch 3 satellite communications ii
PDF
Amplitude Modulation using Multipliers and Envelope Detector
PDF
Error Correction of Burst error
Error control coding techniques
linear codes and cyclic codes
Multiplexing
Turbo codes
Lecture 6
Digital t carriers and multiplexing power point (laurens)
Large scale path loss 1
Pause frames an overview
BCH Codes
Basic WDM Optical Network
ALL INDIA RADIO TRAINING REPORT
HDLC
3rd UNIT Microwave Engineering PPT.pptx
UNIT-4 Baseband Digital Modulation.pdf
DC@UNIT 2 ppt.ppt
Error Control Coding -Introduction
Ch 3 satellite communications ii
Amplitude Modulation using Multipliers and Envelope Detector
Error Correction of Burst error
Ad

Similar to UNIT-2.pdf (20)

PPTX
Information theory & coding PPT Full Syllabus.pptx
PDF
INFORMATION_THEORY.pdf
PDF
Information Theory - Introduction
PPTX
Information Theory Coding 1
PPTX
Unit-1_Digital_Communication-Information_Theory.pptx
PPTX
Unit-1_Digital_Communication-Information_Theory.pptx
PPTX
1.Introduction on information measures and entropy
PPT
information_theory_1 E4.40, ISE4.51, SO20.ppt
PPT
information_theory_1.ppt
PDF
Tele4653 l9
PPTX
Innformation theory in digital communication
PDF
Information_Theory_and_Coding_ITC_CSE.pdf
PPT
Digital Communication: Information Theory
PDF
Data Communication & Computer network: Shanon fano coding
PPT
Huffman&Shannon-multimedia algorithms.ppt
PPT
Lecture1
PPTX
Information Theory and coding - Lecture 2
PPTX
Unit I.pptx INTRODUCTION TO DIGITAL COMMUNICATION
PPT
Machine Learning
PPT
Information theory 1
Information theory & coding PPT Full Syllabus.pptx
INFORMATION_THEORY.pdf
Information Theory - Introduction
Information Theory Coding 1
Unit-1_Digital_Communication-Information_Theory.pptx
Unit-1_Digital_Communication-Information_Theory.pptx
1.Introduction on information measures and entropy
information_theory_1 E4.40, ISE4.51, SO20.ppt
information_theory_1.ppt
Tele4653 l9
Innformation theory in digital communication
Information_Theory_and_Coding_ITC_CSE.pdf
Digital Communication: Information Theory
Data Communication & Computer network: Shanon fano coding
Huffman&Shannon-multimedia algorithms.ppt
Lecture1
Information Theory and coding - Lecture 2
Unit I.pptx INTRODUCTION TO DIGITAL COMMUNICATION
Machine Learning
Information theory 1
Ad

Recently uploaded (20)

PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PDF
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
PDF
Computer organization and architecuture Digital Notes....pdf
PDF
Unit1 - AIML Chapter 1 concept and ethics
PPTX
CyberSecurity Mobile and Wireless Devices
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PDF
Abrasive, erosive and cavitation wear.pdf
PPTX
Chemical Technological Processes, Feasibility Study and Chemical Process Indu...
PPTX
Information Storage and Retrieval Techniques Unit III
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PDF
Applications of Equal_Area_Criterion.pdf
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PDF
Implantable Drug Delivery System_NDDS_BPHARMACY__SEM VII_PCI .pdf
PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PPTX
wireless networks, mobile computing.pptx
PPTX
Principal presentation for NAAC (1).pptx
PPTX
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PPTX
Chapter 2 -Technology and Enginerring Materials + Composites.pptx
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
UEFA_Carbon_Footprint_Calculator_Methology_2.0.pdf
Computer organization and architecuture Digital Notes....pdf
Unit1 - AIML Chapter 1 concept and ethics
CyberSecurity Mobile and Wireless Devices
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
Abrasive, erosive and cavitation wear.pdf
Chemical Technological Processes, Feasibility Study and Chemical Process Indu...
Information Storage and Retrieval Techniques Unit III
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
Applications of Equal_Area_Criterion.pdf
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Implantable Drug Delivery System_NDDS_BPHARMACY__SEM VII_PCI .pdf
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
wireless networks, mobile computing.pptx
Principal presentation for NAAC (1).pptx
CN_Unite_1 AI&DS ENGGERING SPPU PUNE UNIVERSITY
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
Chapter 2 -Technology and Enginerring Materials + Composites.pptx

UNIT-2.pdf

  • 1. UNIT-II INFORMATION THEORY & SOURCE CODING  Uncertainty, Information and Entropy  Source Coding  Shannon-Fano and Huffman Coding  Discrete Memory Less Channel  Probability relations in Channel  Priori & Posteriori Entropies  Mutual Information  Channel Capacity  Binary Symmetric Channel, Binary Erasure Channel  Cascaded Channels  Information Rate  Shannon-Hartley Theorem – Shannon Bound
  • 2. UNIT-II Information Theory and Source Coding INTRODUCTION Communication systems are basically meant to transfer information from one location to another. Information Theory is a branch of probability Theory, which can be applied to study of communication systems. The communication of information is statistical in nature and the main aim of information theory is to study the simple ideal statistical communication models. In 1948 Claude E Shannon known as “Father of Information Theory”, published a treatise on the mathematical theory of communication in which he established basic theoretical bands for the performance of communication systems. Shannon’s theory is based on probabilistic models for information sources and communication channels. Measure of Information: In order to know and compare the “information content” of various messages produced by an information source, a measure is necessary to quantitatively know the information content. Consider the following three statements: 1. Brazil defeats India in football. This statement is almost certain to be true. So. The probability is 1 and very much predictable. So. One does not get any information at all. 2. Japan defeats India in Football. This statement also would be true almost but have less certain than earlier case. So. The probability (say 0.8 or 0.9, but not 1). So. This news gives us some information even though may be scanty. 3. India defeats Brazil in Football. This is great piece of information. Probability of this very very small (say 0.01 or 0.02), but if this happens it should be transmitted at once, because people cannot predict this result. So, this news contains a lot of information. Thus we see that there should be some sort of inverse relationship between the probability of an event and the amount of information associated with it. The more the probability of an event, the less is the amount of information associated with it and vice-versa.  Average Information /Self-Information, where “b” is base of logarithm, if b=2 then the units are called “BITS”, if b = 10, the units are HARTLEY or DECITS and if b= e , the units are called NATS. Properties of Information: 1. Information is non-negative, ( ) 0 i I x  2. ( ) 0; ( ) 1 i i I x if p x   3. ( ) ( ) ( ) ( ) i j i j I x I x if p x p x   4. For an independent events, the overall information is sum of individual information. . ( , ) ( ) ( ) i j i j i e I x x I x I x   1 ( ) log log ( ) ( ) i b b i i I x p x p x   
  • 3. Table 2.1 Conversion of Information Units Sources: A source produces signals and as signals may be basically continuous time, or discrete time in nature, so are the sources too. The continuous time signals like speech signals or video signals are, essentially band-limited though because the spectra of speech signals have very little power beyond about 4 kHz and similarly the spectra of video signals are mostly confined to about 6 MHz These information-bearing signals, being band limited, can as well be modeled by the samples of band limited random processes, i.e., as discrete random variables. Since the samples may have a continuum of values, we shall restrict our sources to emit discrete random variables Xi which take only a discrete set of values, are statistically independent and are identically distributed. Such sources are generally referred to as Discrete Memoryless Sources (DMS), in the sense that they emit discrete-time random processes that take only discrete amplitudes, in which all Xis are generated independently and with the same distribution. Entropy: A communication system is not only meant to deal with a single message but with all possible messages. Messages produced by information sources consists of sequence of symbols that corresponds to the message. From the receiving point of view the entire message is treated as a single unit while the communication system has to deal with individual symbol. Hence, although the instantaneous information flows corresponding to individual messages from the source may be erratic, we may describe the source interms of average information (statistical average) per individual messages known as “entropy” of source.
  • 4. The average information per individual message can now be calculated in the following manner. Let there are M different messages 1 2 3 , , , M m m m m with their respective probabilities of occurrences 1 2 3 ,p ,p , M p p . Let us assume that in a long time interval, L messages have been generated. Let L be very large so that L>>M; then contains 1 p L number of messages of type 1 m , 2 p L number of messages of type 2 m ,    M p L number of messages of type M m And The amount of self-information in message, 1 1 1 log m p  , The amount of self-information in message, 2 2 1 log m p  ,    The amount of self-information in message, 1 log M M m p  1 p L  number of messages of type 1 m contain 1 1 1 log p L p bits of information, 2 p L number of messages of type 2 m contain 2 2 1 log p L p bits of information,    M p L number of messages of type M m contain 1 log M M p L p bits of information.  The total amount of information in all L messages will then be, 1 2 1 2 1 1 1 log log log total M M I p L p L p L p p p     The average information per message (or) Entropy is 1 2 1 2 1 1 1 log log log ( ) M total M p L p L p L I p p p H M L L     
  • 5. 1 2 1 2 1 1 1 log log log M M p p p p p p     Entropy of source, bits/message The quantity H(M) , represents the average information per message emitted by DMS with source alphabet M is called “entropy” of the source. It should be read as “entropy of the source M”. It is a measure of uncertainty, the probability distribution that generates maximum uncertainty will have maximum entropy. For a binary system (M=2), the entropy is 1 2 1 2 1 1 log log H p p p p   Let 1 2 1 , 1 1 p p then p p p      Hence, 1 1 log (1 )log (1 ) H p p p p     The condition for maximum entropy can be found by differtiating above expression with respect to p and equating to zero yields, 0 ln 2 log ln 2 log(1 ) dH p p dp        i.e log log(1 ) p p   i.e 0.5 p  At 0.5 p  , either a maxima or minima is present. If the second derivative of H is positive, then there is a minima and if it is negative, then it is a maxima. 2 2 1 1 0 1 d H dp p p       Hence H has a maximum at 0.5. The Maximum value is, max 0.5 1 / p H H bit message    2 2 1 1 1 ( ) log log M M k k k k k k H M p p p p        
  • 6. Fig 2.1 Entropy of DMS Let us examine H under different cases for M=2 Case I: p1=0.01, p2= 0.99, H= 0.08 Case II: p1= 0.4, p2 = 0.6, H= 0.97 Case III: p1 = 0.5, p2 = 0.5, H = 1.00 In case-I, the message m1 with p1=0.01 will occur or the message m2 with p2=0.99 will occur, but most of times m2 will occur. Thus, the uncertainty is less. In case-II, difficult to guess bcz probabilities are nearly equal. Thus, uncertainty is more. In case-III, it is extremely difficult to guess bcz probabilities are equal. Thus, uncertainty is maximum. Thus, the entropy is less when uncertainty is less and is more when uncertainty is more. Thus, we can say that entropy is a measure of uncertainty. Properties of Entropy: 1. Entropy of source ia always a non-negative zero quantity . ( ) 0 i e H M  Proof: 2 1 1 ( ) log M k k k H M p p    2 1 since,log 0 k p  ( ) 0 H M   2. Entropy of a sure and impossible events is Zero. 3. The maximum value of entropy of a source is given as, max 2 ( ) log H M M  . Proof: The entropy is maximum when all the messages are equally likely. Thus, 1 2 1 M p p p M     .
  • 7. max 2 2 1 1 1 ( ) log log M k k k H M p M M p M           max 2 ( ) log H M M  Bits/message. 4. The upper and lower bound of entropy is given as 2 0 ( ) log H M    Where  indicates source, say X, Y, Z, M … Rate of Information: If a message source generates at the rate r messages per second, the rate of information R is defined as the average number of bits of information per second. Hence, r- rate at which messages are generated/sec H- avg. information (or) entropy Example 2.1 An event has six possible outcomes with the probabilities, p1=1/2, p2=1/4, p3= 1/8, p4=1/16, p5=1/32 and p6=1/32. Find the entropy of the system and rate of information if there are 16 outcomes per second. Sol: Entropy, 6 2 1 1 log k k k H p p    2 2 2 2 2 2 1 1 1 1 1 1 log 2 log 4 log 8 log 16 log 32 log 32 2 4 8 16 32 32       31 / 16 bits message  Now r =16 outcomes/sec Information Rate, R=rH= 31 16 31 / sec 16 bits   Extended Entropy Suppose a DMS source, X emits symbols of Size M, if we consider a block of such symbols and each block consisting of n symbols, consider that a new source, called the extended source, is emitting such blocks as its symbols. The alphabet size for the extended source is n M since there will be that many distinct blocks that can be formed from the alphabet of the original source. The entropy of the source n X can be expressed to be n times the entropy of X. R=rH bits/sec ( ) ( ) n H X nH X 
  • 8. Proof: This is easily shown by representing a message sequence from the output of the order 𝑛 source extension as   1 2 2 , , i i i , where k i can take on one of two states with probability ik p . The entropy of the order 𝑛 extension of the source is     1 2 1 2 1 2 2 2 2 2 1 1 1 ( ) log n n n n i i i i i i i i i H X p p p p p p           1 2 1 2 1 2 2 2 2 2 1 1 1 log n n n i i i i i i i i i p p p p p p           1 1 2 3 1 2 2 3 4 1 2 3 1 2 3 4 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 log log n n n n i i i i i i i i i i i i i i i i i i i i p p p p p p p p p p p                                        1 2 3 2 1 1 1 2 3 2 1 2 2 2 2 2 2 2 1 1 1 1 1 1 log n n n n n n n i i i i i i i i i i i i i p p p p p p p                              1 2 3 1 1 2 3 1 2 2 2 2 2 2 1 1 1 1 1 log n n n n n i i i i i i i i i i i p p p p p p                   Since all of the terms in parentheses are equal to 1, we have, 2 2 2 1 1 1 ( ) log ( ) k k k n n i i k i k H X p p H X          ( ) ( ) n H X nH X   Example2.2 Consider a discrete source   0 1 , S S with probabilities 0 1 1 3 ( ) , ( ) 4 4 p S p S   . Find entropy and extended entropy of size M=3. Sol: 1 2 2 2 0 1 3 3 ( ) log log 4 log 0.811 / 4 4 4 i i i H X p p bits source               Extended Source: 0 X 0 S 0 S 0 S 0 1 1 1 1 (X ) 4 4 4 64 p     1 X 0 S 0 S 1 S 1 1 1 3 3 (X ) 4 4 4 64 p     2 X 0 1 0 S S S 2 1 3 1 3 (X ) 4 4 4 64 p     3 X 0 1 1 S S S 3 1 3 3 9 (X ) 4 4 4 64 p     4 X 1 0 0 S S S 4 3 1 1 3 (X ) 4 4 4 64 p    
  • 9. 5 X 1 0 1 S S S 5 3 1 3 9 (X ) 4 4 4 64 p     6 X 1 1 0 S S S 6 3 3 1 9 (X ) 4 4 4 64 p     7 X 1 1 1 S S S 7 3 3 3 27 (X ) 4 4 4 64 p     7 3 2 0 ( ) log i i i H X p p     2 2 2 1 3 64 9 64 log 64 log log 64 64 3 64 9            2.433 / bits source  (or) 3 ( ) ( ) 3 0.811 2.433 / H X nH X bits source     Joint Entropy Let X be a R.V with M possible values 1 2 3 , , , , m x x x x with probabilities 1 2 3 , , , , m p p p p . Let Y be a R.V with N possible values 1 2 3 , y , y , , yn y with probabilities 1 2 3 , , , , n p p p p respectively and their sample spaces are shown in Figure 2.2. Fig 2.2 Finite Discrete Sample Spaces Since X, has M possible values and Y has N possible values then a total of MN outcomes are possible. Any event say, i x of X may occur in conjunction with any event j y of Y. Hence, the complete set of events in the product space S is   1 1 1 2 1 2 1 2 2 2 1 2 n n m m m n x y x y x y x y x y x y XY x y x y x y              Thus we have three sets of complete probability distributions   ( ) ( ) i P X P x  (Y) ( ) j P P y      ( ) ( , ) i j P XY P x y     
  • 10. With these three complete probability distributions, we can associate three entropies as follows: 2 1 ( ) ( )log ( ) m i i i H X p x p x    1 where, ( ) ( , ) n i i j j p x p x y    2 1 (Y) (y )log (y ) n j j j H p p    1 where, (y ) ( , ) m j i j i p p x y    The entropies H(X) and H(Y) are called marginal entropies of X and Y respectively. The joint entropies of X and Y are given as or The amount of information associated with the joint event i X x  and j Y y  is 2 ( , ) log ( , ) i j i j I X x Y y p x y     Properties: 1. If X and Y are two R.Vs and they are dependent each other then H(X, Y) ≠H(X) +H(Y). 2. If X and Y are two R.Vs and they are independent each other then H(X,Y) =H(X) +H(Y). Conditional Entropies Let X and Y be the two RVs then the Conditional Uncertainty present in Y given X and Conditional Uncertainty present in X given Y are given by expressions:   2 1 1 ( , )log ( ) m n i k i k i k H X Y p x y p x y       2 1 1 ( , )log ( ) n m k j j k j k H Y X p x y p y x     2 1 1 (X,Y) ( , )log ( , ) m n i j i j i j H p x y p x y     2 1 1 (Y,X) ( , )log ( , ) n m j i j i j i H p y x p y x    
  • 11. Proof: The conditional Probability   p X Y is given by   ( , ) ( ) p X Y p X Y p Y  We know that k y may occur in conjunction with 1 2 3 , , , , m x x x x . Thus, associated conditional probability is given by,     1 2 ( ) ( ) ( ) k k k m k p X y p x y p x y p x y  1 2 ( , ) ( , ) ( , ) (y ) (y ) (y ) k k m k k k k p x y p x y p x y p p p        We know that, 1 2 ( , ) ( , ) ( , ) (y ) k k m k k p x y p x y p x y p     . Therefore, the sum of elements of the matrix is unity. Hence the probability distribution is complete. Therefore, an entropy associated with it is,   2 1 ( , ) ( , ) log (y ) (y ) m i k i k k i k k p x y p x y H X y p p    2 1 ( )log ( ) m i k i k i p x y p x y    The average of this conditional entropy for all values of k y in order to obtain a measure of an average conditional entropy,       1 ( ) n k k k k H X Y H X y p y H X y     2 1 1 ( ) ( )log ( ) n m k i k i k k i p y p x y p x y      2 1 1 ( ) ( )log ( ) n m k i k i k k i p y p x y p x y     Similarly, H(𝑋 ∣ Y ) is a measure of our average uncertainty of the transmitted symbol after we have received a symbol, sometimes called the equivocation. The function H (𝑌 ∣ 𝑋) is also called equivocation (i.e gives a measure of error, or noise) is the average uncertainty of the received symbol given that 𝑋 was transmitted.   2 1 1 ( , )log ( ) m n i k i k i k H X Y p x y p x y        2 1 1 ( , )log ( ) n m k j j k j k H Y X p x y p y x    
  • 12. Thus, in all there are five entropies associated with a two-dimensional probability distribution. They are H(X), H(Y), H(X, Y), H(X/Y) and H(Y/X). Let X represent a transmitter and Y be a receiver, then the five entropies can be interpreted as follows: H(X): Average information per character at the transmitter, or entropy of the transmitter. H(Y): Average information per character at the receiver, or entropy of the receiver. H(X, Y): Average information per pair of the transmitted and received characters, or average uncertainty of the communication system as a whole. H(X/Y): A measure of information about the transmitter, where it is known that Y is received. H(Y/X): A measure of information about the receiver, where it is known that X is transmitted. Chain Rule: H(X, Y) = H(X) + H(Y / X) or = H(Y) + H(X / Y) Proof: 2 1 1 ( , ) ( , )log ( , ) m n i k i k i k H X Y p x y p x y       2 1 1 ( , )log ( ) (y ) m n i k i k k i k p x y p x y p       2 2 1 1 ( , ) log ( ) log (y ) m n i k i k k i k p x y p x y p         2 2 1 1 ( , )log ( ) ( , )log (y ) m n i k i k i k k i k p x y p x y p x y p           2 1 1 ( , )log (y ) m n i k k i k H X Y p x y p       2 1 1 ( , ) log (y ) n m i k k k i H X Y p x y p               2 1 (y )log (y ) n k k k H X Y p p    Similarly we can show   ( , ) ( ) H X Y H X Y H Y     ( , ) (X) H X Y H Y X H  
  • 13. Example2.3 A discrete source transmits message 1 2 3 , , x x x with probabilities 0.3, 0.4 and 0.3. The source is connected to the channel as given in Fig Calculate all entropies. Sol: From given channel the conditional probability matrix P(Y /X) is,   1 2 3 1 2 3 0.8 0.2 0 0 1 0 0 0.3 0.7 y y y x P Y X x x            Also given,   ( ) 0.3 0.4 0.3 P X  The joint probability matrix P(X,Y) can be obtained by multiplying the rows of P(Y/X) by 1 2 3 ( ), ( ), ( ) p x p x p x i.e by 0.3, 0.4 and 0.3 respectively, giving 0.8 0.3 0.2 0.3 0 P(X,Y) 0 1 0.4 0 0 0.3 0.3 0.7 0.3                 1 2 3 1 2 3 0.24 0.06 0 0 0.4 0 0 0.09 0.21 y y y x x x            The probabilities 1 2 3 (y ), (y ), (y ) p p p can be obtained by adding the columns of P(X,Y) giving 1 2 3 (y ) 0.24, (y ) 0.06 0.4 0.09 0.55, (y ) 0.21 p p p       The conditional probability P(X/Y) can be obtained by dividing the columns of P(X,Y) by 1 2 3 (y ), (y ), (y ) p p p respectively, giving   1 2 3 1 2 3 1 0.109 0 0 0.727 0 0 0.164 1 y y y x P X Y x x            The entropies are calculated as follows: (i) 3 2 1 ( ) (x )log (x ) j j j H X p p    2 2 2 (0.3log 0.3 0.4log 0.4 0.3log 0.3)     1.571 / bits message  (ii) 3 2 1 (Y) (y )log (y ) k k k H p p   
  • 14. 2 2 2 (0.24log 0.24 0.55log 0.55 0.21log 0.21) 1.441 / bits message      (iii) 3 3 2 1 1 ( , ) (x ,y )log (x ,y ) j k j k j k H X Y p p       2 2 2 2 2 0.24log 0.24 0.06log 0.06 0.4log 0.4 0.09log 0.09 0.21log 0.21       2.053 / bits message  (iv)   3 3 2 1 1 (x ,y )log (x y ) j k j k j k H X Y p p       2 2 2 2 2 0.24log 1 0.06log 0.109 0.4log 0.727 0.09log 0.164 0.21log 1       0.612 / bit message  (v)   3 3 2 1 1 (x ,y )log (y x ) j k k j j k H Y X p p       2 2 2 2 2 0.24log 0.8 0.06log 0.2 0.4log 1 0.09log 0.3 0.21log 0.7       0.482 / bit message  SOURCE CODING A discrete source produces symbols which are to be represented in such a way that it would be possible to transmit them over a given channel. The source output, which is in the form of a sequence of source symbols, must be converted into a sequence of what are called the ‘code elements’. This process is called ‘encoding’ and the device, or system which performs this encoding is called an encoder. The encoder assigns a unique sequence of code elements, called a ‘codeword’, for representing each source symbol. Fig 2.2 Source Coding Encoding Process The objective of source coding is to remove or reduce the redundancy in the source output so as to give an efficient representation of the message information given by the source by using less number of bits
  • 15. Classification of Codes: Fig 2.3 Classification of Codes Block Codes: A block code is a code that maps each of the symbols of the source onto a fixed sequence of bits. These fixed sequences of bits are called codewords. The codewords defining a block code may or may not have equal number of bits. Non-Singular Code (Distinct Codes): A block code is said to be non-singular if and only if all the codewords are distinct and easily distinguishable from one another else are called singular (non-distinct) codewords. Fixed-Length Codes: It is the simplest method to encode each symbol of a discrete source into a block of bits, where each block consists of the same number of m-bits. Variable-Length Codes: The variable-length code is one whose codeword length is not the same for all source symbols i.e length of the codewords are different for each and every message symbols. Prefix-Free (Instantaneous) Codes: Any sequence consisting of the initial part of the codeword is called a prefix of the codeword. In a prefix-free code, no codeword is a prefix of another codeword. The decoding of a prefix-free code can be accomplished as soon as the codeword representing a source symbol is fully received. Prefix-free codes are also called instantaneous codes. (or) A uniquely decodable code is said to be “instantaneous” if it is possible to recognize the end of any code word in any received sequence without reference to the succeeding symbols. Uniquely Decodable Codes: A distinct code must be uniquely decodable, in that for each sequence of source symbols, there is a corresponding codeword that is different from a codeword corresponding to any other sequence of source symbols. A sufficient condition, but not a necessary one, for a code to be uniquely decodable is to be prefix-free. In other words, all prefix-free codes are uniquely decodable, but the converse is not true. The two basic requirements to be met by any source encoder are: 1. Minimum average length of a codeword for a given set of source alphabet {X} and the source symbol probability set {p(xi)}. 2. Unique decipherability of the encoded sequence. Codes having above two properties are called ‘Optimal Codes’.
  • 16. Optimal Codes: An instantaneous code is said to be optimal code if it has minimum average length L for a source with a given probability assignment for the source symbols. Classification of codes is best illustrated by an example as shown in below table where a source of size 4 has been encoded with binary 0 and1. Table 2.2 Illustration of classification of codes i x Code1 Code2 Code3 Code4 Code5 Code6 1 x 00 00 0 0 0 1 2 x 01 01 1 10 01 01 3 x 00 10 00 110 011 001 4 x 11 11 11 111 0111 0001 Code-1 and Code-2 are fixed length codes with length 2. Except Code-1 and Code-2 are variable length codes. Except code-1 remaining codes are distinct codes. Codes-2, 4 and 6 are pre-fix free codes. Code-3 of table 2.2 is not uniquely decodable code, consider the binary data 1001 which can be decoded as 2 3 2 x x x or 2 1 1 2 x x x x . Thus the pre-fix free codes are uniquely decodable codes but it not necessary sufficient condition because Code-5 does not satisfy prefix-free condition but still it is uniquely decodable. Average Codeword Length: Let X be a DMS with finite entropy H(X) and an alphabet   0 1 2 1 , , , , m x x x x  with corresponding Probabilities of occurrence ( ) i P x (i=0, 1, 2……m-1). Let the binary codeword assigned to symbol i x by encoder having a length i n , measured in bits. The average codeword length L per source symbol is given by, For fast transmission of any message from source to the destination (Sink) it is necessary that the average length of code is as small as possible. Code Efficiency: It is defined as the ratio of minimum possible value of average codeword length to average codeword length of the symbols used in source encoding process. min L L   1 0 ( ) / message M i i i L P x n bits    
  • 17. Let H(x) be the entropy of the source in bits/message. Let logM be the maximum average information associated with each letter in bits/letter. min ( ) log H x L M   Letters/message Hence Coding Efficiency is, Or Where M is the size of encoder alphabet. Code Redundancy: Kraft’s Inequality: A necessary and sufficient condition for the existence of an instantaneous binary codes. Where D is the size of the encoder alphabet For binary D=2 then the Kraft’s inequality is given by, 1 0 2 1 i M n i      It only assures existence of an instantaneous decodable code with code length that satisfy the inequality. But it does not show how to obtain codewords, nor does it say any code satisfies the inequality is automatically uniquely decodable. SOURCE CODING FOR DISCRETE MEMORYLESS SOURCES The objective of source coding is to remove that redundancy and make the data more compact. The ultimate limit for data compaction without loss of any information. A few prefix-free coding schemes for discrete memoryless sources, called compact coding schemes, which are suboptimum, in the sense that, they give, on the average, longer codewords than the optimum value which is H(X), the entropy of the source. So these codes will be uniquely decipherable, but not necessarily optimal. min 2 ( ) % 100 log L H x L L M     ( ) H x L   1     1 0 D 1 i M n i     
  • 18. Shannon-Fano Coding: Steps: 1. Write down the message or source symbols in the order of decreasing probabilities. 2. Draw a line after say the k-th symbol such that the total probability of the symbols above the line and below the line are approximately equal, i.e., divide the source symbols into two groups of almost equal probability. 3. Assign to each symbol above the line a binary value ‘0’ and to each symbol below the line a binary value ‘1’. 4. Repeat steps 2 and 3 until all subgroups have only one symbol is left. 5. When that stage is reached, the coding is complete. Example 2.4: consider a source emits independent symbols A,B,C,D with probability of occurrence as P(A)=0.5, P(B)=0.25, P(C)=0.125 and P(D)= 0.125 Table 2.3 Shannon-Fano coding procedure Symbol probabilities I II III Codeword Length( i n ) A 0.5 0 - - 0 1 B 0.25 1 1 1 0 - 10 2 C 0.125 1 1 0 110 3 D 0.125 1 111 3 (i) Average length, 1 0 ( ) M i i i L P x n     0.5 1 0.25 2 0.125 3 0.125 3         1.750  Bits/symbol (ii) Entropy, 1 2 0 H(S) ( )log ( ) M i i i P x P x     2 2 2 2 0.5log 0.5 0.25log 0.25 0.125log 0.125 0.125log 0.125      1.750  Bits/symbol (iii) Efficiency, ( ) % 100 H S L    1.750 100 100% 1.750    (iv) Redundancy, 1     =1 - 1= 0(no redundancy left). Sometimes Shannon-Fano method is ambiguous, because due to the availability of more than one equally valid schemes of partitioning resulting in ambiguity. Moreover, as M increases this method is not suitable and formation of M approximately equi-propable groups is rather difficult and little choice.
  • 19. Huffman Coding Huffman code leads to the lowest possible value of L for a given M, resulting in maximum efficiency (or) minimum redundancy. Hence, it is also known as minimum redundancy code (or) optimum code. Huffman code is a prefix-free code, which can thus be decoded instantaneously and uniquely. The average length of a Huffman code is the same as the entropy (i.e., maximum efficiency), if the probability of every symbol produced by the source is an integer power of 1/2. Modified versions of Huffman coding are employed in fax machines and scanners as well as in the assignments of area and country codes for long-distance telephone calls in most parts of the world. Steps: 1. Write down the message or source probabilities in decreasing order. 2. Assign a binary value 0 and binary value 1 to the last two symbols of lowest probability. This forms stage-1. 3. Combine the last two symbols into one new symbol with probability equal to sum of probabilities of the two original symbols. List the probabilities of original symbols except the last two and the new symbol in decreasing order. This forms stage-II. 4. Step-3 process is adopted (steps) until we are left with two symbols in which 0 and 1 is assigned. This forms the last stage. 5. The code for each original source is then obtained by tracking out the sequence of 0’s and 1’s which we have to go through backwards to arrive to the original source symbols. Example 2.2: Apply the Huffman coding procedure for the following message ensemble     1 2 3 4 5 6 7 X x x x x x x x      ( ) 0.4 0.2 0.12 0.08 0.08 0.08 0.04 p X  Message ( ) i P x I II III IV V 1 x 1 0.40 0.40 0.40 0.40 0.40 0.60 0 2 x 000 0.20 0.20 0.20 0.24 0.36 00 0.40 1 3 x 011 0.12 0.12 0.16 0.20 000 0.24 01 4 x 0010 0.08 0.12 0.12 010 0.16 001 5 x 0011 0.08 0.08 0010 0.12 011 6 x 0100 0.08 0100 0.08 0011 7 x 0101 0.04 0101 (i) Average length, 1 0 ( ) M i i i L P x n     0.4 1 0.2 3 0.12 3 0.08 4 0.08 4 0.08 4 0.04 4               2.48  Bits/message Messages Code Word Length( i n ) 1 x 1 1 2 x 000 3 3 x 011 3 4 x 0010 4 5 x 0011 4 6 x 0100 4 7 x 0101 4
  • 20. (ii) Entropy, 1 2 0 H(X) ( )log ( ) M i i i P x P x     2 2 2 2 2 2 2 0.4log 0.4 0.2log 0.2 0.12log 0.12 0.08log 0.08 0.08log 0.08 0.08log 0.08 0.04log 0.04         = 2.42 bits/message (iii) Efficiency, (X) % 100 H L    2.42 100 97.6% 2.48    (iv) Redundancy, 1     =1-0.976 = 0.024= 2.4% Example 2.5 Compute the Huffman Source Coding for the message symbols in two methods and compute the variance of the average codewordlength and comment on the result.     0 1 2 3 4 S S S S S S      (s ) 0.4 0.2 0.2 0.1 0.1 i P  Method-1: The combined probabilities placing high as possible with already existing probability. Message (s ) i P 0 S 00 0.4 0.4 0.4 0.6 0 1 S 10 0.2 0.2 0.4 00 0.4 1 2 S 11 0.2 0.2 10 0.2 01 3 S 010 0.1 010 0.2 11 4 S 011 0.1 011 Method-2: The combined probabilities placing low as possible with already existing probability. Message (s ) i P 0 S 1 0.4 0.4 0.4 0.6 0 1 S 01 0.2 0.2 00 0.4 1 2 S 000 0.2 0.2 000 0.2 01 3 S 0010 0.1 0010 001 4 S 0011 0.1 0011 0.2 . 0.4 Placing High order 0.2 0.4 Placing low order
  • 21. Table 2.4 Comparison of Methods Message Symbols Probabilities (s ) i P Method-1 Method-2 Code Words Length Code Words Length 0 S 0.4 00 2 1 1 1 S 0.2 10 2 01 2 2 S 0.2 11 2 000 3 3 S 0.1 010 3 0010 4 4 S 0.1 011 3 0011 4 Calculation of avg. Lengths, 1 0.4 2 0.2 2 0.2 2 0.1 3 0.1 3 L           2.2 bits/message-symbols  2 0.4 1 0.2 2 0.2 3 0.1 4 0.1 4 L           2.2 bits/message-symbols  Even though both the methods yields same Average length they differ in variance of the average code length, Variance of Avg.Codeword length, Where 1 2 , , , k p p p are the source probabilities, 1 2 ,n , ,nk n are the length of the codewords and L is the average length of codeword. Variance of Method-1           2 2 2 2 2 2 1 0.4 2 2.2 0.2 2 2.2 0.2 2 2.2 0.1 3 2.2 0.1 3 2.2                 0.160  Variance of Method-2           2 2 2 2 2 2 2 0.4 1 2.2 0.2 2 2.2 0.2 3 2.2 0.1 4 2.2 0.1 4 2.2                 1.928  From the above results, it is seen that 2 2 1 2    Comment: It means that when a combined probability of two lower most symbols are places as high as possible the resulting Huffman Source code has significantly smaller value of Variance as compared to when it is placed as low as possible. So, when the combined probabilities of two symbols become equal to the probability of a symbol already in the list should be placed as high as possible.   2 2 1 N k k k p n L     
  • 22. Huffman Code can be applied to an M-ary source as well. The algorithm is, (a) Rearrange the symbols in the order of decreasing probability. (b) The last M symbols are combined into one symbol. (c) Steps a and b are repeated till the set reduces to M. (d) Each of these reduced symbols are now assigned one of the 0,1,……..,M-1 numbers as the first digit in their respective codeword. (e) Now retrace and assign the numbers 0,1,……..,M-1 to the second digit for the M symbols that were combined in the previous step. (f) This is repeated till the original symbol set is reached. For an M-ary code, it should be ensured that the number of terms to be combined in the last reduction should be equal to M, if the total number of source symbols is equal to M+k (M-1) where kI, because each reduction reduces the number of symbols by M-1. In cases where the source alphabet size is not equal to M+k (M-1), one must add the requisite number of dummy symbols with zero probability. Although Huffman code is an optimal code in the sense that it gives minimum average codeword length for a discrete memoryless source with a given source statistics, it suffers from the following two disadvantages in practical applications. 1. It depends heavily on source statistics, an a priori knowledge of the probabilities of occurrence of the source symbols is a must. 2. As most of the sources that we come across in practice are not memoryless, (i.e., the probability of occurrence of a symbol is not independent of which symbols have preceded it) and since Huffman coding takes into account only the individual symbol probabilities, use of this coding in practical applications does not yield good compression. Mutual Information (Trans-Information/Transferred Information) In order to characteristic the complete communication system none of the entropies discussed quantifies about the information lost in the channel. To incorporate this missing part, the concept of Mutual Information is needed. Consider a source as Random variable X can take any one of the M values 1 2 3 , , , , m x x x x . Let the output of the channel be Random Variable Y can take any one of the N possible values 1 2 3 , y , y , , yn y . If the channel is ideal then N=M i.e. , i k x y i k   . Let us say that source symbol i x is transmitted and the channel output k y . Then there are two probabilities are concerned: Priori Probability: The probability ( ) i p x of i x being transmitted, relates to our ‘state of knowledge’ at the destination, before i x is transmitted regarding which symbol from the source alphabet would be transmitted. In other words, Prior to the reception of message, the state of knowledge at the receiver about a transmitted signal i x is the probability that i x would be selected for transmission by transmitter. This is the Priori probability, ( ) i p x . Posteriori Probability: If i x is being transmitted and k y is received at the destination, our ‘state of knowledge’ at the destination regarding which symbol from the alphabet of the source has been
  • 23. transmitted is represented by probability, ( ) i k p x y . In other words, after the reception of the symbol k y at the receiver, the state of knowledge about the same transmitted signal i x is the conditional probability ( ) i k p x y is posteriori probability. Under ideal conditions, ( ) i k p x y = 1 for one particular i x = 0 for rest Unfortunately, owing to the presence of noise and other channel imperfections, ( ) i k p x y will not be one for one i x and zero for the rest. Instead, it will be large (but not equal to one) for some i x and small but not exactly zero for the rest. Because of this, the uncertainty at the destination regarding which i x was actually transmitted will not be completely removed even after the channel output k y is observed. The uncertainty persisting after receiving k y would be less compared to what it was before it was received, indicating that some amount of information has been transferred from the source to the destination. The uncertainty about i x before k y was received is given by, 2 log ( ) i p x  and the uncertainty after receiving k y is given by 2 log ( ) i k p x y  . The information gained about the symbol i x by the reception of k y is the net reduction in the uncertainty and is known as mutual information   , i k I x y .   , i k I x y = initial uncertainty – final uncertainty =   2 2 log ( ) log ( ) i i k p x p x y    2 2 ( ) ( , ) log log ( ) ( ) ( ) i k i k i i k p x y p x y p x p x p y     2 ( ) log , ( ) k i k i k p y x I y x p y     , i k I x y represents the difference in the amount of information at the destination after and before the reception of the symbol k y consequent to the transmission of the symbol i x , and it is called ‘Mutual Information’ of the channel, between the transmitted symbol i x and the received symbol k y .
  • 24. The average of the mutual information for a pair of symbols over all transmitted and received symbol pairs is the mutual information.   ( , ) , i k I X Y I x y    1 1 ( , ) , m n i k i k i k p x y I x y     2 1 1 ( ) ( , )log ( ) m n i k i k i k i p x y p x y p x       2 2 1 1 ( , ) log ( ) log ( ) m n i k i k i i k p x y p x y p x      2 2 1 1 1 1 ( , )log ( ) ( , )log ( ) m n m n i k i i k i k i k i k p x y p x p x y p x y                   2 1 1 ( , ) log ( ) / m n i k i i k p x y p x H X Y                2 1 ( )log ( ) / m i i i p x p x H X Y      Avg. Mutual Information (or) Trans-information i.e. when an avg. information H(X) or H(Y) is transmitted over the channel, an avg. amount of information equal to H(X/Y) or H(Y/X) is lost in the channel due to intersymbol conversion which is due to noise. The balance amount of information received at the receiver with respect to an observed output symbol is the mutual information. Conclusions: 1. I(X, Y) is an average mutual information, indicates measure of the information transferred through the channel. It is also known as transferred information or transinformation of the channel. 2. The equation ( , ) ( ) ( / ) I X Y H X H X Y   , states that the transferred information is equal to the average source information minus the average uncertainty that still remains about the messages. In other words, H(X/Y) is the average additional information needed at the receiver after reception in order to completely specify the message sent. Thus, H(X/Y) gives the information lost in the channel. This is also known as equivocation. ( , ) ( ) ( / ) I X Y H X H X Y   ( , ) H(Y) H(Y/ X) I X Y  
  • 25. 3. The equation ( , ) H(Y) H(Y/ X) I X Y   , states that the transferred information is equal to the receiver information minus that part of the receiver entropy which is not the information about the source. Thus, H(Y/X) is a measure of noise or error due to the channel. Properties of Mutual Information: 1. The mutual information of a channel is symmetric i.e I(X;Y)=I(Y;X) Proof: we know that, H(X,Y) = H(X) + H(Y/X)=H(Y) + H(X/Y) Therefore, H(X) +H(Y / X) = H(Y) + H(X / Y) H(X) – H(X / Y) = H(Y) – H (Y / X) Hence, I(X;Y) = I(Y;X) 2. The mutual information is non-negative i.e I(X;Y )≥ 0. Proof: we know that, 2 1 1 ( ) ( , ) ( , )log ( ) m n i k i k i k i p x y I X Y p x y p x     2 1 1 ( ) ( , )log ( ) m n i i k i k i k p x p x y p x y     2 1 1 ( ) ( ) ( , ) ( , )log ( , ) m n i k i k i k i k p x p y I X Y p x y p x y            1 1 ( ) ( ) 1 ( , )log log 2 ( , ) m n i k i k e i k e i k p x p y p x y p x y                     But we know that, log ( 1) e x x   1 1 1 ( ) ( ) 1 ( , ) ( , ) log 2 ( , ) m n i k i k i k e i k p x p y I X Y p x y p x y                         1 1 1 ( ) ( ) ( , ) log 2 m n i i k k i k e p x p y p x y                But, 1 1 1 1 1 ( ) ( ) ( , ) m n m n i i k k i i k k p x p y p x y         ( , ) 0 I X Y   Hence, ( , ) 0 I X Y  i.e even on a noisy channel, by observing the output of the channel, on the average we cannot lose any information. At the most, the mutual information may be zero, i.e., we do not gain any information by observing the output, and this happens when the input and output symbols of the channel are statistically independent.
  • 26. 3. The mutual information I(X;Y) of a channel is related to the marginal entropies H(X) , H(Y) and H(X,Y) as I(X;Y) = H(X) + H(Y) – H(X,Y) Proof: we know that, 2 1 1 1 (X,Y) ( , )log ( ) m n i k i k i k H p x y p x y     2 2 1 1 1 1 ( ) ( ) 1 ( , )log ( , )log ( , ) ( ) ( ) m n m n i k i k i k i k i k i k i k p x p y p x y p x y p x y p x p y                     2 1 1 1 ( ; ) ( , )log ( ) ( ) m n i k i k i k I X Y p x y p x p y             2 2 1 1 1 1 1 1 ( ; ) ( , )log ( , )log ( ) ( ) m n m n i k i k i k i k i k I X Y p x y p x y p x p y                       2 2 1 1 1 1 1 1 ( ; ) log ( , ) ( , ) log ( ) ( ) m n m n i k i k i k i k i k I X Y p x y p x y p x p y                                     2 2 1 1 1 1 ( ; ) ( )log ( )log ( ) ( ) m n i k i k i k I X Y p x p y p x p y                     ( ; ) ( ) ( ) I X Y H X H Y     Fig 2.4 Relation between Entropies and Mutual Information ( , ) H(X) H(Y) H(X,Y) I X Y    
  • 27. Discrete Memoryless Channels (DMC) A communication channel is defined as the path or media through which symbols flow to the receiver end. A DMC is one which has discrete random variables, X may take any one of the M possible values i x , i = 0, 1, …, (M – 1) as its input and the random variable Y may take any one of the L possible values, j y , j = 0, 1, 2, …,(L – 1) as its output respectively. It is called a Discrete Memoryless Channel because the input and output are discrete random variables with a finite set of alphabet for each, and it is memoryless because its present output symbol depends only on the present input symbol but not on any of the previous input symbols. The transitional probabilities ML in number is generally given by a matrix called “Channel Matrix” of sizeM L  . M & L need not be equal. Channel Matrix:                   0 0 1 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 P L L j i M M L M Y p y x p y x p y x p y x p y x p y x Y y X x X p y x p y x p y x                              In the matrix of transitional probabilities,   j i P y x represents the probability of the random variable Y taking the value j y given that random variable X has taken the value i x i.e., the probability of receiving symbol j y when the symbol i x is transmitted (because of noise in the channel).   P j i j i P y x Y y X x        Where 0,1,2, ,( 1) i M   , 0,1,2, ,(L 1) j   If channel coding is done then M < L, but if two input symbols lead to the same output symbol then M > L. If the channel were to be ideal, M = L and j y = j x for every j.
  • 28. Properties of Transistion Matrix/Channel Matrix: 1. Each row of the transition matrix corresponds to a particular fixed input symbol to the channel. 2. Each column of the transition matrix corresponds to a certain fixed channel output symbol. 3. The sum of the probabilities along any particular row of the transition matrix is equal to one,   1 0 . 1 L j i j i e P y x i      4. If the input symbol probabilities are known i.e ( ) i p x then the joint distribution is determined as   ( , ) ( ) i j j i i p x y p y x p x  and the marginal distribution (y ) j p is,   1 0 (y ) ( ); 0,1,2, ,(L 1) M j j i i i p P y x p x j       5. If the input probability is a row matrix, P(Y) = P(X) P(Y /X) 6. If the input probability is a diagonal matrix, P(X,Y) = P(X)d P(Y/ X) Types of Channels: (1) Symmetric/Uniform Channel A channel is said to be symmetric (or) Uniform channel if the second and subsequent rows of the channel matrix contains the same elements as that of the first row, but in different order.   1 2 3 1 2 3 1 1 1 2 3 6 1 1 1 / 3 6 2 1 1 1 6 2 3 y y y x P Y X x x                (2) Lossless Channel A channel represented by a channel matrix with one and only one non-zero element in “every column” is defined as “Lossless channel”. Noiseless channel has two properties: (i) Each column only one element with other elements in that column being zeros. (ii) Sum of all elements in any row is equal to unity.   1 2 3 4 5 1 2 3 3 1 0 0 0 4 4 1 2 / 0 0 0 3 3 0 0 0 0 1 y y y y y x P Y X x x             
  • 29. (3) Deterministic Channel A channel represented by a channel matrix with one and only one non-zero element in “every row” is defined as “deterministic channel”. A deterministic channel has two properties: (i) Each row must contain only one element with other elements in that column being zeros. (iii) Sum of all elements in any row is equal to unity.   1 2 3 1 2 3 4 5 1 0 0 1 0 0 / 0 1 0 0 1 0 0 0 1 y y y x x P Y X x x x                  At the transmitter stage itself on can determine which symbol is to be received at the receiver and hence the channel is called deterministic channel. (4) Binary Symmetric Channel(BSC) A BSC consists of two inputs   0 1 0& 1 x x   and two outputs  0 1 0& y 1 y   . The channel is symmetric because the probability of receiving 1 when 0 is sent is same as the probability of receiving 0 when 1. This common transition probability is denoted by p.   0 1 0 1 1 / 1 y y x p p P Y X x p p          (5) Binary Erasure Channel(BEC) A BEC consists of two inputs   0 1 0& 1 x x   and three outputs  0 1 2 0,y & 1 y y    . Due to noise, it may not be possible to identify the output symbol as one or the other of the input symbols. In that case, it is erased, i.e., ignored and a request is sent to the transmitter to retransmit. That is why it is called a binary erasure channel, indicated by symbol that output is erased.
  • 30.   0 1 2 0 1 1 0 / 0 1 y y y x p p P Y X x p p          (6) Noise-free Channel In this channel there is a one-to-one correspondence between input and output i.e each input symbol is received as one and only one output symbol. In this channel there is no loss of information in transition. The number of source and destination symbols are same, n = m. The joint probability matrix P(X, Y) is of the diagonal form.   1 1 2 2 ( , ) 0 0 0 ( , ) 0 ( , ) 0 0 ( , ) m m p x y p x y P X Y p x y              And the channel probability matrices [P(Y/X)] and [P(X/Y)] are unity-diagonal matrices.     1 0 0 0 0 0 1 0 0 0 ( / ) ( / ) 0 0 0 0 1 P Y X P X Y               It can be seen that, ( , ) 0 j k p x y for j k   And, 2 1 1 ( , ) ( , )log ( , ) m m j k j k j k H X Y P x y p x y     2 1 ( , )log ( , ) m j j j j j P x y p x y    From joint probability matrix it can be seen that, ( , ) ( ) ( ) j j j j p x y p x p y   So the joint entropy becomes, H(X,Y) H(X) H(Y)   And the conditional entropies, H(Y/ X) H(X/ Y) (1log1) 0 m     Therefore, the mutual information for Noise-free channel is ( ; ) ( ) ( / ) ( ) ( ) ( ,Y) I X Y H X H X Y H X H Y H X     
  • 31. Channel with Independent Input and Output In these channels there is no correlation between the input and output symbols. Let us consider the channel shown in Figure 2.5(a). The joint probability matrix (JPM) is given as,   1 2 1 1 1 1 2 2 2 2 ( , ) n m m m m y y y x p p p x p p p P X Y x p p p              It can be seen from JPM that Fig 2.5(a) ( ) , j=1,2,.......,m j j p x np for  1 1 ( ) , 1,2,...., m k j j p y p k n n      And, ( , ) j k j p x y p  Hence, ( , ) ( ) ( ) j k j k p x y p x p y  , shows that & j k x y are independent for all j and k , i.e input and output are independent for the channel shown in Fig 2.5(a). Hence, we get,   ( , ) ( ) ( ) ( ) j k j j k j k p x y p x or p x y p x p y   And   ( , ) ( ) ( ) ( ) j k k k j k j p x y p y or p y x p y p x   Now,   2 1 1 H(Y/ X) ( , )log m n j k k j j k P x y p y x     2 1 1 ( ) ( )log ( ) m n j k k j k p x p y p y     2 1 1 ( ) ( )log ( ) n m j k k k j p x p y p y             2 1 ( )log ( ) n k k k p y p y    Thus, H(Y/ X) ( ) H Y  Similarly, it can be shown ( / ) ( ) H X Y H X  Hence, for the channel shown in Fig 2.5(a), we have, ( ; ) ( ) ( / ) 0 ( ; ) ( ) ( / ) 0 I X Y H X H X Y I X Y H Y H Y X      
  • 32. Now consider another channel shown in Figure 2.5(b) along with joint probability matrix,   1 2 1 1 2 2 1 2 1 2 ( , ) n n n m n y y y x p p p x p p p P X Y x p p p              It can be seen that, 1 1 ( ) , j 1,2,...., n j k k p x p n m      ( ) , k=1,2,.......,n k k p y mp for  ( , ) j k k p x y p  Hence, ( , ) ( ) ( ) j k j k p x y p x p y  , shows that & j k x y are independent for all j and k , i.e input and output are independent for the channel shown in Fig 2.5(b). Following the same procedure it cab ne shown that ( ; ) 0 I X Y  . Hence it can be said that in the case of channel with an independent input and output, “No information is transmitted through the channel” i.e. ( ; ) 0 I X Y  as shown in Figures 2.5(a) and 2.5(b). ** A channel with an independent input and output with JPM satisfies atleast one of the following conditions: (a) Each row consists of the same element. (b) Each column consists of the same element. CHANNEL CAPACITY (A)Channel capacity per Symbol (C) The channel capacity of a discrete memoryless channel, commonly denoted by C, is defined as the maximum mutual information I(X;Y) in any single use of the channel (i.e., signaling interval), where the maximization is over all possible input probability distributions   ( ) i P x on X. bits/Symbol (B) Channel Capacity per Second Cs If r symbols are being transmitted per second, then the maximum rate of transmission of information per second is rC. This is the channel capacity binits per second and is denoted by Cs (binits/sec).     max ( ; ) max ( ) ( / ) C I X Y H X H X Y   
  • 33. The capacity is a measure of the number of bits that are delivered to the output with that channel use. The maximization is with respect to the source probabilities, since the transition probabilities are fixed by the channel. However, the channel capacity is a function of only the channel transition probabilities j i y P x       , since the maximization process eliminates the dependence on the source probabilities. The transmission efficiency or channel efficiency is defined as sin max sin actual tran formation imum tran formation     ( ; ) max ( ; ) I X Y I X Y  The redundancy of the channel is defined as Estimation of channel capacities for special channels: (A)Lossless Channel: For a lossless Channel H(X / Y) = 0 and I(X; Y) = H(X). Thus the mutual information is equal to the input (source) entropy and no source is lost in transmission. The channel capacity per symbol is given by   2 ( ) max (X) log i P x C H m   Where m is number of symbols in X. (B) Deterministic Channel: For deterministic channel H(Y / X) = 0 for all input distributions   ( ) i P x and I(X; Y) = H(Y). Thus the information transfer is equal to output entropy. The channel capacity per symbol is given as   2 ( ) max (Y) log i P x C H n   Where n is number of symbols in Y. / sec s C r C binits   ( ; ) ( ; ) s I X Y I X Y C C    ( ; ) ( ; ) 1 s s C I X Y C I X Y R C C       
  • 34. (C) Noise-Free Channel: Since a noise-free channel is both noiseless and deterministic, we have ( ; ) (X) H(Y) I X Y H   . Thus the channel capacity per symbol is given as 2 2 C max ( ; ) log log I X Y m n    (D)Symmetric Channel: A symmetric channel is defined as one for which (i)   j H Y x is independent of j; i.e the entropy corresponding to each row of   ( / ) P Y X is same. (ii)   1 m k j j P y x   is independent of k, i.e the sum of all columns of   ( / ) P Y X is the same. For symmetric channel, ( ; ) (Y) H(Y/ X) I X Y H       1 ( ) m j j j H Y H Y x P x       1 ( ) m j j H Y A P x     Where,   j A H Y x  is independent of j, also   1 1 m j j P x    Hence, ( ; ) H(Y) A I X Y   The capacity of the symmetric channel is,   C max ( ; ) max H(Y) A I X Y      max ( ) H Y A   2 log C n A   ,   2 sin max ( ) log ce H Y n  Where n is the total number of receiver (output) symbols. (E) Binary Symmetric Channel: consider the binary symmetric channel as shown in below fig.
  • 35. Let the source alphabet consists of two symbols 1 x with probability, 1 ( ) P x   and 2 x with probability 2 ( ) (1 ) P x    as shown in above figure. The destination alphabet is 1 2 y & y . This model represents a binary transmission in which errors are statistically independent and the error probabilities are same for both the symbols, so the average error probability per symbol is.     1 2 1 2 1 2 ( ) ( ) e P P x P y x P x P y x   (1 ) p p      p  So p is the error probability of the symbols in BSC. Calculation of I (X; Y): we know that,   1 / 1 p p P Y X p p          The joint probability matrix is calculated as, Method-1   1 1 1 2 2 1 2 2 ( , ) ( , ) ( , ) ( , ) ( , ) p x y p x y P X Y p x y p x y                1 1 1 1 1 1 2 1 2 1 2 1 2 1 2 2 2 2 2 2 ( , ) ( ) (1 ) ( , ) ( ) ( , ) ( ) (1 ) ( , ) ( ) (1 )(1 ) p x y P x P y x p p x y P x P y x p p x y P x P y x p p x y P x P y x p                   (1 ) ( , ) (1 ) (1 )(1 ) p p P X Y p p                 Method-2       ( , ) ( ) / d P X Y P X P Y X  Where   1 2 ( ) 0 0 ( ) 0 ( ) 0 (1 ) d P x P X P x                    0 1 ( , ) 0 (1 ) 1 p p P X Y p p                   Simplifying the above matrix,   1 1 1 2 2 1 2 2 ( , ) ( , ) (1 ) ( , ) ( , ) ( , ) (1 ) (1 )(1 ) p x y p x y p p P X Y p x y p x y p p                       
  • 36. We know that ( ; ) ( ) ( / ) I X Y H Y H Y X   The noise entropy H(Y / X) is calculated as,   2 1 1 1 ( / ) ( , )log n m j k j k j k H Y X p x y p y x         2 2 2 1 1 1 p( ) log k j k k j j k x p y x p y x                              1 1 1 2 1 2 1 2 1 1 2 1 2 1 2 2 2 2 2 2 1 2 2 2 1 1 ( )p log ( )p log 1 1 ( )p log ( )p log p x y x p x y x p y x p y x p x y x p x y x p y x p y x     2 2 2 2 1 1 1 1 (1 )log log (1 ) log (1 )(1 )log (1 p) (1 p) p p p p p p               2 2 1 1 log (1 )log (1 p) p p p       2 2 log (1 )log (1 ) p p p p        2 2 ( / ) log (1 )log (1 ) H Y X p p p p      Calculation of Channel Capacity: The channel capacity is found by maximizing the I(X : Y) with respect to input symbols.   2 2 C max ( ; ) max ( ) log (1 )log (1 ) I X Y H Y p p p p         2 2 max ( ) log (1 )log (1 ) H Y p p p p      i.e we have to maximize the H(Y) since   2 2 log (1 )log (1 ) p p p p    is independent of input values. 2 2 ( ; ) ( ) log (1 )log (1 ) I X Y H Y p p p p      
  • 37.     1 2 2 2 1 2 1 1 H(Y) (y )log (y )log p p p y p y       1 1 1 1 2 1 2 (y ) ( )p ( )p p p x y x p x y x   (1 ) (1 ) p p       2 p p          2 1 2 1 2 2 2 (y ) ( )p ( )p p p x y x p x y x   (1 )(1 ) p p       1 2 p p       Or 2 1 (y ) 1 (y ) p p     1 2 p p       1 2 p p       Method-2 for calculating output probabilities. Consider JPM,   1 1 1 2 2 1 2 2 ( , ) ( , ) (1 ) ( , ) ( , ) ( , ) (1 ) (1 )(1 ) p x y p x y p p P X Y p x y p x y p p                       The sum of each column values produces output probabilities. 1 1 1 2 1 (y ) ( , ) ( , ) p p x y p x y   (1 ) (1 ) p p       2 p p      2 1 2 2 2 (y ) ( , ) ( , ) p p x y p x y   (1 )(1 ) p p       1 2 p p       Since output entropy attains a maximum when the input symbols are equiprobable, i.e. 1 ( ) 0.5 p x    and 2 ( ) (1 ) 0.5 p x     then output probabilities are 1 (y ) 2 0.5 p p p       2 (y ) 1 2 0.5 p p p            1 2 2 2 max 1 2 1 1 H(Y) (y )log (y )log p p p y p y   2 2 1 1 0.5log 0.5log 0.5 0.5   max H(Y) 1 bit/symbol  
  • 38. Hence, 2 2 max ( ; ) max[H(Y)] log (1 )log (1 ) BSC C I X Y p p p p       Where,   2 2 ( ) log (1 )log (1 ) H p p p p p      Hence, variation of channel capacity of a BSC as shown in Figure 2.6 with respect to transition probability p of the BSC. Fig 2.6 channel-capacity variation with transition probability, p Conclusions: 1. When the channel is noise free, i.e if p = 0 or 1, the channel output is completely determined by the channel input, and the capacity is 1 bit per symbol. At this value of p, the entropy function H (p) attains its minimum value of zero. 2. When the conditional probability of error p is equal to 0.5 due to channel noise, an input symbol yields either output symbol with equal probability, and the capacity is zero. whereas the entropy function H(p) attains its maximum value of unity; in such a case, the channel is said to be useless in the sense that the channel input and output assume statistically independent structures. 2 2 1 log (1 )log (1 ) BSC C p p p p      1 ( ) BSC C H p  
  • 39. (F) Binary Erasure Channel: consider the binary erasure channel as shown in below fig. Let the source alphabet consists of two symbols 1 x with probability, 1 ( ) P x   and 2 x with probability 2 ( ) (1 ) P x    as shown in above figure. The destination alphabet is 1 2 3 y , & y y . Calculation of I(X;Y)               1 1 2 1 3 1 1 2 2 2 3 2 1 0 / 0 1 P y x P y x P y x p p P Y X P y x P y x P y x p p                       ( ) ( ) / P Y P X P Y X      Where ,       1 2 ( ) 1 ( ) ( ) P X P x P x          1 0 ( ) 1 0 1 p p P Y p p               (1 ) (1 )(1 ) p p p         1 2 3 (y ) (y ) (y ) P P P  Method-2   1 1 1 1 (y ) ( ) (1 ) P P x P y x p         2 1 2 1 2 2 2 (y ) ( ) ( ) ( ) P P P x P y x P x P y x     (1 ) p p      p    3 2 3 2 (y ) ( ) (1 )(1 ) P P x P y x p      The joint probability Matrix (JPM)       ( , ) ( ) / d P X Y P X P Y X  Where,   1 2 ( ) 0 0 ( ) 0 ( ) 0 (1 ) d P x P X P x                    0 1 0 ( , ) 0 (1 ) 0 1 p p P X Y p p                  
  • 40.   1 1 1 2 1 3 2 1 2 2 2 3 ( , ) ( , ) ( , ) (1 ) 0 ( , ) ( , ) ( , ) ( , ) 0 (1 ) (1 )(1 ) p x y p x y p x y p p P X Y p x y p x y p x y p p                        We know that ( ; ) ( ) ( / ) I X Y H Y H Y X     2 1 1 ( / ) ( , )log n m j k j k j k H Y X p x y p y x          2 3 2 1 1 p( ) log k j k j k k j x p y x p y x                            1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 2 2 2 2 3 2 2 3 2 ( )p log ( )p log ( )p log ( )p log p x y x p y x p x y x p y x p x y x p y x p x y x p y x      2 2 2 2 (1 )log (1 p) log (1 ) log (1 )(1 )log (1 p) p p p p p p               2 2 log (1 )log (1 p) p p p     3 2 1 ( ) ( )log ( ) j j j H Y P y p y      1 2 1 2 2 2 3 2 3 ( )log ( ) ( )log ( ) ( )log ( ) P y p y P y p y P y p y       2 2 2 (1 )log (1 ) log (1 )(1 )log (1 )(1 ) p p p p p p                 2 2 2 2 (1 ) log (1 )log (1 ) log (1 )log (1 ) p p p p p                   ( ; ) ( ) ( / ) I X Y H Y H Y X      2 2 (1 ) log (1 )log (1 ) p           (1 ) ( ) p H X     max ( ; ) max (1 ) ( ) BEC C I X Y p H X      (1 )max ( ) BEC C p H X   The maximum value of output entropy will be maximized if input entropy is maximized when input symbols are equiprobable, 1 ( ) 0.5 p x    and 2 ( ) (1 ) 0.5 p x       2 2 ( ) log (1 )log (1 ) H X          2 2 0.5log 0.5 (0.5)log (0.5) 1 bit/symbol     Therefore the channel capacity of BEC is given as, 1 BEC C p  
  • 41. Binary Channels: It is easy to analyze a BSC channels, in practice we come across binary channels with non- symmetric structures. A binary channel is shown in Figure 2.7. The channel matrix is given as,   11 12 21 22 ( / ) P P D P Y X P P         To find the channel capacity of the binary channel, a method was suggested by Dr. S.MUROGA. Defining the auxiliary variables Q1 and Q2 found using the matrix equation, 11 1 12 2 11 2 11 12 2 12 log log P Q P Q P P P P     21 1 22 2 21 2 21 22 2 22 log log P Q P Q P P P P    Solving the above two expressions gives the values of Q1 and Q2. The channel capacity is found by, 2 1 2 log bits/message-symbol 2 2 Q Q C       In general for a channel with m m  is given by 11 12 1 21 22 2 1 2 ( / ) m m m m mm P P P P P P P P P P Y X                  And the auxiliary variables 1 2 , ,....., m Q Q Q can be obtained by using the matrix and solving the equations obtained from the matrix 11 12 1 1 11 2 11 12 2 12 1 2 1 21 22 2 2 21 2 21 22 2 22 2 2 2 1 2 1 2 1 2 2 2 2 log log log log log log log log log m m m m m m m m mm m m m m m mm mm P P P Q P P P P P P P P P Q P P P P P P P P P Q P P P P P P                                               The channel capacity is found by, 2 1 2 bits/message-symbol log 2 2 2 m Q Q Q C          
  • 42. Cascaded Channels: Let us consider two channels connected in cascade as shown in Figure 2.8. When the information is transmitted from X to Y through the channel-I, there will be loss of information due to the noise in channel-I and the mutual information at the output of channel –I is ( , ) (Y) H(Y/X) I X Y H   When I(X, Y) is passed through channel-II, there will be further loss of information and the mutual information at the output of channel-II is ( ,Z) (Z) H(Z/X) I X H   It is obvious that I(X, Z) < I(X, Y). Let us consider the case of two cascaded identical BSC channels as shown in Figure 2.9. The messages from X1 reaches Z1 in two ways: 1 1 1 1 2 1 & x x y z y z     . The respective path probabilities are & p p q q . Hence, 2 2 2 ( ) 2 1 2 p p q p q pq pq         Similarly, the message from X1 reaches Z2 in two ways: 1 1 2 1 2 2 & x x y z y z     . The respective path probabilities are & p q q p . Hence, 2 q pq qp pq     The channel matrix of the cascaded channel is   1 2 2 (Z/ ) 2 1 2 pq pq p q P X pq pq q p                     The channel capacity of cascaded channel is 1 (2 ) C H pq  
  • 43. For 0.5>q>0, 2pq is always greater than q. hence the channel capacity of two cascaded BSC’s is less than single BSC. Shannon’s Theorem on Channel Capacity (Shannon’s Second Theorem) It states that it is possible to device a means where by a communication system will transmit information with an arbitrarily small probability of error provided that the information rate, t s R C  the channel capacity. Explanation: Given a source M equally likely messages, with M >>1 which is generating information at a rate Rt. Given a channel capacity Cs, then if t s R C  , there exists a coding technique which enables transmission over the channel with as small a probability of error as possible, even in the presence of noise in the channel. But if t s R C  , then reliable transmission of information is not possible without errors and cannot be controlled by any coding technique and the probability error receiving the correct message becomes close to unity. Shannon’s Channel Coding Theorem Given a discrete memoryless source with an entropy of H(S) bits per symbol emitting symbols at the rate of (1/Ts) symbols per second, and given a discrete memoryless channel with a capacity of C bits per symbol and through which the symbols are transmitted at the rate of (1/Tc) symbols per second, it is possible to construct a channel code which would make it possible to transmit the source symbols through the channel and be reconstructed with arbitrarily small probability of error, if and only if ( ) s c H S C T T  Shannon’s theorem merely states that codes do exist that enable us to transmit information over a given channel with arbitrarily small probability of error as long as the rate of transmission (in bits/sec) is less than the channel capacity (in bits/sec) of the channel. It does not tell us anything about how such codes may be constructed. Differential Entropy and Mutual Information for Continuous Random Ensembles Differential Entropies: Consider a continuous random variable X with the probability density function (pdf) ( ) X f x , then the average amount of information measured per sample value of x(t) is 2 ( ) ( )log ( ) bits/sample X X h X f x f x dx      The term h(X) as the differential entropy of X. Similarly, with random variable Y with pdf (y) Y f , then 2 (Y) (y)log (y) bits/sample Y Y h f f dx      (a) Uniform Distribution: Consider a random variable X uniformly distributed over the interval (0, a). The probability density function of X is
  • 44. 1 0 ( ) 0 X x a a f x         Then, 2 ( ) ( )log ( ) bits/sample X X h X f x f x dx      2 0 1 1 log a dx a a   2 log a  Remark: if a<1, the log a will be negative, this shows that unlike in the case of H(X) for discrete sources, h(X) of continuous sources can be negative. (b) Gaussian Distribution: Consider a random variable X Gaussian distributed with 2  as variance and  as mean should satisfy the two conditions: (x) 1 X f dx     (1) 2 2 ( ) (x) X x f dx        (2) In order to maximize differential entropy with respect to eq(1) and eq(2), we make use of Lagrange multipliers.   2 2 1 2 1 2 ( ), , ( ) (x) 1 ( ) (x) X X X g f x h X f dx x f dx                               (3) Where 1 2 ,   are Lagrange multipliers. Then the maximization of h(X) requires   1 2 2 2 2 1 2 ( ), , log (x) log ( ) 0 (x) X X X g f x f e x f               Solving above expression gives 2 1 2 2 2 ( ) (x) exp 1 log log X x f e e               (4) If eq(4 )is substituted in eq(1) and eq(2) , the integrals will not converge, unless 2  is negative. Let 1 2 exp 1 log e            and 2 2 2 ( ) log x e      Eq(4) re-written as, 2 2 ( ) (x) x X f e       (5) Substituting eq(5) in eq(1) we have, 2 2 ( ) 1 x e dx          Let ( ) x y     and solving, we get 1           (6)
  • 45. Substituting eq (5) in eq(2) we have , 2 2 2 ( ) 2 ( ) x x e dx             and solving 2 3 2            (7) Dividing eq (6) and eq (7) one by the other, 1 2    And substitution β in eq (6) we get 1 2    Now substituting β and α in eq (5) we get 2 2 ( ) 2 1 (x) 2 x X f e       Hence, the differential entropy h(X) of r.v.X has maximum value when X is Gaussian and the maximum value is given as,   2 ( ) (x) log ( ) X X h X f f x dx      But,   2 2 log ( ) log log ( ) X e X f x e f x  2 2 ( ) 2 2 1 log log 2 x e e e                   2 2 2 1 ( ) log log 2 2 e x e                     2 2 2 2 ( ) log log 2 2 e x e                2 2 2 2 2 ( ) (x)log log 2 (x) log 2 X e X x f e dx f edx             2 2 2 2 2 1 log 2 (x) log ( ) (x) 2 X X f dx e x f dx             2 2 2 1 ( ) log 2 log 2 h X e     2 2 2 1 1 log 2 log 2 2 e    2 2 2 1 ( ) log (2 ) log ( 2 ) 2 h X e e      
  • 46. Mutual Information: Consider a pair of random variables X and Y then, , 2 ( / ) ( ; ) ( , )log (x) X X Y X f x y I X Y f x y dxdy f              Where ( / ) X f x y is the conditional density function of X given that Y= y and f(x,y) is the joint probability density function. The conditional differential entropy of continuous r.v.X given continuous r.v.Y is , 2 1 (X/ Y) ( , )log ( / ) X Y X h f x y dxdy f x y              The conditional differential entropy of continuous r.v.Y given continuous r.v.X is , 2 1 (Y/ X) ( , )log (y/ x) X Y X h f x y dxdy f              Properties: 1. ( ; ) (Y;X) 2. ( ; ) 0 3. ( ; ) ( ) ( / ) ( ) ( / ) I X Y I I X Y I X Y h X h X Y h Y h Y X       Shannon-Hartley Law /Shannon’s Information –Capacity Theorem Shannon’s information capacity theorem is also known as Shannon’s third theorem or Shannon-Hartley theorem or Gaussian Channel capacity theorem. If a channel bandwidth B is fixed and output is also band-limited signal completely characterized by its periodic sample values taken at Nyquist rate 2B samples/sec. then the channel capacity C (bits/sec) of AWGN channel is given by 2 log 1 bits/second S C B N         It deals with the channel capacity of a band limited and power limited continuous channel which is corrupted by additive white Gaussian noise of zero mean. It is one of the important results of information theory as it shows the relationship between the channel capacity and the three key parameters—channel bandwidth, average transmitted power and the power spectral density of the white noise on the channel. Proof: The input to the channel is in the form of a sequence of the samples of a zero-mean Gaussian distributed stationary random process x(t) band limited to B Hz. These samples are taken uniformly exactly at the Nyquist rate of 2B samples per second. It is assumed that the average signal power is constrained to be S. Since the channel is band limited to B Hz, the received signal as well as noise are also band limited to B Hz. If the input sample is a continuous zero-mean Gaussian random variable, X, the corresponding output sample is a Gaussian random variable Y given by Y X N   (1) where N is a sample of one realization of a zero-mean Gaussian noise process which is bandlimited to B Hz
  • 47. Since the signal and noise are statistically independent, the average power of Y 2 2 2 [ ] [X ] [N ] E Y E E   (2) Where E[ ]  is the mathematical expectation. The mutual information of the channel is given by I(X;Y) h(Y) h(Y/ X)   bits/sample (3) The channel capacity C is the maximum value of this mutual information under the constraint that the average signal power has to be equal to S 2 ( ) C ( ; ) : ( ) max X f x I X Y E X S I       bits/sample (4) The conditional differential entropy ( / ) h Y X is expressed as, , 1 ( / ) ( , )log ( / ) X Y Y h Y X f x y dxdy f y x              1 ( ) ( / )log ( / ) X Y Y f x f y x dxdy f y x              1 ( ) ( / )log ( / ) X Y Y f x dx f y x dy f y x                    1 ( / ) ( / )log , ( / ) Y Y h Y X f y x dy f y x           sin ( ) 1 X ce f x dx     (5) But, y x n   If ( ) N f n denotes the probability density function of the noise sample N, then ( / ) ( ) Y N f y x f y x   1 ( / ) ( )log ( ) N N h Y X f y x dy f y x             Let y x z   , then 1 ( / ) (z)log ( ) (z) N N h Y X f dy h N f            (6) I(X;Y) (Y) (N) h h    h(Y) will be maximum, for a given variance of Y, only when Y is Gaussian distributed if and only if X, the input random variable is Gaussian. We have assumed that the input signal x (t) is a zero- mean Gaussian distributed stationary random process with an average power of S. Since X and Y are statistically independent, the total average power of Y is 2 2 2 [ ] [X ] [N ] S N E Y E E     (7) The maximum value of differential entropy h(Y) is   2 max 1 ( ) log 2 ( ) 2 h Y e S N   
  • 48. But,   2 max max 1 I(X;Y) ( ) (N) log 2 ( ) ( ) 2 h Y h e S N h N       (8) Since the channel noise is Gaussian, h(N) is given by, 2 1 ( ) log 2 2 h N eN      2 2 max 1 1 I(X;Y) log 2 ( ) log 2 2 2 e S N eN      2 2 1 2 ( ) 1 log log 1 / 2 2 2 e S N S bits sample eN N                   Since 2B samples are transmitted per second over the channel, assuming these samples of x(t) taken at Nyquist rate to be statistically independent max 2 ( : ) C B I X Y  Channel Capacity of band-limited Gaussian Channel Where B= channel bandwidth in Hz S= avg. signal power in watts N= Noise power in watts If 2  is the two sided power spectral density of noise then, N B   Bandwidth and S/N –tradeoff: If bandwidth becomes ‘∞’ , the channel capacity does not becomes infinity because an increase in bandwidth, the noise power also increases. Thus, for a fixed signal power, and in the presence of white Gaussian noise, the channel capacity approaches an upper limit called “Shannon Limit” with bandwidth increased to infinite. 2 log 1 S C B B          2 log 1 bits/second S C B N         2 log 1 bits/sec S C B B         
  • 49. 2 log 1 S S B S B             2 log 1 B S S S B            We know that 1/ 0 lim(1 ) x x x e    If , 0 S x when B then B      lim(1 ) B S B S e B      The Shannon-limit on channel capacity is Thus for an AWGN channel, the channel capacity C instead of going to infinity approaches a limit as we go on increasing the bandwidth. The capacity can be made infinity only by increasing the signal power S to infinity. For finite signal and noise powers, the channel capacity always remains finite. Let us consider the trade-off between the bandwidth and S/N ratio: Let 15& 5 S B KHz N   Then, 2 2 log 1 =5log (16)=20 kbits/sec S C B N         If 31 S N  , the bandwidth for the same channel capacity is 2 20 log (32) C B   Therefore B 4KHz  A 20% reduction in the bandwidth (5 KHz to 4 KHz) requires 65% increase in the signal power. Thus, to decrease the bandwidth the signal power has to be increased. Similarly, to decrease signal power, the bandwidth must be increased. Ideal System: An ideal system is defined as one that transmits data at a bit rate that is equal to the channel capacity C, in bits per second. Suppose the energy per bit is b E . Then the average power transmitted is b S E C  Then, b E C S N B   2 lim log 1.44 B S S C e      
  • 50. 2 log 1 b E C C B B            / / 2 1 2 1 / C B C B b E B C B C      Where, C B is bandwidth efficiency. But, / (C/B)ln2 2C B e  If B C  , then / 2 1 ln 2 C B C B        1 ln 2 1 ln 2 1.6 / b C E B db C B               Replacing b R with C, a plot of b E        and b R B       is shown in Figure 2.10
  • 51. From the curve, b R C  (ideal system) is the boundary between two distinct regions. The region b R C  is the one in which it is possible to make the probability of error arbitrarily small and is the one which is of interest in practice. The other region, b R C  is the one in which it is not possible to achieve arbitrarily small probabilities of error. From the figure, it is evident that when b R B is large, a very large value of b E  is needed to keep b R C  i.e., a large average power is needed. Suppose the bit-rate Rb is fixed and the channel bandwidth is increased, so that B >> Rb. In this case, operation in the region Rb < C requires only that b E  be only slightly greater than the Shannon limit, i.e., –1.6 dB. The corresponding signal power required is ln 2 b S R watts   This represents the minimum signal power required for operation in the Rb < C region. For power- limited operation, therefore, the bandwidth should be very large compared to the bit rate so that only minimum signal power represented is required. But, suppose the system has to operate with Rb >> B. Then, from the figure, it is clear that b E  necessary for operation in the Rb < C region, is quite large. Then the system is said to be operating in the bandwidth-limited condition. Drill Problems 1. One internet service provider (ISP) gives dial-up connections at 56kbps. Assume that the telephone connection provides a usable bandwidth of 3.5 kHz. What is the minimum SNR required to support this? Sol: The minimum SNR required is the one when channel capacity rate is achieved, i.e Rb=C(ideal) , so C=56kbps SNR= / 2 1 C B  Given B= 3.5 KHz 56/3.5 2 1   65,535 48.16dB   2. A Gaussian channel has 1MHz bandwidth. Calculate the channel capacity if the signal power to noise spectral density ratio is 105 Hz. Also, find maximum information rate. Sol: given, B= 1 MHz, 5 10 S Hz   2 log 1 S C B B          5 6 2 6 10 10 log 1 13,800 bits/sec 10          Maximum information rate, 5 max R 1.44 1.44 10 144,000 bits/sec S     
  • 52. 3. A black and white TV picture consists of 525 lines of picture information. Assume that each line consists of 525 picture elements (pixels) and that each element can have 256 brightness levels. Picture are repeated at a rate 30 frames/sec. calculate the average rate of information conveyed by a TV set to a viewer. Sol: Total number of pixels in one frame =525 525 2,75,625   pixels. Total number of different frames possible = 2,75,625 (256) frames. Assume that all these frames occur with equal probability, the net maximum information content per frame is Therefore, average rate of information conveyed by a TV ( ) R rH S  5 5 30 22.05 10 66.15 10 / sec bits      4. For the given channel matrix, compute mutual information with 1 2 ( ) ( ) 0.5 P x P x   2 / 3 1/ 3 0 ( / ) 0 1/ 6 5/ 6 P Y X        Sol: mutual information I(X ; Y) = H(Y) – H(Y/X) The JPM is found by multiplying 1st row of given channel matrix with P(x1) and 2nd row by P(x2) 1/ 3 1/ 6 0 (X,Y) 0 1/12 5/12 P        2 2 2 1 1 1 ( / ) ( , )log ( ) i k i k k i H Y X p x y p y x     2 2 2 2 1 3 1 1 5 6 log log 3 log 6 log 3 2 6 12 12 5     0.7842 / bits symbol  1 2 3 1 1 1 1 5 (y ) , (y ) , (y ) 3 6 12 4 12 P P P      2 2 1 ( ) (y )log (y ) k k k H Y p p    2 2 2 1 1 1 1 5 5 log log log 3 3 4 4 12 12           1.5546 / symbol bits  Mutual Information, I(X, Y) = 1.5546 - 0.7842 = 0.7704 bits/symbol. 2,75,625 5 2 ( ) log (256) 22.05 10 / H S bits frame   
  • 53. 5. Find the mutual information as shown in figure. Sol: From figure, JPM 0.25 0.25 (X,Y) 0.15 0.15 0.1 0.1 P            1 2 3 ( ) 0.25 0.25 0.5 ( ) 0.15 0.15 0.3 ( ) 0.1 0.1 0.2 P x P x P x          1 2 (y ) 0.25 0.15 0.1 0.5 (y ) 0.25 0.15 0.1 0.5 P P         3 2 1 (X) (x )log (x ) j j j H p p      2 2 2 0.5log 0.5 0.3log 0.3 0.2log 0.2 1.485 / bits message      2 2 1 (Y) (y )log (y ) k k k H p p      2 2 0.5log 0.5 0.5log 0.5 1 / bits message     3 2 2 1 1 ( , ) ( , )log ( , ) j k j k j k H X Y p x y p x y       2 2 2 2 2 2 0.25log 0.25 0.25log 0.25 0.15log 0.15 0.15log 0.15 0.1log 0.1 0.1log 0.1 2.485 / bits message         ( , ) ( ) ( ) ( , ) I X Y H X H Y H X Y    1.485 1 2.485 0    
  • 54. 6. Find the mutual information and channel capacity of the channel shown in figure. Given 1 2 ( ) 0.6, ( ) 0.4 P x P x   Sol: The channel matrix,   0.8 0.2 (Y/ X) 0.3 0.7 D P         The JPM,   0.48 0.12 (X,Y) 0.12 0.28 P        1 2 (y ) 0.48 0.12 0.6 (y ) 0.12 0.28 0.4 P P         0.48/ 0.6 0.12 / 0.4 0.8 0.3 (X/ Y) 0.12 / 0.6 0.28/ 0.4 0.2 0.7 P               2 2 1 (X) (x )log (x ) j j j H p p      2 2 0.6log 0.6 0.4log 0.4 0.971 / bits message     2 2 2 1 1 ( / ) ( , )log ( / ) j k j k j k H X Y p x y p x y       2 2 2 2 0.48log 0.8 0.12log 0.3 0.12log 0.2 0.28log 0.7 0.786 / bits message       Mutual information, ( , ) ( ) ( / ) I X Y H X H X Y   0.971 0.786 0.185 / bits message    Finding channel capacity by MURGOAs Method, 11 12 21 22 0.8, 0.2, 0.3 0.7 P P P P     1 2 2 2 2 2 0.8log 0.8 0.2log 0.2 0.8 0.2 0.3log 0.3 0.7log 0.7 0.3 0.7 Q Q                      Upon solving for Q1 and Q2 yields, 1 2 0.6568, 0.9764 Q Q     The channel capacity, 1 2 2 log 2 2 Q Q C         0.6568 0.9764 2 2 log 2 2 log 0.633 0.513 0.2 bit/message           