Data Compression and encryption for security

Data Compression and
Encryption- (DCE)
TE-EXTC-Elective Subject

What is Data Compression?
• Data can be Text, image, audio or video
information
• Data compression is about storing and sending a
smaller number of bits. .
• Data compression is the art or science of
representing information in a compact form.
• We create these compact representations by
identifying and using structures that exist in the
data.

How data compression is possible
DATA = INFORMATION + REDUNDANT DATA

Need of data compression?
• Less storage space
• Less transmission time
• Reduced Bandwidth.
• During compression data is converted in
coded form, that secures the information.

Application
• In Satellite Communication
• In Radar
• In Mobile Communication
• In Digital Television
• In Banking Software
• Medical applications and so on..

Data Compression Techniques
1. Lossless Compression Techniques:
 No loss of information
 Original data can be recovered exactly from compressed data
 Loss of information is not tolerable
 Lower compression ratio
 Fidelity and Quality is high
 Highly Robust and Reliable
 Distortion is less
 Rate is high
 Text and Image compression
 Lossless Algorithm: Huffman coding, Arithmetic coding
 Applications: Medical, Satellite, Banking etc.

Data Compression Techniques
2. Lossy Compression Techniques:
 Some loss of information
 Original data cannot be recovered exactly from compressed data
 Higher compression ratio
 Loss of information is tolerable
 Fidelity and Quality is low
 Less Robust and Reliable
 Distortion is more.
 Rate is low
 Audio and Video compression
 Lossy Algorithm : MPEG,JPEG etc.
 Applications: Telephone, Mobile,Tv etc.

Measure of performance
A compression algorithm can be evaluated in different
ways:
1. Relative complexity of algorithm
2. Memory required to implement the algorithm
3. How fast the algorithm, performs on given machine
4. Amount of compression
5. How close reconstruction resembles the original data

Measure of performance cont..
1. Compression Ratio:
CR =
If CR<1 = then there is expansion instead of compression.
This is known as negative compression.
The compression ratio is always greater than 1.

Measures of Performance
2. Compression Factor:
• Compression factor is inverse of compression ratio.
• CF=
• Compression factor is always less than 1.
• If CF>1 then there is expansion instead of compression.
• Thus smaller the factor better the compression.

3. Rate:
• Average number of bits required to represent a single
sample is called rate.
• Eg. Image compression (bits/pixel).
4. Distortion:
• The difference between original data and
reconstructed data is called distortion.
• Eg. Lossy compression

5. Performance measure speed:
• Speed of compression can be measured by cycle/byte.
• It is important when compression is done by special
hardware.
6. Fidelity and Quality:
• It indicates the difference between reconstructed and
original data.
• When fidelity and quality is high, means difference
between reconstructed and original data is small.
• Eg. Lossless compression(image/text)

Lossless Compression
• Lossless compression techniques, as their name implies,
involve no loss of information.
• If data have been losslessly compressed, the original data
can be recovered exactly from the compressed data.
• Lossless compression is generally used for applications
that cannot tolerate any difference between the original
and reconstructed data.
• Text compression is an important area for lossless
compression. It is very important that the reconstruction
is identical to the text original, as very small differences
can result in statements with very different meanings.

Classification of Lossless Compression Techniques
Lossless techniques are classified into static, adaptive (or dynamic),
and hybrid.
• In a static method the mapping from the set of messages to the set of
code words is fixed before transmission begins, so that a given message is
represented by the same codeword every time it appears in the message
being encoded.
• Static coding requires two passes: one pass to compute probabilities (or
frequencies) and determine the mapping, and a second pass to encode.
• Examples: Static Huffman Coding
• In an adaptive method the mapping from the set of messages to the set
of code words changes over time.
• All of the adaptive methods are one-pass methods; only one scan of the
message is required.
• Examples: LZ77, LZ78, LZW, and Adaptive Huffman Coding
• An algorithm may also be a hybrid, neither completely static nor
completely dynamic.

Lossy Compression
• Lossy compression techniques involve some loss of
information, and data that have been compressed using
lossy techniques generally cannot be recovered or
reconstructed exactly.
• In return for accepting this distortion in the reconstruction,
we can generally obtain much higher compression ratios
than is possible with lossless compression.
• In many applications, this lack of exact reconstruction is not
a problem. For example, when storing or transmitting
speech, the exact value of each sample of speech is not
necessary.
• Depending on the quality required of the reconstructed
speech, varying amounts of loss of information about the
value of each sample can be tolerated.

Compression Utilities and Formats
• Compression tool examples:
winzip, pkzip, compress, gzip
„
• General compression formats:
.zip, .gz
„
• Common image compression formats:
JPEG, JPEG 2000, BMP, GIF, PCX, PNG, TGA, TIFF, WMP
• Common audio (sound) compression formats:
MPEG-1 Layer III (known as MP3), RealAudio (RA, RAM, RP), AU,
Vorbis, WMA, AIFF, WAVE, G.729a
• Common video (sound and image) compression formats:
MPEG-1, MPEG-2, MPEG-4, DivX, Quicktime (MOV), RealVideo (RM),
Windows Media Video ( WMV), Video for Windows ( AVI), Flash video
(FLV)

Modeling and Coding
An compression technique which work well for the compression of text
may not work well for compressing images.
Development of data compression algorithm for a variety of data is
divided into two phases:
1. Modeling 2. Coding
Modeling:
• In this method we try to extract information about redundancy that
exist in the data & describes the redundancy in the form of model.
• The description of model and description how data differs from
model are encoded generally using binary.
• The difference between data and model is often referred to as
residual.
• We can obtain compression by transmitting or storing the
parameters of the model and the residual sequence.

Modeling and Coding
Coding:
• In this we perform different arithmetic
operations on modeled data according to
simplicity & quality required.
• There are different coding techniques used for
compression of data are Huffman coding,
Shannon fano coding, Arithmetic coding.

Different MODELS
• Physical Models
• Probability Models
• Markov Models
• Composite Source Model
Different Coding
• Uniquely Decodable Codes
• Prefix Codes
• Huffman Codes
• Shannon Fanon Codes

Different Types of Models
1. Physical Model:
• If we know something about physics of data
generation process, we can use that information to
reconstruct the model, this model called as physical
model.
• Eg: In speech applications, the knowledge about the
physics of speech production is used to construct
mathematical model for sampled speech process.
• Sample speech is encoded using this model.
• The physics of data generation is too complicated to
understand so the are rarely used.

Different Types of Models cont..
2. Probability Model:
• This model is used when there is no any idea
about statics of source.
• In this model the probability of redundant
letters are calculated which are generated
from source.
• It assumes that each letter generated by
source is independent of every other letter
and so this model is called as “ignorance
model”.
• It is useful for text and image compression.

3. Markov Model:
• In this model, the present state depends on previous state of samples.
• This model is useful for lossless data compression.
• A specific type of Markov process is used called ‘Discrete Time Markov Process’.
• Let Xn be the sequence of observations, follow Kth order model,
P(Xn|Xn-1,Xn-2,…….Xn-k) = P(Xn|Xn-1,Xn-2,…….Xn-k-1)
• It means knowledge of past k symbols is equivalent to the knowledge of entire
past history of the process.
Xn-1,Xn-2,…….Xn-k is called states of process
• The first order Markov model is written as,
P(Xn|Xn-1) = P(Xn|Xn-1,Xn-2,…….)
• Markov model : Complex
Accurate results

4. Composite source Model:
• In many application it is not easy to use a single model to describe this source.
• In such cases composite source model is useful.
• It uses number of different sources Si, each with its own model Mi.
• The switch is used to select a particular source according to requirement.
Source 1 1
Source 2
Source n
Switch

• The first phase is usually referred to as modeling.
• In this phase we try to extract information about any
redundancy that exists in the data and describe the
redundancy in the form of a model.
• The second phase is called coding. A description of
the model and a “description” of how the data differ
from the model are encoded, generally using a binary
alphabet.
• The difference between the data and the model is
often referred to as the residual

Coding
When we talk about coding we mean the assignment of binary
sequences to elements of an alphabet. The set of binary sequences
is called a code, and the individual members of the set are called
code words. An alphabet is a collection of symbols called letters.

Shannon Fano Coding
• Its a method of constructing prefix code based on a set of
symbols and their probabilities estimated or measured.
• The technique was proposed in Shannon's "A Mathematical
Theory of Communication", his 1948 article introducing the
field of information theory.
• In the field of data compression, Shannon–Fano coding,
named after Claude Shannon and Robert Fano, is a technique
for constructing a prefix code based on a set of symbols and
their probabilities (estimated or measured).
• It is suboptimal in the sense that it does not achieve the lowest
possible expected code word length like Huffman coding;
however unlike Huffman coding, it does guarantee that all
code word lengths are within one bit of their theoretical ideal .

Basic Technique
 In Shannon–Fano coding, the symbols are arranged in order
from most probable to least probable, and then divided into
two sets whose total probabilities are as close as possible to
being equal.
 All symbols then have the first digits of their codes assigned;
symbols in the first set receive "0" and symbols in the
second set receive "1".
 As long as any sets with more than one member remain, the
same process is repeated on those sets, to determine
successive digits of their codes.
 When a set has been reduced to one symbol this means the
symbol's code is complete and will not form the prefix of
any other symbol's code.

Messag
e
Prob Step 1 Step 2 Step 3 Code Code
length
x1 0.4 0
X2 0.19 0
x3 0.16 1
x4 0.15 1
x5 0.1 1
1)x1 and x2 = UP = Yelow = 0.4+0.19= 0.59,
x3,x4 and x4= LP = Red = 0.16+0.15+0.1= 0.41.
UP and LP are almost equal probable or same

Messag
e
length
x1 0.4 0 0
X2 0.19 0 1
x3 0.16 1
x4 0.15 1
x5 0.1 1

Messag
e
length
x1 0.4 0 0
X2 0.19 0 1
x3 0.16 1 0
x4 0.15 1 1
x5 0.1 1 1

Messag
e
length
x1 0.4 0 0 00 2
X2 0.19 0 1 01 2
x3 0.16 1 0 10 2
x4 0.15 1 1 0 110 3
x5 0.1 1 1 1 111 3

Messag
e
length
x1 0.30
X2 0.25
x3 0.15
x4 0.12
X5 0.10
x6 0.08

Messag
e
length
x1 0.30 0
X2 0.25 0
x3 0.15 1
x4 0.12 1
X5 0.10 1
x6 0.08 1
1) 0.3+0.25= 0.55, x1 and x2 = UP=Yelow
0.15+0.12+0.1+0.08= 0.45, x3,x4 and x4= LP= Red

Messag
e
length
x1 0.30 0 0
X2 0.25 0 1
x3 0.15 1
x4 0.12 1
X5 0.10 1
x6 0.08 1

Messag
e
length
x1 0.30 0 0
X2 0.25 0 1
x3 0.15 1 0
x4 0.12 1 0
X5 0.10 1 1
x6 0.08 1 1

Messag
e
length
x1 0.30 0 0
X2 0.25 0 1
x3 0.15 1 0 0
x4 0.12 1 0 1
X5 0.10 1 1
x6 0.08 1 1

Messag
e
length
x1 0.30 0 0 00 2
X2 0.25 0 1 01 2
x3 0.15 1 0 0 100 3
x4 0.12 1 0 1 101 3
X5 0.10 1 1 0 110 3
x6 0.08 1 1 1 111 3
1) 0.3+0.25= 0.55, x1 and x2 = UP=Yelow ; 0.15+0.12+0.1+0.08= 0.45, x3,x4
and x4= LP= Red

The Huffman Coding Algorithm
This technique was developed by David Huffman as part of a
class assignment; the class was the first ever in the area of
information theory and was taught by Robert Fano at MIT.
The codes generated using this technique or procedure are
called Huffman codes. These codes are prefix codes and are
optimum for a given model (set of probabilities).
The Huffman procedure is based on two observations regarding
optimum prefix codes.
1. In an optimum code, symbols that occur more frequently
(have a higher probability of occurrence) will have shorter
codewords than symbols that occur less frequenly.
2. In an optimum code, the two symbols that occur least
frequently will have the same length.

Design of a Huffman code
Let us design a Huffman code for a source that puts out letters from an
alphabet = a1, a2, a3, a4, a5 with P(a1) = P(a3) = 0. 2, P(a2) = 0.4, and
P(a4) = P(a5) = 0.1. The entropy for this source is 2.122 bits/symbol.
To design the Huffman code, we first sort the letters in a descending
probability order as shown in Table 3.1. Here c(ai) denotes the codeword for
ai.

Minimum Variance Huffman Codes

Data Compression and encryption for security

More Related Content

Similar to Data Compression and encryption for security (20)

Recently uploaded (20)

Data Compression and encryption for security