Compressionbasics

Topics today… Why Compression ? Information Theory Basics Classification of Compression Algorithms Data Compression Model Compression Performance

Why Compression? Digital representation of analog signals requires huge storage High quality audio signal require 1.5 megabits/sec A low resolution movie (30 frames per second, 640 x 580 pixels per frame, 24 bits per pixel) requires 210 megabits per minute!! 95 gigabytes per hour It is challenging transferring such files through the available limited bandwidth network.

Why Compression? Table –1 : Uncompressed source data rates Source Bit Rates for uncompressed Sources (Approximate) Telephony (200-3400 Hz) 8000 samples/second x 12 bits/sample = 96 kbps Wideband Audio (20-20000 Hz) 44100 samples/second x 2 channels x 16 bits/sample= 1.412Mbps Images 512x512 pixel color image x 24 bits/pixel = 6.3Mbits/image Video 640x480 pixel color image x 24 bits/pixel x 30 images/second=221 Mbps  650 megabyte CD can store 23.5 mins of video ? HDTV 1280x720 pixel color image x 60 images/second x 24 bits/pixel=1.3Gbps

The compression problem Efficient digital representation of a source Data compression is the representation of the source in digital form with as few bits as possible while maintaining an acceptable loss in fidelity. Source can be data, still images, speech, audio, video or whatever signal needs to be stored & transmitted .

Synonyms for Data Compression Signal compression & signal coding Source coding & source coding with fidelity criterion (in information theory) Noiseless & Noisy Source coding (lossless & lossy compression) Noise  Reconstruction noise Bandwidth compression, redundancy removal (more dated terminologies, in 80’s.)

Types of Data Compression Problem Distortion-rate Problem Given the constraint on transmitted data rate or storage capacity, problem is to compress the source at or below this rate but at the highest fidelity possible Ex . Voice mail, video conferencing, digital cellular Rate-distortion Problem Given the constraint on the fidelity, problem is to achieve it with as few bits as possible Ex . CD-Quality audio

Information Theory Basics Representation of data is the combination of information and redundancy Data compression is essentially a redundancy reduction technique Data compression scheme can be broadly divided into two phases Modelling Coding

Information Theory Basics In Modeling phase information about redundancy is analyzed & represented as a model This can be done via observing the empirical distribution of the symbols the source generates In the coding phase the difference between the actual data and the model is coded

Discrete Memoryless Model Source is discrete memoryless if it generates symbol that is statistically independent of one another Described by the source alphabet A={a 1 ,a 2 ,a 3 …a n } and associated probabilities P=(p(a 1 ), p(a 2 ), p(a 3 ),…. p(a n )) The amount of information content for a source symbol I(a i ) is The base 2 logarithm indicates the information content is represented in bits. Higher probability symbols are coded with less bits.

Discrete Memoryless Model [2] Averaging the information content over all symbols, we get the entropy E as follows Hence, entropy is the expected length of a binary code over all the symbols. Estimation of entropy depends on the observation & assumption on the structure of source symbols

Noiseless source coding theorem The Noiseless Source Coding Theorem states that any source can be losslessly encoded with a code whose average number of bits per source symbol is arbitrarily close to, but not less than, the source entropy E in bits by coding infinitely long extensions of the source.

Entropy Reduction Consider a discrete memoryless source, with source alphabet A1 = {α, β, γ, δ} & probability p (α) = 0.65, p (β) = 0.20, p (γ) = 0.10, p (δ) = 0.05 respectively The entropy of this source is E = −(0.65 log2 0.65 + 0.20 log2 0.20 + 0.10 log2 0.10 + 0.05 log2 0.05) = 1.42 bits per symbol A data source of 2000 symbols can be represented using 2000 x 1.42 = 2840 bits

Entropy Reduction [2] Now assume we know something about the structure of the sequence Alphabet A2 = {0, 1, 2, 3} Sequence D = 0 1 1 2 3 3 3 3 3 3 3 3 3 2 2 2 3 3 3 3 p (0) = 0.05, p (1) = 0.10, p (2) = 0.20, and p (3) = 0.65 E = 1.42 bits per symbol Assume the correlation between consecutive bits and we attempt to reduce it by r i = s i − s i −1 for each sample s i

Entropy Reduction [3] Now D = 0 1 0 1 1 0 0 0 0 0 0 0 0 −1 0 0 1 0 0 0 A2 = {−1, 1, 0} P (−1) = 0.05, p (1) = 0.2, and p (0) = 0.75 E = 0.992 If used appropriate entropy coding technique maximum compression can be achieved

Unique Decipherability Consider the following table Symbols are encoded with codes A, B and C. Consider the string S = ααγαβαδ

Unique Decipherability [2] Deciphering C A (S) and C B (S) are unambiguous and we get the string S C C (S) is ambiguous and not uniquely decipherable Fixed length codes are always uniquely decipherable. Not all variable length codes are uniquely decipherable.

Unique Decipherability [3] Uniquely decipherable codes maintain prefix property , ie no codeword in the code-set forms the prefix of another distinct codeword Popular variable-length coding techniques Shannon-Fano Coding Huffman Coding Elias Coding Arithmetic Coding Fixed-length codes can be treated as a special case of uniquely decipherable variable-length code.

Classification of compression algorithms CODEC

Classification of compression algorithms [2] Data compression as a method that takes an input data D and generates a shorter representation of the data c ( D ) with a fewer number of bits compared to that of D The reverse process is called decompression, which takes the compressed data c(D) and generates or reconstructs the data D′ Sometimes the compression (coding) and decompression (decoding) systems together are called a "CODEC,"

Classification of compression algorithms [3] If the reconstructed data D ′ is an exact replica of the original data D, we call the algorithm applied to compress D and decompress c ( D ) to be lossless . Otherwise the algorithms are lossy Text, scientific data, medical images are some of the applications requires lossless compression Compression can be static or dynamic , depends on the coding scheme used

Data compression model A data compression system mainly consists of three major steps removal or reduction in data redundancy reduction in entropy entropy encoding

Data compression model REDUCTION IN DATA REDUNDANCY Removal or reduction in data redundancy is typically achieved by transforming the original data from one form or representation to another Popular transformation techniques are Discrete Cosine Transform (DCT) Discrete Wavelet Transformation (DWT) etc This step leads to the reduction of entropy For Lossless compression this transformation is completely reversible

Data compression model REDUCTION IN ENTROPY Non reversible process Achieved by dropping insignificant information in the transformed data ( Lossy!!! ) Done by some quantization techniques Amount of quantization dictate the quality of the reconstructed data Entropy of the quantized data is less compared to the original one, hence more compression.

Compression Performance The performance measures of data compression algorithms can be looked at from different perspectives depending on the application requirements amount of compression achieved objective and subjective quality of the reconstructed data relative complexity of the algorithm speed of execution, etc.

Compression Performance AMOUNT OF COMPRESSION ACHIEVED Compression ratio , the ratio of the number of bits to represent the original data to the number of bits to represent the compressed data Achievable compression ratio using a lossless compression scheme is totally input data dependent. Sources with less redundancy have more entropy and hence are more difficult to achieve compression

Compression Performance SUBJECTIVE QUALITY METRIC MOS : m ean o bservers s core or m ean o pinion s core is a common measure A statistically significant number of observers are randomly chosen to evaluate visual quality of the reconstructed images. Each observer assigns a numeric score to each reconstructed image based on his or her perception of quality of the image, say within a range 1–5 to describe the quality of the image—5 being the highest quality and 1 being the worst quality. MOS is the average of these scores

Compression Performance OBJECTIVE QUALITY METRIC Common quality metrics are root-mean-squared error ( RMSE ) signal-to-noise ratio ( SNR ) peak signal-to-noise ratio ( PSNR ). If I is an M × N image and I is the corresponding reconstructed image after compression and decompression, RMSE is calculated by The SNR in decibel unit (dB) is expressed as

Compression Performance CODING DELAY AND COMPLEXITY Coding delay , a performance measure for compression algorithms where interactive encoding and decoding is the requirement (e.g., interactive video teleconferencing, on-line image browsing, real-time voice communication, etc.) The complex the compression algorithm  Increased coding delay Compression system designer often use a less sophisticated algorithm for the compression system.

Compression Performance CODING DELAY AND COMPLEXITY Coding complexity , a performance measure considered where the computational requirement to implement the codec is an important criteria MOPS (millions of operations per second), MIPS (millions of instructions per second) are often used to measure the compression performance in a specific computing engine's architecture.

Reference Chapter 1 of JPEG2000 Standard for Image Compression: Concepts, Algorithms and VLSI Architectures by Tinku Acharya and Ping-Sing Tsai , John Wiley & Sons http://discovery.bits-pilani.ac.in/discipline/csis/vimal/course%2006-07%20Second/MMC/Lectures/cf.doc Chapter 1 of Digital Compression for Multimedia: Principles & Standards by Jerry D.Gibson

Compressionbasics

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Compressionbasics (20)

Recently uploaded (20)

Compressionbasics