data_compression.pdf explains different

T H I R D E D I T I O N
Introduction to
Data Compression

The Morgan Kaufmann Series in Multimedia Information and Systems
Series Editor, Edward A. Fox, Virginia Polytechnic University
Introduction to Data Compression, Third Edition
Khalid Sayood
Understanding Digital Libraries, Second Edition
Michael Lesk
Bioinformatics: Managing Scientific Data
Zoe Lacroix and Terence Critchlow
How to Build a Digital Library
Ian H. Witten and David Bainbridge
Digital Watermarking
Ingemar J. Cox, Matthew L. Miller, and Jeffrey A. Bloom
Readings in Multimedia Computing and Networking
Edited by Kevin Jeffay and HongJiang Zhang
Introduction to Data Compression, Second Edition
Khalid Sayood
Multimedia Servers: Applications, Environments, and Design
Dinkar Sitaram and Asit Dan
Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition
Ian H. Witten, Alistair Moffat, and Timothy C. Bell
Digital Compression for Multimedia: Principles and Standards
Jerry D. Gibson, Toby Berger, Tom Lookabaugh, Dave Lindbergh, and Richard L. Baker
Readings in Information Retrieval
Edited by Karen Sparck Jones and Peter Willett

T H I R D E D I T I O N
Introduction to
Data Compression
Khalid Sayood
University of Nebraska
AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann is an imprint of Elsevier

Senior Acquisitions Editor Rick Adams
Publishing Services Manager Simon Crump
Assistant Editor Rachel Roumeliotis
Cover Design Cate Barr
Composition Integra Software Services Pvt. Ltd.
Copyeditor Jessika Bella Mura
Proofreader Jacqui Brownstein
Indexer Northwind Editorial Sevices
Interior printer Maple Vail Book Manufacturing Group
Cover printer Phoenix Color
Morgan Kaufmann Publishers is an imprint of Elsevier.
500 Sansome Street, Suite 400, San Francisco, CA 94111
This book is printed on acid-free paper.
©2006 by Elsevier Inc. All rights reserved.
Designations used by companies to distinguish their products are often claimed as trademarks or registered
trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names
appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies
for more complete information regarding trademarks and registration.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by
any means—electronic, mechanical, photocopying, scanning, or otherwise—without prior written
permission of the publisher.
Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford,
UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail: permissions@elsevier.co.uk. You may
also complete your request on-line via the Elsevier homepage (http://elsevier.com) by selecting “Customer
Support” and then “Obtaining Permissions.”
Library of Congress Cataloging-in-Publication Data
Sayood, Khalid.
Introduction to data compression/Khalid Sayood.—3rd ed.
p. cm.
Includes bibliographical references and index.
ISBN-13: 978-0-12-620862-7
ISBN-10: 0-12-620862-X
1. Data compression (Telecommunication) 2. Coding theory. I. Title
TK5102.92.S39 2005
005.74
6—dc22
2005052759
ISBN 13: 978-0-12-620862-7
ISBN 10: 0-12-620862-X
For information on all Morgan Kaufmann publications,
visit our Web site at www.mkp.com or www.books.elsevier.com
Printed in the United States of America
05 06 07 08 09 5 4 3 2 1
Working together to grow
libraries in developing countries
www.elsevier.com | www.bookaid.org | www.sabre.org

Contents
Preface xvii
1 Introduction 1
1.1 Compression Techniques 3
1.1.1 Lossless Compression 4
1.1.2 Lossy Compression 5
1.1.3 Measures of Performance 5
1.2 Modeling and Coding 6
1.3 Summary 10
1.4 Projects and Problems 11
2 Mathematical Preliminaries for Lossless Compression 13
2.1 Overview 13
2.2 A Brief Introduction to Information Theory 13
2.2.1 Derivation of Average Information 18
2.3 Models 23
2.3.1 Physical Models 23
2.3.2 Probability Models 23
2.3.3 Markov Models 24
2.3.4 Composite Source Model 27
2.4 Coding 27
2.4.1 Uniquely Decodable Codes 28
2.4.2 Prefix Codes 31
2.4.3 The Kraft-McMillan Inequality 32
2.5 Algorithmic Information Theory 35
2.6 Minimum Description Length Principle 36
2.7 Summary 37
3 Huffman Coding 41
3.1 Overview 41
3.2 The Huffman Coding Algorithm 41
3.2.1 Minimum Variance Huffman Codes 46
3.2.2 Optimality of Huffman Codes 48
3.2.3 Length of Huffman Codes 49
3.2.4 Extended Huffman Codes 51

viii C O N T E N T S
3.3 Nonbinary Huffman Codes 55
3.4 Adaptive Huffman Coding 58
3.4.1 Update Procedure 59
3.4.2 Encoding Procedure 62
3.4.3 Decoding Procedure 63
3.5 Golomb Codes 65
3.6 Rice Codes 67
3.6.1 CCSDS Recommendation for Lossless Compression 67
3.7 Tunstall Codes 69
3.8 Applications of Huffman Coding 72
3.8.1 Lossless Image Compression 72
3.8.2 Text Compression 74
3.8.3 Audio Compression 75
3.9 Summary 77
4 Arithmetic Coding 81
4.1 Overview 81
4.2 Introduction 81
4.3 Coding a Sequence 83
4.3.1 Generating a Tag 84
4.3.2 Deciphering the Tag 91
4.4 Generating a Binary Code 92
4.4.1 Uniqueness and Efficiency of the Arithmetic Code 93
4.4.2 Algorithm Implementation 96
4.4.3 Integer Implementation 102
4.5 Comparison of Huffman and Arithmetic Coding 109
4.6 Adaptive Arithmetic Coding 112
4.7 Applications 112
4.8 Summary 113
5 Dictionary Techniques 117
5.1 Overview 117
5.2 Introduction 117
5.3 Static Dictionary 118
5.3.1 Digram Coding 119
5.4 Adaptive Dictionary 121
5.4.1 The LZ77 Approach 121
5.4.2 The LZ78 Approach 125
5.5.1 File Compression—UNIX compress 133
5.5.2 Image Compression—The Graphics Interchange Format (GIF) 133
5.5.3 Image Compression—Portable Network Graphics (PNG) 134
5.5.4 Compression over Modems—V.42 bis 136

C O N T E N T S ix
5.6 Summary 138
6 Context-Based Compression 141
6.1 Overview 141
6.3 Prediction with Partial Match (ppm) 143
6.3.1 The Basic Algorithm 143
6.3.2 The Escape Symbol 149
6.3.3 Length of Context 150
6.3.4 The Exclusion Principle 151
6.4 The Burrows-Wheeler Transform 152
6.4.1 Move-to-Front Coding 156
6.5 Associative Coder of Buyanovsky (ACB) 157
6.6 Dynamic Markov Compression 158
6.7 Summary 160
7 Lossless Image Compression 163
7.1 Overview 163
7.2.1 The Old JPEG Standard 164
7.3 CALIC 166
7.4 JPEG-LS 170
7.5 Multiresolution Approaches 172
7.5.1 Progressive Image Transmission 173
7.6 Facsimile Encoding 178
7.6.1 Run-Length Coding 179
7.6.2 CCITT Group 3 and 4—Recommendations T.4 and T.6 180
7.6.3 JBIG 183
7.6.4 JBIG2—T.88 189
7.7 MRC—T.44 190
7.8 Summary 193
8 Mathematical Preliminaries for Lossy Coding 195
8.1 Overview 195
8.3 Distortion Criteria 197
8.3.1 The Human Visual System 199
8.3.2 Auditory Perception 200
8.4 Information Theory Revisited 201
8.4.1 Conditional Entropy 202
8.4.2 Average Mutual Information 204
8.4.3 Differential Entropy 205

x C O N T E N T S
8.5 Rate Distortion Theory 208
8.6 Models 215
8.6.1 Probability Models 216
8.6.2 Linear System Models 218
8.6.3 Physical Models 223
8.7 Summary 224
9 Scalar Quantization 227
9.1 Overview 227
9.3 The Quantization Problem 228
9.4 Uniform Quantizer 233
9.5 Adaptive Quantization 244
9.5.1 Forward Adaptive Quantization 244
9.5.2 Backward Adaptive Quantization 246
9.6 Nonuniform Quantization 253
9.6.1 pdf-Optimized Quantization 253
9.6.2 Companded Quantization 257
9.7 Entropy-Coded Quantization 264
9.7.1 Entropy Coding of Lloyd-Max Quantizer Outputs 265
9.7.2 Entropy-Constrained Quantization 265
9.7.3 High-Rate Optimum Quantization 266
9.8 Summary 269
10 Vector Quantization 273
10.1 Overview 273
10.3 Advantages of Vector Quantization over Scalar Quantization 276
10.4 The Linde-Buzo-Gray Algorithm 282
10.4.1 Initializing the LBG Algorithm 287
10.4.2 The Empty Cell Problem 294
10.4.3 Use of LBG for Image Compression 294
10.5 Tree-Structured Vector Quantizers 299
10.5.1 Design of Tree-Structured Vector Quantizers 302
10.5.2 Pruned Tree-Structured Vector Quantizers 303
10.6 Structured Vector Quantizers 303
10.6.1 Pyramid Vector Quantization 305
10.6.2 Polar and Spherical Vector Quantizers 306
10.6.3 Lattice Vector Quantizers 307
10.7 Variations on the Theme 311
10.7.1 Gain-Shape Vector Quantization 311
10.7.2 Mean-Removed Vector Quantization 312

C O N T E N T S xi
10.7.3 Classified Vector Quantization 313
10.7.4 Multistage Vector Quantization 313
10.7.5 Adaptive Vector Quantization 315
10.8 Trellis-Coded Quantization 316
10.9 Summary 321
11 Differential Encoding 325
11.1 Overview 325
11.3 The Basic Algorithm 328
11.4 Prediction in DPCM 332
11.5 Adaptive DPCM 337
11.5.1 Adaptive Quantization in DPCM 338
11.5.2 Adaptive Prediction in DPCM 339
11.6 Delta Modulation 342
11.6.1 Constant Factor Adaptive Delta Modulation (CFDM) 343
11.6.2 Continuously Variable Slope Delta Modulation 345
11.7 Speech Coding 345
11.7.1 G.726 347
11.8 Image Coding 349
11.9 Summary 351
12 Mathematical Preliminaries for Transforms, Subbands, and Wavelets 355
12.1 Overview 355
12.3 Vector Spaces 356
12.3.1 Dot or Inner Product 357
12.3.2 Vector Space 357
12.3.3 Subspace 359
12.3.4 Basis 360
12.3.5 Inner Product—Formal Definition 361
12.3.6 Orthogonal and Orthonormal Sets 361
12.4 Fourier Series 362
12.5 Fourier Transform 365
12.5.1 Parseval’s Theorem 366
12.5.2 Modulation Property 366
12.5.3 Convolution Theorem 367
12.6 Linear Systems 368
12.6.1 Time Invariance 368
12.6.2 Transfer Function 368
12.6.3 Impulse Response 369
12.6.4 Filter 371

xii C O N T E N T S
12.7 Sampling 372
12.7.1 Ideal Sampling—Frequency Domain View 373
12.7.2 Ideal Sampling—Time Domain View 375
12.8 Discrete Fourier Transform 376
12.9 Z-Transform 378
12.9.1 Tabular Method 381
12.9.2 Partial Fraction Expansion 382
12.9.3 Long Division 386
12.9.4 Z-Transform Properties 387
12.9.5 Discrete Convolution 387
12.10 Summary 389
13 Transform Coding 391
13.1 Overview 391
13.3 The Transform 396
13.4 Transforms of Interest 400
13.4.1 Karhunen-Loéve Transform 401
13.4.2 Discrete Cosine Transform 402
13.4.3 Discrete Sine Transform 404
13.4.4 Discrete Walsh-Hadamard Transform 404
13.5 Quantization and Coding of Transform Coefficients 407
13.6 Application to Image Compression—JPEG 410
13.6.1 The Transform 410
13.6.2 Quantization 411
13.6.3 Coding 413
13.7 Application to Audio Compression—the MDCT 416
13.8 Summary 419
14 Subband Coding 423
14.1 Overview 423
14.3 Filters 428
14.3.1 Some Filters Used in Subband Coding 432
14.4 The Basic Subband Coding Algorithm 436
14.4.1 Analysis 436
14.4.2 Quantization and Coding 437
14.4.3 Synthesis 437
14.5 Design of Filter Banks 438
14.5.1 Downsampling 440
14.5.2 Upsampling 443
14.6 Perfect Reconstruction Using Two-Channel Filter Banks 444
14.6.1 Two-Channel PR Quadrature Mirror Filters 447
14.6.2 Power Symmetric FIR Filters 449

C O N T E N T S xiii
14.7 M-Band QMF Filter Banks 451
14.8 The Polyphase Decomposition 454
14.9 Bit Allocation 459
14.10 Application to Speech Coding—G.722 461
14.11 Application to Audio Coding—MPEG Audio 462
14.12 Application to Image Compression 463
14.12.1 Decomposing an Image 465
14.12.2 Coding the Subbands 467
14.13 Summary 470
15 Wavelet-Based Compression 473
15.1 Overview 473
15.3 Wavelets 476
15.4 Multiresolution Analysis and the Scaling Function 480
15.5 Implementation Using Filters 486
15.5.1 Scaling and Wavelet Coefficients 488
15.5.2 Families of Wavelets 491
15.6 Image Compression 494
15.7 Embedded Zerotree Coder 497
15.8 Set Partitioning in Hierarchical Trees 505
15.9 JPEG 2000 512
15.10 Summary 513
16 Audio Coding 515
16.1 Overview 515
16.2.1 Spectral Masking 517
16.2.2 Temporal Masking 517
16.2.3 Psychoacoustic Model 518
16.3 MPEG Audio Coding 519
16.3.1 Layer I Coding 520
16.3.2 Layer II Coding 521
16.3.3 Layer III Coding—mp3 522
16.4 MPEG Advanced Audio Coding 527
16.4.1 MPEG-2 AAC 527
16.4.2 MPEG-4 AAC 532
16.5 Dolby AC3 (Dolby Digital) 533
16.5.1 Bit Allocation 534
16.6 Other Standards 535
16.7 Summary 536

xiv C O N T E N T S
17 Analysis/Synthesis and Analysis by Synthesis Schemes 537
17.1 Overview 537
17.3 Speech Compression 539
17.3.1 The Channel Vocoder 539
17.3.2 The Linear Predictive Coder (Government Standard LPC-10) 542
17.3.3 Code Excited Linear Predicton (CELP) 549
17.3.4 Sinusoidal Coders 552
17.3.5 Mixed Excitation Linear Prediction (MELP) 555
17.4 Wideband Speech Compression—ITU-T G.722.2 558
17.5.1 Fractal Compression 560
17.6 Summary 568
18 Video Compression 571
18.1 Overview 571
18.3 Motion Compensation 573
18.4 Video Signal Representation 576
18.5 ITU-T Recommendation H.261 582
18.5.1 Motion Compensation 583
18.5.2 The Loop Filter 584
18.5.4 Quantization and Coding 586
18.5.5 Rate Control 588
18.6 Model-Based Coding 588
18.7 Asymmetric Applications 590
18.8 The MPEG-1 Video Standard 591
18.9 The MPEG-2 Video Standard—H.262 594
18.9.1 The Grand Alliance HDTV Proposal 597
18.10 ITU-T Recommendation H.263 598
18.10.1 Unrestricted Motion Vector Mode 600
18.10.2 Syntax-Based Arithmetic Coding Mode 600
18.10.3 Advanced Prediction Mode 600
18.10.4 PB-frames and Improved PB-frames Mode 600
18.10.5 Advanced Intra Coding Mode 600
18.10.6 Deblocking Filter Mode 601
18.10.7 Reference Picture Selection Mode 601
18.10.8 Temporal, SNR, and Spatial Scalability Mode 601
18.10.9 Reference Picture Resampling 601
18.10.10 Reduced-Resolution Update Mode 602
18.10.11 Alternative Inter VLC Mode 602
18.10.12 Modified Quantization Mode 602
18.10.13 Enhanced Reference Picture Selection Mode 603

C O N T E N T S xv
18.11 ITU-T Recommendation H.264, MPEG-4 Part 10, Advanced Video
Coding 603
18.11.1 Motion-Compensated Prediction 604
18.11.3 Intra Prediction 605
18.11.4 Quantization 606
18.11.5 Coding 608
18.12 MPEG-4 Part 2 609
18.13 Packet Video 610
18.14 ATM Networks 610
18.14.1 Compression Issues in ATM Networks 611
18.14.2 Compression Algorithms for Packet Video 612
18.15 Summary 613
A Probability and Random Processes 615
A.1 Probability 615
A.1.1 Frequency of Occurrence 615
A.1.2 A Measure of Belief 616
A.1.3 The Axiomatic Approach 618
A.2 Random Variables 620
A.3 Distribution Functions 621
A.4 Expectation 623
A.4.1 Mean 624
A.4.2 Second Moment 625
A.4.3 Variance 625
A.5 Types of Distribution 625
A.5.1 Uniform Distribution 625
A.5.2 Gaussian Distribution 626
A.5.3 Laplacian Distribution 626
A.5.4 Gamma Distribution 626
A.6 Stochastic Process 626
A.7 Projects and Problems 629
B A Brief Review of Matrix Concepts 631
B.1 A Matrix 631
B.2 Matrix Operations 632
C The Root Lattices 637
Bibliography 639
Index 655

Preface
Within the last decade the use of data compression has become ubiquitous. From mp3 players
whose headphones seem to adorn the ears of most young (and some not so young) people, to
cell phones, to DVDs, to digital television, data compression is an integral part of almost all
information technology. This incorporation of compression into more and more of our lives
also points to a certain degree of maturation of the technology. This maturity is reflected in
the fact that there are fewer differences between this and the previous edition of this book
than there were between the second and first editions. In the second edition we had added
new techniques that had been developed since the first edition of this book came out. In this
edition our purpose is more to include some important topics, such as audio compression,
that had not been adequately covered in the second edition. During this time the field has
not entirely stood still and we have tried to include information about new developments.
We have added a new chapter on audio compression (including a description of the mp3
algorithm). We have added information on new standards such as the new video coding
standard and the new facsimile standard. We have reorganized some of the material in the
book, collecting together various lossless image compression techniques and standards into
a single chapter, and we have updated a number of chapters, adding information that perhaps
should have been there from the beginning.
All this has yet again enlarged the book. However, the intent remains the same: to provide
an introduction to the art or science of data compression. There is a tutorial description
of most of the popular compression techniques followed by a description of how these
techniques are used for image, speech, text, audio, and video compression.
Given the pace of developments in this area, there are bound to be new ones that are
not reflected in this book. In order to keep you informed of these developments, we will
periodically provide updates at http://www.mkp.com.
A u d i e n c e
If you are designing hardware or software implementations of compression algorithms, or
need to interact with individuals engaged in such design, or are involved in development
of multimedia applications and have some background in either electrical or computer
engineering, or computer science, this book should be useful to you. We have included a
large number of examples to aid in self-study. We have also included discussion of various
multimedia standards. The intent here is not to provide all the details that may be required
to implement a standard but to provide information that will help you follow and understand
the standards documents.

xviii P R E F A C E
C o u r s e U s e
The impetus for writing this book came from the need for a self-contained book that could
be used at the senior/graduate level for a course in data compression in either electrical
engineering, computer engineering, or computer science departments. There are problems
and project ideas after most of the chapters. A solutions manual is available from the
publisher. Also at http://sensin.unl.edu/idc/index.html we provide links to various course
homepages, which can be a valuable source of project ideas and support material.
The material in this book is too much for a one semester course. However, with judicious
use of the starred sections, this book can be tailored to fit a number of compression
courses that emphasize various aspects of compression. If the course emphasis is on lossless
compression, the instructor could cover most of the sections in the first seven chapters. Then,
to give a taste of lossy compression, the instructor could cover Sections 1–5 of Chapter 9,
followed by Chapter 13 and its description of JPEG, and Chapter 18, which describes video
compression approaches used in multimedia communications. If the class interest is more
attuned to audio compression, then instead of Chapters 13 and 18, the instructor could cover
Chapters 14 and 16. If the latter option is taken, depending on the background of the students
in the class, Chapter 12 may be assigned as background reading. If the emphasis is to be on
lossy compression, the instructor could cover Chapter 2, the first two sections of Chapter
3, Sections 4 and 6 of Chapter 4 (with a cursory overview of Sections 2 and 3), Chapter 8,
selected parts of Chapter 9, and Chapter 10 through 15. At this point depending on the time
available and the interests of the instructor and the students portions of the remaining three
chapters can be covered. I have always found it useful to assign a term project in which the
students can follow their own interests as a means of covering material that is not covered
in class but is of interest to the student.
A p p r o a c h
In this book, we cover both lossless and lossy compression techniques with applications to
image, speech, text, audio, and video compression. The various lossless and lossy coding
techniques are introduced with just enough theory to tie things together. The necessary
theory is introduced just before we need it. Therefore, there are three mathematical prelim-
inaries chapters. In each of these chapters, we present the mathematical material needed to
understand and appreciate the techniques that follow.
Although this book is an introductory text, the word introduction may have a different
meaning for different audiences. We have tried to accommodate the needs of different
audiences by taking a dual-track approach. Wherever we felt there was material that could
enhance the understanding of the subject being discussed but could still be skipped without
seriously hindering your understanding of the technique, we marked those sections with a
star (). If you are primarily interested in understanding how the various techniques function,
especially if you are using this book for self-study, we recommend you skip the starred
sections, at least in a first reading. Readers who require a slightly more theoretical approach
should use the starred sections. Except for the starred sections, we have tried to keep the
mathematics to a minimum.

P R E F A C E xix
L e a r n i n g f r o m T h i s B o o k
I have found that it is easier for me to understand things if I can see examples. Therefore, I
have relied heavily on examples to explain concepts. You may find it useful to spend more
time with the examples if you have difficulty with some of the concepts.
Compression is still largely an art and to gain proficiency in an art we need to get a “feel”
for the process. We have included software implementations for most of the techniques
discussed in this book, along with a large number of data sets. The software and data sets
can be obtained from ftp://ftp.mkp.com/pub/Sayood/. The programs are written in C and have
been tested on a number of platforms. The programs should run under most flavors of UNIX
machines and, with some slight modifications, under other operating systems as well. More
detailed information is contained in the README file in the pub/Sayood directory.
You are strongly encouraged to use and modify these programs to work with your
favorite data in order to understand some of the issues involved in compression. A useful and
achievable goal should be the development of your own compression package by the time
you have worked through this book. This would also be a good way to learn the trade-offs
involved in different approaches. We have tried to give comparisons of techniques wherever
possible; however, different types of data have their own idiosyncrasies. The best way to
know which scheme to use in any given situation is to try them.
C o n t e n t a n d O r g a n i z a t i o n
The organization of the chapters is as follows: We introduce the mathematical preliminaries
necessary for understanding lossless compression in Chapter 2; Chapters 3 and 4 are devoted
to coding algorithms, including Huffman coding, arithmetic coding, Golomb-Rice codes,
and Tunstall codes. Chapters 5 and 6 describe many of the popular lossless compression
schemes along with their applications. The schemes include LZW, ppm, BWT, and DMC,
among others. In Chapter 7 we describe a number of lossless image compression algorithms
and their applications in a number of international standards. The standards include the JBIG
standards and various facsimile standards.
Chapter 8 is devoted to providing the mathematical preliminaries for lossy compression.
Quantization is at the heart of most lossy compression schemes. Chapters 9 and 10 are
devoted to the study of quantization. Chapter 9 deals with scalar quantization, and Chapter
10 deals with vector quantization. Chapter 11 deals with differential encoding techniques,
in particular differential pulse code modulation (DPCM) and delta modulation. Included in
this chapter is a discussion of the CCITT G.726 standard.
Chapter 12 is our third mathematical preliminaries chapter. The goal of this chapter is to
provide the mathematical foundation necessary to understand some aspects of the transform,
subband, and wavelet-based techniques that are described in the next three chapters. As in
the case of the previous mathematical preliminaries chapters, not all material covered is
necessary for everyone. We describe the JPEG standard in Chapter 13, the CCITT G.722
international standard in Chapter 14, and EZW, SPIHT, and JPEG 2000 in Chapter 15.
Chapter 16 is devoted to audio compression. We describe the various MPEG audio
compression schemes in this chapter including the scheme popularly known as mp3.

xx P R E F A C E
Chapter 17 covers techniques in which the data to be compressed are analyzed, and a
model for the generation of the data is transmitted to the receiver. The receiver uses this
model to synthesize the data. These analysis/synthesis and analysis by synthesis schemes
include linear predictive schemes used for low-rate speech coding and the fractal compres-
sion technique. We describe the federal government LPC-10 standard. Code-excited linear
prediction (CELP) is a popular example of an analysis by synthesis scheme. We also discuss
three CELP-based standards, the federal standard 1016, the CCITT G.728 international stan-
dard, and the relatively new wideband speech compression standard G.722.2. We have also
included a discussion of the mixed excitation linear prediction (MELP) technique, which is
the new federal standard for speech coding at 2.4 kbps.
Chapter 18 deals with video coding. We describe popular video coding techniques via
description of various international standards, including H.261, H.264, and the various MPEG
standards.
A P e r s o n a l V i e w
For me, data compression is more than a manipulation of numbers; it is the process of
discovering structures that exist in the data. In the 9th century, the poet Omar Khayyam
wrote
The moving finger writes, and having writ,
moves on; not all thy piety nor wit,
shall lure it back to cancel half a line,
nor all thy tears wash out a word of it.
(The Rubaiyat of Omar Khayyam)
To explain these few lines would take volumes. They tap into a common human expe-
rience so that in our mind’s eye, we can reconstruct what the poet was trying to convey
centuries ago. To understand the words we not only need to know the language, we also
need to have a model of reality that is close to that of the poet. The genius of the poet lies
in identifying a model of reality that is so much a part of our humanity that centuries later
and in widely diverse cultures, these few words can evoke volumes.
Data compression is much more limited in its aspirations, and it may be presumptuous to
mention it in the same breath as poetry. But there is much that is similar to both endeavors.
Data compression involves identifying models for the many different types of structures
that exist in different types of data and then using these models, perhaps along with the
perceptual framework in which these data will be used, to obtain a compact representation
of the data. These structures can be in the form of patterns that we can recognize simply
by plotting the data, or they might be statistical structures that require a more mathematical
approach to comprehend.
In The Long Dark Teatime of the Soul by Douglas Adams, the protagonist finds that he
can enter Valhalla (a rather shoddy one) if he tilts his head in a certain way. Appreciating
the structures that exist in data sometimes require us to tilt our heads in a certain way. There
are an infinite number of ways we can tilt our head and, in order not to get a pain in the
neck (carrying our analogy to absurd limits), it would be nice to know some of the ways that

P R E F A C E xxi
will generally lead to a profitable result. One of the objectives of this book is to provide you
with a frame of reference that can be used for further exploration. I hope this exploration
will provide as much enjoyment for you as it has given to me.
A c k n o w l e d g m e n t s
It has been a lot of fun writing this book. My task has been made considerably easier and
the end product considerably better because of the help I have received. Acknowledging that
help is itself a pleasure.
The first edition benefitted from the careful and detailed criticism of Roy Hoffman from
IBM, Glen Langdon from the University of California at Santa Cruz, Debra Lelewer from
California Polytechnic State University, Eve Riskin from the University of Washington,
Ibrahim Sezan from Kodak, and Peter Swaszek from the University of Rhode Island. They
provided detailed comments on all or most of the first edition. Nasir Memon from Polytechnic
University, Victor Ramamoorthy then at S3, Grant Davidson at Dolby Corporation, Hakan
Caglar, who was then at TÜBITAK in Istanbul, and Allen Gersho from the University of
California at Santa Barbara reviewed parts of the manuscript.
For the second edition Steve Tate at the University of North Texas, Sheila Horan at
New Mexico State University, Edouard Lamboray at Oerlikon Contraves Group, Steven
Pigeon at the University of Montreal, and Jesse Olvera at Raytheon Systems reviewed the
entire manuscript. Emin Anarım of Boğaziçi University and Hakan Çağlar helped me with
the development of the chapter on wavelets. Mark Fowler provided extensive comments on
Chapters 12–15, correcting mistakes of both commission and omission. Tim James, Devajani
Khataniar, and Lance Pérez also read and critiqued parts of the new material in the second
edition. Chloeann Nelson, along with trying to stop me from splitting infinitives, also tried
to make the first two editions of the book more user-friendly.
Since the appearance of the first edition, various readers have sent me their comments
and critiques. I am grateful to all who sent me comments and suggestions. I am especially
grateful to Roberto Lopez-Hernandez, Dirk vom Stein, Christopher A. Larrieu, Ren Yih
Wu, Humberto D’Ochoa, Roderick Mills, Mark Elston, and Jeerasuda Keesorth for pointing
out errors and suggesting improvements to the book. I am also grateful to the various
instructors who have sent me their critiques. In particular I would like to thank Bruce
Bomar from the University of Tennessee, Mark Fowler from SUNY Binghamton, Paul Amer
from the University of Delaware, K.R. Rao from the University of Texas at Arlington,
Ralph Wilkerson from the University of Missouri–Rolla, Adam Drozdek from Duquesne
University, Ed Hong and Richard Ladner from the University of Washington, Lars Nyland
from the Colorado School of Mines, Mario Kovac from the University of Zagreb, and Pierre
Jouvelet from the Ecole Superieure des Mines de Paris.
Frazer Williams and Mike Hoffman, from my department at the University of Nebraska,
provided reviews for the first edition of the book. Mike read the new chapters in the second
and third edition in their raw form and provided me with critiques that led to major rewrites.
His insights were always helpful and the book carries more of his imprint than he is perhaps
aware of. It is nice to have friends of his intellectual caliber and generosity. Rob Maher
at Montana State University provided me with an extensive critique of the new chapter on

xxii P R E F A C E
audio compression pointing out errors in my thinking and gently suggesting corrections. I
thank him for his expertise, his time, and his courtesy.
Rick Adams, Rachel Roumeliotis, and Simon Crump at Morgan Kaufmann had the task
of actually getting the book out. This included the unenviable task of getting me to meet
deadlines. Vytas Statulevicius helped me with LaTex problems that were driving me up the
wall.
Most of the examples in this book were generated in a lab set up by Andy Hadenfeldt.
James Nau helped me extricate myself out of numerous software puddles giving freely of
his time. In my times of panic, he was always just an email or voice mail away.
I would like to thank the various “models” for the data sets that accompany this book
and were used as examples. The individuals in the images are Sinan Sayood, Sena Sayood,
and Elif Sevuktekin. The female voice belongs to Pat Masek.
This book reflects what I have learned over the years. I have been very fortunate in the
teachers I have had. David Farden, now at North Dakota State University, introduced me
to the area of digital communication. Norm Griswold at Texas AM University introduced
me to the area of data compression. Jerry Gibson, now at University of California at Santa
Barbara was my Ph.D. advisor and helped me get started on my professional career. The
world may not thank him for that, but I certainly do.
I have also learned a lot from my students at the University of Nebraska and Boğaziçi
University. Their interest and curiosity forced me to learn and kept me in touch with the
broad field that is data compression today. I learned at least as much from them as they
learned from me.
Much of this learning would not have been possible but for the support I received from
NASA. The late Warner Miller and Pen-Shu Yeh at the Goddard Space Flight Center and
Wayne Whyte at the Lewis Research Center were a source of support and ideas. I am truly
grateful for their helpful guidance, trust, and friendship.
Our two boys, Sena and Sinan, graciously forgave my evenings and weekends at work.
They were tiny (witness the images) when I first started writing this book. Soon I will
have to look up when talking to them. “The book” has been their (sometimes unwanted)
companion through all these years. For their graciousness and for always being such perfect
joys, I thank them.
Above all the person most responsible for the existence of this book is my partner and
closest friend Füsun. Her support and her friendship gives me the freedom to do things I
would not otherwise even consider. She centers my universe and, as with every significant
endeavor that I have undertaken since I met her, this book is at least as much hers as it is
mine.

1
Introduction
I
n the last decade we have been witnessing a transformation—some call it
a revolution—in the way we communicate, and the process is still under
way. This transformation includes the ever-present, ever-growing Internet; the
explosive development of mobile communications; and the ever-increasing
importance of video communication. Data compression is one of the enabling
technologies for each of these aspects of the multimedia revolution. It would not be practical
to put images, let alone audio and video, on websites if it were not for data compression
algorithms. Cellular phones would not be able to provide communication with increasing
clarity were it not for compression. The advent of digital TV would not be possible without
compression. Data compression, which for a long time was the domain of a relatively small
group of engineers and scientists, is now ubiquitous. Make a long-distance call and you
are using compression. Use your modem, or your fax machine, and you will benefit from
compression. Listen to music on your mp3 player or watch a DVD and you are being
entertained courtesy of compression.
So, what is data compression, and why do we need it? Most of you have heard of JPEG
and MPEG, which are standards for representing images, video, and audio. Data compression
algorithms are used in these standards to reduce the number of bits required to represent
an image or a video sequence or music. In brief, data compression is the art or science
of representing information in a compact form. We create these compact representations
by identifying and using structures that exist in the data. Data can be characters in a text
file, numbers that are samples of speech or image waveforms, or sequences of numbers
that are generated by other processes. The reason we need data compression is that more
and more of the information that we generate and use is in digital form—in the form
of numbers represented by bytes of data. And the number of bytes required to represent
multimedia data can be huge. For example, in order to digitally represent 1 second of
video without compression (using the CCIR 601 format), we need more than 20 megabytes,
or 160 megabits. If we consider the number of seconds in a movie, we can easily see
why we would need compression. To represent 2 minutes of uncompressed CD-quality

2 1 I N T R O D U C T I O N
music (44,100 samples per second, 16 bits per sample) requires more than 84 million bits.
Downloading music from a website at these rates would take a long time.
As human activity has a greater and greater impact on our environment, there is an ever-
increasing need for more information about our environment, how it functions, and what we
are doing to it. Various space agencies from around the world, including the European Space
Agency (ESA), the National Aeronautics and Space Agency (NASA), the Canadian Space
Agency (CSA), and the Japanese Space Agency (STA), are collaborating on a program to
monitor global change that will generate half a terabyte of data per day when they are fully
operational. Compare this to the 130 terabytes of data currently stored at the EROS data
center in South Dakota, that is the largest archive for land mass data in the world.
Given the explosive growth of data that needs to be transmitted and stored, why not
focus on developing better transmission and storage technologies? This is happening, but it
is not enough. There have been significant advances that permit larger and larger volumes of
information to be stored and transmitted without using compression, including CD-ROMs,
optical fibers, Asymmetric Digital Subscriber Lines (ADSL), and cable modems. However,
while it is true that both storage and transmission capacities are steadily increasing with
new technological innovations, as a corollary to Parkinson’s First Law,1
it seems that the
need for mass storage and transmission increases at least twice as fast as storage and
transmission capacities improve. Then there are situations in which capacity has not increased
significantly. For example, the amount of information we can transmit over the airwaves
will always be limited by the characteristics of the atmosphere.
An early example of data compression is Morse code, developed by Samuel Morse in
the mid-19th century. Letters sent by telegraph are encoded with dots and dashes. Morse
noticed that certain letters occurred more often than others. In order to reduce the average
time required to send a message, he assigned shorter sequences to letters that occur more
frequently, such as e (·) and a (· −), and longer sequences to letters that occur less frequently,
such as q (− − · −) and j (· − − −). This idea of using shorter codes for more frequently
occurring characters is used in Huffman coding, which we will describe in Chapter 3.
Where Morse code uses the frequency of occurrence of single characters, a widely used
form of Braille code, which was also developed in the mid-19th century, uses the frequency
of occurrence of words to provide compression [1]. In Braille coding, 2 ×3 arrays of dots
are used to represent text. Different letters can be represented depending on whether the dots
are raised or flat. In Grade 1 Braille, each array of six dots represents a single character.
However, given six dots with two positions for each dot, we can obtain 26
, or 64, different
combinations. If we use 26 of these for the different letters, we have 38 combinations left. In
Grade 2 Braille, some of these leftover combinations are used to represent words that occur
frequently, such as “and” and “for.” One of the combinations is used as a special symbol
indicating that the symbol that follows is a word and not a character, thus allowing a large
number of words to be represented by two arrays of dots. These modifications, along with
contractions of some of the words, result in an average reduction in space, or compression,
of about 20% [1].
1
Parkinson’s First Law: “Work expands so as to fill the time available,” in Parkinson’s Law and Other Studies in
Administration, by Cyril Northcote Parkinson, Ballantine Books, New York, 1957.

Statistical structure is being used to provide compression in these examples, but that
is not the only kind of structure that exists in the data. There are many other kinds of
structures existing in data of different types that can be exploited for compression. Consider
speech. When we speak, the physical construction of our voice box dictates the kinds of
sounds that we can produce. That is, the mechanics of speech production impose a structure
on speech. Therefore, instead of transmitting the speech itself, we could send information
about the conformation of the voice box, which could be used by the receiver to synthesize
the speech. An adequate amount of information about the conformation of the voice box
can be represented much more compactly than the numbers that are the sampled values of
speech. Therefore, we get compression. This compression approach is being used currently
in a number of applications, including transmission of speech over mobile radios and the
synthetic voice in toys that speak. An early version of this compression approach, called
the vocoder (voice coder), was developed by Homer Dudley at Bell Laboratories in 1936.
The vocoder was demonstrated at the New York World’s Fair in 1939, where it was a
major attraction. We will revisit the vocoder and this approach to compression of speech in
Chapter 17.
These are only a few of the many different types of structures that can be used to obtain
compression. The structure in the data is not the only thing that can be exploited to obtain
compression. We can also make use of the characteristics of the user of the data. Many times,
for example, when transmitting or storing speech and images, the data are intended to be
perceived by a human, and humans have limited perceptual abilities. For example, we cannot
hear the very high frequency sounds that dogs can hear. If something is represented in the
data that cannot be perceived by the user, is there any point in preserving that information?
The answer often is “no.” Therefore, we can make use of the perceptual limitations of
humans to obtain compression by discarding irrelevant information. This approach is used
in a number of compression schemes that we will visit in Chapters 13, 14, and 16.
Before we embark on our study of data compression techniques, let’s take a general look
at the area and define some of the key terms and concepts we will be using in the rest of
the book.
1 . 1 Compression Techniques
When we speak of a compression technique or compression algorithm,2
we are actually
referring to two algorithms. There is the compression algorithm that takes an input and
generates a representation c that requires fewer bits, and there is a reconstruction algorithm
that operates on the compressed representation c to generate the reconstruction . These
operations are shown schematically in Figure 1.1. We will follow convention and refer
to both the compression and reconstruction algorithms together to mean the compression
algorithm.
2
The word algorithm comes from the name of an early 9th-century Arab mathematician, Al-Khwarizmi, who
wrote a treatise entitled The Compendious Book on Calculation by al-jabr and al-muqabala, in which he explored
(among other things) the solution of various linear and quadratic equations via rules or an “algorithm.” This approach
became known as the method of Al-Khwarizmi. The name was changed to algoritni in Latin, from which we get the word
algorithm. The name of the treatise also gave us the word algebra [2].

Compression Reconstruction
x y
xc
Original Reconstructed
σιναννοψανσενα
οψτυνκεϖενελιφ
δερινυλασ
φυσυνφυνδαφιγεν
ταηιρυλκερ
σιναννοψανσενα
οψτυνκεϖενελιφ
δερινυλασ
φυσυνφυνδαφιγεν
ταηιρυλκερ
F I G U R E 1. 1 Compression and reconstruction.
Based on the requirements of reconstruction, data compression schemes can be divided
into two broad classes: lossless compression schemes, in which is identical to , and
lossy compression schemes, which generally provide much higher compression than lossless
compression but allow to be different from .
1 . 1 . 1 L o s s l e s s C o m p r e s s i o n
Lossless compression techniques, as their name implies, involve no loss of information. If
data have been losslessly compressed, the original data can be recovered exactly from the
compressed data. Lossless compression is generally used for applications that cannot tolerate
any difference between the original and reconstructed data.
Text compression is an important area for lossless compression. It is very important that
the reconstruction is identical to the text original, as very small differences can result in
statements with very different meanings. Consider the sentences “Do not send money” and
“Do now send money.” A similar argument holds for computer files and for certain types of
data such as bank records.
If data of any kind are to be processed or “enhanced” later to yield more information, it is
important that the integrity be preserved. For example, suppose we compressed a radiological
image in a lossy fashion, and the difference between the reconstruction and the original
was visually undetectable. If this image was later enhanced, the previously undetectable
differences may cause the appearance of artifacts that could seriously mislead the radiologist.
Because the price for this kind of mishap may be a human life, it makes sense to be very
careful about using a compression scheme that generates a reconstruction that is different
from the original.
Data obtained from satellites often are processed later to obtain different numerical
indicators of vegetation, deforestation, and so on. If the reconstructed data are not identical
to the original data, processing may result in “enhancement” of the differences. It may not

be possible to go back and obtain the same data over again. Therefore, it is not advisable to
allow for any differences to appear in the compression process.
There are many situations that require compression where we want the reconstruction to
be identical to the original. There are also a number of situations in which it is possible to
relax this requirement in order to get more compression. In these situations we look to lossy
compression techniques.
1 . 1 . 2 L o s s y C o m p r e s s i o n
Lossy compression techniques involve some loss of information, and data that have been
compressed using lossy techniques generally cannot be recovered or reconstructed exactly.
In return for accepting this distortion in the reconstruction, we can generally obtain much
higher compression ratios than is possible with lossless compression.
In many applications, this lack of exact reconstruction is not a problem. For example,
when storing or transmitting speech, the exact value of each sample of speech is not
necessary. Depending on the quality required of the reconstructed speech, varying amounts
of loss of information about the value of each sample can be tolerated. If the quality of
the reconstructed speech is to be similar to that heard on the telephone, a significant loss
of information can be tolerated. However, if the reconstructed speech needs to be of the
quality heard on a compact disc, the amount of information loss that can be tolerated is much
lower.
Similarly, when viewing a reconstruction of a video sequence, the fact that the reconstruc-
tion is different from the original is generally not important as long as the differences do not
result in annoying artifacts. Thus, video is generally compressed using lossy compression.
Once we have developed a data compression scheme, we need to be able to measure its
performance. Because of the number of different areas of application, different terms have
been developed to describe and measure the performance.
1 . 1 . 3 M e a s u r e s o f P e r f o r m a n c e
A compression algorithm can be evaluated in a number of different ways. We could measure
the relative complexity of the algorithm, the memory required to implement the algorithm,
how fast the algorithm performs on a given machine, the amount of compression, and how
closely the reconstruction resembles the original. In this book we will mainly be concerned
with the last two criteria. Let us take each one in turn.
A very logical way of measuring how well a compression algorithm compresses a given
set of data is to look at the ratio of the number of bits required to represent the data before
compression to the number of bits required to represent the data after compression. This
ratio is called the compression ratio. Suppose storing an image made up of a square array of
256×256 pixels requires 65,536 bytes. The image is compressed and the compressed version
requires 16,384 bytes. We would say that the compression ratio is 4:1. We can also represent
the compression ratio by expressing the reduction in the amount of data required as a
percentage of the size of the original data. In this particular example the compression ratio
calculated in this manner would be 75%.

Another way of reporting compression performance is to provide the average number
of bits required to represent a single sample. This is generally referred to as the rate. For
example, in the case of the compressed image described above, if we assume 8 bits per byte
(or pixel), the average number of bits per pixel in the compressed representation is 2. Thus,
we would say that the rate is 2 bits per pixel.
In lossy compression, the reconstruction differs from the original data. Therefore, in
order to determine the efficiency of a compression algorithm, we have to have some way
of quantifying the difference. The difference between the original and the reconstruction is
often called the distortion. (We will describe several measures of distortion in Chapter 8.)
Lossy techniques are generally used for the compression of data that originate as analog
signals, such as speech and video. In compression of speech and video, the final arbiter of
quality is human. Because human responses are difficult to model mathematically, many
approximate measures of distortion are used to determine the quality of the reconstructed
waveforms. We will discuss this topic in more detail in Chapter 8.
Other terms that are also used when talking about differences between the reconstruction
and the original are fidelity and quality. When we say that the fidelity or quality of a
reconstruction is high, we mean that the difference between the reconstruction and the original
is small. Whether this difference is a mathematical difference or a perceptual difference
should be evident from the context.
1 . 2 Modeling and Coding
While reconstruction requirements may force the decision of whether a compression scheme
is to be lossy or lossless, the exact compression scheme we use will depend on a number of
different factors. Some of the most important factors are the characteristics of the data that
need to be compressed. A compression technique that will work well for the compression
of text may not work well for compressing images. Each application presents a different set
of challenges.
There is a saying attributed to Bobby Knight, the basketball coach at Texas Tech
University: “If the only tool you have is a hammer, you approach every problem as if it were
a nail.” Our intention in this book is to provide you with a large number of tools that you
can use to solve the particular data compression problem. It should be remembered that data
compression, if it is a science at all, is an experimental science. The approach that works
best for a particular application will depend to a large extent on the redundancies inherent
in the data.
The development of data compression algorithms for a variety of data can be divided
into two phases. The first phase is usually referred to as modeling. In this phase we try to
extract information about any redundancy that exists in the data and describe the redundancy
in the form of a model. The second phase is called coding. A description of the model
and a “description” of how the data differ from the model are encoded, generally using a
binary alphabet. The difference between the data and the model is often referred to as the
residual. In the following three examples we will look at three different ways that data can
be modeled. We will then use the model to obtain compression.

E x a m p l e 1 . 2 . 1:
Consider the following sequence of numbers x1x2x3:
9 11 11 11 14 13 15 17 16 17 20 21
If we were to transmit or store the binary representations of these numbers, we would need
to use 5 bits per sample. However, by exploiting the structure in the data, we can represent
the sequence using fewer bits. If we plot these data as shown in Figure 1.2, we see that the
data seem to fall on a straight line. A model for the data could therefore be a straight line
given by the equation
x̂n = n+8 n = 1, 2, …
5
10
15
20
2 4 6 8 9 10
F I G U R E 1. 2 A sequence of data values.
Thus, the structure in the data can be characterized by an equation. To make use of
this structure, let’s examine the difference between the data and the model. The difference
(or residual) is given by the sequence
en = xn −x̂n 0 1 0 −1 1 −1 0 1 −1 −1 1 1
The residual sequence consists of only three numbers −101. If we assign a code of 00
to −1, a code of 01 to 0, and a code of 10 to 1, we need to use 2 bits to represent each
element of the residual sequence. Therefore, we can obtain compression by transmitting or
storing the parameters of the model and the residual sequence. The encoding can be exact
if the required compression is to be lossless, or approximate if the compression can be
lossy.

The type of structure or redundancy that existed in these data follows a simple law. Once
we recognize this law, we can make use of the structure to predict the value of each element
in the sequence and then encode the residual. Structure of this type is only one of many
types of structure. Consider the following example.
E x a m p l e 1 . 2 . 2:
Consider the following sequence of numbers:
27 28 29 28 26 27 29 28 30 32 34 36 38
The sequence is plotted in Figure 1.3.
10
20
30
40
2 4 6 8 10 12
F I G U R E 1. 3 A sequence of data values.
The sequence does not seem to follow a simple law as in the previous case. However,
each value is close to the previous value. Suppose we send the first value, then in place of
subsequent values we send the difference between it and the previous value. The sequence
of transmitted values would be
27 1 1 −1 −2 1 2 −1 2 2 2 2 2
Like the previous example, the number of distinct values has been reduced. Fewer bits are
required to represent each number and compression is achieved. The decoder adds each
received value to the previous decoded value to obtain the reconstruction corresponding

to the received value. Techniques that use the past values of a sequence to predict the
current value and then encode the error in prediction, or residual, are called predictive coding
schemes. We will discuss lossless predictive compression schemes in Chapter 7 and lossy
predictive coding schemes in Chapter 11.
Assuming both encoder and decoder know the model being used, we would still have to
send the value of the first element of the sequence.
A very different type of redundancy is statistical in nature. Often we will encounter
sources that generate some symbols more often than others. In these situations, it will be
advantageous to assign binary codes of different lengths to different symbols.
E x a m p l e 1 . 2 . 3:
Suppose we have the following sequence:
a/
barayaran/
barray/
bran/
bfar/
bfaar/
bfaaar/
baway
which is typical of all sequences generated by a source. Notice that the sequence is made
up of eight different symbols. In order to represent eight symbols, we need to use 3 bits per
symbol. Suppose instead we used the code shown in Table 1.1. Notice that we have assigned
a codeword with only a single bit to the symbol that occurs most often, and correspondingly
longer codewords to symbols that occur less often. If we substitute the codes for each
symbol, we will use 106 bits to encode the entire sequence. As there are 41 symbols in
the sequence, this works out to approximately 258 bits per symbol. This means we have
obtained a compression ratio of 1.16:1. We will study how to use statistical redundancy of
this sort in Chapters 3 and 4.
T A B L E 1 . 1 A code with codewords
of varying length.
a 1
n 001
b 01100
f 0100
n 0111
r 000
w 01101
y 0101

When dealing with text, along with statistical redundancy, we also see redundancy in
the form of words that repeat often. We can take advantage of this form of redundancy by
constructing a list of these words and then represent them by their position in the list. This
type of compression scheme is called a dictionary compression scheme. We will study these
schemes in Chapter 5.

Often the structure or redundancy in the data becomes more evident when we look at
groups of symbols. We will look at compression schemes that take advantage of this in
Chapters 4 and 10.
Finally, there will be situations in which it is easier to take advantage of the structure if
we decompose the data into a number of components. We can then study each component
separately and use a model appropriate to that component. We will look at such schemes in
Chapters 13, 14, and 15.
There are a number of different ways to characterize data. Different characterizations
will lead to different compression schemes. We will study these compression schemes in
the upcoming chapters, and use a number of examples that should help us understand the
relationship between the characterization and the compression scheme.
With the increasing use of compression, there has also been an increasing need for
standards. Standards allow products developed by different vendors to communicate. Thus,
we can compress something with products from one vendor and reconstruct it using the
products of a different vendor. The different international standards organizations have
responded to this need, and a number of standards for various compression applications have
been approved. We will discuss these standards as applications of the various compression
techniques.
Finally, compression is still largely an art, and to gain proficiency in an art you need to
get a feel for the process. To help, we have developed software implementations of most of
the techniques discussed in this book, and also provided the data sets used for developing the
examples in this book. Details on how to obtain these programs and data sets are provided
in the Preface. You should use these programs on your favorite data or on the data sets
provided in order to understand some of the issues involved in compression. We would also
encourage you to write your own software implementations of some of these techniques,
as very often the best way to understand how an algorithm works is to implement the
algorithm.
1 . 3 Summary
In this chapter we have introduced the subject of data compression. We have provided
some motivation for why we need data compression and defined some of the terminology
we will need in this book. Additional terminology will be introduced as needed. We have
briefly introduced the two major types of compression algorithms: lossless compression
and lossy compression. Lossless compression is used for applications that require an exact
reconstruction of the original data, while lossy compression is used when the user can
tolerate some differences between the original and reconstructed representations of the data.
An important element in the design of data compression algorithms is the modeling of the
data. We have briefly looked at how modeling can help us in obtaining more compact
representations of the data. We have described some of the different ways we can view the
data in order to model it. The more ways we have of looking at the data, the more successful
we will be in developing compression schemes that take full advantage of the structures in
the data.

1 . 4 Projects and Problems
1. Use the compression utility on your computer to compress different files. Study the
effect of the original file size and file type on the ratio of compressed file size to
original file size.
2. Take a few paragraphs of text from a popular magazine and compress them by remov-
ing all words that are not essential for comprehension. For example, in the sentence
“This is the dog that belongs to my friend,” we can remove the words is, the, that, and
to and still convey the same meaning. Let the ratio of the words removed to the total
number of words in the original text be the measure of redundancy in the text. Repeat
the experiment using paragraphs from a technical journal. Can you make any quanti-
tative statements about the redundancy in the text obtained from different sources?

2
Mathematical Preliminaries for
Lossless Compression
2 . 1 Overview
T
he treatment of data compression in this book is not very mathematical. (For a
more mathematical treatment of some of the topics covered in this book,
see [3, 4, 5, 6].) However, we do need some mathematical preliminaries to
appreciate the compression techniques we will discuss. Compression schemes
can be divided into two classes, lossy and lossless. Lossy compression schemes
involve the loss of some information, and data that have been compressed using a lossy
scheme generally cannot be recovered exactly. Lossless schemes compress the data without
loss of information, and the original data can be recovered exactly from the compressed data.
In this chapter, some of the ideas in information theory that provide the framework for the
development of lossless data compression schemes are briefly reviewed. We will also look
at some ways to model the data that lead to efficient coding schemes. We have assumed
some knowledge of probability concepts (see Appendix A for a brief review of probability
and random processes).
2 . 2 A Brief Introduction to Information Theory
Although the idea of a quantitative measure of information has been around for a while, the
person who pulled everything together into what is now called information theory was Claude
Elwood Shannon [7], an electrical engineer at Bell Labs. Shannon defined a quantity called
self-information. Suppose we have an event A, which is a set of outcomes of some random

14 2 L O S S L E S S C O M P R E S S I O N
experiment. If PA is the probability that the event A will occur, then the self-information
associated with A is given by
iA = logb
1
PA
= −logb PA (2.1)
Note that we have not specified the base of the log function. We will discuss this in more
detail later in the chapter. The use of the logarithm to obtain a measure of information
was not an arbitrary choice as we shall see later in this chapter. But first let’s see if the
use of a logarithm in this context makes sense from an intuitive point of view. Recall
that log1 = 0, and −logx increases as x decreases from one to zero. Therefore, if the
probability of an event is low, the amount of self-information associated with it is high; if
the probability of an event is high, the information associated with it is low. Even if we
ignore the mathematical definition of information and simply use the definition we use in
everyday language, this makes some intuitive sense. The barking of a dog during a burglary
is a high-probability event and, therefore, does not contain too much information. However,
if the dog did not bark during a burglary, this is a low-probability event and contains a lot of
information. (Obviously, Sherlock Holmes understood information theory!)1
Although this
equivalence of the mathematical and semantic definitions of information holds true most of
the time, it does not hold all of the time. For example, a totally random string of letters
will contain more information (in the mathematical sense) than a well-thought-out treatise
on information theory.
Another property of this mathematical definition of information that makes intuitive
sense is that the information obtained from the occurrence of two independent events is the
sum of the information obtained from the occurrence of the individual events. Suppose A
and B are two independent events. The self-information associated with the occurrence of
both event A and event B is, by Equation (2.1),
iAB = logb
1
PAB

As A and B are independent,
PAB = PAPB
and
iAB = logb
1
PAPB
= logb
1
PA
+logb
1
PB
= iA+iB
The unit of information depends on the base of the log. If we use log base 2, the unit is bits;
if we use log base e, the unit is nats; and if we use log base 10, the unit is hartleys.
1
Silver Blaze by Arthur Conan Doyle.

Note that to calculate the information in bits, we need to take the logarithm base 2 of
the probabilities. Because this probably does not appear on your calculator, let’s review
logarithms briefly. Recall that
logb x = a
means that
ba
= x
Therefore, if we want to take the log base 2 of x
log2 x = a ⇒ 2a
= x
we want to find the value of a. We can take the natural log (log base e) or log base 10 of
both sides (which do appear on your calculator). Then
ln2a
= ln x ⇒ aln 2 = ln x
and
a =
ln x
ln 2
E x a m p l e 2 . 2 . 1:
Let H and T be the outcomes of flipping a coin. If the coin is fair, then
PH = PT = 1
2
and
iH = iT = 1 bit
If the coin is not fair, then we would expect the information associated with each event to
be different. Suppose
PH = 1
8
PT = 7
8

Then
iH = 3 bits iT = 0.193 bits
At least mathematically, the occurrence of a head conveys much more information than
the occurrence of a tail. As we shall see later, this has certain consequences for how the
information conveyed by these outcomes should be encoded.
If we have a set of independent events Ai, which are sets of outcomes of some experi-
ment , such that

Ai = S

where S is the sample space, then the average self-information associated with the random
experiment is given by
H =

PAiiAi = −

PAilogb PAi
This quantity is called the entropy associated with the experiment. One of the many con-
tributions of Shannon was that he showed that if the experiment is a source that puts out
symbols Ai from a set , then the entropy is a measure of the average number of binary
symbols needed to code the output of the source. Shannon showed that the best that a lossless
compression scheme can do is to encode the output of a source with an average number of
bits equal to the entropy of the source.
The set of symbols is often called the alphabet for the source, and the symbols are
referred to as letters. For a general source with alphabet = 12m that generates
a sequence X1X2, the entropy is given by
H = lim
n→
1
n
Gn (2.2)
where
Gn = −
i1=m

i1=1
i2=m

i2=1
···
in=m

in=1
PX1 = i1X2 = i2Xn = inlogPX1 = i1X2 = i2Xn = in
and X1X2Xn is a sequence of length n from the source. We will talk more about the
reason for the limit in Equation (2.2) later in the chapter. If each element in the sequence is
independent and identically distributed (iid), then we can show that
Gn = −n
i1=m

i1=1
PX1 = i1logPX1 = i1 (2.3)
and the equation for the entropy becomes
HS = −

PX1logPX1 (2.4)
For most sources Equations (2.2) and (2.4) are not identical. If we need to distinguish
between the two, we will call the quantity computed in (2.4) the first-order entropy of the
source, while the quantity in (2.2) will be referred to as the entropy of the source.
In general, it is not possible to know the entropy for a physical source, so we have to
estimate the entropy. The estimate of the entropy depends on our assumptions about the
structure of the source sequence.
Consider the following sequence:
1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10
Assuming the frequency of occurrence of each number is reflected accurately in the number
of times it appears in the sequence, we can estimate the probability of occurrence of each
symbol as follows:
P1 = P6 = P7 = P10 = 1
16
P2 = P3 = P4 = P5 = P8 = P9 = 2
16

Assuming the sequence is iid, the entropy for this sequence is the same as the first-order
entropy as defined in (2.4). The entropy can then be calculated as
H = −
10

i=1
Pilog2 Pi
With our stated assumptions, the entropy for this source is 325 bits. This means that the
best scheme we could find for coding this sequence could only code it at 3.25 bits/sample.
However, if we assume that there was sample-to-sample correlation between the samples
and we remove the correlation by taking differences of neighboring sample values, we arrive
at the residual sequence
1 1 1−1 1 1 1 −1 1 1 1 1 1 −1 1 1
This sequence is constructed using only two values with probabilities P1 = 13
16
and
P−1 = 3
16
. The entropy in this case is 0.70 bits per symbol. Of course, knowing only this
sequence would not be enough for the receiver to reconstruct the original sequence. The
receiver must also know the process by which this sequence was generated from the original
sequence. The process depends on our assumptions about the structure of the sequence.
These assumptions are called the model for the sequence. In this case, the model for the
sequence is
xn = xn−1 +rn
where xn is the nth element of the original sequence and rn is the nth element of the residual
sequence. This model is called a static model because its parameters do not change with n.
A model whose parameters change or adapt with n to the changing characteristics of the
data is called an adaptive model.
Basically, we see that knowing something about the structure of the data can help to
“reduce the entropy.” We have put “reduce the entropy” in quotes because the entropy of
the source is a measure of the amount of information generated by the source. As long
as the information generated by the source is preserved (in whatever representation), the
entropy remains the same. What we are reducing is our estimate of the entropy. The “actual”
structure of the data in practice is generally unknowable, but anything we can learn about
the data can help us to estimate the actual source entropy. Theoretically, as seen in Equation
(2.2), we accomplish this in our definition of the entropy by picking larger and larger blocks
of data to calculate the probability over, letting the size of the block go to infinity.
Consider the following contrived sequence:
1 2 1 2 3 3 3 3 1 2 3 3 3 3 1 2 3 3 1 2
Obviously, there is some structure to this data. However, if we look at it one symbol at a
time, the structure is difficult to extract. Consider the probabilities: P1 = P2 = 1
4
, and
P3 = 1
2
. The entropy is 1.5 bits/symbol. This particular sequence consists of 20 symbols;
therefore, the total number of bits required to represent this sequence is 30. Now let’s take
the same sequence and look at it in blocks of two. Obviously, there are only two symbols,
1 2, and 3 3. The probabilities are P1 2 = 1
2
, P3 3 = 1
2
, and the entropy is 1 bit/symbol.

As there are 10 such symbols in the sequence, we need a total of 10 bits to represent the
entire sequence—a reduction of a factor of three. The theory says we can always extract the
structure of the data by taking larger and larger block sizes; in practice, there are limitations
to this approach. To avoid these limitations, we try to obtain an accurate model for the
data and code the source with respect to the model. In Section 2.3, we describe some of
the models commonly used in lossless compression algorithms. But before we do that, let’s
make a slight detour and see a more rigorous development of the expression for average
information. While the explanation is interesting, it is not really necessary for understanding
much of what we will study in this book and can be skipped.
2 . 2 . 1 D e r i v a t i o n o f A v e r a g e I n f o r m a t i o n
We start with the properties we want in our measure of average information. We will then
show that requiring these properties in the information measure leads inexorably to the
particular definition of average information, or entropy, that we have provided earlier.
Given a set of independent events A1, A2 , An with probability pi = PAi, we desire
the following properties in the measure of average information H:
1. We want H to be a continuous function of the probabilities pi. That is, a small change
in pi should only cause a small change in the average information.
2. If all events are equally likely, that is, pi = 1/n for all i, then H should be a mono-
tonically increasing function of n. The more possible outcomes there are, the more
information should be contained in the occurrence of any particular outcome.
3. Suppose we divide the possible outcomes into a number of groups. We indicate the
occurrence of a particular event by first indicating the group it belongs to, then indi-
cating which particular member of the group it is. Thus, we get some information first
by knowing which group the event belongs to and then we get additional information
by learning which particular event (from the events in the group) has occurred. The
information associated with indicating the outcome in multiple stages should not be
any different than the information associated with indicating the outcome in a single
stage.
For example, suppose we have an experiment with three outcomes A1, A2, and A3,
with corresponding probabilities p1, p2, and p3. The average information associated
with this experiment is simply a function of the probabilities:
H = Hp1p2p3
Let’s group the three outcomes into two groups
B1 = A1 B2 = A2A3
The probabilities of the events Bi are given by
q1 = PB1 = p1 q2 = PB2 = p2 +p3

If we indicate the occurrence of an event Ai by first declaring which group the event
belongs to and then declaring which event occurred, the total amount of average
information would be given by
H = Hq1q2+q1H

p1
q1

+q2H

p2
q2

p3
q2

We require that the average information computed either way be the same.
In his classic paper, Shannon showed that the only way all these conditions could be
satisfied was if
H = −K

pi logpi
where K is an arbitrary positive constant. Let’s review his proof as it appears in the appendix
of his paper [7].
Suppose we have an experiment with n = km
equally likely outcomes. The average
information H1
n
1
n
1
n
associated with this experiment is a function of n. In other
words,
H

1
n

1
n

1
n

= An
We can indicate the occurrence of an event from km
events by a series of m choices from
k equally likely possibilities. For example, consider the case of k = 2 and m = 3. There are
eight equally likely events; therefore, H1
8
1
8
1
8
= A8.
We can indicate occurrence of any particular event as shown in Figure 2.1. In this
case, we have a sequence of three selections. Each selection is between two equally likely
possibilities. Therefore,
H
1
8
1
8
1
8

= A8
= H1
2
1
2
+ 1
2

H1
2
1
2
+ 1
2
H1
2
1
2

+ 1
2
H1
2
1
2

+ 1
2

H1
2
1
2
+ 1
2
H1
2
1
2

+ 1
2
H1
2
1
2

(2.5)
= 3H1
2
1
2

= 3A2
In other words,
A8 = 3A2
(The rather odd way of writing the left-hand side of Equation (2.5) is to show how the terms
correspond to the branches of the tree shown in Figure 2.1.) We can generalize this for the
case of n = km
as
An = Akm
= mAk

First selection
Second selection
Third selection
Third selection
Third selection
Third selection
Second selection
F I G U R E 2. 1 A possible way of identifying the occurrence of an event.
Similarly, for jl
choices,
Ajl
= lAj
We can pick l arbitrarily large (more on this later) and then choose m so that
km
≤ jl
≤ km+1

Taking logarithms of all terms, we get
mlogk ≤ llogj ≤ m+1logk
Now divide through by llogk to get
m
l
≤
logj
logk
≤
m
l
+
1
l

Recall that we picked l arbitrarily large. If l is arbitrarily large, then 1
l
is arbitrarily small.
This means that the upper and lower bounds of logj
logk
can be made arbitrarily close to m
l
by
picking l arbitrarily large. Another way of saying this is
m
l
−
logj
logk

where can be made arbitrarily small. We will use this fact to find an expression for An
and hence for H1
n
1
n
.
To do this we use our second requirement that H1
n
1
n
be a monotonically increasing
function of n. As
H

1
n

1
n

= An
this means that An is a monotonically increasing function of n. If
km
≤ jl
≤ km+1
then in order to satisfy our second requirement
Akm
≤ Ajl
≤ Akm+1

or
mAk ≤ lAj ≤ m+1Ak
Dividing through by lAk, we get
m
l
≤
Aj
Ak
≤
m
l
+
1
l

Using the same arguments as before, we get
m
l
−
Aj
Ak

where can be made arbitrarily small.
Now Aj
Ak
is at most a distance of away from m
l
, and logj
logk
is at most a distance of
away from m
l
. Therefore, Aj
Ak
is at most a distance of 2 away from logj
logk
.
Aj
Ak
−
logj
logk
2
We can pick to be arbitrarily small, and j and k are arbitrary. The only way this inequality
can be satisfied for arbitrarily small and arbitrary j and k is for Aj = K logj, where K
is an arbitrary constant. In other words,
H = K logn
Up to this point we have only looked at equally likely events. We now make the transition
to the more general case of an experiment with outcomes that are not equally likely. We do
that by considering an experiment with ni equally likely outcomes that are grouped in n
unequal groups of size ni with rational probabilities (if the probabilities are not rational, we
approximate them with rational probabilities and use the continuity requirement):
pi =
ni
n
j=1 nj

Given that we have ni equally likely events, from the development above we have
H = K log

nj

(2.6)
If we indicate an outcome by first indicating which of the n groups it belongs to, and second
indicating which member of the group it is, then by our earlier development the average
information H is given by
H = Hp1p2pn+p1H

1
n1

1
n1

+···+pnH

1
nn

1
nn

(2.7)
= Hp1p2pn+p1K logn1 +p2K logn2 +···+pnK lognn (2.8)
= Hp1p2pn+K
n

i=1
pi logni (2.9)
Equating the expressions in Equations (2.6) and (2.9), we obtain
K log

nj

= Hp1p2pn+K
n

i=1
pi logni
or
Hp1p2pn = K log

nj

−K
n

i=1
pi logni
= −K
n

i=1
pi logni −log
n

j=1
nj

= −K
n

i=1
pi logni −log
n

j=1
nj
n

i=1
pi (2.10)
= −K
n

i=1
pi logni −
n

i=1
pi log
n

j=1
nj

= −K
n

i=1
pi logni −log
n

j=1
nj

= −K
n

i=1
pi log
ni
n
j=1 nj
(2.11)
= −K

pi logpi (2.12)
where, in Equation (2.10) we have used the fact that n
i=1 pi = 1. By convention we pick K
to be 1, and we have the formula
H = −

pi logpi
Note that this formula is a natural outcome of the requirements we imposed in the
beginning. It was not artificially forced in any way. Therein lies the beauty of information
theory. Like the laws of physics, its laws are intrinsic in the nature of things. Mathematics
is simply a tool to express these relationships.

2.3 Models 23
2 . 3 Models
As we saw in Section 2.2, having a good model for the data can be useful in estimating the
entropy of the source. As we will see in later chapters, good models for sources lead to more
efficient compression algorithms. In general, in order to develop techniques that manipulate
data using mathematical operations, we need to have a mathematical model for the data.
Obviously, the better the model (i.e., the closer the model matches the aspects of reality that
are of interest to us), the more likely it is that we will come up with a satisfactory technique.
There are several approaches to building mathematical models.
2 . 3 . 1 P h y s i c a l M o d e l s
If we know something about the physics of the data generation process, we can use that
information to construct a model. For example, in speech-related applications, knowledge
about the physics of speech production can be used to construct a mathematical model for
the sampled speech process. Sampled speech can then be encoded using this model. We will
discuss speech production models in more detail in Chapter 8.
Models for certain telemetry data can also be obtained through knowledge of the under-
lying process. For example, if residential electrical meter readings at hourly intervals were
to be coded, knowledge about the living habits of the populace could be used to determine
when electricity usage would be high and when the usage would be low. Then instead of
the actual readings, the difference (residual) between the actual readings and those predicted
by the model could be coded.
In general, however, the physics of data generation is simply too complicated to under-
stand, let alone use to develop a model. Where the physics of the problem is too complicated,
we can obtain a model based on empirical observation of the statistics of the data.
2 . 3 . 2 P r o b a b i l i t y M o d e l s
The simplest statistical model for the source is to assume that each letter that is generated by
the source is independent of every other letter, and each occurs with the same probability.
We could call this the ignorance model, as it would generally be useful only when we know
nothing about the source. (Of course, that really might be true, in which case we have a rather
unfortunate name for the model!) The next step up in complexity is to keep the indepen-
dence assumption, but remove the equal probability assumption and assign a probability of
occurrence to each letter in the alphabet. For a source that generates letters from an alphabet
= a1a2aM , we can have a probability model = Pa1Pa2PaM .
Given a probability model (and the independence assumption), we can compute the
entropy of the source using Equation (2.4). As we will see in the following chapters using
the probability model, we can also construct some very efficient codes to represent the letters
in . Of course, these codes are only efficient if our mathematical assumptions are in accord
with reality.
If the assumption of independence does not fit with our observation of the data, we can
generally find better compression schemes if we discard this assumption. When we discard

the independence assumption, we have to come up with a way to describe the dependence
of elements of the data sequence on each other.
2 . 3 . 3 M a r k o v M o d e l s
One of the most popular ways of representing dependence in the data is through the use of
Markov models, named after the Russian mathematician Andrei Andrevich Markov (1856–
1922). For models used in lossless compression, we use a specific type of Markov process
called a discrete time Markov chain. Let xn be a sequence of observations. This sequence
is said to follow a kth-order Markov model if
Pxnxn−1xn−k = Pxnxn−1xn−k (2.13)
In other words, knowledge of the past k symbols is equivalent to the knowledge of the entire
past history of the process. The values taken on by the set xn−1 xn−k are called the
states of the process. If the size of the source alphabet is l, then the number of states is lk
.
The most commonly used Markov model is the first-order Markov model, for which
Pxnxn−1 = Pxnxn−1xn−2xn−3 (2.14)
Equations (2.13) and (2.14) indicate the existence of dependence between samples. However,
they do not describe the form of the dependence. We can develop different first-order Markov
models depending on our assumption about the form of the dependence between samples.
If we assumed that the dependence was introduced in a linear manner, we could view
the data sequence as the output of a linear filter driven by white noise. The output of such
a filter can be given by the difference equation
xn = xn−1 + n (2.15)
where n is a white noise process. This model is often used when developing coding
algorithms for speech and images.
The use of the Markov model does not require the assumption of linearity. For example,
consider a binary image. The image has only two types of pixels, white pixels and black
pixels. We know that the appearance of a white pixel as the next observation depends,
to some extent, on whether the current pixel is white or black. Therefore, we can model
the pixel process as a discrete time Markov chain. Define two states Sw and Sb (Sw would
correspond to the case where the current pixel is a white pixel, and Sb corresponds to the
case where the current pixel is a black pixel). We define the transition probabilities Pw/b
and Pb/w, and the probability of being in each state PSw and PSb. The Markov model
can then be represented by the state diagram shown in Figure 2.2.
The entropy of a finite state process with states Si is simply the average value of the
entropy at each state:
H =
M

i=1
PSiHSi (2.16)

2.3 Models 25
Sw P(b|b)
P(b|w)
P(w|b)
P(w|w) Sb
F I G U R E 2. 2 A two-state Markov model for binary images.
For our particular example of a binary image
HSw = −Pb/wlogPb/w−Pw/wlogPw/w
where Pw/w = 1−Pb/w. HSb can be calculated in a similar manner.
E x a m p l e 2 . 3 . 1: M a r k o v m o d e l
To see the effect of modeling on the estimate of entropy, let us calculate the entropy for a
binary image, first using a simple probability model and then using the finite state model
described above. Let us assume the following values for the various probabilities:
PSw = 30/31 PSb = 1/31
Pww = 099 Pbw = 001 Pbb = 07 Pwb = 03
Then the entropy using a probability model and the iid assumption is
H = −08log08−02log02 = 0.206 bits
Now using the Markov model
HSb = −03log03−07log07 = 0.881 bits
and
HSw = −001log001−099log099 = 0.081 bits
which, using Equation (2.16), results in an entropy for the Markov model of 0.107 bits,
about a half of the entropy obtained using the iid assumption.
M a r k o v M o d e l s i n T e x t C o m p r e s s i o n
As expected, Markov models are particularly useful in text compression, where the prob-
ability of the next letter is heavily influenced by the preceding letters. In fact, the use
of Markov models for written English appears in the original work of Shannon [7]. In
current text compression literature, the kth-order Markov models are more widely known

as finite context models, with the word context being used for what we have earlier defined
as state.
Consider the word preceding. Suppose we have already processed precedin and are going
to encode the next letter. If we take no account of the context and treat each letter as a
surprise, the probability of the letter g occurring is relatively low. If we use a first-order
Markov model or single-letter context (that is, we look at the probability model given n),
we can see that the probability of g would increase substantially. As we increase the context
size (go from n to in to din and so on), the probability of the alphabet becomes more and
more skewed, which results in lower entropy.
Shannon used a second-order model for English text consisting of the 26 letters and one
space to obtain an entropy of 31 bits/letter [8]. Using a model where the output symbols
were words rather than letters brought down the entropy to 24 bits/letter. Shannon then used
predictions generated by people (rather than statistical models) to estimate the upper and
lower bounds on the entropy of the second order model. For the case where the subjects knew
the 100 previous letters, he estimated these bounds to be 13 and 06 bits/letter, respectively.
The longer the context, the better its predictive value. However, if we were to store the
probability model with respect to all contexts of a given length, the number of contexts
would grow exponentially with the length of context. Furthermore, given that the source
imposes some structure on its output, many of these contexts may correspond to strings
that would never occur in practice. Consider a context model of order four (the context is
determined by the last four symbols). If we take an alphabet size of 95, the possible number
of contexts is 954
—more than 81 million!
This problem is further exacerbated by the fact that different realizations of the source
output may vary considerably in terms of repeating patterns. Therefore, context modeling
in text compression schemes tends to be an adaptive strategy in which the probabilities for
different symbols in the different contexts are updated as they are encountered. However,
this means that we will often encounter symbols that have not been encountered before for
any of the given contexts (this is known as the zero frequency problem). The larger the
context, the more often this will happen. This problem could be resolved by sending a code
to indicate that the following symbol was being encountered for the first time, followed by
a prearranged code for that symbol. This would significantly increase the length of the code
for the symbol on its first occurrence (in the given context). However, if this situation did not
occur too often, the overhead associated with such occurrences would be small compared to
the total number of bits used to encode the output of the source. Unfortunately, in context-
based encoding, the zero frequency problem is encountered often enough for overhead to be
a problem, especially for longer contexts. Solutions to this problem are presented by the ppm
(prediction with partial match) algorithm and its variants (described in detail in Chapter 6).
Briefly, the ppm algorithms first attempt to find if the symbol to be encoded has a
nonzero probability with respect to the maximum context length. If this is so, the symbol is
encoded and transmitted. If not, an escape symbol is transmitted, the context size is reduced
by one, and the process is repeated. This procedure is repeated until a context is found
with respect to which the symbol has a nonzero probability. To guarantee that this process
converges, a null context is always included with respect to which all symbols have equal
probability. Initially, only the shorter contexts are likely to be used. However, as more and
more of the source output is processed, the longer contexts, which offer better prediction,

2.4 Coding 27
Source 1
Source 2
Source n
Switch
F I G U R E 2. 3 A composite source.
will be used more often. The probability of the escape symbol can be computed in a number
of different ways leading to different implementations [1].
The use of Markov models in text compression is a rich and active area of research. We
describe some of these approaches in Chapter 6 (for more details, see [1]).
2 . 3 . 4 C o m p o s i t e S o u r c e M o d e l
In many applications, it is not easy to use a single model to describe the source. In such cases,
we can define a composite source, which can be viewed as a combination or composition of
several sources, with only one source being active at any given time. A composite source
can be represented as a number of individual sources i, each with its own model i, and
a switch that selects a source i with probability Pi (as shown in Figure 2.3). This is an
exceptionally rich model and can be used to describe some very complicated processes. We
will describe this model in more detail when we need it.
2 . 4 Coding
When we talk about coding in this chapter (and through most of this book), we mean the
assignment of binary sequences to elements of an alphabet. The set of binary sequences is
called a code, and the individual members of the set are called codewords. An alphabet is a
collection of symbols called letters. For example, the alphabet used in writing most books
consists of the 26 lowercase letters, 26 uppercase letters, and a variety of punctuation marks.
In the terminology used in this book, a comma is a letter. The ASCII code for the letter a
is 1000011, the letter A is coded as 1000001, and the letter “,” is coded as 0011010. Notice
that the ASCII code uses the same number of bits to represent each symbol. Such a code
is called a fixed-length code. If we want to reduce the number of bits required to represent
different messages, we need to use a different number of bits to represent different symbols.
If we use fewer bits to represent symbols that occur more often, on the average we would
use fewer bits per symbol. The average number of bits per symbol is often called the rate
of the code. The idea of using fewer bits to represent symbols that occur more often is the

same idea that is used in Morse code: the codewords for letters that occur more frequently
are shorter than for letters that occur less frequently. For example, the codeword for E is ·,
while the codeword for Z is − − · · [9].
2 . 4 . 1 U n i q u e l y D e c o d a b l e C o d e s
The average length of the code is not the only important point in designing a “good”
code. Consider the following example adapted from [10]. Suppose our source alphabet
consists of four letters a1, a2, a3, and a4, with probabilities Pa1 = 1
2
, Pa2 = 1
4
, and
Pa3 = Pa4 = 1
8
. The entropy for this source is 1.75 bits/symbol. Consider the codes for
this source in Table 2.1.
The average length l for for each code is given by
l =
4

i=1
Painai
where nai is the number of bits in the codeword for letter ai and the average length is given
in bits/symbol. Based on the average length, Code 1 appears to be the best code. However,
to be useful, a code should have the ability to transfer information in an unambiguous
manner. This is obviously not the case with Code 1. Both a1 and a2 have been assigned the
codeword 0. When a 0 is received, there is no way to know whether an a1 was transmitted
or an a2. We would like each symbol to be assigned a unique codeword.
At first glance Code 2 does not seem to have the problem of ambiguity; each symbol is
assigned a distinct codeword. However, suppose we want to encode the sequence a2 a1 a1.
Using Code 2, we would encode this with the binary string 100. However, when the string
100 is received at the decoder, there are several ways in which the decoder can decode this
string. The string 100 can be decoded as a2 a1 a1, or as a2 a3. This means that once a
sequence is encoded with Code 2, the original sequence cannot be recovered with certainty.
In general, this is not a desirable property for a code. We would like unique decodability from
the code; that is, any given sequence of codewords can be decoded in one, and only one, way.
We have already seen that Code 1 and Code 2 are not uniquely decodable. How about
Code 3? Notice that the first three codewords all end in a 0. In fact, a 0 always denotes the
termination of a codeword. The final codeword contains no 0s and is 3 bits long. Because
all other codewords have fewer than three 1s and terminate in a 0, the only way we can get
three 1s in a row is as a code for a4. The decoding rule is simple. Accumulate bits until you
get a 0 or until you have three 1s. There is no ambiguity in this rule, and it is reasonably
T A B L E 2 . 1 Four different codes for a four-letter alphabet.
Letters Probability Code 1 Code 2 Code 3 Code 4
a1 05 0 0 0 0
a2 025 0 1 10 01
a3 0125 1 00 110 011
a4 0125 10 11 111 0111
Average length 1.125 1.25 1.75 1.875

2.4 Coding 29
easy to see that this code is uniquely decodable. With Code 4 we have an even simpler
condition. Each codeword starts with a 0, and the only time we see a 0 is in the beginning
of a codeword. Therefore, the decoding rule is accumulate bits until you see a 0. The bit
before the 0 is the last bit of the previous codeword.
There is a slight difference between Code 3 and Code 4. In the case of Code 3, the
decoder knows the moment a code is complete. In Code 4, we have to wait till the beginning
of the next codeword before we know that the current codeword is complete. Because of this
property, Code 3 is called an instantaneous code. Although Code 4 is not an instantaneous
code, it is almost that.
While this property of instantaneous or near-instantaneous decoding is a nice property to
have, it is not a requirement for unique decodability. Consider the code shown in Table 2.2.
Let’s decode the string 011111111111111111. In this string, the first codeword is either 0
corresponding to a1 or 01 corresponding to a2. We cannot tell which one until we have
decoded the whole string. Starting with the assumption that the first codeword corresponds
to a1, the next eight pairs of bits are decoded as a3. However, after decoding eight a3s, we
are left with a single (dangling) 1 that does not correspond to any codeword. On the other
hand, if we assume the first codeword corresponds to a2, we can decode the next 16 bits as
a sequence of eight a3s, and we do not have any bits left over. The string can be uniquely
decoded. In fact, Code 5, while it is certainly not instantaneous, is uniquely decodable.
We have been looking at small codes with four letters or less. Even with these, it is not
immediately evident whether the code is uniquely decodable or not. In deciding whether
larger codes are uniquely decodable, a systematic procedure would be useful. Actually, we
should include a caveat with that last statement. Later in this chapter we will include a class
of variable-length codes that are always uniquely decodable, so a test for unique decodability
may not be that necessary. You might wish to skip the following discussion for now, and
come back to it when you find it necessary.
Before we describe the procedure for deciding whether a code is uniquely decodable, let’s
take another look at our last example. We found that we had an incorrect decoding because
we were left with a binary string (1) that was not a codeword. If this had not happened, we
would have had two valid decodings. For example, consider the code shown in Table 2.3. Let’s
T A B L E 2 . 2 Code 5.
Letter Codeword
a1 0
a2 01
a3 11
T A B L E 2 . 3 Code 6.
Letter Codeword
a1 0
a2 01
a3 10

encode the sequence a1 followed by eight a3s using this code. The coded sequence is
01010101010101010. The first bit is the codeword for a1. However, we can also decode it
as the first bit of the codeword for a2. If we use this (incorrect) decoding, we decode the
next seven pairs of bits as the codewords for a2. After decoding seven a2s, we are left with
a single 0 that we decode as a1. Thus, the incorrect decoding is also a valid decoding, and
this code is not uniquely decodable.
A T e s t f o r U n i q u e D e c o d a b i l i t y
In the previous examples, in the case of the uniquely decodable code, the binary string left
over after we had gone through an incorrect decoding was not a codeword. In the case of the
code that was not uniquely decodable, in the incorrect decoding what was left was a valid
codeword. Based on whether the dangling suffix is a codeword or not, we get the following
test [11, 12].
We start with some definitions. Suppose we have two binary codewords a and b, where
a is k bits long, b is n bits long, and k n. If the first k bits of b are identical to a, then a is
called a prefix of b. The last n−k bits of b are called the dangling suffix [11]. For example,
if a = 010 and b = 01011, then a is a prefix of b and the dangling suffix is 11.
Construct a list of all the codewords. Examine all pairs of codewords to see if any
codeword is a prefix of another codeword. Whenever you find such a pair, add the dangling
suffix to the list unless you have added the same dangling suffix to the list in a previous
iteration. Now repeat the procedure using this larger list. Continue in this fashion until one
of the following two things happens:
1. You get a dangling suffix that is a codeword.
2. There are no more unique dangling suffixes.
If you get the first outcome, the code is not uniquely decodable. However, if you get the
second outcome, the code is uniquely decodable.
Let’s see how this procedure works with a couple of examples.
E x a m p l e 2 . 4 . 1:
Consider Code 5. First list the codewords
00111
The codeword 0 is a prefix for the codeword 01. The dangling suffix is 1. There are no
other pairs for which one element of the pair is the prefix of the other. Let us augment the
codeword list with the dangling suffix.
001111
Comparing the elements of this list, we find 0 is a prefix of 01 with a dangling suffix of 1. But
we have already included 1 in our list. Also, 1 is a prefix of 11. This gives us a dangling suffix
of 1, which is already in the list. There are no other pairs that would generate a dangling suffix,
so we cannot augment the list any further. Therefore, Code 5 is uniquely decodable.

2.4 Coding 31
E x a m p l e 2 . 4 . 2:
Consider Code 6. First list the codewords
00110
The codeword 0 is a prefix for the codeword 01. The dangling suffix is 1. There are no other
pairs for which one element of the pair is the prefix of the other. Augmenting the codeword
list with 1, we obtain the list
001101
In this list, 1 is a prefix for 10. The dangling suffix for this pair is 0, which is the codeword
for a1. Therefore, Code 6 is not uniquely decodable.
2 . 4 . 2 P r e f i x C o d e s
The test for unique decodability requires examining the dangling suffixes initially generated
by codeword pairs in which one codeword is the prefix of the other. If the dangling suffix
is itself a codeword, then the code is not uniquely decodable. One type of code in which we
will never face the possibility of a dangling suffix being a codeword is a code in which no
codeword is a prefix of the other. In this case, the set of dangling suffixes is the null set, and
we do not have to worry about finding a dangling suffix that is identical to a codeword. A
code in which no codeword is a prefix to another codeword is called a prefix code. A simple
way to check if a code is a prefix code is to draw the rooted binary tree corresponding to
the code. Draw a tree that starts from a single node (the root node) and has a maximum of
two possible branches at each node. One of these branches corresponds to a 1 and the other
branch corresponds to a 0. In this book, we will adopt the convention that when we draw
a tree with the root node at the top, the left branch corresponds to a 0 and the right branch
corresponds to a 1. Using this convention, we can draw the binary tree for Code 2, Code 3,
and Code 4 as shown in Figure 2.4.
Note that apart from the root node, the trees have two kinds of nodes—nodes that give
rise to other nodes and nodes that do not. The first kind of nodes are called internal nodes,
and the second kind are called external nodes or leaves. In a prefix code, the codewords are
only associated with the external nodes. A code that is not a prefix code, such as Code 4, will
have codewords associated with internal nodes. The code for any symbol can be obtained
a1
a3
a2
a4
a1
a2
a3 a4
a1
a2
a3
a4
Code 2 Code 3 Code 4
F I G U R E 2. 4 Binary trees for three different codes.

by traversing the tree from the root to the external node corresponding to that symbol. Each
branch on the way contributes a bit to the codeword: a 0 for each left branch and a 1 for
each right branch.
It is nice to have a class of codes, whose members are so clearly uniquely decodable.
However, are we losing something if we restrict ourselves to prefix codes? Could it be that
if we do not restrict ourselves to prefix codes, we can find shorter codes? Fortunately for us
the answer is no. For any nonprefix uniquely decodable code, we can always find a prefix
code with the same codeword lengths. We prove this in the next section.
2 . 4 . 3 T h e K r a f t - M c M i l l a n I n e q u a l i t y
The particular result we look at in this section consists of two parts. The first part provides
a necessary condition on the codeword lengths of uniquely decodable codes. The second
part shows that we can always find a prefix code that satisfies this necessary condition.
Therefore, if we have a uniquely decodable code that is not a prefix code, we can always
find a prefix code with the same codeword lengths.
T h e o r e m Let be a code with N codewords with lengths l1l2lN . If is uniquely
decodable, then
K =
N

i=1
2−li
≤ 1
This inequality is known as the Kraft-McMillan inequality.
P r o o f The proof works by looking at the nth power of K. If K is greater than
one, then Kn
should grow exponentially with n. If it does not grow exponentially with
n, then this is proof that N
i=1 2−li ≤ 1.
Let n be an arbitrary integer. Then
N

i=1
2−li
n
=
N

i1=1
2−li1
N

i2=1
2−li2

···
N

in=1
2−lin

(2.17)
=
N

i1=1
N

i2=1
···
N

in=1
2−li1
+li2
+···+lin

(2.18)
The exponent li1
+li2
+···+lin
is simply the length of n codewords from the code . The
smallest value that this exponent can take is greater than or equal to n, which would be the
case if all codewords were 1 bit long. If
l = maxl1l2lN
then the largest value that the exponent can take is less than or equal to nl. Therefore, we
can write this summation as
Kn
=
nl

k=n
Ak2−k

2.4 Coding 33
where Ak is the number of combinations of n codewords that have a combined length of
k. Let’s take a look at the size of this coefficient. The number of possible distinct binary
sequences of length k is 2k
. If this code is uniquely decodable, then each sequence can
represent one and only one sequence of codewords. Therefore, the number of possible
combinations of codewords whose combined length is k cannot be greater than 2k
. In other
words,
Ak ≤ 2k

This means that
Kn
=
nl

k=n
Ak2−k
≤
nl

k=n
2k
2−k
= nl−n+1 (2.19)
But if K is greater than one, it will grow exponentially with n, while nl − 1 + 1 can
only grow linearly. So if K is greater than one, we can always find an n large enough
that the inequality (2.19) is violated. Therefore, for a uniquely decodable code , K is
less than or equal to one.
This part of the Kraft-McMillan inequality provides a necessary condition for uniquely
decodable codes. That is, if a code is uniquely decodable, the codeword lengths have to
satisfy the inequality. The second part of this result is that if we have a set of codeword
lengths that satisfy the inequality, we can always find a prefix code with those codeword
lengths. The proof of this assertion presented here is adapted from [6].
T h e o r e m Given a set of integers l1l2lN that satisfy the inequality
N

i=1
2−li
≤ 1
we can always find a prefix code with codeword lengths l1l2lN .
P r o o f We will prove this assertion by developing a procedure for constructing a prefix
code with codeword lengths l1l2lN that satisfy the given inequality.
Without loss of generality, we can assume that
l1 ≤ l2 ≤ ··· ≤ lN
Define a sequence of numbers w1w2wN as follows:
w1 = 0
wj =
j−1

i=1
2lj−li
j 1

The binary representation of wj for j 1 would take up log2wj + 1 bits. We will use
this binary representation to construct a prefix code. We first note that the number of bits
in the binary representation of wj is less than or equal to lj. This is obviously true for w1.
For j 1,
log2wj +1 = log2
j−1

i=1
2lj−li
+1
= log2 2lj
j−1

i=1
2−li
+2−lj
= lj +log2
j

i=1
2−li
≤ lj
The last inequality results from the hypothesis of the theorem that N
i=1 2−li ≤ 1, which
implies that j
i=1 2−li ≤ 1. As the logarithm of a number less than one is negative, lj +
log2

j
i=1 2−li

has to be less than lj.
Using the binary representation of wj, we can devise a binary code in the following manner:
If log2wj +1 = lj, then the jth codeword cj is the binary representation of wj. If log2wj +
1 lj, then cj is the binary representation of wj, with lj − log2wj + 1 zeros appended
to the right. This is certainly a code, but is it a prefix code? If we can show that the code =
c1c2cN is a prefix code, then we will have proved the theorem by construction.
Suppose that our claim is not true. Then for some j k, cj is a prefix of ck. This means
that the lj most significant bits of wk form the binary representation of wj. Therefore if
we right-shift the binary representation of wk by lk − lj bits, we should get the binary
representation for wj. We can write this as
wj =
wk
2lk−lj

However,
wk =
k−1

i=1
2lk−li

Therefore,
wk
2lk−lj
=
k−1

i=0
2lj−li
= wj +
k−1

i=j
2lj−li
= wj +20
+
k−1

i=j+1
2lj−li
≥ wj +1 (2.20)

2.5 Algorithmic Information Theory 35
That is, the smallest value for wk
2lk−lj
is wj +1. This contradicts the requirement for cj being the
prefix of ck. Therefore, cj cannot be the prefix for ck. As j and k were arbitrary, this means
that no codeword is a prefix of another codeword, and the code is a prefix code.
Therefore, if we have a uniquely decodable code, the codeword lengths have to satisfy
the Kraft-McMillan inequality. And, given codeword lengths that satisfy the Kraft-McMillan
inequality, we can always find a prefix code with those codeword lengths. Thus, by restricting
ourselves to prefix codes, we are not in danger of overlooking nonprefix uniquely decodable
codes that have a shorter average length.
2 . 5 Algorithmic Information Theory
The theory of information described in the previous sections is intuitively satisfying and
has useful applications. However, when dealing with real world data, it does have some
theoretical difficulties. Suppose you were given the task of developing a compression scheme
for use with a specific set of documentations. We can view the entire set as a single long
string. You could develop models for the data. Based on these models you could calculate
probabilities using the relative frequency approach. These probabilities could then be used
to obtain an estimate of the entropy and thus an estimate of the amount of compression
available. All is well except for a fly in the “ointment.” The string you have been given is
fixed. There is nothing probabilistic about it. There is no abstract source that will generate
different sets of documentation at different times. So how can we talk about the entropies
without pretending that reality is somehow different from what it actually is? Unfortunately,
it is not clear that we can. Our definition of entropy requires the existence of an abstract
source. Our estimate of the entropy is still useful. It will give us a very good idea of how
much compression we can get. So, practically speaking, information theory comes through.
However, theoretically it seems there is some pretending involved. Algorithmic information
theory is a different way of looking at information that has not been as useful in practice
(and therefore we will not be looking at it a whole lot) but it gets around this theoretical
problem. At the heart of algorithmic information theory is a measure called Kolmogorov
complexity. This measure, while it bears the name of one person, was actually discovered
independently by three people: R. Solomonoff, who was exploring machine learning; the
Russian mathematician A.N. Kolmogorov; and G. Chaitin, who was in high school when he
came up with this idea.
The Kolmogorov complexity Kx of a sequence x is the size of the program needed
to generate x. In this size we include all inputs that might be needed by the program.
We do not specify the programming language because it is always possible to translate
a program in one language to a program in another language at fixed cost. If x was a
sequence of all ones, a highly compressible sequence, the program would simply be a print
statement in a loop. On the other extreme, if x were a random sequence with no structure
then the only program that could generate it would contain the sequence itself. The size
of the program, would be slightly larger than the sequence itself. Thus, there is a clear
correspondence between the size of the smallest program that can generate a sequence and
the amount of compression that can be obtained. Kolmogorov complexity seems to be the

ideal measure to use in data compression. The problem is we do not know of any systematic
way of computing or closely approximating Kolmogorov complexity. Clearly, any program
that can generate a particular sequence is an upper bound for the Kolmogorov complexity
of the sequence. However, we have no way of determining a lower bound. Thus, while
the notion of Kolmogorov complexity is more satisfying theoretically than the notion of
entropy when compressing sequences, in practice it is not yet as helpful. However, given
the active interest in these ideas it is quite possible that they will result in more practical
applications.
2 . 6 Minimum Description Length Principle
One of the more practical offshoots of Kolmogorov complexity is the minimum description
length (MDL) principle. The first discoverer of Kolmogorov complexity, Ray Solomonoff,
viewed the concept of a program that would generate a sequence as a way of modeling the
data. Independent from Solomonoff but inspired nonetheless by the ideas of Kolmogorov
complexity, Jorma Risannen in 1978 [13] developed the modeling approach commonly
known as MDL.
1
2
3
4
5
6
7
8
9
10
1 2 3
F I G U R E 2. 5 An example to illustrate the MDL principle.

2.7 Summary 37
Let Mj be a model from a set of models that attempt to characterize the structure in a
sequence x. Let DMj
be the number of bits required to describe the model Mj. For example,
if the set of models can be represented by a (possibly variable) number of coefficients,
then the description of Mj would include the number of coefficients and the value of each
coefficient. Let RMj
x be the number of bits required to represent x with respect to the
model Mj. The minimum description length would be given by
min
j
DMj
+RMj
x
Consider the example shown as Figure 2. 5, where the X’s represent data values. Suppose
the set of models is the set of kth
order polynomials. We have also sketched two
polynomials that could be used to model the data. Clearly, the higher-order polynomial does
a much “better” job of modeling the data in the sense that the model exactly describes
the data. To describe the higher order polynomial, we need to specify the value of each
coefficient. The coefficients have to be exact if the polynomial is to exactly model the data
requiring a large number of bits. The quadratic model, on the other hand, does not fit any
of the data values. However, its description is very simple and the data values are either
+1 or −1 away from the quadratic. So we could exactly represent the data by sending the
coefficients of the quadratic (1, 0) and 1 bit per data value to indicate whether each data
value is +1 or −1 away from the quadratic. In this case, from a compression point of view,
using the worse model actually gives better compression.
2 . 7 Summary
In this chapter we learned some of the basic definitions of information theory. This was
a rather brief visit, and we will revisit the subject in Chapter 8. However, the coverage
in this chapter will be sufficient to take us through the next four chapters. The concepts
introduced in this chapter allow us to estimate the number of bits we need to represent the
output of a source given the probability model for the source. The process of assigning a
binary representation to the output of a source is called coding. We have introduced the
concepts of unique decodability and prefix codes, which we will use in the next two chapters
when we describe various coding algorithms. We also looked, rather briefly, at different
approaches to modeling. If we need to understand a model in more depth later in the book,
we will devote more attention to it at that time. However, for the most part, the coverage of
modeling in this chapter will be sufficient to understand methods described in the next four
chapters.
F u r t h e r R e a d i n g
1. A very readable book on information theory and its applications in a number of
fields is Symbols, Signals, and Noise—The Nature and Process of Communications,
by J.R. Pierce [14].
2. Another good introductory source for the material in this chapter is Chapter 6 of
Coding and Information Theory, by R.W. Hamming [9].

3. Various models for text compression are described very nicely and in more detail in
Text Compression, by T.C. Bell, J.G. Cleary, and I.H. Witten [1].
4. For a more thorough and detailed account of information theory, the following books
are especially recommended (the first two are my personal favorites): Information The-
ory, by R.B. Ash [15]; Transmission of Information, by R.M. Fano [16]; Information
Theory and Reliable Communication, by R.G. Gallagher [11]; Entropy and Informa-
tion Theory, by R.M. Gray [17]; Elements of Information Theory, by T.M. Cover and
J.A. Thomas [3]; and The Theory of Information and Coding, by R.J. McEliece [6].
5. Kolmogorov complexity is addressed in detail in An Introduction to Kolmogorov
Complexity and Its Applications, by M. Li and P. Vitanyi [18].
6. A very readable overview of Kolmogorov complexity in the context of lossless
compression can be found in the chapter Complexity Measures, by S.R. Tate [19].
7. Various aspects of the minimum description length principle are discussed in
Advances in Minimum Description Length edited by P. Grunwald, I.J. Myung, and
M.A. Pitt [20]. Included in this book is a very nice introduction to the minimum
description length principle by Peter Grunwald [21].
1. Suppose X is a random variable that takes on values from an M-letter alphabet. Show
that 0 ≤ HX ≤ log2 M.
2. Show that for the case where the elements of an observed sequence are iid, the entropy
is equal to the first-order entropy.
3. Given an alphabet = a1a2a3a4, find the first-order entropy in the following
cases:
(a) Pa1 = Pa2 = Pa3 = Pa4 = 1
4
.
(b) Pa1 = 1
2
, Pa2 = 1
4
, Pa3 = Pa4 = 1
8
.
(c) Pa1 = 0505, Pa2 = 1
4
, Pa3 = 1
8
, and Pa4 = 012.
4. Suppose we have a source with a probability model P = p0p1pm and entropy
HP. Suppose we have another source with probability model Q = q0q1qm and
entropy HQ, where
qi = pi i = 01j −2j +1m
and
qj = qj−1 =
pj +pj−1
2

How is HQ related to HP (greater, equal, or less)? Prove your answer.

5. There are several image and speech files among the accompanying data sets.
(a) Write a program to compute the first-order entropy of some of the image and
speech files.
(b) Pick one of the image files and compute its second-order entropy. Comment on
the difference between the first- and second-order entropies.
(c) Compute the entropy of the differences between neighboring pixels for the image
you used in part (b). Comment on what you discover.
6. Conduct an experiment to see how well a model can describe a source.
(a) Write a program that randomly selects letters from the 26-letter alphabet
abz and forms four-letter words. Form 100 such words and see how
many of these words make sense.
(b) Among the accompanying data sets is a file called 4letter.words, which
contains a list of four-letter words. Using this file, obtain a probability model
for the alphabet. Now repeat part (a) generating the words using the probability
model. To pick letters according to a probability model, construct the cumulative
density function (cdf) FXx (see Appendix A for the definition of cdf ). Using a
uniform pseudorandom number generator to generate a value r, where 0 ≤ r 1,
pick the letter xk if FXxk −1 ≤ r FXxk. Compare your results with those of
part (a).
(c) Repeat (b) using a single-letter context.
(d) Repeat (b) using a two-letter context.
7. Determine whether the following codes are uniquely decodable:
(a) 00111111
(b) 001110111
(c) 010110111
(d) 110110111
8. Using a text file compute the probabilities of each letter pi.
(a) Assumethatweneedacodewordoflengthlog2
1
pi
toencodetheletteri.Determine
the number of bits needed to encode the file.
(b) Compute the conditional probabilities Pi/j of a letter i given that the previous
letter is j. Assume that we need log2
1
Pi/j
to represent a letter i that follows a
letter j. Determine the number of bits needed to encode the file.

3
Huffman Coding
3 . 1 Overview
I
n this chapter we describe a very popular coding algorithm called the Huffman
coding algorithm. We first present a procedure for building Huffman codes
when the probability model for the source is known, then a procedure for
building codes when the source statistics are unknown. We also describe a
few techniques for code design that are in some sense similar to the Huffman
coding approach. Finally, we give some examples of using the Huffman code for image
compression, audio compression, and text compression.
3 . 2 The Huffman Coding Algorithm
This technique was developed by David Huffman as part of a class assignment; the class was
the first ever in the area of information theory and was taught by Robert Fano at MIT [22].
The codes generated using this technique or procedure are called Huffman codes. These
codes are prefix codes and are optimum for a given model (set of probabilities).
The Huffman procedure is based on two observations regarding optimum prefix codes.
1. In an optimum code, symbols that occur more frequently (have a higher probability
of occurrence) will have shorter codewords than symbols that occur less frequently.
2. Inanoptimumcode,thetwosymbolsthatoccurleastfrequentlywillhavethesamelength.
It is easy to see that the first observation is correct. If symbols that occur more often had
codewords that were longer than the codewords for symbols that occurred less often, the
average number of bits per symbol would be larger than if the conditions were reversed.
Therefore, a code that assigns longer codewords to symbols that occur more frequently
cannot be optimum.

42 3 H U F F M A N C O D I N G
To see why the second observation holds true, consider the following situation. Suppose
an optimum code exists in which the two codewords corresponding to the two least
probable symbols do not have the same length. Suppose the longer codeword is k bits longer
than the shorter codeword. Because this is a prefix code, the shorter codeword cannot be
a prefix of the longer codeword. This means that even if we drop the last k bits of the
longer codeword, the two codewords would still be distinct. As these codewords correspond
to the least probable symbols in the alphabet, no other codeword can be longer than these
codewords; therefore, there is no danger that the shortened codeword would become the
prefix of some other codeword. Furthermore, by dropping these k bits we obtain a new code
that has a shorter average length than . But this violates our initial contention that is an
optimal code. Therefore, for an optimal code the second observation also holds true.
The Huffman procedure is obtained by adding a simple requirement to these two obser-
vations. This requirement is that the codewords corresponding to the two lowest probability
symbols differ only in the last bit. That is, if and are the two least probable symbols in
an alphabet, if the codeword for was m∗0, the codeword for would be m∗1. Here m
is a string of 1s and 0s, and ∗ denotes concatenation.
This requirement does not violate our two observations and leads to a very simple
encoding procedure. We describe this procedure with the help of the following example.
E x a m p l e 3 . 2 . 1: D e s i g n o f a H u f f m a n c o d e
Let us design a Huffman code for a source that puts out letters from an alphabet =
a1a2a3a4a5 with Pa1 = Pa3 = 0 2, Pa2 = 0 4, and Pa4 = Pa5 = 0 1. The
entropy for this source is 2.122 bits/symbol. To design the Huffman code, we first sort
the letters in a descending probability order as shown in Table 3.1. Here cai denotes the
codeword for ai.
T A B L E 3 . 1 The initial ﬁve-letter alphabet.
Letter Probability Codeword
a2 0 4 ca2
a1 0 2 ca1
a3 0 2 ca3
a4 0 1 ca4
a5 0 1 ca5
The two symbols with the lowest probability are a4 and a5. Therefore, we can assign
their codewords as
ca4 = 1 ∗0
ca5 = 1 ∗1
where 1 is a binary string, and ∗ denotes concatenation.

We now define a new alphabet A
with a four-letter alphabet a1, a2, a3, a
4, where a
4
is composed of a4 and a5 and has a probability Pa
4 = Pa4+Pa5 = 0 2. We sort this
new alphabet in descending order to obtain Table 3.2.
T A B L E 3 . 2 The reduced four-letter alphabet.
a2 0.4 ca2
a1 0.2 ca1
a3 0.2 ca3
a
4 0.2 1
In this alphabet, a3 and a
4 are the two letters at the bottom of the sorted list. We assign
their codewords as
ca3 = 2 ∗0
ca
4 = 2 ∗1
but ca
4 = 1. Therefore,
1 = 2 ∗1
which means that
ca4 = 2 ∗10
ca5 = 2 ∗11
At this stage, we again define a new alphabet A
that consists of three letters a1, a2, a
3,
where a
3 is composed of a3 and a
4 and has a probability Pa
3 = Pa3+Pa
4 = 0 4. We
sort this new alphabet in descending order to obtain Table 3.3.
T A B L E 3 . 3 The reduced three-letter alphabet.
a2 0.4 ca2
a
3 0.4 2
a1 0.2 ca1
In this case, the two least probable symbols are a1 and a
3. Therefore,
ca
3 = 3 ∗0
ca1 = 3 ∗1

But ca
3 = 2. Therefore,
2 = 3 ∗0
which means that
ca3 = 3 ∗00
ca4 = 3 ∗010
ca5 = 3 ∗011
Again we define a new alphabet, this time with only two letters a
3, a2. Here a
3 is
composed of the letters a
3 and a1 and has probability Pa
3 = Pa
3+Pa1 = 0 6. We now
have Table 3.4.
T A B L E 3 . 4 The reduced two-letter alphabet.
a
3 0.6 3
a2 0.4 ca2
As we have only two letters, the codeword assignment is straightforward:
ca
3 = 0
ca2 = 1
which means that 3 = 0, which in turn means that
ca1 = 01
ca3 = 000
ca4 = 0010
ca5 = 0011
T A B L E 3 . 5 Huffman code for the original
ﬁve-letter alphabet.
a2 0.4 1
a1 0.2 01
a3 0.2 000
a4 0.1 0010
a5 0.1 0011

a2(0.4)
a1(0.2)
a3(0.2)
a4(0.1) 0
1
0
1
0
1
a5(0.1)
a2(0.4)
a1(0.2)
a3(0.2)
a4(0.2)
a2(0.4)
a3(0.4)
a1(0.2)
a3 (0.6)
a2(0.4)
'
'

F I G U R E 3. 1 The Huffman encoding procedure. The symbol probabilities are listed
in parentheses.
and the Huffman code is given by Table 3.5. The procedure can be summarized as shown
in Figure 3.1.
The average length for this code is
l = 4×1+ 2×2+ 2×3+ 1×4+ 1×4 = 2 2 bits/symbol
A measure of the efficiency of this code is its redundancy—the difference between the
entropy and the average length. In this case, the redundancy is 0.078 bits/symbol. The
redundancy is zero when the probabilities are negative powers of two.
An alternative way of building a Huffman code is to use the fact that the Huffman
code, by virtue of being a prefix code, can be represented as a binary tree in which the
external nodes or leaves correspond to the symbols. The Huffman code for any symbol can
be obtained by traversing the tree from the root node to the leaf corresponding to the symbol,
adding a 0 to the codeword every time the traversal takes us over an upper branch and a
1 every time the traversal takes us over a lower branch.
We build the binary tree starting at the leaf nodes. We know that the codewords for the
two symbols with smallest probabilities are identical except for the last bit. This means that
the traversal from the root to the leaves corresponding to these two symbols must be the same
except for the last step. This in turn means that the leaves corresponding to the two symbols
with the lowest probabilities are offspring of the same node. Once we have connected the
leaves corresponding to the symbols with the lowest probabilities to a single node, we treat
this node as a symbol of a reduced alphabet. The probability of this symbol is the sum of
the probabilities of its offspring. We can now sort the nodes corresponding to the reduced
alphabet and apply the same rule to generate a parent node for the nodes corresponding to the
two symbols in the reduced alphabet with lowest probabilities. Continuing in this manner,
we end up with a single node, which is the root node. To obtain the code for each symbol,
we traverse the tree from the root to each leaf node, assigning a 0 to the upper branch and a
1 to the lower branch. This procedure as applied to the alphabet of Example 3.2.1 is shown
in Figure 3.2. Notice the similarity between Figures 3.1 and 3.2. This is not surprising, as
they are a result of viewing the same procedure in two different ways.

a2(0.4)
a1(0.2)
a3(0.2)
a4(0.1) 0
1
a5(0.1)
(0.4)
(0.2)
(0.2) 0
(0.2) 1
(0.4)
(0.4) 0
(0.2) 1
(0.6) 0
(0.4) 1
(1.0)
F I G U R E 3. 2 Building the binary Huffman tree.
3 . 2 . 1 M i n i m u m V a r i a n c e H u f f m a n C o d e s
By performing the sorting procedure in a slightly different manner, we could have found a
different Huffman code. In the first re-sort, we could place a
4 higher in the list, as shown in
Table 3.6.
Now combine a1 and a3 into a
1, which has a probability of 0 4. Sorting the alphabet a2,
a
4, a
1 and putting a
1 as far up the list as possible, we get Table 3.7. Finally, by combining
a2 and a
4 and re-sorting, we get Table 3.8. If we go through the unbundling procedure, we
get the codewords in Table 3.9. The procedure is summarized in Figure 3.3. The average
length of the code is
l = 4×2+ 2×2+ 2×2+ 1×3+ 1×3 = 2 2 bits/symbol
The two codes are identical in terms of their redundancy. However, the variance of the
length of the codewords is significantly different. This can be clearly seen from Figure 3.4.
T A B L E 3 . 6 Reduced four-letter alphabet.
a2 0.4 ca2
a
4 0.2 1
a1 0.2 ca1
a3 0.2 ca3
T A B L E 3 . 7 Reduced three-letter alphabet.
a
1 0.4 2
a2 0.4 ca2
a
4 0.2 1

T A B L E 3 . 8 Reduced two-letter alphabet.
a
2 0.6 3
a
1 0.4 2
T A B L E 3 . 9 Minimum variance Huffman code.
a1 0.2 10
a2 0.4 00
a3 0.2 11
a4 0.1 010
a5 0.1 011
a2(0.4)
a1(0.2)
a3(0.2)
a4(0.1)
a5(0.1)
a2(0.4)
a4(0.2)
a1(0.2) 0
1
0
1
0
1
0
1
a3(0.2)
a1(0.4)
a2(0.4)
a4(0.2)
a2(0.6)
a1(0.4)
′
′
′ ′′
F I G U R E 3. 3 The minimum variance Huffman encoding procedure.
a3
a4
a5
a1
0
0
1
1
a2
a2
a4
a5
a3
a1
0
1
1
F I G U R E 3. 4 Two Huffman trees corresponding to the same probabilities.
Remember that in many applications, although you might be using a variable-length code,
the available transmission rate is generally fixed. For example, if we were going to transmit
symbols from the alphabet we have been using at 10,000 symbols per second, we might ask
for transmission capacity of 22,000 bits per second. This means that during each second the
channel expects to receive 22,000 bits, no more and no less. As the bit generation rate will

vary around 22,000 bits per second, the output of the source coder is generally fed into a
buffer. The purpose of the buffer is to smooth out the variations in the bit generation rate.
However, the buffer has to be of finite size, and the greater the variance in the codewords, the
more difficult the buffer design problem becomes. Suppose that the source we are discussing
generates a string of a4s and a5s for several seconds. If we are using the first code, this
means that we will be generating bits at a rate of 40,000 bits per second. For each second,
the buffer has to store 18,000 bits. On the other hand, if we use the second code, we would
be generating 30,000 bits per second, and the buffer would have to store 8000 bits for every
second this condition persisted. If we have a string of a2s instead of a string of a4s and a5s,
the first code would result in the generation of 10,000 bits per second. Remember that the
channel will still be expecting 22,000 bits every second, so somehow we will have to make
up a deficit of 12,000 bits per second. The same situation using the second code would lead
to a deficit of 2000 bits per second. Thus, it seems reasonable to elect to use the second
code instead of the first. To obtain the Huffman code with minimum variance, we always
put the combined letter as high in the list as possible.
3 . 2 . 2 O p t i m a l i t y o f H u f f m a n C o d e s
The optimality of Huffman codes can be proven rather simply by first writing down the
necessary conditions that an optimal code has to satisfy and then showing that satisfying
these conditions necessarily leads to designing a Huffman code. The proof we present here
is based on the proof shown in [16] and is obtained for the binary case (for a more general
proof, see [16]).
The necessary conditions for an optimal variable-length binary code are as follows:
Condition 1: Given any two letters aj and ak, if P aj ≥ P ak , then lj ≤ lk, where lj
is the number of bits in the codeword for aj.
Condition 2: The two least probable letters have codewords with the same maximum
length lm.
We have provided the justification for these two conditions in the opening sections of this
chapter.
Condition 3: In the tree corresponding to the optimum code, there must be two
branches stemming from each intermediate node.
If there were any intermediate node with only one branch coming from that node, we could
remove it without affecting the decipherability of the code while reducing its average length.
Condition 4: Suppose we change an intermediate node into a leaf node by combining
all the leaves descending from it into a composite word of a reduced alphabet. Then,
if the original tree was optimal for the original alphabet, the reduced tree is optimal
for the reduced alphabet.
If this condition were not satisfied, we could find a code with smaller average code length
for the reduced alphabet and then simply expand the composite word again to get a new

code tree that would have a shorter average length than our original “optimum” tree. This
would contradict our statement about the optimality of the original tree.
In order to satisfy conditions 1, 2, and 3, the two least probable letters would have to be
assigned codewords of maximum length lm. Furthermore, the leaves corresponding to these
letters arise from the same intermediate node. This is the same as saying that the codewords
for these letters are identical except for the last bit. Consider the common prefix as the
codeword for the composite letter of a reduced alphabet. Since the code for the reduced
alphabet needs to be optimum for the code of the original alphabet to be optimum, we
follow the same procedure again. To satisfy the necessary conditions, the procedure needs
to be iterated until we have a reduced alphabet of size one. But this is exactly the Huffman
procedure. Therefore, the necessary conditions above, which are all satisfied by the Huffman
procedure, are also sufficient conditions.
3 . 2 . 3 L e n g t h o f H u f f m a n C o d e s
We have said that the Huffman coding procedure generates an optimum code, but we have
not said what the average length of an optimum code is. The length of any code will depend
on a number of things, including the size of the alphabet and the probabilities of individual
letters. In this section we will show that the optimal code for a source , hence the Huffman
code for the source , has an average code length l bounded below by the entropy and
bounded above by the entropy plus 1 bit. In other words,
H ≤ l H+1 (3.1)
In order for us to do this, we will need to use the Kraft-McMillan inequality introduced
in Chapter 2. Recall that the first part of this result, due to McMillan, states that if we have a
uniquely decodable code with K codewords of length liK
i=1, then the following inequality
holds:
K

i=1
2−li
≤ 1 (3.2)
E x a m p l e 3 . 2 . 2:
Examining the code generated in Example 3.2.1 (Table 3.5), the lengths of the codewords
are 12344. Substituting these values into the left-hand side of Equation (3.2), we get
2−1
+2−2
+2−3
+2−4
+2−4
= 1
which satisfies the Kraft-McMillan inequality.
If we use the minimum variance code (Table 3.9), the lengths of the codewords are
22233. Substituting these values into the left-hand side of Equation (3.2), we get
2−2
+2−2
+2−2
+2−3
+2−3
= 1
which again satisfies the inequality.

The second part of this result, due to Kraft, states that if we have a sequence of positive
integers liK
i=1, which satisfies (3.2), then there exists a uniquely decodable code whose
codeword lengths are given by the sequence liK
i=1.
Using this result, we will now show the following:
1. The average codeword length l of an optimal code for a source is greater than or
equal to H.
2. The average codeword length l of an optimal code for a source is strictly less than
H+1.
For a source with alphabet = a1a2 aK, and probability model
Pa1Pa2 PaK, the average codeword length is given by
l =
K

i=1
Paili
Therefore, we can write the difference between the entropy of the source H and the
average length as
H−l = −
K

i=1
Pailog2 Pai−
K

i=1
Paili
=
K

i=1
Pai

log2

1
Pai

−li

=
K

i=1
Pai

log2

1
Pai

−log2 2li

=
K

i=1
Pailog2

2−li
Pai

≤ log2
K

i=1
2−li

The last inequality is obtained using Jensen’s inequality, which states that if fx is a concave
(convex cap, convex ∩) function, then E fX ≤ fE X . The log function is a concave
function.
As the code is an optimal code
K
i=1 2−li ≤ 1, therefore
H−l ≤ 0 (3.3)
We will prove the upper bound by showing that there exists a uniquely decodable code with
average codeword length H+1. Therefore, if we have an optimal code, this code must
have an average length that is less than or equal to H+1.
Given a source, alphabet, and probability model as before, define
li =

log2
1
Pai

where x is the smallest integer greater than or equal to x. For example, 3 3 = 4 and
5 = 5. Therefore,
x = x + where 0 ≤ 1
Therefore,
log2
1
Pai
≤ li log2
1
Pai
+1 (3.4)
From the left inequality of (3.4) we can see that
2−li
≤ Pai
Therefore,
K

i=1
2−li
≤
K

i=1
Pai = 1
and by the Kraft-McMillan inequality there exists a uniquely decodable code with codeword
lengths li. The average length of this code can be upper-bounded by using the right
inequality of (3.4):
l =
K

i=1
Paili
K

i=1
Pai

log2
1
Pai
+1

or
l H+1
We can see from the way the upper bound was derived that this is a rather loose upper
bound. In fact, it can be shown that if pmax is the largest probability in the probability
model, then for pmax ≥ 0 5, the upper bound for the Huffman code is H+pmax, while for
pmax 0 5, the upper bound is H+pmax +0 086. Obviously, this is a much tighter bound
than the one we derived above. The derivation of this bound takes some time (see [23] for
details).
3 . 2 . 4 E x t e n d e d H u f f m a n C o d e s
In applications where the alphabet size is large, pmax is generally quite small, and the amount
of deviation from the entropy, especially in terms of a percentage of the rate, is quite small.
However, in cases where the alphabet is small and the probability of occurrence of the
different letters is skewed, the value of pmax can be quite large and the Huffman code can
become rather inefficient when compared to the entropy.
E x a m p l e 3 . 2 . 3:
Consider a source that puts out iid letters from the alphabet = a1a2a3 with the
probability model Pa1 = 0 8, Pa2 = 0 02, and Pa3 = 0 18. The entropy for this source
is 0.816 bits/symbol. A Huffman code for this source is shown in Table 3.10.

T A B L E 3 . 10 Huffman code for
the alphabet .
Letter Codeword
a1 0
a2 11
a3 10
The average length for this code is 1 2 bits/symbol. The difference between the average
code length and the entropy, or the redundancy, for this code is 0.384 bits/symbol, which is
47% of the entropy. This means that to code this sequence we would need 47% more bits
than the minimum required.
We can sometimes reduce the coding rate by blocking more than one symbol together.
To see how this can happen, consider a source S that emits a sequence of letters from an
alphabet = a1a2 am. Each element of the sequence is generated independently of
the other elements in the sequence. The entropy for this source is given by
HS = −
m

i=1
Pailog2 Pai
We know that we can generate a Huffman code for this source with rate R such that
HS ≤ R HS+1 (3.5)
We have used the looser bound here; the same argument can be made with the tighter
bound. Notice that we have used “rate R” to denote the number of bits per symbol. This is
a standard convention in the data compression literature. However, in the communication
literature, the word “rate” often refers to the number of bits per second.
Suppose we now encode the sequence by generating one codeword for every n symbols.
As there are mn
combinations of n symbols, we will need mn
codewords in our Huffman
code. We could generate this code by viewing the mn
symbols as letters of an extended
alphabet
n
=
n times
a1a1 a1a1a1 a2 a1a1 ama1a1 a2a1 amam am
from a source Sn
. Let us denote the rate for the new source as Rn
. Then we know that
HSn
≤ Rn
HSn
+1 (3.6)
Rn
is the number of bits required to code n symbols. Therefore, the number of bits required
per symbol, R, is given by
R =
1
n
Rn

The number of bits per symbol can be bounded as
HSn

n
≤ R
HSn

n
+
1
n
In order to compare this to (3.5), and see the advantage we get from encoding symbols
in blocks instead of one at a time, we need to express HSn
in terms of HS. This turns
out to be a relatively easy (although somewhat messy) thing to do.
HSn
= −
m

i1=1
m

i2=1

m

in=1
Pai1
ai2
ain
log Pai1
ai2
ain

= −
m

i1=1
m

i2=1

m

in=1
Pai1
Pai2
Pain
log Pai1
Pai2
Pain

= −
m

i1=1
m

i2=1

m

in=1
Pai1
Pai2
Pain

n

j=1
log Paij

= −
m

i1=1
Pai1
log Pai1

m

i2=1

m

in=1
Pai2
Pain

−
m

i2=1
Pai2
log Pai2

m

i1=1
m

i3=1

m

in=1
Pai1
Pai3
Pain

−
m

in=1
Pain
log Pain

m

i1=1
m

i2=1

m

in−1=1
Pai1
Pai2
Pain−1

The n−1 summations in braces in each term sum to one. Therefore,
HSn
= −
m

i1=1
Pai1
log Pai1
−
m

i2=1
Pai2
log Pai2
−···−
m

in=1
Pain
log Pain

= nHS
and we can write (3.6) as
HS ≤ R ≤ HS+
1
n
(3.7)
Comparing this to (3.5), we can see that by encoding the output of the source in longer
blocks of symbols we are guaranteed a rate closer to the entropy. Note that all we are talking
about here is a bound or guarantee about the rate. As we have seen in the previous chapter,
there are a number of situations in which we can achieve a rate equal to the entropy with a
block length of one!

E x a m p l e 3 . 2 . 4:
For the source described in the previous example, instead of generating a codeword for
every symbol, we will generate a codeword for every two symbols. If we look at the source
sequence two at a time, the number of possible symbol pairs, or size of the extended alphabet,
is 32
= 9. The extended alphabet, probability model, and Huffman code for this example are
shown in Table 3.11.
T A B L E 3 . 11 The extended alphabet and
corresponding Huffman code.
Letter Probability Code
a1a1 0 64 0
a1a2 0 016 10101
a1a3 0 144 11
a2a1 0 016 101000
a2a2 0 0004 10100101
a2a3 0 0036 1010011
a3a1 0 1440 100
a3a2 0 0036 10100100
a3a3 0 0324 1011
The average codeword length for this extended code is 1.7228 bits/symbol. However,
each symbol in the extended alphabet corresponds to two symbols from the original alphabet.
Therefore, in terms of the original alphabet, the average codeword length is 1 7228/2 =
0 8614 bits/symbol. This redundancy is about 0.045 bits/symbol, which is only about 5.5%
of the entropy.
We see that by coding blocks of symbols together we can reduce the redundancy of
Huffman codes. In the previous example, two symbols were blocked together to obtain a
rate reasonably close to the entropy. Blocking two symbols together means the alphabet
size goes from m to m2
, where m was the size of the initial alphabet. In this case, m was
three, so the size of the extended alphabet was nine. This size is not an excessive burden
for most applications. However, if the probabilities of the symbols were more unbalanced,
then it would require blocking many more symbols together before the redundancy lowered
to acceptable levels. As we block more and more symbols together, the size of the alphabet
grows exponentially, and the Huffman coding scheme becomes impractical. Under these
conditions, we need to look at techniques other than Huffman coding. One approach that is
very useful in these conditions is arithmetic coding. We will discuss this technique in some
detail in the next chapter.

3 . 3 Nonbinary Huffman Codes
The binary Huffman coding procedure can be easily extended to the nonbinary case where the
code elements come from an m-ary alphabet, and m is not equal to two. Recall that we obtained
the Huffman algorithm based on the observations that in an optimum binary prefix code
1. symbols that occur more frequently (have a higher probability of occurrence) will
have shorter codewords than symbols that occur less frequently, and
2. the two symbols that occur least frequently will have the same length,
and the requirement that the two symbols with the lowest probability differ only in the last
position.
We can obtain a nonbinary Huffman code in almost exactly the same way. The obvious
thing to do would be to modify the second observation to read: “The m symbols that occur
least frequently will have the same length,” and also modify the additional requirement to
read “The m symbols with the lowest probability differ only in the last position.”
However, we run into a small problem with this approach. Consider the design of a
ternary Huffman code for a source with a six-letter alphabet. Using the rules described
above, we would first combine the three letters with the lowest probability into a composite
letter. This would give us a reduced alphabet with four letters. However, combining the three
letters with lowest probability from this alphabet would result in a further reduced alphabet
consisting of only two letters. We have three values to assign and only two letters. Instead
of combining three letters at the beginning, we could have combined two letters. This would
result in a reduced alphabet of size five. If we combined three letters from this alphabet, we
would end up with a final reduced alphabet size of three. Finally, we could combine two
letters in the second step, which would again result in a final reduced alphabet of size three.
Which alternative should we choose?
Recall that the symbols with lowest probability will have the longest codeword. Fur-
thermore, all the symbols that we combine together into a composite symbol will have
codewords of the same length. This means that all letters we combine together at the very
first stage will have codewords that have the same length, and these codewords will be the
longest of all the codewords. This being the case, if at some stage we are allowed to combine
less than m symbols, the logical place to do this would be in the very first stage.
In the general case of an m-ary code and an M-letter alphabet, how many letters should
we combine in the first phase? Let m
be the number of letters that are combined in the first
phase. Then m
is the number between two and m, which is equal to M modulo m−1.
E x a m p l e 3 . 3 . 1:
Generate a ternary Huffman code for a source with a six-letter alphabet and a probability
model Pa1 = Pa3 = Pa4 = 0 2, Pa5 = 0 25, Pa6 = 0 1, and Pa2 = 0 05. In this
case m = 3, therefore m
is either 2 or 3.
6 mod 2 = 0 2 mod 2 = 0 3 mod 2 = 1

Since 6 mod 2 = 2 mod 2, m
= 2. Sorting the symbols in probability order results in
Table 3.12.
T A B L E 3 . 12 Sorted six-letter alphabet.
a5 0 25 ca5
a1 0 20 ca1
a3 0 20 ca3
a4 0 20 ca4
a6 0 10 ca6
a2 0 05 ca2
As m
is 2, we can assign the codewords of the two symbols with lowest probability as
ca6 = 1 ∗0
ca2 = 1 ∗1
where 1 is a ternary string and * denotes concatenation. The reduced alphabet is shown in
Table 3.13.
T A B L E 3 . 13 Reduced ﬁve-letter alphabet.
a5 0 25 ca5
a1 0 20 ca1
a3 0 20 ca3
a4 0 20 ca4
a
6 0 15 1
Now we combine the three letters with the lowest probability into a composite letter a
3
and assign their codewords as
ca3 = 2 ∗0
ca4 = 2 ∗1
ca
6 = 2 ∗2
But ca
6 = 1. Therefore,
1 = 2 ∗2

which means that
ca6 = 2 ∗20
ca2 = 2 ∗21
Sorting the reduced alphabet, we have Table 3.14. Thus, 2 = 0, ca5 = 1, and ca1 = 2.
Substituting for 2, we get the codeword assignments in Table 3.15.
T A B L E 3 . 14 Reduced three-letter alphabet.
a
3 0 45 2
a5 0 25 ca5
a1 0 20 ca1
T A B L E 3 . 15 Ternary code for six-letter alphabet.
a1 0 20 2
a2 0 05 021
a3 0 20 00
a4 0 20 01
a5 0 25 1
a6 0 10 020
The tree corresponding to this code is shown in Figure 3.5. Notice that at the lowest
level of the tree we have only two codewords. If we had combined three letters at the first
step, and combined two letters at a later step, the lowest level would have contained three
codewords and a longer average code length would result (see Problem 7).
1 2
0
a1
a5
1 2
0
a4
a3
1
0
a2
a6
F I G U R E 3. 5 Code tree for the nonbinary Huffman code.

3 . 4 Adaptive Huffman Coding
Huffman coding requires knowledge of the probabilities of the source sequence. If this
knowledge is not available, Huffman coding becomes a two-pass procedure: the statistics are
collected in the first pass, and the source is encoded in the second pass. In order to convert
this algorithm into a one-pass procedure, Faller [24] and Gallagher [23] independently
developed adaptive algorithms to construct the Huffman code based on the statistics of the
symbols already encountered. These were later improved by Knuth [25] and Vitter [26].
Theoretically, if we wanted to encode the (k+1)-th symbol using the statistics of the first
k symbols, we could recompute the code using the Huffman coding procedure each time a
symbol is transmitted. However, this would not be a very practical approach due to the large
amount of computation involved—hence, the adaptive Huffman coding procedures.
The Huffman code can be described in terms of a binary tree similar to the ones shown
in Figure 3.4. The squares denote the external nodes or leaves and correspond to the symbols
in the source alphabet. The codeword for a symbol can be obtained by traversing the tree
from the root to the leaf corresponding to the symbol, where 0 corresponds to a left branch
and 1 corresponds to a right branch. In order to describe how the adaptive Huffman code
works, we add two other parameters to the binary tree: the weight of each leaf, which is
written as a number inside the node, and a node number. The weight of each external node is
simply the number of times the symbol corresponding to the leaf has been encountered. The
weight of each internal node is the sum of the weights of its offspring. The node number yi
is a unique number assigned to each internal and external node. If we have an alphabet of
size n, then the 2n−1 internal and external nodes can be numbered as y1 y2n−1 such
that if xj is the weight of node yj, we have x1 ≤ x2 ≤ ··· ≤ x2n−1. Furthermore, the nodes
y2j−1 and y2j are offspring of the same parent node, or siblings, for 1 ≤ j n, and the node
number for the parent node is greater than y2j−1 and y2j. These last two characteristics are
called the sibling property, and any tree that possesses this property is a Huffman tree [23].
In the adaptive Huffman coding procedure, neither transmitter nor receiver knows any-
thing about the statistics of the source sequence at the start of transmission. The tree at both
the transmitter and the receiver consists of a single node that corresponds to all symbols not
yet transmitted (NYT) and has a weight of zero. As transmission progresses, nodes corre-
sponding to symbols transmitted will be added to the tree, and the tree is reconfigured using
an update procedure. Before the beginning of transmission, a fixed code for each symbol is
agreed upon between transmitter and receiver. A simple (short) code is as follows:
If the source has an alphabet a1a2 am of size m, then pick e and r such that
m = 2e
+r and 0 ≤ r 2e
. The letter ak is encoded as the e+1-bit binary representation
of k − 1, if 1 ≤ k ≤ 2r; else, ak is encoded as the e-bit binary representation of k − r − 1.
For example, suppose m = 26, then e = 4, and r = 10. The symbol a1 is encoded as 00000,
the symbol a2 is encoded as 00001, and the symbol a22 is encoded as 1011.
When a symbol is encountered for the first time, the code for the NYT node is transmitted,
followed by the fixed code for the symbol. A node for the symbol is then created, and the
symbol is taken out of the NYT list.
Both transmitter and receiver start with the same tree structure. The updating procedure
used by both transmitter and receiver is identical. Therefore, the encoding and decoding
processes remain synchronized.

3 . 4 . 1 U p d a t e P r o c e d u r e
The update procedure requires that the nodes be in a fixed order. This ordering is preserved
by numbering the nodes. The largest node number is given to the root of the tree, and the
smallest number is assigned to the NYT node. The numbers from the NYT node to the root
of the tree are assigned in increasing order from left to right, and from lower level to upper
level. The set of nodes with the same weight makes up a block. Figure 3.6 is a flowchart of
the updating procedure.
START
Go to symbol
external node
Node
number max
in block?
Increment
node weight
Is this
the root
node?
STOP
First
appearance
for symbol?
NYT gives birth
to new NYT and
external node
Increment weight
of external node
and old NYT node
Go to old
NYT node
No
Yes
Yes
Yes
Switch node with
highest numbered
node in block
No
Go to
parent node
No
F I G U R E 3. 6 Update procedure for the adaptive Huffman coding algorithm.

The function of the update procedure is to preserve the sibling property. In order that the
update procedures at the transmitter and receiver both operate with the same information, the
tree at the transmitter is updated after each symbol is encoded, and the tree at the receiver
is updated after each symbol is decoded. The procedure operates as follows:
After a symbol has been encoded or decoded, the external node corresponding to the
symbol is examined to see if it has the largest node number in its block. If the external
node does not have the largest node number, it is exchanged with the node that has the
largest node number in the block, as long as the node with the higher number is not the
parent of the node being updated. The weight of the external node is then incremented. If
we did not exchange the nodes before the weight of the node is incremented, it is very
likely that the ordering required by the sibling property would be destroyed. Once we have
incremented the weight of the node, we have adapted the Huffman tree at that level. We
then turn our attention to the next level by examining the parent node of the node whose
weight was incremented to see if it has the largest number in its block. If it does not, it is
exchanged with the node with the largest number in the block. Again, an exception to this is
when the node with the higher node number is the parent of the node under consideration.
Once an exchange has taken place (or it has been determined that there is no need for
an exchange), the weight of the parent node is incremented. We then proceed to a new
parent node and the process is repeated. This process continues until the root of the tree is
reached.
If the symbol to be encoded or decoded has occurred for the first time, a new external
node is assigned to the symbol and a new NYT node is appended to the tree. Both the new
external node and the new NYT node are offsprings of the old NYT node. We increment
the weight of the new external node by one. As the old NYT node is the parent of the new
external node, we increment its weight by one and then go on to update all the other nodes
until we reach the root of the tree.
E x a m p l e 3 . 4 . 1: U p d a t e p r o c e d u r e
Assume we are encoding the message [a a r d v a r k], where our alphabet consists of the
26 lowercase letters of the English alphabet.
The updating process is shown in Figure 3.7. We begin with only the NYT node. The
total number of nodes in this tree will be 2×26−1 = 51, so we start numbering backwards
from 51 with the number of the root node being 51. The first letter to be transmitted is a.
As a does not yet exist in the tree, we send a binary code 00000 for a and then add a to
the tree. The NYT node gives birth to a new NYT node and a terminal node corresponding
to a. The weight of the terminal node will be higher than the NYT node, so we assign
the number 49 to the NYT node and 50 to the terminal node corresponding to the letter
a. The second letter to be transmitted is also a. This time the transmitted code is 1. The
node corresponding to a has the highest number (if we do not consider its parent), so we
do not need to swap nodes. The next letter to be transmitted is r. This letter does not have
a corresponding node on the tree, so we send the codeword for the NYT node, which is 0
followed by the index of r, which is 10001. The NYT node gives birth to a new NYT node
and an external node corresponding to r. Again, no update is required. The next letter to
be transmitted is d, which is also being sent for the first time. We again send the code for

0 1
1
2
d
r
a
45
(aard)
46
48
50
4
2
2
1
0
1 r
1 v
1 d
2 a
51
49
48
45
43
(aardv)
47
44
46
50
0
NYT 1
1
1
2
v
d
r
a
43
(aardv)
44
46
48
50
47
45
51
49
1
0 a
49
(a) (aa)
50
51 51 51
49 50
47
(aar)
48
50
NYT
NYT
0
51
NYT 1 0
NYT 2 a 2
1
a
r
0
NYT
2
1
49
3
2
4
51
1
49
47
1
1 v
0
(aardv)
NYT
NYT
1
r
2
a
5
51
3
2
1 d
1
1
4
2
F I G U R E 3. 7 Adaptive Huffman tree after [ a a r d v ] is processed.
the NYT node, which is now 00 followed by the index for d, which is 00011. The NYT
node again gives birth to two new nodes. However, an update is still not required. This
changes with the transmission of the next letter, v, which has also not yet been encountered.
Nodes 43 and 44 are added to the tree, with 44 as the terminal node corresponding to v. We
examine the grandparent node of v (node 47) to see if it has the largest number in its block.
As it does not, we swap it with node 48, which has the largest number in its block. We then
increment node 48 and move to its parent, which is node 49. In the block containing node
49, the largest number belongs to node 50. Therefore, we swap nodes 49 and 50 and then
increment node 50. We then move to the parent node of node 50, which is node 51. As this
is the root node, all we do is increment node 51.

3 . 4 . 2 E n c o d i n g P r o c e d u r e
The flowchart for the encoding procedure is shown in Figure 3.8. Initially, the tree at both
the encoder and decoder consists of a single node, the NYT node. Therefore, the codeword
for the very first symbol that appears is a previously agreed-upon fixed code. After the very
first symbol, whenever we have to encode a symbol that is being encountered for the first
time, we send the code for the NYT node, followed by the previously agreed-upon fixed
code for the symbol. The code for the NYT node is obtained by traversing the Huffman tree
from the root to the NYT node. This alerts the receiver to the fact that the symbol whose
code follows does not as yet have a node in the Huffman tree. If a symbol to be encoded
has a corresponding node in the tree, then the code for the symbol is generated by traversing
the tree from the root to the external node corresponding to the symbol.
START
Read in symbol
Is this
the first
appearance
of the
symbol?
Call update
procedure
Is this the
last symbol?
STOP
Send code for NYT
node followed by
index in the NYT list
Code is the path from
the root node to the
corresponding node
Yes
Yes
No
No
F I G U R E 3. 8 Flowchart of the encoding procedure.

To see how the coding operation functions, we use the same example that was used to
demonstrate the update procedure.
E x a m p l e 3 . 4 . 2: E n c o d i n g p r o c e d u r e
In Example 3.4.1 we used an alphabet consisting of 26 letters. In order to obtain our
prearranged code, we have to find m and e such that 2e
+ r = 26, where 0 ≤ r 2e
. It is
easy to see that the values of e = 4 and r = 10 satisfy this requirement.
The first symbol encoded is the letter a. As a is the first letter of the alphabet, k = 1.
As 1 is less than 20, a is encoded as the 5-bit binary representation of k − 1, or 0, which
is 00000. The Huffman tree is then updated as shown in the figure. The NYT node gives
birth to an external node corresponding to the element a and a new NYT node. As a has
occurred once, the external node corresponding to a has a weight of one. The weight of
the NYT node is zero. The internal node also has a weight of one, as its weight is the sum
of the weights of its offspring. The next symbol is again a. As we have an external node
corresponding to symbol a, we simply traverse the tree from the root node to the external
node corresponding to a in order to find the codeword. This traversal consists of a single
right branch. Therefore, the Huffman code for the symbol a is 1.
After the code for a has been transmitted, the weight of the external node corresponding
to a is incremented, as is the weight of its parent. The third symbol to be transmitted is r.
As this is the first appearance of this symbol, we send the code for the NYT node followed
by the previously arranged binary representation for r. If we traverse the tree from the root
to the NYT node, we get a code of 0 for the NYT node. The letter r is the 18th letter of
the alphabet; therefore, the binary representation of r is 10001. The code for the symbol r
becomes 010001. The tree is again updated as shown in the figure, and the coding process
continues with symbol d. Using the same procedure for d, the code for the NYT node,
which is now 00, is sent, followed by the index for d, resulting in the codeword 0000011.
The next symbol v is the 22nd symbol in the alphabet. As this is greater than 20, we send
the code for the NYT node followed by the 4-bit binary representation of 22−10−1 = 11.
The code for the NYT node at this stage is 000, and the 4-bit binary representation of 11
is 1011; therefore, v is encoded as 0001011. The next symbol is a, for which the code is 0,
and the encoding proceeds.
3 . 4 . 3 D e c o d i n g P r o c e d u r e
The flowchart for the decoding procedure is shown in Figure 3.9. As we read in the received
binary string, we traverse the tree in a manner identical to that used in the encoding procedure.
Once a leaf is encountered, the symbol corresponding to that leaf is decoded. If the leaf
is the NYT node, then we check the next e bits to see if the resulting number is less than
r. If it is less than r, we read in another bit to complete the code for the symbol. The
index for the symbol is obtained by adding one to the decimal number corresponding to
the e- or e + 1-bit binary string. Once the symbol has been decoded, the tree is updated
and the next received bit is used to start another traversal down the tree. To see how this
procedure works, let us decode the binary string generated in the previous example.

Call update
procedure
Decode the (p + 1)
element in NYT list
Read one more bit
START
Read bit and go to
corresponding node
Is the node
the NYT
node?
Is the
e-bit number p
less than r?
Decode element
corresponding
to node
Is this
the last bit?
STOP
Is the
node an external
node?
Go to root
of the tree
Yes
No
Yes
No
No
Read e bits
Yes
Add r to p
No
Yes
F I G U R E 3. 9 Flowchart of the decoding procedure.

3.5 Golomb Codes 65
E x a m p l e 3 . 4 . 3: D e c o d i n g p r o c e d u r e
The binary string generated by the encoding procedure is
000001010001000001100010110
Initially, the decoder tree consists only of the NYT node. Therefore, the first symbol to be
decoded must be obtained from the NYT list. We read in the first 4 bits, 0000, as the value
of e is four. The 4 bits 0000 correspond to the decimal value of 0. As this is less than the
value of r, which is 10, we read in one more bit for the entire code of 00000. Adding one
to the decimal value corresponding to this binary string, we get the index of the received
symbol as 1. This is the index for a; therefore, the first letter is decoded as a. The tree is
now updated as shown in Figure 3.7. The next bit in the string is 1. This traces a path from
the root node to the external node corresponding to a. We decode the symbol a and update
the tree. In this case, the update consists only of incrementing the weight of the external
node corresponding to a. The next bit is a 0, which traces a path from the root to the NYT
node. The next 4 bits, 1000, correspond to the decimal number 8, which is less than 10, so
we read in one more bit to get the 5-bit word 10001. The decimal equivalent of this 5-bit
word plus one is 18, which is the index for r. We decode the symbol r and then update the
tree. The next 2 bits, 00, again trace a path to the NYT node. We read the next 4 bits, 0001.
Since this corresponds to the decimal number 1, which is less than 10, we read another bit
to get the 5-bit word 00011. To get the index of the received symbol in the NYT list, we add
one to the decimal value of this 5-bit word. The value of the index is 4, which corresponds
to the symbol d. Continuing in this fashion, we decode the sequence aardva.
Although the Huffman coding algorithm is one of the best-known variable-length coding
algorithms, there are some other lesser-known algorithms that can be very useful in cer-
tain situations. In particular, the Golomb-Rice codes and the Tunstall codes are becoming
increasingly popular. We describe these codes in the following sections.
3 . 5 Golomb Codes
The Golomb-Rice codes belong to a family of codes designed to encode integers with the
assumption that the larger an integer, the lower its probability of occurrence. The simplest
code for this situation is the unary code. The unary code for a positive integer n is simply
n 1s followed by a 0. Thus, the code for 4 is 11110, and the code for 7 is 11111110. The
unary code is the same as the Huffman code for the semi-infinite alphabet 123 with
probability model
P k =
1
2k
Because the Huffman code is optimal, the unary code is also optimal for this probability model.
Although the unary code is optimal in very restricted conditions, we can see that it is
certainly very simple to implement. One step higher in complexity are a number of coding
schemes that split the integer into two parts, representing one part with a unary code and

the other part with a different code. An example of such a code is the Golomb code. Other
examples can be found in [27].
The Golomb code is described in a succinct paper [28] by Solomon Golomb, which
begins “Secret Agent 00111 is back at the Casino again, playing a game of chance, while
the fate of mankind hangs in the balance.” Agent 00111 requires a code to represent runs of
success in a roulette game, and Golomb provides it! The Golomb code is actually a family
of codes parameterized by an integer m 0. In the Golomb code with parameter m, we
represent an integer n 0 using two numbers q and r, where
q =
n
m

and
r = n−qm
x is the integer part of x. In other words, q is the quotient and r is the remainder when
n is divided by m. The quotient q can take on values 012 and is represented by the
unary code of q. The remainder r can take on the values 012 m−1 If m is a power
of two, we use the log2 m-bit binary representation of r. If m is not a power of two, we
could still use log2 m bits, where x is the smallest integer greater than or equal to x. We
can reduce the number of bits required if we use the log2 m -bit binary representation of r
for the first 2log2 m
−m values, and the log2 m-bit binary representation of r +2log2 m
−m
for the rest of the values.
E x a m p l e 3 . 5 . 1: G o l o m b c o d e
Let’s design a Golomb code for m = 5. As
log2 5 = 3 and log2 5 = 2
the first 8 − 5 = 3 values of r (that is, r = 0, 1, 2) will be represented by the 2-bit binary
representation of r, and the next two values (that is, r = 3, 4) will be represented by the
3-bit representation of r +3. The quotient q is always represented by the unary code for q.
Thus, the codeword for 3 is 0110, and the codeword for 21 is 1111001. The codewords for
n = 0, …, 15 are shown in Table 3.16.
T A B L E 3 . 16 Golomb code for m = 5.
n q r Codeword n q r Codeword
0 0 0 000 8 1 3 10110
1 0 1 001 9 1 4 10111
2 0 2 010 10 2 0 11000
3 0 3 0110 11 2 1 11001
4 0 4 0111 12 2 2 11010
5 1 0 1000 13 2 3 110110
6 1 1 1001 14 2 4 110111
7 1 2 1010 15 3 0 111000

3.6 Rice Codes 67
It can be shown that the Golomb code is optimal for the probability model
Pn = pn−1
q q = 1−p
when
m =

−
1
log2 p
3 . 6 Rice Codes
The Rice code was originally developed by Robert F. Rice (he called it the Rice machine)
[29, 30] and later extended by Pen-Shu Yeh and Warner Miller [31]. The Rice code can be
viewed as an adaptive Golomb code. In the Rice code, a sequence of nonnegative integers
(which might have been obtained from the preprocessing of other data) is divided into blocks
of J integers apiece. Each block is then coded using one of several options, most of which
are a form of Golomb codes. Each block is encoded with each of these options, and the
option resulting in the least number of coded bits is selected. The particular option used is
indicated by an identifier attached to the code for each block.
The easiest way to understand the Rice code is to examine one of its implementations.
We will study the implementation of the Rice code in the recommendation for lossless
compression from the Consultative Committee on Space Data Standards (CCSDS).
3 . 6 . 1 C C S D S R e c o m m e n d a t i o n f o r L o s s l e s s
C o m p r e s s i o n
As an application of the Rice algorithm, let’s briefly look at the algorithm for lossless data
compression recommended by CCSDS. The algorithm consists of a preprocessor (the mod-
eling step) and a binary coder (coding step). The preprocessor removes correlation from the
input and generates a sequence of nonnegative integers. This sequence has the property that
smaller values are more probable than larger values. The binary coder generates a bitstream
to represent the integer sequence. The binary coder is our main focus at this point.
The preprocessor functions as follows: Given a sequence yi, for each yi we generate a
prediction ŷi. A simple way to generate a prediction would be to take the previous value of
the sequence to be a prediction of the current value of the sequence:
ŷi = yi−1
We will look at more sophisticated ways of generating a prediction in Chapter 7. We then
generate a sequence whose elements are the difference between yi and its predicted value ŷi:
di = yi −ŷi
The di value will have a small magnitude when our prediction is good and a large value
when it is not. Assuming an accurate modeling of the data, the former situation is more
likely than the latter. Let ymax and ymin be the largest and smallest values that the sequence

yi takes on. It is reasonable to assume that the value of ŷ will be confined to the range
yminymax . Define
Ti = minymax −ŷŷ −ymin (3.8)
The sequence di can be converted into a sequence of nonnegative integers xi using
the following mapping:
xi =
⎧
⎪
⎨
⎪
⎩
2di 0 ≤ di ≤ Ti
2 di −1 −Ti ≤ di 0
Ti + di otherwise.
(3.9)
The value of xi will be small whenever the magnitude of di is small. Therefore, the value
of xi will be small with higher probability. The sequence xi is divided into segments with
each segment being further divided into blocks of size J. It is recommended by CCSDS that
J have a value of 16. Each block is then coded using one of the following options. The
coded block is transmitted along with an identifier that indicates which particular option was
used.
Fundamental sequence: This is a unary code. A number n is represented by a
sequence of n 0s followed by a 1 (or a sequence of n 1s followed by a 0).
Split sample options: These options consist of a set of codes indexed by a parameter
m. The code for a k-bit number n using the mth split sample option consists of the
m least significant bits of k followed by a unary code representing the k − m most
significant bits. For example, suppose we wanted to encode the 8-bit number 23 using
the third split sample option. The 8-bit representation of 23 is 00010111. The three
least significant bits are 111. The remaining bits (00010) correspond to the number 2,
which has a unary code 001. Therefore, the code for 23 using the third split sample
option is 111011. Notice that different values of m will be preferable for different
values of xi, with higher values of m used for higher-entropy sequences.
Second extension option: The second extension option is useful for sequences with
low entropy—when, in general, many of the values of xi will be zero. In the second
extension option the sequence is divided into consecutive pairs of samples. Each pair
is used to obtain an index using the following transformation:
=
1
2
xi +xi+1xi +xi+1 +1+xi+1 (3.10)
and the value of is encoded using a unary code. The value of is an index to a
lookup table with each value of corresponding to a pair of values xixi+1.
Zero block option: The zero block option is used when one or more of the blocks of
xi are zero—generally when we have long sequences of yi that have the same value. In
this case the number of zero blocks are transmitted using the code shown in Table 3.17.
The ROS code is used when the last five or more blocks in a segment are all zero.
The Rice code has been used in several space applications, and variations of the Rice
code have been proposed for a number of different applications.

T A B L E 3 . 17 Code used for zero block option.
Number of All-Zero Blocks Codeword
1 1
2 01
3 001
4 0001
5 000001
6 0000001
63
63 0s
000···01
ROS 00001
3 . 7 Tunstall Codes
Most of the variable-length codes that we look at in this book encode letters from the source
alphabet using codewords with varying numbers of bits: codewords with fewer bits for
letters that occur more frequently and codewords with more bits for letters that occur less
frequently. The Tunstall code is an important exception. In the Tunstall code, all codewords
are of equal length. However, each codeword represents a different number of letters. An
example of a 2-bit Tunstall code for an alphabet = AB is shown in Table 3.18. The
main advantage of a Tunstall code is that errors in codewords do not propagate, unlike other
variable-length codes, such as Huffman codes, in which an error in one codeword will cause
a series of errors to occur.
E x a m p l e 3 . 7 . 1:
Let’s encode the sequence AAABAABAABAABAAA using the code in Table 3.18. Starting
at the left, we can see that the string AAA occurs in our codebook and has a code of
00. We then code B as 11, AAB as 01, and so on. We finally end up with coded string
001101010100.
T A B L E 3 . 18 A 2-bit Tunstall code.
Sequence Codeword
AAA 00
AAB 01
AB 10
B 11

T A B L E 3 . 19 A 2-bit (non-Tunstall) code.
Sequence Codeword
AAA 00
ABA 01
AB 10
B 11
The design of a code that has a fixed codeword length but a variable number of symbols
per codeword should satisfy the following conditions:
1. We should be able to parse a source output sequence into sequences of symbols that
appear in the codebook.
2. We should maximize the average number of source symbols represented by each
codeword.
In order to understand what we mean by the first condition, consider the code shown in
Table 3.19. Let’s encode the same sequence AAABAABAABAABAAA as in the previous
example using the code in Table 3.19. We first encode AAA with the code 00. We then
encode B with 11. The next three symbols are AAB. However, there are no codewords
corresponding to this sequence of symbols. Thus, this sequence is unencodable using this
particular code—not a desirable situation.
Tunstall [32] gives a simple algorithm that fulfills these conditions. The algorithm is as
follows:
Suppose we want an n-bit Tunstall code for a source that generates iid letters from an
alphabet of size N. The number of codewords is 2n
. We start with the N letters of the
source alphabet in our codebook. Remove the entry in the codebook that has the highest
probability and add the N strings obtained by concatenating this letter with every letter
in the alphabet (including itself). This will increase the size of the codebook from N to
N +N −1. The probabilities of the new entries will be the product of the probabilities of
the letters concatenated to form the new entry. Now look through the N + N − 1 entries
in the codebook and find the entry that has the highest probability, keeping in mind that the
entry with the highest probability may be a concatenation of symbols. Each time we perform
this operation we increase the size of the codebook by N −1. Therefore, this operation can
be performed K times, where
N +KN −1 ≤ 2n
E x a m p l e 3 . 7 . 2: T u n s t a l l c o d e s
Let us design a 3-bit Tunstall code for a memoryless source with the following alphabet:
= ABC
PA = 0 6 PB = 0 3 PC = 0 1

T A B L E 3 . 20 Source alphabet and
associated probabilities.
Letter Probability
A 0.60
B 0.30
C 0.10
T A B L E 3 . 21 The codebook after
one iteration.
Sequence Probability
B 0.30
C 0.10
AA 0.36
AB 0.18
AC 0.06
T A B L E 3 . 22 A 3-bit Tunstall code.
Sequence Probability
B 000
C 001
AB 010
AC 011
AAA 100
AAB 101
AAC 110
We start out with the codebook and associated probabilities shown in Table 3.20. Since
the letter A has the highest probability, we remove it from the list and add all two-letter
strings beginning with A as shown in Table 3.21. After one iteration we have 5 entries in
our codebook. Going through one more iteration will increase the size of the codebook by 2,
and we will have 7 entries, which is still less than the final codebook size. Going through
another iteration after that would bring the codebook size to 10, which is greater than the
maximum size of 8. Therefore, we will go through just one more iteration. Looking through
the entries in Table 3.22, the entry with the highest probability is AA. Therefore, at the next
step we remove AA and add all extensions of AA as shown in Table 3.22. The final 3-bit
Tunstall code is shown in Table 3.22.

3 . 8 Applications of Huffman Coding
In this section we describe some applications of Huffman coding. As we progress through the
book, we will describe more applications, since Huffman coding is often used in conjunction
with other coding techniques.
3 . 8 . 1 L o s s l e s s I m a g e C o m p r e s s i o n
A simple application of Huffman coding to image compression would be to generate a
Huffman code for the set of values that any pixel may take. For monochrome images, this
set usually consists of integers from 0 to 255. Examples of such images are contained in the
accompanying data sets. The four that we will use in the examples in this book are shown
in Figure 3.10.
F I G U R E 3. 10 Test images.

T A B L E 3 . 23 Compression using Huffman codes on pixel values.
Image Name Bits/Pixel Total Size (bytes) Compression Ratio
Sena 7.01 57,504 1.14
Sensin 7.49 61,430 1.07
Earth 4.94 40,534 1.62
Omaha 7.12 58,374 1.12
We will make use of one of the programs from the accompanying software (see Preface)
to generate a Huffman code for each image, and then encode the image using the Huffman
code. The results for the four images in Figure 3.10 are shown in Table 3.23. The Huffman
code is stored along with the compressed image as the code will be required by the decoder
to reconstruct the image.
The original (uncompressed) image representation uses 8 bits/pixel. The image consists
of 256 rows of 256 pixels, so the uncompressed representation uses 65,536 bytes. The
compression ratio is simply the ratio of the number of bytes in the uncompressed represen-
tation to the number of bytes in the compressed representation. The number of bytes in the
compressed representation includes the number of bytes needed to store the Huffman code.
Notice that the compression ratio is different for different images. This can cause some
problems in certain applications where it is necessary to know in advance how many bytes
will be needed to represent a particular data set.
The results in Table 3.23 are somewhat disappointing because we get a reduction of only
about 1
2
to 1 bit/pixel after compression. For some applications this reduction is acceptable.
For example, if we were storing thousands of images in an archive, a reduction of 1 bit/pixel
saves many megabytes in disk space. However, we can do better. Recall that when we first
talked about compression, we said that the first step for any compression algorithm was to
model the data so as to make use of the structure in the data. In this case, we have made
absolutely no use of the structure in the data.
From a visual inspection of the test images, we can clearly see that the pixels in an
image are heavily correlated with their neighbors. We could represent this structure with the
crude model x̂n = xn−1. The residual would be the difference between neighboring pixels.
If we carry out this differencing operation and use the Huffman coder on the residuals, the
results are as shown in Table 3.24. As we can see, using the structure in the data resulted in
substantial improvement.
T A B L E 3 . 24 Compression using Huffman codes on pixel difference values.
Sena 4.02 32,968 1.99
Sensin 4.70 38,541 1.70
Earth 4.13 33,880 1.93
Omaha 6.42 52,643 1.24

T A B L E 3 . 25 Compression using adaptive Huffman codes on pixel difference
values.
Sena 3.93 32,261 2.03
Sensin 4.63 37,896 1.73
Earth 4.82 39,504 1.66
Omaha 6.39 52,321 1.25
The results in Tables 3.23 and 3.24 were obtained using a two-pass system, in which
the statistics were collected in the first pass and a Huffman table was generated. Instead
of using a two-pass system, we could have used a one-pass adaptive Huffman coder. The
results for this are given in Table 3.25.
Notice that there is little difference between the performance of the adaptive Huffman
code and the two-pass Huffman coder. In addition, the fact that the adaptive Huffman
coder can be used as an on-line or real-time coder makes the adaptive Huffman coder a
more attractive option in many applications. However, the adaptive Huffman coder is more
vulnerable to errors and may also be more difficult to implement. In the end, the particular
application will determine which approach is more suitable.
3 . 8 . 2 T e x t C o m p r e s s i o n
Text compression seems natural for Huffman coding. In text, we have a discrete alphabet
that, in a given class, has relatively stationary probabilities. For example, the probability
model for a particular novel will not differ significantly from the probability model for
another novel. Similarly, the probability model for a set of FORTRAN programs is not going
to be much different than the probability model for a different set of FORTRAN programs.
The probabilities in Table 3.26 are the probabilities of the 26 letters (upper- and lowercase)
obtained for the U.S. Constitution and are representative of English text. The probabilities
in Table 3.27 were obtained by counting the frequency of occurrences of letters in an earlier
version of this chapter. While the two documents are substantially different, the two sets of
probabilities are very much alike.
We encoded the earlier version of this chapter using Huffman codes that were created
using the probabilities of occurrence obtained from the chapter. The file size dropped from
about 70,000 bytes to about 43,000 bytes with Huffman coding.
While this reduction in file size is useful, we could have obtained better compression if
we first removed the structure existing in the form of correlation between the symbols in
the file. Obviously, there is a substantial amount of correlation in this text. For example,
Huf is always followed by fman! Unfortunately, this correlation is not amenable to simple
numerical models, as was the case for the image files. However, there are other somewhat
more complex techniques that can be used to remove the correlation in text files. We will
look more closely at these in Chapters 5 and 6.

T A B L E 3 . 26 Probabilities of occurrence of the
letters in the English alphabet in the
U.S. Constitution.
Letter Probability Letter Probability
A 0 057305 N 0.056035
B 0 014876 O 0.058215
C 0 025775 P 0.021034
D 0 026811 Q 0.000973
E 0 112578 R 0.048819
F 0 022875 S 0.060289
G 0 009523 T 0.078085
H 0 042915 U 0.018474
I 0 053475 V 0.009882
J 0 002031 W 0.007576
K 0 001016 X 0.002264
L 0 031403 Y 0.011702
M 0 015892 Z 0.001502
T A B L E 3 . 27 Probabilities of occurrence of the letters
in the English alphabet in this chapter.
Letter Probability Letter Probability
A 0 049855 N 0.048039
B 0 016100 O 0.050642
C 0 025835 P 0.015007
D 0 030232 Q 0.001509
E 0 097434 R 0.040492
F 0 019754 S 0.042657
G 0 012053 T 0.061142
H 0 035723 U 0.015794
I 0 048783 V 0.004988
J 0 000394 W 0.012207
K 0 002450 X 0.003413
L 0 025835 Y 0.008466
M 0 016494 Z 0.001050
3 . 8 . 3 A u d i o C o m p r e s s i o n
Another class of data that is very suitable for compression is CD-quality audio data. The
audio signal for each stereo channel is sampled at 44.1 kHz, and each sample is represented
by 16 bits. This means that the amount of data stored on one CD is enormous. If we
want to transmit this data, the amount of channel capacity required would be significant.
Compression is definitely useful in this case. In Table 3.28 we show for a variety of audio
material the file size, the entropy, the estimated compressed file size if a Huffman coder is
used, and the resulting compression ratio.

T A B L E 3 . 28 Huffman coding of 16-bit CD-quality audio.
Original Entropy Estimated Compressed Compression
File Name File Size (bytes) (bits) File Size (bytes) Ratio
Mozart 939,862 12.8 725,420 1.30
Cohn 402,442 13.8 349,300 1.15
Mir 884,020 13.7 759,540 1.16
The three segments used in this example represent a wide variety of audio material, from
a symphonic piece by Mozart to a folk rock piece by Cohn. Even though the material is
varied, Huffman coding can lead to some reduction in the capacity required to transmit this
material.
Note that we have only provided the estimated compressed file sizes. The estimated
file size in bits was obtained by multiplying the entropy by the number of samples in the
file. We used this approach because the samples of 16-bit audio can take on 65,536 distinct
values, and therefore the Huffman coder would require 65,536 distinct (variable-length)
codewords. In most applications, a codebook of this size would not be practical. There
is a way of handling large alphabets, called recursive indexing, that we will describe in
Chapter 9. There is also some recent work [14] on using a Huffman tree in which leaves
represent sets of symbols with the same probability. The codeword consists of a prefix that
specifies the set followed by a suffix that specifies the symbol within the set. This approach
can accommodate relatively large alphabets.
As with the other applications, we can obtain an increase in compression if we first
remove the structure from the data. Audio data can be modeled numerically. In later chapters
we will examine more sophisticated modeling approaches. For now, let us use the very
simple model that was used in the image-coding example; that is, each sample has the
same value as the previous sample. Using this model we obtain the difference sequence.
The entropy of the difference sequence is shown in Table 3.29.
Note that there is a further reduction in the file size: the compressed file sizes are about
60% of the original files. Further reductions can be obtained by using more sophisticated
models.
Many of the lossless audio compression schemes, including FLAC (Free Lossless
Audio Codec), Apple’s ALAC or ALE, Shorten [33], Monkey’s Audio, and the proposed
(as of now) MPEG-4 ALS [34] algorithms, use a linear predictive model to remove some of
T A B L E 3 . 29 Huffman coding of differences of 16-bit CD-quality audio.
Original Entropy Estimated Compressed Compression
File Name File Size (bytes) of Differences (bits) File Size (bytes) Ratio
Mozart 939,862 9.7 569,792 1.65
Cohn 402,442 10.4 261,590 1.54
Mir 884,020 10.9 602,240 1.47

the structure from the audio sequence and then use Rice coding to encode the residuals. Most
others, such as AudioPak [35] and OggSquish, use Huffman coding to encode the residuals.
3 . 9 Summary
In this chapter we began our exploration of data compression techniques with a description
of the Huffman coding technique and several other related techniques. The Huffman coding
technique and its variants are some of the most commonly used coding approaches. We will
encounter modified versions of Huffman codes when we look at compression techniques
for text, image, and video. In this chapter we described how to design Huffman codes and
discussed some of the issues related to Huffman codes. We also described how adaptive
Huffman codes work and looked briefly at some of the places where Huffman codes are
used. We will see more of these in future chapters.
To explore further applications of Huffman coding, you can use the programs
huff_enc, huff_dec, and adap_huff to generate your own Huffman codes for your
favorite applications.
1. A detailed and very accessible overview of Huffman codes is provided in “Huffman
Codes,” by S. Pigeon [36], in Lossless Compression Handbook.
2. Details about nonbinary Huffman codes and a much more theoretical and rigor-
ous description of variable-length codes can be found in The Theory of Informa-
tion and Coding, volume 3 of Encyclopedia of Mathematic and Its Application, by
R.J. McEliece [6].
3. The tutorial article “Data Compression” in the September 1987 issue of ACM Com-
puting Surveys, by D.A. Lelewer and D.S. Hirschberg [37], along with other material,
provides a very nice brief coverage of the material in this chapter.
4. A somewhat different approach to describing Huffman codes can be found in Data
Compression—Methods and Theory, by J.A. Storer [38].
5. A more theoretical but very readable account of variable-length coding can be found
in Elements of Information Theory, by T.M. Cover and J.A. Thomas [3].
6. Although the book Coding and Information Theory, by R.W. Hamming [9], is mostly
about channel coding, Huffman codes are described in some detail in Chapter 4.
1. The probabilities in Tables 3.27 and 3.27 were obtained using the program
countalpha from the accompanying software. Use this program to compare prob-
abilities for different types of text, C programs, messages on Usenet, and so on.

Comment on any differences you might see and describe how you would tailor your
compression strategy for each type of text.
2. Use the programs huff_enc and huff_dec to do the following (in each case use
the codebook generated by the image being compressed):
(a) Code the Sena, Sinan, and Omaha images.
(b) Write a program to take the difference between adjoining pixels, and then use
huffman to code the difference images.
(c) Repeat (a) and (b) using adap_huff.
Report the resulting file sizes for each of these experiments and comment on the
differences.
3. Using the programs huff_enc and huff_dec, code the Bookshelf1 and Sena
images using the codebook generated by the Sinan image. Compare the results with
the case where the codebook was generated by the image being compressed.
4. A source emits letters from an alphabet = a1a2a3a4a5 with probabilities
Pa1 = 0 15, Pa2 = 0 04, Pa3 = 0 26, Pa4 = 0 05, and Pa5 = 0 50.
(a) Calculate the entropy of this source.
(b) Find a Huffman code for this source.
(c) Find the average length of the code in (b) and its redundancy.
5. For an alphabet = a1a2a3a4 with probabilities Pa1 = 0 1, Pa2 = 0 3,
Pa3 = 0 25, and Pa4 = 0 35, find a Huffman code
(a) using the first procedure outlined in this chapter, and
(b) using the minimum variance procedure.
Comment on the difference in the Huffman codes.
6. In many communication applications, it is desirable that the number of 1s and 0s
transmitted over the channel are about the same. However, if we look at Huffman
codes, many of them seem to have many more 1s than 0s or vice versa. Does this
mean that Huffman coding will lead to inefficient channel usage? For the Huffman
code obtained in Problem 3, find the probability that a 0 will be transmitted over the
channel. What does this probability say about the question posed above?
7. For the source in Example 3.3.1, generate a ternary code by combining three letters in
the first and second steps and two letters in the third step. Compare with the ternary
code obtained in the example.
8. In Example 3.4.1 we have shown how the tree develops when the sequence a a r d v
is transmitted. Continue this example with the next letters in the sequence, a r k.
9. The Monte Carlo approach is often used for studying problems that are difficult to
solve analytically. Let’s use this approach to study the problem of buffering when

using variable-length codes. We will simulate the situation in Example 3.2.1, and
study the time to overflow and underflow as a function of the buffer size. In our
program, we will need a random number generator, a set of seeds to initialize the
random number generator, a counter B to simulate the buffer occupancy, a counter T
to keep track of the time, and a value N, which is the size of the buffer. Input to the
buffer is simulated by using the random number generator to select a letter from our
alphabet. The counter B is then incremented by the length of the codeword for the
letter. The output to the buffer is simulated by decrementing B by 2 except when T is
divisible by 5. For values of T divisible by 5, decrement B by 3 instead of 2 (why?).
Keep incrementing T, each time simulating an input and an output, until either B ≥ N,
corresponding to a buffer overflow, or B 0, corresponding to a buffer underflow.
When either of these events happens, record what happened and when, and restart the
simulation with a new seed. Do this with at least 100 seeds.
Perform this simulation for a number of buffer sizes (N = 100100010000), and the
two Huffman codes obtained for the source in Example 3.2.1. Describe your results
in a report.
10. While the variance of lengths is an important consideration when choosing between
two Huffman codes that have the same average lengths, it is not the only consideration.
Another consideration is the ability to recover from errors in the channel. In this
problem we will explore the effect of error on two equivalent Huffman codes.
(a) For the source and Huffman code of Example 3.2.1 (Table 3.5), encode the
sequence
a2 a1 a3 a2 a1 a2
Suppose there was an error in the channel and the first bit was received as a 0
instead of a 1. Decode the received sequence of bits. How many characters are
received in error before the first correctly decoded character?
(b) Repeat using the code in Table 3.9.
(c) Repeat parts (a) and (b) with the error in the third bit.
11. (This problem was suggested by P.F. Swaszek.)
(a) For a binary source with probabilities P0 = 0 9, P1 = 0 1, design a Huffman
code for the source obtained by blocking m bits together, m = 1, 2, , 8. Plot
the average lengths versus m. Comment on your result.
(b) Repeat for P0 = 0 99, P1 = 0 01.
You can use the program huff_enc to generate the Huffman codes.
12. Encode the following sequence of 16 values using the Rice code with J = 8 and one
split sample option.
32333539373839404040403940404140

For prediction use the previous value in the sequence
ŷi = yi−1
and assume a prediction of zero for the first element of the sequence.
13. For an alphabet = a1a2a3 with probabilities Pa1 = 0 7, Pa2 = 0 2, Pa3 =
0 1, design a 3-bit Tunstall code.
14. Write a program for encoding images using the Rice algorithm. Use eight options,
including the fundamental sequence, five split sample options, and the two low-entropy
options. Use J = 16. For prediction use either the pixel to the left or the pixel above.
Encode the Sena image using your program. Compare your results with the results
obtained by Huffman coding the differences between pixels.

4
Arithmetic Coding
4 . 1 Overview
I
n the previous chapter we saw one approach to generating variable-length
codes. In this chapter we see another, increasingly popular, method of gen-
erating variable-length codes called arithmetic coding. Arithmetic coding is
especially useful when dealing with sources with small alphabets, such as
binary sources, and alphabets with highly skewed probabilities. It is also a very
useful approach when, for various reasons, the modeling and coding aspects of lossless com-
pression are to be kept separate. In this chapter, we look at the basic ideas behind arithmetic
coding, study some of the properties of arithmetic codes, and describe an implementation.
4 . 2 Introduction
In the last chapter we studied the Huffman coding method, which guarantees a coding rate
R within 1 bit of the entropy H. Recall that the coding rate is the average number of bits
used to represent a symbol from a source and, for a given probability model, the entropy is
the lowest rate at which the source can be coded. We can tighten this bound somewhat. It
has been shown [23] that the Huffman algorithm will generate a code whose rate is within
pmax +0086 of the entropy, where pmax is the probability of the most frequently occurring
symbol. We noted in the last chapter that, in applications where the alphabet size is large,
pmax is generally quite small, and the amount of deviation from the entropy, especially in
terms of a percentage of the rate, is quite small. However, in cases where the alphabet is
small and the probability of occurrence of the different letters is skewed, the value of pmax
can be quite large and the Huffman code can become rather inefficient when compared to
the entropy. One way to avoid this problem is to block more than one symbol together and
generate an extended Huffman code. Unfortunately, this approach does not always work.

82 4 A R I T H M E T I C C O D I N G
E x a m p l e 4 . 2 . 1:
Consider a source that puts out independent, identically distributed (iid) letters from the
alphabet = a1a2a3 with the probability model Pa1 = 095, Pa2 = 002, and
Pa3 = 003. The entropy for this source is 0.335 bits/symbol. A Huffman code for this
source is given in Table 4.1.
T A B L E 4 . 1 Huffman code for
three-letter alphabet.
Letter Codeword
a1 0
a2 11
a3 10
The average length for this code is 1.05 bits/symbol. The difference between the average
code length and the entropy, or the redundancy, for this code is 0.715 bits/symbol, which is
213% of the entropy. This means that to code this sequence we would need more than twice
the number of bits promised by the entropy.
Recall Example 3.2.4. Here also we can group the symbols in blocks of two. The extended
alphabet, probability model, and code can be obtained as shown in Table 4.2. The average
rate for the extended alphabet is 1.222 bits/symbol, which in terms of the original alphabet is
0.611 bits/symbol. As the entropy of the source is 0.335 bits/symbol, the additional rate over
the entropy is still about 72% of the entropy! By continuing to block symbols together, we
find that the redundancy drops to acceptable values when we block eight symbols together.
The corresponding alphabet size for this level of blocking is 6561! A code of this size
is impractical for a number of reasons. Storage of a code like this requires memory that
may not be available for many applications. While it may be possible to design reasonably
efficient encoders, decoding a Huffman code of this size would be a highly inefficient and
time-consuming procedure. Finally, if there were some perturbation in the statistics, and
some of the assumed probabilities changed slightly, this would have a major impact on the
efficiency of the code.
T A B L E 4 . 2 Huffman code for extended alphabet.
Letter Probability Code
a1a1 0.9025 0
a1a2 0.0190 111
a1a3 0.0285 100
a2a1 0.0190 1101
a2a2 0.0004 110011
a2a3 0.0006 110001
a3a1 0.0285 101
a3a2 0.0006 110010
a3a3 0.0009 110000

We can see that it is more efficient to generate codewords for groups or sequences of
symbols rather than generating a separate codeword for each symbol in a sequence. However,
this approach becomes impractical when we try to obtain Huffman codes for long sequences
of symbols. In order to find the Huffman codeword for a particular sequence of length m,
we need codewords for all possible sequences of length m. This fact causes an exponential
growth in the size of the codebook. We need a way of assigning codewords to particular
sequences without having to generate codes for all sequences of that length. The arithmetic
coding technique fulfills this requirement.
In arithmetic coding a unique identifier or tag is generated for the sequence to be
encoded. This tag corresponds to a binary fraction, which becomes the binary code for the
sequence. In practice the generation of the tag and the binary code are the same process.
However, the arithmetic coding approach is easier to understand if we conceptually divide
the approach into two phases. In the first phase a unique identifier or tag is generated for a
given sequence of symbols. This tag is then given a unique binary code. A unique arithmetic
code can be generated for a sequence of length m without the need for generating codewords
for all sequences of length m. This is unlike the situation for Huffman codes. In order to
generate a Huffman code for a sequence of length m, where the code is not a concatenation
of the codewords for the individual symbols, we need to obtain the Huffman codes for all
sequences of length m.
4 . 3 Coding a Sequence
In order to distinguish a sequence of symbols from another sequence of symbols we need
to tag it with a unique identifier. One possible set of tags for representing sequences of
symbols are the numbers in the unit interval 01. Because the number of numbers in the
unit interval is infinite, it should be possible to assign a unique tag to each distinct sequence
of symbols. In order to do this we need a function that will map sequences of symbols into
the unit interval. A function that maps random variables, and sequences of random variables,
into the unit interval is the cumulative distribution function (cdf) of the random variable
associated with the source. This is the function we will use in developing the arithmetic
code. (If you are not familiar with random variables and cumulative distribution functions,
or need to refresh your memory, you may wish to look at Appendix A.)
The use of the cumulative distribution function to generate a binary code for a sequence
has a rather interesting history. Shannon, in his original 1948 paper [7], mentioned an
approach using the cumulative distribution function when describing what is now known as
the Shannon-Fano code. Peter Elias, another member of Fano’s first information theory class
at MIT (this class also included Huffman), came up with a recursive implementation for this
idea. However, he never published it, and we only know about it through a mention in a 1963
book on information theory by Abramson [39]. Abramson described this coding approach in
a note to a chapter. In another book on information theory by Jelinek [40] in 1968, the idea of
arithmetic coding is further developed, this time in an appendix, as an example of variable-
length coding. Modern arithmetic coding owes its birth to the independent discoveries in
1976 of Pasco [41] and Rissanen [42] that the problem of finite precision could be resolved.

Finally, several papers appeared that provided practical arithmetic coding algorithms, the
most well known of which is the paper by Rissanen and Langdon [43].
Before we begin our development of the arithmetic code, we need to establish some
notation. Recall that a random variable maps the outcomes, or sets of outcomes, of an
experiment to values on the real number line. For example, in a coin-tossing experiment, the
random variable could map a head to zero and a tail to one (or it could map a head to 23675
and a tail to −192). To use this technique, we need to map the source symbols or letters to
numbers. For convenience, in the discussion in this chapter we will use the mapping
Xai = i ai ∈ (4.1)
where = a1a2am is the alphabet for a discrete source and X is a random variable.
This mapping means that given a probability model for the source, we also have a
probability density function for the random variable
PX = i = Pai
and the cumulative density function can be defined as
FXi =
i

k=1
PX = k
Notice that for each symbol ai with a nonzero probability we have a distinct value of FXi.
We will use this fact in what follows to develop the arithmetic code. Our development may
be more detailed than what you are looking for, at least on the first reading. If so, skip or
skim Sections 4.3.1–4.4.1 and go directly to Section 4.4.2.
4 . 3 . 1 G e n e r a t i n g a T a g
The procedure for generating the tag works by reducing the size of the interval in which the
tag resides as more and more elements of the sequence are received.
We start out by first dividing the unit interval into subintervals of the form
FXi−1FXi, i = 1, . . . , m. Because the minimum value of the cdf is zero and the
maximum value is one, this exactly partitions the unit interval. We associate the subin-
terval FXi − 1FXi with the symbol ai. The appearance of the first symbol in the
sequence restricts the interval containing the tag to one of these subintervals. Suppose the
first symbol was ak. Then the interval containing the tag value will be the subinterval
FXk−1FXk. This subinterval is now partitioned in exactly the same proportions as
the original interval. That is, the jth interval corresponding to the symbol aj is given by
FXk−1+FXj −1/FXk−FXk−1FXk−1+FXj/FXk−FXk−1. So if
the second symbol in the sequence is aj, then the interval containing the tag value becomes
FXk−1+FXj −1/FXk−FXk−1FXk−1+FXj/FXk−FXk−1. Each
succeeding symbol causes the tag to be restricted to a subinterval that is further partitioned
in the same proportions. This process can be more clearly understood through an example.

E x a m p l e 4 . 3 . 1:
Consider a three-letter alphabet = a1a2a3 with Pa1 = 07, Pa2 = 01, and Pa3 =
02. Using the mapping of Equation (4.1), FX1 = 07, FX2 = 08, and FX3 = 1. This
partitions the unit interval as shown in Figure 4.1.
1.0
0.8
a2
a3
a1
0.7
0.0
0.70
0.56
a2
a3
a1
0.49
0.00
0.560
0.546
a2
a3
a1
0.539
0.490
0.5600
0.5572
a2
a3
a1
0.5558
0.5460
F I G U R E 4. 1 Restricting the interval containing the tag for the input sequence
a1a2a3 .
The partition in which the tag resides depends on the first symbol of the sequence being
encoded. For example, if the first symbol is a1, the tag lies in the interval 0007; if the
first symbol is a2, the tag lies in the interval 0708; and if the first symbol is a3, the
tag lies in the interval 0810. Once the interval containing the tag has been determined,
the rest of the unit interval is discarded, and this restricted interval is again divided in the
same proportions as the original interval. Suppose the first symbol was a1. The tag would be
contained in the subinterval 0007. This subinterval is then subdivided in exactly the same
proportions as the original interval, yielding the subintervals 00049, 049056, and
05607. The first partition as before corresponds to the symbol a1, the second partition
corresponds to the symbol a2, and the third partition 05607 corresponds to the symbol
a3. Suppose the second symbol in the sequence is a2. The tag value is then restricted to
lie in the interval 049056. We now partition this interval in the same proportion as
the original interval to obtain the subintervals 0490539 corresponding to the symbol
a1, 05390546 corresponding to the symbol a2, and 0546056 corresponding to the
symbol a3. If the third symbol is a3, the tag will be restricted to the interval 0546056,
which can then be subdivided further. This process is described graphically in Figure 4.1.
Notice that the appearance of each new symbol restricts the tag to a subinterval that is
disjoint from any other subinterval that may have been generated using this process. For

the sequence beginning with a1a2a3, by the time the third symbol a3 is received,
the tag has been restricted to the subinterval 0546056. If the third symbol had been a1
instead of a3, the tag would have resided in the subinterval 0490539, which is disjoint
from the subinterval 0546056. Even if the two sequences are identical from this point
on (one starting with a1a2a3 and the other beginning with a1a2a1), the tag interval for
the two sequences will always be disjoint.
As we can see, the interval in which the tag for a particular sequence resides is disjoint
from all intervals in which the tag for any other sequence may reside. As such, any member
of this interval can be used as a tag. One popular choice is the lower limit of the interval;
another possibility is the midpoint of the interval. For the moment, let’s use the midpoint of
the interval as the tag.
In order to see how the tag generation procedure works mathematically, we start with
sequences of length one. Suppose we have a source that puts out symbols from some
alphabet = a1a2am. We can map the symbols ai to real numbers i. Define
T̄Xai as
T̄Xai =
i−1

k=1
PX = k+
1
2
PX = i (4.2)
= FXi−1+
1
2
PX = i (4.3)
For each ai, T̄Xai will have a unique value. This value can be used as a unique tag for ai.
E x a m p l e 4 . 3 . 2:
Consider a simple dice-throwing experiment with a fair die. The outcomes of a roll of the
die can be mapped into the numbers 126. For a fair die
PX = k =
1
6
for k = 1, 2, …, 6.
Therefore, using (4.3) we can find the tag for X = 2 as
T̄X2 = PX = 1+
1
2
PX = 2 =
1
6
+
1
12
= 025
and the tag for X = 5 as
T̄X5 =
4

k=1
PX = k+
1
2
PX = 5 = 075
The tags for all other outcomes are shown in Table 4.3.

T A B L E 4 . 3 Toss for outcomes in a
dice-throwing experment.
Outcome Tag
1 00833
3 04166
4 05833
6 09166
As we can see from the example above, giving a unique tag to a sequence of length one
is an easy task. This approach can be extended to longer sequences by imposing an order
on the sequences. We need an ordering on the sequences because we will assign a tag to a
particular sequence xi as
T̄
m
X xi =

yxi
Py+
1
2
Pxi (4.4)
where y x means that y precedes x in the ordering, and the superscript denotes the length
of the sequence.
An easy ordering to use is lexicographic ordering. In lexicographic ordering, the ordering
of letters in an alphabet induces an ordering on the words constructed from this alphabet. The
ordering of words in a dictionary is a good (maybe the original) example of lexicographic
ordering. Dictionary order is sometimes used as a synonym for lexicographic order.
E x a m p l e 4 . 3 . 3:
We can extend Example 4.3.1 so that the sequence consists of two rolls of a die. Using the
ordering scheme described above, the outcomes (in order) would be 11 12 13 66. The
tags can then be generated using Equation (4.4). For example, the tag for the sequence 13
would be
T̄X13 = Px = 11+Px = 12+1/2Px = 13 (4.5)
= 1/36+1/36+1/21/36 (4.6)
= 5/72 (4.7)

Notice that to generate the tag for 13 we did not have to generate a tag for every other
possible message. However, based on Equation (4.4) and Example 4.3.3, we need to know
the probability of every sequence that is “less than” the sequence for which the tag is being
generated. The requirement that the probability of all sequences of a given length be explicitly
calculated can be as prohibitive as the requirement that we have codewords for all sequences
of a given length. Fortunately, we shall see that to compute a tag for a given sequence of
symbols, all we need is the probability of individual symbols, or the probability model.

Recall that, given our construction, the interval containing the tag value for a given
sequence is disjoint from the intervals containing the tag values of all other sequences. This
means that any value in this interval would be a unique identifier for xi. Therefore, to fulfill
our initial objective of uniquely identifying each sequence, it would be sufficient to compute
the upper and lower limits of the interval containing the tag and select any value in that
interval. The upper and lower limits can be computed recursively as shown in the following
example.
E x a m p l e 4 . 3 . 4:
We will use the alphabet of Example 4.3.2 and find the upper and lower limits of the
interval containing the tag for the sequence 322. Assume that we are observing 3 2 2 in a
sequential manner; that is, first we see 3, then 2, and then 2 again. After each observation we
will compute the upper and lower limits of the interval containing the tag of the sequence
observed to that point. We will denote the upper limit by un
and the lower limit by ln
,
where n denotes the length of the sequence.
We first observe 3. Therefore,
u1
= FX3 l1
= FX2
We then observe 2 and the sequence is x = 32. Therefore,
u2
= F
2
X 32 l2
= F
2
X 31
We can compute these values as follows:
F
2
X 32 = Px = 11+Px = 12+···+Px = 16
+Px = 21+Px = 22+···+Px = 26
+Px = 31+Px = 32
But,
i=6

i=1
Px = ki =
i=6

i=1
Px1 = kx2 = i = Px1 = k
where x = x1x2. Therefore,
F
2
X 32 = Px1 = 1+Px1 = 2+Px = 31+Px = 32
= FX2+Px = 31+Px = 32
However, assuming each roll of the dice is independent of the others,
Px = 31 = Px1 = 3Px2 = 1
and
Px = 32 = Px1 = 3Px2 = 2

Therefore,
Px = 31+Px = 32 = Px1 = 3Px2 = 1+Px2 = 2
= Px1 = 3FX2
Noting that
Px1 = 3 = FX3−FX2
we can write
Px = 31+Px = 32 = FX3−FX2FX2
and
F
2
X 32 = FX2+FX3−FX2FX2
We can also write this as
u2
= l1
+u1
−l1
FX2
We can similarly show that
F
2
X 31 = FX2+FX3−FX2FX1
or
l2
= l1
+u1
−l1
FX1
The third element of the observed sequence is 2, and the sequence is x = 322. The upper
and lower limits of the interval containing the tag for this sequence are
u3
= F
3
X 322 l3
= F
3
X 321
Using the same approach as above we find that
F
3
X 322 = F
2
X 31+F
2
X 32−F
2
X 31FX2 (4.8)
F
3
X 321 = F
2
X 31+F
2
X 32−F
2
X 31FX1
or
u3
= l2
+u2
−l2
FX2
l3
= l2
+u2
−l2
FX1
In general, we can show that for any sequence x = x1x2 xn
ln
= ln−1
+un−1
−ln−1
FXxn −1 (4.9)
un
= ln−1
+un−1
−ln−1
FXxn (4.10)

Notice that throughout this process we did not explicitly need to compute any joint
probabilities.
If we are using the midpoint of the interval for the tag, then
T̄Xx =
un
+ln
2

Therefore, the tag for any sequence can be computed in a sequential fashion. The only
information required by the tag generation procedure is the cdf of the source, which can be
obtained directly from the probability model.
E x a m p l e 4 . 3 . 5: G e n e r a t i n g a t a g
Consider the source in Example 3.2.4. Define the random variable Xai = i. Suppose we
wish to encode the sequence 1 3 2 1. From the probability model we know that
FXk = 0 k ≤ 0 FX1 = 08 FX2 = 082 FX3 = 1 FXk = 1 k 3
We can use Equations (4.9) and (4.10) sequentially to determine the lower and upper limits
of the interval containing the tag. Initializing u0
to 1, and l0
to 0, the first element of the
sequence 1 results in the following update:
l1
= 0+1−00 = 0
u1
= 0+1−008 = 08
That is, the tag is contained in the interval 008. The second element of the sequence is 3.
Using the update equations we get
l2
= 0+08−0FX2 = 08×082 = 0656
u2
= 0+08−0FX3 = 08×10 = 08
Therefore, the interval containing the tag for the sequence 1 3 is 065608. The third
element, 2, results in the following update equations:
l3
= 0656+08−0656FX1 = 0656+0144×08 = 07712
u3
= 0656+08−0656FX2 = 0656+0144×082 = 077408
and the interval for the tag is 07712077408. Continuing with the last element, the upper
and lower limits of the interval containing the tag are
l4
= 07712+077408−07712FX0 = 07712+000288×00 = 07712
u4
= 07712+077408−01152FX1 = 07712+000288×08 = 0773504
and the tag for the sequence 1 3 2 1 can be generated as
T̄X1321 =
07712+0773504
2
= 0772352

Notice that each succeeding interval is contained in the preceding interval. If we examine
the equations used to generate the intervals, we see that this will always be the case. This
property will be used to decipher the tag. An undesirable consequence of this process is
that the intervals get smaller and smaller and require higher precision as the sequence gets
longer. To combat this problem, a rescaling strategy needs to be adopted. In Section 4.4.2,
we will describe a simple rescaling approach that takes care of this problem.
4 . 3 . 2 D e c i p h e r i n g t h e T a g
We have spent a considerable amount of time showing how a sequence can be assigned a
unique tag, given a minimal amount of information. However, the tag is useless unless we
can also decipher it with minimal computational cost. Fortunately, deciphering the tag is as
simple as generating it. We can see this most easily through an example.
E x a m p l e 4 . 3 . 6: D e c i p h e r i n g a t a g
Given the tag obtained in Example 4.3.5, let’s try to obtain the sequence represented by
the tag. We will try to mimic the encoder in order to do the decoding. The tag value is
0772352. The interval containing this tag value is a subset of every interval obtained in the
encoding process. Our decoding strategy will be to decode the elements in the sequence in
such a way that the upper and lower limits uk
and lk
will always contain the tag value for
each k. We start with l0
= 0 and u0
= 1. After decoding the first element of the sequence
x1, the upper and lower limits become
l1
= 0+1−0FXx1 −1 = FXx1 −1
u1
= 0+1−0FXx1 = FXx1
In other words, the interval containing the tag is FXx1 −1FXx1. We need to find the
value of x1 for which 0772352 lies in the interval FXx1 −1FXx1. If we pick x1 = 1,
the interval is 008. If we pick x1 = 2, the interval is 08082, and if we pick x1 = 3,
the interval is 08210. As 0772352 lies in the interval 0008 , we choose x1 = 1. We
now repeat this procedure for the second element x2, using the updated values of l1
and
u1
:
l2
= 0+08−0FXx2 −1 = 08FXx2 −1
u2
= 0+08−0FXx2 = 08FXx2
If we pick x2 = 1, the updated interval is 0064, which does not contain the tag. Therefore,
x2 cannot be 1. If we pick x2 = 2, the updated interval is 0640656, which also does not
contain the tag. If we pick x2 = 3, the updated interval is 065608, which does contain
the tag value of 0772352. Therefore, the second element in the sequence is 3. Knowing
the second element of the sequence, we can update the values of l2
and u2
and find the
element x3, which will give us an interval containing the tag:
l3
= 0656+08−0656FXx3 −1 = 0656+0144×FXx3 −1
u3
= 0656+08−0656FXx3 = 0656+0144×FXx3

However, the expressions for l3
and u3
are cumbersome in this form. To make the
comparisons more easily, we could subtract the value of l2
from both the limits and the tag.
That is, we find the value of x3 for which the interval 0144×FXx3 −1, 0144×FXx3
contains 0772352−0656 = 0116352. Or, we could make this even simpler and divide the
residual tag value of 0116352 by 0144 to get 0808, and find the value of x3 for which
0808 falls in the interval FXx3 − 1FXx3. We can see that the only value of x3 for
which this is possible is 2. Substituting 2 for x3 in the update equations, we can update the
values of l3
and u3
. We can now find the element x4 by computing the upper and lower
limits as
l4
= 07712+077408−07712FXx4 −1 = 07712+000288×FXx4 −1
u4
= 07712+077408−01152FXx4 = 07712+000288×FXx4
Again we can subtract l3
from the tag to get 0772352 − 07712 = 0001152 and find
the value of x4 for which the interval 000288 × FXx4 − 1, 000288 × FXx4 contains
0001152. To make the comparisons simpler, we can divide the residual value of the tag by
000288 to get 04, and find the value of x4 for which 04 is contained in FXx4 −1FXx4.
We can see that the value is x4 = 1, and we have decoded the entire sequence. Note that we
knew the length of the sequence beforehand and, therefore, we knew when to stop.
From the example above, we can deduce an algorithm that can decipher the tag.
1. Initialize l0
= 0 and u0
= 1.
2. For each k find t∗
= tag −lk−1
/uk−1
−lk−1
.
3. Find the value of xk for which FXxk −1 ≤ t∗
FXxk.
4. Update uk
and lk
.
5. Continue until the entire sequence has been decoded.
There are two ways to know when the entire sequence has been decoded. The decoder may
know the length of the sequence, in which case the deciphering process is stopped when
that many symbols have been obtained. The second way to know if the entire sequence has
been decoded is that a particular symbol is denoted as an end-of-transmission symbol. The
decoding of this symbol would bring the decoding process to a close.
4 . 4 Generating a Binary Code
Using the algorithm described in the previous section, we can obtain a tag for a given
sequence x. However, the binary code for the sequence is what we really want to know. We
want to find a binary code that will represent the sequence x in a unique and efficient manner.
We have said that the tag forms a unique representation for the sequence. This means that
the binary representation of the tag forms a unique binary code for the sequence. However,
we have placed no restrictions on what values in the unit interval the tag can take. The binary

representation of some of these values would be infinitely long, in which case, although the
code is unique, it may not be efficient. To make the code efficient, the binary representation
has to be truncated. But if we truncate the representation, is the resulting code still unique?
Finally, is the resulting code efficient? How far or how close is the average number of bits
per symbol from the entropy? We will examine all these questions in the next section.
Even if we show the code to be unique and efficient, the method described to this
point is highly impractical. In Section 4.4.2, we will describe a more practical algorithm for
generating the arithmetic code for a sequence. We will give an integer implementation of
this algorithm in Section 4.4.3.
4 . 4 . 1 U n i q u e n e s s a n d E f f i c i e n c y o f t h e A r i t h m e t i c
C o d e
T̄Xx is a number in the interval 01. A binary code for T̄Xx can be obtained by taking
the binary representation of this number and truncating it to lx = log 1
Px
+1 bits.
E x a m p l e 4 . 4 . 1:
Consider a source that generates letters from an alphabet of size four,
= a1a2a3a4
with probabilities
Pa1 =
1
2
Pa2 =
1
4
Pa3 =
1
8
Pa4 =
1
8

A binary code for this source can be generated as shown in Table 4.4. The quantity T̄x is
obtained using Equation (4.3). The binary representation of T̄x is truncated to log 1
Px
+1
bits to obtain the binary code.
T A B L E 4 . 4 A binary code for a four-letter alphabet.
Symbol FX T̄X In Binary log 1
Px
+1 Code
1 .5 .25 .010 2 01
2 .75 .625 .101 3 101
3 .875 .8125 .1101 4 1101
4 1.0 .9375 .1111 4 1111

We will show that a code obtained in this fashion is a uniquely decodable code. We first
show that this code is unique, and then we will show that it is uniquely decodable.
Recall that while we have been using T̄Xx as the tag for a sequence x, any number
in the interval FXx −1FXx would be a unique identifier. Therefore, to show that the
code T̄Xxlx is unique, all we need to do is show that it is contained in the interval

FXx −1FXx. Because we are truncating the binary representation of T̄Xx to obtain
T̄Xxlx, T̄Xxlx is less than or equal to T̄Xx. More specifically,
0 ≤ T̄Xx−T̄Xxlx
1
2lx
(4.11)
As T̄Xx is strictly less than FXx,
T̄Xxlx FXx
To show that T̄Xxlx ≥ FXx −1, note that
1
2lx
=
1
2log 1
Px +1

1
2log 1
Px +1
=
1
2 1
Px
=
Px
2

From (4.3) we have
Px
2
= T̄Xx−FXx −1
Therefore,
T̄Xx−FXx −1
1
2lx
(4.12)
Combining (4.11) and (4.12), we have
T̄Xxlx FXx −1 (4.13)
Therefore, the code T̄Xxlx is a unique representation of T̄Xx.
To show that this code is uniquely decodable, we will show that the code is a prefix
code; that is, no codeword is a prefix of another codeword. Because a prefix code is always
uniquely decodable, by showing that an arithmetic code is a prefix code, we automatically
show that it is uniquely decodable. Given a number a in the interval 01 with an n-bit
binary representation b1b2 bn , for any other number b to have a binary representation
with b1b2 bn as the prefix, b has to lie in the interval aa+ 1
2n . (See Problem 1.)
If x and y are two distinct sequences, we know that T̄Xxlx and T̄Xyly lie in
two disjoint intervals, FXx−1FXx and FXy−1FXy. Therefore, if we can show
that for any sequence x, the interval T̄XxlxT̄Xxlx + 1
2lx lies entirely within the
interval FXx − 1FXx, this will mean that the code for one sequence cannot be the
prefix for the code for another sequence.

We have already shown that T̄Xxlx FXx − 1. Therefore, all we need to do is
show that
FXx−T̄Xxlx
1
2lx

This is true because
FXx−T̄Xxlx FXx− ¯
TXx
=
Px
2

1
2lx

This code is prefix free, and by taking the binary representation of T̄Xx and truncating it
to lx = log 1
Px
+1 bits, we obtain a uniquely decodable code.
Although the code is uniquely decodable, how efficient is it? We have shown that the
number of bits lx required to represent FXx with enough accuracy such that the code for
different values of x are distinct is
lx =

log
1
Px

+1
Remember that lx is the number of bits required to encode the entire sequence x. So, the
average length of an arithmetic code for a sequence of length m is given by
lAm =

Pxlx (4.14)
=

Px

log
1
Px

+1

(4.15)

Px

log
1
Px
+1+1

(4.16)
= −

PxlogPx+2

Px (4.17)
= HXm
+2 (4.18)
Given that the average length is always greater than the entropy, the bounds on lAm are
HXm
≤ lAm HXm
+2
The length per symbol, lA, or rate of the arithmetic code is
lAm
m
. Therefore, the bounds on
lA are
HXm

m
≤ lA
HXm

m
+
2
m
(4.19)
We have shown in Chapter 3 that for iid sources
HXm
= mHX (4.20)

Therefore,
HX ≤ lA HX+
2
m
(4.21)
By increasing the length of the sequence, we can guarantee a rate as close to the entropy as
we desire.
4 . 4 . 2 A l g o r i t h m I m p l e m e n t a t i o n
In Section 4.3.1 we developed a recursive algorithm for the boundaries of the interval
containing the tag for the sequence being encoded as
ln
= ln−1
+un−1
−ln−1
FXxn −1 (4.22)
un
= ln−1
+un−1
−ln−1
FXxn (4.23)
where xn is the value of the random variable corresponding to the nth observed symbol, ln
is the lower limit of the tag interval at the nth iteration, and un
is the upper limit of the tag
interval at the nth iteration.
Before we can implement this algorithm, there is one major problem we have to resolve.
Recall that the rationale for using numbers in the interval 01 as a tag was that there are
an infinite number of numbers in this interval. However, in practice the number of numbers
that can be uniquely represented on a machine is limited by the maximum number of digits
(or bits) we can use for representing the number. Consider the values of ln
and un
in
Example 4.3.5. As n gets larger, these values come closer and closer together. This means
that in order to represent all the subintervals uniquely we need increasing precision as the
length of the sequence increases. In a system with finite precision, the two values are bound
to converge, and we will lose all information about the sequence from the point at which
the two values converged. To avoid this situation, we need to rescale the interval. However,
we have to do it in a way that will preserve the information that is being transmitted. We
would also like to perform the encoding incrementally—that is, to transmit portions of the
code as the sequence is being observed, rather than wait until the entire sequence has been
observed before transmitting the first bit. The algorithm we describe in this section takes
care of the problems of synchronized rescaling and incremental encoding.
As the interval becomes narrower, we have three possibilities:
1. The interval is entirely confined to the lower half of the unit interval 005.
2. The interval is entirely confined to the upper half of the unit interval 0510.
3. The interval straddles the midpoint of the unit interval.
We will look at the third case a little later in this section. First, let us examine the first two
cases. Once the interval is confined to either the upper or lower half of the unit interval, it
is forever confined to that half of the unit interval. The most significant bit of the binary
representation of all numbers in the interval 005 is 0, and the most significant bit of the
binary representation of all numbers in the interval 051 is 1. Therefore, once the interval
gets restricted to either the upper or lower half of the unit interval, the most significant bit of

the tag is fully determined. Therefore, without waiting to see what the rest of the sequence
looks like, we can indicate to the decoder whether the tag is confined to the upper or lower
half of the unit interval by sending a 1 for the upper half and a 0 for the lower half. The bit
that we send is also the first bit of the tag.
Once the encoder and decoder know which half contains the tag, we can ignore the half
of the unit interval not containing the tag and concentrate on the half containing the tag.
As our arithmetic is of finite precision, we can do this best by mapping the half interval
containing the tag to the full 01 interval. The mappings required are
E1 005 → 01 E1x = 2x (4.24)
E2 051 → 01 E2x = 2x −05 (4.25)
As soon as we perform either of these mappings, we lose all information about the most
significant bit. However, this should not matter because we have already sent that bit to the
decoder. We can now continue with this process, generating another bit of the tag every time
the tag interval is restricted to either half of the unit interval. This process of generating the
bits of the tag without waiting to see the entire sequence is called incremental encoding.
E x a m p l e 4 . 4 . 2: T a g g e n e r a t i o n w i t h s c a l i n g
Let’s revisit Example 4.3.5. Recall that we wish to encode the sequence 1 3 2 1. The
probability model for the source is Pa1 = 08, Pa2 = 002, Pa3 = 018. Initializing u0
to 1, and l0
to 0, the first element of the sequence, 1, results in the following update:
l1
= 0+1−00 = 0
u1
= 0+1−008 = 08
The interval 008 is not confined to either the upper or lower half of the unit interval, so
we proceed.
The second element of the sequence is 3. This results in the update
l2
= 0+08−0FX2 = 08×082 = 0656
u2
= 0+08−0FX3 = 08×10 = 08
The interval 065608 is contained entirely in the upper half of the unit interval, so we
send the binary code 1 and rescale:
l2
= 2×0656−05 = 0312
u2
= 2×08−05 = 06
The third element, 2, results in the following update equations:
l3
= 0312+06−0312FX1 = 0312+0288×08 = 05424
u3
= 0312+08−0312FX2 = 0312+0288×082 = 054816

The interval for the tag is 05424054816, which is contained entirely in the upper half
of the unit interval. We transmit a 1 and go through another rescaling:
l3
= 2×05424−05 = 00848
u3
= 2×054816−05 = 009632
This interval is contained entirely in the lower half of the unit interval, so we send a 0 and
use the E1 mapping to rescale:
l3
= 2×00848 = 01696
u3
= 2×009632 = 019264
The interval is still contained entirely in the lower half of the unit interval, so we send
another 0 and go through another rescaling:
l3
= 2×01696 = 03392
u3
= 2×019264 = 038528
Because the interval containing the tag remains in the lower half of the unit interval, we
send another 0 and rescale one more time:
l3
= 2×03392 = 06784
u3
= 2×038528 = 077056
Now the interval containing the tag is contained entirely in the upper half of the unit interval.
Therefore, we transmit a 1 and rescale using the E2 mapping:
l3
= 2×06784−05 = 03568
u3
= 2×077056−05 = 054112
At each stage we are transmitting the most significant bit that is the same in both the
upper and lower limit of the tag interval. If the most significant bits in the upper and
lower limit are the same, then the value of this bit will be identical to the most significant
bit of the tag. Therefore, by sending the most significant bits of the upper and lower endpoint
of the tag whenever they are identical, we are actually sending the binary representation of
the tag. The rescaling operations can be viewed as left shifts, which make the second most
significant bit the most significant bit.
Continuing with the last element, the upper and lower limits of the interval containing
the tag are
l4
= 03568+054112−03568FX0 = 03568+018422×00 = 03568
u4
= 03568+054112−03568FX1 = 03568+018422×08 = 0504256

At this point, if we wished to stop encoding, all we need to do is inform the receiver of
the final status of the tag value. We can do so by sending the binary representation of any
value in the final tag interval. Generally, this value is taken to be ln
. In this particular
example, it is convenient to use the value of 05. The binary representation of 05 is 10 .
Thus, we would transmit a 1 followed by as many 0s as required by the word length of the
implementation being used.
Notice that the tag interval size at this stage is approximately 64 times the size it was
when we were using the unmodified algorithm. Therefore, this technique solves the finite
precision problem. As we shall soon see, the bits that we have been sending with each
mapping constitute the tag itself, which satisfies our desire for incremental encoding. The
binary sequence generated during the encoding process in the previous example is 1100011.
We could simply treat this as the binary expansion of the tag. A binary number 1100011
corresponds to the decimal number 07734375. Looking back to Example 4.3.5, notice that
this number lies within the final tag interval. Therefore, we could use this to decode the
sequence.
However, we would like to do incremental decoding as well as incremental encoding.
This raises three questions:
1. How do we start decoding?
2. How do we continue decoding?
3. How do we stop decoding?
The second question is the easiest to answer. Once we have started decoding, all we have to
do is mimic the encoder algorithm. That is, once we have started decoding, we know how
to continue decoding. To begin the decoding process, we need to have enough information
to decode the first symbol unambiguously. In order to guarantee unambiguous decoding, the
number of bits received should point to an interval smaller than the smallest tag interval.
Based on the smallest tag interval, we can determine how many bits we need before we start
the decoding procedure. We will demonstrate this procedure in Example 4.4.4. First let’s
look at other aspects of decoding using the message from Example 4.4.2.
E x a m p l e 4 . 4 . 3:
We will use a word length of 6 for this example. Note that because we are dealing with
real numbers this word length may not be sufficient for a different sequence. As in the
encoder, we start with initializing u0
to 1 and l0
to 0. The sequence of received bits is
110001100 0. The first 6 bits correspond to a tag value of 0765625, which means that
the first element of the sequence is 1, resulting in the following update:
l1
= 0+1−00 = 0
u1
= 0+1−008 = 08

The interval 008 is not confined to either the upper or lower half of the unit interval,
so we proceed. The tag 0765625 lies in the top 18% of the interval 008; therefore, the
second element of the sequence is 3. Updating the tag interval we get
l2
= 0+08−0FX2 = 08×082 = 0656
u2
= 0+08−0FX3 = 08×10 = 08
The interval 065608 is contained entirely in the upper half of the unit interval. At
the encoder, we sent the bit 1 and rescaled. At the decoder, we will shift 1 out of the receive
buffer and move the next bit in to make up the 6 bits in the tag. We will also update the tag
interval, resulting in
l2
= 2×0656−05 = 0312
u2
= 2×08−05 = 06
while shifting a bit to give us a tag of 0546875. When we compare this value with the
tag interval, we can see that this value lies in the 80–82% range of the tag interval, so we
decode the next element of the sequence as 2. We can then update the equations for the tag
interval as
l3
= 0312+06−0312FX1 = 0312+0288×08 = 05424
u3
= 0312+08−0312FX2 = 0312+0288×082 = 054816
As the tag interval is now contained entirely in the upper half of the unit interval, we
rescale using E2 to obtain
l3
= 2×05424−05 = 00848
u3
= 2×054816−05 = 009632
We also shift out a bit from the tag and shift in the next bit. The tag is now 000110. The
interval is contained entirely in the lower half of the unit interval. Therefore, we apply E1
and shift another bit. The lower and upper limits of the tag interval become
l3
= 2×00848 = 01696
u3
= 2×009632 = 019264
and the tag becomes 001100. The interval is still contained entirely in the lower half of
the unit interval, so we shift out another 0 to get a tag of 011000 and go through another
rescaling:
l3
= 2×01696 = 03392
u3
= 2×019264 = 038528

Because the interval containing the tag remains in the lower half of the unit interval, we
shift out another 0 from the tag to get 110000 and rescale one more time:
l3
= 2×03392 = 06784
u3
= 2×038528 = 077056
Now the interval containing the tag is contained entirely in the upper half of the unit
interval. Therefore, we shift out a 1 from the tag and rescale using the E2 mapping:
l3
= 2×06784−05 = 03568
u3
= 2×077056−05 = 054112
Now we compare the tag value to the the tag interval to decode our final element. The tag
is 100000, which corresponds to 05. This value lies in the first 80% of the interval, so we
decode this element as 1.
If the tag interval is entirely contained in the upper or lower half of the unit interval,
the scaling procedure described will prevent the interval from continually shrinking. Now
we consider the case where the diminishing tag interval straddles the midpoint of the unit
interval. As our trigger for rescaling, we check to see if the tag interval is contained in the
interval 025075. This will happen when ln
is greater than 025 and un
is less than
075. When this happens, we double the tag interval using the following mapping:
E3 025075 → 01 E3x = 2x −025 (4.26)
We have used a 1 to transmit information about an E2 mapping, and a 0 to transmit
information about an E1 mapping. How do we transfer information about an E3 mapping
to the decoder? We use a somewhat different strategy in this case. At the time of the E3
mapping, we do not send any information to the decoder; instead, we simply record the fact
that we have used the E3 mapping at the encoder. Suppose that after this, the tag interval gets
confined to the upper half of the unit interval. At this point we would use an E2 mapping
and send a 1 to the receiver. Note that the tag interval at this stage is at least twice what it
would have been if we had not used the E3 mapping. Furthermore, the upper limit of the
tag interval would have been less than 075. Therefore, if the E3 mapping had not taken
place right before the E2 mapping, the tag interval would have been contained entirely in
the lower half of the unit interval. At this point we would have used an E1 mapping and
transmitted a 0 to the receiver. In fact, the effect of the earlier E3 mapping can be mimicked
at the decoder by following the E2 mapping with an E1 mapping. At the encoder, right after
we send a 1 to announce the E2 mapping, we send a 0 to help the decoder track the changes
in the tag interval at the decoder. If the first rescaling after the E3 mapping happens to be an
E1 mapping, we do exactly the opposite. That is, we follow the 0 announcing an E1 mapping
with a 1 to mimic the effect of the E3 mapping at the encoder.
What happens if we have to go through a series of E3 mappings at the encoder? We
simply keep track of the number of E3 mappings and then send that many bits of the opposite
variety after the first E1 or E2 mapping. If we went through three E3 mappings at the encoder,

followed by an E2 mapping, we would transmit a 1 followed by three 0s. On the other hand,
if we went through an E1 mapping after the E3 mappings, we would transmit a 0 followed
by three 1s. Since the decoder mimics the encoder, the E3 mappings are also applied at the
decoder when the tag interval is contained in the interval 025075.
4 . 4 . 3 I n t e g e r I m p l e m e n t a t i o n
We have described a floating-point implementation of arithmetic coding. Let us now repeat
the procedure using integer arithmetic and generate the binary code in the process.
E n c o d e r I m p l e m e n t a t i o n
The first thing we have to do is decide on the word length to be used. Given a word length
of m, we map the important values in the 01 interval to the range of 2m
binary words.
The point 0 gets mapped to
m times

00 0
1 gets mapped to
m times

11 1
The value of 05 gets mapped to
1
m−1 times

00 0
The update equations remain almost the same as Equations (4.9) and (4.10). As we are going
to do integer arithmetic, we need to replace FXx in these equations.
Define nj as the number of times the symbol j occurs in a sequence of length
Total Count. Then FXk can be estimated by
FXk =
k
i=1 ni
Total Count
(4.27)
If we now define
Cum_Countk =
k

i=1
ni
we can write Equations (4.9) and (4.10) as
ln
= ln−1
+
un−1
−ln−1
+1×Cum_Countxn −1
Total Count
(4.28)
un
= ln−1
+
un−1
−ln−1
+1×Cum_Countxn
Total Count
−1 (4.29)

where xn is the nth symbol to be encoded, x is the largest integer less than or equal
to x, and where the addition and subtraction of one is to handle the effects of the integer
arithmetic.
Because of the way we mapped the endpoints and the halfway points of the unit interval,
when both ln
and un
are in either the upper half or lower half of the interval, the leading
bit of un
and ln
will be the same. If the leading or most significant bit (MSB) is 1, then
the tag interval is contained entirely in the upper half of the 00 011 1 interval. If
the MSB is 0, then the tag interval is contained entirely in the lower half. Applying the E1
and E2 mappings is a simple matter. All we do is shift out the MSB and then shift in a 1
into the integer code for un
and a 0 into the code for ln
. For example, suppose m was 6,
un
was 54, and ln
was 33. The binary representations of un
and ln
are 110110 and
100001, respectively. Notice that the MSB for both endpoints is 1. Following the procedure
above, we would shift out (and transmit or store) the 1, and shift in 1 for un
and 0 for ln
,
obtaining the new value for un
as 101101, or 45, and a new value for ln
as 000010, or 2.
This is equivalent to performing the E2 mapping. We can see how the E1 mapping would
also be performed using the same operation.
To see if the E3 mapping needs to be performed, we monitor the second most significant
bit of un
and ln
. When the second most significant bit of un
is 0 and the second most
significant bit of ln
is 1, this means that the tag interval lies in the middle half of the
00 011 1 interval. To implement the E3 mapping, we complement the second most
significant bit in un
and ln
, and shift left, shifting in a 1 in un
and a 0 in ln
. We also
keep track of the number of E3 mappings in Scale3.
We can summarize the encoding algorithm using the following pseudocode:
Initialize l and u.
Get symbol.
l ←− l+
u−l+1×Cum_Countx −1
TotalCount
u ←− l+
u−l+1×Cum_Countx
TotalCount
−1
while(MSB of u and l are both equal to b or E3 condition holds)
if(MSB of u and l are both equal to b)
{
send b
shift l to the left by 1 bit and shift 0 into LSB
shift u to the left by 1 bit and shift 1 into LSB
while(Scale3 0)
{
send complement of b
decrement Scale3
}
}

if(E3 condition holds)
{
complement (new) MSB of l and u
increment Scale3
}
To see how all this functions together, let’s look at an example.
E x a m p l e 4 . 4 . 4:
We will encode the sequence 1 3 2 1 with parameters shown in Table 4.5. First we need to
select the word length m. Note that Cum_Count1 and Cum_Count2 differ by only 1.
Recall that the values of Cum_Count will get translated to the endpoints of the subintervals.
We want to make sure that the value we select for the word length will allow enough range
for it to be possible to represent the smallest difference between the endpoints of intervals.
We always rescale whenever the interval gets small. In order to make sure that the endpoints
of the intervals always remain distinct, we need to make sure that all values in the range
from 0 to Total_Count, which is the same as Cum_Count3, are uniquely represented in
the smallest range an interval under consideration can be without triggering a rescaling. The
interval is smallest without triggering a rescaling when ln
is just below the midpoint of the
interval and un
is at three-quarters of the interval, or when un
is right at the midpoint of
the interval and ln
is just below a quarter of the interval. That is, the smallest the interval
ln
un
can be is one-quarter of the total available range of 2m
values. Thus, m should be
large enough to accommodate uniquely the set of values between 0 and Total_Count.
T A B L E 4 . 5 Values of some of the parameters for
arithmetic coding example.
Count1 = 40 Cum_Count0 = 0 Scale3 = 0
Count2 = 1 Cum_Count1 = 40
Count3 = 9 Cum_Count2 = 41
Total_Count = 50 Cum_Count3 = 50
For this example, this means that the total interval range has to be greater than 200.
A value of m = 8 satisfies this requirement.
With this value of m we have
l0
= 0 = 000000002 (4.30)
u0
= 255 = 111111112 (4.31)
where ···2 is the binary representation of a number.

The first element of the sequence to be encoded is 1. Using Equations (4.28) and (4.29),
l1
= 0+
256×Cum_Count0
50
= 0 = 000000002 (4.32)
u1
= 0+
256×Cum_Count1
50
−1 = 203 = 110010112 (4.33)
The next element of the sequence is 3.
l2
= 0+
204×Cum_Count2
50
= 167 = 101001112 (4.34)
u2
= 0+
204×Cum_Count3
50
−1 = 203 = 110010112 (4.35)
The MSBs of l2
and u2
are both 1. Therefore, we shift this value out and send it to the
decoder. All other bits are shifted left by 1 bit, giving
l2
= 010011102 = 78 (4.36)
u2
= 100101112 = 151 (4.37)
Notice that while the MSBs of the limits are different, the second MSB of the upper limit is
0, while the second MSB of the lower limit is 1. This is the condition for the E3 mapping.
We complement the second MSB of both limits and shift 1 bit to the left, shifting in a 0 as
the LSB of l2
and a 1 as the LSB of u2
. This gives us
l2
= 000111002 = 28 (4.38)
u2
= 101011112 = 175 (4.39)
We also increment Scale3 to a value of 1.
The next element in the sequence is 2. Updating the limits, we have
l3
= 28+
148×Cum_Count1
50
= 146 = 100100102 (4.40)
u3
= 28+
148×Cum_Count2
50
−1 = 148 = 100101002 (4.41)
The two MSBs are identical, so we shift out a 1 and shift left by 1 bit:
l3
= 001001002 = 36 (4.42)
u3
= 001010012 = 41 (4.43)
As Scale3 is 1, we transmit a 0 and decrement Scale3 to 0. The MSBs of the upper and
lower limits are both 0, so we shift out and transmit 0:
l3
= 010010002 = 72 (4.44)
u3
= 010100112 = 83 (4.45)

Both MSBs are again 0, so we shift out and transmit 0:
l3
= 100100002 = 144 (4.46)
u3
= 101001112 = 167 (4.47)
Now both MSBs are 1, so we shift out and transmit a 1. The limits become
l3
= 001000002 = 32 (4.48)
u3
= 010011112 = 79 (4.49)
Once again the MSBs are the same. This time we shift out and transmit a 0.
l3
= 010000002 = 64 (4.50)
u3
= 100111112 = 159 (4.51)
Now the MSBs are different. However, the second MSB for the lower limit is 1 while the
second MSB for the upper limit is 0. This is the condition for the E3 mapping. Applying the
E3 mapping by complementing the second MSB and shifting 1 bit to the left, we get
l3
= 000000002 = 0 (4.52)
u3
= 101111112 = 191 (4.53)
We also increment Scale3 to 1.
The next element in the sequence to be encoded is 1. Therefore,
l4
= 0+
192×Cum_Count0
50
= 0 = 000000002 (4.54)
u4
= 0+
192×Cum_Count1
50
−1 = 152 = 100110002 (4.55)
The encoding continues in this fashion. To this point we have generated the binary sequence
1100010. If we wished to terminate the encoding at this point, we have to send the current
status of the tag. This can be done by sending the value of the lower limit l4
. As l4
is
0, we will end up sending eight 0s. However, Scale3 at this point is 1. Therefore, after we
send the first 0 from the value of l4
, we need to send a 1 before sending the remaining
seven 0s. The final transmitted sequence is 1100010010000000.
D e c o d e r I m p l e m e n t a t i o n
Once we have the encoder implementation, the decoder implementation is easy to describe.
As mentioned earlier, once we have started decoding all we have to do is mimic the encoder
algorithm. Let us first describe the decoder algorithm using pseudocode and then study its
implementation using Example 4.4.5.

D e c o d e r A l g o r i t h m
Initialize l and u.
Read the first m bits of the received bitstream into tag t.
k = 0
while

t −l+1×Total Count−1
u−l+1
≥ Cum_Countk

k ←− k+1
decode symbol x.
l ←− l+
u−l+1×Cum_Countx −1
Total Count
u ←− l+
u−l+1×Cum_Countx
Total Count
−1
while(MSB of u and l are both equal to b or E3 condition holds)
if(MSB of u and l are both equal to b)
{
shift t to the left by 1 bit and read next bit from received bitstream into LSB
}
if(E3 condition holds)
{
shift t to the left by 1 bit and read next bit from received bitstream into LSB
complement (new) MSB of l, u, and t
}
E x a m p l e 4 . 4 . 5:
After encoding the sequence in Example 4.4.4, we ended up with the following binary
sequence: 1100010010000000. Treating this as the received sequence and using the param-
eters from Table 4.5, let us decode this sequence. Using the same word length, eight, we
read in the first 8 bits of the received sequence to form the tag t:
t = 110001002 = 196
We initialize the lower and upper limits as
l = 000000002 = 0
u = 111111112 = 255

To begin decoding, we compute
u−l+1
=
197×50−1
255−0+1
= 38
and compare this value to
Cum_Count =
⎡
⎢
⎢
⎣
0
40
41
50
⎤
⎥
⎥
⎦
Since
0 ≤ 38 40
we decode the first symbol as 1. Once we have decoded a symbol, we update the lower and
upper limits:
l = 0+
256×Cum_Count0
Total Count
= 0+ 256×
0
50
= 0
u = 0+
256×Cum_Count1
Total Count
−1 = 0+ 256×
40
50
−1 = 203
or
l = 000000002
u = 110010112
The MSB of the limits are different and the E3 condition does not hold. Therefore, we
continue decoding without modifying the tag value. To obtain the next symbol, we compare
u−l+1
which is 48, against the Cum_Count array:
Cum_Count2 ≤ 48 Cum_Count3
Therefore, we decode 3 and update the limits:
l = 0+
204×Cum_Count2
Total Count
= 0+ 204×
41
50
= 167 = 10100112
u = 0+
204×Cum_Count3
Total Count
−1 = 0+ 204×
50
50
−1 = 203 = 110010112

As the MSB of u and l are the same, we shift the MSB out and read in a 0 for the LSB of
l and a 1 for the LSB of u. We mimic this action for the tag as well, shifting the MSB out
and reading in the next bit from the received bitstream as the LSB:
l = 010011102
u = 100101112
t = 100010012
Examining l and u we can see we have an E3 condition. Therefore, for l, u, and t, we
shift the MSB out, complement the new MSB, and read in a 0 as the LSB of l, a 1 as the
LSB of u, and the next bit in the received bitstream as the LSB of t. We now have
l = 000111002 = 28
u = 101011112 = 175
t = 100100102 = 146
To decode the next symbol, we compute
u−l+1
= 40
Since 40 ≤ 40 41, we decode 2.
Updating the limits using this decoded symbol, we get
l = 28+
175−28+1×40
50
= 146 = 100100102
u = 28+
175−28+1×41
50
−1 = 148 = 100101002
We can see that we have quite a few bits to shift out. However, notice that the lower limit
l has the same value as the tag t. Furthermore, the remaining received sequence consists
entirely of 0s. Therefore, we will be performing identical operations on numbers that are the
same, resulting in identical numbers. This will result in the final decoded symbol being 1.
We knew this was the final symbol to be decoded because only four symbols had been
encoded. In practice this information has to be conveyed to the decoder.
4 . 5 Comparison of Huffman and Arithmetic
Coding
We have described a new coding scheme that, although more complicated than Huffman
coding, allows us to code sequences of symbols. How well this coding scheme works depends
on how it is used. Let’s first try to use this code for encoding sources for which we know
the Huffman code.

Looking at Example 4.4.1, the average length for this code is
l = 2×05+3×025+4×0125+4×0125 (4.56)
= 275 bits/symbol (4.57)
Recall from Section 2.4 that the entropy of this source was 1.75 bits/symbol and the Huffman
code achieved this entropy. Obviously, arithmetic coding is not a good idea if you are going
to encode your message one symbol at a time. Let’s repeat the example with messages
consisting of two symbols. (Note that we are only doing this to demonstrate a point. In
practice, we would not code sequences this short using an arithmetic code.)
E x a m p l e 4 . 5 . 1:
If we encode two symbols at a time, the resulting code is shown in Table 4.6.
T A B L E 4 . 6 Arithmetic code for two-symbol sequences.
Message Px T̄Xx T̄Xx in Binary log 1
Px
+1 Code
11 25 125 001 3 001
12 125 3125 0101 4 0101
13 0625 40625 01101 5 01101
14 0625 46875 01111 5 01111
21 125 5625 1001 4 1001
22 0625 65625 10101 5 10101
23 03125 703125 101101 6 101101
24 03125 734375 101111 6 101111
31 0625 78125 11001 5 11001
32 03125 828125 110101 6 110101
33 015625 8515625 1101101 7 1101101
34 015625 8671875 1101111 7 1101111
41 0625 90625 11101 5 11101
42 03125 953125 111101 6 111101
43 015625 9765625 1111101 7 1111101
44 015625 984375 1111111 7 1111111
The average length per message is 4.5 bits. Therefore, using two symbols at a time we
get a rate of 2.25 bits/symbol (certainly better than 2.75 bits/symbol, but still not as good
as the best rate of 1.75 bits/symbol). However, we see that as we increase the number of
symbols per message, our results get better and better.
How many samples do we have to group together to make the arithmetic coding scheme
perform better than the Huffman coding scheme? We can get some idea by looking at the
bounds on the coding rate.

Recall that the bounds on the average length lA of the arithmetic code are
HX ≤ lA ≤ HX+
2
m

It does not take many symbols in a sequence before the coding rate for the arithmetic code
becomes quite close to the entropy. However, recall that for Huffman codes, if we block m
symbols together, the coding rate is
HX ≤ lH ≤ HX+
1
m

The advantage seems to lie with the Huffman code, although the advantage decreases
with increasing m. However, remember that to generate a codeword for a sequence of length
m, using the Huffman procedure requires building the entire code for all possible sequences
of length m. If the original alphabet size was k, then the size of the codebook would be km
.
Taking relatively reasonable values of k = 16 and m = 20 gives a codebook size of 1620
! This
is obviously not a viable option. For the arithmetic coding procedure, we do not need to build
the entire codebook. Instead, we simply obtain the code for the tag corresponding to a given
sequence. Therefore, it is entirely feasible to code sequences of length 20 or much more. In
practice, we can make m large for the arithmetic coder and not for the Huffman coder. This
means that for most sources we can get rates closer to the entropy using arithmetic coding
than by using Huffman coding. The exceptions are sources whose probabilities are powers
of two. In these cases, the single-letter Huffman code achieves the entropy, and we cannot
do any better with arithmetic coding, no matter how long a sequence we pick.
The amount of gain also depends on the source. Recall that for Huffman codes we are
guaranteed to obtain rates within 0086+pmax of the entropy, where pmax is the probability
of the most probable letter in the alphabet. If the alphabet size is relatively large and the
probabilities are not too skewed, the maximum probability pmax is generally small. In these
cases, the advantage of arithmetic coding over Huffman coding is small, and it might not be
worth the extra complexity to use arithmetic coding rather than Huffman coding. However,
there are many sources, such as facsimile, in which the alphabet size is small, and the
probabilities are highly unbalanced. In these cases, the use of arithmetic coding is generally
worth the added complexity.
Another major advantage of arithmetic coding is that it is easy to implement a system with
multiple arithmetic codes. This may seem contradictory, as we have claimed that arithmetic
coding is more complex than Huffman coding. However, it is the computational machinery
that causes the increase in complexity. Once we have the computational machinery to
implement one arithmetic code, all we need to implement more than a single arithmetic code
is the availability of more probability tables. If the alphabet size of the source is small, as in
the case of a binary source, there is very little added complexity indeed. In fact, as we shall see
in the next section, it is possible to develop multiplication-free arithmetic coders that are quite
simple to implement (nonbinary multiplication-free arithmetic coders are described in [44]).
Finally, it is much easier to adapt arithmetic codes to changing input statistics. All we
need to do is estimate the probabilities of the input alphabet. This can be done by keeping a
count of the letters as they are coded. There is no need to preserve a tree, as with adaptive
Huffman codes. Furthermore, there is no need to generate a code a priori, as in the case of

Huffman coding. This property allows us to separate the modeling and coding procedures
in a manner that is not very feasible with Huffman coding. This separation permits greater
flexibility in the design of compression systems, which can be used to great advantage.
4 . 6 Adaptive Arithmetic Coding
We have seen how to construct arithmetic coders when the distribution of the source, in the
form of cumulative counts, is available. In many applications such counts are not available
a priori. It is a relatively simple task to modify the algorithms discussed so that the coder
learns the distribution as the coding progresses. A straightforward implementation is to start
out with a count of 1 for each letter in the alphabet. We need a count of at least 1 for each
symbol, because if we do not we will have no way of encoding the symbol when it is first
encountered. This assumes that we know nothing about the distribution of the source. If we
do know something about the distribution of the source, we can let the initial counts reflect
our knowledge.
After coding is initiated, the count for each letter encountered is incremented after
that letter has been encoded. The cumulative count table is updated accordingly. It is very
important that the updating take place after the encoding; otherwise the decoder will not
be using the same cumulative count table as the encoder to perform the decoding. At the
decoder, the count and cumulative count tables are updated after each letter is decoded.
In the case of the static arithmetic code, we picked the size of the word based on Total
Count, the total number of symbols to be encoded. In the adaptive case, we may not know
ahead of time what the total number of symbols is going to be. In this case we have to pick
the word length independent of the total count. However, given a word length m we know
that we can only accomodate a total count of 2m−2
or less. Therefore, during the encoding and
decoding processes when the total count approaches 2m−2
, we have to go through a rescaling,
or renormalization, operation. A simple rescaling operation is to divide all counts by 2 and
rounding up the result so that no count gets rescaled to zero. This periodic rescaling can
have an added benefit in that the count table better reflects the local statisitcs of the source.
4 . 7 Applications
Arithmetic coding is used in a variety of lossless and lossy compression applications.
It is a part of many international standards. In the area of multimedia there are a few
principal organizations that develop standards. The International Standards Organization
(ISO) and the International Electrotechnical Commission (IEC) are industry groups that work
on multimedia standards, while the International Telecommunications Union (ITU), which
is part of the United Nations, works on multimedia standards on behalf of the member states
of the United Nations. Quite often these institutions work together to create international
standards. In later chapters we will be looking at a number of these standards, and we will
see how arithmetic coding is used in image compression, audio compression, and video
compression standards.
For now let us look at the lossless compression example from the previous chapter.

4.8 Summary 113
T A B L E 4 . 7 Compression using adaptive arithmetic coding of pixel values.
Total Size Compression Ratio Compression Ratio
Image Name Bits/Pixel (bytes) (arithmetic) (Huffman)
Sena 652 53,431 123 1.16
Sensin 712 58,306 112 1.27
Earth 467 38,248 171 1.67
Omaha 684 56,061 117 1.14
T A B L E 4 . 8 Compression using adaptive arithmetic coding of pixel differences.
Total Size Compression Ratio Compression Ratio
Image Name Bits/Pixel (bytes) (arithmetic) (Huffman)
Sena 389 31,847 206 2.08
Sensin 456 37,387 175 1.73
Earth 392 32,137 204 2.04
Omaha 627 51,393 128 1.26
In Tables 4.7 and 4.8, we show the results of using adaptive arithmetic coding to
encode the same test images that were previously encoded using Huffman coding. We have
included the compression ratios obtained using Huffman code from the previous chapter
for comparison. Comparing these values to those obtained in the previous chapter, we can
see very little change. The reason is that beacuse the alphabet size for the images is quite
large, the value of pmax is quite small, and in the Huffman coder performs very close to the
entropy.
As we mentioned before, a major advantage of arithmetic coding over Huffman coding
is the ability to separate the modeling and coding aspects of the compression approach. In
terms of image coding, this allows us to use a number of different models that take advantage
of local properties. For example, we could use different decorrelation strategies in regions
of the image that are quasi-constant and will, therefore, have differences that are small, and
in regions where there is a lot of activity, causing the presence of larger difference values.
4 . 8 Summary
In this chapter we introduced the basic ideas behind arithmetic coding. We have shown
that the arithmetic code is a uniquely decodable code that provides a rate close to the
entropy for long stationary sequences. This ability to encode sequences directly instead of
as a concatenation of the codes for the elements of the sequence makes this approach more
efficient than Huffman coding for alphabets with highly skewed probabilities. We have
looked in some detail at the implementation of the arithmetic coding approach.
The arithmetic coding results in this chapter were obtained by using the program provided
by Witten, Neal, and Cleary [45]. This code can be used (with some modifications) for
exploring different aspects of arithmetic coding (see problems).

1. The book Text Compression, by T.C. Bell, J.G. Cleary, and I.H. Witten [1], contains
a very readable section on arithmetic coding, complete with pseudocode and C code.
2. A thorough treatment of various aspects of arithmetic coding can be found in the
excellent chapter Arithmetic Coding, by Amir Said [46] in the Lossless Compression
Handbook.
3. There is an excellent tutorial article by G.G. Langdon, Jr. [47] in the March 1984
issue of the IBM Journal of Research and Development.
4. The separate model and code paradigm is explored in a precise manner in the context
of arithmetic coding in a paper by J.J. Rissanen and G.G. Langdon [48].
5. The separation of modeling and coding is exploited in a very nice manner in an early
paper by G.G. Langdon and J.J. Rissanen [49].
6. Various models for text compression that can be used effectively with arithmetic
coding are described by T.G. Bell, I.H. Witten, and J.G. Cleary [50] in an article in
the ACM Computing Surveys.
7. The coder used in the JBIG algorithm is a descendant of the Q coder, described in
some detail in several papers [51, 52, 53] in the November 1988 issue of the IBM
Journal of Research and Development.
1. Given a number a in the interval 01 with an n-bit binary representation b1b2 bn ,
show that for any other number b to have a binary representation with b1b2 bn as
the prefix, b has to lie in the interval aa+ 1
2n .
2. The binary arithmetic coding approach specified in the JBIG standard can be used for
coding gray-scale images via bit plane encoding. In bit plane encoding, we combine
the most significant bits for each pixel into one bit plane, the next most significant
bits into another bit plane, and so on. Use the function extrctbp to obtain eight
bit planes for the sena.img and omaha.img test images, and encode them using
arithmetic coding. Use the low-resolution contexts shown in Figure 7.11.
3. Bit plane encoding is more effective when the pixels are encoded using a Gray
code. The Gray code assigns numerically adjacent values binary codes that differ by
only 1 bit. To convert from the standard binary code b0b1b2 b7 to the Gray code
g0g1g2 g7, we can use the equations
g0 = b0
gk = bk ⊕bk−1
Convert the test images sena.img and omaha.img to a Gray code representation,
and bit plane encode. Compare with the results for the non-Gray-coded representation.

T A B L E 4 . 9 Probability model for
Problems 5 and 6.
Letter Probability
a1 .2
a2 .3
a3 .5
T A B L E 4 . 10 Frequency counts
for Problem 7.
Letter Count
a 37
b 38
c 25
4. In Example 4.4.4, repeat the encoding using m = 6. Comment on your results.
5. Given the probability model in Table 4.9, find the real valued tag for the sequence
a1 a1 a3 a2 a3 a1.
6. For the probability model in Table 4.9, decode a sequence of length 10 with the tag
063215699.
7. Given the frequency counts shown in Table 4.10:
(a) What is the word length required for unambiguous encoding?
(b) Find the binary code for the sequence abacabb.
(c) Decode the code you obtained to verify that your encoding was correct.
8. Generate a binary sequence of length L with P0 = 08, and use the arithmetic coding
algorithm to encode it. Plot the difference of the rate in bits/symbol and the entropy
as a function of L. Comment on the effect of L on the rate.

5
Dictionary Techniques
5 . 1 Overview
I
n the previous two chapters we looked at coding techniques that assume a
source that generates a sequence of independent symbols. As most sources are
correlated to start with, the coding step is generally preceded by a decorrelation
step. In this chapter we will look at techniques that incorporate the structure in
the data in order to increase the amount of compression. These techniques—
both static and adaptive (or dynamic)—build a list of commonly occurring patterns and
encode these patterns by transmitting their index in the list. They are most useful with sources
that generate a relatively small number of patterns quite frequently, such as text sources and
computer commands. We discuss applications to text compression, modem communications,
and image compression.
5 . 2 Introduction
In many applications, the output of the source consists of recurring patterns. A classic
example is a text source in which certain patterns or words recur constantly. Also, there are
certain patterns that simply do not occur, or if they do, occur with great rarity. For example,
we can be reasonably sure that the word Limpopo1
occurs in a very small fraction of the
text sources in existence.
A very reasonable approach to encoding such sources is to keep a list, or dictionary,
of frequently occurring patterns. When these patterns appear in the source output, they are
encoded with a reference to the dictionary. If the pattern does not appear in the dictionary,
then it can be encoded using some other, less efficient, method. In effect we are splitting
1
“How the Elephant Got Its Trunk” in Just So Stories by Rudyard Kipling.

118 5 D I C T I O N A R Y T E C H N I Q U E S
the input into two classes, frequently occurring patterns and infrequently occurring patterns.
For this technique to be effective, the class of frequently occurring patterns, and hence the
size of the dictionary, must be much smaller than the number of all possible patterns.
Suppose we have a particular text that consists of four-character words, three characters
from the 26 lowercase letters of the English alphabet followed by a punctuation mark.
Suppose our source alphabet consists of the 26 lowercase letters of the English alphabet and
the punctuation marks comma, period, exclamation mark, question mark, semicolon, and
colon. In other words, the size of the input alphabet is 32. If we were to encode the text
source one character at a time, treating each character as an equally likely event, we would
need 5 bits per character. Treating all 324
( = 220
= 1048576) four-character patterns as
equally likely, we have a code that assigns 20 bits to each four-character pattern. Let us now
put the 256 most likely four-character patterns into a dictionary. The transmission scheme
works as follows: Whenever we want to send a pattern that exists in the dictionary, we will
send a 1-bit flag, say, a 0, followed by an 8-bit index corresponding to the entry in the
dictionary. If the pattern is not in the dictionary, we will send a 1 followed by the 20-bit
encoding of the pattern. If the pattern we encounter is not in the dictionary, we will actually
use more bits than in the original scheme, 21 instead of 20. But if it is in the dictionary, we
will send only 9 bits. The utility of our scheme will depend on the percentage of the words
we encounter that are in the dictionary. We can get an idea about the utility of our scheme
by calculating the average number of bits per pattern. If the probability of encountering a
pattern from the dictionary is p, then the average number of bits per pattern R is given by
R = 9p+211−p = 21−12p (5.1)
For our scheme to be useful, R should have a value less than 20. This happens when
p ≥ 0084. This does not seem like a very large number. However, note that if all patterns
were occurring in an equally likely manner, the probability of encountering a pattern from
the dictionary would be less than 000025!
We do not simply want a coding scheme that performs slightly better than the simple-
minded approach of coding each pattern as equally likely; we would like to improve the
performance as much as possible. In order for this to happen, p should be as large as possible.
This means that we should carefully select patterns that are most likely to occur as entries
in the dictionary. To do this, we have to have a pretty good idea about the structure of the
source output. If we do not have information of this sort available to us prior to the encoding
of a particular source output, we need to acquire this information somehow when we are
encoding. If we feel we have sufficient prior knowledge, we can use a static approach; if not,
we can take an adaptive approach. We will look at both these approaches in this chapter.
5 . 3 Static Dictionary
Choosing a static dictionary technique is most appropriate when considerable prior knowl-
edge about the source is available. This technique is especially suitable for use in specific
applications. For example, if the task were to compress the student records at a university, a
static dictionary approach may be the best. This is because we know ahead of time that cer-
tain words such as “Name” and “Student ID” are going to appear in almost all of the records.

5.3 Static Dictionary 119
Other words such as “Sophomore,” “credits,” and so on will occur quite often. Depending
on the location of the university, certain digits in social security numbers are more likely to
occur. For example, in Nebraska most student ID numbers begin with the digits 505. In fact,
most entries will be of a recurring nature. In this situation, it is highly efficient to design a
compression scheme based on a static dictionary containing the recurring patterns. Similarly,
there could be a number of other situations in which an application-specific or data-specific
static-dictionary-based coding scheme would be the most efficient. It should be noted that
these schemes would work well only for the applications and data they were designed for.
If these schemes were to be used with different applications, they may cause an expansion
of the data instead of compression.
A static dictionary technique that is less specific to a single application is digram coding.
We describe this in the next section.
5 . 3 . 1 D i g r a m C o d i n g
One of the more common forms of static dictionary coding is digram coding. In this form
of coding, the dictionary consists of all letters of the source alphabet followed by as many
pairs of letters, called digrams, as can be accommodated by the dictionary. For example,
suppose we were to construct a dictionary of size 256 for digram coding of all printable
ASCII characters. The first 95 entries of the dictionary would be the 95 printable ASCII
characters. The remaining 161 entries would be the most frequently used pairs of characters.
The digram encoder reads a two-character input and searches the dictionary to see if this
input exists in the dictionary. If it does, the corresponding index is encoded and transmitted.
If it does not, the first character of the pair is encoded. The second character in the pair
then becomes the first character of the next digram. The encoder reads another character to
complete the digram, and the search procedure is repeated.
E x a m p l e 5 . 3 . 1:
Suppose we have a source with a five-letter alphabet = abcdr. Based on knowledge
about the source, we build the dictionary shown in Table 5.1.
T A B L E 5 . 1 A sample dictionary.
Code Entry Code Entry
000 a 100 r
001 b 101 ab
010 c 110 ac
011 d 111 ad
Suppose we wish to encode the sequence
abracadabra
The encoder reads the first two characters ab and checks to see if this pair of letters exists
in the dictionary. It does and is encoded using the codeword 101. The encoder then reads

the next two characters ra and checks to see if this pair occurs in the dictionary. It does not,
so the encoder sends out the code for r, which is 100, then reads in one more character, c,
to make the two-character pattern ac. This does exist in the dictionary and is encoded as
110. Continuing in this fashion, the remainder of the sequence is coded. The output string
for the given input sequence is 101100110111101100000.
T A B L E 5 . 2 Thirty most frequently occurring pairs of characters
in a 41,364-character-long LaTeX document.
Pair Count Pair Count
e/
b 1128 ar 314
/
bt 838 at 313
/
b/
b 823 /
bw 309
th 817 te 296
he 712 /
bs 295
in 512 d/
b 272
s/
b 494 /
bo 266
er 433 io 257
/
ba 425 co 256
t/
b 401 re 247
en 392 /
b$ 246
on 385 r/
b 239
n/
b 353 di 230
ti 322 ic 229
/
bi 317 ct 226
T A B L E 5 . 3 Thirty most frequently occurring pairs of
characters in a collection of C programs
containing 64,983 characters.
Pair Count Pair Count
/
b/
b 5728 st 442
nl/
b 1471 le 440
nl 1133 ut 440
in 985 f 416
nt 739 ar 381
= /
b 687 or 374
/
bi 662 r/
b 373
t/
b 615 en 371
/
b = 612 er 358
558 ri 357
/
b 554 at 352
nlnl 506 pr 351
/
bf 505 te 349
e/
b 500 an 348
/
b∗ 444 lo 347

A list of the 30 most frequently occurring pairs of characters in an earlier version of this
chapter is shown in Table 5.2. For comparison, the 30 most frequently occurring pairs of
characters in a set of C programs is shown in Table 5.3.
In these tables, /
b corresponds to a space and nl corresponds to a new line. Notice how
different the two tables are. It is easy to see that a dictionary designed for compressing L
A
TEX
documents would not work very well when compressing C programs. However, generally
we want techniques that will be able to compress a variety of source outputs. If we wanted
to compress computer files, we do not want to change techniques based on the content of
the file. Rather, we would like the technique to adapt to the characteristics of the source
output. We discuss adaptive-dictionary-based techniques in the next section.
5 . 4 Adaptive Dictionary
Most adaptive-dictionary-based techniques have their roots in two landmark papers by
Jacob Ziv and Abraham Lempel in 1977 [54] and 1978 [55]. These papers provide two
different approaches to adaptively building dictionaries, and each approach has given rise
to a number of variations. The approaches based on the 1977 paper are said to belong to
the LZ77 family (also known as LZ1), while the approaches based on the 1978 paper are
said to belong to the LZ78, or LZ2, family. The transposition of the initials is a historical
accident and is a convention we will observe in this book. In the following sections, we first
describe an implementation of each approach followed by some of the more well-known
variations.
5 . 4 . 1 T h e L Z 7 7 A p p r o a c h
In the LZ77 approach, the dictionary is simply a portion of the previously encoded sequence.
The encoder examines the input sequence through a sliding window as shown in Figure 5.1.
The window consists of two parts, a search buffer that contains a portion of the recently
encoded sequence, and a look-ahead buffer that contains the next portion of the sequence to
be encoded. In Figure 5.1, the search buffer contains eight symbols, while the look-ahead
buffer contains seven symbols. In practice, the sizes of the buffers are significantly larger;
however, for the purpose of explanation, we will keep the buffer sizes small.
To encode the sequence in the look-ahead buffer, the encoder moves a search pointer back
through the search buffer until it encounters a match to the first symbol in the look-ahead
a b
a r a a a
d b a
r r a
r r a
r
Search buffer
Match pointer
Look-ahead buffer
x x x x
F I G U R E 5. 1 Encoding using the LZ77 approach.

buffer. The distance of the pointer from the look-ahead buffer is called the offset. The encoder
then examines the symbols following the symbol at the pointer location to see if they match
consecutive symbols in the look-ahead buffer. The number of consecutive symbols in the
search buffer that match consecutive symbols in the look-ahead buffer, starting with the
first symbol, is called the length of the match. The encoder searches the search buffer for
the longest match. Once the longest match has been found, the encoder encodes it with a
triple olc, where o is the offset, l is the length of the match, and c is the codeword
corresponding to the symbol in the look-ahead buffer that follows the match. For example,
in Figure 5.1 the pointer is pointing to the beginning of the longest match. The offset o
in this case is 7, the length of the match l is 4, and the symbol in the look-ahead buffer
following the match is .
The reason for sending the third element in the triple is to take care of the situation
where no match for the symbol in the look-ahead buffer can be found in the search buffer.
In this case, the offset and match-length values are set to 0, and the third element of the
triple is the code for the symbol itself.
If the size of the search buffer is S, the size of the window (search and look-ahead
buffers) is W, and the size of the source alphabet is A, then the number of bits needed to code
the triple using fixed-length codes is log2 S+log2 W+log2 A. Notice that the second
term is log2 W, not log2 S. The reason for this is that the length of the match can actually
exceed the length of the search buffer. We will see how this happens in Example 5.4.1.
In the following example, we will look at three different possibilities that may be
encountered during the coding process:
1. There is no match for the next character to be encoded in the window.
2. There is a match.
3. The matched string extends inside the look-ahead buffer.
E x a m p l e 5 . 4 . 1: T h e L Z 7 7 a p p r o a c h
Suppose the sequence to be encoded is
cabracadabrarrarrad
Suppose the length of the window is 13, the size of the look-ahead buffer is six, and the
current condition is as follows:
cabraca dabrar
with dabrar in the look-ahead buffer. We look back in the already encoded portion of the
window to find a match for d. As we can see, there is no match, so we transmit the triple
00Cd. The first two elements of the triple show that there is no match to d in the
search buffer, while Cd is the code for the character d. This seems like a wasteful way to
encode a single character, and we will have more to say about this later.

For now, let’s continue with the encoding process. As we have encoded a single character,
we move the window by one character. Now the contents of the buffer are
abracad abrarr
with abrarr in the look-ahead buffer. Looking back from the current location, we find a
match to a at an offset of two. The length of this match is one. Looking further back, we
have another match for a at an offset of four; again the length of the match is one. Looking
back even further in the window, we have a third match for a at an offset of seven. However,
this time the length of the match is four (see Figure 5.2). So we encode the string abra with
the triple 74Cr, and move the window forward by five characters. The window now
contains the following characters:
adabrar rarrad
Now the look-ahead buffer contains the string rarrad. Looking back in the window, we find
a match for r at an offset of one and a match length of one, and a second match at an offset
of three with a match length of what at first appears to be three. It turns out we can use a
match length of five instead of three.
c a r
b a a
c d b
a r r
a r r
a r d
a
l = 4
o = 7
Search
pointer
F I G U R E 5. 2 The encoding process.
Why this is so will become clearer when we decode the sequence. To see how the
decoding works, let us assume that we have decoded the sequence cabraca and we receive
the triples 00Cd, 74Cr, and 35Cd. The first triple is easy to decode; there
was no match within the previously decoded string, and the next symbol is d. The decoded
string is now cabracad. The first element of the next triple tells the decoder to move the
copy pointer back seven characters, and copy four characters from that point. The decoding
process works as shown in Figure 5.3.
Finally, let’s see how the triple 35Cd gets decoded. We move back three characters
and start copying. The first three characters we copy are rar. The copy pointer moves once
again, as shown in Figure 5.4, to copy the recently copied character r. Similarly, we copy
the next character a. Even though we started copying only three characters back, we end
up decoding five characters. Notice that the match only has to start in the search buffer; it
can extend into the look-ahead buffer. In fact, if the last character in the look-ahead buffer

c a r
b a a
c d b
a r a c a r
b a a
c d b
a r r
a
Copy 4
Decode C(r)
c a r
b a a
c d b
a c a r
b a a
c d b
a r
Copy 2 Copy 3
c a r
b a a
c d c a r
b a a
c d a
Move back 7 Copy 1
F I G U R E 5. 3 Decoding of the triple 7, 4, C(r).
a b b
a r r
a r r
a r a b b
a r r
a r r
a r a
Copy 4 Copy 5
a a
b b a
r r a
r r d
r a
Decode C(d)
a b b
a r r
a r a a b b
a r r
a r r
a
Copy 2 Copy 3
a b b
a r r
a a b b
a r r
a r
Move back 3 Copy 1
F I G U R E 5. 4 Decoding the triple 3, 5, C(d).

had been r instead of d, followed by several more repetitions of rar, the entire sequence of
repeated rars could have been encoded with a single triple.
As we can see, the LZ77 scheme is a very simple adaptive scheme that requires no
prior knowledge of the source and seems to require no assumptions about the characteristics
of the source. The authors of this algorithm showed that asymptotically the performance
of this algorithm approached the best that could be obtained by using a scheme that had
full knowledge about the statistics of the source. While this may be true asymptotically, in
practice there are a number of ways of improving the performance of the LZ77 algorithm
as described here. Furthermore, by using the recent portions of the sequence, there is an
assumption of sorts being used here—that is, that patterns recur “close” together. As we
shall see, in LZ78 the authors removed this “assumption” and came up with an entirely
different adaptive-dictionary-based scheme. Before we get to that, let us look at the different
variations of the LZ77 algorithm.
V a r i a t i o n s o n t h e L Z 7 7 T h e m e
There are a number of ways that the LZ77 scheme can be made more efficient, and most
of these have appeared in the literature. Many of the improvements deal with the efficient
encoding of the triples. In the description of the LZ77 algorithm, we assumed that the
triples were encoded using a fixed-length code. However, if we were willing to accept
more complexity, we could encode the triples using variable-length codes. As we saw in
earlier chapters, these codes can be adaptive or, if we were willing to use a two-pass
algorithm, they can be semiadaptive. Popular compression packages, such as PKZip, Zip,
LHarc, PNG, gzip, and ARJ, all use an LZ77-based algorithm followed by a variable-length
coder.
Other variations on the LZ77 algorithm include varying the size of the search and look-
ahead buffers. To make the search buffer large requires the development of more effective
search strategies. Such strategies can be implemented more effectively if the contents of the
search buffer are stored in a manner conducive to fast searches.
The simplest modification to the LZ77 algorithm, and one that is used by most variations
of the LZ77 algorithm, is to eliminate the situation where we use a triple to encode a single
character. Use of a triple is highly inefficient, especially if a large number of characters
occur infrequently. The modification to get rid of this inefficiency is simply the addition of
a flag bit, to indicate whether what follows is the codeword for a single symbol. By using
this flag bit we also get rid of the necessity for the third element of the triple. Now all
we need to do is to send a pair of values corresponding to the offset and length of match.
This modification to the LZ77 algorithm is referred to as LZSS [56, 57].
5 . 4 . 2 T h e L Z 7 8 A p p r o a c h
The LZ77 approach implicitly assumes that like patterns will occur close together. It makes
use of this structure by using the recent past of the sequence as the dictionary for encoding.

a c
b d f
e g i
h a c
b d f
e g i
h a c
b d f
e g i
h
Search buffer Look-ahead buffer
F I G U R E 5. 5 The Achilles’ heel of LZ77.
However, this means that any pattern that recurs over a period longer than that covered
by the coder window will not be captured. The worst-case situation would be where the
sequence to be encoded was periodic with a period longer than the search buffer. Consider
Figure 5.5.
This is a periodic sequence with a period of nine. If the search buffer had been just one
symbol longer, this sequence could have been significantly compressed. As it stands, none
of the new symbols will have a match in the search buffer and will have to be represented
by separate codewords. As this involves sending along overhead (a 1-bit flag for LZSS and
a triple for the original LZ77 algorithm), the net result will be an expansion rather than a
compression.
Although this is an extreme situation, there are less drastic circumstances in which the
finite view of the past would be a drawback. The LZ78 algorithm solves this problem by
dropping the reliance on the search buffer and keeping an explicit dictionary. This dictionary
has to be built at both the encoder and decoder, and care must be taken that the dictionaries
are built in an identical manner. The inputs are coded as a double ic, with i being an
index corresponding to the dictionary entry that was the longest match to the input, and c
being the code for the character in the input following the matched portion of the input. As
in the case of LZ77, the index value of 0 is used in the case of no match. This double then
becomes the newest entry in the dictionary. Thus, each new entry into the dictionary is one
new symbol concatenated with an existing dictionary entry. To see how the LZ78 algorithm
works, consider the following example.
E x a m p l e 5 . 4 . 2: T h e L Z 7 8 a p p r o a c h
Let us encode the following sequence using the LZ78 approach:
wabba/
bwabba/
bwabba/
bwabba/
bwoo/
bwoo/
bwoo2
where /
b stands for space. Initially, the dictionary is empty, so the first few symbols encoun-
tered are encoded with the index value set to 0. The first three encoder outputs are 0Cw,
0Ca, 0Cb, and the dictionary looks like Table 5.4.
The fourth symbol is a b, which is the third entry in the dictionary. If we append the
next symbol, we would get the pattern ba, which is not in the dictionary, so we encode
these two symbols as 3Ca, and add the pattern ba as the fourth entry in the dictionary.
Continuing in this fashion, the encoder output and the dictionary develop as in Table 5.5.
Notice that the entries in the dictionary generally keep getting longer, and if this particular
2
“The Monster Song” from Sesame Street.

T A B L E 5 . 4 The initial dictionary.
Index Entry
1 w
2 a
3 b
T A B L E 5 . 5 Development of dictionary.
Dictionary
Encoder Output Index Entry
0Cw 1 w
0Ca 2 a
0Cb 3 b
3Ca 4 ba
0C/
b 5 /
b
1Ca 6 wa
3Cb 7 bb
2C/
b 8 a/
b
6Cb 9 wab
4C/
b 10 ba/
b
9Cb 11 wabb
8Cw 12 a/
bw
0Co 13 o
13C/
b 14 o/
b
1Co 15 wo
14Cw 16 o/
bw
13Co 17 oo
sentence was repeated often, as it is in the song, after a while the entire sentence would be
an entry in the dictionary.
While the LZ78 algorithm has the ability to capture patterns and hold them indefinitely, it
also has a rather serious drawback. As seen from the example, the dictionary keeps growing
without bound. In a practical situation, we would have to stop the growth of the dictionary at
some stage, and then either prune it back or treat the encoding as a fixed dictionary scheme.
We will discuss some possible approaches when we study applications of dictionary coding.
V a r i a t i o n s o n t h e L Z 7 8 T h e m e -
— T h e L Z W A l g o r i t h m
There are a number of ways the LZ78 algorithm can be modified, and as is the case with the
LZ77 algorithm, anything that can be modified probably has been. The most well-known
modification, one that initially sparked much of the interest in the LZ algorithms, is a
modification by Terry Welch known as LZW [58]. Welch proposed a technique for removing

the necessity of encoding the second element of the pair ic. That is, the encoder would
only send the index to the dictionary. In order to do this, the dictionary has to be primed with
all the letters of the source alphabet. The input to the encoder is accumulated in a pattern
p as long as p is contained in the dictionary. If the addition of another letter a results in a
pattern p ∗ a (∗ denotes concatenation) that is not in the dictionary, then the index of p is
transmitted to the receiver, the pattern p∗a is added to the dictionary, and we start another
pattern with the letter a. The LZW algorithm is best understood with an example. In the
following two examples, we will look at the encoder and decoder operations for the same
sequence used to explain the LZ78 algorithm.
E x a m p l e 5 . 4 . 3: T h e L Z W a l g o r i t h m -
— e n c o d i n g
We will use the sequence previously used to demonstrate the LZ78 algorithm as our input
sequence:
wabba/
bwabba/
bwabba/
bwabba/
bwoo/
bwoo/
bwoo
Assuming that the alphabet for the source is /
babow, the LZW dictionary initially
looks like Table 5.6.
T A B L E 5 . 6 Initial LZW dictionary.
Index Entry
1 /
b
2 a
3 b
4 o
5 w
The encoder first encounters the letter w. This “pattern” is in the dictionary so we
concatenate the next letter to it, forming the pattern wa. This pattern is not in the dictionary,
so we encode w with its dictionary index 5, add the pattern wa to the dictionary as the sixth
element of the dictionary, and begin a new pattern starting with the letter a. As a is in the
dictionary, we concatenate the next element b to form the pattern ab. This pattern is not in
the dictionary, so we encode a with its dictionary index value 2, add the pattern ab to the
dictionary as the seventh element of the dictionary, and start constructing a new pattern with
the letter b. We continue in this manner, constructing two-letter patterns, until we reach
the letter w in the second wabba. At this point the output of the encoder consists entirely
of indices from the initial dictionary: 5 2 3 3 2 1. The dictionary at this point looks like
Table 5.7. (The 12th entry in the dictionary is still under construction.) The next symbol in
the sequence is a. Concatenating this to w, we get the pattern wa. This pattern already exists
in the dictionary (item 6), so we read the next symbol, which is b. Concatenating this to
wa, we get the pattern wab. This pattern does not exist in the dictionary, so we include it as
the 12th entry in the dictionary and start a new pattern with the symbol b. We also encode

wa with its index value of 6. Notice that after a series of two-letter entries, we now have
a three-letter entry. As the encoding progresses, the length of the entries keeps increasing.
The longer entries in the dictionary indicate that the dictionary is capturing more of the
structure in the sequence. The dictionary at the end of the encoding process is shown in
Table 5.8. Notice that the 12th through the 19th entries are all either three or four letters in
length. Then we encounter the pattern woo for the first time and we drop back to two-letter
patterns for three more entries, after which we go back to entries of increasing length.
T A B L E 5 . 7 Constructing the 12th entry
of the LZW dictionary.
Index Entry
1 /
b
2 a
3 b
4 o
5 w
6 wa
7 ab
8 bb
9 ba
10 a/
b
11 /
bw
12 w…
T A B L E 5 . 8 The LZW dictionary for encoding
wabba/
bwabba/
bwabba/
bwabba/
bwoo/
bwoo/
bwoo.
Index Entry Index Entry
1 /
b 14 a/
bw
2 a 15 wabb
3 b 16 ba/
b
4 o 17 /
bwa
5 w 18 abb
6 wa 19 ba/
bw
7 ab 20 wo
8 bb 21 oo
9 ba 22 o/
b
10 a/
b 23 /
bwo
11 /
bw 24 oo/
b
12 wab 25 /
bwoo
13 bba
The encoder output sequence is 5 2 3 3 2 1 6 8 10 12 9 11 7 16 5 4 4 11 21 23 4.

E x a m p l e 5 . 4 . 4: T h e L Z W a l g o r i t h m -
— d e c o d i n g
In this example we will take the encoder output from the previous example and decode it
using the LZW algorithm. The encoder output sequence in the previous example was
5 2 3 3 2 1 6 8 10 12 9 11 7 16 5 4 4 11 21 23 4
This becomes the decoder input sequence. The decoder starts with the same initial dictionary
as the encoder (Table 5.6).
The index value 5 corresponds to the letter w, so we decode w as the first element of
our sequence. At the same time, in order to mimic the dictionary construction procedure
of the encoder, we begin construction of the next element of the dictionary. We start with
the letter w. This pattern exists in the dictionary, so we do not add it to the dictionary
and continue with the decoding process. The next decoder input is 2, which is the index
corresponding to the letter a. We decode an a and concatenate it with our current pattern to
form the pattern wa. As this does not exist in the dictionary, we add it as the sixth element
of the dictionary and start a new pattern beginning with the letter a. The next four inputs
3 3 2 1 correspond to the letters bba/
b and generate the dictionary entries ab, bb, ba, and
a/
b. The dictionary now looks like Table 5.9, where the 11th entry is under construction.
T A B L E 5 . 9 Constructing the 11th entry of the
LZW dictionary while decoding.
Index Entry
1 /
b
2 a
3 b
4 o
5 w
6 wa
7 ab
8 bb
9 ba
10 a/
b
11 /
b
The next input is 6, which is the index of the pattern wa. Therefore, we decode a w
and an a. We first concatenate w to the existing pattern, which is /
b, and form the pattern
/
bw. As /
bw does not exist in the dictionary, it becomes the 11th entry. The new pattern now
starts with the letter w. We had previously decoded the letter a, which we now concatenate
to w to obtain the pattern wa. This pattern is contained in the dictionary, so we decode the
next input, which is 8. This corresponds to the entry bb in the dictionary. We decode the
first b and concatenate it to the pattern wa to get the pattern wab. This pattern does not exist
in the dictionary, so we add it as the 12th entry in the dictionary and start a new pattern
with the letter b. Decoding the second b and concatenating it to the new pattern, we get
the pattern bb. This pattern exists in the dictionary, so we decode the next element in the

sequence of encoder outputs. Continuing in this fashion, we can decode the entire sequence.
Notice that the dictionary being constructed by the decoder is identical to that constructed
by the encoder.
There is one particular situation in which the method of decoding the LZW algorithm
described above breaks down. Suppose we had a source with an alphabet = ab, and
we were to encode the sequence beginning with abababab . The encoding process is
still the same. We begin with the initial dictionary shown in Table 5.10 and end up with the
final dictionary shown in Table 5.11.
The transmitted sequence is 1 2 3 5 . This looks like a relatively straightforward
sequence to decode. However, when we try to do so, we run into a snag. Let us go through
the decoding process and see what happens.
We begin with the same initial dictionary as the encoder (Table 5.10). The first two
elements in the received sequence 1 2 3 5 are decoded as a and b, giving rise to the third
dictionary entry ab, and the beginning of the next pattern to be entered in the dictionary, b.
The dictionary at this point is shown in Table 5.12.
T A B L E 5 . 10 Initial dictionary for
abababab.
Index Entry
1 a
2 b
T A B L E 5 . 11 Final dictionary for
abababab.
Index Entry
1 a
2 b
3 ab
4 ba
5 aba
6 abab
7 b
T A B L E 5 . 12 Constructing the fourth entry of
the dictionary while decoding.
Index Entry
1 a
2 b
3 ab
4 b

T A B L E 5 . 13 Constructing the ﬁfth
entry (stage one).
Index Entry
1 a
2 b
3 ab
4 ba
5 a
T A B L E 5 . 14 Constructing the ﬁfth
entry (stage two).
Index Entry
1 a
2 b
3 ab
4 ba
5 ab
The next input to the decoder is 3. This corresponds to the dictionary entry ab. Decoding
each in turn, we first concatenate a to the pattern under construction to get ba. This pattern
is not contained in the dictionary, so we add this to the dictionary (keep in mind, we have
not used the b from ab yet), which now looks like Table 5.13.
The new entry starts with the letter a. We have only used the first letter from the pair
ab. Therefore, we now concatenate b to a to obtain the pattern ab. This pattern is contained
in the dictionary, so we continue with the decoding process. The dictionary at this stage
looks like Table 5.14.
The first four entries in the dictionary are complete, while the fifth entry is still under
construction. However, the very next input to the decoder is 5, which corresponds to the
incomplete entry! How do we decode an index for which we do not as yet have a complete
dictionary entry?
The situation is actually not as bad as it looks. (Of course, if it were, we would not now
be studying LZW.) While we may not have a fifth entry for the dictionary, we do have the
beginnings of the fifth entry, which is ab . Let us, for the moment, pretend that we do
indeed have the fifth entry and continue with the decoding process. If we had a fifth entry,
the first two letters of the entry would be a and b. Concatenating a to the partial new entry
we get the pattern aba. This pattern is not contained in the dictionary, so we add this to
our dictionary, which now looks like Table 5.15. Notice that we now have the fifth entry in
the dictionary, which is aba. We have already decoded the ab portion of aba. We can now
decode the last letter a and continue on our merry way.
This means that the LZW decoder has to contain an exception handler to handle the
special case of decoding an index that does not have a corresponding complete entry in the
decoder dictionary.

T A B L E 5 . 15 Completion of
the ﬁfth entry.
Index Entry
1 a
2 b
3 ab
4 ba
5 aba
6 a
5 . 5 Applications
Since the publication of Terry Welch’s article [58], there has been a steadily increasing
number of applications that use some variant of the LZ78 algorithm. Among the LZ78
variants, by far the most popular is the LZW algorithm. In this section we describe two of
the best-known applications of LZW: GIF, and V.42 bis. While the LZW algorithm was
initially the algorithm of choice patent concerns has lead to increasing use of the LZ77
algorithm. The most popular implementation of the LZ77 algorithm is the deflate algorithm
initially designed by Phil Katz. It is part of the popular zlib library developed by Jean-loup
Gailly and Mark Adler. Jean-loup Gailly also used deflate in the widely used gzip algorithm.
The deflate algorithm is also used in PNG which we describe below.
5 . 5 . 1 F i l e C o m p r e s s i o n -
— U N I X c o m p r e s s
The UNIX compress command is one of the earlier applications of LZW. The size of the
dictionary is adaptive. We start with a dictionary of size 512. This means that the transmitted
codewords are 9 bits long. Once the dictionary has filled up, the size of the dictionary is
doubled to 1024 entries. The codewords transmitted at this point have 10 bits. The size of
the dictionary is progressively doubled as it fills up. In this way, during the earlier part of
the coding process when the strings in the dictionary are not very long, the codewords used
to encode them also have fewer bits. The maximum size of the codeword, bmax, can be set
by the user to between 9 and 16, with 16 bits being the default. Once the dictionary contains
2bmax entries, compress becomes a static dictionary coding technique. At this point the
algorithm monitors the compression ratio. If the compression ratio falls below a threshold,
the dictionary is flushed, and the dictionary building process is restarted. This way, the
dictionary always reflects the local characteristics of the source.
5 . 5 . 2 I m a g e C o m p r e s s i o n -
— T h e G r a p h i c s
I n t e r c h a n g e F o r m a t ( G I F )
The Graphics Interchange Format (GIF) was developed by Compuserve Information Service
to encode graphical images. It is another implementation of the LZW algorithm and is very
similar to the compress command. The compressed image is stored with the first byte

T A B L E 5 . 16 Comparison of GIF with arithmetic
coding.
Arithmetic Coding Arithmetic Coding
Image GIF of Pixel Values of Pixel Differences
Sena 51,085 53,431 31,847
Sensin 60,649 58,306 37,126
Earth 34,276 38,248 32,137
Omaha 61,580 56,061 51,393
being the minimum number of bits b per pixel in the original image. For the images we
have been using as examples, this would be eight. The binary number 2b
is defined to be the
clear code. This code is used to reset all compression and decompression parameters to a
start-up state. The initial size of the dictionary is 2b+1
. When this fills up, the dictionary size
is doubled, as was done in the compress algorithm, until the maximum dictionary size
of 4096 is reached. At this point the compression algorithm behaves like a static dictionary
algorithm. The codewords from the LZW algorithm are stored in blocks of characters. The
characters are 8 bits long, and the maximum block size is 255. Each block is preceded by a
header that contains the block size. The block is terminated by a block terminator consisting
of eight 0s. The end of the compressed image is denoted by an end-of-information code with
a value of 2b
+1. This codeword should appear before the block terminator.
GIF has become quite popular for encoding all kinds of images, both computer-generated
and “natural” images. While GIF works well with computer-generated graphical images, and
pseudocolor or color-mapped images, it is generally not the most efficient way to losslessly
compress images of natural scenes, photographs, satellite images, and so on. In Table 5.16
we give the file sizes for the GIF-encoded test images. For comparison, we also include the
file sizes for arithmetic coding the original images and arithmetic coding the differences.
Notice that even if we account for the extra overhead in the GIF files, for these images
GIF barely holds its own even with simple arithmetic coding of the original pixels. While
this might seem odd at first, if we examine the image on a pixel level, we see that there are
very few repetitive patterns compared to a text source. Some images, like the Earth image,
contain large regions of constant values. In the dictionary coding approach, these regions
become single entries in the dictionary. Therefore, for images like these, the straight forward
dictionary coding approach does hold its own. However, for most other images, it would
probably be preferable to perform some preprocessing to obtain a sequence more amenable
to dictionary coding. The PNG standard described next takes advantage of the fact that
in natural images the pixel-to-pixel variation is generally small to develop an appropriate
preprocessor. We will also revisit this subject in Chapter 7.
5 . 5 . 3 I m a g e C o m p r e s s i o n -
— P o r t a b l e N e t w o r k
G r a p h i c s ( P N G )
The PNG standard is one of the first standards to be collaboratively developed over the
Internet. The impetus for it was an announcement in December 1994 by Unisys (which had
acquired the patent for LZW from Sperry) and CompuServe that they would start charging

royalties to authors of software that included support for GIF. The announcement resulted in
an uproar in the segment of the compression community that formed the core of the Usenet
group comp.compression. The community decided that a patent-free replacement for GIF
should be developed, and within three months PNG was born. (For a more detailed history
of PNG as well as software and much more, go to the PNG website maintained by Greg
Roelof, http://www.libpng.org/pub/png/.)
Unlike GIF, the compression algorithm used in PNG is based on LZ77. In particular,
it is based on the deflate [59] implementation of LZ77. This implementation allows for
match lengths of between 3 and 258. At each step the encoder examines three bytes. If it
cannot find a match of at least three bytes it puts out the first byte and examines the next
three bytes. So, at each step it either puts out the value of a single byte, or literal, or the pair
match length offset . The alphabets of the literal and match length are combined to
form an alphabet of size 286 (indexed by 0−−285). The indices 0−−255 represent literal
bytes and the index 256 is an end-of-block symbol. The remaining 29 indices represent
codes for ranges of lengths between 3 and 258, as shown in Table 5.17. The table shows
the index, the number of selector bits to follow the index, and the lengths represented by
the index and selector bits. For example, the index 277 represents the range of lengths from
67 to 82. To specify which of the sixteen values has actually occurred, the code is followed
by four selector bits.
The index values are represented using a Huffman code. The Huffman code is specified
in Table 5.18.
The offset can take on values between 1 and 32,768. These values are divided into
30 ranges. The thirty range values are encoded using a Huffman code (different from the
Huffman code for the literal and length values) and the code is followed by a number of
selector bits to specify the particular distance within the range.
We have mentioned earlier that in natural images there is not great deal of repetition of
sequences of pixel values. However, pixel values that are spatially close also tend to have
values that are similar. The PNG standard makes use of this structure by estimating the
value of a pixel based on its causal neighbors and subtracting this estimate from the pixel.
The difference modulo 256 is then encoded in place of the original pixel. There are four
different ways of getting the estimate (five if you include no estimation), and PNG allows
T A B L E 5 . 17 Codes for representations of match length [59].
Index # of selector bits Length Index # of selector bits Length Index # of selector bits Length
257 0 3 267 1 15,16 277 4 67–82
258 0 4 268 1 17,18 278 4 83–98
259 0 5 269 2 19–22 279 4 99–114
260 0 6 270 2 23–26 280 4 115–130
261 0 7 271 2 27–30 281 5 131–162
262 0 8 272 2 31–34 282 5 163–194
263 0 9 273 3 35–42 283 5 195–226
264 0 10 274 3 43–50 284 5 227–257
265 1 11, 12 275 3 51–58 285 0 258
266 1 13, 14 276 3 59–66

T A B L E 5 . 18 Huffman codes for the
match length alphabet [59].
Index Ranges # of bits Binary Codes
0–143 8 00110000 through
10111111
144–255 9 110010000 through
111111111
256–279 7 0000000 through
0010111
280–287 8 11000000 through
11000111
T A B L E 5 . 19 Comparison of PNG with GIF and arithmetic coding.
Arithmetic Coding Arithmetic Coding
Image PNG GIF of Pixel Values of Pixel Differences
Sena 31,577 51,085 53,431 31,847
Sensin 34,488 60,649 58,306 37,126
Earth 26,995 34,276 38,248 32,137
Omaha 50,185 61,580 56,061 51,393
the use of a different method of estimation for each row. The first way is to use the pixel
from the row above as the estimate. The second method is to use the pixel to the left as the
estimate. The third method uses the average of the pixel above and the pixel to the left. The
final method is a bit more complex. An initial estimate of the pixel is first made by adding
the pixel to the left and the pixel above and subtracting the pixel to the upper left. Then the
pixel that is closest to the initial esitmate (upper, left, or upper left) is taken as the estimate.
A comparison of the performance of PNG and GIF on our standard image set is shown in
Table 5.19. The PNG method clearly outperforms GIF.
5 . 5 . 4 C o m p r e s s i o n o v e r M o d e m s -
— V . 4 2 b i s
The ITU-T Recommendation V.42 bis is a compression standard devised for use over a
telephone network along with error-correcting procedures described in CCITT Recommen-
dation V.42. This algorithm is used in modems connecting computers to remote users. The
algorithm described in this recommendation operates in two modes, a transparent mode and
a compressed mode. In the transparent mode, the data are transmitted in uncompressed form,
while in the compressed mode an LZW algorithm is used to provide compression.
The reason for the existence of two modes is that at times the data being transmitted do
not have repetitive structure and therefore cannot be compressed using the LZW algorithm.
In this case, the use of a compression algorithm may even result in expansion. In these
situations, it is better to send the data in an uncompressed form. A random data stream
would cause the dictionary to grow without any long patterns as elements of the dictionary.
This means that most of the time the transmitted codeword would represent a single letter

from the source alphabet. As the dictionary size is much larger than the source alphabet
size, the number of bits required to represent an element in the dictionary is much more than
the number of bits required to represent a source letter. Therefore, if we tried to compress
a sequence that does not contain repeating patterns, we would end up with more bits to
transmit than if we had not performed any compression. Data without repetitive structure are
often encountered when a previously compressed file is transferred over the telephone lines.
The V.42 bis recommendation suggests periodic testing of the output of the compression
algorithm to see if data expansion is taking place. The exact nature of the test is not specified
in the recommendation.
In the compressed mode, the system uses LZW compression with a variable-size dictio-
nary. The initial dictionary size is negotiated at the time a link is established between the
transmitter and receiver. The V.42 bis recommendation suggests a value of 2048 for the
dictionary size. It specifies that the minimum size of the dictionary is to be 512. Suppose
the initial negotiations result in a dictionary size of 512. This means that our codewords that
are indices into the dictionary will be 9 bits long. Actually, the entire 512 indices do not
correspond to input strings; three entries in the dictionary are reserved for control codewords.
These codewords in the compressed mode are shown in Table 5.20.
When the numbers of entries in the dictionary exceed a prearranged threshold C3, the
encoder sends the STEPUP control code, and the codeword size is incremented by 1 bit.
At the same time, the threshold C3 is also doubled. When all available dictionary entries
are filled, the algorithm initiates a reuse procedure. The location of the first string entry in
the dictionary is maintained in a variable N5. Starting from N5, a counter C1 is incremented
until it finds a dictionary entry that is not a prefix to any other dictionary entry. The fact
that this entry is not a prefix to another dictionary entry means that this pattern has not been
encountered since it was created. Furthermore, because of the way it was located, among
patterns of this kind this pattern has been around the longest. This reuse procedure enables
the algorithm to prune the dictionary of strings that may have been encountered in the past
but have not been encountered recently, on a continual basis. In this way the dictionary is
always matched to the current source statistics.
To reduce the effect of errors, the CCITT recommends setting a maximum string length.
This maximum length is negotiated at link setup. The CCITT recommends a range of 6–250,
with a default value of 6.
The V.42 bis recommendation avoids the need for an exception handler for the case
where the decoder receives a codeword corresponding to an incomplete entry by forbidding
the use of the last entry in the dictionary. Instead of transmitting the codeword corresponding
to the last entry, the recommendation requires the sending of the codewords corresponding
T A B L E 5 . 20 Control codewords in
compressed mode.
Codeword Name Description
0 ETM Enter transparent mode
1 FLUSH Flush data
2 STEPUP Increment codeword size

to the constituents of the last entry. In the example used to demonstrate this quirk of the
LZW algorithm, instead of transmitting the codeword 5, the V.42 bis recommendation would
have forced us to send the codewords 3 and 1.
5 . 6 Summary
In this chapter we have introduced techniques that keep a dictionary of recurring patterns
and transmit the index of those patterns instead of the patterns themselves in order to achieve
compression. There are a number of ways the dictionary can be constructed.
In applications where certain patterns consistently recur, we can build application-
specific static dictionaries. Care should be taken not to use these dictionaries outside
their area of intended application. Otherwise, we may end up with data expansion
instead of data compression.
The dictionary can be the source output itself. This is the approach used by the LZ77
algorithm. When using this algorithm, there is an implicit assumption that recurrence
of a pattern is a local phenomenon.
This assumption is removed in the LZ78 approach, which dynamically constructs a
dictionary from patterns observed in the source output.
Dictionary-based algorithms are being used to compress all kinds of data; however, care
should be taken with their use. This approach is most useful when structural constraints
restrict the frequently occurring patterns to a small subset of all possible patterns. This is
the case with text, as well as computer-to-computer communication.
1. Text Compression, by T.C. Bell, J.G. Cleary, and I.H. Witten [1], provides an excellent
exposition of dictionary-based coding techniques.
2. The Data Compression Book, by M. Nelson and J.-L. Gailley [60], also does a good
job of describing the Ziv-Lempel algorithms. There is also a very nice description of
some of the software implementation aspects.
3. Data Compression, by G. Held and T.R. Marshall [61], contains a description of
digram coding under the name “diatomic coding.” The book also includes BASIC
programs that help in the design of dictionaries.
4. The PNG algorithm is described in a very accessible manner in “PNG Lossless
Compression,” by G. Roelofs [62] in the Lossless Compression Handbook.
5. A more in-depth look at dictionary compression is provided in “Dictionary-Based Data
Compression: An Algorithmic Perspective,” by S.C. Şahinalp and N.M. Rajpoot [63]
in the Lossless Compression Handbook.

1. To study the effect of dictionary size on the efficiency of a static dictionary technique,
we can modify Equation (5.1) so that it gives the rate as a function of both p and
the dictionary size M. Plot the rate as a function of p for different values of M, and
discuss the trade-offs involved in selecting larger or smaller values of M.
2. Design and implement a digram coder for text files of interest to you.
(a) Study the effect of the dictionary size, and the size of the text file being encoded
on the amount of compression.
(b) Use the digram coder on files that are not similar to the ones you used to design
the digram coder. How much does this affect your compression?
3. Given an initial dictionary consisting of the letters a b r y /
b, encode the following
message using the LZW algorithm: a/
bbar/
barray/
bby/
bbarrayar/
bbay.
4. A sequence is encoded using the LZW algorithm and the initial dictionary shown in
Table 5.21.
T A B L E 5 . 21 Initial dictionary
for Problem 4.
Index Entry
1 a
2 /
b
3 h
4 i
5 s
6 t
(a) The output of the LZW encoder is the following sequence:
6 3 4 5 2 3 1 6 2 9 11 16 12 14 4 20 10 8 23 13
Decode this sequence.
(b) Encode the decoded sequence using the same initial dictionary. Does your answer
match the sequence given above?
5. A sequence is encoded using the LZW algorithm and the initial dictionary shown in
Table 5.22.
(a) The output of the LZW encoder is the following sequence:
3 1 4 6 8 4 2 1 2 5 10 6 11 13 6
Decode this sequence.

T A B L E 5 . 22 Initial dictionary
for Problem 5.
Index Entry
1 a
2 /
b
3 r
4 t
(b) Encode the decoded sequence using the same initial dictionary. Does your answer
match the sequence given above?
6. Encode the following sequence using the LZ77 algorithm:
barrayar/
bbar/
bby/
bbarrayar/
bbay
Assume you have a window size of 30 with a look-ahead buffer of size 15. Furthermore,
assume that Ca = 1, Cb = 2, C/
b = 3, Cr = 4, and Cy = 5.
7. A sequence is encoded using the LZ77 algorithm. Given that Ca = 1, C/
b = 2,
Cr = 3, and Ct = 4, decode the following sequence of triples:
003 001 004 282 312 003 644 954
Assume that the size of the window is 20 and the size of the look-ahead buffer is 10.
Encode the decoded sequence and make sure you get the same sequence of triples.
8. Given the following primed dictionary and the received sequence below, build an
LZW dictionary and decode the transmitted sequence.
Received Sequence: 4, 5, 3, 1, 2, 8, 2, 7, 9, 7, 4
Decoded Sequence:
Initial dictionary:
(a) S
(b) b
(c) I
(d) T
(e) H

6
Context-Based Compression
6 . 1 Overview
I
n this chapter we present a number of techniques that use minimal prior
assumptions about the statistics of the data. Instead they use the context of the
data being encoded and the past history of the data to provide more efficient
compression. We will look at a number of schemes that are principally used
for the compression of text. These schemes use the context in which the data
occurs in different ways.
6 . 2 Introduction
In Chapters 3 and 4 we learned that we get more compression when the message that is
being coded has a more skewed set of probabilities. By “skewed” we mean that certain
symbols occur with much higher probability than others in the sequence to be encoded. So
it makes sense to look for ways to represent the message that would result in greater skew.
One very effective way to do so is to look at the probability of occurrence of a letter in the
context in which it occurs. That is, we do not look at each symbol in a sequence as if it
had just happened out of the blue. Instead, we examine the history of the sequence before
determining the likely probabilities of different values that the symbol can take.
In the case of English text, Shannon [8] showed the role of context in two very interesting
experiments. In the first, a portion of text was selected and a subject (possibly his wife,
Mary Shannon) was asked to guess each letter. If she guessed correctly, she was told that
she was correct and moved on to the next letter. If she guessed incorrectly, she was told
the correct answer and again moved on to the next letter. Here is a result from one of these
experiments. Here the dashes represent the letters that were correctly guessed.

142 6 C O N T E X T - B A S E D C O M P R E S S I O N
Actual Text T H E R O O M W A S N O T V E R Y L I G H T A S M A L L O B L O N G
Subject Performance _ _ _ _ R O O _ _ _ _ _ _ N O T _ V _ _ _ _ _ I _ _ _ _ _ _ S M _ _ _ _ O B L _ _ _
Notice that there is a good chance that the subject will guess the letter, especially if the
letter is at the end of a word or if the word is clear from the context. If we now represent
the original sequence by the subject performance, we would get a very different set of
probabilities for the values that each element of the sequence takes on. The probabilities are
definitely much more skewed in the second row: the “letter” _ occurs with high probability.
If a mathematical twin of the subject were available at the other end, we could send the
“reduced” sentence in the second row and have the twin go through the same guessing
process to come up with the original sequence.
In the second experiment, the subject was allowed to continue guessing until she had
guessed the correct letter and the number of guesses required to correctly predict the letter
was noted. Again, most of the time the subject guessed correctly, resulting in 1 being the
most probable number. The existence of a mathematical twin at the receiving end would
allow this skewed sequence to represent the original sequence to the receiver. Shannon used
his experiments to come up with upper and lower bounds for the English alphabet (1.3 bits
per letter and 0.6 bits per letter, respectively).
The difficulty with using these experiments is that the human subject was much better
at predicting the next letter in a sequence than any mathematical predictor we can develop.
Grammar is hypothesized to be innate to humans [64], in which case development of a
predictor as efficient as a human for language is not possible in the near future. However,
the experiments do provide an approach to compression that is useful for compression of all
types of sequences, not simply language representations.
If a sequence of symbols being encoded does not consist of independent occurrences
of the symbols, then the knowledge of which symbols have occurred in the neighborhood
of the symbol being encoded will give us a much better idea of the value of the symbol
being encoded. If we know the context in which a symbol occurs we can guess with a much
greater likelihood of success what the value of the symbol is. This is just another way of
saying that, given the context, some symbols will occur with much higher probability than
others. That is, the probability distribution given the context is more skewed. If the context
is known to both encoder and decoder, we can use this skewed distribution to perform the
encoding, thus increasing the level of compression. The decoder can use its knowledge of
the context to determine the distribution to be used for decoding. If we can somehow group
like contexts together, it is quite likely that the symbols following these contexts will be the
same, allowing for the use of some very simple and efficient compression strategies. We
can see that the context can play an important role in enhancing compression, and in this
chapter we will look at several different ways of using the context.
Consider the encoding of the word probability. Suppose we have already encoded the
first four letters, and we want to code the fifth letter, a. If we ignore the first four letters,
the probability of the letter a is about 0.06. If we use the information that the previous letter
is b, this reduces the probability of several letters such as q and z occurring and boosts
the probability of an a occurring. In this example, b would be the first-order context for
aob would be the second-order context for a, and so on. Using more letters to define the
context in which a occurs, or higher-order contexts, will generally increase the probability

of the occurrence of a in this example, and hence reduce the number of bits required to
encode its occurrence. Therefore, what we would like to do is to encode each letter using
the probability of its occurrence with respect to a context of high order.
If we want to have probabilities with respect to all possible high-order contexts, this
might be an overwhelming amount of information. Consider an alphabet of size M. The
number of first-order contexts is M, the number of second-order contexts is M2
, and so on.
Therefore, if we wanted to encode a sequence from an alphabet of size 256 using contexts
of order 5, we would need 2565
, or about 109951 ×1012
probability distributions! This is
not a practical alternative. A set of algorithms that resolve this problem in a very simple and
elegant way is based on the prediction with partial match (ppm) approach. We will describe
this in the next section.
6 . 3 Prediction with Partial Match (p p m)
The best-known context-based algorithm is the ppm algorithm, first proposed by Cleary and
Witten [65] in 1984. It has not been as popular as the various Ziv-Lempel-based algorithms
mainly because of the faster execution speeds of the latter algorithms. Lately, with the
development of more efficient variants, ppm-based algorithms are becoming increasingly
more popular.
The idea of the ppm algorithm is elegantly simple. We would like to use large contexts to
determine the probability of the symbol being encoded. However, the use of large contexts
would require us to estimate and store an extremely large number of conditional probabilities,
which might not be feasible. Instead of estimating these probabilities ahead of time, we can
reduce the burden by estimating the probabilities as the coding proceeds. This way we only
need to store those contexts that have occurred in the sequence being encoded. This is a
much smaller number than the number of all possible contexts. While this mitigates the
problem of storage, it also means that, especially at the beginning of an encoding, we will
need to code letters that have not occurred previously in this context. In order to handle
this situation, the source coder alphabet always contains an escape symbol, which is used to
signal that the letter to be encoded has not been seen in this context.
6 . 3 . 1 T h e B a s i c A l g o r i t h m
The basic algorithm initially attempts to use the largest context. The size of the largest
context is predetermined. If the symbol to be encoded has not previously been encountered in
this context, an escape symbol is encoded and the algorithm attempts to use the next smaller
context. If the symbol has not occurred in this context either, the size of the context is further
reduced. This process continues until either we obtain a context that has previously been
encountered with this symbol, or we arrive at the conclusion that the symbol has not been
encountered previously in any context. In this case, we use a probability of 1/M to encode
the symbol, where M is the size of the source alphabet. For example, when coding the a
of probability, we would first attempt to see if the string proba has previously occurred—
that is, if a had previously occurred in the context of prob. If not, we would encode an

escape and see if a had occurred in the context of rob. If the string roba had not occurred
previously, we would again send an escape symbol and try the context ob. Continuing in
this manner, we would try the context b, and failing that, we would see if the letter a (with a
zero-order context) had occurred previously. If a was being encountered for the first time,
we would use a model in which all letters occur with equal probability to encode a. This
equiprobable model is sometimes referred to as the context of order −1.
As the development of the probabilities with respect to each context is an adaptive
process, each time a symbol is encountered, the count corresponding to that symbol is
updated. The number of counts to be assigned to the escape symbol is not obvious, and a
number of different approaches have been used. One approach used by Cleary and Witten is
to give the escape symbol a count of one, thus inflating the total count by one. Cleary and
Witten call this method of assigning counts Method A, and the resulting algorithm ppma.
We will describe some of the other ways of assigning counts to the escape symbol later in
this section.
Before we delve into some of the details, let’s work through an example to see how
all this works together. As we will be using arithmetic coding to encode the symbols, you
might wish to refresh your memory of the arithmetic coding algorithms.
E x a m p l e 6 . 3 . 1:
Let’s encode the sequence
this/
bis/
bthe/
btithe
Assuming we have already encoded the initial seven characters this/
bis, the various counts
and Cum_Count arrays to be used in the arithmetic coding of the symbols are shown in
Tables 6.1–6.4. In this example, we are assuming that the longest context length is two. This
is a rather small value and is used here to keep the size of the example reasonably small.
A more common value for the longest context length is five.
We will assume that the word length for arithmetic coding is six. Thus, l = 000000 and
u = 111111. As this/
bis has already been encoded, the next letter to be encoded is /
b. The
second-order context for this letter is is. Looking at Table 6.4, we can see that the letter /
b
T A B L E 6 . 1 Count array for −1 order context.
Letter Count Cum_Count
t 1 1
h 1 2
i 1 3
s 1 4
e 1 5
/
b 1 6
Total Count 6

T A B L E 6 . 2 Count array for zero-order context.
t 1 1
h 1 2
i 2 4
s 2 6
/
b 1 7
Esc 1 8
Total Count 8
T A B L E 6 . 3 Count array for ﬁrst-order contexts.
Context Letter Count Cum_Count
t h 1 1
Esc 1 2
Total Count 2
h i 1 1
Esc 1 2
Total Count 2
i s 2 2
Esc 1 3
Total Count 3
/
b i 1 1
Esc 1 2
Total Count 2
s /
b 1 1
Esc 1 2
Total Count 2
is the first letter in this context with a Cum_Count value of 1. As the Total_Count in this
case is 2, the update equations for the lower and upper limits are
l = 0+

63−0+1×
0
2

= 0 = 000000
u = 0+

63−0+1×
1
2

−1 = 31 = 011111

T A B L E 6 . 4 Count array for second-order contexts.
th i 1 1
Esc 1 2
Total Count 2
hi s 1 1
Esc 1 2
Total Count 2
is /
b 1 1
Esc 1 2
Total Count 2
s/
b i 1 1
Esc 1 2
Total Count 2
/
bi s 1 1
Esc 1 2
Total Count 2
As the MSBs of both l and u are the same, we shift that bit out, shift a 0 into the LSB of l,
and a 1 into the LSB of u. The transmitted sequence, lower limit, and upper limit after the
update are
Transmitted sequence 0
l 000000
u 111111
We also update the counts in Tables 6.2–6.4.
The next letter to be encoded in the sequence is t. The second-order context is s/
b.
Looking at Table 6.4, we can see that t has not appeared before in this context. We therefore
encode an escape symbol. Using the counts listed in Table 6.4, we update the lower and
upper limits:
l = 0+

63−0+1×
1
2

= 32 = 100000
u = 0+

63−0+1×
2
2

−1 = 63 = 111111

Again, the MSBs of l and u are the same, so we shift the bit out and shift 0 into the LSB of
l, and 1 into u, restoring l to a value of 0 and u to a value of 63. The transmitted sequence
is now 01. After transmitting the escape, we look at the first-order context of t, which is /
b.
Looking at Table 6.3, we can see that t has not previously occurred in this context. To let
the decoder know this, we transmit another escape. Updating the limits, we get
l = 0+

63−0+1×
1
2

= 32 = 100000
u = 0+

63−0+1×
2
2

−1 = 63 = 111111
As the MSBs of l and u are the same, we shift the MSB out and shift 0 into the LSB of l
and 1 into the LSB of u. The transmitted sequence is now 011. Having escaped out of the
first-order contexts, we examine Table 6.5, the updated version of Table 6.2, to see if we
can encode t using a zero-order context. Indeed we can, and using the Cum_Count array,
we can update l and u:
l = 0+

63−0+1×
0
9

= 0 = 000000
u = 0+

63−0+1×
1
9

−1 = 6 = 000110
T A B L E 6 . 5 Updated count array for
zero-order context.
t 1 1
h 1 2
i 2 4
s 2 6
/
b 2 8
Esc 1 9
Total Count 9
The three most significant bits of both l and u are the same, so we shift them out. After the
update we get
l 000000
u 110111

The next letter to be encoded is h. The second-order context /
bt has not occurred previ-
ously, so we move directly to the first-order context t. The letter h has occurred previously
in this context, so we update l and u and obtain
l 000000
u 110101
T A B L E 6 . 6 Count array for zero-order context.
t 2 2
h 2 4
i 2 6
s 2 8
/
b 2 10
Esc 1 11
Total Count 11
T A B L E 6 . 7 Count array for ﬁrst-order contexts.
t h 2 2
Esc 1 3
Total Count 3
h i 1 1
Esc 1 2
Total Count 2
i s 2 2
Esc 1 3
Total Count 3
/
b i 1 1
t 1 2
Esc 1 3
Total Count 3
s /
b 2 2
Esc 1 3
Total Count 3

T A B L E 6 . 8 Count array for second-order contexts.
th i 1 1
Esc 1 2
Total Count 2
hi s 1 1
Esc 1 2
Total Count 2
is /
b 2 2
Esc 1 3
Total Count 3
s/
b i 1 1
t 1 2
Esc 1 3
Total Count 3
/
bi s 1 1
Esc 1 2
Total Count 2
/
bt h 1 1
Esc 1 2
Total Count 2
The method of encoding should now be clear. At this point the various counts are as
shown in Tables 6.6–6.8.
Now that we have an idea of how the ppm algorithm works, let’s examine some of the
variations.
6 . 3 . 2 T h e E s c a p e S y m b o l
In our example we used a count of one for the escape symbol, thus inflating the total count in
each context by one. Cleary and Witten call this Method A, and the corresponding algorithm
is referred to as ppma. There is really no obvious justification for assigning a count of one
to the escape symbol. For that matter, there is no obvious method of assigning counts to the
escape symbol. There have been various methods reported in the literature.
Another method described by Cleary and Witten is to reduce the counts of each symbol
by one and assign these counts to the escape symbol. For example, suppose in a given

T A B L E 6 . 9 Counts using Method A.
Context Symbol Count
prob a 10
l 9
o 3
Esc 1
Total Count 23
T A B L E 6 . 10 Counts using Method B.
prob a 9
l 8
o 2
Esc 3
Total Count 22
sequence a occurs 10 times in the context of prob, l occurs 9 times, and o occurs 3 times
in the same context (e.g., problem, proboscis, etc.). In Method A we assign a count of one
to the escape symbol, resulting in a total count of 23, which is one more than the number of
times prob has occurred. The situation is shown in Table 6.9.
In this second method, known as Method B, we reduce the count of each of the symbols
al, and o by one and give the escape symbol a count of three, resulting in the counts shown
in Table 6.10.
The reasoning behind this approach is that if in a particular context more symbols can
occur, there is a greater likelihood that there is a symbol in this context that has not occurred
before. This increases the likelihood that the escape symbol will be used. Therefore, we
should assign a higher probability to the escape symbol.
A variant of Method B, appropriately named Method C, was proposed by Moffat [66].
In Method C, the count assigned to the escape symbol is the number of symbols that have
occurred in that context. In this respect, Method C is similar to Method B. The difference
comes in the fact that, instead of “robbing” this from the counts of individual symbols, the
total count is inflated by this amount. This situation is shown in Table 6.11.
While there is some variation in the performance depending on the characteristics of the
data being encoded, of the three methods for assigning counts to the escape symbol, on the
average, Method C seems to provide the best performance.
6 . 3 . 3 L e n g t h o f C o n t e x t
It would seem that as far as the maximum length of the contexts is concerned, more is
better. However, this is not necessarily true. A longer maximum length will usually result

T A B L E 6 . 11 Counts using Method C.
prob a 10
l 9
o 3
Esc 3
Total Count 25
in a higher probability if the symbol to be encoded has a nonzero count with respect to
that context. However, a long maximum length also means a higher probability of long
sequences of escapes, which in turn can increase the number of bits used to encode the
sequence. If we plot the compression performance versus maximum context length, we
see an initial sharp increase in performance until some value of the maximum length,
followed by a steady drop as the maximum length is further increased. The value at which
we see a downturn in performance changes depending on the characteristics of the source
sequence.
An alternative to the policy of a fixed maximum length is used in the algorithm ppm∗
[67]. This algorithm uses the fact that long contexts that give only a single prediction are
seldom followed by a new symbol. If mike has always been followed by y in the past, it
will probably not be followed by /
b the next time it is encountered. Contexts that are always
followed by the same symbol are called deterministic contexts. The ppm∗
algorithm first
looks for the longest deterministic context. If the symbol to be encoded does not occur in that
context, an escape symbol is encoded and the algorithm defaults to the maximum context
length. This approach seems to provide a small but significant amount of improvement over
the basic algorithm. Currently, the best variant of the ppm∗
algorithm is the ppmz algorithm
by Charles Bloom. Details of the ppmz algorithm as well as implementations of the algorithm
can be found at http://www.cbloom.com/src/ppmz.html.
6 . 3 . 4 T h e E x c l u s i o n P r i n c i p l e
The basic idea behind arithmetic coding is the division of the unit interval into subintervals,
each of which represents a particular letter. The smaller the subinterval, the more bits are
required to distinguish it from other subintervals. If we can reduce the number of symbols
to be represented, the number of subintervals goes down as well. This in turn means that
the sizes of the subintervals increase, leading to a reduction in the number of bits required
for encoding. The exclusion principle used in ppm provides this kind of reduction in rate.
Suppose we have been compressing a text sequence and come upon the sequence proba,
and suppose we are trying to encode the letter a. Suppose also that the state of the two-letter
context ob and the one-letter context b are as shown in Table 6.12.
First we attempt to encode a with the two-letter context. As a does not occur in this
context, we issue an escape symbol and reduce the size of the context. Looking at the table
for the one-letter context b, we see that a does occur in this context with a count of 4 out of a
total possible count of 21. Notice that other letters in this context include l and o. However,

T A B L E 6 . 12 Counts for exclusion
example.
ob l 10
o 3
Esc 2
Total Count 15
b l 5
o 3
a 4
r 2
e 2
Esc 5
Total Count 21
T A B L E 6 . 13 Modiﬁed table used for
exclusion example.
b a 4
r 2
e 2
Esc 3
Total Count 11
by sending the escape symbol in the context of ob, we have already signalled to the decoder
that the symbol being encoded is not any of the letters that have previously been encountered
in the context of ob. Therefore, we can increase the size of the subinterval corresponding
to a by temporarily removing l and o from the table. Instead of using Table 6.12, we use
Table 6.13 to encode a. This exclusion of symbols from contexts on a temporary basis can
result in cumulatively significant savings in terms of rate.
You may have noticed that we keep talking about small but significant savings. In lossless
compression schemes, there is usually a basic principle, such as the idea of prediction with
partial match, followed by a host of relatively small modifications. The importance of these
modifications should not be underestimated because often together they provide the margin
of compression that makes a particular scheme competitive.
6 . 4 The Burrows-Wheeler Transform
The Burrows-Wheeler Transform (BWT) algorithm also uses the context of the symbol
being encoded, but in a very different way, for lossless compression. The transform that

is a major part of this algorithm was developed by Wheeler in 1983. However, the BWT
compression algorithm, which uses this transform, saw the light of day in 1994 [68]. Unlike
most of the previous algorithms we have looked at, the BWT algorithm requires that the
entire sequence to be coded be available to the encoder before the coding takes place. Also,
unlike most of the previous algorithms, the decoding procedure is not immediately evident
once we know the encoding procedure. We will first describe the encoding procedure. If
it is not clear how this particular encoding can be reversed, bear with us and we will get
to it.
The algorithm can be summarized as follows. Given a sequence of length N, we create
N −1 other sequences where each of these N −1 sequences is a cyclic shift of the original
sequence. These N sequences are arranged in lexicographic order. The encoder then transmits
the sequence of length N created by taking the last letter of each sorted, cyclically shifted,
sequence. This sequence of last letters L, and the position of the original sequence in the
sorted list, are coded and sent to the decoder. As we shall see, this information is sufficient
to recover the original sequence.
We start with a sequence of length N and end with a representation that contains
N + 1 elements. However, this sequence has a structure that makes it highly amenable to
compression. In particular we will use a method of coding called move-to-front (mtf), which
is particularly effective on the type of structure exhibited by the sequence L.
Before we describe the mtf approach, let us work through an example to generate the L
sequence.
E x a m p l e 6 . 4 . 1:
Let’s encode the sequence
this/
bis/
bthe
We start with all the cyclic permutations of this sequence. As there are a total of 11 characters,
there are 11 permutations, shown in Table 6.14.
T A B L E 6 . 14 Permutations of this/
bis/
bthe.
0 t h i s /
b i s /
b t h e
1 h i s /
b i s /
b t h e t
2 i s /
b i s /
b t h e t h
3 s /
b i s /
b t h e t h i
4 /
b i s /
b t h e t h i s
5 i s /
b t h e t h i s /
b
6 s /
b t h e t h i s /
b i
7 /
b t h e t h i s /
b i s
8 t h e t h i s /
b i s /
b
9 h e t h i s /
b i s /
b t
10 e t h i s /
b i s /
b t h

T A B L E 6 . 15 Sequences sorted into
lexicographic order.
0 /
b i s /
b t h e t h i s
1 /
b t h e t h i s /
b i s
2 e t h i s /
b i s /
b t h
3 h e t h i s /
b i s /
b t
4 h i s /
b i s /
b t h e t
5 i s /
b i s /
b t h e t h
6 i s /
b t h e t h i s /
b
7 s /
b i s /
b t h e t h i
8 s /
b t h e t h i s /
b i
9 t h e t h i s /
b i s /
b
10 t h i s /
b i s /
b t h e
Now let’s sort these sequences in lexicographic (dictionary) order (Table 6.15). The
sequence of last letters L in this case is
L sshtth/
bii/
be
Notice how like letters have come together. If we had a longer sequence of letters, the runs
of like letters would have been even longer. The mtf algorithm, which we will describe later,
takes advantage of these runs.
The original sequence appears as sequence number 10 in the sorted list, so the encoding
of the sequence consists of the sequence L and the index value 10.
Now that we have an encoding of the sequence, let’s see how we can decode the
original sequence by using the sequence L and the index to the original sequence in the
sorted list. The important thing to note is that all the elements of the initial sequence are
contained in L. We just need to figure out the permutation that will let us recover the original
sequence.
The first step in obtaining the permutation is to generate the sequence F consisting of
the first element of each row. That is simple to do because we lexicographically ordered the
sequences. Therefore, the sequence F is simply the sequence L in lexicographic order. In
our example this means that F is given as
F /
b/
behhiisstt
We can use L and F to generate the original sequence. Look at Table 6.15 containing
the cyclically shifted sequences sorted in lexicographic order. Because each row is a cyclical
shift, the letter in the first column of any row is the letter appearing after the last column
in the row in the original sequence. If we know that the original sequence is in the kth
row, then we can begin unraveling the original sequence starting with the kth
element
of F.

E x a m p l e 6 . 4 . 2:
In our example
F =
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
/
b
/
b
e
h
h
i
i
s
s
t
t
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
L =
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
s
s
h
t
t
h
/
b
i
i
/
b
e
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
the original sequence is sequence number 10, so the first letter in of the original sequence is
F10 = t. To find the letter following t we look for t in the array L. There are two t’s in L.
Which should we use? The t in F that we are working with is the lower of two t’s, so we
pick the lower of two t’s in L. This is L4. Therefore, the next letter in our reconstructed
sequence is F4 = h. The reconstructed sequence to this point is th. To find the next letter,
we look for h in the L array. Again there are two h’s. The h at F4is the lower of two
h’s in F, so we pick the lower of the two h’s in L. This is the fifth element of L, so the
next element in our decoded sequence is F5 = i. The decoded sequence to this point is thi.
The process continues as depicted in Figure 6.1 to generate the original sequence.
0
1
2
3
4
5
6
7
8
9
10
b
b
e
h
h
i
i
s
s
t
t
s
s
h
t
t
h
b
i
i
b
e
F I G U R E 6. 1 Decoding process.

Why go through all this trouble? After all, we are going from a sequence of length N to
another sequence of length N plus an index value. It appears that we are actually causing
expansion instead of compression. The answer is that the sequence L can be compressed
much more efficiently than the original sequence. Even in our small example we have runs
of like symbols. This will happen a lot more when N is large. Consider a large sample of
text that has been cyclically shifted and sorted. Consider all the rows of A beginning with
he/
b. With high probability he/
b would be preceded by t. Therefore, in L we would get a
long run of ts.
6 . 4 . 1 M o v e - t o - F r o n t C o d i n g
A coding scheme that takes advantage of long runs of identical symbols is the move-to-front
(mtf) coding. In this coding scheme, we start with some initial listing of the source alphabet.
The symbol at the top of the list is assigned the number 0, the next one is assigned the
number 1, and so on. The first time a particular symbol occurs, the number corresponding
to its place in the list is transmitted. Then it is moved to the top of the list. If we have a
run of this symbol, we transmit a sequence of 0s. This way, long runs of different symbols
get transformed to a large number of 0s. Applying this technique to our example does not
produce very impressive results due to the small size of the sequence, but we can see how
the technique functions.
E x a m p l e 6 . 4 . 3:
Let’s encode L = sshtth/
bii/
be. Let’s assume that the source alphabet is given by
= /
behist
We start out with the assignment
0 1 2 3 4 5
/
b e h i s t
The first element of L is s, which gets encoded as a 4. We then move s to the top of the
list, which gives us
0 1 2 3 4 5
s /
b e h i t
The next s is encoded as 0. Because s is already at the top of the list, we do not need to
make any changes. The next letter is h, which we encode as 3. We then move h to the top
of the list:

6.5 Associative Coder of Buyanovsky (ACB) 157
0 1 2 3 4 5
h s /
b e i t
The next letter is t, which gets encoded as 5. Moving t to the top of the list, we get
0 1 2 3 4 5
t h s /
b e i
The next letter is also a t, so that gets encoded as a 0.
Continuing in this fashion, we get the sequence
4 0 3 5 0 1 3 5 0 1 5
As we warned, the results are not too impressive with this small sequence, but we can see
how we would get large numbers of 0s and small values if the sequence to be encoded was
longer.
6 . 5 Associative Coder of Buyanovsky (ACB)
A different approach to using contexts for compression is employed by the eponymous
compression utility developed by George Buyanovsky. The details of this very efficient coder
are not well known; however, the way the context is used is interesting and we will briefly
describe this aspect of ACB. More detailed descriptions are available in [69] and [70]. The
ACB coder develops a sorted dictionary of all encountered contexts. In this it is similar to
other context based encoders. However, it also keeps track of the contents of these contexts.
The content of a context is what appears after the context. In a traditional left-to-right reading
of text, the contexts are unbounded to the left and the contents to the right (to the limits
of text that has already been encoded). When encoding the coder searches for the longest
match to the current context reading right to left. This again is not an unusual thing to do.
What is interesting is what the coder does after the best match is found. Instead of simply
examining the content corresponding to the best matched context, the coder also examines
the contents of the coders in the neighborhood of the best matched contexts. Fenwick [69]
describes this process as first finding an anchor point then searching the contents of the
neighboring contexts for the best match. The location of the anchor point is known to both
the encoder and the decoder. The location of the best content match is signalled to the
decoder by encoding the offset of the context of this content from the anchor point. We
have not specified what we mean by “best” match. The coder takes the utilitarian approach
that the best match is the one that ends up providing the most compression. Thus, a longer
match farther away from the anchor may not be as advantageous as a shorter match closer
to the anchor because of the number of bits required to encode . The length of the match
is also sent to the decoder.

The interesting aspect of this scheme is that it moves away from the idea of exactly
matching the past. It provides a much richer environment and flexibility to enhance the
compression and will, hopefully, provide a fruitful avenue for further research.
6 . 6 Dynamic Markov Compression
Quite often the probabilities of the value that the next symbol in a sequence takes on depend
not only on the current value but on the past values as well. The ppm scheme relies on
this longer-range correlation. The ppm scheme, in some sense, reflects the application, that
is, text compression, for which it is most used. Dynamic Markov compression (DMC),
introduced by Cormack and Horspool [71], uses a more general framework to take advantage
of relationships and correlations, or contexts, that extend beyond a single symbol.
Consider the sequence of pixels in a scanned document. The sequence consists of runs
of black and white pixels. If we represent black by 0 and white by 1, we have runs of 0s
and 1s. If the current value is 0, the probability that the next value is 0 is higher than if
the current value was 1. The fact that we have two different sets of probabilities is reflected
in the two-state model shown in Figure 6.2. Consider state A. The probability of the next
value being 1 changes depending on whether we reached state A from state B or from state
A itself. We can have the model reflect this by cloning state A, as shown in Figure 6.3, to
create state A
. Now if we see a white pixel after a run of black pixels, we go to state A
.
The probability that the next value will be 1 is very high in this state. This way, when we
estimate probabilities for the next pixel value, we take into account not only the value of
the current pixel but also the value of the previous pixel.
This process can be continued as long as we wish to take into account longer and longer
histories. “As long as we wish” is a rather vague statement when it comes to implementing
the algorithm. In fact, we have been rather vague about a number of implementation issues.
We will attempt to rectify the situation.
There are a number of issues that need to be addressed in order to implement this
algorithm:
1. What is the initial number of states?
2. How do we estimate probabilities?
1 0
0
1
A B
F I G U R E 6. 2 A two-state model for binary sequences.

6.6 Dynamic Markov Compression 159
0
1
1
1
0
0
B
A
A'
F I G U R E 6. 3 A three-state model obtained by cloning.
3. How do we decide when a state needs to be cloned?
4. What do we do when the number of states becomes too large?
Let’s answer each question in turn.
We can start the encoding process with a single state with two self-loops for 0 and 1.
This state can be cloned to two and then a higher number of states. In practice it has been
found that, depending on the particular application, it is more efficient to start with a larger
number of states than one.
The probabilities from a given state can be estimated by simply counting the number of
times a 0 or a 1 occurs in that state divided by the number of times the particular state is
occupied. For example, if in state V the number of times a 0 occurs is denoted by nV
0 and
the number of times a 1 occurs is denoted by nV
1 , then
P0V =
nV
0
nV
0 +nV
1
P1V =
nV
1
nV
0 +nV
1

What if a 1 has never previously occurred in this state? This approach would assign a
probability of zero to the occurrence of a 1. This means that there will be no subinterval
assigned to the possibility of a 1 occurring, and when it does occur, we will not be able to
represent it. In order to avoid this, instead of counting from zero, we start the count of 1s
and 0s with a small number c and estimate the probabilities as
P0V =
nV
0 +c
nV
0 +nV
1 +2c
P1V =
nV
1 +c
nV
0 +nV
1 +2c

0
1
0
0
0
1
Cloning
0
A
B
C
0
A C
1
0
C'
B
F I G U R E 6. 4 The cloning process.
Whenever we have two branches leading to a state, it can be cloned. And, theoretically,
cloning is never harmful. By cloning we are providing additional information to the encoder.
This might not reduce the rate, but it should never result in an increase in the rate. However,
cloning does increase the complexity of the coding, and hence the decoding, process. In
order to control the increase in the number of states, we should only perform cloning when
there is a reasonable expectation of reduction in rate. We can do this by making sure that
both paths leading to the state being considered for cloning are used often enough. Consider
the situation shown in Figure 6.4. Suppose the current state is A and the next state is C. As
there are two paths entering C, C is a candidate for cloning. Cormack and Horspool suggest
that C be cloned if nA
0 T1 and nB
0 T2, where T1 and T2 are threshold values set by the
user. If there are more than three paths leading to a candidate for cloning, then we check
that both the number of transitions from the current state is greater than T1 and the number
of transitions from all other states to the candidate state is greater than T2.
Finally, what do we do when, for practical reasons, we cannot accommodate any more
states? A simple solution is to restart the algorithm. In order to make sure that we do not
start from ground zero every time, we can train the initial state configuration using a certain
number of past inputs.
6 . 7 Summary
The context in which a symbol occurs can be very informative about the value that the
symbol takes on. If this context is known to the decoder then this information need not be
encoded: it can be inferred by the decoder. In this chapter we have looked at several creative
ways in which the knowledge of the context can be used to provide compression.

1. The basic ppm algorithm is described in detail in Text Compression, by T.C. Bell,
J.G. Cleary, and I.H. Witten [1].
2. For an excellent description of Burrows-Wheeler Coding, including methods of imple-
mentation and improvements to the basic algorithm, see “Burrows-Wheeler Compres-
sion,” by P. Fenwick [72] in Lossless Compression Handbook.
3. The ACB algorithm is described in “Symbol Ranking and ACB Compression,” by
P. Fenwick [69] in the Lossless Compression Handbook, and in Data Compression:
The Complete Reference by D. Salomon [70]. The chapter by Fenwick also explores
compression schemes based on Shannon’s experiments.
1. Decode the bitstream generated in Example 6.3.1. Assume you have already decoded
this/
bis and Tables 6.1–6.4 are available to you.
2. Given the sequence the/
bbeta/
bcat/
bate/
bthe/
bceta/
bhat:
(a) Encode the sequence using the ppma algorithm and an adaptive arithmetic coder.
Assume a six-letter alphabet hetac/
b .
(b) Decode the encoded sequence.
3. Given the sequence eta/
bceta/
band/
bbeta/
bceta:
(a) Encode using the Burrows-Wheeler transform and move-to-front coding.
(b) Decode the encoded sequence.
4. A sequence is encoded using the Burrows-Wheeler transform. Given L = elbkkee,
and index = 5 (we start counting from 1, not 0), find the original sequence.

7
Lossless Image Compression
7 . 1 Overview
I
n this chapter we examine a number of schemes used for lossless compression
of images. We will look at schemes for compression of grayscale and color
images as well as schemes for compression of binary images. Among these
schemes are several that are a part of international standards.
7 . 2 Introduction
In the previous chapters we have focused on compression techniques. Although some of
them may apply to some preferred applications, the focus has been on the technique rather
than on the application. However, there are certain techniques for which it is impossible
to separate the technique from the application. This is because the techniques rely upon
the properties or characteristics of the application. Therefore, we have several chapters in
this book that focus on particular applications. In this chapter we will examine techniques
specifically geared toward lossless image compression. Later chapters will examine speech,
audio, and video compression.
In the previous chapters we have seen that a more skewed set of probabilities for the
message being encoded results in better compression. In Chapter 6 we saw how the use of
context to obtain a skewed set of probabilities can be especially effective when encoding
text. We can also transform the sequence (in an invertible fashion) into another sequence
that has the desired property in other ways. For example, consider the following sequence:
1 2 5 7 2 −2 0 −5 −3 −1 1 −2 −7 −4 −2 1 3 4

164 7 L O S S L E S S I M A G E C O M P R E S S I O N
If we consider this sample to be fairly typical of the sequence, we can see that the probability
of any given number being in the range from −7 to 7 is about the same. If we were to
encode this sequence using a Huffman or arithmetic code, we would use almost 4 bits per
symbol.
Instead of encoding this sequence directly, we could do the following: add two to the
previous number in the sequence and send the difference between the current element in the
sequence and this predicted value. The transmitted sequence would be
1 −1 1 0 −7 −4 0 −7 0 0 0 −5 −7 1 0 1 0 −1
This method uses a rule (add two) and the history (value of the previous symbol) to generate
the new sequence. If the rule by which this residual sequence was generated is known to
the decoder, it can recover the original sequence from the residual sequence. The length
of the residual sequence is the same as the original sequence. However, notice that the
residual sequence is much more likely to contain 0s, 1s, and −1s than other values. That is,
the probability of 0, 1, and −1 will be significantly higher than the probabilities of other
numbers. This, in turn, means that the entropy of the residual sequence will be low and,
therefore, provide more compression.
We used a particular method of prediction in this example (add two to the previous
element of the sequence) that was specific to this sequence. In order to get the best possible
performance, we need to find the prediction approach that is best suited to the particular
data we are dealing with. We will look at several prediction schemes used for lossless image
compression in the following sections.
7 . 2 . 1 T h e O l d J P E G S t a n d a r d
The Joint Photographic Experts Group (JPEG) is a joint ISO/ITU committee responsible
for developing standards for continuous-tone still-picture coding. The more famous standard
produced by this group is the lossy image compression standard. However, at the time of the
creation of the famous JPEG standard, the committee also created a lossless standard [73].
At this time the standard is more or less obsolete, having been overtaken by the much more
efficient JPEG-LS standard described later in this chapter. However, the old JPEG standard
is still useful as a first step into examining predictive coding in images.
The old JPEG lossless still compression standard [73] provides eight different predictive
schemes from which the user can select. The first scheme makes no prediction. The next
seven are listed below. Three of the seven are one-dimensional predictors, and four are
two-dimensional prediction schemes. Here, Iij is the ijth pixel of the original image,
and Îij is the predicted value for the ijth pixel.
1 Îij = Ii−1j (7.1)
2 Îij = Iij −1 (7.2)
3 Îij = Ii−1j −1 (7.3)
4 Îij = Iij −1+Ii−1j−Ii−1j −1 (7.4)

5 Îij = Iij −1+Ii−1j−Ii−1j −1/2 (7.5)
6 Îij = Ii−1j+Iij −1−Ii−1j −1/2 (7.6)
7 Îij = Iij −1+Ii−1j/2 (7.7)
Different images can have different structures that can be best exploited by one of these
eight modes of prediction. If compression is performed in a nonreal-time environment—for
example, for the purposes of archiving—all eight modes of prediction can be tried and the
one that gives the most compression is used. The mode used to perform the prediction can
be stored in a 3-bit header along with the compressed file. We encoded our four test images
using the various JPEG modes. The residual images were encoded using adaptive arithmetic
coding. The results are shown in Table 7.1.
The best results—that is, the smallest compressed file sizes—are indicated in bold in
the table. From these results we can see that a different JPEG predictor is the best for the
different images. In Table 7.2, we compare the best JPEG results with the file sizes obtained
using GIF and PNG. Note that PNG also uses predictive coding with four possible predictors,
where each row of the image can be encoded using a different predictor. The PNG approach
is described in Chapter 5.
Even if we take into account the overhead associated with GIF, from this comparison
we can see that the predictive approaches are generally better suited to lossless image
compression than the dictionary-based approach when the images are “natural” gray-scale
images. The situation is different when the images are graphic images or pseudocolor images.
A possible exception could be the Earth image. The best compressed file size using the
second JPEG mode and adaptive arithmetic coding is 32,137 bytes, compared to 34,276
bytes using GIF. The difference between the file sizes is not significant. We can see the
reason by looking at the Earth image. Note that a significant portion of the image is the
T A B L E 7 . 1 Compressed ﬁle size in bytes of the residual images obtained using
the various JPEG prediction modes.
Image JPEG 0 JPEG 1 JPEG 2 JPEG 3 JPEG 4 JPEG 5 JPEG 6 JPEG 7
Sena 53,431 37,220 31,559 38,261 31,055 29,742 33,063 32,179
Sensin 58,306 41,298 37,126 43,445 32,429 33,463 35,965 36,428
Earth 38,248 32,295 32,137 34,089 33,570 33,057 33,072 32,672
Omaha 56,061 48,818 51,283 53,909 53,771 53,520 52,542 52,189
T A B L E 7 . 2 Comparison of the ﬁle sizes
obtained using JPEG lossless
compression, GIF, and PNG.
Image Best JPEG GIF PNG
Sena 31,055 51,085 31,577
Sensin 32,429 60,649 34,488
Earth 32,137 34,276 26,995
Omaha 48,818 61,341 50,185

background, which is of a constant value. In dictionary coding, this would result in some
very long entries that would provide significant compression. We can see that if the ratio of
background to foreground were just a little different in this image, the dictionary method in
GIF might have outperformed the JPEG approach. The PNG approach which allows the use
of a different predictor (or no predictor) on each row, prior to dictionary coding significantly
outperforms both GIF and JPEG on this image.
7 . 3 CALIC
The Context Adaptive Lossless Image Compression (CALIC) scheme, which came into
being in response to a call for proposal for a new lossless image compression scheme in
1994 [74, 75], uses both context and prediction of the pixel values. The CALIC scheme
actually functions in two modes, one for gray-scale images and another for bi-level images.
In this section, we will concentrate on the compression of gray-scale images.
In an image, a given pixel generally has a value close to one of its neighbors. Which
neighbor has the closest value depends on the local structure of the image. Depending on
whether there is a horizontal or vertical edge in the neighborhood of the pixel being encoded,
the pixel above, or the pixel to the left, or some weighted average of neighboring pixels may
give the best prediction. How close the prediction is to the pixel being encoded depends
on the surrounding texture. In a region of the image with a great deal of variability, the
prediction is likely to be further from the pixel being encoded than in the regions with less
variability.
In order to take into account all these factors, the algorithm has to make a determination
of the environment of the pixel to be encoded. The only information that can be used to
make this determination has to be available to both encoder and decoder.
Let’s take up the question of the presence of vertical or horizontal edges in the neigh-
borhood of the pixel being encoded. To help our discussion, we will refer to Figure 7.1. In
this figure, the pixel to be encoded has been marked with an X. The pixel above is called
the north pixel, the pixel to the left is the west pixel, and so on. Note that when pixel X is
being encoded, all other marked pixels (N W NW NE WW NN NE, and NNE) are
available to both encoder and decoder.
NN NNE
NW N NE
WW W X
F I G U R E 7. 1 Labeling the neighbors of pixel X.

7.3 CALIC 167
We can get an idea of what kinds of boundaries may or may not be in the neighborhood
of X by computing
dh = W −WW+N −NW+NE −N
dv = W −NW+N −NN+NE −NNE
The relative values of dh and dv are used to obtain the initial prediction of the pixel X.
This initial prediction is then refined by taking other factors into account. If the value of dh
is much higher than the value of dv, this will mean there is a large amount of horizontal
variation, and it would be better to pick N to be the initial prediction. If, on the other hand,
dv is much larger than dh, this would mean that there is a large amount of vertical variation,
and the initial prediction is taken to be W. If the differences are more moderate or smaller,
the predicted value is a weighted average of the neighboring pixels.
The exact algorithm used by CALIC to form the initial prediction is given by the
following pseudocode:
if dh −dv 80
X̂ ← N
else if dv −dh 80
X̂ ← W
else
{
X̂ ← N +W/2+NE −NW/4
if dh −dv 32
X̂ ← X̂ +N/2
else if dv −dh 32
X̂ ← X̂ +W/2
else if dh −dv 8
X̂ ← 3X̂ +N/4
else if dv −dh 8
X̂ ← 3X̂ +W/4
}
Using the information about whether the pixel values are changing by large or small
amounts in the vertical or horizontal direction in the neighborhood of the pixel being encoded
provides a good initial prediction. In order to refine this prediction, we need some information
about the interrelationships of the pixels in the neighborhood. Using this information, we
can generate an offset or refinement to our initial prediction. We quantify the information
about the neighborhood by first forming the vector
NWNWNENNWW2N −NN2W −WW
We then compare each component of this vector with our initial prediction X̂. If the value
of the component is less than the prediction, we replace the value with a 1; otherwise

we replace it with a 0. Thus, we end up with an eight-component binary vector. If each
component of the binary vector was independent, we would end up with 256 possible vectors.
However, because of the dependence of various components, we actually have 144 possible
configurations. We also compute a quantity that incorporates the vertical and horizontal
variations and the previous error in prediction by
= dh +dv +2N −N̂ (7.8)
where N̂ is the predicted value of N. This range of values of is divided into four intervals,
each being represented by 2 bits. These four possibilities, along with the 144 texture
descriptors, create 144×4 = 576 contexts for X. As the encoding proceeds, we keep track
of how much prediction error is generated in each context and offset our initial prediction
by that amount. This results in the final predicted value.
Once the prediction is obtained, the difference between the pixel value and the prediction
(the prediction error, or residual) has to be encoded. While the prediction process outlined
above removes a lot of the structure that was in the original sequence, there is still some
structure left in the residual sequence. We can take advantage of some of this structure by
coding the residual in terms of its context. The context of the residual is taken to be the
value of defined in Equation (7.8). In order to reduce the complexity of the encoding,
rather than using the actual value as the context, CALIC uses the range of values in which
lies as the context. Thus:
0 ≤ q1 ⇒ Context 1
q1 ≤ q2 ⇒ Context 2
The values of q1–q8 can be prescribed by the user.
If the original pixel values lie between 0 and M −1, the differences or prediction residuals
will lie between −M − 1 and M − 1. Even though most of the differences will have a
magnitude close to zero, for arithmetic coding we still have to assign a count to all possible
symbols. This means a reduction in the size of the intervals assigned to values that do occur,
which in turn means using a larger number of bits to represent these values. The CALIC
algorithm attempts to resolve this problem in a number of ways. Let’s describe these using
an example.

7.3 CALIC 169
Consider the sequence
xn 0 7 4 3 5 2 1 7
We can see that all the numbers lie between 0 and 7, a range of values that would require
3 bits to represent. Now suppose we predict a sequence element by the previous element in
the sequence. The sequence of differences
rn = xn −xn−1
is given by
rn 0 7 −3 −1 2 −3 −1 6
If we were given this sequence, we could easily recover the original sequence by using
xn = xn−1 +rn
However, the prediction residual values rn lie in the −77 range. That is, the alphabet
required to represent these values is almost twice the size of the original alphabet. However,
if we look closely we can see that the value of rn actually lies between −xn−1 and 7−xn−1.
The smallest value that rn can take on occurs when xn has a value of 0, in which case rn
will have a value of −xn−1. The largest value that rn can take on occurs when xn is 7, in
which case rn has a value of 7−xn−1. In other words, given a particular value for xn−1, the
number of different values that rn can take on is the same as the number of values that xn
can take on. Generalizing from this, we can see that if a pixel takes on values between 0
and M −1, then given a predicted value X̂, the difference X −X̂ will take on values in the
range −X̂ to M − 1 − X̂. We can use this fact to map the difference values into the range
0M −1, using the following mapping:
0 → 0
1 → 1
−1 → 2
2 → 3

−X̂ → 2X̂
X̂ +1 → 2X̂ +1
X̂ +2 → 2X̂ +2

M −1−X̂ → M −1
where we have assumed that X̂ ≤ M −1/2.

Another approach used by CALIC to reduce the size of its alphabet is to use a modifi-
cation of a technique called recursive indexing [76]. Recursive indexing is a technique for
representing a large range of numbers using only a small set. It is easiest to explain using an
example. Suppose we want to represent positive integers using only the integers between 0
and 7—that is, a representation alphabet of size 8. Recursive indexing works as follows: If
the number to be represented lies between 0 and 6, we simply represent it by that number.
If the number to be represented is greater than or equal to 7, we first send the number 7,
subtract 7 from the original number, and repeat the process. We keep repeating the process
until the remainder is a number between 0 and 6. Thus, for example, 9 would be represented
by 7 followed by a 2, and 17 would be represented by two 7s followed by a 3. The decoder,
when it sees a number between 0 and 6, would decode it at its face value, and when it
saw 7, would keep accumulating the values until a value between 0 and 6 was received.
This method of representation followed by entropy coding has been shown to be optimal for
sequences that follow a geometric distribution [77].
In CALIC, the representation alphabet is different for different coding contexts. For each
coding context k, we use an alphabet Ak = 01Nk . Furthermore, if the residual occurs
in context k, then the first number that is transmitted is coded with respect to context k; if
further recursion is needed, we use the k+1 context.
We can summarize the CALIC algorithm as follows:
1. Find initial prediction X̂.
2. Compute prediction context.
3. Refine prediction by removing the estimate of the bias in that context.
4. Update bias estimate.
5. Obtain the residual and remap it so the residual values lie between 0 and M −1, where
M is the size of the initial alphabet.
6. Find the coding context k.
7. Code the residual using the coding context.
All these components working together have kept CALIC as the state of the art in lossless
image compression. However, we can get almost as good a performance if we simplify some
of the more involved aspects of CALIC. We study such a scheme in the next section.
7 . 4 JPEG-LS
The JPEG-LS standard looks more like CALIC than the old JPEG standard. When the initial
proposals for the new lossless compression standard were compared, CALIC was rated first
in six of the seven categories of images tested. Motivated by some aspects of CALIC, a team
from Hewlett-Packard proposed a much simpler predictive coder, under the name LOCO-I
(for low complexity), that still performed close to CALIC [78].
As in CALIC, the standard has both a lossless and a lossy mode. We will not describe
the lossy coding procedures.

7.4 JPEG-LS 171
The initial prediction is obtained using the following algorithm:
if NW ≥ maxWN
X̂ = maxWN
else
{
if NW ≤ minWN
X̂ = minWN
else
X̂ = W +N −NW
}
This prediction approach is a variation of Median Adaptive Prediction [79], in which the
predicted value is the median of the N, W, and NW pixels. The initial prediction is then
refined using the average value of the prediction error in that particular context.
The contexts in JPEG-LS also reflect the local variations in pixel values. However, they
are computed differently from CALIC. First, measures of differences D1, D2, and D3 are
computed as follows:
D1 = NE −N
D2 = N −NW
D3 = NW −W
The values of these differences define a three-component context vector Q. The components
of Q (Q1, Q2, and Q3) are defined by the following mappings:
Di ≤ −T3 ⇒ Qi = −4
−T3 Di ≤ −T2 ⇒ Qi = −3
−T2 Di ≤ −T1 ⇒ Qi = −2
−T1 Di ≤ 0 ⇒ Qi = −1
Di = 0 ⇒ Qi = 0
0 Di ≤ T1 ⇒ Qi = 1
T1 Di ≤ T2 ⇒ Qi = 2
T2 Di ≤ T3 ⇒ Qi = 3
T3 Di ⇒ Qi = 4 (7.9)
where T1, T2, and T3 are positive coefficients that can be defined by the user. Given nine
possible values for each component of the context vector, this results in 9 × 9 × 9 = 729
possible contexts. In order to simplify the coding process, the number of contexts is reduced
by replacing any context vector Q whose first nonzero element is negative by −Q. Whenever

T A B L E 7 . 3 Comparison of the ﬁle sizes obtained
using new and old JPEG lossless
compression standard and CALIC.
Image Old JPEG New JPEG CALIC
Sena 31,055 27,339 26,433
Sensin 32,429 30,344 29,213
Earth 32,137 26,088 25,280
Omaha 48,818 50,765 48,249
this happens, a variable SIGN is also set to −1; otherwise, it is set to +1. This reduces the
number of contexts to 365. The vector Q is then mapped into a number between 0 and 364.
(The standard does not specify the particular mapping to use.)
The variable SIGN is used in the prediction refinement step. The correction is first
multiplied by SIGN and then added to the initial prediction.
The prediction error rn is mapped into an interval that is the same size as the range
occupied by the original pixel values. The mapping used in JPEG-LS is as follows:
rn −
M
2
⇒ rn ← rn +M
rn
M
2
⇒ rn ← rn −M
Finally, the prediction errors are encoded using adaptively selected codes based on
Golomb codes, which have also been shown to be optimal for sequences with a geometric
distribution. In Table 7.3 we compare the performance of the old and new JPEG standards and
CALIC. The results for the new JPEG scheme were obtained using a software implementation
courtesy of HP.
We can see that for most of the images the new JPEG standard performs very close
to CALIC and outperforms the old standard by 6% to 18%. The only case where the
performance is not as good is for the Omaha image. While the performance improvement in
these examples may not be very impressive, we should keep in mind that for the old JPEG
we are picking the best result out of eight. In practice, this would mean trying all eight JPEG
predictors and picking the best. On the other hand, both CALIC and the new JPEG standard
are single-pass algorithms. Furthermore, because of the ability of both CALIC and the new
standard to function in multiple modes, both perform very well on compound documents,
which may contain images along with text.
7 . 5 Multiresolution Approaches
Our final predictive image compression scheme is perhaps not as competitive as the other
schemes. However, it is an interesting algorithm because it approaches the problem from a
slightly different point of view.

•
Δ
X
•
Δ
•
X
•
Δ
*
•
•
*
•
*
•
*
•
•
X
•
X
•
•
X
*
•
•
*
•
*
•
*
•
•
X
•
Δ
•
X
•
Δ
*
•
•
*
•
*
•
*
•
•
X
•
X
•
•
X
*
•
•
*
•
*
•
*
•
•
X
•
Δ
•
X
•
Δ
Δ
Δ
F I G U R E 7. 2 The HINT scheme for hierarchical prediction.
Multiresolution models generate representations of an image with varying spatial reso-
lution. This usually results in a pyramidlike representation of the image, with each layer of
the pyramid serving as a prediction model for the layer immediately below.
One of the more popular of these techniques is known as HINT (Hierarchical INTerpola-
tion) [80]. The specific steps involved in HINT are as follows. First, residuals corresponding
to the pixels labeled in Figure 7.2 are obtained using linear prediction and transmitted.
Then, the intermediate pixels () are estimated by linear interpolation, and the error in
estimation is then transmitted. Then, the pixels X are estimated from and , and the
estimation error is transmitted. Finally, the pixels labeled ∗ and then • are estimated from
known neighbors, and the errors are transmitted. The reconstruction process proceeds in a
similar manner.
One use of a multiresolution approach is in progressive image transmission. We describe
this application in the next section.
7 . 5 . 1 P r o g r e s s i v e I m a g e T r a n s m i s s i o n
The last few years have seen a very rapid increase in the amount of information stored as
images, especially remotely sensed images (such as images from weather and other satellites)
and medical images (such as CAT scans, magnetic resonance images, and mammograms).
It is not enough to have information. We also need to make these images accessible to
individuals who can make use of them. There are many issues involved with making large
amounts of information accessible to a large number of people. In this section we will look
at one particular issue—transmitting these images to remote users. (For a more general look
at the problem of managing large amounts of information, see [81].)
Suppose a user wants to browse through a number of images in a remote database.
The user is connected to the database via a 56 kbits per second (kbps) modem. Suppose the

images are of size 1024 ×1024, and on the average users have to look through 30 images
before finding the image they are looking for. If these images were monochrome with 8 bits
per pixel, this process would take close to an hour and 15 minutes, which is not very practical.
Even if we compressed these images before transmission, lossless compression on average
gives us about a two-to-one compression. This would only cut the transmission in half, which
still makes the approach cumbersome. A better alternative is to send an approximation of
each image first, which does not require too many bits but still is sufficiently accurate to
give users an idea of what the image looks like. If users find the image to be of interest, they
can request a further refinement of the approximation, or the complete image. This approach
is called progressive image transmission.
E x a m p l e 7 . 5 . 1:
A simple progressive transmission scheme is to divide the image into blocks and then send
a representative pixel for the block. The receiver replaces each pixel in the block with the
representative value. In this example, the representative value is the value of the pixel in the
top-left corner. Depending on the size of the block, the amount of data that would need to be
transmitted could be substantially reduced. For example, to transmit a 1024×1024 image at
8 bits per pixel over a 56 kbps line takes about two and a half minutes. Using a block size
of 8 × 8, and using the top-left pixel in each block as the representative value, means we
approximate the 1024 ×1024 image with a 128 ×128 subsampled image. Using 8 bits per
pixel and a 56 kbps line, the time required to transmit this approximation to the image takes
less than two and a half seconds. Assuming that this approximation was sufficient to let the
user decide whether a particular image was the desired image, the time required now to look
through 30 images becomes a minute and a half instead of the hour and a half mentioned
earlier. If the approximation using a block size of 8×8 does not provide enough resolution
to make a decision, the user can ask for a refinement. The transmitter can then divide the
8×8 block into four 4×4 blocks. The pixel at the upper-left corner of the upper-left block
was already transmitted as the representative pixel for the 8 ×8 block, so we need to send
three more pixels for the other three 4×4 blocks. This takes about seven seconds, so even
if the user had to request a finer approximation every third image, this would only increase
the total search time by a little more than a minute. To see what these approximations look
like, we have taken the Sena image and encoded it using different block sizes. The results
are shown in Figure 7.3. The lowest-resolution image, shown in the top left, is a 32 × 32
image. The top-left image is a 64×64 image. The bottom-left image is a 128×128 image,
and the bottom-right image is the 256×256 original.
Notice that even with a block size of 8 the image is clearly recognizable as a person.
Therefore, if the user was looking for a house, they would probably skip over this image
after seeing the first approximation. If the user was looking for a picture of a person, they
could still make decisions based on the second approximation.
Finally, when an image is built line by line, the eye tends to follow the scan line. With
the progressive transmission approach, the user gets a more global view of the image very
early in the image formation process. Consider the images in Figure 7.4. The images on the
left are the 8×8, 4×4, and 2×2 approximations of the Sena image. On the right, we show

F I G U R E 7. 3 Sena image coded using different block sizes for progressive
transmission. Top row: block size 8 × 8 and block size 4 × 4. Bottom
row: block size 2 × 2 and original image.
how much of the image we would see in the same amount of time if we used the standard
line-by-line raster scan order.
We would like the first approximations that we transmit to use as few bits as possible
yet be accurate enough to allow the user to make a decision to accept or reject the image
with a certain degree of confidence. As these approximations are lossy, many progressive
transmission schemes use well-known lossy compression schemes in the first pass.

F I G U R E 7. 4 Comparison between the received image using progressive
transmission and using the standard raster scan order.

The more popular lossy compression schemes, such as transform coding, tend to require
a significant amount of computation. As the decoders for most progressive transmission
schemes have to function on a wide variety of platforms, they are generally implemented in
software and need to be simple and fast. This requirement has led to the development of a
number of progressive transmission schemes that do not use lossy compression schemes for
their initial approximations. Most of these schemes have a form similar to the one described
in Example 7.5.1, and they are generally referred to as pyramid schemes because of the
manner in which the approximations are generated and the image is reconstructed.
When we use the pyramid form, we still have a number of ways to generate the
approximations. One of the problems with the simple approach described in Example 7.5.1 is
that if the pixel values vary a lot within a block, the “representative” value may not be very
representative. To prevent this from happening, we could represent the block by some sort of
an average or composite value. For example, suppose we start out with a 512×512 image.
We first divide the image into 2 × 2 blocks and compute the integer value of the average
of each block [82, 83]. The integer values of the averages would constitute the penultimate
approximation. The approximation to be transmitted prior to that can be obtained by taking
the average of 2×2 averages and so on, as shown in Figure 7.5.
Using the simple technique in Example 7.5.1, we ended up transmitting the same number
of values as the original number of pixels. However, when we use the mean of the pixels
as our approximation, after we have transmitted the mean values at each level, we still have
to transmit the actual pixel values. The reason is that when we take the integer part of
the average we end up throwing away information that cannot be retrieved. To avoid this
problem of data expansion, we can transmit the sum of the values in the 2 ×2 block. Then
we only need to transmit three more values to recover the original four values. With this
approach, although we would be transmitting the same number of values as the number of
pixels in the image, we might still end up sending more bits because representing all possible
F I G U R E 7. 5 The pyramid structure for progressive transmission.

values of the sum would require transmitting 2 more bits than was required for the original
value. For example, if the pixels in the image can take on values between 0 and 255, which
can be represented by 8 bits, their sum will take on values between 0 and 1024, which would
require 10 bits. If we are allowed to use entropy coding, we can remove the problem of data
expansion by using the fact that the neighboring values in each approximation are heavily
correlated, as are values in different levels of the pyramid. This means that differences
between these values can be efficiently encoded using entropy coding. By doing so, we end
up getting compression instead of expansion.
Instead of taking the arithmetic average, we could also form some sort of weighted
average. The general procedure would be similar to that described above. (For one of the
more well-known weighted average techniques, see [84].)
The representative value does not have to be an average. We could use the pixel values
in the approximation at the lower levels of the pyramid as indices into a lookup table. The
lookup table can be designed to preserve important information such as edges. The problem
with this approach would be the size of the lookup table. If we were using 2 ×2 blocks of
8-bit values, the lookup table would have 232
values, which is too large for most applications.
The size of the table could be reduced if the number of bits per pixel was lower or if, instead
of taking 2×2 blocks, we used rectangular blocks of size 2×1 and 1×2 [85].
Finally, we do not have to build the pyramid one layer at a time. After sending the
lowest-resolution approximations, we can use some measure of information contained in a
block to decide whether it should be transmitted [86]. One possible measure could be the
difference between the largest and smallest intensity values in the block. Another might be
to look at the maximum number of similar pixels in a block. Using an information measure
to guide the progressive transmission of images allows the user to see portions of the image
first that are visually more significant.
7 . 6 Facsimile Encoding
One of the earliest applications of lossless compression in the modern era has been the
compression of facsimile, or fax. In facsimile transmission, a page is scanned and converted
into a sequence of black or white pixels. The requirements of how fast the facsimile of an
A4 document (210×297 mm) must be transmitted have changed over the last two decades.
The CCITT (now ITU-T) has issued a number of recommendations based on the speed
requirements at a given time. The CCITT classifies the apparatus for facsimile transmission
into four groups. Although several considerations are used in this classification, if we only
consider the time to transmit an A4-size document over phone lines, the four groups can be
described as follows:
Group 1: This apparatus is capable of transmitting an A4-size document in about six
minutes over phone lines using an analog scheme. The apparatus is standardized in
recommendation T.2.
Group 2: This apparatus is capable of transmitting an A4-size document over phone
lines in about three minutes. A Group 2 apparatus also uses an analog scheme and,

therefore, does not use data compression. The apparatus is standardized in recommen-
dation T.3.
Group 3: This apparatus uses a digitized binary representation of the facsimile.
Because it is a digital scheme, it can and does use data compression and is capable of
transmitting an A4-size document in about a minute. The apparatus is standardized in
recommendation T.4.
Group 4: This apparatus has the same speed requirement as Group 3. The apparatus
is standardized in recommendations T.6, T.503, T.521, and T.563.
With the arrival of the Internet, facsimile transmission has changed as well. Given the
wide range of rates and “apparatus” used for digital communication, it makes sense to focus
more on protocols than on apparatus. The newer recommendations from the ITU provide
standards for compression that are more or less independent of apparatus.
Later in this chapter, we will look at the compression schemes described in the ITU-T
recommendations T.4, T.6, T.82 (JBIG) T.88 (JBIG2), and T.42 (MRC). We begin with a
look at an earlier technique for facsimile called run-length coding, which still survives as
part of the T.4 recommendation.
7 . 6 . 1 R u n - L e n g t h C o d i n g
The model that gives rise to run-length coding is the Capon model [87], a two-state Markov
model with states Sw and Sb (Sw corresponds to the case where the pixel that has just been
encoded is a white pixel, and Sb corresponds to the case where the pixel that has just been
encoded is a black pixel). The transition probabilities Pwb and Pbw, and the probability
of being in each state PSw and PSb, completely specify this model. For facsimile images,
Pww and Pwb are generally significantly higher than Pbw and Pbb. The Markov
model is represented by the state diagram shown in Figure 7.6.
The entropy of a finite state process with states Si is given by Equation (2.16). Recall
that in Example 2.3.1, the entropy using a probability model and the iid assumption was
significantly more than the entropy using the Markov model.
Sw P(b|b)
P(b|w)
P(w|b)
P(w|w) Sb
F I G U R E 7. 6 The Capon model for binary images.

Let us try to interpret what the model says about the structure of the data. The highly
skewed nature of the probabilities Pbw and Pww, and to a lesser extent Pwb and
Pbb, says that once a pixel takes on a particular color (black or white), it is highly likely
that the following pixels will also be of the same color. So, rather than code the color of each
pixel separately, we can simply code the length of the runs of each color. For example, if
we had 190 white pixels followed by 30 black pixels, followed by another 210 white pixels,
instead of coding the 430 pixels individually, we would code the sequence 19030210,
along with an indication of the color of the first string of pixels. Coding the lengths of runs
instead of coding individual values is called run-length coding.
7 . 6 . 2 C C I T T G r o u p 3 a n d 4 -
— R e c o m m e n d a t i o n s T . 4
a n d T . 6
The recommendations for Group 3 facsimile include two coding schemes. One is a one-
dimensional scheme in which the coding on each line is performed independently of any
other line. The other is two-dimensional; the coding of one line is performed using the
line-to-line correlations.
The one-dimensional coding scheme is a run-length coding scheme in which each line
is represented as a series of alternating white runs and black runs. The first run is always
a white run. If the first pixel is a black pixel, then we assume that we have a white run of
length zero.
Runs of different lengths occur with different probabilities; therefore, they are coded
using a variable-length code. The approach taken in the CCITT standards T.4 and T.6 is to
use a Huffman code to encode the run lengths. However, the number of possible lengths
of runs is extremely large, and it is simply not feasible to build a codebook that large.
Therefore, instead of generating a Huffman code for each run length rl, the run length is
expressed in the form
rl = 64×m+t for t = 0, 1, …, 63, and m = 1, 2, …, 27. (7.10)
When we have to represent a run length rl, instead of finding a code for rl, we use the
corresponding codes for m and t. The codes for t are called the terminating codes, and the
codes for m are called the make-up codes. If rl 63, we only need to use a terminating
code. Otherwise, both a make-up code and a terminating code are used. For the range of m
and t given here, we can represent lengths of 1728, which is the number of pixels per line
in an A4-size document. However, if the document is wider, the recommendations provide
for those with an optional set of 13 codes. Except for the optional codes, there are separate
codes for black and white run lengths. This coding scheme is generally referred to as a
modified Huffman (MH) scheme.
In the two-dimensional scheme, instead of reporting the run lengths, which in terms of
our Markov model is the length of time we remain in one state, we report the transition
times when we move from one state to another state. Look at Figure 7.7. We can encode this
in two ways. We can say that the first row consists of a sequence of runs 02338, and
the second row consists of runs of length 01834 (notice the first runs of length zero).
Or, we can encode the location of the pixel values that occur at a transition from white to

F I G U R E 7. 7 Two rows of an image. The transition pixels are marked with a dot.
black or black to white. The first pixel is an imaginary white pixel assumed to be to the left
of the first actual pixel. Therefore, if we were to code transition locations, we would encode
the first row as 1369 and the second row as 121013.
Generally, rows of a facsimile image are heavily correlated. Therefore, it would be easier
to code the transition points with reference to the previous line than to code each one in
terms of its absolute location, or even its distance from the previous transition point. This
is the basic idea behind the recommended two-dimensional coding scheme. This scheme
is a modification of a two-dimensional coding scheme called the Relative Element Address
Designate (READ) code [88, 89] and is often referred to as Modified READ (MR). The
READ code was the Japanese proposal to the CCITT for the Group 3 standard.
To understand the two-dimensional coding scheme, we need some definitions.
a0: This is the last pixel whose value is known to both encoder and decoder. At the
beginning of encoding each line, a0 refers to an imaginary white pixel to the left of
the first actual pixel. While it is often a transition pixel, it does not have to be.
a1: This is the first transition pixel to the right of a0. By definition its color should be the
opposite of a0. The location of this pixel is known only to the encoder.
a2: This is the second transition pixel to the right of a0. Its color should be the opposite of
a1, which means it has the same color as a0. The location of this pixel is also known
only to the encoder.
b1: This is the first transition pixel on the line above the line currently being encoded to
the right of a0 whose color is the opposite of a0. As the line above is known to both
encoder and decoder, as is the value of a0, the location of b1 is also known to both
encoder and decoder.
b2: This is the first transition pixel to the right of b1 in the line above the line currently
being encoded.
For the pixels in Figure 7.7, if the second row is the one being currently encoded, and if
we have encoded the pixels up to the second pixel, the assignment of the different pixels
is shown in Figure 7.8. The pixel assignments for a slightly different arrangement of black
and white pixels are shown in Figure 7.9.
If b1 and b2 lie between a0 and a1, we call the coding mode used the pass mode. The
transmitter informs the receiver about the situation by sending the code 0001. Upon receipt
of this code, the receiver knows that from the location of a0 to the pixel right below b2,
all pixels are of the same color. If this had not been true, we would have encountered a
transition pixel. As the first transition pixel to the right of a0 is a1, and as b2 occurs before
a1, no transitions have occurred and all pixels from a0 to right below b2 are the same color.
At this time, the last pixel known to both the transmitter and receiver is the pixel below b2.

a0 a1 a2
b1 b2
a0 a1 a2
b1 b2
Therefore, this now becomes the new a0, and we find the new positions of b1 and b2 by
examining the row above the one being encoded and continue with the encoding process.
If a1 is detected before b2 by the encoder, we do one of two things. If the distance
between a1 and b1 (the number of pixels from a1 to right under b1) is less than or equal to
three, then we send the location of a1 with respect to b1, move a0 to a1, and continue with
the coding process. This coding mode is called the vertical mode. If the distance between a1
and b1 is large, we essentially revert to the one-dimensional technique and send the distances
between a0 and a1, and a1 and a2, using the modified Huffman code. Let us look at exactly
how this is accomplished.
In the vertical mode, if the distance between a1 and b1 is zero (that is, a1 is exactly
under b1), we send the code 1. If the a1 is to the right of b1 by one pixel (as in Figure 7.9),
we send the code 011. If a1 is to the right of b1 by two or three pixels, we send the codes
000011 or 0000011, respectively. If a1 is to the left of b1 by one, two, or three pixels, we
send the codes 010, 000010, or 0000010, respectively.
In the horizontal mode, we first send the code 001 to inform the receiver about the mode,
and then send the modified Huffman codewords corresponding to the run length from a0 to
a1, and a1 to a2.
As the encoding of a line in the two-dimensional algorithm is based on the previous
line, an error in one line could conceivably propagate to all other lines in the transmission.
To prevent this from happening, the T.4 recommendations contain the requirement that after
each line is coded with the one-dimensional algorithm, at most K − 1 lines will be coded
using the two-dimensional algorithm. For standard vertical resolution, K = 2, and for high
resolution, K = 4.

The Group 4 encoding algorithm, as standardized in CCITT recommendation T.6, is
identical to the two-dimensional encoding algorithm in recommendation T.4. The main
difference between T.6 and T.4 from the compression point of view is that T.6 does not
have a one-dimensional coding algorithm, which means that the restriction described in
the previous paragraph is also not present. This slight modification of the modified READ
algorithm has earned the name modified modified READ (MMR)!
7 . 6 . 3 J B I G
Many bi-level images have a lot of local structure. Consider a digitized page of text. In large
portions of the image we will encounter white pixels with a probability approaching 1.
In other parts of the image there will be a high probability of encountering a black pixel. We
can make a reasonable guess of the situation for a particular pixel by looking at values of
the pixels in the neighborhood of the pixel being encoded. For example, if the pixels in the
neighborhood of the pixel being encoded are mostly white, then there is a high probability
that the pixel to be encoded is also white. On the other hand, if most of the pixels in the
neighborhood are black, there is a high probability that the pixel being encoded is also
black. Each case gives us a skewed probability—a situation ideally suited for arithmetic
coding. If we treat each case separately, using a different arithmetic coder for each of the
two situations, we should be able to obtain improvement over the case where we use the
same arithmetic coder for all pixels. Consider the following example.
Suppose the probability of encountering a black pixel is 02 and the probability of
encountering a white pixel is 08. The entropy for this source is given by
H = −02log2 02−08log2 08 = 0722 (7.11)
If we use a single arithmetic coder to encode this source, we will get an average bit rate
close to 0722 bits per pixel. Now suppose, based on the neighborhood of the pixels, that
we can divide the pixels into two sets, one comprising 80% of the pixels and the other
20%. In the first set, the probability of encountering a white pixel is 095, and in the second
set the probability of encountering a black pixel is 07. The entropy of these sets is 0286
and 0881, respectively. If we used two different arithmetic coders for the two sets with
frequency tables matched to the probabilities, we would get rates close to 0286 bits per
pixel about 80% of the time and close to 0881 bits per pixel about 20% of the time. The
average rate would be about 0405 bits per pixel, which is almost half the rate required if
we used a single arithmetic coder. If we use only those pixels in the neighborhood that had
already been transmitted to the receiver to make our decision about which arithmetic coder
to use, the decoder can keep track of which encoder was used to encode a particular pixel.
As we have mentioned before, the arithmetic coding approach is particularly amenable
to the use of multiple coders. All coders use the same computational machinery, with each
coder using a different set of probabilities. The JBIG algorithm makes full use of this feature
of arithmetic coding. Instead of checking to see if most of the pixels in the neighborhood are
white or black, the JBIG encoder uses the pattern of pixels in the neighborhood, or context,
to decide which set of probabilities to use in encoding a particular pixel. If the neighborhood
consists of 10 pixels, with each pixel capable of taking on two different values, the number of

O O O
O O O O A
O O X
O A
O O O O
O O O O X
(a) (b)
F I G U R E 7. 10 (a) Three-line and (b) two-line neighborhoods.
0 0 0
1 0 0 0 1
1 0 0
1 0
0 0 1 1
0 0 0 1 1
(a) (b)
F I G U R E 7. 11 (a) Three-line and (b) two-line contexts.
possible patterns is 1024. The JBIG coder uses 1024 to 4096 coders, depending on whether
a low- or high-resolution layer is being encoded.
For the low-resolution layer, the JBIG encoder uses one of the two different neighbor-
hoods shown in Figure 7.10. The pixel to be coded is marked X, while the pixels to be used
for templates are marked O or A. The A and O pixels are previously encoded pixels and are
available to both encoder and decoder. The A pixel can be thought of as a floating member
of the neighborhood. Its placement is dependent on the input being encoded. Suppose the
image has vertical lines 30 pixels apart. The A pixel would be placed 30 pixels to the left
of the pixel being encoded. The A pixel can be moved around to capture any structure that
might exist in the image. This is especially useful in halftone images in which the A pixels
are used to capture the periodic structure. The location and movement of the A pixel are
transmitted to the decoder as side information.
In Figure 7.11, the symbols in the neighborhoods have been replaced by 0s and 1s.
We take 0 to correspond to white pixels, while 1 corresponds to black pixels. The pixel
to be encoded is enclosed by the heavy box. The pattern of 0s and 1s is interpreted as a
binary number, which is used as an index to the set of probabilities. The context in the case
of the three-line neighborhood (reading left to right, top to bottom) is 0001000110, which
corresponds to an index of 70. For the two-line neighborhood, the context is 0011100001,
or 225. Since there are 10 bits in these templates, we will have 1024 different arithmetic
coders.
In the JBIG standard, the 1024 arithmetic coders are a variation of the arithmetic coder
known as the QM coder. The QM coder is a modification of an adaptive binary arithmetic
coder called the Q coder [51, 52, 53], which in turn is an extension of another binary adaptive
arithmetic coder called the skew coder [90].
In our description of arithmetic coding, we updated the tag interval by updating the
endpoints of the interval, un
and ln
. We could just as well have kept track of one endpoint

and the size of the interval. This is the approach adopted in the QM coder, which tracks the
lower end of the tag interval ln
and the size of the interval An
, where
An
= un
−ln
(7.12)
The tag for a sequence is the binary representation of ln
.
We can obtain the update equation for An
by subtracting Equation (4.9) from Equation
(4.10) and making this substitution
An
= An−1
FXxn−FXxn −1 (7.13)
= An−1
Pxn (7.14)
Substituting An
for un
−ln
in Equation (4.9), we get the update equation for ln
:
ln
= ln−1
+An−1
FXxn −1 (7.15)
Instead of dealing directly with the 0s and 1s put out by the source, the QM coder
maps them into a More Probable Symbol (MPS) and Less Probable Symbol (LPS). If 0
represents black pixels and 1 represents white pixels, then in a mostly black image 0 will
be the MPS, whereas in an image with mostly white regions 1 will be the MPS. Denoting
the probability of occurrence of the LPS for the context C by qc and mapping the MPS
to the lower subinterval, the occurrence of an MPS symbol results in the following update
equations:
ln
= ln−1
(7.16)
An
= An−1
1−qc (7.17)
while the occurrence of an LPS symbol results in the following update equations:
ln
= ln−1
+An−1
1−qc (7.18)
An
= An−1
qc (7.19)
Until this point, the QM coder looks very much like the arithmetic coder described earlier
in this chapter. To make the implementation simpler, the JBIG committee recommended
several deviations from the standard arithmetic coding algorithm. The update equations
involve multiplications, which are expensive in both hardware and software. In the QM
coder, the multiplications are avoided by assuming that An
has a value close to 1, and
multiplication with An
can be approximated by multiplication with 1. Therefore, the update
equations become
For MPS:
ln
= ln−1
(7.20)
An
= 1−qc (7.21)
For LPS:
ln
= ln−1
+1−qc (7.22)
An
= qc (7.23)

In order not to violate the assumption on An
whenever the value of An
drops below
075, the QM coder goes through a series of rescalings until the value of An
is greater than
or equal to 075. The rescalings take the form of repeated doubling, which corresponds to
a left shift in the binary representation of An
. To keep all parameters in sync, the same
scaling is also applied to ln
. The bits shifted out of the buffer containing the value of ln
make up the encoder output. Looking at the update equations for the QM coder, we can see
that a rescaling will occur every time an LPS occurs. Occurrence of an MPS may or may
not result in a rescale, depending on the value of An
.
The probability qc of the LPS for context C is updated each time a rescaling takes place
and the context C is active. An ordered list of values for qc is listed in a table. Every time
a rescaling occurs, the value of qc is changed to the next lower or next higher value in
the table, depending on whether the rescaling was caused by the occurrence of an LPS or
an MPS.
In a nonstationary situation, the symbol assigned to LPS may actually occurs more often
than the symbol assigned to MPS. This condition is detected when qc An
−qc. In this
situation, the assignments are reversed; the symbol assigned the LPS label is assigned the
MPS label and vice versa. The test is conducted every time a rescaling takes place.
The decoder for the QM coder operates in much the same way as the decoder described
in this chapter, mimicking the encoder operation.
P r o g r e s s i v e T r a n s m i s s i o n
In some applications we may not always need to view an image at full resolution. For
example, if we are looking at the layout of a page, we may not need to know what each
word or letter on the page is. The JBIG standard allows for the generation of progressively
lower-resolution images. If the user is interested in some gross patterns in the image
(for example, if they were interested in seeing if there were any figures on a particular page)
they could request a lower-resolution image, which could be transmitted using fewer
bits. Once the lower-resolution image was available, the user could decide whether a
higher-resolution image was necessary. The JBIG specification recommends generating one
lower-resolution pixel for each 2 × 2 block in the higher-resolution image. The number of
lower-resolution images (called layers) is not specified by JBIG.
A straightforward method for generating lower-resolution images is to replace every
2×2 block of pixels with the average value of the four pixels, thus reducing the resolution
by two in both the horizontal and vertical directions. This approach works well as long as
three of the four pixels are either black or white. However, when we have two pixels of
each kind, we run into trouble; consistently replacing the four pixels with either a white
or black pixel causes a severe loss of detail, and randomly replacing with a black or white
pixel introduces a considerable amount of noise into the image [81].
Instead of simply taking the average of every 2×2 block, the JBIG specification provides
a table-based method for resolution reduction. The table is indexed by the neighboring pixels
shown in Figure 7.12, in which the circles represent the lower-resolution layer pixels and
the squares represent the higher-resolution layer pixels.
Each pixel contributes a bit to the index. The table is formed by computing the expression
4e+2b +d +f +h+a+c +g +i−3B +C−A

a b c
d e f
g h i
A B
C X
F I G U R E 7. 12 Pixels used to determine the value of a lower-level pixel.
If the value of this expression is greater than 45, the pixel X is tentatively declared to be 1.
The table has certain exceptions to this rule to reduce the amount of edge smearing, generally
encountered in a filtering operation. There are also exceptions that preserve periodic patterns
and dither patterns.
As the lower-resolution layers are obtained from the higher-resolution images, we can use
them when encoding the higher-resolution images. The JBIG specification makes use of the
lower-resolution images when encoding the higher-resolution images by using the pixels of
the lower-resolution images as part of the context for encoding the higher-resolution images.
The contexts used for coding the lowest-resolution layer are those shown in Figure 7.10.
The contexts used in coding the higher-resolution layer are shown in Figure 7.13.
Ten pixels are used in each context. If we include the 2 bits required to indicate which
context template is being used, 12 bits will be used to indicate the context. This means that
we can have 4096 different contexts.
C o m p a r i s o n o f M H , M R , M M R , a n d J B I G
In this section we have seen three old facsimile coding algorithms: modified Huffman,
modified READ, and modified modified READ. Before we proceed to the more modern
techniques found in T.88 and T.42, we compare the performance of these algorithms with
the earliest of the modern techniques, namely JBIG. We described the JBIG algorithm as
an application of arithmetic coding in Chapter 4. This algorithm has been standardized in
ITU-T recommendation T.82. As we might expect, the JBIG algorithm performs better than
the MMR algorithm, which performs better than the MR algorithm, which in turn performs
better than the MH algorithm. The level of complexity also follows the same trend, although
we could argue that MMR is actually less complex than MR.
A comparison of the schemes for some facsimile sources is shown in Table 7.4. The
modified READ algorithm was used with K = 4, while the JBIG algorithm was used with
an adaptive three-line template and adaptive arithmetic coder to obtain the results in this
table. As we go from the one-dimensional MH coder to the two-dimensional MMR coder,
we get a factor of two reduction in file size for the sparse text sources. We get even more
reduction when we use an adaptive coder and an adaptive model, as is true for the JBIG
coder. When we come to the dense text, the advantage of the two-dimensional MMR over
the one-dimensional MH is not as significant, as the amount of two-dimensional correlation
becomes substantially less.

A O
O
O
O
O ?
O O
O
(a)
O
A O
O
O
O
O ?
O O
O
(b)
O
A O
O
O
O
O ?
O O
O
(c)
O
A O
O
O
O
O ?
O O
O
(d)
O
F I G U R E 7. 13 Contexts used in the coding of higher-resolution layers.
T A B L E 7 . 4 Comparison of binary image coding schemes. Data from [91].
Source Original Size
Description (pixels) MH (bytes) MR (bytes) MMR (bytes) JBIG (bytes)
Letter 4352×3072 20,605 14,290 8,531 6,682
Sparse text 4352×3072 26,155 16,676 9,956 7,696
Dense text 4352×3072 135,705 105,684 92,100 70,703
The compression schemes specified in T.4 and T.6 break down when we try to use them
to encode halftone images. In halftone images, gray levels are represented using binary pixel
patterns. A gray level closer to black would be represented by a pattern that contains more
black pixels, while a gray level closer to white would be represented by a pattern with fewer
black pixels. Thus, the model that was used to develop the compression schemes specified
in T.4 and T.6 is not valid for halftone images. The JBIG algorithm, with its adaptive
model and coder, suffers from no such drawbacks and performs well for halftone images
also [91].

7 . 6 . 4 J B I G 2 -
— T . 8 8
The JBIG2 standard was approved in February of 2000. Besides facsimile transmission,
the standard is also intended for document storage, archiving, wireless transmission, print
spooling, and coding of images on the Web. The standard provides specifications only
for the decoder, leaving the encoder design open. This means that the encoder design can
be constantly refined, subject only to compatibility with the decoder specifications. This
situation also allows for lossy compression, beacuse the encoder can incorporate lossy
transformations to the data that enhance the level of compression.
The compression algorithm in JBIG provides excellent compression of a generic bi-level
image. The compression algorithm proposed for JBIG2 uses the same arithmetic coding
scheme as JBIG. However, it takes advantage of the fact that a significant number of bi-level
images contain structure that can be used to enhance the compression performance. A large
percentage of bi-level images consist of text on some background, while another significant
percentage of bi-level images are or contain halftone images. The JBIG2 approach allows
the encoder to select the compression technique that would provide the best performance for
the type of data. To do so, the encoder divides the page to be compressed into three types
of regions called symbol regions, halftone regions, and generic regions. The symbol regions
are those containing text data, the halftone regions are those containing halftone images, and
the generic regions are all the regions that do not fit into either category.
The partitioning information has to be supplied to the decoder. The decoder requires that
all information provided to it be organized in segments that are made up of a segment header,
a data header, and segment data. The page information segment contains information about
the page including the size and resolution. The decoder uses this information to set up the
page buffer. It then decodes the various regions using the appropriate decoding procedure
and places the different regions in the appropriate location.
G e n e r i c D e c o d i n g P r o c e d u r e s
There are two procedures used for decoding the generic regions: the generic region decod-
ing procedure and the generic refinement region decoding procedure. The generic region
decoding procedure uses either the MMR technique used in the Group 3 and Group 4 fax
standards or a variation of the technique used to encode the lowest-resolution layer in the
JBIG recommendation. We describe the operation of the MMR algorithm in Chapter 6. The
latter procedure is described as follows.
The second generic region decoding procedure is a procedure called typical prediction.
In a bi-level image, a line of pixels is often identical to the line above. In typical prediction,
if the current line is the same as the line above, a bit flag called LNTPn is set to 0, and the
line is not transmitted. If the line is not the same, the flag is set to 1, and the line is coded
using the contexts currently used for the low-resolution layer in JBIG. The value of LNTPn
is encoded by generating another bit, SLNTPn, according to the rule
SLNTPn = !LNTPn ⊕LNTPn−1
which is treated as a virtual pixel to the left of each row. If the decoder decodes an LNTP
value of 0, it copies the line above. If it decodes an LNTP value of 1, the following bits

in the segment data are decoded using an arithmetic decoder and the contexts described
previously.
The generic refinement decoding procedure assumes the existence of a reference layer
and decodes the segment data with reference to this layer. The standard leaves open the
specification of the reference layer.
S y m b o l R e g i o n D e c o d i n g
The symbol region decoding procedure is a dictionary-based decoding procedure. The symbol
region segment is decoded with the help of a symbol dictionary contained in the symbol
dictionary segment. The data in the symbol region segment contains the location where
a symbol is to be placed, as well as the index to an entry in the symbol dictionary. The
symbol dictionary consists of a set of bitmaps and is decoded using the generic decoding
procedures. Note that because JBIG2 allows for lossy compression, the symbols do not
have to exactly match the symbols in the original document. This feature can significantly
increase the compression performance when the original document contains noise that may
preclude exact matches with the symbols in the dictionary.
H a l f t o n e R e g i o n D e c o d i n g
The halftone region decoding procedure is also a dictionary-based decoding procedure. The
halftone region segment is decoded with the help of a halftone dictionary contained in the
halftone dictionary segment. The halftone dictionary segment is decoded using the generic
decoding procedures. The data in the halftone region segment consists of the location of the
halftone region and indices to the halftone dictionary. The dictionary is a set of fixed-size
halftone patterns. As in the case of the symbol region, if lossy compression is allowed,
the halftone patterns do not have to exactly match the patterns in the original document.
By allowing for nonexact matches, the dictionary can be kept small, resulting in higher
compression.
7 . 7 MRC-
—T.44
With the rapid advance of technology for document production, documents have changed
in appearance. Where a document used to be a set of black and white printed pages, now
documents contain multicolored text as well as color images. To deal with this new type
of document, the ITU-T developed the recommendation T.44 for Mixed Raster Content
(MRC). This recommendation takes the approach of separating the document into ele-
ments that can be compressed using available techniques. Thus, it is more an approach of
partitioning a document image than a compression technique. The compression strategies
employed here are borrowed from previous standards such as JPEG (T.81), JBIG (T.82), and
even T.6.
The T.44 recommendation divides a page into slices where the width of the slice is equal
to the width of the entire page. The height of the slice is variable. In the base mode, each

7.7 MRC-
—T.44 191
You are invited to a PARTY
It Will Soon be June 4
with Ruby and Hanna
to CELEBRATE
That’s Ruby’s Birthday!
F I G U R E 7. 14 Ruby’s birthday invitation.
This area not coded
or sent
F I G U R E 7. 15 The background layer.
slice is represented by three layers: a background layer, a foreground layer, and a mask
layer. These layers are used to effectively represent three basic data types: color images
(which may be continuous tone or color mapped), bi-level data, and multilevel (multicolor)
data. The multilevel image data is put in the background layer, and the mask and foreground
layers are used to represent the bi-level and multilevel nonimage data. To work through the
various definitions, let us use the document shown in Figure 7.14 as an example. We have
divided the document into two slices. The top slice contains the picture of the cake and two
lines of writing in two “colors.” Notice that the heights of the two slices are not the same and
the complexity of the information contained in the two slices is not the same. The top slice
contains multicolored text and a continuous tone image whereas the bottom slice contains
only bi-level text. Let us take the upper slice first and see how to divide it into the three
layers. We will discuss how to code these layers later. The background layer consists of the
cake and nothing else. The default color for the background layer is white (though this can
be changed). Therefore, we do not need to send the left half of this layer, which contains
only white pixels.

It Will Soon be June 4
That’s Ruby’s Birthday!
F I G U R E 7. 16 The mask layer.
This area not
coded or sent.
F I G U R E 7. 17 The foreground layer.
The mask layer (Figure 7.16) consists of a bi-level representation of the textual infor-
mation, while the foreground layer contains the colors used in the text. To reassemble the
slice we begin with the background layer. We then add to it pixels from the foreground
layer using the mask layer as the guide. Wherever the mask layer pixel is black (1) we pick
the corresponding pixel from the foreground layer. Wherever the mask pixel is white (0) we
use the pixel from the background layer. Because of its role in selecting pixels, the mask
layer is also known as the selector layer. During transmission the mask layer is transmitted
first, followed by the background and the foreground layers. During the rendering process
the background layer is rendered first.
When we look at the lower slice we notice that it contains only bi-level information. In
this case we only need the mask layer because the other two layers would be superfluous.
In order to deal with this kind of situation, the standard defines three different kinds of
stripes. Three-layer stripes (3LS) contain all three layers and is useful when there is both
image and textual data in the strip. Two-layer stripes (2LS) only contain two layers, with the
third set to a constant value. This kind of stripe would be useful when encoding a stripe with
multicolored text and no images, or a stripe with images and bi-level text or line drawings.
The third kind of stripe is a one-layer stripe (1LS) which would be used when a stripe
contains only bi-level text or line art, or only continuous tone images.
Once the document has been partitioned it can be compressed. Notice that the types
of data we have after partitioning are continuous tone images, bi-level information, and
multilevel regions. We already have efficient standards for compressing these types of
data. For the mask layer containing bi-level information, the recommendation suggests that
one of several approaches can be used, including modified Huffman or modified READ

(as described in recomendation T.4), MMR (as described in recommendation T.6) or JBIG
(recommendation T.82). The encoder includes information in the datastream about which
algorithm has been used. For the continuous tone images and the multilevel regions contained
in the foreground and background layers, the recommendation suggests the use of the JPEG
standard (recommendation T.81) or the JBIG standard. The header for each slice contains
information about which algorithm is used for compression.
7 . 8 Summary
In this section we have examined a number of ways to compress images. All these approaches
exploit the fact that pixels in an image are generally highly correlated with their neighbors.
This correlation can be used to predict the actual value of the current pixel. The prediction
error can then be encoded and transmitted. Where the correlation is especially high, as in
the case of bi-level images, long stretches of pixels can be encoded together using their
similarity with previous rows. Finally, by identifying different components of an image that
have common characteristics, an image can be partitioned and each partition encoded using
the algorithm best suited to it.
1. A detailed survey of lossless image compression techniques can be found in “Lossless
Image Compression” by K.P. Subbalakshmi. This chapter appears in the Lossless
Compression Handbook, Academic Press, 2003.
2. For a detailed description of the LOCO-I and JPEG-LS compression algorithm, see
“The LOCO-I Lossless Image Compression Algorithm: Principles and Standardiza-
tion into JPEG-LS,” Hewlett-Packard Laboratories Technical Report HPL-98-193,
November 1998 [92].
3. The JBIG and JBIG2 standards are described in a very accessible manner in “Lossless
Bilevel Image Compression,” by M.W. Hoffman. This chapter appears in the Lossless
Compression Handbook, Academic Press, 2003.
4. The area of lossless image compression is a very active one, and new schemes are
being published all the time. These articles appear in a number of journals, including
Journal of Electronic Imaging, Optical Engineering, IEEE Transactions on Image
Processing, IEEE Transactions on Communications, Communications of the ACM,
IEEE Transactions on Computers, and Image Communication, among others.
1. Encode the binary image shown in Figure 7.18 using the modified Huffman scheme.
2. Encode the binary image shown in Figure 7.18 using the modified READ scheme.
3. Encode the binary image shown in Figure 7.18 using the modified modified READ
scheme.

F I G U R E 7. 18 An 8 × 16 binary image.
4. Suppose we want to transmit a 512×512, 8-bits-per-pixel image over a 9600 bits per
second line.
(a) If we were to transmit this image using raster scan order, after 15 seconds how
many rows of the image will the user have received? To what fraction of the
image does this correspond?
(b) If we were to transmit the image using the method of Example 7.5.1, how long
would it take the user to receive the first approximation? How long would it take
to receive the first two approximations?
5. An implementation of the progressive transmission example (Example 7.5.1) is included
intheprogramsaccompanyingthisbook.Theprogramiscalledprog_tran1.c.Using
this program as a template, experiment with different ways of generating approximations
(you could use various types of weighted averages) and comment on the qualitative
differences (or lack thereof) with using various schemes. Try different block sizes and
comment on the practical effects in terms of quality and rate.
6. The program jpegll_enc.c generates the residual image for the different JPEG
prediction modes, while the program jpegll_dec.c reconstructs the original image
from the residual image. The output of the encoder program can be used as the input
to the public domain arithmetic coding program mentioned in Chapter 4 and the
Huffman coding programs mentioned in Chapter 3. Study the performance of different
combinations of prediction mode and entropy coder using three images of your choice.
Account for any differences you see.
7. Extend jpegll_enc.c and jpegll_dec.c with an additional prediction mode—
be creative! Compare the performance of your predictor with the JPEG predictors.
8. Implement the portions of the CALIC algorithm described in this chapter. Encode the
Sena image using your implementation.

8
Lossy Coding
8 . 1 Overview
B
efore we discussed lossless compression, we presented some of the mathemat-
ical background necessary for understanding and appreciating the compression
schemes that followed. We will try to do the same here for lossy compres-
sion schemes. In lossless compression schemes, rate is the general concern.
With lossy compression schemes, the loss of information associated with such
schemes is also a concern. We will look at different ways of assessing the impact of the
loss of information. We will also briefly revisit the subject of information theory, mainly
to get an understanding of the part of the theory that deals with the trade-offs involved
in reducing the rate, or number of bits per sample, at the expense of the introduction of
distortion in the decoded information. This aspect of information theory is also known as
rate distortion theory. We will also look at some of the models used in the development of
lossy compression schemes.
8 . 2 Introduction
This chapter will provide some mathematical background that is necessary for discussing
lossy compression techniques. Most of the material covered in this chapter is common to
many of the compression techniques described in the later chapters. Material that is specific
to a particular technique is described in the chapter in which the technique is presented. Some
of the material presented in this chapter is not essential for understanding the techniques
described in this book. However, to follow some of the literature in this area, familiarity
with these topics is necessary. We have marked these sections with a . If you are primarily
interested in the techniques, you may wish to skip these sections, at least on first reading.

196 8 L O S S Y C O D I N G
On the other hand, if you wish to delve more deeply into these topics, we have included
a list of resources at the end of this chapter that provide a more mathematically rigorous
treatment of this material.
When we were looking at lossless compression, one thing we never had to worry about
was how the reconstructed sequence would differ from the original sequence. By definition,
the reconstruction of a losslessly constructed sequence is identical to the original sequence.
However, there is only a limited amount of compression that can be obtained with lossless
compression. There is a floor (a hard one) defined by the entropy of the source, below which
we cannot drive the size of the compressed sequence. As long as we wish to preserve all of
the information in the source, the entropy, like the speed of light, is a fundamental limit.
The limited amount of compression available from using lossless compression schemes
may be acceptable in several circumstances. The storage or transmission resources available
to us may be sufficient to handle our data requirements after lossless compression. Or the
possible consequences of a loss of information may be much more expensive than the cost
of additional storage and/or transmission resources. This would be the case with the storage
and archiving of bank records; an error in the records could turn out to be much more
expensive than the cost of buying additional storage media.
If neither of these conditions hold—that is, resources are limited and we do not require
absolute integrity—we can improve the amount of compression by accepting a certain degree
of loss during the compression process. Performance measures are necessary to determine
the efficiency of our lossy compression schemes. For the lossless compression schemes we
essentially used only the rate as the performance measure. That would not be feasible for
lossy compression. If rate were the only criterion for lossy compression schemes, where loss
of information is permitted, the best lossy compression scheme would be simply to throw
away all the data! Therefore, we need some additional performance measure, such as some
measure of the difference between the original and reconstructed data, which we will refer
to as the distortion in the reconstructed data. In the next section, we will look at some of the
more well-known measures of difference and discuss their advantages and shortcomings.
In the best of all possible worlds we would like to incur the minimum amount of
distortion while compressing to the lowest rate possible. Obviously, there is a trade-off
between minimizing the rate and keeping the distortion small. The extreme cases are when
we transmit no information, in which case the rate is zero, or keep all the information, in
which case the distortion is zero. The rate for a discrete source is simply the entropy. The
study of the situations between these two extremes is called rate distortion theory. In this
chapter we will take a brief look at some important concepts related to this theory.
Finally, we need to expand the dictionary of models available for our use, for several
reasons. First, because we are now able to introduce distortion, we need to determine how
to add distortion intelligently. For this, we often need to look at the sources somewhat
differently than we have done previously. Another reason is that we will be looking at
compression schemes for sources that are analog in nature, even though we have treated
them as discrete sources in the past. We need models that more precisely describe the true
nature of these sources. We will describe several different models that are widely used in
the development of lossy compression algorithms.
We will use the block diagram and notation used in Figure 8.1 throughout our dis-
cussions. The output of the source is modeled as a random variable X. The source coder

Source Source
encoder
Channel
X Xc
User Source
decoder
Y Xc
^
F I G U R E 8. 1 Block diagram of a generic compression scheme.
takes the source output and produces the compressed representation Xc. The channel block
represents all transformations the compressed representation undergoes before the source is
reconstructed. Usually, we will take the channel to be the identity mapping, which means
Xc = X̂c. The source decoder takes the compressed representation and produces a recon-
struction of the source output for the user.
8 . 3 Distortion Criteria
How do we measure the closeness or fidelity of a reconstructed source sequence to the
original? The answer frequently depends on what is being compressed and who is doing
the answering. Suppose we were to compress and then reconstruct an image. If the image
is a work of art and the resulting reconstruction is to be part of a book on art, the best
way to find out how much distortion was introduced and in what manner is to ask a person
familiar with the work to look at the image and provide an opinion. If the image is that of
a house and is to be used in an advertisement, the best way to evaluate the quality of the
reconstruction is probably to ask a real estate agent. However, if the image is from a satellite
and is to be processed by a machine to obtain information about the objects in the image,
the best measure of fidelity is to see how the introduced distortion affects the functioning of
the machine. Similarly, if we were to compress and then reconstruct an audio segment, the
judgment of how close the reconstructed sequence is to the original depends on the type of
material being examined as well as the manner in which the judging is done. An audiophile
is much more likely to perceive distortion in the reconstructed sequence, and distortion is
much more likely to be noticed in a musical piece than in a politician’s speech.
In the best of all worlds we would always use the end user of a particular source
output to assess quality and provide the feedback required for the design. In practice this
is not often possible, especially when the end user is a human, because it is difficult to
incorporate the human response into mathematical design procedures. Also, there is difficulty
in objectively reporting the results. The people asked to assess one person’s design may
be more easygoing than the people who were asked to assess another person’s design.
Even though the reconstructed output using one person’s design is rated “excellent” and the
reconstructed output using the other person’s design is only rated “acceptable,” switching
observers may change the ratings. We could reduce this kind of bias by recruiting a large

number of observers in the hope that the various biases will cancel each other out. This is
often the option used, especially in the final stages of the design of compression systems.
However, the rather cumbersome nature of this process is limiting. We generally need a
more practical method for looking at how close the reconstructed signal is to the original.
A natural thing to do when looking at the fidelity of a reconstructed sequence is to
look at the differences between the original and reconstructed values—in other words, the
distortion introduced in the compression process. Two popular measures of distortion or
difference between the original and reconstructed sequences are the squared error measure
and the absolute difference measure. These are called difference distortion measures. If xn
is the source output and yn is the reconstructed sequence, then the squared error measure
is given by
dxy = x −y2
(8.1)
and the absolute difference measure is given by
dxy = x −y (8.2)
In general, it is difficult to examine the difference on a term-by-term basis. Therefore,
a number of average measures are used to summarize the information in the difference
sequence. The most often used average measure is the average of the squared error measure.
This is called the mean squared error (mse) and is often represented by the symbol 2
or
2
d:
2
=
1
N
N

n=1
xn −yn2
(8.3)
If we are interested in the size of the error relative to the signal, we can find the ratio of
the average squared value of the source output and the mse. This is called the signal-to-noise
ratio (SNR).
SNR =
2
x
2
d
(8.4)
where 2
x is the average squared value of the source output, or signal, and 2
d is the mse. The
SNR is often measured on a logarithmic scale and the units of measurement are decibels
(abbreviated to dB).
SNRdB = 10log10
2
x
2
d
(8.5)
Sometimes we are more interested in the size of the error relative to the peak value of
the signal xpeak than with the size of the error relative to the average squared value of the
signal. This ratio is called the peak-signal-to-noise-ratio (PSNR) and is given by
PSNRdB = 10log10
x2
peak
2
d
(8.6)

Another difference distortion measure that is used quite often, although not as often as
the mse, is the average of the absolute difference, or
d1 =
1
N
N

n=1
xn −yn (8.7)
This measure seems especially useful for evaluating image compression algorithms.
In some applications, the distortion is not perceptible as long as it is below some
threshold. In these situations we might be interested in the maximum value of the error
magnitude,
d = max
n
xn −yn (8.8)
We have looked at two approaches to measuring the fidelity of a reconstruction. The
first method involving humans may provide a very accurate measure of perceptible fidelity,
but it is not practical and not useful in mathematical design approaches. The second is
mathematically tractable, but it usually does not provide a very accurate indication of the
perceptible fidelity of the reconstruction. A middle ground is to find a mathematical model for
human perception, transform both the source output and the reconstruction to this perceptual
space, and then measure the difference in the perceptual space. For example, suppose we
could find a transformation that represented the actions performed by the human visual
system (HVS) on the light intensity impinging on the retina before it is “perceived” by the
cortex. We could then find x and y and examine the difference between them. There
are two problems with this approach. First, the process of human perception is very difficult
to model, and accurate models of perception are yet to be discovered. Second, even if we
could find a mathematical model for perception, the odds are that it would be so complex
that it would be mathematically intractable.
In spite of these disheartening prospects, the study of perception mechanisms is still
important from the perspective of design and analysis of compression systems. Even if we
cannot obtain a transformation that accurately models perception, we can learn something
about the properties of perception that may come in handy in the design of compression
systems. In the following, we will look at some of the properties of the human visual system
and the perception of sound. Our review will be far from thorough, but the intent here is to
present some properties that will be useful in later chapters when we talk about compression
of images, video, speech, and audio.
8 . 3 . 1 T h e H u m a n V i s u a l S y s t e m
The eye is a globe-shaped object with a lens in the front that focuses objects onto the retina
in the back of the eye. The retina contains two kinds of receptors, called rods and cones.
The rods are more sensitive to light than cones, and in low light most of our vision is due
to the operation of rods. There are three kinds of cones, each of which are most sensitive at
different wavelengths of the visible spectrum. The peak sensitivities of the cones are in the
red, blue, and green regions of the visible spectrum [93]. The cones are mostly concentrated
in a very small area of the retina called the fovea. Although the rods are more numerous
than the cones, the cones provide better resolution because they are more closely packed in
the fovea. The muscles of the eye move the eyeball, positioning the image of the object on

Light
source
Spatial
low-pass
filter
Logarithmic
nonlinearity
F I G U R E 8. 2 A model of monochromatic vision.
the fovea. This becomes a drawback in low light. One way to improve what you see in low
light is to focus to one side of the object. This way the object is imaged on the rods, which
are more sensitive to light.
The eye is sensitive to light over an enormously large range of intensities; the upper end
of the range is about 1010
times the lower end of the range. However, at a given instant
we cannot perceive the entire range of brightness. Instead, the eye adapts to an average
brightness level. The range of brightness levels that the eye can perceive at any given instant
is much smaller than the total range it is capable of perceiving.
If we illuminate a screen with a certain intensity I and shine a spot on it with different
intensity, the spot becomes visible when the difference in intensity is I. This is called
the just noticeable difference (jnd). The ratio I
I
is known as the Weber fraction or Weber
ratio. This ratio is known to be constant at about 002 over a wide range of intensities in the
absence of background illumination. However, if the background illumination is changed, the
range over which the Weber ratio remains constant becomes relatively small. The constant
range is centered around the intensity level to which the eye adapts.
If I
I
is constant, then we can infer that the sensitivity of the eye to intensity is a
logarithmic function (dlogI = dI/I). Thus, we can model the eye as a receptor whose
output goes to a logarithmic nonlinearity. We also know that the eye acts as a spatial low-
pass filter [94, 95]. Putting all of this information together, we can develop a model for
monochromatic vision, shown in Figure 8.2.
How does this description of the human visual system relate to coding schemes? Notice
that the mind does not perceive everything the eye sees. We can use this knowledge to
design compression systems such that the distortion introduced by our lossy compression
scheme is not noticeable.
8 . 3 . 2 A u d i t o r y P e r c e p t i o n
The ear is divided into three parts, creatively named the outer ear, the middle ear, and the
inner ear. The outer ear consists of the structure that directs the sound waves, or pressure
waves, to the tympanic membrane, or eardrum. This membrane separates the outer ear from
the middle ear. The middle ear is an air-filled cavity containing three small bones that provide
coupling between the tympanic membrane and the oval window, which leads into the inner
ear. The tympanic membrane and the bones convert the pressure waves in the air to acoustical
vibrations. The inner ear contains, among other things, a snail-shaped passage called the
cochlea that contains the transducers that convert the acoustical vibrations to nerve impulses.

The human ear can hear sounds from approximately 20 Hz to 20 kHz, a 1000:1 range
of frequencies. The range decreases with age; older people are usually unable to hear the
higher frequencies. As in vision, auditory perception has several nonlinear components. One
is that loudness is a function not only of the sound level, but also of the frequency. Thus, for
example, a pure 1 kHz tone presented at a 20 dB intensity level will have the same apparent
loudness as a 50 Hz tone presented at a 50 dB intensity level. By plotting the amplitude of
tones at different frequencies that sound equally loud, we get a series of curves called the
Fletcher-Munson curves [96].
Another very interesting audio phenomenon is that of masking, where one sound blocks
out or masks the perception of another sound. The fact that one sound can drown out another
seems reasonable. What is not so intuitive about masking is that if we were to try to mask a
pure tone with noise, only the noise in a small frequency range around the tone being masked
contributes to the masking. This range of frequencies is called the critical band. For most
frequencies, when the noise just masks the tone, the ratio of the power of the tone divided by
the power of the noise in the critical band is a constant [97]. The width of the critical band
varies with frequency. This fact has led to the modeling of auditory perception as a bank
of band-pass filters. There are a number of other, more complicated masking phenomena
that also lend support to this theory (see [97, 98] for more information). The limitations of
auditory perception play a major role in the design of audio compression algorithms. We
will delve further into these limitations when we discuss audio compression in Chapter 16.
8 . 4 Information Theory Revisited
In order to study the trade-offs between rate and the distortion of lossy compression schemes,
we would like to have rate defined explicitly as a function of the distortion for a given
distortion measure. Unfortunately, this is generally not possible, and we have to go about it
in a more roundabout way. Before we head down this path, we need a few more concepts
from information theory.
In Chapter 2, when we talked about information, we were referring to letters from a
single alphabet. In the case of lossy compression, we have to deal with two alphabets, the
source alphabet and the reconstruction alphabet. These two alphabets are generally different
from each other.
E x a m p l e 8 . 4 . 1:
A simple lossy compression approach is to drop a certain number of the least significant
bits from the source output. We might use such a scheme between a source that generates
monochrome images at 8 bits per pixel and a user whose display facility can display only 64
different shades of gray. We could drop the two least significant bits from each pixel before
transmitting the image to the user. There are other methods we can use in this situation that
are much more effective, but this is certainly simple.
Suppose our source output consists of 4-bit words 01215. The source encoder
encodes each value by shifting out the least significant bit. The output alphabet for the source
coder is 0127. At the receiver we cannot recover the original value exactly. However,

wecangetanapproximationbyshiftingina0astheleastsignificantbit,orinotherwords,multi-
plyingthesourceencoderoutputbytwo. Thus, the reconstruction alphabet is 02414,
and the source and reconstruction do not take values from the same alphabet.
As the source and reconstruction alphabets can be distinct, we need to be able to talk
about the information relationships between two random variables that take on values from
two different alphabets.
8 . 4 . 1 C o n d i t i o n a l E n t r o p y
Let X be a random variable that takes values from the source alphabet = x0x1xN−1.
Let Y be a random variable that takes on values from the reconstruction alphabet =
y0y1yM−1. From Chapter 2 we know that the entropy of the source and the recon-
struction are given by
HX = −
N−1

i=0
Pxilog2 Pxi
and
HY = −
M−1

j=0
Pyjlog2 Pyj
A measure of the relationship between two random variables is the conditional entropy
(the average value of the conditional self-information). Recall that the self-information for
an event A was defined as
iA = log
1
PA
= −logPA
In a similar manner, the conditional self-information of an event A, given that another event
B has occurred, can be defined as
iAB = log
1
PAB
= −logPAB
Suppose B is the event “Frazer has not drunk anything in two days,” and A is the event
“Frazer is thirsty.” Then PAB should be close to one, which means that the conditional
self-information iAB would be close to zero. This makes sense from an intuitive point
of view as well. If we know that Frazer has not drunk anything in two days, then the
statement that Frazer is thirsty would not be at all surprising to us and would contain very
little information.
As in the case of self-information, we are generally interested in the average value of
the conditional self-information. This average value is called the conditional entropy. The
conditional entropies of the source and reconstruction alphabets are given as
HXY = −
N−1

i=0
M−1

j=0
PxiyjPyjlog2 Pxiyj (8.9)

and
HYX = −
N−1

i=0
M−1

j=0
PxiyjPyjlog2 Pyjxi (8.10)
The conditional entropy HXY can be interpreted as the amount of uncertainty remaining
about the random variable X, or the source output, given that we know what value the
reconstruction Y took. The additional knowledge of Y should reduce the uncertainty about
X, and we can show that
HXY ≤ HX (8.11)
(see Problem 5).
E x a m p l e 8 . 4 . 2:
Suppose we have the 4-bits-per-symbol source and compression scheme described in Exam-
ple 8.4.1. Assume that the source is equally likely to select any letter from its alphabet. Let
us calculate the various entropies for this source and compression scheme.
As the source outputs are all equally likely, PX = i = 1
16
for all i ∈ 01215,
and therefore
HX = −

i
1
16
log
1
16
= log16 = 4 bits (8.12)
We can calculate the probabilities of the reconstruction alphabet:
PY = j = PX = j+PX = j +1 =
1
16
+
1
16
=
1
8
(8.13)
Therefore, HY = 3 bits. To calculate the conditional entropy HXY, we need the condi-
tional probabilities Pxiyj. From our construction of the source encoder, we see that
PX = iY = j =

1
2
if i = j or i = j +1, for j = 024 14
0 otherwise.
(8.14)
Substituting this in the expression for HXY in Equation (8.9), we get
HXY = −

i

j
PX = iY = jPY = jlogPX = iY = j
= −

j
PX = jY = jPY = jlogPX = jY = j
+PX = j +1Y = jPY = jlogPX = j +1Y = j
= −8

1
2
·
1
8
log
1
2
+
1
2
·
1
8
log
1
2

(8.15)
= 1 (8.16)
Let us compare this answer to what we would have intuitively expected the uncertainty to
be, based on our knowledge of the compression scheme. With the coding scheme described

here, knowledge of Y means that we know the first 3 bits of the input X. The only thing
about the input that we are uncertain about is the value of the last bit. In other words, if
we know the value of the reconstruction, our uncertainty about the source output is 1 bit.
Therefore, at least in this case, our intuition matches the mathematical definition.
To obtain HYX, we need the conditional probabilities Pyjxi. From our knowledge
of the compression scheme, we see that
PY = jX = i =

1 if i = j or i = j +1, for j = 024 14
0 otherwise.
(8.17)
If we substitute these values into Equation (8.10), we get HYX = 0 bits (note that
0log0 = 0). This also makes sense. For the compression scheme described here, if we know
the source output, we know 4 bits, the first 3 of which are the reconstruction. Therefore,
in this example, knowledge of the source output at a specific time completely specifies the
corresponding reconstruction.
8 . 4 . 2 A v e r a g e M u t u a l I n f o r m a t i o n
We make use of one more quantity that relates the uncertainty or entropy of two random
variables. This quantity is called the mutual information and is defined as
ixkyj = log

Pxkyj
Pxk

(8.18)
We will use the average value of this quantity, appropriately called the average mutual
information, which is given by
IXY =
N−1

i=0
M−1

j=0
Pxiyjlog

Pxiyj
Pxi

(8.19)
=
N−1

i=0
M−1

j=0
PxiyjPyjlog

Pxiyj
Pxi

(8.20)
We can write the average mutual information in terms of the entropy and the conditional
entropy by expanding the argument of the logarithm in Equation (8.20).
IXY =
N−1

i=0
M−1

j=0
Pxiyjlog

Pxiyj
Pxi

(8.21)
=
N−1

i=0
M−1

j=0
PxiyjlogPxiyj−
N−1

i=0
M−1

j=0
PxiyjlogPxi (8.22)
= HX−HXY (8.23)
where the second term in Equation (8.22) is HX, and the first term is −HXY. Thus, the
average mutual information is the entropy of the source minus the uncertainty that remains

about the source output after the reconstructed value has been received. The average mutual
information can also be written as
IXY = HY−HYX = IYX (8.24)
E x a m p l e 8 . 4 . 3:
For the source coder of Example 8.4.2, HX = 4 bits, and HXY = 1 bit. Therefore, using
Equation (8.23), the average mutual information IXY is 3 bits. If we wish to use Equation
(8.24) to compute IXY, we would need HY and HYX, which from Example 8.4.2
are 3 and 0, respectively. Thus, the value of IXY still works out to be 3 bits.
8 . 4 . 3 D i f f e r e n t i a l E n t r o p y
Up to this point we have assumed that the source picks its outputs from a discrete alphabet.
When we study lossy compression techniques, we will see that for many sources of interest
to us this assumption is not true. In this section, we will extend some of the information
theoretic concepts defined for discrete random variables to the case of random variables with
continuous distributions.
Unfortunately, we run into trouble from the very beginning. Recall that the first quantity
we defined was self-information, which was given by log 1
Pxi
, where Pxi is the probability
that the random variable will take on the value xi. For a random variable with a continuous
distribution, this probability is zero. Therefore, if the random variable has a continuous
distribution, the “self-information” associated with any value is infinity.
If we do not have the concept of self-information, how do we go about defining entropy,
which is the average value of the self-information? We know that many continuous functions
can be written as limiting cases of their discretized version. We will try to take this route
in order to define the entropy of a continuous random variable X with probability density
function (pdf) fXx.
While the random variable X cannot generally take on a particular value with nonzero
probability, it can take on a value in an interval with nonzero probability. Therefore, let us
divide the range of the random variable into intervals of size . Then, by the mean value
theorem, in each interval i−1 i , there exists a number xi, such that
fXxi =
i
i−1
fXxdx (8.25)
Let us define a discrete random variable Xd with pdf
PXd = xi = fXxi (8.26)
Then we can obtain the entropy of this random variable as
HXd = −

i=−
PxilogPxi (8.27)
= −

i=−
fXxi logfXxi (8.28)

= −

i=−
fXxi logfXxi−

i=−
fXxi log (8.29)
= −

i=−
fXxilogfXxi −log (8.30)
Taking the limit as → 0 of Equation (8.30), the first term goes to −

−
fXxlogfXxdx,
which looks like the analog to our definition of entropy for discrete sources. However, the
second term is −log , which goes to plus infinity when goes to zero. It seems there is
not an analog to entropy as defined for discrete sources. However, the first term in the limit
serves some functions similar to that served by entropy in the discrete case and is a useful
function in its own right. We call this term the differential entropy of a continuous source
and denote it by hX.
E x a m p l e 8 . 4 . 4:
Suppose we have a random variable X that is uniformly distributed in the interval ab.
The differential entropy of this random variable is given by
hX = −

−
fXxlogfXxdx (8.31)
= −
b
a
1
b −a
log
1
b −a
dx (8.32)
= logb −a (8.33)
Notice that when b −a is less than one, the differential entropy will become negative—in
contrast to the entropy, which never takes on negative values.
Later in this chapter, we will find particular use for the differential entropy of the
Gaussian source.
E x a m p l e 8 . 4 . 5:
Suppose we have a random variable X that has a Gaussian pdf,
fXx =
1
√
2 2
exp−
x −2
2 2
(8.34)
The differential entropy is given by
hX = −

−
1
√
2 2
exp−
x −2
2 2
log

1
√
2 2
exp−
x −2
2 2

dx (8.35)
= −log
1
√
2 2

−
fXxdx +

−
x −2
2 2
logefXxdx (8.36)

=
1
2
log2 2
+
1
2
loge (8.37)
=
1
2
log2e 2
(8.38)
Thus, the differential entropy of a Gaussian random variable is directly proportional to its
variance.
The differential entropy for the Gaussian distribution has the added distinction that it is
larger than the differential entropy for any other continuously distributed random variable
with the same variance. That is, for any random variable X, with variance 2
hX ≤
1
2
log2e 2
(8.39)
The proof of this statement depends on the fact that for any two continuous distributions
fXX and gXX
−

−
fXxlogfXxdx ≤ −

−
fXxloggXxdx (8.40)
We will not prove Equation (8.40) here, but you may refer to [99] for a simple proof. To
obtain Equation (8.39), we substitute the expression for the Gaussian distribution for gXx.
Noting that the left-hand side of Equation (8.40) is simply the differential entropy of the
random variable X, we have
hX ≤ −

−
fXxlog
1
√
2 2
exp−
x −2
2 2
dx
=
1
2
log2 2
+loge

−
fXx
x −2
2 2
dx
=
1
2
log2 2
+
loge
2 2

−
fXxx −2
dx
=
1
2
log2e 2
(8.41)
We seem to be striking out with continuous random variables. There is no analog for
self-information and really none for entropy either. However, the situation improves when
we look for an analog for the average mutual information. Let us define the random variable
Yd in a manner similar to the random variable Xd, as the discretized version of a continuous
valued random variable Y. Then we can show (see Problem 4)
HXdYd = −

i=−

j=−

fXY xiyjfY yjlogfXY xiyj −log (8.42)

Therefore, the average mutual information for the discretized random variables is given by
IXdYd = HXd−HXdYd (8.43)
= −

i=−
fXxi logfXxi (8.44)
−

i=−

j=−
fXY xiyjfY yjlogfXY xiyj

(8.45)
Notice that the two log s in the expression for HXd and HXdYd cancel each other out,
and as long as hX and hXY are not equal to infinity, when we take the limit as → 0
of IXdYd we get
IXY = hX−hXY (8.46)
The average mutual information in the continuous case can be obtained as a limiting case of
the average mutual information for the discrete case and has the same physical significance.
We have gone through a lot of mathematics in this section. But the information will be
used immediately to define the rate distortion function for a random source.
8 . 5 Rate Distortion Theory
Rate distortion theory is concerned with the trade-offs between distortion and rate in lossy
compression schemes. Rate is defined as the average number of bits used to represent
each sample value. One way of representing the trade-offs is via a rate distortion function
R(D). The rate distortion function RD specifies the lowest rate at which the output of a
source can be encoded while keeping the distortion less than or equal to D. On our way to
mathematically defining the rate distortion function, let us look at the rate and distortion for
some different lossy compression schemes.
In Example 8.4.2, knowledge of the value of the input at time k completely specifies the
reconstructed value at time k. In this situation,
Pyjxi =

1 for some j = ji
0 otherwise.
(8.47)
Therefore,
D =
N−1

i=0
M−1

j=0
PyjxiPxidxiyj (8.48)
=
N−1

i=0
Pxidxiyji
(8.49)
where we used the fact that Pxiyj = PyjxiPxi in Equation (8.48). The rate for this
source coder is the output entropy HY of the source decoder. If this were always the
case, the task of obtaining a rate distortion function would be relatively simple. Given a

distortion constraint D∗
, we could look at all encoders with distortion less than D∗
and
pick the one with the lowest output entropy. This entropy would be the rate corresponding
to the distortion D∗
. However, the requirement that knowledge of the input at time k
completely specifies the reconstruction at time k is very restrictive, and there are many
efficient compression techniques that would have to be excluded under this requirement.
Consider the following example.
E x a m p l e 8 . 5 . 1:
With a data sequence that consists of height and weight measurements, obviously height and
weight are quite heavily correlated. In fact, after studying a long sequence of data, we find
that if we plot the height along the x axis and the weight along the y axis, the data points
cluster along the line y = 25x. In order to take advantage of this correlation, we devise the
following compression scheme. For a given pair of height and weight measurements, we
find the orthogonal projection on the y = 25x line as shown in Figure 8.3. The point on
this line can be represented as the distance to the nearest integer from the origin. Thus, we
encode a pair of values into a single value. At the time of reconstruction, we simply map
this value back into a pair of height and weight measurements.
For instance, suppose somebody is 72 inches tall and weighs 200 pounds (point A in
Figure 8.3). This corresponds to a point at a distance of 212 along the y = 25x line. The
reconstructed values of the height and weight corresponding to this value are 79 and 197.
Notice that the reconstructed values differ from the original values. Suppose we now have
72
190
200
Weight
(lb)
Height (in)
A
B
F I G U R E 8. 3 Compression scheme for encoding height-weight pairs.

another individual who is also 72 inches tall but weighs 190 pounds (point B in Figure 8.3).
The source coder output for this pair would be 203, and the reconstructed values for height
and weight are 75 and 188, respectively. Notice that while the height value in both cases was
the same, the reconstructed value is different. The reason for this is that the reconstructed
value for the height depends on the weight. Thus, for this particular source coder, we
do not have a conditional probability density function Pyjxi of the form shown in
Equation (8.47).
Let us examine the distortion for this scheme a little more closely. As the conditional
probability for this scheme is not of the form of Equation (8.47), we can no longer write the
distortion in the form of Equation (8.49). Recall that the general form of the distortion is
D =
N−1

i=0
M−1

j=0
dxiyjPxiPyjxi (8.50)
Each term in the summation consists of three factors: the distortion measure dxiyj, the
source density Pxi, and the conditional probability Pyjxi. The distortion measure is a
measure of closeness of the original and reconstructed versions of the signal and is generally
determined by the particular application. The source probabilities are solely determined by
the source. The third factor, the set of conditional probabilities, can be seen as a description
of the compression scheme.
Therefore, for a given source with some pdf Pxi and a specified distortion measure
d··, the distortion is a function only of the conditional probabilities Pyjxi; that is,
D = DPyjxi (8.51)
Therefore, we can write the constraint that the distortion D be less than some value D∗
as a
requirement that the conditional probabilities for the compression scheme belong to a set of
conditional probabilities that have the property that
= Pyjxi such that DPyjxi ≤ D∗
(8.52)
Once we know the set of compression schemes to which we have to confine ourselves,
we can start to look at the rate of these schemes. In Example 8.4.2, the rate was the entropy
of Y. However, that was a result of the fact that the conditional probability describing that
particular source coder took on only the values 0 and 1. Consider the following trivial
situation.
E x a m p l e 8 . 5 . 2:
Suppose we have the same source as in Example 8.4.2 and the same reconstruction alphabet.
Suppose the distortion measure is
dxiyj = xi −yj2

and D∗
= 225. One compression scheme that satisfies the distortion constraint randomly
maps the input to any one of the outputs; that is,
Pyjxi =
1
8
for i = 0115 and j = 0214
We can see that this conditional probability assignment satisfies the distortion constraint. As
each of the eight reconstruction values is equally likely, HY is 3 bits. However, we are not
transmitting any information. We could get exactly the same results by transmitting 0 bits
and randomly picking Y at the receiver.
Therefore, the entropy of the reconstruction HY cannot be a measure of the rate. In
his 1959 paper on source coding [100], Shannon showed that the minimum rate for a given
distortion is given by
RD = min
Pyjxi∈
IXY (8.53)
To prove this is beyond the scope of this book. (Further information can be found in [3]
and [4].) However, we can at least convince ourselves that defining the rate as an average
mutual information gives sensible answers when used for the examples shown here. Consider
Example 8.4.2. The average mutual information in this case is 3 bits, which is what we said
the rate was. In fact, notice that whenever the conditional probabilities are constrained to be
of the form of Equation (8.47),
HYX = 0
then
IXY = HY
which had been our measure of rate.
In Example 8.5.2, the average mutual information is 0 bits, which accords with our
intuitive feeling of what the rate should be. Again, whenever
HYX = HY
that is, knowledge of the source gives us no knowledge of the reconstruction,
IXY = 0
which seems entirely reasonable. We should not have to transmit any bits when we are not
sending any information.
At least for the examples here, it seems that the average mutual information does
represent the rate. However, earlier we had said that the average mutual information between
the source output and the reconstruction is a measure of the information conveyed by the
reconstruction about the source output. Why are we then looking for compression schemes
that minimize this value? To understand this, we have to remember that the process of finding
the performance of the optimum compression scheme had two parts. In the first part we

specified the desired distortion. The entire set of conditional probabilities over which the
average mutual information is minimized satisfies the distortion constraint. Therefore, we
can leave the question of distortion, or fidelity, aside and concentrate on minimizing the rate.
Finally, how do we find the rate distortion function? There are two ways: one is a
computational approach developed by Arimoto [101] and Blahut [102]. While the derivation
of the algorithm is beyond the scope of this book, the algorithm itself is relatively simple.
The other approach is to find a lower bound for the average mutual information and then
show that we can achieve this bound. We use this approach to find the rate distortion
functions for two important sources.
E x a m p l e 8 . 5 . 3: R a t e d i s t o r t i o n f u n c t i o n f o r t h e
b i n a r y s o u r c e
Suppose we have a source alphabet 01, with P0 = p. The reconstruction alphabet is
also binary. Given the distortion measure
dxiyj = xi ⊕yj (8.54)
where ⊕ is modulo 2 addition, let us find the rate distortion function. Assume for the moment
that p 1
2
. For D p an encoding scheme that would satisfy the distortion criterion would
be not to transmit anything and fix Y = 1. So for D ≥ p
RD = 0 (8.55)
We will find the rate distortion function for the distortion range 0 ≤ D p.
Find a lower bound for the average mutual information:
IXY = HX−HXY (8.56)
= HX−HX ⊕YY (8.57)
≥ HX−HX ⊕Y from Equation (8.11). (8.58)
In the second step we have used the fact that if we know Y, then knowing X we can obtain
X ⊕Y and vice versa as X ⊕Y ⊕Y = X.
Let us look at the terms on the right-hand side of (8.11):
HX = −plog2 p−1−plog21−p = Hbp (8.59)
where Hbp is called the binary entropy function and is plotted in Figure 8.4. Note that
Hbp = Hb1−p.
Given that HX is completely specified by the source probabilities, our task now is
to find the conditional probabilities Pxiyj such that HX ⊕ Y is maximized while
the average distortion E dxiyj ≤ D. HX ⊕ Y is simply the binary entropy function
HbPX ⊕Y = 1, where
PX ⊕Y = 1 = PX = 0Y = 1+PX = 1Y = 0 (8.60)

Hb(p)
p
1.0
0.5 1.0
F I G U R E 8. 4 The binary entropy function.
Therefore, to maximize HX ⊕Y, we would want PX ⊕Y = 1 to be as close as possible
to one-half. However, the selection of PX ⊕Y also has to satisfy the distortion constraint.
The distortion is given by
E dxiyj = 0×PX = 0Y = 0+1×PX = 0Y = 1
+1×PX = 1Y = 0+0×PX = 1Y = 1
= PX = 0Y = 1+PX = 1Y = 0
= PY = 1X = 0p+PY = 0X = 11−p (8.61)
But this is simply the probability that X ⊕ Y = 1. Therefore, the maximum value that
PX ⊕Y = 1 can have is D. Our assumptions were that D p and p ≤ 1
2
, which means that
D 1
2
. Therefore, PX ⊕ Y = 1 is closest to 1
2
while being less than or equal to D when
PX ⊕Y = 1 = D. Therefore,
IXY ≥ Hbp−HbD (8.62)
We can show that for PX = 0Y = 1 = PX = 1Y = 0 = D, this bound is achieved.
That is, if PX = 0Y = 1 = PX = 1Y = 0 = D, then
IXY = Hbp−HbD (8.63)
Therefore, for D p and p ≤ 1
2
,
RD = Hbp−HbD (8.64)

Finally, if p 1
2
, then we simply switch the roles of p and 1−p. Putting all this together,
the rate distortion function for a binary source is
RD =

Hbp−HbD for D minp1−p
0 otherwise.
(8.65)

E x a m p l e 8 . 5 . 4: R a t e d i s t o r t i o n f u n c t i o n f o r t h e
G a u s s i a n s o u r c e
Suppose we have a continuous amplitude source that has a zero mean Gaussian pdf with
variance 2
. If our distortion measure is given by
dxy = x −y2
(8.66)
our distortion constraint is given by
E

X −Y2
≤ D (8.67)
Our approach to finding the rate distortion function will be the same as in the previous
example; that is, find a lower bound for IXY given a distortion constraint, and then show
that this lower bound can be achieved.
First we find the rate distortion function for D 2
.
IXY = hX−hXY (8.68)
= hX−hX −YY (8.69)
≥ hX−hX −Y (8.70)
In order to minimize the right-hand side of Equation (8.70), we have to maximize the second
term subject to the constraint given by Equation (8.67). This term is maximized if X −Y is
Gaussian, and the constraint can be satisfied if E

X −Y2
= D. Therefore, hX −Y is the
differential entropy of a Gaussian random variable with variance D, and the lower bound
becomes
IXY ≥
1
2
log2e 2
−
1
2
log2eD (8.71)
=
1
2
log
2
D
(8.72)
This average mutual information can be achieved if Y is zero mean Gaussian with variance
2
−D, and
fXY xy =
1
√
2D
exp
−x2
2D
(8.73)
For D 2
, if we set Y = 0, then
IXY = 0 (8.74)

8.6 Models 215
and
E

X −Y2
= 2
D (8.75)
Therefore, the rate distortion function for the Gaussian source can be written as
RD =

1
2
log
2
D
for D 2
0 for D 2
.
(8.76)

Like the differential entropy for the Gaussian source, the rate distortion function for the
Gaussian source also has the distinction of being larger than the rate distortion function for
any other source with a continuous distribution and the same variance. This is especially
valuable because for many sources it can be very difficult to calculate the rate distortion
function. In these situations, it is helpful to have an upper bound for the rate distortion
function. It would be very nice if we also had a lower bound for the rate distortion function
of a continuous random variable. Shannon described such a bound in his 1948 paper [7],
and it is appropriately called the Shannon lower bound. We will simply state the bound here
without derivation (for more information, see [4]).
The Shannon lower bound for a random variable X and the magnitude error criterion
dxy = x −y (8.77)
is given by
RSLBD = hX−log2eD (8.78)
If we used the squared error criterion, the Shannon lower bound is given by
RSLBD = hX−
1
2
log2eD (8.79)
In this section we have defined the rate distortion function and obtained the rate distortion
function for two important sources. We have also obtained upper and lower bounds on the
rate distortion function for an arbitrary iid source. These functions and bounds are especially
useful when we want to know if it is possible to design compression schemes to provide a
specified rate and distortion given a particular source. They are also useful in determining the
amount of performance improvement that we could obtain by designing a better compression
scheme. In these ways the rate distortion function plays the same role for lossy compression
that entropy plays for lossless compression.
8 . 6 Models
As in the case of lossless compression, models play an important role in the design of lossy
compression algorithms; there are a variety of approaches available. The set of models we
can draw on for lossy compression is much wider than the set of models we studied for

lossless compression. We will look at some of these models in this section. What is presented
here is by no means an exhaustive list of models. Our only intent is to describe those models
that will be useful in the following chapters.
8 . 6 . 1 P r o b a b i l i t y M o d e l s
An important method for characterizing a particular source is through the use of probability
models. As we shall see later, knowledge of the probability model is important for the design
of a number of compression schemes.
Probability models used for the design and analysis of lossy compression schemes
differ from those used in the design and analysis of lossless compression schemes. When
developing models in the lossless case, we tried for an exact match. The probability of each
symbol was estimated as part of the modeling process. When modeling sources in order
to design or analyze lossy compression schemes, we look more to the general rather than
exact correspondence. The reasons are more pragmatic than theoretical. Certain probability
distribution functions are more analytically tractable than others, and we try to match the
distribution of the source with one of these “nice” distributions.
Uniform, Gaussian, Laplacian, and Gamma distribution are four probability models
commonly used in the design and analysis of lossy compression systems:
Uniform Distribution: As for lossless compression, this is again our ignorance model.
If we do not know anything about the distribution of the source output, except possibly
the range of values, we can use the uniform distribution to model the source. The
probability density function for a random variable uniformly distributed between
a and b is
fXx =

1
b−a
for a ≤ x ≤ b
0 otherwise.
(8.80)
Gaussian Distribution: The Gaussian distribution is one of the most commonly used
probability models for two reasons: it is mathematically tractable and, by virtue of the
central limit theorem, it can be argued that in the limit the distribution of interest goes
to a Gaussian distribution. The probability density function for a random variable with
a Gaussian distribution and mean and variance 2
is
fXx =
1
√
2 2
exp−
x −2
2 2
(8.81)
Laplacian Distribution: Many sources that we deal with have distributions that are
quite peaked at zero. For example, speech consists mainly of silence. Therefore,
samples of speech will be zero or close to zero with high probability. Image pixels
themselves do not have any attraction to small values. However, there is a high degree
of correlation among pixels. Therefore, a large number of the pixel-to-pixel differences
will have values close to zero. In these situations, a Gaussian distribution is not a very
close match to the data. A closer match is the Laplacian distribution, which is peaked

8.6 Models 217
at zero. The distribution function for a zero mean random variable with Laplacian
distribution and variance 2
is
fXx =
1
√
2 2
exp
−
√
2x
(8.82)
Gamma Distribution: A distribution that is even more peaked, though considerably
less tractable, than the Laplacian distribution is the Gamma distribution. The distribu-
tion function for a Gamma distributed random variable with zero mean and variance
2
is given by
fXx =
4
√
3
8 x
exp
−
√
3x
2
(8.83)
The shapes of these four distributions, assuming a mean of zero and a variance of one, are
shown in Figure 8.5.
One way of obtaining the estimate of the distribution of a particular source is to divide
the range of outputs into “bins” or intervals Ik. We can then find the number of values nk
that fall into each interval. A plot of nk
nT
, where nT is the total number of source outputs
being considered, should give us some idea of what the input distribution looks like. Be
aware that this is a rather crude method and can at times be misleading. For example, if we
were not careful in our selection of the source output, we might end up modeling some local
peculiarities of the source. If the bins are too large, we might effectively filter out some
important properties of the source. If the bin sizes are too small, we may miss out on some
of the gross behavior of the source.
0
0.2
0.8
1.0
0.4
0.6
1.2
−4
−6 −2 0 2 4 6
Uniform
Gaussian
Laplacian
Gamma
F I G U R E 8. 5 Uniform, Gaussian, Laplacian, and Gamma distributions.

Once we have decided on some candidate distributions, we can select between them
using a number of sophisticated tests. These tests are beyond the scope of this book but are
described in [103].
Many of the sources that we deal with when we design lossy compression schemes have
a great deal of structure in the form of sample-to-sample dependencies. The probability
models described here capture none of these dependencies. Fortunately, we have a lot of
models that can capture most of this structure. We describe some of these models in the
next section.
8 . 6 . 2 L i n e a r S y s t e m M o d e l s
A large class of processes can be modeled in the form of the following difference equation:
xn =
N

i=1
aixn−i +
M

j=1
bjn−j +n (8.84)
where xn are samples of the process we wish to model, and n is a white noise sequence.
We will assume throughout this book that we are dealing with real valued samples. Recall that
a zero-mean wide-sense-stationary noise sequence n is a sequence with autocorrelation
function
Rk =

2
for k = 0
0 otherwise.
(8.85)
In digital signal-processing terminology, Equation (8.84) represents the output of a linear
discrete time invariant filter with N poles and M zeros. In the statistical literature, this model
is called an autoregressive moving average model of order (N,M), or an ARMA (N,M)
model. The autoregressive label is because of the first summation in Equation (8.84), while
the second summation gives us the moving average portion of the name.
If all the bj were zero in Equation (8.84), only the autoregressive part of the ARMA
model would remain:
xn =
N

i=1
aixn−i +n (8.86)
This model is called an Nth-order autoregressive model and is denoted by AR(N). In digital
signal-processing terminology, this is an all pole filter. The AR(N) model is the most popular
of all the linear models, especially in speech compression, where it arises as a natural
consequence of the speech production model. We will look at it a bit more closely.
First notice that for the AR(N) process, knowing all the past history of the process gives
no more information than knowing the last N samples of the process; that is,
Pxnxn−1xn−2 = Pxnxn−1xn−2xn−N (8.87)
which means that the AR(N) process is a Markov model of order N.
The autocorrelation function of a process can tell us a lot about the sample-to-sample
behavior of a sequence. A slowly decaying autocorrelation function indicates a high sample-
to-sample correlation, while a fast decaying autocorrelation denotes low sample-to-sample

8.6 Models 219
correlation. In the case of no sample-to-sample correlation, such as white noise, the auto-
correlation function is zero for lags greater than zero, as seen in Equation (8.85). The
autocorrelation function for the AR(N) process can be obtained as follows:
Rxxk = E xnxn−k (8.88)
= E
N

i=1
aixn−i +n xn−k

(8.89)
= E
N

i=1
aixn−ixn−k

+E nxn−k (8.90)
=

N
i=1 aiRxxk−i for k 0
N
i=1 aiRxxi+ 2
for k = 0.
(8.91)
E x a m p l e 8 . 6 . 1:
Suppose we have an AR(3) process. Let us write out the equations for the autocorrelation
coefficient for lags 1, 2, 3:
Rxx1 = a1Rxx0+a2Rxx1+a3Rxx2
If we know the values of the autocorrelation function Rxxk, for k = 0123, we can use
this set of equations to find the AR(3) coefficients a1a2a3. On the other hand, if we
know the model coefficients and 2
, we can use the above equations along with the equation
for Rxx0 to find the first four autocorrelation coefficients. All the other autocorrelation
values can be obtained by using Equation (8.91).
To see how the autocorrelation function is related to the temporal behavior of the
sequence, let us look at the behavior of a simple AR(1) source.
E x a m p l e 8 . 6 . 2:
An AR(1) source is defined by the equation
xn = a1xn−1 +n (8.92)
The autocorrelation function for this source (see Problem 8) is given by
Rxxk =
1
1−a2
1
ak
1
2
(8.93)
From this we can see that the autocorrelation will decay more slowly for larger values
of a1. Remember that the value of a1 in this case is an indicator of how closely the current

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 2 4 6 8 10 12 14 16 18 20
R(k)
k
a1 = 0.6
a1 = 0.99
F I G U R E 8. 6 Autocorrelation function of an AR(1) process with two values of a1.
sample is related to the previous sample. The autocorrelation function is plotted for two
values of a1 in Figure 8.6. Notice that for a1 close to 1, the autocorrelation function decays
extremely slowly. As the value of a1 moves farther away from 1, the autocorrelation function
decays much faster.
Sample waveforms for a1 = 099 and a1 = 06 are shown in Figures 8.7 and 8.8. Notice
the slower variations in the waveform for the process with a higher value of a1. Because
4
3
2
1
0
−1
−2
−3
−4
−5
0 10 20 30 40 50 60 70 80 90 100
xn
n
F I G U R E 8. 7 Sample function of an AR(1) process with a1= 0.99.

8.6 Models 221
−1.0
0
1.0
2.0
3.0
–2
0 10 20 30 40 50 60 70 80 90 100
xn
n
F I G U R E 8. 8 Sample function of an AR(1) process with a1= 0.6.
the waveform in Figure 8.7 varies more slowly than the waveform in Figure 8.8, samples of
this waveform are much more likely to be close in value than the samples of the waveform
of Figure 8.8.
Let’s look at what happens when the AR(1) coefficient is negative. The sample wave-
forms are plotted in Figures 8.9 and 8.10. The sample-to-sample variation in these waveforms
12
−10
−8
−6
−4
−2
0
2
4
6
8
10
0 10 20 30 40 50 60 70 80 90 100
xn
n
F I G U R E 8. 9 Sample function of an AR(1) process with a1= −0.99.

3
−4
−3
−2
−1
0
1
2
0 10 20 30 40 50 60 70 80 90 100
xn
n
F I G U R E 8. 10 Sample function of an AR(1) process with a1= −0.6.
1.0
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1.0
0 2 4 6 8 10 12 14 16 18 20
R(k)
k
a1 = −0.99
a1 = −0.6
F I G U R E 8. 11 Autocorrelation function of an AR(1) process with two negative
values of a1.
is much higher than in the waveforms shown in Figures 8.7 and 8.8. However, if we were
to look at the variation in magnitude, we can see that the higher value of a1 results in
magnitude values that are closer together.
This behavior is also reflected in the autocorrelation function, shown in Figure 8.11, as
we might expect from looking at Equation (8.93).

8.6 Models 223
In Equation (8.84), instead of setting all the bj coefficients to zero, if we set all the
ai coefficients to zero, we are left with the moving average part of the ARMA process:
xn =
M

j=1
bjn−j +n (8.94)
This process is called an Mth-order moving average process. This is a weighted average of
the current and M past samples. Because of the form of this process, it is most useful when
modeling slowly varying processes.
8 . 6 . 3 P h y s i c a l M o d e l s
Physical models are based on the physics of the source output production. The physics are
generally complicated and not amenable to a reasonable mathematical approximation. An
exception to this rule is speech generation.
S p e e c h P r o d u c t i o n
There has been a significant amount of research conducted in the area of speech production
[104], and volumes have been written about it. We will try to summarize some of the
pertinent aspects in this section.
Speech is produced by forcing air first through an elastic opening, the vocal cords, and
then through cylindrical tubes with nonuniform diameter (the laryngeal, oral, nasal, and
pharynx passages), and finally through cavities with changing boundaries such as the mouth
and the nasal cavity. Everything past the vocal cords is generally referred to as the vocal
tract. The first action generates the sound, which is then modulated into speech as it traverses
through the vocal tract.
We will often be talking about filters in the coming chapters. We will try to describe
filters more precisely at that time. For our purposes at present, a filter is a system that has
an input and an output, and a rule for converting the input to the output, which we will call
the transfer function. If we think of speech as the output of a filter, the sound generated by
the air rushing past the vocal cords can be viewed as the input, while the rule for converting
the input to the output is governed by the shape and physics of the vocal tract.
The output depends on the input and the transfer function. Let’s look at each in turn.
There are several different forms of input that can be generated by different conformations
of the vocal cords and the associated cartilages. If the vocal cords are stretched shut and we
force air through, the vocal cords vibrate, providing a periodic input. If a small aperture is left
open, the input resembles white noise. By opening an aperture at different locations along
the vocal cords, we can produce a white-noise–like input with certain dominant frequencies
that depend on the location of the opening. The vocal tract can be modeled as a series of
tubes of unequal diameter. If we now examine how an acoustic wave travels through this
series of tubes, we find that the mathematical model that best describes this process is an
autoregressive model. We will often encounter the autoregressive model when we discuss
speech compression algorithms.

8 . 7 Summary
In this chapter we have looked at a variety of topics that will be useful to us when we
study various lossy compression techniques, including distortion and its measurement, some
new concepts from information theory, average mutual information and its connection to
the rate of a compression scheme, and the rate distortion function. We have also briefly
looked at some of the properties of the human visual system and the auditory system—
most importantly, visual and auditory masking. The masking phenomena allow us to incur
distortion in such a way that the distortion is not perceptible to the human observer. We also
presented a model for speech production.
There are a number of excellent books available that delve more deeply in the area of
information theory:
1. Information Theory, by R.B. Ash [15].
2. Information Transmission, by R.M. Fano [16].
3. Information Theory and Reliable Communication, by R.G. Gallagher [11].
4. Entropy and Information Theory, by R.M. Gray [17].
5. Elements of Information Theory, by T.M. Cover and J.A. Thomas [3].
6. The Theory of Information and Coding, by R.J. McEliece [6].
The subject of rate distortion theory is discussed in very clear terms in Rate Distortion
Theory, by T. Berger [4].
For an introduction to the concepts behind speech perception, see Voice and Speech
Processing, by T. Parsons [105].
1. Although SNR is a widely used measure of distortion, it often does not correlate with
perceptual quality. In order to see this we conduct the following experiment. Using
one of the images provided, generate two “reconstructed” images. For one of the
reconstructions add a value of 10 to each pixel. For the other reconstruction, randomly
add either +10 or −10 to each pixel.
(a) What is the SNR for each of the reconstructions? Do the relative values reflect
the difference in the perceptual quality?
(b) Devise a mathematical measure that will better reflect the difference in perceptual
quality for this particular case.
2. Consider the following lossy compression scheme for binary sequences. We divide
the binary sequence into blocks of size M. For each block we count the number

of 0s. If this number is greater than or equal to M/2, we send a 0; otherwise, we
send a 1.
(a) If the sequence is random with P0 = 08, compute the rate and distortion
(use Equation (8.54)) for M = 124816. Compare your results with the rate
distortion function for binary sources.
(b) Repeat assuming that the output of the encoder is encoded at a rate equal to the
entropy of the output.
3. Write a program to implement the compression scheme described in the previous
problem.
(a) Generate a random binary sequence with P0 = 08, and compare your simula-
tion results with the analytical results.
(b) Generate a binary first-order Markov sequence with P00 = 09, and P11 =
09. Encode it using your program. Discuss and comment on your results.
4. Show that
HXdYd = −

j=−

i=−
fXY xiyjfY yj logfXY xiyj−log (8.95)
5. For two random variables X and Y, show that
HXY ≤ HX
with equality if X is independent of Y.
Hint: E logfx ≤ logE fx (Jensen’s inequality).
6. Given two random variables X and Y, show that IXY = IYX.
7. For a binary source with P0 = p, PX = 0Y = 1 = PX = 1Y = 0 = D, and
distortion measure
dxiyj = xi ⊕yj
show that
IXY = Hbp−HbD (8.96)
8. Find the autocorrelation function in terms of the model coefficients and 2
for
(a) an AR(1) process,
(b) an MA(1) process, and
(c) an AR(2) process.

9
Scalar Quantization
9 . 1 Overview
I
n this chapter we begin our study of quantization, one of the simplest and most
general ideas in lossy compression. We will look at scalar quantization in this
chapter and continue with vector quantization in the next chapter. First, the
general quantization problem is stated, then various solutions are examined,
starting with the simpler solutions, which require the most assumptions, and
proceeding to more complex solutions that require fewer assumptions. We describe uniform
quantization with fixed-length codewords, first assuming a uniform source, then a source
with a known probability density function (pdf) that is not necessarily uniform, and finally
a source with unknown or changing statistics. We then look at pdf-optimized nonuniform
quantization, followed by companded quantization. Finally, we return to the more general
statement of the quantizer design problem and study entropy-coded quantization.
9 . 2 Introduction
In many lossy compression applications we are required to represent each source output
using one of a small number of codewords. The number of possible distinct source output
values is generally much larger than the number of codewords available to represent them.
The process of representing a large—possibly infinite—set of values with a much smaller
set is called quantization.
Consider a source that generates numbers between −100 and 10.0. A simple quantization
scheme would be to represent each output of the source with the integer value closest to it.
(If the source output is equally close to two integers, we will randomly pick one of them.)
For example, if the source output is 2.47, we would represent it as 2, and if the source output
is 3.1415926, we would represent it as 3.

228 9 S C A L A R Q U A N T I Z A T I O N
This approach reduces the size of the alphabet required to represent the source output; the
infinite number of values between −100 and 10.0 are represented with a set that contains
only 21 values (−10010). At the same time we have also forever lost the
original value of the source output. If we are told that the reconstruction value is 3, we
cannot tell whether the source output was 2.95, 3.16, 3.057932, or any other of an infinite
set of values. In other words, we have lost some information. This loss of information is the
reason for the use of the word “lossy” in many lossy compression schemes.
The set of inputs and outputs of a quantizer can be scalars or vectors. If they are
scalars, we call the quantizers scalar quantizers. If they are vectors, we call the quantizers
vector quantizers. We will study scalar quantizers in this chapter and vector quantizers in
Chapter 10.
9 . 3 The Quantization Problem
Quantization is a very simple process. However, the design of the quantizer has a significant
impact on the amount of compression obtained and loss incurred in a lossy compression
scheme. Therefore, we will devote a lot of attention to issues related to the design of
quantizers.
In practice, the quantizer consists of two mappings: an encoder mapping and a decoder
mapping. The encoder divides the range of values that the source generates into a number of
intervals. Each interval is represented by a distinct codeword. The encoder represents all the
source outputs that fall into a particular interval by the codeword representing that interval.
As there could be many—possibly infinitely many—distinct sample values that can fall in
any given interval, the encoder mapping is irreversible. Knowing the code only tells us the
interval to which the sample value belongs. It does not tell us which of the many values
in the interval is the actual sample value. When the sample value comes from an analog
source, the encoder is called an analog-to-digital (A/D) converter.
The encoder mapping for a quantizer with eight reconstruction values is shown in
Figure 9.1. For this encoder, all samples with values between −1 and 0 would be assigned
the code 011. All values between 0 and 1.0 would be assigned the codeword 100, and so on.
On the two boundaries, all inputs with values greater than 3 would be assigned the code 111,
and all inputs with values less than −30 would be assigned the code 000. Thus, any input
3.0
2.0
1.0
0
−1.0
−2.0
−3.0
110 111
101
100
011
010
001
000
Input
Codes
F I G U R E 9. 1 Mapping for a 3-bit encoder.

Input Codes Output
000 −35
001 −25
010 −15
011 −05
100 05
101 15
110 25
111 35
F I G U R E 9. 2 Mapping for a 3-bit D/A converter.
that we receive will be assigned a codeword depending on the interval in which it falls. As
we are using 3 bits to represent each value, we refer to this quantizer as a 3-bit quantizer.
For every codeword generated by the encoder, the decoder generates a reconstruction
value. Because a codeword represents an entire interval, and there is no way of knowing
which value in the interval was actually generated by the source, the decoder puts out a
value that, in some sense, best represents all the values in the interval. Later, we will see
how to use information we may have about the distribution of the input in the interval to
obtain a representative value. For now, we simply use the midpoint of the interval as the
representative value generated by the decoder. If the reconstruction is analog, the decoder is
often referred to as a digital-to-analog (D/A) converter. A decoder mapping corresponding
to the 3-bit encoder shown in Figure 9.1 is shown in Figure 9.2.
E x a m p l e 9 . 3 . 1:
Suppose a sinusoid 4cos2t was sampled every 0.05 second. The sample was digitized
using the A/D mapping shown in Figure 9.1 and reconstructed using the D/A mapping
shown in Figure 9.2. The first few inputs, codewords, and reconstruction values are given
in Table 9.1. Notice the first two samples in Table 9.1. Although the two input values
are distinct, they both fall into the same interval in the quantizer. The encoder, therefore,
represents both inputs with the same codeword, which in turn leads to identical reconstruction
values.
T A B L E 9 . 1 Digitizing a sine wave.
t 4cos2t A/D Output D/A Output Error
0.05 3.804 111 3.5 0.304
0.10 3.236 111 3.5 −0264
0.15 2.351 110 2.5 −0149
0.20 1.236 101 1.5 −0264

Construction of the intervals (their location, etc.) can be viewed as part of the design of
the encoder. Selection of reconstruction values is part of the design of the decoder. However,
the fidelity of the reconstruction depends on both the intervals and the reconstruction values.
Therefore, when designing or analyzing encoders and decoders, it is reasonable to view them
as a pair. We call this encoder-decoder pair a quantizer. The quantizer mapping for the 3-bit
encoder-decoder pair shown in Figures 9.1 and 9.2 can be represented by the input-output
map shown in Figure 9.3. The quantizer accepts sample values, and depending on the interval
in which the sample values fall, it provides an output codeword and a representation value.
Using the map of Figure 9.3, we can see that an input to the quantizer of 1.7 will result in
an output of 1.5, and an input of −03 will result in an output of −05.
From Figures 9.1–9.3 we can see that we need to know how to divide the input range
into intervals, assign binary codes to these intervals, and find representation or output values
for these intervals in order to specify a quantizer. We need to do all of this while satisfying
distortion and rate criteria. In this chapter we will define distortion to be the average squared
difference between the quantizer input and output. We call this the mean squared quantization
error (msqe) and denote it by 2
q . The rate of the quantizer is the average number of bits
−4.0 −3.0 −2.0 −1.0
0.5
−0.5
−1.5
−2.5
−3.5
1.5
2.5
3.5
Output
Input
1.0 2.0 3.0 4.0
F I G U R E 9. 3 Quantizer input-output map.

required to represent a single quantizer output. We would like to get the lowest distortion
for a given rate, or the lowest rate for a given distortion.
Let us pose the design problem in precise terms. Suppose we have an input modeled by
a random variable X with pdf fXx. If we wished to quantize this source using a quan-
tizer with M intervals, we would have to specify M + 1 endpoints for the intervals, and a
representative value for each of the M intervals. The endpoints of the intervals are known
as decision boundaries, while the representative values are called reconstruction levels. We
will often model discrete sources with continuous distributions. For example, the difference
between neighboring pixels is often modeled using a Laplacian distribution even though the
differences can only take on a limited number of discrete values. Discrete processes are
modeled with continuous distributions because it can simplify the design process consider-
ably, and the resulting designs perform well in spite of the incorrect assumption. Several
of the continuous distributions used to model source outputs are unbounded—that is, the
range of values is infinite. In these cases, the first and last endpoints are generally chosen
to be ±.
Let us denote the decision boundaries by biM
i=0, the reconstruction levels by yiM
i=1,
and the quantization operation by Q·. Then
Qx = yi iff bi−1 x ≤ bi (9.1)
The mean squared quantization error is then given by
2
q =

−
x −Qx2
fXxdx (9.2)
=
M

i=1
bi
bi−1
x −yi2
fXxdx (9.3)
The difference between the quantizer input x and output y = Qx, besides being referred
to as the quantization error, is also called the quantizer distortion or quantization noise. But
the word “noise” is somewhat of a misnomer. Generally, when we talk about noise we mean
a process external to the source process. Because of the manner in which the quantization
error is generated, it is dependent on the source process and, therefore, cannot be regarded as
external to the source process. One reason for the use of the word “noise” in this context is
that from time to time we will find it useful to model the quantization process as an additive
noise process as shown in Figure 9.4.
If we use fixed-length codewords to represent the quantizer output, then the size of the
output alphabet immediately specifies the rate. If the number of quantizer outputs is M, then
the rate is given by
R = log2 M (9.4)
For example, if M = 8, then R = 3. In this case, we can pose the quantizer design problem
as follows:
Given an input pdf fxx and the number of levels M in the quantizer, find the
decision boundaries bi and the reconstruction levels yi so as to minimize the
mean squared quantization error given by Equation (9.3).

Quantization noise
Quantizer output
Quantizer input +
F I G U R E 9. 4 Additive noise model of a quantizer.
T A B L E 9 . 2 Codeword assignment
for an eight-level
quantizer.
y1 1110
y2 1100
y3 100
y4 00
y5 01
y6 101
y7 1101
y8 1111
However, if we are allowed to use variable-length codes, such as Huffman codes or
arithmetic codes, along with the size of the alphabet, the selection of the decision boundaries
will also affect the rate of the quantizer. Consider the codeword assignment for the output
of an eight-level quantizer shown in Table 9.2.
According to this codeword assignment, if the output y4 occurs, we use 2 bits to encode
it, while if the output y1 occurs, we need 4 bits to encode it. Obviously, the rate will depend
on how often we have to encode y4 versus how often we have to encode y1. In other words,
the rate will depend on the probability of occurrence of the outputs. If li is the length of the
codeword corresponding to the output yi, and Pyi is the probability of occurrence of yi,
then the rate is given by
R =
M

i=1
liPyi (9.5)
However, the probabilities Pyi depend on the decision boundaries bi. For example,
the probability of yi occurring is given by
Pyi =
bi
bi−1
fXxdx

Therefore, the rate R is a function of the decision boundaries and is given by the expression
R =
M

i=1
li
bi
bi−1
fXxdx (9.6)
From this discussion and Equations (9.3) and (9.6), we see that for a given source
input, the partitions we select and the representation for those partitions will determine the
distortion incurred during the quantization process. The partitions we select and the binary
codes for the partitions will determine the rate for the quantizer. Thus, the problem of
finding the optimum partitions, codes, and representation levels are all linked. In light of
this information, we can restate our problem statement:
Given a distortion constraint
2
q ≤ D∗
(9.7)
find the decision boundaries, reconstruction levels, and binary codes that mini-
mize the rate given by Equation (9.6), while satisfying Equation (9.7).
Or, given a rate constraint
R ≤ R∗
(9.8)
find the decision boundaries, reconstruction levels, and binary codes that mini-
mize the distortion given by Equation (9.3), while satisfying Equation (9.8).
This problem statement of quantizer design, while more general than our initial statement,
is substantially more complex. Fortunately, in practice there are situations in which we can
simplify the problem. We often use fixed-length codewords to encode the quantizer output.
In this case, the rate is simply the number of bits used to encode each output, and we can
use our initial statement of the quantizer design problem. We start our study of quantizer
design by looking at this simpler version of the problem, and later use what we have learned
in this process to attack the more complex version.
9 . 4 Uniform Quantizer
The simplest type of quantizer is the uniform quantizer. All intervals are the same size in the
uniform quantizer, except possibly for the two outer intervals. In other words, the decision
boundaries are spaced evenly. The reconstruction values are also spaced evenly, with the
same spacing as the decision boundaries; in the inner intervals, they are the midpoints of
the intervals. This constant spacing is usually referred to as the step size and is denoted by
. The quantizer shown in Figure 9.3 is a uniform quantizer with = 1. It does not have
zero as one of its representation levels. Such a quantizer is called a midrise quantizer. An
alternative uniform quantizer could be the one shown in Figure 9.5. This is called a midtread
quantizer. As the midtread quantizer has zero as one of its output levels, it is especially
useful in situations where it is important that the zero value be represented—for example,

−3.5 −2.5 −1.5 −0.5
−1.0
−2.0
−3.0
1.0
2.0
3.0
Output
Input
0.5 1.5 2.5 3.5
F I G U R E 9. 5 A midtread quantizer.
control systems in which it is important to represent a zero value accurately, and audio
coding schemes in which we need to represent silence periods. Notice that the midtread
quantizer has only seven intervals or levels. That means that if we were using a fixed-length
3-bit code, we would have one codeword left over.
Usually, we use a midrise quantizer if the number of levels is even and a midtread
quantizer if the number of levels is odd. For the remainder of this chapter, unless we
specifically mention otherwise, we will assume that we are dealing with midrise quantizers.
We will also generally assume that the input distribution is symmetric around the origin
and the quantizer is also symmetric. (The optimal minimum mean squared error quantizer
for a symmetric distribution need not be symmetric [106].) Given all these assumptions, the
design of a uniform quantizer consists of finding the step size that minimizes the distortion
for a given input process and number of decision levels.
U n i f o r m Q u a n t i z a t i o n o f a U n i f o r m l y D i s t r i b u t e d S o u r c e
We start our study of quantizer design with the simplest of all cases: design of a uniform
quantizer for a uniformly distributed source. Suppose we want to design an M-level uniform
quantizer for an input that is uniformly distributed in the interval −XmaxXmax . This means

we need to divide the −XmaxXmax interval into M equally sized intervals. In this case, the
step size is given by
=
2Xmax
M
(9.9)
The distortion in this case becomes
2
q = 2
M
2

i=1
i
i−1

x −
2i−1
2
2
1
2Xmax
dx (9.10)
If we evaluate this integral (after some suffering), we find that the msqe is 2
/12.
The same result can be more easily obtained if we examine the behavior of the quanti-
zation error q given by
q = x −Qx (9.11)
In Figure 9.6 we plot the quantization error versus the input signal for an eight-level
uniform quantizer, with an input that lies in the interval −XmaxXmax . Notice that the
quantization error lies in the interval − 2
2
. As the input is uniform, it is not difficult to
establish that the quantization error is also uniform over this interval. Thus, the mean squared
quantization error is the second moment of a random variable uniformly distributed in the
interval − 2
2
:
2
q =
1 2
− 2
q2
dq (9.12)
=
2
12
(9.13)
Let us also calculate the signal-to-noise ratio for this case. The signal variance 2
s for
a uniform random variable, which takes on values in the interval −XmaxXmax , is 2Xmax2
12
.
−4Δ −3Δ −2Δ −Δ
−Δ/2
Δ/2
x − Q(x)
Xmax x
–Xmax
2Δ
Δ 3Δ 4Δ
F I G U R E 9. 6 Quantization error for a uniform midrise quantizer with a uniformly
distributed input.

The value of the step size is related to Xmax and the number of levels M by
=
2Xmax
M

For the case where we use a fixed-length code, with each codeword being made up of n
bits, the number of codewords or the number of reconstruction levels M is 2n
. Combining
all this, we have
SNRdB = 10log10

2
s
2
q

(9.14)
= 10log10

2Xmax2
12
·
12
2

(9.15)
= 10log10

2Xmax2
12
12
2Xmax
M
2

(9.16)
= 10log10M2

= 20log102n

= 602ndB (9.17)
This equation says that for every additional bit in the quantizer, we get an increase in the
signal-to-noise ratio of 6.02 dB. This is a well-known result and is often used to get an
indication of the maximum gain available if we increase the rate. However, remember that
we obtained this result under some assumptions about the input. If the assumptions are not
true, this result will not hold true either.
E x a m p l e 9 . 4 . 1: I m a g e c o m p r e s s i o n
A probability model for the variations of pixels in an image is almost impossible to obtain
because of the great variety of images available. A common approach is to declare the pixel
values to be uniformly distributed between 0 and 2b
−1, where b is the number of bits per
pixel. For most of the images we deal with, the number of bits per pixel is 8; therefore, the
pixel values would be assumed to vary uniformly between 0 and 255. Let us quantize our
test image Sena using a uniform quantizer.
If we wanted to use only 1 bit per pixel, we would divide the range [0, 255] into two
intervals, [0, 127] and [128, 255]. The first interval would be represented by the value 64,
the midpoint of the first interval; the pixels in the second interval would be represented by
the pixel value 196, the midpoint of the second interval. In other words, the boundary values
are 0128255, while the reconstruction values are 64196. The quantized image is
shown in Figure 9.7. As expected, almost all the details in the image have disappeared. If we
were to use a 2-bit quantizer, with boundary values 064128196255 and reconstruction
levels 3296160224, we get considerably more detail. The level of detail increases as
the use of bits increases until at 6 bits per pixel, the reconstructed image is indistinguishable
from the original, at least to a casual observer. The 1-, 2-, and 3-bit images are shown in
Figure 9.7.

F I G U R E 9. 7 Top left: original Sena image; top right: 1 bit/pixel image; bottom
left: 2 bits/pixel; bottorm right: 3 bits/pixel.
Looking at the lower-rate images, we notice a couple of things. First, the lower-rate
images are darker than the original, and the lowest-rate reconstructions are the darkest. The
reason for this is that the quantization process usually results in scaling down of the dynamic
range of the input. For example, in the 1-bit-per-pixel reproduction, the highest pixel value
is 196, as opposed to 255 for the original image. As higher gray values represent lighter
shades, there is a corresponding darkening of the reconstruction. The other thing to notice
in the low-rate reconstruction is that wherever there were smooth changes in gray values
there are now abrupt transitions. This is especially evident in the face and neck area, where
gradual shading has been transformed to blotchy regions of constant values. This is because
a range of values is being mapped to the same value, as was the case for the first two samples
of the sinusoid in Example 9.3.1. For obvious reasons, this effect is called contouring. The
perceptual effect of contouring can be reduced by a procedure called dithering [107].

U n i f o r m Q u a n t i z a t i o n o f N o n u n i f o r m S o u r c e s
Quite often the sources we deal with do not have a uniform distribution; however, we still
want the simplicity of a uniform quantizer. In these cases, even if the sources are bounded,
simply dividing the range of the input by the number of quantization levels does not produce
a very good design.
E x a m p l e 9 . 4 . 2:
Suppose our input fell within the interval −11 with probability 0.95, and fell in the
intervals −1001, (1, 100] with probability 0.05. Suppose we wanted to design an eight-
level uniform quantizer. If we followed the procedure of the previous section, the step size
would be 25. This means that inputs in the −10 interval would be represented by the
value −125, and inputs in the interval [0, 1) would be represented by the value 12.5. The
maximum quantization error that can be incurred is 12.5. However, at least 95% of the
time, the minimum error that will be incurred is 11.5. Obviously, this is not a very good
design. A much better approach would be to use a smaller step size, which would result in
better representation of the values in the −11 interval, even if it meant a larger maximum
error. Suppose we pick a step size of 0.3. In this case, the maximum quantization error goes
from 12.5 to 98.95. However, 95% of the time the quantization error will be less than 0.15.
Therefore, the average distortion, or msqe, for this quantizer would be substantially less than
the msqe for the first quantizer.
We can see that when the distribution is no longer uniform, it is not a good idea to
obtain the step size by simply dividing the range of the input by the number of levels. This
approach becomes totally impractical when we model our sources with distributions that are
unbounded, such as the Gaussian distribution. Therefore, we include the pdf of the source
in the design process.
Our objective is to find the step size that, for a given value of M, will minimize the
distortion. The simplest way to do this is to write the distortion as a function of the step size,
and then minimize this function. An expression for the distortion, or msqe, for an M-level
uniform quantizer as a function of the step size can be found by replacing the bis and yis in
Equation (9.3) with functions of . As we are dealing with a symmetric condition, we need
only compute the distortion for positive values of x; the distortion for negative values of x
will be the same.
From Figure 9.8, we see that the decision boundaries are integral multiples of , and the
representation level for the interval k−1 k is simply 2k−1
2
. Therefore, the expression
for msqe becomes
2
q = 2
M
2 −1

i=1
i
i−1

x −
2i−1
2
2
fXxdx
+2

M
2 −1

x −
M −1
2
2
fXxdx (9.18)

−3Δ −2Δ −Δ
−7Δ/2
−5Δ/2
−3Δ/2
−Δ/2
7Δ/2
5Δ/2
3Δ/2
Δ/2
Output
Input
Δ 2Δ 3Δ
F I G U R E 9. 8 A uniform midrise quantizer.
To find the optimal value of , we simply take a derivative of this equation and set it
equal to zero [108] (see Problem 1).
2
q
= −
M
2 −1

i=1
2i−1
i
i−1
x −
2i−1
2
fXxdx
−M −1

M
2 −1

x −
M −1
2

fXxdx = 0 (9.19)
This is a rather messy-looking expression, but given the pdf fXx, it is easy to solve using
any one of a number of numerical techniques (see Problem 2). In Table 9.3, we list step sizes
found by solving (9.19) for nine different alphabet sizes and three different distributions.
Before we discuss the results in Table 9.3, let’s take a look at the quantization noise
for the case of nonuniform sources. Nonuniform sources are often modeled by pdfs with
unbounded support. That is, there is a nonzero probability of getting an unbounded input.
In practical situations, we are not going to get inputs that are unbounded, but often it is very
convenient to model the source process with an unbounded distribution. The classic example
of this is measurement error, which is often modeled as having a Gaussian distribution,

T A B L E 9 . 3 Optimum step size and SNR for uniform quantizers for different
distributions and alphabet sizes [108, 109].
Alphabet Uniform Gaussian Laplacian
Size Step Size SNR Step Size SNR Step Size SNR
2 1.732 6.02 1.596 4.40 1.414 3.00
4 0.866 12.04 0.9957 9.24 1.0873 7.05
6 0.577 15.58 0.7334 12.18 0.8707 9.56
8 0.433 18.06 0.5860 14.27 0.7309 11.39
10 0.346 20.02 0.4908 15.90 0.6334 12.81
12 0.289 21.60 0.4238 17.25 0.5613 13.98
14 0.247 22.94 0.3739 18.37 0.5055 14.98
16 0.217 24.08 0.3352 19.36 0.4609 15.84
32 0.108 30.10 0.1881 24.56 0.2799 20.46
−4Δ −3Δ −2Δ −Δ Δ
−Δ/2
Granular noise
Overload noise
Δ/2
x − Q(x)
x
2Δ 3Δ 4Δ
F I G U R E 9. 9 Quantization error for a uniform midrise quantizer.
even when the measurement error is known to be bounded. If the input is unbounded, the
quantization error is no longer bounded either. The quantization error as a function of input
is shown in Figure 9.9. We can see that in the inner intervals the error is still bounded by
2
; however, the quantization error in the outer intervals is unbounded. These two types of
quantization errors are given different names. The bounded error is called granular error
or granular noise, while the unbounded error is called overload error or overload noise. In
the expression for the msqe in Equation (9.18), the first term represents the granular noise,
while the second term represents the overload noise. The probability that the input will fall
into the overload region is called the overload probability (Figure 9.10).

−4Δ −3Δ −2Δ −Δ
Overload probability
Granular probability
Δ 2Δ 3Δ
x
4Δ
F I G U R E 9. 10 Overload and granular regions for a 3-bit uniform quantizer.
The nonuniform sources we deal with have probability density functions that are generally
peaked at zero and decay as we move away from the origin. Therefore, the overload
probability is generally much smaller than the probability of the input falling in the granular
region. As we see from Equation (9.19), an increase in the size of the step size will result in
an increase in the value of
M
2
−1 , which in turn will result in a decrease in the overload
probability and the second term in Equation (9.19). However, an increase in the step size
will also increase the granular noise, which is the first term in Equation (9.19). The design
process for the uniform quantizer is a balancing of these two effects. An important parameter
that describes this trade-off is the loading factor fl, defined as the ratio of the maximum
value the input can take in the granular region to the standard deviation. A common value
of the loading factor is 4. This is also referred to as 4 loading.
Recall that when quantizing an input with a uniform distribution, the SNR and bit rate
are related by Equation (9.17), which says that for each bit increase in the rate there is an
increase of 6.02 dB in the SNR. In Table 9.3, along with the step sizes, we have also listed
the SNR obtained when a million input values with the appropriate pdf are quantized using
the indicated quantizer.
From this table, we can see that, although the SNR for the uniform distribution follows
the rule of a 6.02 dB increase in the signal-to-noise ratio for each additional bit, this is
not true for the other distributions. Remember that we made some assumptions when we
obtained the 602n rule that are only valid for the uniform distribution. Notice that the more
peaked a distribution is (that is, the further away from uniform it is), the more it seems to
vary from the 6.02 dB rule.
We also said that the selection of is a balance between the overload and granular
errors. The Laplacian distribution has more of its probability mass away from the origin in

its tails than the Gaussian distribution. This means that for the same step size and number of
levels there is a higher probability of being in the overload region if the input has a Laplacian
distribution than if the input has a Gaussian distribution. The uniform distribution is the
extreme case, where the overload probability is zero. For the same number of levels, if we
increase the step size, the size of the overload region (and hence the overload probability) is
reduced at the expense of granular noise. Therefore, for a given number of levels, if we were
picking the step size to balance the effects of the granular and overload noise, distributions
that have heavier tails will tend to have larger step sizes. This effect can be seen in Table 9.3.
For example, for eight levels the step size for the uniform quantizer is 0.433. The step size
for the Gaussian quantizer is larger (0.586), while the step size for the Laplacian quantizer
is larger still (0.7309).
M i s m a t c h E f f e c t s
We have seen that for a result to hold, the assumptions we used to obtain the result have
to hold. When we obtain the optimum step size for a particular uniform quantizer using
Equation (9.19), we make some assumptions about the statistics of the source. We assume
a certain distribution and certain parameters of the distribution. What happens when our
assumptions do not hold? Let’s try to answer this question empirically.
We will look at two types of mismatches. The first is when the assumed distribution
type matches the actual distribution type, but the variance of the input is different from the
assumed variance. The second mismatch is when the actual distribution type is different
from the distribution type assumed when obtaining the value of the step size. Throughout
our discussion, we will assume that the mean of the input distribution is zero.
In Figure 9.11, we have plotted the signal-to-noise ratio as a function of the ratio of
the actual to assumed variance of a 4-bit Gaussian uniform quantizer, with a Gaussian
20
−2
0
2
4
6
8
10
SNR (dB)
14
12
18
16
0 0.5 1.0 1.5 2.0
Ratio of input variance to design variance
2.5 3.0 3.5 4.0
F I G U R E 9. 11 Effect of variance mismatch on the performance of a 4-bit uniform
quantizer.

input. (To see the effect under different conditions, see Problem 5.) Remember that for a
distribution with zero mean, the variance is given by 2
x = E X2
, which is also a measure
of the power in the signal X. As we can see from the figure, the signal-to-noise ratio is
maximum when the input signal variance matches the variance assumed when designing the
quantizer. From the plot we also see that there is an asymmetry; the SNR is considerably
worse when the input variance is lower than the assumed variance. This is because the
SNR is a ratio of the input variance and the mean squared quantization error. When the
input variance is smaller than the assumed variance, the mean squared quantization error
actually drops because there is less overload noise. However, because the input variance is
low, the ratio is small. When the input variance is higher than the assumed variance, the
msqe increases substantially, but because the input power is also increasing, the ratio does
not decrease as dramatically. To see this more clearly, we have plotted the mean squared
error versus the signal variance separately in Figure 9.12. We can see from these figures
that the decrease in signal-to-noise ratio does not always correlate directly with an increase
in msqe.
The second kind of mismatch is where the input distribution does not match the distribu-
tion assumed when designing the quantizer. In Table 9.4 we have listed the SNR when inputs
with different distributions are quantized using several different eight-level quantizers. The
quantizers were designed assuming a particular input distribution.
Notice that as we go from left to right in the table, the designed step size becomes
progressively larger than the “correct” step size. This is similar to the situation where the
input variance is smaller than the assumed variance. As we can see when we have a mismatch
that results in a smaller step size relative to the optimum step size, there is a greater drop in
performance than when the quantizer step size is larger than its optimum value.
2.0
0
0.2
0.4
0.6
0.8
1.0
msqe
1.4
1.2
1.8
1.6
0 1 2 3 4
5 6 7 8 9 10
F I G U R E 9. 12 The msqe as a function of variance mismatch with a 4-bit uniform
quantizer.

T A B L E 9 . 4 Demonstration of the effect of mismatch using
eight- level quantizers (dB).
Input Uniform Gaussian Laplacian Gamma
Distribution Quantizer Quantizer Quantizer Quantizer
Uniform 18.06 15.56 13.29 12.41
Gaussian 12.40 14.27 13.37 12.73
Laplacian 8.80 10.79 11.39 11.28
Gamma 6.98 8.06 8.64 8.76
9 . 5 Adaptive Quantization
One way to deal with the mismatch problem is to adapt the quantizer to the statistics of the
input. Several things might change in the input relative to the assumed statistics, including
the mean, the variance, and the pdf. The strategy for handling each of these variations can
be different, though certainly not exclusive. If more than one aspect of the input statistics
changes, it is possible to combine the strategies for handling each case separately. If the
mean of the input is changing with time, the best strategy is to use some form of differential
encoding (discussed in some detail in Chapter 11). For changes in the other statistics, the
common approach is to adapt the quantizer parameters to the input statistics.
There are two main approaches to adapting the quantizer parameters: an off-line or
forward adaptive approach, and an on-line or backward adaptive approach. In forward
adaptive quantization, the source output is divided into blocks of data. Each block is analyzed
before quantization, and the quantizer parameters are set accordingly. The settings of the
quantizer are then transmitted to the receiver as side information. In backward adaptive
quantization, the adaptation is performed based on the quantizer output. As this is available
to both transmitter and receiver, there is no need for side information.
9 . 5 . 1 F o r w a r d A d a p t i v e Q u a n t i z a t i o n
Let us first look at approaches for adapting to changes in input variance using the forward
adaptive approach. This approach necessitates a delay of at least the amount of time required
to process a block of data. The insertion of side information in the transmitted data stream
may also require the resolution of some synchronization problems. The size of the block
of data processed also affects a number of other things. If the size of the block is too
large, then the adaptation process may not capture the changes taking place in the input
statistics. Furthermore, large block sizes mean more delay, which may not be tolerable in
certain applications. On the other hand, small block sizes mean that the side information
has to be transmitted more often, which in turn means the amount of overhead per sample
increases. The selection of the block size is a trade-off between the increase in side infor-
mation necessitated by small block sizes and the loss of fidelity due to large block sizes
(see Problem 7).

The variance estimation procedure is rather simple. At time n we use a block of N future
samples to compute an estimate of the variance
ˆ 2
q =
1
N
N−1

i=0
x2
n+i (9.20)
Note that we are assuming that our input has a mean of zero. The variance information also
needs to be quantized so that it can be transmitted to the receiver. Usually, the number of
bits used to quantize the value of the variance is significantly larger than the number of bits
used to quantize the sample values.
E x a m p l e 9 . 5 . 1:
In Figure 9.13 we show a segment of speech quantized using a fixed 3-bit quantizer. The
step size of the quantizer was adjusted based on the statistics of the entire sequence. The
sequence was the testm.raw sequence from the sample data sets, consisting of about
4000 samples of a male speaker saying the word “test.” The speech signal was sampled at
8000 samples per second and digitized using a 16-bit A/D.
2.5
−2.0
0
1.0
1.5
2.0
0.5
−0.5
−1.0
−1.5
180 200 220 240 260 280 300 320
Original
Reconstructed
F I G U R E 9. 13 Original 16-bit speech and compressed 3-bit speech sequences.
We can see from the figure that, as in the case of the example of the sinusoid earlier
in this chapter, there is a considerable loss in amplitude resolution. Sample values that are
close together have been quantized to the same value.
The same sequence quantized with a forward adaptive quantizer is shown in Figure 9.14.
For this example, we divided the input into blocks of 128 samples. Before quantizing the
samples in a block, the standard deviation for the samples in the block was obtained. This
value was quantized using an 8-bit quantizer and sent to both the transmitter and receiver.

2.5
−2.5
−0.5
0.5
1.0
1.5
2.0
0
−1.0
−1.5
−2.0
180 200 220 240 260 280 300 320
Original
Reconstructed
F I G U R E 9. 14 Original 16-bit speech sequence and sequence obtained using an
eight-level forward adaptive quantizer.
The samples in the block were then normalized using this value of the standard deviation.
Notice that the reconstruction follows the input much more closely, though there seems to
be room for improvement, especially in the latter half of the displayed samples.
E x a m p l e 9 . 5 . 2:
In Example 9.4.1, we used a uniform quantizer with the assumption that the input is uniformly
distributed. Let us refine this source model a bit and say that while the source is uniformly
distributed over different regions, the range of the input changes. In a forward adaptive
quantization scheme, we would obtain the minimum and maximum values for each block of
data, which would be transmitted as side information. In Figure 9.15, we see the Sena image
quantized with a block size of 8×8 using 3-bit forward adaptive uniform quantization. The
side information consists of the minimum and maximum values in each block, which require
8 bits each. Therefore, the overhead in this case is 16
8×8
or 0.25 bits per pixel, which is quite
small compared to the number of bits per sample used by the quantizer.
The resulting image is hardly distinguishable from the original. Certainly at higher rates,
forward adaptive quantization seems to be a very good alternative.
9 . 5 . 2 B a c k w a r d A d a p t i v e Q u a n t i z a t i o n
In backward adaptive quantization, only the past quantized samples are available for use in
adapting the quantizer. The values of the input are only known to the encoder; therefore,
this information cannot be used to adapt the quantizer. How can we get information about
mismatch simply by examining the output of the quantizer without knowing what the input
was? If we studied the output of the quantizer for a long period of time, we could get some
idea about mismatch from the distribution of output values. If the quantizer step size is

F I G U R E 9. 15 Sena image quantized to 3.25 bits per pixel using forward
adaptive quantization.
well matched to the input, the probability that an input to the quantizer would land in a
particular interval would be consistent with the pdf assumed for the input. However, if the
actual pdf differs from the assumed pdf, the number of times the input falls in the different
quantization intervals will be inconsistent with the assumed pdf. If is smaller than what
it should be, the input will fall in the outer levels of the quantizer an excessive number of
times. On the other hand, if is larger than it should be for a particular source, the input
will fall in the inner levels an excessive number of times. Therefore, it seems that we should
observe the output of the quantizer for a long period of time, then expand the quantizer step
size if the input falls in the outer levels an excessive number of times, and contract the step
size if the input falls in the inner levels an excessive number of times.
Nuggehally S. Jayant at Bell Labs showed that we did not need to observe the quantizer
output over a long period of time [110]. In fact, we could adjust the quantizer step size
after observing a single output. Jayant named this quantization approach “quantization with
one word memory.” The quantizer is better known as the Jayant quantizer. The idea behind
the Jayant quantizer is very simple. If the input falls in the outer levels, the step size needs
to be expanded, and if the input falls in the inner quantizer levels, the step size needs to
be reduced. The expansions and contractions should be done in such a way that once the
quantizer is matched to the input, the product of the expansions and contractions is unity.
The expansion and contraction of the step size is accomplished in the Jayant quantizer
by assigning a multiplier Mk to each interval. If the n−1th input falls in the kth interval,
the step size to be used for the nth input is obtained by multiplying the step size used for the
n−1th input with Mk. The multiplier values for the inner levels in the quantizer are less
than one, and the multiplier values for the outer levels of the quantizer are greater than one.

Therefore, if an input falls into the inner levels, the quantizer used to quantize the next input
will have a smaller step size. Similarly, if an input falls into the outer levels, the step size
will be multiplied with a value greater than one, and the next input will be quantized using
a larger step size. Notice that the step size for the current input is modified based on the
previous quantizer output. The previous quantizer output is available to both the transmitter
and receiver, so there is no need to send any additional information to inform the receiver
about the adaptation. Mathematically, the adaptation process can be represented as
n = Mln−1 n−1 (9.21)
where ln−1 is the quantization interval at time n−1.
In Figure 9.16 we show a 3-bit uniform quantizer. We have eight intervals represented by
the different quantizer outputs. However, the multipliers for symmetric intervals are identical
because of symmetry:
M0 = M4 M1 = M5 M2 = M6 M3 = M7
Therefore, we only need four multipliers. To see how the adaptation proceeds, let us work
through a simple example using this quantizer.
−3Δ −2Δ −Δ
−7Δ/2
−5Δ/2
−3Δ/2
–Δ/2
7Δ/2
5Δ/2
3Δ/2
Δ/2
Output
Input
Δ 2Δ 3Δ
3
2
0
1
4
5
6
7
F I G U R E 9. 16 Output levels for the Jayant quantizer.

E x a m p l e 9 . 5 . 3: J a y a n t q u a n t i z e r
For the quantizer in Figure 9.16, suppose the multiplier values are M0 = M4 = 08, M1 =
M5 = 09, M2 = M6 = 1M3 = M7 = 12; the initial value of the step size, 0, is 0.5; and the
sequence to be quantized is 01−020201−030102050915 When the
first input is received, the quantizer step size is 0.5. Therefore, the input falls into level 0,
and the output value is 0.25, resulting in an error of 0.15. As this input fell into the quantizer
level 0, the new step size 1 is M0 × 0 = 08×05 = 04. The next input is −02, which
falls into level 4. As the step size at this time is 0.4, the output is −02. To update, we
multiply the current step size with M4. Continuing in this fashion, we get the sequence of
step sizes and outputs shown in Table 9.5.
T A B L E 9 . 5 Operation of a Jayant quantizer.
n n Input Output Level Output Error Update Equation
0 05 01 0 025 015 1 = M0 × 0
1 04 −02 4 −02 00 2 = M4 × 1
2 032 02 0 016 004 3 = M0 × 2
3 0256 01 0 0128 0028 4 = M0 × 3
4 02048 −03 5 −03072 −00072 5 = M5 × 4
5 01843 01 0 00922 −00078 6 = M0 × 5
6 01475 02 1 02212 00212 7 = M1 × 6
7 01328 05 3 04646 −00354 8 = M3 × 7
8 01594 09 3 05578 −03422 9 = M3 × 8
9 01913 15 3 06696 −08304 10 = M3 × 9
10 02296 10 3 08036 01964 11 = M3 × 10
11 02755 09 3 09643 00643 12 = M3 × 11
Notice how the quantizer adapts to the input. In the beginning of the sequence, the
input values are mostly small, and the quantizer step size becomes progressively smaller,
providing better and better estimates of the input. At the end of the sample sequence, the
input values are large and the step size becomes progressively bigger. However, the size
of the error is quite large during the transition. This means that if the input was changing
rapidly, which would happen if we had a high-frequency input, such transition situations
would be much more likely to occur, and the quantizer would not function very well.
However, in cases where the statistics of the input change slowly, the quantizer could adapt
to the input. As most natural sources such as speech and images tend to be correlated, their
values do not change drastically from sample to sample. Even when some of this structure
is removed through some transformation, the residual structure is generally enough for the
Jayant quantizer (or some variation of it) to function quite effectively.
The step size in the initial part of the sequence in this example is progressively getting
smaller. We can easily conceive of situations where the input values would be small for a
long period. Such a situation could occur during a silence period in speech-encoding systems,

or while encoding a dark background in image-encoding systems. If the step size continues
to shrink for an extended period of time, in a finite precision system it would result in a
value of zero. This would be catastrophic, effectively replacing the quantizer with a zero
output device. Usually, a minimum value min is defined, and the step size is not allowed to
go below this value to prevent this from happening. Similarly, if we get a sequence of large
values, the step size could increase to a point that, when we started getting smaller values,
the quantizer would not be able to adapt fast enough. To prevent this from happening, a
maximum value max is defined, and the step size is not allowed to increase beyond this
value.
The adaptivity of the Jayant quantizer depends on the values of the multipliers. The
further the multiplier values are from unity, the more adaptive the quantizer. However, if the
adaptation algorithm reacts too fast, this could lead to instability. So how do we go about
selecting the multipliers?
First of all, we know that the multipliers correponding to the inner levels are less than one,
and the multipliers for the outer levels are greater than one. If the input process is stationary
and Pk represents the probability of being in quantizer interval k (generally estimated by
using a fixed quantizer for the input data), then we can impose a stability criterion for the
Jayant quantizer based on our requirement that once the quantizer is matched to the input,
the product of the expansions and contractions are equal to unity. That is, if nk is the number
of times the input falls in the kth interval,
M
k=0
M
nk
k = 1 (9.22)
Taking the Nth root of both sides (where N is the total number of inputs) we obtain
M
k=0
M
nk
N
k = 1
or
M
k=0
M
Pk
k = 1 (9.23)
where we have assumed that Pk = nk/N.
There are an infinite number of multiplier values that would satisfy Equation (9.23). One
way to restrict this number is to impose some structure on the multipliers by requiring them
to be of the form
Mk = lk
(9.24)
where is a number greater than one and lk takes on only integer values [111, 112]. If we
substitute this expression for Mk into Equation (9.23), we get
M
k=0
lkPk
= 1 (9.25)

which implies that
M

k=0
lkPk = 0 (9.26)
The final step is the selection of , which involves a significant amount of creativity. The
value we pick for determines how fast the quantizer will respond to changing statistics.
A large value of will result in faster adaptation, while a smaller value of will result in
greater stability.
E x a m p l e 9 . 5 . 4:
Suppose we have to obtain the multiplier functions for a 2-bit quantizer with input proba-
bilities P0 = 08, P1 = 02. First, note that the multiplier value for the inner level has to be
less than 1. Therefore, l0 is less than 0. If we pick l0 = −1 and l1 = 4, this would satisfy
Equation (9.26), while making M0 less than 1 and M1 greater than 1. Finally, we need to
pick a value for .
In Figure 9.17 we see the effect of using different values of in a rather extreme
example. The input is a square wave that switches between 0 and 1 every 30 samples. The
input is quantized using a 2-bit Jayant quantizer. We have used l0 = −1 and l1 = 2. Notice
what happens when the input switches from 0 to 1. At first the input falls in the outer level
of the quantizer, and the step size increases. This process continues until is just greater
than 1. If is close to 1, has been increasing quite slowly and should have a value close
to 1 right before its value increases to greater than 1. Therefore, the output at this point is
close to 1.5. When becomes greater than 1, the input falls in the inner level, and if is
close to 1, the output suddenly drops to about 0.5. The step size now decreases until it is just
1.2
0
0.2
0.4
0.6
Amplitude
0.8
1.0
0 10 20 30 40 50
Time
60 70 80 90
F I G U R E 9. 17 Effect of on the performance of the Jayant quantizer.

below 1, and the process repeats, causing the “ringing” seen in Figure 9.17. As increases,
the quantizer adapts more rapidly, and the magnitude of the ringing effect decreases. The
reason for the decrease is that right before the value of increases above 1, its value is
much smaller than 1, and subsequently the output value is much smaller than 1.5. When
increases beyond 1, it may increase by a significant amount, so the inner level may be much
greater than 0.5. These two effects together compress the ringing phenomenon. Looking at
this phenomenon, we can see that it may have been better to have two adaptive strategies,
one for when the input is changing rapidly, as in the case of the transitions between 0 and 1,
and one for when the input is constant, or nearly so. We will explore this approach further
when we describe the quantizer used in the CCITT standard G.726.
When selecting multipliers for a Jayant quantizer, the best quantizers expand more rapidly
than they contract. This makes sense when we consider that, when the input falls into the
outer levels of the quantizer, it is incurring overload error, which is essentially unbounded.
This situation needs to be mitigated with dispatch. On the other hand, when the input falls in
the inner levels, the noise incurred is granular noise, which is bounded and therefore may be
more tolerable. Finally, the discussion of the Jayant quantizer was motivated by the need for
robustness in the face of changing input statistics. Let us repeat the earlier experiment with
changing input variance and distributions and see the performance of the Jayant quantizer
compared to the pdf-optimized quantizer. The results for these experiments are presented in
Figure 9.18.
Notice how flat the performance curve is. While the performance of the Jayant quantizer
is much better than the nonadaptive uniform quantizer over a wide range of input variances,
at the point where the input variance and design variance agree, the performance of the
nonadaptive quantizer is significantly better than the performance of the Jayant quantizer.
20
15
10
SNR (dB)
5
0
0 1 2 3 4 5
6 7 8 9 10
F I G U R E 9. 18 Performance of the Jayant quantizer for different input variances.

This means that if we know the input statistics and we are reasonably certain that the input
statistics will not change over time, it is better to design for those statistics than to design
an adaptive system.
9 . 6 Nonuniform Quantization
As we can see from Figure 9.10, if the input distribution has more mass near the origin,
the input is more likely to fall in the inner levels of the quantizer. Recall that in lossless
compression, in order to minimize the average number of bits per input symbol, we assigned
shorter codewords to symbols that occurred with higher probability and longer codewords
to symbols that occurred with lower probability. In an analogous fashion, in order to
decrease the average distortion, we can try to approximate the input better in regions of high
probability, perhaps at the cost of worse approximations in regions of lower probability. We
can do this by making the quantization intervals smaller in those regions that have more
probability mass. If the source distribution is like the distribution shown in Figure 9.10, we
would have smaller intervals near the origin. If we wanted to keep the number of intervals
constant, this would mean we would have larger intervals away from the origin. A quantizer
that has nonuniform intervals is called a nonuniform quantizer. An example of a nonuniform
quantizer is shown in Figure 9.19.
Notice that the intervals closer to zero are smaller. Hence the maximum value that the
quantizer error can take on is also smaller, resulting in a better approximation. We pay for
this improvement in accuracy at lower input levels by incurring larger errors when the input
falls in the outer intervals. However, as the probability of getting smaller input values is
much higher than getting larger signal values, on the average the distortion will be lower
than if we had a uniform quantizer. While a nonuniform quantizer provides lower average
distortion, the design of nonuniform quantizers is also somewhat more complex. However,
the basic idea is quite straightforward: find the decision boundaries and reconstruction levels
that minimize the mean squared quantization error. We look at the design of nonuniform
quantizers in more detail in the following sections.
9 . 6 . 1 p d f - O p t i m i z e d Q u a n t i z a t i o n
A direct approach for locating the best nonuniform quantizer, if we have a probability model
for the source, is to find the bi and yi that minimize Equation (9.3). Setting the derivative
of Equation (9.3) with respect to yj to zero, and solving for yj, we get
yj =
bj
bj−1
xfXxdx
bj
bj−1
fXxdx
(9.27)
The output point for each quantization interval is the centroid of the probability mass in
that interval. Taking the derivative with respect to bj and setting it equal to zero, we get an
expression for bj as
bj =
yj+1 +yj
2
(9.28)

b1 b2 b3 b4
y1
y4
y8
y7
y3
y2
y6
y5
Output
Input
b5 b6 b7
F I G U R E 9. 19 A nonuniform midrise quantizer.
The decision boundary is simply the midpoint of the two neighboring reconstruction
levels. Solving these two equations will give us the values for the reconstruction levels and
decision boundaries that minimize the mean squared quantization error. Unfortunately, to
solve for yj, we need the values of bj and bj−1, and to solve for bj, we need the values
of yj+1 and yj. In a 1960 paper, Joel Max [108] showed how to solve the two equations
iteratively. The same approach was described by Stuart P. Lloyd in a 1957 internal Bell
Labs memorandum. Generally, credit goes to whomever publishes first, but in this case,
because much of the early work in quantization was done at Bell Labs, Lloyd’s work was
given due credit and the algorithm became known as the Lloyd-Max algorithm. However,
the story does not end (begin?) there. Allen Gersho [113] points out that the same algorithm
was published by Lukaszewicz and Steinhaus in a Polish journal in 1955 [114]! Lloyd’s
paper remained unpublished until 1982, when it was finally published in a special issue of
the IEEE Transactions on Information Theory devoted to quantization [115].

To see how this algorithm works, let us apply it to a specific situation. Suppose
we want to design an M-level symmetric midrise quantizer. To define our symbols, we
will use Figure 9.20. From the figure, we see that in order to design this quantizer,
we need to obtain the reconstruction levels y1y2 yM
2
and the decision boundaries
b1b2 bM
2 −1. The reconstruction levels y−1y−2 y− M
2
and the decision bound-
aries b−1b−2 b− M
2 −1 can be obtained through symmetry, the decision boundary b0
is zero, and the decision boundary bM
2
is simply the largest value the input can take on
(for unbounded inputs this would be ).
Let us set j equal to 1 in Equation (9.27):
y1 =
b1
b0
xfXxdx
b1
b0
fXxdx
(9.29)
b−3 b−2 b−1 b0
y−4 1
−1
y−1
y4
y3
y−2
y−3
y2
y1
Output
Input
b1 b2 b3
1 2 3 4
F I G U R E 9. 20 A nonuniform midrise quantizer.

As b0 is known to be 0, we have two unknowns in this equation, b1 and y1. We make a
guess at y1 and later we will try to refine this guess. Using this guess in Equation (9.29),
we numerically find the value of b1 that satisfies Equation (9.29). Setting j equal to 1 in
Equation (9.28), and rearranging things slightly, we get
y2 = 2b1 +y1 (9.30)
from which we can compute y2. This value of y2 can then be used in Equation (9.27) with
j = 2 to find b2, which in turn can be used to find y3. We continue this process, until we
obtain a value for y1y2yM
2
and b1b2bM
2 −1. Note that the accuracy of all
the values obtained to this point depends on the quality of our initial estimate of y1. We
can check this by noting that yM
2
is the centroid of the probability mass of the interval
bM
2 −1bM
2
. We know bM
2
from our knowledge of the data. Therefore, we can compute the
integral
yM
2
=
b M
2
b M
2 −1
xfXxdx
b M
2
b M
2 −1
fXxdx
(9.31)
and compare it with the previously computed value of yM
2
. If the difference is less than some
tolerance threshold, we can stop. Otherwise, we adjust the estimate of y1 in the direction
indicated by the sign of the difference and repeat the procedure.
Decision boundaries and reconstruction levels for various distributions and number of
levels generated using this procedure are shown in Table 9.6. Notice that the distributions
that have heavier tails also have larger outer step sizes. However, these same quantizers have
smaller inner step sizes because they are more heavily peaked. The SNR for these quantizers
is also listed in the table. Comparing these values with those for the pdf-optimized uniform
quantizers, we can see a significant improvement, especially for distributions further away
from the uniform distribution. Both uniform and nonuniform pdf-optimized, or Lloyd-Max,
T A B L E 9 . 6 Quantizer boundary and reconstruction levels for nonuniform
Gaussian and Laplacian quantizers.
Gaussian Laplacian
Levels bi yi SNR bi yi SNR
4 0.0 0.4528 0.0 0.4196
0.9816 1.510 9.3 dB 1.1269 1.8340 7.54 dB
6 0.0 0.3177 0.0 0.2998
0.6589 1.0 0.7195 1.1393
1.447 1.894 12.41 dB 1.8464 2.5535 10.51 dB
8 0.0 0.2451 0.0 0.2334
0.7560 0.6812 0.5332 0.8330
1.050 1.3440 1.2527 1.6725
1.748 2.1520 14.62 dB 2.3796 3.0867 12.64 dB

quantizers have a number of interesting properties. We list these properties here (their proofs
can be found in [116, 117, 118]):
Property 1: The mean values of the input and output of a Lloyd-Max quantizer are equal.
Property 2: For a given Lloyd-Max quantizer, the variance of the output is always
less than or equal to the variance of the input.
Property 3: The mean squared quantization error for a Lloyd-Max quantizer is
given by
2
q = 2
x −
M

j=1
y2
j P bj−1 ≤ X bj (9.32)
where 2
x is the variance of the quantizer input, and the second term on the right-hand
side is the second moment of the output (or variance if the input is zero mean).
Property 4: Let N be the random variable corresponding to the quantization error.
Then for a given Lloyd-Max quantizer,
E XN = − 2
q (9.33)
Property 5: For a given Lloyd-Max quantizer, the quantizer output and the quantiza-
tion noise are orthogonal:
E QXN b0b1bM = 0 (9.34)
M i s m a t c h E f f e c t s
As in the case of uniform quantizers, the pdf-optimized nonuniform quantizers also have
problems when the assumptions underlying their design are violated. In Figure 9.21 we show
the effects of variance mismatch on a 4-bit Laplacian nonuniform quantizer.
This mismatch effect is a serious problem because in most communication systems the
input variance can change considerably over time. A common example of this is the telephone
system. Different people speak with differing amounts of loudness into the telephone. The
quantizer used in the telephone system needs to be quite robust to the wide range of input
variances in order to provide satisfactory service.
One solution to this problem is the use of adaptive quantization to match the quantizer
to the changing input characteristics. We have already looked at adaptive quantization for
the uniform quantizer. Generalizing the uniform adaptive quantizer to the nonuniform case
is relatively straightforward, and we leave that as a practice exercise (see Problem 8).
A somewhat different approach is to use a nonlinear mapping to flatten the performance
curve shown in Figure 9.21. In order to study this approach, we need to view the nonuniform
quantizer in a slightly different manner.
9 . 6 . 2 C o m p a n d e d Q u a n t i z a t i o n
Instead of making the step size small, we could make the interval in which the input lies
with high probability large—that is, expand the region in which the input lands with high

18.5
13.5
15.5
16.5
17.0
17.5
18.0
16.0
SNR (dB)
15.0
14.5
14.0
0.5 1.0 1.5
Ratio of input variance to assumed variance
2.0 2.5 3.0
F I G U R E 9. 21 Effect of mismatch on nonuniform quantization.
probability in proportion to the probability with which the input lands in this region. This
is the idea behind companded quantization. This quantization approach can be represented
by the block diagram shown in Figure 9.22. The input is first mapped through a compressor
function. This function “stretches” the high-probability regions close to the origin, and cor-
respondingly “compresses” the low-probability regions away from the origin. Thus, regions
close to the origin in the input to the compressor occupy a greater fraction of the total
region covered by the compressor. If the output of the compressor function is quantized
using a uniform quantizer, and the quantized value transformed via an expander function,
Input
Output
Expander
Input
Output
Uniform quantizer
Input
Output
Compressor
F I G U R E 9. 22 Block diagram for log companded quantization.

the overall effect is the same as using a nonuniform quantizer. To see this, we devise a
simple compander and see how the process functions.
E x a m p l e 9 . 6 . 1:
Suppose we have a source that can be modeled as a random variable taking values in the
interval −44 with more probability mass near the origin than away from it. We want to
quantize this using the quantizer of Figure 9.3. Let us try to flatten out this distribution using
the following compander, and then compare the companded quantization with straightforward
uniform quantization. The compressor characteristic we will use is given by the following
equation:
cx =
⎧
⎪
⎨
⎪
⎩
2x if −1 ≤ x ≤ 1
2x
3
+ 4
3
x 1
2x
3
− 4
3
x −1
(9.35)
The mapping is shown graphically in Figure 9.23. The inverse mapping is given by
c−1
x =
⎧
⎪
⎨
⎪
⎩
x
2
if −2 ≤ x ≤ 2
3x
2
−2 x 2
3x
2
+2 x −2
(9.36)
The inverse mapping is shown graphically in Figure 9.24.
−4 −3 −2 −1 1 2 3
−4
−3
−2
−1
c(x)
3
2
1
x
4
F I G U R E 9. 23 Compressor mapping.

−4 −3 −2 −1 1 2 3
−4
−3
−2
−1
c−1(x)
3
2
1
x
4
F I G U R E 9. 24 Expander mapping.
Let’s see how using these mappings affects the quantization error both near and far
from the origin. Suppose we had an input of 0.9. If we quantize directly with the uniform
quantizer, we get an output of 0.5, resulting in a quantization error of 0.4. If we use the
companded quantizer, we first use the compressor mapping, mapping the input value of 0.9
to 1.8. Quantizing this with the same uniform quantizer results in an output of 1.5, with
an apparent error of 0.3. The expander then maps this to the final reconstruction value of
0.75, which is 0.15 away from the input. Comparing 0.15 with 0.4, we can see that relative
to the input we get a substantial reduction in the quantization error. In fact, for all values
in the interval −11 , we will not get any increase in the quantization error, and for most
values we will get a decrease in the quantization error (see Problem 6 at the end of this
chapter). Of course, this will not be true for the values outside the −11 interval. Suppose
we have an input of 2.7. If we quantized this directly with the uniform quantizer, we would
get an output of 2.5, with a corresponding error of 0.2. Applying the compressor mapping,
the value of 2.7 would be mapped to 3.13, resulting in a quantized value of 3.5. Mapping
this back through the expander, we get a reconstructed value of 3.25, which differs from the
input by 0.55.
As we can see, the companded quantizer effectively works like a nonuniform quantizer
with smaller quantization intervals in the interval −11 and larger quantization intervals
outside this interval. What is the effective input-output map of this quantizer? Notice that
all inputs in the interval [0, 0.5] get mapped into the interval [0, 1], for which the quantizer
output is 0.5, which in turn corresponds to the reconstruction value of 0.25. Essentially,
all values in the interval [0, 0.5] are represented by the value 0.25. Similarly, all values in

−4 −3 −2 −1 1 2 3
−4
−3
−2
−1
Output
3
2
1
Input
4
F I G U R E 9. 25 Nonuniform companded quantizer.
the interval [0.5, 1] are represented by the value 0.75, and so on. The effective quantizer
input-output map is shown in Figure 9.25.
If we bound the source output by some value xmax, any nonuniform quantizer can always
be represented as a companding quantizer. Let us see how we can use this fact to come up
with quantizers that are robust to mismatch. First we need to look at some of the properties
of high-rate quantizers, or quantizers with a large number of levels.
Define
k = bk −bk−1 (9.37)
If the number of levels is high, then the size of each quantization interval will be small, and
we can assume that the pdf of the input fXx is essentially constant in each quantization
interval. Then
fXx = fXyk if bk−1 ≤ x bk (9.38)
Using this we can rewrite Equation (9.3) as
2
q =
M

i=1
fXyi
bi
bi−1
x −yi2
dx (9.39)
=
1
12
M

i=1
fXyi 3
i (9.40)

xmax
c(bk)
c(bk − 1)
xmax
Δk
. . .
.
.
.
F I G U R E 9. 26 A compressor function.
Armed with this result, let us return to companded quantization. Let cx be a companding
characteristic for a symmetric quantizer, and let c
x be the derivative of the compressor
characteristic with respect to x. If the rate of the quantizer is high, that is, if there are a
large number of levels, then within the kth interval, the compressor characteristic can be
approximated by a straight line segment (see Figure 9.26), and we can write
c
yk =
cbk−cbk−1
k
(9.41)
From Figure 9.26 we can also see that cbk−cbk−1 is the step size of a uniform M-level
quantizer. Therefore,
cbk−cbk−1 =
2xmax
M
(9.42)
Substituting this into Equation (9.41) and solving for k, we get
k =
2xmax
Mcyk
(9.43)

Finally, substituting this expression for k into Equation (9.40), we get the following
relationship between the quantizer distortion, the pdf of the input, and the compressor
characteristic:
2
q =
1
12
M

i=1
fXyi

2xmax
Mcyi
3
=
x2
max
3M2
M

i=1
fXyi
c2yi
·
2xmax
Mcyi
=
x2
max
3M2
M

i=1
fXyi
c2yi i (9.44)
which for small i can be written as
2
q =
x2
max
3M2
xmax
−xmax
fXx
cx2
dx (9.45)
This is a famous result, known as the Bennett integral after its discoverer, W.R. Bennett
[119], and it has been widely used to analyze quantizers. We can see from this integral that
the quantizer distortion is dependent on the pdf of the source sequence. However, it also
tells us how to get rid of this dependence. Define
c
x =
xmax
x
(9.46)
where is a constant. From the Bennett integral we get
2
q =
x2
max
3M2
2
x2
max
xmax
−xmax
x2
fXxdx (9.47)
=
2
3M2
2
x (9.48)
where
2
x =
xmax
−xmax
x2
fXxdx (9.49)
Substituting the expression for 2
q into the expression for SNR, we get
SNR = 10log10
2
x
2
q
(9.50)
= 10log103M2
−20log10 (9.51)
which is independent of the input pdf. This means that if we use a compressor characteristic
whose derivative satisfies Equation (9.46), then regardless of the input variance, the signal-
to-noise ratio will remain constant. This is an impressive result. However, we do need some
caveats.
Notice that we are not saying that the mean squared quantization error is independent
of the quantizer input. It is not, as is clear from Equation (9.48). Remember also that this

result is valid as long as the underlying assumptions are valid. When the input variance is
very small, our assumption about the pdf being constant over the quantization interval is
no longer valid, and when the variance of the input is very large, our assumption about the
input being bounded by xmax may no longer hold.
With fair warning, let us look at the resulting compressor characteristic. We can obtain
the compressor characteristic by integrating Equation (9.46):
cx = xmax +log
x
xmax
(9.52)
where is a constant. The only problem with this compressor characteristic is that it becomes
very large for small x. Therefore, in practice we approximate this characteristic with a
function that is linear around the origin and logarithmic away from it.
Two companding characteristics that are widely used today are -law companding and
A-law companding. The -law compressor function is given by
cx = xmax
ln

1+ x
xmax

ln1+
sgnx (9.53)
The expander function is given by
c−1
x =
xmax

1+
x
xmax −1 sgnx (9.54)
This companding characteristic with = 255 is used in the telephone systems in North
America and Japan. The rest of the world uses the A-law characteristic, which is given by
cx =
⎧
⎨
⎩
Ax
1+ln A
sgnx 0 ≤ x
xmax
≤ 1
A
xmax
1+ln Ax
xmax
1+ln A
sgnx 1
A
≤ x
xmax
≤ 1
(9.55)
and
c−1
x =
x
A
1+ln A 0 ≤ x
xmax
≤ 1
1+ln A
xmax
A
exp

x
xmax
1+ln A−1

1
1+ln A
≤ x
xmax
≤ 1
(9.56)
9 . 7 Entropy-Coded Quantization
In Section 9.3 we mentioned three tasks: selection of boundary values, selection of recon-
struction levels, and selection of codewords. Up to this point we have talked about accom-
plishment of the first two tasks, with the performance measure being the mean squared
quantization error. In this section we will look at accomplishing the third task, assigning
codewords to the quantization interval. Recall that this becomes an issue when we use
variable-length codes. In this section we will be looking at the latter situation, with the rate
being the performance measure.
We can take two approaches to the variable-length coding of quantizer outputs. We
can redesign the quantizer by taking into account the fact that the selection of the deci-
sion boundaries will affect the rate, or we can keep the design of the quantizer the same

(i.e., Lloyd-Max quantization) and simply entropy-code the quantizer output. Since the latter
approach is by far the simpler one, let’s look at it first.
9 . 7 . 1 E n t r o p y C o d i n g o f L l o y d - M a x Q u a n t i z e r
O u t p u t s
The process of trying to find the optimum quantizer for a given number of levels and rate
is a rather difficult task. An easier approach to incorporating entropy coding is to design
a quantizer that minimizes the msqe, that is, a Lloyd-Max quantizer, then entropy-code its
output.
In Table 9.7 we list the output entropies of uniform and nonuniform Lloyd-Max quan-
tizers. Notice that while the difference in rate for lower levels is relatively small, for a
larger number of levels, there can be a substantial difference between the fixed-rate and
entropy-coded cases. For example, for 32 levels a fixed-rate quantizer would require 5 bits
per sample. However, the entropy of a 32-level uniform quantizer for the Laplacian case
is 3.779 bits per sample, which is more than 1 bit less. Notice that the difference between
the fixed rate and the uniform quantizer entropy is generally greater than the difference
between the fixed rate and the entropy of the output of the nonuniform quantizer. This is
because the nonuniform quantizers have smaller step sizes in high-probability regions and
larger step sizes in low-probability regions. This brings the probability of an input falling
into a low-probability region and the probability of an input falling in a high-probability
region closer together. This, in turn, raises the output entropy of the nonuniform quantizer
with respect to the uniform quantizer. Finally, the closer the distribution is to being uniform,
the less difference in the rates. Thus, the difference in rates is much less for the quantizer
for the Gaussian source than the quantizer for the Laplacian source.
9 . 7 . 2 E n t r o p y - C o n s t r a i n e d Q u a n t i z a t i o n
Although entropy coding the Lloyd-Max quantizer output is certainly simple, it is easy to
see that we could probably do better if we take a fresh look at the problem of quantizer
T A B L E 9 . 7 Output entropies in bits per sample
for minimum mean squared error
quantizers.
Number of Gaussian Laplacian
Levels Uniform Nonuniform Uniform Nonuniform
4 1.904 1.911 1.751 1.728
6 2.409 2.442 2.127 2.207
8 2.759 2.824 2.394 2.479
16 3.602 3.765 3.063 3.473
32 4.449 4.730 3.779 4.427

design, this time with the entropy as a measure of rate rather than the alphabet size. The
entropy of the quantizer output is given by
HQ = −
M

i=1
Pi log2 Pi (9.57)
where Pi is the probability of the input to the quantizer falling in the ith quantization interval
and is given by
Pi =
bi
bi−1
fXxdx (9.58)
Notice that the selection of the representation values yj has no effect on the rate.
This means that we can select the representation values solely to minimize the distortion.
However, the selection of the boundary values affects both the rate and the distortion.
Initially, we found the reconstruction levels and decision boundaries that minimized the
distortion, while keeping the rate fixed by fixing the quantizer alphabet size and assuming
fixed-rate coding. In an analogous fashion, we can now keep the entropy fixed and try to
minimize the distortion. Or, more formally:
For a given Ro, find the decision boundaries bj that minimize 2
q given by
Equation (9.3), subject to HQ ≤ Ro.
The solution to this problem involves the solution of the following M − 1 nonlinear
equations [120]:
ln
Pl+1
Pl
= yk+1 −ykyk+1 +yk −2bk (9.59)
where is adjusted to obtain the desired rate, and the reconstruction levels are obtained
using Equation (9.27). A generalization of the method used to obtain the minimum mean
squared error quantizers can be used to obtain solutions for this equation [121]. The process
of finding optimum entropy-constrained quantizers looks complex. Fortunately, at higher
rates we can show that the optimal quantizer is a uniform quantizer, simplifying the problem.
Furthermore, while these results are derived for the high-rate case, it has been shown that
the results also hold for lower rates [121].
9 . 7 . 3 H i g h - R a t e O p t i m u m Q u a n t i z a t i o n
At high rates, the design of optimum quantizers becomes simple, at least in theory. Gish
and Pierce’s work [122] says that at high rates the optimum entropy-coded quantizer is a
uniform quantizer. Recall that any nonuniform quantizer can be represented by a compander
and a uniform quantizer. Let us try to find the optimum compressor function at high rates
that minimizes the entropy for a given distortion. Using the calculus of variations approach,
we will construct the functional
J = HQ+ 2
q (9.60)
then find the compressor characteristic to minimize it.

For the distortion 2
q , we will use the Bennett integral shown in Equation (9.45). The
quantizer entropy is given by Equation (9.57). For high rates, we can assume (as we
did before) that the pdf fXx is constant over each quantization interval i, and we can
replace Equation (9.58) by
Pi = fXyi i (9.61)
Substituting this into Equation (9.57), we get
HQ = −

fXyi i log fXyi i (9.62)
= −

fXyilog fXyi i −

fXyilog i i (9.63)
= −

fXyilog fXyi i −

fXyilog
2xmax/M
cyi i (9.64)
where we have used Equation (9.43) for i. For small i we can write this as
HQ = −

fXxlogfXxdx −

fXxlog
2xmax/M
cx
dx (9.65)
= −

fXxlogfXxdx −log
2xmax
M
+

fXxlogc
xdx (9.66)
where the first term is the differential entropy of the source hX. Let’s define g = c
x.
Then substituting the value of HQ into Equation (9.60) and differentiating with respect to
g, we get

fXx g−1
−2
x2
max
3M2
g−3
dx = 0 (9.67)
This equation is satisfied if the integrand is zero, which gives us
g =

2
3
xmax
M
= Kconstant (9.68)
Therefore,
c
x = K (9.69)
and
cx = Kx + (9.70)
If we now use the boundary conditions c0 = 0 and cxmax = xmax, we get cx = x,
which is the compressor characteristic for a uniform quantizer. Thus, at high rates the
optimum quantizer is a uniform quantizer.
Substituting this expression for the optimum compressor function in the Bennett integral,
we get an expression for the distortion for the optimum quantizer:
2
q =
x2
max
3M2
(9.71)

Substituting the expression for cx in Equation (9.66), we get the expression for the entropy
of the optimum quantizer:
HQ = hX−log
2xmax
M
(9.72)
Note that while this result provides us with an easy method for designing optimum
quantizers, our derivation is only valid if the source pdf is entirely contained in the interval
−xmaxxmax , and if the step size is small enough that we can reasonably assume the pdf
to be constant over a quantization interval. Generally, these conditions can only be satisfied
if we have an extremely large number of quantization intervals. While theoretically this is
not much of a problem, most of these reconstruction levels will be rarely used. In practice,
as mentioned in Chapter 3, entropy coding a source with a large output alphabet is very
problematic. One way we can get around this is through the use of a technique called
recursive indexing.
Recursive indexing is a mapping of a countable set to a collection of sequences of
symbols from another set with finite size [76]. Given a countable set A = a0a1 and
a finite set B = b0b1 bM of size M + 1, we can represent any element in A by a
sequence of elements in B in the following manner:
1. Take the index i of element ai of A.
2. Find the quotient m and remainder r of the index i such that
i = mM +r
3. Generate the sequence: bM bM ···bM

m times
br .
B is called the representation set. We can see that given any element in A we will have a
unique sequence from B representing it. Furthermore, no representative sequence is a prefix
of any other sequence. Therefore, recursive indexing can be viewed as a trivial, uniquely
decodable prefix code. The inverse mapping is given by
bM bM ···bM

m times
br → amM+r
Since it is one-to-one, if it is used at the output of the quantizer to convert the index sequence
of the quantizer output into the sequence of the recursive indices, the former can be recovered
without error from the latter. Furthermore, when the size M + 1 of the representation set
B is chosen appropriately, in effect we can achieve the reduction in the size of the output
alphabets that are used for entropy coding.
E x a m p l e 9 . 7 . 1:
Suppose we want to represent the set of nonnegative integers A = 012 with the rep-
resentation set B = 012345. Then the value 12 would be represented by the sequence
5, 5, 2, and the value 16 would be represented by the sequence 5, 5, 5, 1. Whenever the

9.8 Summary 269
decoder sees the value 5, it simply adds on the next value until the next value is smaller than 5.
For example, the sequence 3, 5, 1, 2, 5, 5, 1, 5, 0 would be decoded as 3, 6, 2, 11, 5.
Recursive indexing is applicable to any representation of a large set by a small set. One
way of applying recursive indexing to the problem of quantization is as follows: For a given
step size 0 and a positive integer K, define xl and xh as follows:
xl = −

K −1
2

xh = xl +K −1
where x is the largest integer not exceeding x. We define a recursively indexed quantizer
of size K to be a uniform quantizer with step size and with xl and xh being its smallest and
largest output levels. (Q defined this way also has 0 as its output level.) The quantization
rule Q, for a given input value x, is as follows:
1. If x falls in the interval xl + 2
xh − 2
, then Qx is the nearest output level.
2. If x is greater than xh − 2
, see if x1 = x − xh ∈ xl + 2
xh − 2
. If so, Qx =
xhQx1 If not, form x2 = x−2xh and do the same as for x1 This process continues
until for some m, xm = x−mxh falls in xl + 2
xh − 2
, which will be quantized into
Qx = xhxh xh

m times
Qxm (9.73)
3. If x is smaller than xl + 2
, a similar procedure to the above is used; that is, form xm =
x +mxl so that it falls in xl + 2
xh − 2
, and quantize it to xlxl xlQxm.
In summary, the quantizer operates in two modes: one when the input falls in the range
xlxh, the other when it falls outside of the specified range. The recursive nature in the
second mode gives it the name.
We pay for the advantage of encoding a larger set by a smaller set in several ways. If we
get a large input to our quantizer, the representation sequence may end up being intolerably
large. We also get an increase in the rate. If HQ is the entropy of the quantizer output,
and is the average number of representation symbols per input symbol, then the minimum
rate for the recursively indexed quantizer is HQ.
In practice, neither cost is too large. We can avoid the problem of intolerably large
sequences by adopting some simple strategies for representing these sequences, and the
value of is quite close to one for reasonable values of M. For Laplacian and Gaussian
quantizers, a typical value for M would be 15 [76].
9 . 8 Summary
The area of quantization is a well-researched area and much is known about the subject. In
this chapter, we looked at the design and performance of uniform and nonuniform quantizers
for a variety of sources, and how the performance is affected when the assumptions used

in the design process are not correct. When the source statistics are not well known or
change with time, we can use an adaptive strategy. One of the more popular approaches to
adaptive quantization is the Jayant quantizer. We also looked at the issues involved with
entropy-coded quantization.
With an area as broad as quantization, we had to keep some of the coverage rather cursory.
However, there is a wealth of information on quantization available in the published literature.
The following sources are especially useful for a general understanding of the area:
1. A very thorough coverage of quantization can be found in Digital Coding of Wave-
forms, by N.S. Jayant and P. Noll [123].
2. The paper “Quantization,” by A. Gersho, in IEEE Communication Magazine, Septem-
ber 1977 [113], provides an excellent tutorial coverage of many of the topics listed
here.
3. The original paper by J. Max, “Quantization for Minimum Distortion,” IRE Transac-
tions on Information Theory [108], contains a very accessible description of the design
of pdf-optimized quantizers.
4. A thorough study of the effects of mismatch is provided by W. Mauersberger in [124].
1. Show that the derivative of the distortion expression in Equation (9.18) results in the
expression in Equation (9.19). You will have to use a result called Leibnitz’s rule
and the idea of a telescoping series. Leibnitz’s rule states that if at and bt are
monotonic, then
t
bt
at
fxtdx =
bt
at
fxt
t
dx +fbtt
bt
t
−fatt
at
t
(9.74)
2. Use the program falspos to solve Equation (9.19) numerically for the Gaussian and
Laplacian distributions. You may have to modify the function func in order to do this.
3. Design a 3-bit uniform quantizer (specify the decision boundaries and representa-
tion levels) for a source with a Laplacian pdf, with a mean of 3 and a variance of 4.
4. The pixel values in the Sena image are not really distributed uniformly. Obtain a
histogram of the image (you can use the hist_image routine), and using the fact
that the quantized image should be as good an approximation as possible for the
original, design 1-, 2-, and 3-bit quantizers for this image. Compare these with the
results displayed in Figure 9.7. (For better comparison, you can reproduce the results
in the book using the program uquan_img.)

5. Use the program misuquan to study the effect of mismatch between the input and
assumed variances. How do these effects change with the quantizer alphabet size and
the distribution type?
6. For the companding quantizer of Example 9.6.1, what are the outputs for the
following inputs: −0812050632−03? Compare your results with the case
when the input is directly quantized with a uniform quantizer with the same number
of levels. Comment on your results.
7. Use the test images Sena and Bookshelf1 to study the trade-offs involved in the
selection of block sizes in the forward adaptive quantization scheme described in
Example 9.5.2. Compare this with a more traditional forward adaptive scheme in
which the variance is estimated and transmitted. The variance information should be
transmitted using a uniform quantizer with differing number of bits.
8. Generalize the Jayant quantizer to the nonuniform case. Assume that the input is from a
known distribution with unknown variance. Simulate the performance of this quantizer
over the same range of ratio of variances as we have done for the uniform case. Com-
pare your results to the fixed nonuniform quantizer and the adaptive uniform quantizer.
To get a start on your program, you may wish to use misnuq.c and juquan.c.
9. Let’s look at the rate distortion performance of the various quanitzers.
(a) Plot the rate-distortion function RD for a Gaussian source with mean zero and
variance 2
X = 2.
(b) Assuming fixed length codewords, compute the rate and distortion for 1, 2, and
3 bit pdf-optimized nonuniform quantizers. Also, assume that X is a Gaussian
random variable with mean zero and 2
X = 2. Plot these values on the same
graph with x’s.
(c) For the 2 and 3 bit quantizers, compute the rate and distortion assuming that the
quantizer outputs are entropy coded. Plot these on the graph with o’s.

10
Vector Quantization
10 . 1 Overview
B
y grouping source outputs together and encoding them as a single block, we can
obtain efficient lossy as well as lossless compression algorithms. Many of the
lossless compression algorithms that we looked at took advantage of this fact.
We can do the same with quantization. In this chapter, several quantization
techniques that operate on blocks of data are described. We can view these
blocks as vectors, hence the name “vector quantization.” We will describe several different
approaches to vector quantization. We will explore how to design vector quantizers and how
these quantizers can be used for compression.
10 . 2 Introduction
In the last chapter, we looked at different ways of quantizing the output of a source. In all
cases the quantizer inputs were scalar values, and each quantizer codeword represented a
single sample of the source output. In Chapter 2 we saw that, by taking longer and longer
sequences of input samples, it is possible to extract the structure in the source coder output.
In Chapter 4 we saw that, even when the input is random, encoding sequences of samples
instead of encoding individual samples separately provides a more efficient code. Encoding
sequences of samples is more advantageous in the lossy compression framework as well.
By “advantageous” we mean a lower distortion for a given rate, or a lower rate for a given
distortion. As in the previous chapter, by “rate” we mean the average number of bits per
input sample, and the measures of distortion will generally be the mean squared error and
the signal-to-noise ratio.
The idea that encoding sequences of outputs can provide an advantage over the encoding
of individual samples was first put forward by Shannon, and the basic results in information

274 10 V E C T O R Q U A N T I Z A T I O N
theory were all proved by taking longer and longer sequences of inputs. This indicates that
a quantization strategy that works with sequences or blocks of output would provide some
improvement in performance over scalar quantization. In other words, we wish to generate
a representative set of sequences. Given a source output sequence, we would represent it
with one of the elements of the representative set.
In vector quantization we group the source output into blocks or vectors. For example,
we can treat L consecutive samples of speech as the components of an L-dimensional
vector. Or, we can take a block of L pixels from an image and treat each pixel value as
a component of a vector of size or dimension L. This vector of source outputs forms the
input to the vector quantizer. At both the encoder and decoder of the vector quantizer, we
have a set of L-dimensional vectors called the codebook of the vector quantizer. The vectors
in this codebook, known as code-vectors, are selected to be representative of the vectors
we generate from the source output. Each code-vector is assigned a binary index. At the
encoder, the input vector is compared to each code-vector in order to find the code-vector
closest to the input vector. The elements of this code-vector are the quantized values of the
source output. In order to inform the decoder about which code-vector was found to be the
closest to the input vector, we transmit or store the binary index of the code-vector. Because
the decoder has exactly the same codebook, it can retrieve the code-vector given its binary
index. A pictorial representation of this process is shown in Figure 10.1.
Although the encoder may have to perform a considerable amount of computations in
order to find the closest reproduction vector to the vector of source outputs, the decoding
consists of a table lookup. This makes vector quantization a very attractive encoding scheme
for applications in which the resources available for decoding are considerably less than
the resources available for encoding. For example, in multimedia applications, considerable
Source
output
Group
into
vectors
Find
closest
code-vector
Codebook Index
Unblock
Reconstruction
Decoder
Encoder
Table
lookup
Codebook
Index
. . .
F I G U R E 10. 1 The vector quantization procedure.

computational resources may be available for the encoding operation. However, if the
decoding is to be done in software, the amount of computational resources available to the
decoder may be quite limited.
Even though vector quantization is a relatively new area, it has developed very rapidly,
and now even some of the subspecialties are broad areas of research. If this chapter we will
try to introduce you to as much of this fascinating area as we can. If your appetite is whetted
by what is available here and you wish to explore further, there is an excellent book by
Gersho and Gray [5] devoted to the subject of vector quantization.
Our approach in this chapter is as follows: First, we try to answer the question of why we
would want to use vector quantization over scalar quantization. There are several answers
to this question, each illustrated through examples. In our discussion, we assume that you
are familiar with the material in Chapter 9. We will then turn to one of the most important
elements in the design of a vector quantizer, the generation of the codebook. While there are
a number of ways of obtaining the vector quantizer codebook, most of them are based on one
particular approach, popularly known as the Linde-Buzo-Gray (LBG) algorithm. We devote
a considerable amount of time in describing some of the details of this algorithm. Our intent
here is to provide you with enough information so that you can write your own programs for
design of vector quantizer codebooks. In the software accompanying this book, we have also
included programs for designing codebooks that are based on the descriptions in this chapter.
If you are not currently thinking of implementing vector quantization routines, you may wish
to skip these sections (Sections 10.4.1 and 10.4.2). We follow our discussion of the LBG
algorithm with some examples of image compression using codebooks designed with this
algorithm, and then with a brief sampling of the many different kinds of vector quantizers.
Finally, we describe another quantization strategy, called trellis-coded quantization (TCQ),
which, though different in implementation from the vector quantizers, also makes use of the
advantage to be gained from operating on sequences.
Before we begin our discussion of vector quantization, let us define some of the ter-
minology we will be using. The amount of compression will be described in terms of the
rate, which will be measured in bits per sample. Suppose we have a codebook of size K,
and the input vector is of dimension L. In order to inform the decoder of which code-vector
was selected, we need to use log2 K bits. For example, if the codebook contained 256
code-vectors, we would need 8 bits to specify which of the 256 code-vectors had been
selected at the encoder. Thus, the number of bits per vector is log2 K bits. As each code-
vector contains the reconstruction values for L source output samples, the number of bits
per sample would be log2 K
L
. Thus, the rate for an L-dimensional vector quantizer with a
codebook of size K is log2 K
L
. As our measure of distortion we will use the mean squared
error. When we say that in a codebook , containing the K code-vectors Yi, the input
vector X is closest to Yj, we will mean that

X −Yj

2
≤ X −Yi2
for allYi ∈ (10.1)
where X = x1x2 ···xL and
X2
=
L

i=1
x2
i (10.2)

The term sample will always refer to a scalar value. Thus, when we are discussing
compression of images, a sample refers to a single pixel. Finally, the output points of the
quantizer are often referred to as levels. Thus, when we wish to refer to a quantizer with K
output points or code-vectors, we may refer to it as a K-level quantizer.
10 . 3 Advantages of Vector Quantization over
Scalar Quantization
For a given rate (in bits per sample), use of vector quantization results in a lower distortion
than when scalar quantization is used at the same rate, for several reasons. In this section we
will explore these reasons with examples (for a more theoretical explanation, see [3, 4, 17]).
If the source output is correlated, vectors of source output values will tend to fall in
clusters. By selecting the quantizer output points to lie in these clusters, we have a more
accurate representation of the source output. Consider the following example.
E x a m p l e 10 . 3 . 1:
In Example 8.5.1, we introduced a source that generates the height and weight of individuals.
Suppose the height of these individuals varied uniformly between 40 and 80 inches, and the
weight varied uniformly between 40 and 240 pounds. Suppose we were allowed a total of 6
bits to represent each pair of values. We could use 3 bits to quantize the height and 3 bits to
quantize the weight. Thus, the weight range between 40 and 240 pounds would be divided
into eight intervals of equal width of 25 and with reconstruction values 5277 227.
Similarly, the height range between 40 and 80 inches can be divided into eight intervals of
width five, with reconstruction levels 4247 77. When we look at the representation
of height and weight separately, this approach seems reasonable. But let’s look at this
quantization scheme in two dimensions. We will plot the height values along the x-axis
and the weight values along the y-axis. Note that we are not changing anything in the
quantization process. The height values are still being quantized to the same eight different
values, as are the weight values. The two-dimensional representation of these two quantizers
is shown in Figure 10.2.
From the figure we can see that we effectively have a quantizer output for a person who
is 80 inches (6 feet 8 inches) tall and weighs 40 pounds, as well as a quantizer output for
an individual whose height is 42 inches but weighs more than 200 pounds. Obviously, these
outputs will never be used, as is the case for many of the other outputs. A more sensible
approach would be to use a quantizer like the one shown in Figure 10.3, where we take
account of the fact that the height and weight are correlated. This quantizer has exactly the
same number of output points as the quantizer in Figure 10.2; however, the output points are
clustered in the area occupied by the input. Using this quantizer, we can no longer quantize
the height and weight separately. We have to consider them as the coordinates of a point
in two dimensions in order to find the closest quantizer output point. However, this method
provides a much finer quantization of the input.

40
65
90
115
140
Weight (lb)
165
190
215
40 50 60
Height-weight quantizer
Weight quantizer
70 Height (in)
40 50 60
Height quantizer
70
65
90
115
140
165
190
215
Quantizer
output
F I G U R E 10. 2 The height/weight scalar quantizers when viewed in
two dimensions.
Note that we have not said how we would obtain the locations of the quantizer outputs
shown in Figure 10.3. These output points make up the codebook of the vector quantizer,
and we will be looking at codebook design in some detail later in this chapter.
We can see from this example that, as in lossless compression, looking at longer
sequences of inputs brings out the structure in the source output. This structure can then be
used to provide more efficient representations.
We can easily see how structure in the form of correlation between source outputs
can make it more efficient to look at sequences of source outputs rather than looking at
each sample separately. However, the vector quantizer is also more efficient than the scalar
quantizer when the source output values are not correlated. The reason for this is actually

165
40
190
215
Weight (lb)
140
115
90
65
40 50 60 70 Height (in)
F I G U R E 10. 3 The height-weight vector quantizer.
quite simple. As we look at longer and longer sequences of source outputs, we are afforded
more flexibility in terms of our design. This flexibility in turn allows us to match the design
of the quantizer to the source characteristics. Consider the following example.
E x a m p l e 10 . 3 . 2:
Suppose we have to design a uniform quantizer with eight output values for a Laplacian
input. Using the information from Table 9.3 in Chapter 9, we would obtain the quantizer
shown in Figure 10.4, where is equal to 0.7309. As the input has a Laplacian distribution,
the probability of the source output falling in the different quantization intervals is not the
same. For example, the probability that the input will fall in the interval 0 is 0.3242,
while the probability that a source output will fall in the interval 3 is 0.0225. Let’s
look at how this quantizer will quantize two consecutive source outputs. As we did in the
previous example, let’s plot the first sample along the x-axis and the second sample along
the y-axis. We can represent this two-dimensional view of the quantization process as shown
in Figure 10.5. Note that, as in the previous example, we have not changed the quantization
process; we are simply representing it differently. The first quantizer input, which we have
represented in the figure as x1, is quantized to the same eight possible output values as
before. The same is true for the second quantizer input, which we have represented in the

−3Δ −2Δ −Δ
−7Δ/2
−5Δ/2
−3Δ/2
−Δ/2
7Δ/2
5Δ/2
3Δ/2
Δ/2
Output
Input
Δ 2Δ 3Δ
−3Δ −2Δ −Δ
Quantizer
output
Δ
0 2Δ 3Δ
F I G U R E 10. 4 Two representations of an eight-level scalar quantizer.
figure as x2. This two-dimensional representation allows us to examine the quantization
process in a slightly different manner. Each filled-in circle in the figure represents a sequence
of two quantizer outputs. For example, the top rightmost circle represents the two quantizer
outputs that would be obtained if we had two consecutive source outputs with a value
greater than 3 . We computed the probability of a single source output greater than 3 to
be 0.0225. The probability of two consecutive source outputs greater than 2.193 is simply
00225×00225 = 00005, which is quite small. Given that we do not use this output point
very often, we could simply place it somewhere else where it would be of more use. Let
us move this output point to the origin, as shown in Figure 10.6. We have now modified
the quantization process. Now if we get two consecutive source outputs with values greater
than 3 , the quantizer output corresponding to the second source output may not be the
same as the first source output.

−3Δ −2Δ −Δ
−3Δ
−2Δ
−Δ
3Δ
2Δ
Δ
x2
x1
Δ 2Δ 3Δ
Quantizer
output
F I G U R E 10. 5 Input-output map for consecutive quantization of two inputs using
an eight-level scalar quantizer.
If we compare the rate distortion performance of the two vector quantizers, the SNR
for the first vector quantizer is 11.44 dB, which agrees with the result in Chapter 9 for
the uniform quantizer with a Laplacian input. The SNR for the modified vector quantizer,
however, is 11.73 dB, an increase of about 0.3 dB. Recall that the SNR is a measure of the
average squared value of the source output samples and the mean squared error. As the
average squared value of the source output is the same in both cases, an increase in SNR
means a decrease in the mean squared error. Whether this increase in SNR is significant will
depend on the particular application. What is important here is that by treating the source
output in groups of two we could effect a positive change with only a minor modification.
We could argue that this modification is really not that minor since the uniform characteristic
of the original quantizer has been destroyed. However, if we begin with a nonuniform
quantizer and modify it in a similar way, we get similar results.
Could we do something similar with the scalar quantizer? If we move the output point at
7
2
to the origin, the SNR drops from 11.44 dB to 10.8 dB. What is it that permits us to make

F I G U R E 10. 6 Modiﬁed two-dimensional vector quantizer.
modifications in the vector case, but not in the scalar case? This advantage is caused by the
added flexibility we get by viewing the quantization process in higher dimensions. Consider
the effect of moving the output point from 7
2
to the origin in terms of two consecutive
inputs. This one change in one dimension corresponds to moving 15 output points in two
dimensions. Thus, modifications at the scalar quantizer level are gross modifications when
viewed from the point of view of the vector quantizer. Remember that in this example we
have only looked at two-dimensional vector quantizers. As we block the input into larger
and larger blocks or vectors, these higher dimensions provide even greater flexibility and
the promise of further gains to be made.
In Figure 10.6, notice how the quantization regions have changed for the outputs around
the origin, as well as for the two neighbors of the output point that were moved. The
decision boundaries between the reconstruction levels can no longer be described as easily
as in the case for the scalar quantizer. However, if we know the distortion measure, simply
knowing the output points gives us sufficent information to implement the quantization

process. Instead of defining the quantization rule in terms of the decision boundary, we can
define the quantization rule as follows:
QX = Yj iff dXYj dXYi ∀i = j (10.3)
For the case where the input X is equidistant from two output points, we can use a simple
tie-breaking rule such as “use the output point with the smaller index.” The quantization
regions Vj can then be defined as
Vj = X dXYj dXYi ∀i = j (10.4)
Thus, the quantizer is completely defined by the output points and a distortion measure.
From a multidimensional point of view, using a scalar quantizer for each input restricts
the output points to a rectangular grid. Observing several source output values at once allows
us to move the output points around. Another way of looking at this is that in one dimension
the quantization intervals are restricted to be intervals, and the only parameter that we can
manipulate is the size of these intervals. When we divide the input into vectors of some
length n, the quantization regions are no longer restricted to be rectangles or squares. We
have the freedom to divide the range of the inputs in an infinite number of ways.
These examples have shown two ways in which the vector quantizer can be used to
improve performance. In the first case, we exploited the sample-to-sample dependence of
the input. In the second case, there was no sample-to-sample dependence; the samples were
independent. However, looking at two samples together still improved performance.
These two examples can be used to motivate two somewhat different approaches toward
vector quantization. One approach is a pattern-matching approach, similar to the process
used in Example 10.3.1, while the other approach deals with the quantization of random
inputs. We will look at both of these approaches in this chapter.
10 . 4 The Linde-Buzo-Gray Algorithm
In Example 10.3.1 we saw that one way of exploiting the structure in the source output is
to place the quantizer output points where the source output (blocked into vectors) are most
likely to congregate. The set of quantizer output points is called the codebook of the quantizer,
and the process of placing these output points is often referred to as codebook design. When
we group the source output in two-dimensional vectors, as in the case of Example 10.3.1,
we might be able to obtain a good codebook design by plotting a representative set of source
output points and then visually locate where the quantizer output points should be. However,
this approach to codebook design breaks down when we design higher-dimensional vector
quantizers. Consider designing the codebook for a 16-dimensional quantizer. Obviously, a
visual placement approach will not work in this case. We need an automatic procedure for
locating where the source outputs are clustered.
This is a familiar problem in the field of pattern recognition. It is no surprise, there-
fore, that the most popular approach to designing vector quantizers is a clustering pro-
cedure known as the k-means algorithm, which was developed for pattern recognition
applications.

The k-means algorithm functions as follows: Given a large set of output vectors from
the source, known as the training set, and an initial set of k representative patterns, assign
each element of the training set to the closest representative pattern. After an element is
assigned, the representative pattern is updated by computing the centroid of the training set
vectors assigned to it. When the assignment process is complete, we will have k groups of
vectors clustered around each of the output points.
Stuart Lloyd [115] used this approach to generate the pdf-optimized scalar quantizer,
except that instead of using a training set, he assumed that the distribution was known. The
Lloyd algorithm functions as follows:
1. Start with an initial set of reconstruction values

y
0
i
M
i=1
. Set k = 0D0
= 0. Select
threshold .
2. Find decision boundaries
b
k
j =
y
k
j+1 +y
k
j
2
j = 12 M −1
3. Compute the distortion
Dk
=
M

i=1
b
k
i
b
k
i−1
x −yi2
fXxdx
4. If Dk
−Dk−1
, stop; otherwise, continue.
5. k = k+1. Compute new reconstruction values
y
k
j =
b
k−1
j
b
k−1
j−1
xfXxdx
b
k−1
j
b
k−1
j−1
fXxdx

Go to Step 2.
Linde, Buzo, and Gray generalized this algorithm to the case where the inputs are no
longer scalars [125]. For the case where the distribution is known, the algorithm looks very
much like the Lloyd algorithm described above.

Y
0
i
M
i=1
. Set k = 0D0
= 0. Select
threshold .
2. Find quantization regions
V
k
i = X dXYi dXYj ∀j = i j = 12 M
3. Compute the distortion
Dk
=
M

i=1

V
k
i

X −Y
k
i

2
fXXdX

4. If Dk−Dk−1
Dk , stop; otherwise, continue.
5. k = k+1. Find new reconstruction values

Y
k
i
M
i=1
that are the centroids of

V
k−1
i

.
Go to Step 2.
This algorithm is not very practical because the integrals required to compute the distor-
tions and centroids are over odd-shaped regions in n dimensions, where n is the dimension
of the input vectors. Generally, these integrals are extremely difficult to compute, making
this particular algorithm more of an academic interest.
Of more practical interest is the algorithm for the case where we have a training set
available. In this case, the algorithm looks very much like the k-means algorithm.

Y
0
i
M
i=1
and a set of training vectors
XnN
n=1. Set k = 0D0
= 0. Select threshold .
2. The quantization regions

V
k
i
M
i=1
are given by
V
k
i = Xn dXnYi dXnYj ∀j = i i = 12 M
We assume that none of the quantization regions are empty. (Later we will deal with
the case where V
k
i is empty for some i and k.)
3. Compute the average distortion Dk
between the training vectors and the representative
reconstruction value.
4. If Dk−Dk−1
Dk , stop; otherwise, continue.
5. k = k+1. Find new reconstruction values

Y
k
i
M
i=1
that are the average value of the
elements of each of the quantization regions V
k−1
i . Go to Step 2.
This algorithm forms the basis of most vector quantizer designs. It is popularly known as
the Linde-Buzo-Gray or LBG algorithm, or the generalized Lloyd algorithm (GLA) [125].
Although the paper of Linde, Buzo, and Gray [125] is a starting point for most of the work
on vector quantization, the latter algorithm had been used several years prior by Edward E.
Hilbert at the NASA Jet Propulsion Laboratories in Pasadena, California. Hilbert’s starting
point was the idea of clustering, and although he arrived at the same algorithm as described
above, he called it the cluster compression algorithm [126].
In order to see how this algorithm functions, consider the following example of a two-
dimensional vector quantizer codebook design.
E x a m p l e 10 . 4 . 1:
Suppose our training set consists of the height and weight values shown in Table 10.1. The
initial set of output points is shown in Table 10.2. (For ease of presentation, we will always
round the coordinates of the output points to the nearest integer.) The inputs, outputs, and
quantization regions are shown in Figure 10.7.

T A B L E 10 . 1 Training set for designing
vector quantizer codebook.
Height Weight
72 180
65 120
59 119
64 150
65 162
57 88
72 175
44 41
62 114
60 110
56 91
70 172
T A B L E 10 . 2 Initial set of output points
for codebook design.
Height Weight
45 50
75 117
45 117
80 180
The input (44, 41) has been assigned to the first output point; the inputs (56, 91), (57,
88), (59, 119), and (60, 110) have been assigned to the second output point; the inputs
(62, 114), and (65, 120) have been assigned to the third output; and the five remaining
vectors from the training set have been assigned to the fourth output. The distortion for
this assignment is 387.25. We now find the new output points. There is only one vector in
the first quantization region, so the first output point is (44, 41). The average of the four
vectors in the second quantization region (rounded up) is the vector (58, 102), which is the
new second output point. In a similar manner, we can compute the third and fourth output
points as (64, 117) and (69, 168). The new output points and the corresponding quantization
regions are shown in Figure 10.8. From Figure 10.8, we can see that, while the training
vectors that were initially part of the first and fourth quantization regions are still in the same
quantization regions, the training vectors (59,115) and (60,120), which were in quantization
region 2, are now in quantization region 3. The distortion corresponding to this assignment
of training vectors to quantization regions is 89, considerably less than the original 387.25.
Given the new assignments, we can obtain a new set of output points. The first and fourth
output points do not change because the training vectors in the corresponding regions have
not changed. However, the training vectors in regions 2 and 3 have changed. Recomputing
the output points for these regions, we get (57, 90) and (62, 116). The final form of the

40
60
80
100
120
140
160
180
Weight (lb)
Height (in)
70
60
50
40
4
3
2
1
x
x
x
x
x
x
x
x
x
x
x
x
F I G U R E 10. 7 Initial state of the vector quantizer.
40
60
80
100
120
140
160
180
Weight (lb)
Height (in)
70
60
50
40
4
3
2
1
x
x
x
x
x
x
x
x
x
x
x
x
F I G U R E 10. 8 The vector quantizer after one iteration.

40
60
80
100
120
140
160
180
Weight (lb)
Height (in)
70
60
50
40
4
3
2
1
x
x
x
x
x
x
x
x
x
x
x
x
F I G U R E 10. 9 Final state of the vector quantizer.
quantizer is shown in Figure 10.9. The distortion corresponding to the final assignments
is 60.17.
The LBG algorithm is conceptually simple, and as we shall see later, the resulting vector
quantizer is remarkably effective in the compression of a wide variety of inputs, both by
itself and in conjunction with other schemes. In the next two sections we will look at some
of the details of the codebook design process. While these details are important to consider
when designing codebooks, they are not necessary for the understanding of the quantization
process. If you are not currently interested in these details, you may wish to proceed directly
to Section 10.4.3.
10 . 4 . 1 I n i t i a l i z i n g t h e L B G A l g o r i t h m
The LBG algorithm guarantees that the distortion from one iteration to the next will not
increase. However, there is no guarantee that the procedure will converge to the optimal
solution. The solution to which the algorithm converges is heavily dependent on the initial
conditions. For example, if our initial set of output points in Example 10.4 had been those

T A B L E 10 . 3 An alternate initial
set of output points.
Height Weight
75 50
75 117
75 127
80 180
T A B L E 10 . 4 Final codebook obtained
using the alternative
initial codebook.
Height Weight
44 41
60 107
64 150
70 172
shown in Table 10.3 instead of the set in Table 10.2, by using the LBG algorithm we would
get the final codebook shown in Table 10.4.
The resulting quantization regions and their membership are shown in Figure 10.10.
This is a very different quantizer than the one we had previously obtained. Given this
heavy dependence on initial conditions, the selection of the initial codebook is a matter of
some importance. We will look at some of the better-known methods of initialization in the
following section.
Linde, Buzo, and Gray described a technique in their original paper [125] called the
splitting technique for initializing the design algorithm. In this technique, we begin by
designing a vector quantizer with a single output point; in other words, a codebook of size
one, or a one-level vector quantizer. With a one-element codebook, the quantization region
is the entire input space, and the output point is the average value of the entire training
set. From this output point, the initial codebook for a two-level vector quantizer can be
obtained by including the output point for the one-level quantizer and a second output point
obtained by adding a fixed perturbation vector . We then use the LBG algorithm to obtain
the two-level vector quantizer. Once the algorithm has converged, the two codebook vectors
are used to obtain the initial codebook of a four-level vector quantizer. This initial four-level
codebook consists of the two codebook vectors from the final codebook of the two-level
vector quantizer and another two vectors obtained by adding to the two codebook vectors.
The LBG algorithm can then be used until this four-level quantizer converges. In this manner
we keep doubling the number of levels until we reach the desired number of levels. By
including the final codebook of the previous stage at each “splitting,” we guarantee that the
codebook after splitting will be at least as good as the codebook prior to splitting.

40
60
80
100
120
140
160
180
Weight (lb)
Height (in)
70
60
50
40
4
3
2
1
x
x
x
x
x
x
x
x
x
x
x
x
F I G U R E 10. 10 Final state of the vector quantizer.
E x a m p l e 10 . 4 . 2:
Let’s revisit Example 10.4.1. This time, instead of using the initial codewords used in
Example 10.4.1, we will use the splitting technique. For the perturbations, we will use a
fixed vector = 1010. The perturbation vector is usually selected randomly; however,
for purposes of explanation it is more useful to use a fixed perturbation vector.
We begin with a single-level codebook. The codeword is simply the average value of
the training set. The progression of codebooks is shown in Table 10.5.
The perturbed vectors are used to initialize the LBG design of a two-level vector quan-
tizer. The resulting two-level vector quantizer is shown in Figure 10.11. The resulting
distortion is 468.58. These two vectors are perturbed to get the initial output points for
the four-level design. Using the LBG algorithm, the final quantizer obtained is shown in
Figure 10.12. The distortion is 156.17. The average distortion for the training set for this
quantizer using the splitting algorithm is higher than the average distortion obtained pre-
viously. However, because the sample size used in this example is rather small, this is no
indication of relative merit.

T A B L E 10 . 5 Progression of codebooks using
splitting.
Codebook Height Weight
One-level 62 127
Initial two-level 62 127
72 137
Final two-level 58 98
69 168
Initial four-level 58 98
68 108
69 168
79 178
Final four-level 52 73
62 116
65 156
71 176
40
60
80
100
120
140
160
180
Weight (lb)
Height (in)
70
60
50
40
2
1
x
x
x
x
x
x
x
x
x
x
x
x
F I G U R E 10. 11 Two-level vector quantizer using splitting approach.

40
60
80
100
120
140
160
180
Weight (lb)
Height (in)
70
60
50
40
1
2
x
x
x
x
x
x
x
x
x
x
x
x
F I G U R E 10. 12 Final design using the splitting approach.
If the desired number of levels is not a power of two, then in the last step, instead of
generating two initial points from each of the output points of the vector quantizer designed
previously, we can perturb as many vectors as necessary to obtain the desired number of
vectors. For example, if we needed an eleven-level vector quantizer, we would generate a
one-level vector quantizer first, then a two-level, then a four-level, and then an eight-level
vector quantizer. At this stage, we would perturb only three of the eight vectors to get the
eleven initial output points of the eleven-level vector quantizer. The three points should be
those with the largest number of training set vectors, or the largest distortion.
The approach used by Hilbert [126] to obtain the initial output points of the vector
quantizer was to pick the output points randomly from the training set. This approach
guarantees that, in the initial stages, there will always be at least one vector from the training
set in each quantization region. However, we can still get different codebooks if we use
different subsets of the training set as our initial codebook.
E x a m p l e 10 . 4 . 3:
Using the training set of Example 10.4.1, we selected different vectors of the training set
as the initial codebook. The results are summarized in Table 10.6. If we pick the codebook
labeled “Initial Codebook 1,” we obtain the codebook labeled “Final Codebook 1.” This

T A B L E 10 . 6 Effect of using different subsets of the
training sequence as the initial codebook.
Codebook Height Weight
Initial Codebook 1 72 180
72 175
65 120
59 119
Final Codebook 1 71 176
65 156
62 116
52 73
Initial Codebook 2 65 120
44 41
59 119
57 88
Final Codebook 2 69 168
44 41
62 116
57 90
codebook is identical to the one obtained using the split algorithm. The set labeled “Initial
Codebook 2” results in the codebook labeled “Final Codebook 2.” This codebook is identical
to the quantizer we obtained in Example 10.4.1. In fact, most of the other selections result
in one of these two quantizers.
Notice that by picking different subsets of the input as our initial codebook, we can
generate different vector quantizers. A good approach to codebook design is to initialize the
codebook randomly several times, and pick the one that generates the least distortion in the
training set from the resulting quantizers.
In 1989, Equitz [127] introduced a method for generating the initial codebook called
the pairwise nearest neighbor (PNN) algorithm. In the PNN algorithm, we start with as
many clusters as there are training vectors and end with the initial codebook. At each stage,
we combine the two closest vectors into a single cluster and replace the two vectors by
their mean. The idea is to merge those clusters that would result in the smallest increase
in distortion. Equitz showed that when we combine two clusters Ci and Cj, the increase in
distortion is
ninj
ni +nj

Yi −Yj

2
(10.5)
where ni is the number of elements in the cluster Ci, and Yi is the corresponding output
point. In the PNN algorithm, we combine clusters that cause the smallest increase in the
distortion.

72,180
72,175
65,120
44,41
59,119
62,114
64,150
60,110
65,162
56,91
57,88
70,172
72,180
71,174 (2)
65,120
44,41
61,117 (2)
65,156 (2)
60,110
57,90 (2)
71,176 (3)
65,120
44,41
61,117 (2)
65,156 (2)
60,110
57,90 (2)
71,176 (3)
65,120
44,41
60,114 (3)
65,156 (2)
57,90 (2)
69,168 (5)
65,120
44,41
60,114 (3)
57,90 (2)
69,168 (5)
62,116 (4)
44,41
57,90 (2)
F I G U R E 10. 13 Obtaining initial output points using the PNN approach.
E x a m p l e 10 . 4 . 4:
Using the PNN algorithm, we combine the elements in the training set as shown in
Figure 10.13. At each step we combine the two clusters that are closest in the sense of
Equation (10.5). If we use these values to initialize the LBG algorithm, we get a vector
quantizer shown with output points (70, 172), (60, 107), (44, 41), (64, 150), and a distortion
of 104.08.
Although it was a relatively easy task to generate the initial codebook using the PNN
algorithm in Example 10.4.4, we can see that, as the size of the training set increases, this
procedure becomes progressively more time-consuming. In order to avoid this cost, we can
use a fast PNN algorithm that does not attempt to find the absolute smallest cost at each
step (see [127] for details).
Finally, a simple initial codebook is the set of output points from the corresponding scalar
quantizers. In the beginning of this chapter we saw how scalar quantization of a sequence
of inputs can be viewed as vector quantization using a rectangular vector quantizer. We can
use this rectangular vector quantizer as the initial set of outputs.
E x a m p l e 10 . 4 . 5:
Return once again to the quantization of the height-weight data set. If we assume that the
heights are uniformly distributed between 40 and 180, then a two-level scalar quantizer would
have reconstruction values 75 and 145. Similarly, if we assume that the weights are uniformly
distributed between 40 and 80, the reconstruction values would be 50 and 70. The initial
reconstruction values for the vector quantizer are (50, 75), (50, 145), (70, 75), and (70, 145).
The final design for this initial set is the same as the one obtained in Example 10.4.1 with
a distortion of 60.17.

We have looked at four different ways of initializing the LBG algorithm. Each has its own
advantages and drawbacks. The PNN initialization has been shown to result in better designs,
producing a lower distortion for a given rate than the splitting approach [127]. However,
the procedure for obtaining the initial codebook is much more involved and complex. We
cannot make any general claims regarding the superiority of any one of these initialization
techniques. Even the PNN approach cannot be proven to be optimal. In practice, if we are
dealing with a wide variety of inputs, the effect of using different initialization techniques
appears to be insignificant.
10 . 4 . 2 T h e E m p t y C e l l P r o b l e m
Let’s take a closer look at the progression of the design in Example 10.4.5. When we assign
the inputs to the initial output points, no input point gets assigned to the output point at
(70, 75). This is a problem because in order to update an output point, we need to take the
average value of the input vectors. Obviously, some strategy is needed. The strategy that we
actually used in Example 10.4.5 was not to update the output point if there were no inputs
in the quantization region associated with it. This strategy seems to have worked in this
particular example; however, there is a danger that we will end up with an output point that
is never used. A common approach to avoid this is to remove an output point that has no
inputs associated with it, and replace it with a point from the quantization region with the
most output points. This can be done by selecting a point at random from the region with the
highest population of training vectors, or the highest associated distortion. A more systematic
approach is to design a two-level quantizer for the training vectors in the most heavily
populated quantization region. This approach is computationally expensive and provides no
significant improvement over the simpler approach. In the program accompanying this book,
we have used the first approach. (To compare the two approaches, see Problem 3.)
10 . 4 . 3 U s e o f L B G f o r I m a g e C o m p r e s s i o n
One application for which the vector quantizer described in this section has been extremely
popular is image compression. For image compression, the vector is formed by taking blocks
of pixels of size N ×M and treating them as an L = NM dimensional vector. Generally, we
take N = M. Instead of forming vectors in this manner, we could form the vector by taking
L pixels in a row of the image. However, this does not allow us to take advantage of the two-
dimensional correlations in the image. Recall that correlation between the samples provides
the clustering of the input, and the LBG algorithm takes advantage of this clustering.
E x a m p l e 10 . 4 . 6:
Let us quantize the Sinan image shown in Figure 10.14 using a 16-dimensional quantizer.
The input vectors are constructed using 4 × 4 blocks of pixels. The codebook was trained
on the Sinan image.
The results of the quantization using codebooks of size 16, 64, 256, and 1024 are shown
in Figure 10.15. The rates and compression ratios are summarized in Table 10.7. To see how
these quantities were calculated, recall that if we have K vectors in a codebook, we need

F I G U R E 10. 14 Original Sinan image.
log2 K bits to inform the receiver which of the K vectors is the quantizer output. This
quantity is listed in the second column of Table 10.7 for the different values of K. If the
vectors are of dimension L, this means that we have used log2 K bits to send the quantized
value of L pixels. Therefore, the rate in bits per pixel is log2 K
L
. (We have assumed that the
codebook is available to both transmitter and receiver, and therefore we do not have to use
any bits to transmit the codebook from the transmitter to the receiver.) This quantity is listed
in the third column of Table 10.7. Finally, the compression ratio, given in the last column of
Table 10.7, is the ratio of the number of bits per pixel in the original image to the number
of bits per pixel in the compressed image. The Sinan image was digitized using 8 bits per
pixel. Using this information and the rate after compression, we can obtain the compression
ratios.
Looking at the images, we see that reconstruction using a codebook of size 1024 is
very close to the original. At the other end, the image obtained using a codebook with 16
reconstruction vectors contains a lot of visible artifacts. The utility of each reconstruction
depends on the demands of the particular application.
In this example, we used codebooks trained on the image itself. Generally, this is not
the preferred approach because the receiver has to have the same codebook in order to
reconstruct the image. Either the codebook must be transmitted along with the image, or
the receiver has the same training image so that it can generate an identical codebook.
This is impractical because, if the receiver already has the image in question, much better
compression can be obtained by simply sending the name of the image to the receiver.
Sending the codebook with the image is not unreasonable. However, the transmission of

F I G U R E 10. 15 Top left: codebook size 16; top right: codebook size 64; bottom
left: codebook size 256; bottom right: codebook size 1024.
T A B L E 10 . 7 Summary of compression measures for image compression example.
Codebook Size Bits Needed to
(# of codewords) Select a Codeword Bits per Pixel Compression Ratio
16 4 0.25 32:1
64 6 0.375 21.33:1
256 8 0.50 16:1
1024 10 0.625 12.8:1

T A B L E 10 . 8 Overhead in bits per pixel for
codebooks of different sizes.
Codebook Size K Overhead in Bits per Pixel
16 003125
64 0125
256 050
1024 20
the codebook is overhead that could be avoided if a more generic codebook, one that is
available to both transmitter and receiver, were to be used.
In order to compute the overhead, we need to calculate the number of bits required
to transmit the codebook to the receiver. If each codeword in the codebook is a vector
with L elements and if we use B bits to represent each element, then in order to transmit
the codebook of a K-level quantizer we need B × L × K bits. In our example, B = 8 and
L = 16. Therefore, we need K ×128 bits to transmit the codebook. As our image consists of
256×256 pixels, the overhead in bits per pixel is 128K/65536. The overhead for different
values of K is summarized in Table 10.8. We can see that while the overhead for a codebook
of size 16 seems reasonable, the overhead for a codebook of size 1024 is over three times
the rate required for quantization.
Given the excessive amount of overhead required for sending the codebook along with
the vector quantized image, there has been substantial interest in the design of codebooks
that are more generic in nature and, therefore, can be used to quantize a number of images.
To investigate the issues that might arise, we quantized the Sinan image using four different
codebooks generated by the Sena, Sensin, Earth, and Omaha images. The results are shown
in Figure 10.16.
As expected, the reconstructed images from this approach are not of the same quality as
when the codebook is generated from the image to be quantized. However, this is only true
as long as the overhead required for storage or transmission of the codebook is ignored. If we
include the extra rate required to encode and transmit the codebook of output points, using
the codebook generated by the image to be quantized seems unrealistic. Although using
the codebook generated by another image to perform the quantization may be realistic, the
quality of the reconstructions is quite poor. Later in this chapter we will take a closer look
at the subject of vector quantization of images and consider a variety of ways to improve
this performance.
You may have noticed that the bit rates for the vector quantizers used in the examples are
quite low. The reason is that the size of the codebook increases exponentially with the rate.
Suppose we want to encode a source using R bits per sample; that is, the average number of
bits per sample in the compressed source output is R. By “sample” we mean a scalar element
of the source output sequence. If we wanted to use an L-dimensional quantizer, we would
group L samples together into vectors. This means that we would have RL bits available
to represent each vector. With RL bits, we can represent 2RL
different output vectors. In
other words, the size of the codebook for an L-dimensional R-bits-per-sample quantizer is
2RL
. From Table 10.7, we can see that when we quantize an image using 025 bits per
pixel and 16-dimensional quantizers, we have 16×025 = 4 bits available to represent each

F I G U R E 10. 16 Sinan image quantized at the rate of 0.5 bits per pixel. The
images used to obtain the codebook were (clockwise from
top left) Sensin, Sena, Earth, Omaha.
vector. Hence, the size of the codebook is 24
= 16. The quantity RL is often called the rate
dimension product. Note that the size of the codebook grows exponentially with this product.
Consider the problems. The codebook size for a 16-dimensional, 2-bits-per-sample vector
quantizer would be 216×2
! (If the source output was originally represented using 8 bits per
sample, a rate of 2 bits per sample for the compressed source corresponds to a compression
ratio of 4:1.) This large size causes problems both with storage and with the quantization
process. To store 232
sixteen-dimensional vectors, assuming that we can store each component
of the vector in a single byte, requires 232
× 16 bytes—approximately 64 gigabytes of
storage. Furthermore, to quantize a single input vector would require over four billion vector

comparisons to find the closest output point. Obviously, neither the storage requirements
nor the computational requirements are realistic. Because of this problem, most vector
quantization applications operate at low bit rates. In many applications, such as low-rate
speech coding, we want to operate at very low rates; therefore, this is not a drawback.
However, for applications such as high-quality video coding, which requires higher rates,
this is definitely a problem.
There are several approaches to solving these problems. Each entails the introduction
of some structure in the codebook and/or the quantization process. While the introduction
of structure mitigates some of the storage and computational problems, there is generally a
trade-off in terms of the distortion performance. We will look at some of these approaches
in the following sections.
10 . 5 Tree-Structured Vector Quantizers
One way we can introduce structure is to organize our codebook in such a way that it is easy
to pick which part contains the desired output vector. Consider the two-dimensional vector
quantizer shown in Figure 10.17. Note that the output points in each quadrant are the mirror
image of the output points in neighboring quadrants. Given an input to this vector quantizer,
we can reduce the number of comparisons necessary for finding the closest output point by
using the sign on the components of the input. The sign on the components of the input
vector will tell us in which quadrant the input lies. Because all the quadrants are mirror
images of the neighboring quadrants, the closest output point to a given input will lie in the
same quadrant as the input itself. Therefore, we only need to compare the input to the output
points that lie in the same quadrant, thus reducing the number of required comparisons by
a factor of four. This approach can be extended to L dimensions, where the signs on the L
components of the input vector can tell us in which of the 2L
hyperquadrants the input lies,
which in turn would reduce the number of comparisons by 2L
.
This approach works well when the output points are distributed in a symmetrical manner.
However, it breaks down as the distribution of the output points becomes less symmetrical.
F I G U R E 10. 17 A symmetrical vector quantizer in two dimensions.

E x a m p l e 10 . 5 . 1:
Consider the vector quantizer shown in Figure 10.18. This is different from the output points
in Figure 10.17; we have dropped the mirror image requirement of the previous example.
The output points are shown as filled circles, and the input point is the X. It is obvious
from the figure that while the input is in the first quadrant, the closest output point is in the
fourth quadrant. However, the quantization approach described above will force the input to
be represented by an output in the first quadrant.
x
F I G U R E 10. 18 Breakdown of the method using the quadrant approach.
The situation gets worse as we lose more and more of the symmetry. Consider the
situation in Figure 10.19. In this quantizer, not only will we get an incorrect output point
when the input is close to the boundaries of the first quadrant, but also there is no significant
reduction in the amount of computation required.
F I G U R E 10. 19 Breakdown of the method using the quadrant approach.
Most of the output points are in the first quadrant. Therefore, whenever the input falls
in the first quadrant, which it will do quite often if the quantizer design is reflective of the
distribution of the input, knowing that it is in the first quadrant does not lead to a great
reduction in the number of comparisons.

0
1
F I G U R E 10. 20 Division of output points into two groups.
The idea of using the L-dimensional equivalents of quadrants to partition the output points
in order to reduce the computational load can be extended to nonsymmetrical situations, like
those shown in Figure 10.19, in the following manner. Divide the set of output points into
two groups, group0 and group1, and assign to each group a test vector such that output
points in each group are closer to the test vector assigned to that group than to the test vector
assigned to the other group (Figure 10.20). Label the two test vectors 0 and 1. When we
get an input vector, we compare it against the test vectors. Depending on the outcome, the
input is compared to the output points associated with the test vector closest to the input.
After these two comparisons, we can discard half of the output points. Comparison with the
test vectors takes the place of looking at the signs of the components to decide which set of
output points to discard from contention. If the total number of output points is K, with this
approach we have to make K
2
+2 comparisons instead of K comparisons.
This process can be continued by splitting the output points in each group into two
groups and assigning a test vector to the subgroups. So group0 would be split into group00
and group01, with associated test vectors labeled 00 and 01, and group1 would be split into
group10 and group11, with associated test vectors labeled 10 and 11. Suppose the result of
the first set of comparisons was that the output point would be searched for in group1. The
input would be compared to the test vectors 10 and 11. If the input was closer to the test
vector 10, then the output points in group11 would be discarded, and the input would be
compared to the output points in group10. We can continue the procedure by successively
dividing each group of output points into two, until finally, if the number of output points
is a power of two, the last set of groups would consist of single points. The number of
comparisons required to obtain the final output point would be 2logK instead of K. Thus,
for a codebook of size 4096 we would need 24 vector comparisons instead of 4096 vector
comparisons.
This is a remarkable decrease in computational complexity. However, we pay for this
decrease in two ways. The first penalty is a possible increase in distortion. It is possible at
some stage that the input is closer to one test vector while at the same time being closest to an
output belonging to the rejected group. This is similar to the situation shown in Figure 10.18.
The other penalty is an increase in storage requirements. Now we not only have to store the
output points from the vector quantizer codebook, we also must store the test vectors. This
means almost a doubling of the storage requirement.

01
011 010 001 000
0
00
11
111 110 101 100
1
10
F I G U R E 10. 21 Decision tree for quantization.
The comparisons that must be made at each step are shown in Figure 10.21. The label
inside each node is the label of the test vector that we compare the input against. This tree
of decisions is what gives tree-structured vector quantizers (TSVQ) their name. Notice also
that, as we are progressing down a tree, we are also building a binary string. As the leaves
of the tree are the output points, by the time we reach a particular leaf or, in other words,
select a particular output point, we have obtained the binary codeword corresponding to that
output point.
This process of building the binary codeword as we progress through the series of
decisions required to find the final output can result in some other interesting properties of
tree-structured vector quantizers. For instance, even if a partial codeword is transmitted, we
can still get an approximation of the input vector. In Figure 10.21, if the quantized value
was the codebook vector 5, the binary codeword would be 011. However, if only the first
two bits 01 were received by the decoder, the input can be approximated by the test vector
labeled 01.
10 . 5 . 1 D e s i g n o f T r e e - S t r u c t u r e d V e c t o r Q u a n t i z e r s
In the last section we saw how we could reduce the computational complexity of the design
process by imposing a tree structure on the vector quantizer. Rather than imposing this
structure after the vector quantizer has been designed, it makes sense to design the vector
quantizer within the framework of the tree structure. We can do this by a slight modification
of the splitting design approach proposed by Linde et al. [125].
We start the design process in a manner identical to the splitting technique. First, obtain
the average of all the training vectors, perturb it to obtain a second vector, and use these
vectors to form a two-level vector quantizer. Let us label these two vectors 0 and 1, and the
groups of training set vectors that would be quantized to each of these two vectors group0
and group1. We will later use these vectors as test vectors. We perturb these output points
to get the initial vectors for a four-level vector quantizer. At this point, the design procedure

for the tree-structured vector quantizer deviates from the splitting technique. Instead of
using the entire training set to design a four-level vector quantizer, we use the training
set vectors in group0 to design a two-level vector quantizer with output points labeled 00
and 01. We use the training set vectors in group1 to design a two-level vector quantizer
with output points labeled 10 and 11. We also split the training set vectors in group0 and
group1 into two groups each. The vectors in group0 are split, based on their proximity to
the vectors labeled 00 and 01, into group00 and group01, and the vectors in group1 are
divided in a like manner into the groups group10 and group11. The vectors labeled 00, 01,
10, and 11 will act as test vectors at this level. To get an eight-level quantizer, we use the
training set vectors in each of the four groups to obtain four two-level vector quantizers. We
continue in this manner until we have the required number of output points. Notice that in
the process of obtaining the output points, we have also obtained the test vectors required
for the quantization process.
10 . 5 . 2 P r u n e d T r e e - S t r u c t u r e d V e c t o r Q u a n t i z e r s
Once we have built a tree-structured codebook, we can sometimes improve its rate distortion
performance by removing carefully selected subgroups. Removal of a subgroup, referred to
as pruning, will reduce the size of the codebook and hence the rate. It may also result in an
increase in distortion. Therefore, the objective of the pruning is to remove those subgroups
that will result in the best trade-off of rate and distortion. Chou, Lookabaugh, and Gray [128]
have developed an optimal pruning algorithm called the generalized BFOS algorithm. The
name of the algorithm derives from the fact that it is an extension of an algorithm originally
developed by Brieman, Freidman, Olshen, and Stone [129] for classification applications.
(See [128] and [5] for description and discussion of the algorithm.)
Pruning output points from the codebook has the unfortunate effect of removing the
structure that was previously used to generate the binary codeword corresponding to the
output points. If we used the structure to generate the binary codewords, the pruning would
cause the codewords to be of variable length. As the variable-length codes would correspond
to the leaves of a binary tree, this code would be a prefix code and, therefore, certainly
usable. However, it would not require a large increase in complexity to assign fixed-length
codewords to the output points using another method. This increase in complexity is generally
offset by the improvement in performance that results from the pruning [130].
10 . 6 Structured Vector Quantizers
The tree-structured vector quantizer solves the complexity problem, but acerbates the storage
problem. We now take an entirely different tack and develop vector quantizers that do not
have these storage problems; however, we pay for this relief in other ways.
Example 10.3.1 was our motivation for the quantizer obtained by the LBG algorithm.
This example showed that the correlation between samples of the output of a source leads
to clustering. This clustering is exploited by the LBG algorithm by placing output points at
the location of these clusters. However, in Example 10.3.2, we saw that even when there

is no correlation between samples, there is a kind of probabilistic structure that becomes
more evident as we group the random inputs of a source into larger and larger blocks or
vectors.
In Example 10.3.2, we changed the position of the output point in the top-right corner.
All four corner points have the same probability, so we could have chosen any of these
points. In the case of the two-dimensional Laplacian distribution in Example 10.3.2, all
points that lie on the contour described by x + y = constant have equal probability. These
are called contours of constant probability. For spherically symmetrical distributions like
the Gaussian distribution, the contours of constant probability are circles in two dimensions,
spheres in three dimensions, and hyperspheres in higher dimensions.
We mentioned in Example 10.3.2 that the points away from the origin have very little
probability mass associated with them. Based on what we have said about the contours of
constant probability, we can be a little more specific and say that the points on constant
probability contours farther away from the origin have very little probability mass associated
with them. Therefore, we can get rid of all of the points outside some contour of constant
probability without incurring much of a distortion penalty. In addition as the number of
reconstruction points is reduced, there is a decrease in rate, thus improving the rate distortion
performance.
E x a m p l e 10 . 6 . 1:
Let us design a two-dimensional uniform quantizer by keeping only the output points in the
quantizer of Example 10.3.2 that lie on or within the contour of constant probability given
by x1 + x2 = 5 . If we count all the points that are retained, we get 60 points. This is
close enough to 64 that we can compare it with the eight-level uniform scalar quantizer. If
we simulate this quantization scheme with a Laplacian input, and the same step size as the
scalar quantizer, that is, = 07309, we get an SNR of 12.22 dB. Comparing this to the
11.44 dB obtained with the scalar quantizer, we see that there is a definite improvement. We
can get slightly more improvement in performance if we modify the step size.
Notice that the improvement in the previous example is obtained only by restricting the
outer boundary of the quantizer. Unlike Example 10.3.2, we did not change the shape of
any of the inner quantization regions. This gain is referred to in the quantization literature
as boundary gain. In terms of the description of quantization noise in Chapter 8, we
reduced the overload error by reducing the overload probability, without a commensurate
increase in the granular noise. In Figure 10.22, we have marked the 12 output points that
belonged to the original 64-level quantizer, but do not belong to the 60-level quantizer, by
drawing circles around them. Removal of these points results in an increase in overload
probability. We also marked the eight output points that belong to the 60-level quantizer,
but were not part of the original 64-level quantizer, by drawing squares around them.
Adding these points results in a decrease in the overload probability. If we calculate the
increases and decreases (Problem 5), we find that the net result is a decrease in overload
probability. This overload probability is further reduced as the dimension of the vector is
increased.

F I G U R E 10. 22 Contours of constant probability.
10 . 6 . 1 P y r a m i d V e c t o r Q u a n t i z a t i o n
As the dimension of the input vector increases, something interesting happens. Suppose we
are quantizing a random variable X with pdf fXX and differential entropy hX. Suppose
we block samples of this random variable into a vector X. A result of Shannon’s, called the
asymptotic equipartition property (AEP), states that for sufficiently large L and arbitrarily
small

logfXX
L
+hX

(10.6)

for all but a set of vectors with a vanishingly small probability [7]. This means that almost
all the L-dimensional vectors will lie on a contour of constant probability given by

logfXX
L

= −hX (10.7)
Given that this is the case, Sakrison [131] suggested that an optimum manner to encode
the source would be to distribute 2RL
points uniformly in this region. Fischer [132] used this
insight to design a vector quantizer called the pyramid vector quantizer for the Laplacian
source that looks quite similar to the quantizer described in Example 10.6.1. The vector
quantizer consists of points of the rectangular quantizer that fall on the hyperpyramid given by
L

i=1
xi = C
where C is a constant depending on the variance of the input. Shannon’s result is asymp-
totic, and for realistic values of L, the input vector is generally not localized to a single
hyperpyramid.
For this case, Fischer first finds the distance
r =
L

i=1
xi
This value is quantized and transmitted to the receiver. The input is normalized by this gain
term and quantized using a single hyperpyramid. The quantization process for the shape term
consists of two stages: finding the output point on the hyperpyramid closest to the scaled
input, and finding a binary codeword for this output point. (See [132] for details about the
quantization and coding process.) This approach is quite successful, and for a rate of 3 bits
per sample and a vector dimension of 16, we get an SNR value of 16.32 dB. If we increase
the vector dimension to 64, we get an SNR value of 17.03. Compared to the SNR obtained
from using a nonuniform scalar quantizer, this is an improvement of more than 4 dB.
Notice that in this approach we separated the input vector into a gain term and a pattern
or shape term. Quantizers of this form are called gain-shape vector quantizers, or product
code vector quantizers [133].
10 . 6 . 2 P o l a r a n d S p h e r i c a l V e c t o r Q u a n t i z e r s
For the Gaussian distribution, the contours of constant probability are circles in two dimen-
sions and spheres and hyperspheres in three and higher dimensions. In two dimensions, we
can quantize the input vector by first transforming it into polar coordinates r and :
r = x2
1 +x2
2 (10.8)
and
= tan−1 x2
x1
(10.9)

r and can then be either quantized independently [134], or we can use the quantized value
of r as an index to a quantizer for [135]. The former is known as a polar quantizer; the
latter, an unrestricted polar quantizer. The advantage to quantizing r and independently is
one of simplicity. The quantizers for r and are independent scalar quantizers. However, the
performance of the polar quantizers is not significantly higher than that of scalar quantization
of the components of the two-dimensional vector. The unrestricted polar quantizer has a
more complex implementation, as the quantization of depends on the quantization of
r. However, the performance is also somewhat better than the polar quantizer. The polar
quantizer can be extended to three or more dimensions [136].
10 . 6 . 3 L a t t i c e V e c t o r Q u a n t i z e r s
Recall that quantization error is composed of two kinds of error, overload error and granular
error. The overload error is determined by the location of the quantization regions furthest
from the origin, or the boundary. We have seen how we can design vector quantizers to
reduce the overload probability and thus the overload error. We called this the boundary
gain of vector quantization. In scalar quantization, the granular error was determined by the
size of the quantization interval. In vector quantization, the granular error is affected by the
size and shape of the quantization interval.
Consider the square and circular quantization regions shown in Figure 10.23. We show
only the quantization region at the origin. These quantization regions need to be distributed
in a regular manner over the space of source outputs. However, for now, let us simply
consider the quantization region at the origin. Let’s assume they both have the same area
so that we can compare them. This way it would require the same number of quantization
regions to cover a given area. That is, we will be comparing two quantization regions of the
same “size.” To have an area of one, the square has to have sides of length one. As the area
of a circle is given by r2
, the radius of the circle is 1
√

. The maximum quantization error
possible with the square quantization region is when the input is at one of the four corners of
the square. In this case, the error is 1
√
2
, or about 0.707. For the circular quantization region,
the maximum error occurs when the input falls on the boundary of the circle. In this case,
the error is 1
√

, or about 0.56. Thus, the maximum granular error is larger for the square
region than the circular region.
In general, we are more concerned with the average squared error than the maximum
error. If we compute the average squared error for the square region, we obtain

Square
X2
dX = 01666̄
F I G U R E 10. 23 Possible quantization regions.

For the circle, we obtain

Circle
X2
dX = 0159
Thus, the circular region would introduce less granular error than the square region.
Our choice seems to be clear; we will use the circle as the quantization region. Unfor-
tunately, a basic requirement for the quantizer is that for every possible input vector there
should be a unique output vector. In order to satisfy this requirement and have a quantizer
with sufficient structure that can be used to reduce the storage space, a union of translates
of the quantization region should cover the output space of the source. In other words, the
quantization region should tile space. A two-dimensional region can be tiled by squares, but
it cannot be tiled by circles. If we tried to tile the space with circles, we would either get
overlaps or holes.
Apart from squares, other shapes that tile space include rectangles and hexagons. It
turns out that the best shape to pick for a quantization region in two dimensions is a
hexagon [137].
In two dimensions, it is relatively easy to find the shapes that tile space, then select the
one that gives the smallest amount of granular error. However, when we start looking at
higher dimensions, it is difficult, if not impossible, to visualize different shapes, let alone find
which ones tile space. An easy way out of this dilemma is to remember that a quantizer can
be completely defined by its output points. In order for this quantizer to possess structure,
these points should be spaced in some regular manner.
Regular arrangements of output points in space are called lattices. Mathematically, we
can define a lattice as follows:
Let a1a2 aL be L independent L-dimensional vectors. Then the set
= x x =
L

i=1
uiai (10.10)
is a lattice if ui are all integers.
When a subset of lattice points is used as the output points of a vector quantizer, the
quantizer is known as a lattice vector quantizer. From this definition, the pyramid vector
quantizer described earlier can be viewed as a lattice vector quantizer. Basing a quantizer
on a lattice solves the storage problem. As any lattice point can be regenerated if we know
the basis set, there is no need to store the output points. Further, the highly structured nature
of lattices makes finding the closest output point to an input relatively simple. Note that
what we give up when we use lattice vector quantizers is the clustering property of LBG
quantizers.
Let’s take a look at a few examples of lattices in two dimensions. If we pick a1 = 10
and a2 = 01, we obtain the integer lattice—the lattice that contains all points in two
dimensions whose coordinates are integers.

−6
−7 −4 −2
−5 −3 −1
−6
−4
−2
−3
−5
−1
6
4
2
1
3
5
2 4 6 7
1 3 5 8
−8
F I G U R E 10. 24 The D2 lattice.
If we pick a1 = 11 and a2 = 1−1, we get the lattice shown in Figure 10.24. This
lattice has a rather interesting property. Any point in the lattice is given by na1 +ma2, where
n and m are integers. But
na1 +ma2 =
n+m
n−m
and the sum of the coefficients is n+m+n−m = 2n, which is even for all n. Therefore, all
points in this lattice have an even coordinate sum. Lattices with these properties are called
D lattices.
Finally, if a1 = 10 and a2 =

−1
2

√
3
2

, we get the hexagonal lattice shown in
Figure 10.25. This is an example of an A lattice.
There are a large number of lattices that can be used to obtain lattice vector quantizers.
In fact, given a dimension L, there are an infinite number of possible sets of L independent
vectors. Among these, we would like to pick the lattice that produces the greatest reduction
in granular noise. When comparing the square and circle as candidates for quantization
regions, we used the integral over the shape of X2
. This is simply the second moment of
the shape. The shape with the smallest second moment for a given volume is known to be
the circle in two dimensions and the sphere and hypersphere in higher dimensions [138].
Unfortunately, circles and spheres cannot tile space; either there will be overlap or there will

F I G U R E 10. 25 The A2 lattice.
be holes. As the ideal case is unattainable, we can try to approximate it. We can look for
ways of arranging spheres so that they cover space with minimal overlap [139], or look for
ways of packing spheres with the least amount of space left over [138]. The centers of these
spheres can then be used as the output points. The quantization regions will not be spheres,
but they may be close approximations to spheres.
The problems of sphere covering and sphere packing are widely studied in a number
of different areas. Lattices discovered in these studies have also been useful as vector
quantizers [138]. Some of these lattices, such as the A2 and D2 lattices described earlier, are
based on the root systems of Lie algebras [140]. The study of Lie algebras is beyond the
scope of this book; however, we have included a brief discussion of the root systems and
how to obtain the corresponding lattices in Appendix C.
One of the nice things about root lattices is that we can use their structural properties
to obtain fast quantization algorithms. For example, consider building a quantizer based on
the D2 lattice. Because of the way in which we described the D2 lattice, the size of the
lattice is fixed. We can change the size by picking the basis vectors as and − ,
instead of 11 and 1−1. We can have exactly the same effect by dividing each input
by before quantization, and then multiplying the reconstruction values by . Suppose we
pick the latter approach and divide the components of the input vector by . If we wanted
to find the closest lattice point to the input, all we need to do is find the closest integer to
each coordinate of the scaled input. If the sum of these integers is even, we have a lattice
point. If not, find the coordinate that incurred the largest distortion during conversion to an
integer and then find the next closest integer. The sum of coordinates of this new vector
differs from the sum of coordinates of the previous vector by one. Therefore, if the sum of
coordinates of the previous vector was odd, the sum of the coordinates of the current vector
will be even, and we have the closest lattice point to the input.
E x a m p l e 10 . 6 . 2:
Suppose the input vector is given by (2.3, 1.9). Rounding each coefficient to the nearest
integer, we get the vector (2, 2). The sum of the coordinates is even; therefore, this is the
closest lattice point to the input.

Suppose the input was (3.4, 1.8). Rounding the components to the nearest integer, we
get (3, 2). The sum of the components is 5, which is odd. The differences between the
components of the input vector and the nearest integer are 0.4 and 0.2. The largest difference
was incurred by the first component, so we round it up to the next closest integer, and the
resulting vector is (4, 2). The sum of the coordinates is 6, which is even; therefore, this is
the closest lattice point.
Many of the lattices have similar properties that can be used to develop fast algorithms
for finding the closest output point to a given input [141, 140].
To review our coverage of lattice vector quantization, overload error can be reduced by
careful selection of the boundary, and we can reduce the granular noise by selection of the
lattice. The lattice also provides us with a way to avoid storage problems. Finally, we can
use the structural properties of the lattice to find the closest lattice point to a given input.
Now we need two things: to know how to find the closest output point (remember, not
all lattice points are output points), and to find a way of assigning a binary codeword to the
output point and recovering the output point from the binary codeword. This can be done by
again making use of the specific structures of the lattices. While the procedures necessary
are simple, explanations of the procedures are lengthy and involved (see [142] and [140]
for details).
10 . 7 Variations on the Theme
Because of its capability to provide high compression with relatively low distortion, vector
quantization has been one of the more popular lossy compression techniques over the last
decade in such diverse areas as video compression and low-rate speech compression. During
this period, several people have come up with variations on the basic vector quantization
approach. We briefly look at a few of the more well-known variations here, but this is by
no means an exhaustive list. For more information, see [5] and [143].
10 . 7 . 1 G a i n - S h a p e V e c t o r Q u a n t i z a t i o n
In some applications such as speech, the dynamic range of the input is quite large. One
effect of this is that, in order to be able to represent the various vectors from the source,
we need a very large codebook. This requirement can be reduced by normalizing the
source output vectors, then quantizing the normalized vector and the normalization factor
separately [144, 133]. In this way, the variation due to the dynamic range is represented by
the normalization factor or gain, while the vector quantizer is free to do what it does best,
which is to capture the structure in the source output. Vector quantizers that function in this
manner are called gain-shape vector quantizers. The pyramid quantizer discussed earlier is
an example of a gain-shape vector quantizer.

10 . 7 . 2 M e a n - R e m o v e d V e c t o r Q u a n t i z a t i o n
If we were to generate a codebook from an image, differing amounts of background illumi-
nation would result in vastly different codebooks. This effect can be significantly reduced if
we remove the mean from each vector before quantization. The mean and the mean-removed
vector can then be quantized separately. The mean can be quantized using a scalar quantiza-
tion scheme, while the mean-removed vector can be quantized using a vector quantizer. Of
course, if this strategy is used, the vector quantizer should be designed using mean-removed
vectors as well.
E x a m p l e 10 . 7 . 1:
Let us encode the Sinan image using a codebook generated by the Sena image, as we did in
Figure 10.16. However, this time we will use a mean-removed vector quantizer. The result
is shown in Figure 10.26. For comparison we have also included the reconstructed image
from Figure 10.16. Notice the annoying blotches on the shoulder have disappeared. How-
ever, the reconstructed image also suffers from more blockiness. The blockiness increases
because adding the mean back into each block accentuates the discontinuity at the block
boundaries.
F I G U R E 10. 26 Left: Reconstructed image using mean-removed vector
quantization and the Sena image as the training set. Right: LBG
vector quantization with the Sena image as the training set.
Each approach has its advantages and disadvantages. Which approach we use in a
particular application depends very much on the application.

10 . 7 . 3 C l a s s i f i e d V e c t o r Q u a n t i z a t i o n
We can sometimes divide the source output into separate classes with different spatial
properties. In these cases, it can be very beneficial to design separate vector quantizers for
the different classes. This approach, referred to as classified vector quantization, is especially
useful in image compression, where edges and nonedge regions form two distinct classes.
We can separate the training set into vectors that contain edges and vectors that do not.
A separate vector quantizer can be developed for each class. During the encoding process,
the vector is first tested to see if it contains an edge. A simple way to do this is to check
the variance of the pixels in the vector. A large variance will indicate the presence of an
edge. More sophisticated techniques for edge detection can also be used. Once the vector
is classified, the corresponding codebook can be used to quantize the vector. The encoder
transmits both the label for the codebook used and the label for the vector in the codebook
[145].
A slight variation of this strategy is to use different kinds of quantizers for the different
classes of vectors. For example, if certain classes of source outputs require quantization
at a higher rate than is possible using LBG vector quantizers, we can use lattice vector
quantizers. An example of this approach can be found in [146].
10 . 7 . 4 M u l t i s t a g e V e c t o r Q u a n t i z a t i o n
Multistage vector quantization [147] is an approach that reduces both the encoding com-
plexity and the memory requirements for vector quantization, especially at high rates. In
this approach, the input is quantized in several stages. In the first stage, a low-rate vector
quantizer is used to generate a coarse approximation of the input. This coarse approximation,
in the form of the label of the output point of the vector quantizer, is transmitted to the
receiver. The error between the original input and the coarse representation is quantized by
the second-stage quantizer, and the label of the output point is transmitted to the receiver. In
this manner, the input to the nth-stage vector quantizer is the difference between the original
input and the reconstruction obtained from the outputs of the preceding n − 1 stages. The
difference between the input to a quantizer and the reconstruction value is often called the
residual, and the multistage vector quantizers are also known as residual vector quantizers
[148]. The reconstructed vector is the sum of the output points of each of the stages. Suppose
we have a three-stage vector quantizer, with the three quantizers represented by Q1Q2, and
Q3. Then for a given input X, we find
Y1 = Q1X
Y2 = Q2X−Q1X
Y3 = Q3X−Q1X−Q2X−Q1X (10.11)
The reconstruction X̂ is given by
X̂ = Y1 +Y2 +Y3 (10.12)
This process is shown in Figure 10.27.

Index
X Q1
Y1
+
−
+
−
+
−
Index
Q2
Y2 Y3
Index
Q3
F I G U R E 10. 27 A three-stage vector quantizer.
If we have K stages, and the codebook size of the nth-stage vector quantizer is Ln,
then the effective size of the overall codebook is L1 ×L2 ×···×LK. However, we need to
store only L1 + L2 + ··· + LK vectors, which is also the number of comparisons required.
Suppose we have a five-stage vector quantizer, each with a codebook size of 32, meaning
that we would have to store 160 codewords. This would provide an effective codebook size
of 325
= 33554432. The computational savings are also of the same order.
This approach allows us to use vector quantization at much higher rates than we could
otherwise. However, at rates at which it is feasible to use LBG vector quantizers, the
performance of the multistage vector quantizers is generally lower than the LBG vector
quantizers [5]. The reason for this is that after the first few stages, much of the structure
used by the vector quantizer has been removed, and the vector quantization advantage that
depends on this structure is not available. Details on the design of residual vector quantizers
can be found in [148, 149].
There may be some vector inputs that can be well represented by fewer stages than
others. A multistage vector quantizer with a variable number of stages can be implemented
by extending the idea of recursively indexed scalar quantization to vectors. It is not possible
to do this directly because there are some fundamental differences between scalar and vector
quantizers. The input to a scalar quantizer is assumed to be iid. On the other hand, the vector
quantizer can be viewed as a pattern-matching algorithm [150]. The input is assumed to be
one of a number of different patterns. The scalar quantizer is used after the redundancy has
been removed from the source sequence, while the vector quantizer takes advantage of the
redundancy in the data.
With these differences in mind, the recursively indexed vector quantizer (RIVQ) can
be described as a two-stage process. The first stage performs the normal pattern-matching
function, while the second stage recursively quantizes the residual if the magnitude of the
residual is greater than some prespecified threshold. The codebook of the second stage is
ordered so that the magnitude of the codebook entries is a nondecreasing function of its
index. We then choose an index I that will determine the mode in which the RIVQ operates.
The quantization rule Q, for a given input value X, is as follows:
Quantize X with the first-stage quantizer Q1.
If the residual X−Q1X is below a specified threshold, then Q1X is the nearest
output level.

Otherwise, generate X1 = X − Q1X and quantize using the second-stage quantizer
Q2. Check if the index J1 of the output is below the index I. If so,
QX = Q1X+Q2X1
If not, form
X2 = X1 −QX1
and do the same for X2 as we did for X1.
This process is repeated until for some m, the index Jm falls below the index I, in which
case X will be quantized to
QX = Q1X+Q2X1+···+Q2XM
Thus, the RIVQ operates in two modes: when the index J of the quantized input falls below
a given index I and when the index J falls above the index I.
Details on the design and performance of the recursively indexed vector quantizer can
be found in [151, 152].
10 . 7 . 5 A d a p t i v e V e c t o r Q u a n t i z a t i o n
While LBG vector quantizers function by using the structure in the source output, this
reliance on the use of the structure can also be a drawback when the characteristics of the
source change over time. For situations like these, we would like to have the quantizer adapt
to the changes in the source output.
For mean-removed and gain-shape vector quantizers, we can adapt the scalar aspect of
the quantizer, that is, the quantization of the mean or the gain using the techniques discussed
in the previous chapter. In this section, we look at a few approaches to adapting the codebook
of the vector quantizer to changes in the characteristics of the input.
One way of adapting the codebook to changing input characteristics is to start with a
very large codebook designed to accommodate a wide range of source characteristics [153].
This large codebook can be ordered in some manner known to both transmitter and receiver.
Given a sequence of input vectors to be quantized, the encoder can select a subset of the
larger codebook to be used. Information about which vectors from the large codebook were
used can be transmitted as a binary string. For example, if the large codebook contained 10
vectors, and the encoder was to use the second, third, fifth, and ninth vectors, we would
send the binary string 0110100010, with a 1 representing the position of the codeword used
in the large codebook. This approach permits the use of a small codebook that is matched
to the local behavior of the source.
This approach can be used with particular effectiveness with the recursively indexed
vector quantizer [151]. Recall that in the recursively indexed vector quantizer, the quantized
output is always within a prescribed distance of the inputs, determined by the index I. This
means that the set of output values of the RIVQ can be viewed as an accurate representation
of the inputs and their statistics. Therefore, we can treat a subset of the output set of the
previous intervals as our large codebook. We can then use the method described in [153] to

inform the receiver of which elements of the previous outputs form the codebook for the next
interval. This method (while not the most efficient) is quite simple. Suppose an output set,
in order of first appearance, is paqsltr, and the desired codebook for the interval to
be encoded is aqlr. Then we would transmit the binary string 0110101 to the receiver.
The 1s correspond to the letters in the output set, which would be elements of the desired
codebook. We select the subset for the current interval by finding the closest vectors from
our collection of past outputs to the input vectors of the current set. This means that there is
an inherent delay of one interval imposed by this approach. The overhead required to send
the codebook selection is M/N, where M is the number of vectors in the output set and N
is the interval size.
Another approach to updating the codebook is to check the distortion incurred while
quantizing each input vector. Whenever this distortion is above some specified threshold,
a different higher-rate mechanism is used to encode the input. The higher-rate mechanism
might be the scalar quantization of each component, or the use of a high-rate lattice vector
quantizer. This quantized representation of the input is transmitted to the receiver and, at the
same time, added to both the encoder and decoder codebooks. In order to keep the size of the
codebook the same, an entry must be discarded when a new vector is added to the codebook.
Selecting an entry to discard is handled in a number of different ways. Variations of this
approach have been used for speech coding, image coding, and video coding (see [154, 155,
156, 157, 158] for more details).
10 . 8 Trellis-Coded Quantization
Finally, we look at a quantization scheme that appears to be somewhat different from other
vector quantization schemes. In fact, some may argue that it is not a vector quantizer at all.
However, the trellis-coded quantization (TCQ) algorithm gets its performance advantage by
exploiting the statistical structure exploited by the lattice vector quantizer. Therefore, we
can argue that it should be classified as a vector quantizer.
The trellis-coded quantization algorithm was inspired by the appearance of a revolution-
ary concept in modulation called trellis-coded modulation (TCM). The TCQ algorithm and
its entropy-constrained variants provide some of the best performance when encoding ran-
dom sources. This quantizer can be viewed as a vector quantizer with very large dimension,
but a restricted set of values for the components of the vectors.
Like a vector quantizer, the TCQ quantizes sequences of source outputs. Each element
of a sequence is quantized using 2R
reconstruction levels selected from a set of 2R+1
reconstruction levels, where R is the number of bits per sample used by a trellis-coded
quantizer. The 2R
element subsets are predefined; which particular subset is used is based
on the reconstruction level used to quantize the previous quantizer input. However, the TCQ
algorithm allows us to postpone a decision on which reconstruction level to use until we
can look at a sequence of decisions. This way we can select the sequence of decisions that
gives us the lowest amount of average distortion.
Let’s take the case of a 2-bit quantizer. As described above, this means that we will
need 23
, or 8, reconstruction levels. Let’s label these reconstruction levels as shown in
Figure 10.28. The set of reconstruction levels is partitioned into two subsets: one consisting

Q1,1 Q3,1
Set #1
Set #2
Q1,2 Q2,2 Q3,2
Q0,1 Q2,1 Q0,2
F I G U R E 10. 28 Reconstruction levels for a 2-bit trellis-coded quantizer.
of the reconstruction values labeled Q0i and Q2i, and the remainder comprising the second
set. We use the first set to perform the quantization if the previous quantization level was
one labeled Q0i or Q1i; otherwise, we use the second set. Because the current reconstructed
value defines the subset that can be used to perform the quantization on the next input,
sometimes it may be advantageous to actually accept more distortion than necessary for
the current sample in order to have less distortion in the next quantization step. In fact, at
times it may be advantageous to accept poor quantization for several samples so that several
samples down the line the quantization can result in less distortion. If you have followed
this reasoning, you can see how we might be able to get lower overall distortion by looking
at the quantization of an entire sequence of source outputs. The problem with delaying a
decision is that the number of choices increases exponentially with each sample. In the 2-bit
example, for the first sample we have four choices; for each of these four choices we have
four choices for the second sample. For each of these 16 choices we have four choices for the
third sample, and so on. Luckily, there is a technique that can be used to keep this explosive
growth of choices under control. The technique, called the Viterbi algorithm [159], is widely
used in error control coding.
In order to explain how the Viterbi algorithm works, we will need to formalize some of
what we have been discussing. The sequence of choices can be viewed in terms of a state
diagram. Let’s suppose we have four states: S0, S1, S2, and S3. We will say we are in state
Sk if we use the reconstruction levels Qk1 or Qk2. Thus, if we use the reconstruction levels
Q0i, we are in state S0. We have said that we use the elements of Set #1 if the previous
quantization levels were Q0i or Q1i. As Set #1 consists of the quantization levels Q0i and
Q2i, this means that we can go from state S0 and S1 to states S0 and S2. Similarly, from
states S2 and S3 we can only go to states S1 and S3. The state diagram can be drawn as
shown in Figure 10.29.

S3
S0
S1 S2
F I G U R E 10. 29 State diagram for the selection process.
Let’s suppose we go through two sequences of choices that converge to the same state,
after which both sequences are identical. This means that the sequence of choices that had
incurred a higher distortion at the time the two sequences converged will have a higher
distortion from then on. In the end we will select the sequence of choices that results in the
lowest distortion; therefore, there is no point in continuing to keep track of a sequence that
we will discard anyway. This means that whenever two sequences of choices converge, we
can discard one of them. How often does this happen? In order to see this, let’s introduce
time into our state diagram. The state diagram with the element of time introduced into it
is called a trellis diagram. The trellis for this particular example is shown in Figure 10.30.
At each time instant, we can go from one state to two other states. And, at each step we
S0
S2
S1
S3
F I G U R E 10. 30 Trellis diagram for the selection process.

S0
S2
S1
S3
0 0 0
0 0 0
0 0 0
0 0 0
1 1 1
1 1 1
1 1 1
1 1 1
F I G U R E 10. 31 Trellis diagram for the selection process with binary labels for the
state transitions.
have two sequences that converge to each state. If we discard one of the two sequences that
converge to each state, we can see that, no matter how long a sequence of decisions we use,
we will always end up with four sequences.
Notice that, assuming the initial state is known to the decoder, any path through this
particular trellis can be described to the decoder using 1 bit per sample. From each state we
can only go to two other states. In Figure 10.31, we have marked the branches with the bits
used to signal that transition. Given that each state corresponds to two quantization levels,
specifying the quantization level for each sample would require an additional bit, resulting
in a total of 2 bits per sample. Let’s see how all this works together in an example.
E x a m p l e 10 . 8 . 1:
Using the quantizer whose quantization levels are shown in Figure 10.32, we will quantize the
sequence of values 0.2, 1.6, 2.3. For the distortion measure we will use the sum of absolute
differences. If we simply used the quantization levels marked as Set #1 in Figure 10.28,
we would quantize 0.2 to the reconstruction value 0.5, for a distortion of 0.3. The second
sample value of 1.6 would be quantized to 2.5, and the third sample value of 2.3 would also
be quantized to 2.5, resulting in a total distortion of 1.4. If we used Set #2 to quantize these
values, we would end up with a total distortion of 1.6. Let’s see how much distortion results
when using the TCQ algorithm.
We start by quantizing the first sample using the two quantization levels Q01 and Q02.
The reconstruction level Q02, or 0.5, is closer and results in an absolute difference of 0.3.
We mark this on the first node corresponding to S0. We then quantize the first sample using
Q1,1 Q3,1 Q1,2 Q2,2 Q3,2
Q0,1 Q2,1 Q0,2
−2.5 −0.5 1.5 2.5 3.5
−3.5 −1.5 0.5
F I G U R E 10. 32 Reconstruction levels for a 2-bit trellis-coded quantizer.

S0
S2
S1
S3
0
0.3
1.7
1.3
0.7
0 0
0 0 0
0 0 0
0 0 0
1 1 1
1 1 1
1 1 1
1 1 1
F I G U R E 10. 33 Quantizing the ﬁrst sample.
Q11 and Q12. The closest reconstruction value is Q12, or 1.5, which results in a distortion
value of 1.3. We mark the first node corresponding to S1. Continuing in this manner, we get
a distortion value of 1.7 when we use the reconstruction levels corresponding to state S2 and
a distortion value of 0.7 when we use the reconstruction levels corresponding to state S3. At
this point the trellis looks like Figure 10.33. Now we move on to the second sample. Let’s
first quantize the second sample value of 1.6 using the quantization levels associated with
state S0. The reconstruction levels associated with state S0 are −35 and 0.5. The closest
value to 1.6 is 0.5. This results in an absolute difference for the second sample of 1.1. We
can reach S0 from S0 and from S1. If we accept the first sample reconstruction corresponding
to S0, we will end up with an accumulated distortion of 1.4. If we accept the reconstruction
corresponding to state S1, we get an accumulated distortion of 2.4. Since the accumulated
distortion is less if we accept the transition from state S0, we do so and discard the transition
from state S1. Continuing in this fashion for the remaining states, we end up with the situation
depicted in Figure 10.34. The sequence of decisions that have been terminated are shown by
an X on the branch corresponding to the particular transition. The accumulated distortion is
listed at each node. Repeating this procedure for the third sample value of 2.3, we obtain the
S0
S2
S1
0
0.3
1.7
X
X
X
X
1.3
0.7
1.4
1.2
0.8
2.6
0 0
0 0 0
0 0 0
0 0 0
1 1 1
1 1 1
1 1 1
1 1 1
F I G U R E 10. 34 Quantizing the second sample.

10.9 Summary 321
S0
S2
S1
S3
0
0.3
1.7
X
X
X
X
X
X
1.3
0.7
2.6
1.0
2.0
2.4
1.4
X
X
1.2
0.8
2.6
0 0
0 0 0
0 0 0
0 0 0
1 1 1
1 1 1
1 1 1
1 1 1
F I G U R E 10. 35 Quantizing the third sample.
trellis shown in Figure 10.35. If we wanted to terminate the algorithm at this time, we could
pick the sequence of decisions with the smallest accumulated distortion. In this particular
example, the sequence would be S3, S1, S2. The accumulated distortion is 1.0, which is less
than what we would have obtained using either Set #1 or Set #2.
10 . 9 Summary
In this chapter we introduced the technique of vector quantization. We have seen how we can
make use of the structure exhibited by groups, or vectors, of values to obtain compression.
Because there are different kinds of structure in different kinds of data, there are a number
of different ways to design vector quantizers. Because data from many sources, when
viewed as vectors, tend to form clusters, we can design quantizers that essentially consist of
representations of these clusters. We also described aspects of the design of vector quantizers
and looked at some applications. Recent literature in this area is substantial, and we have
barely skimmed the surface of the large number of interesting variations of this technique.
The subject of vector quantization is dealt with extensively in the book Vector Quantization and
Signal Compression, by A. Gersho and R.M. Gray [5]. There is also an excellent collection of
papers called Vector Quantization, edited by H. Abut and published by IEEE Press [143].
There are a number of excellent tutorial articles on this subject:
1. “Vector Quantization,” by R.M. Gray, in the April 1984 issue of IEEE Acoustics,
Speech, and Signal Processing Magazine [160].
2. “Vector Quantization: A Pattern Matching Technique for Speech Coding,” by
A. Gersho and V. Cuperman, in the December 1983 issue of IEEE Communications
Magazine [150].

3. “Vector Quantization in Speech Coding,” by J. Makhoul, S. Roucos, and H. Gish, in
the November 1985 issue of the Proceedings of the IEEE [161].
4. “Vector Quantization,” by P.F. Swaszek, in Communications and Networks, edited by
I.F. Blake and H.V. Poor [162].
5. A survey of various image-coding applications of vector quantization can be found in
“Image Coding Using Vector Quantization: A Review,” by N.M. Nasrabadi and R.A.
King, in the August 1988 issue of the IEEE Transactions on Communications [163].
6. A thorough review of lattice vector quantization can be found in “Lattice Quantiza-
tion,” by J.D. Gibson and K. Sayood, in Advances in Electronics and Electron Physics
[140].
The area of vector quantization is an active one, and new techniques that use vector
quantization are continually being developed. The journals that report work in this area
include IEEE Transactions on Information Theory, IEEE Transactions on Communications,
IEEE Transactions on Signal Processing, and IEEE Transactions on Image Processing,
among others.
1. In Example 10.3.2 we increased the SNR by about 03 dB by moving the top-left
output point to the origin. What would happen if we moved the output points at the
four corners to the positions ± 0, 0± . As in the example, assume the input
has a Laplacian distribution with mean zero and variance one, and = 07309. You
can obtain the answer analytically or through simulation.
2. For the quantizer of the previous problem, rather than moving the output points to
± 0 and 0± , we could have moved them to other positions that might have
provided a larger increase in SNR. Write a program to test different (reasonable)
possibilities and report on the best and worst cases.
3. In the program trainvq.c the empty cell problem is resolved by replacing the vector
with no associated training set vectors with a training set vector from the quantization
region with the largest number of vectors. In this problem, we will investigate some
possible alternatives.
Generate a sequence of pseudorandom numbers with a triangular distribution between
0 and 2. (You can obtain a random number with a triangular distribution by adding
two uniformly distributed random numbers.) Design an eight-level, two-dimensional
vector quantizer with the initial codebook shown in Table 10.9.
(a) Use the trainvq program to generate a codebook with 10,000 random numbers
as the training set. Comment on the final codebook you obtain. Plot the elements
of the codebook and discuss why they ended up where they did.
(b) Modify the program so that the empty cell vector is replaced with a vector from
the quantization region with the largest distortion. Comment on any changes in

T A B L E 10 . 9 Initial codebook for
Problem 3.
1 1
1 2
1 05
05 1
05 05
15 1
2 5
3 3
the distortion (or lack of change). Is the final codebook different from the one
you obtained earlier?
(c) Modify the program so that whenever an empty cell problem arises, a two-level
quantizer is designed for the quantization region with the largest number of
output points. Comment on any differences in the codebook and distortion from
the previous two cases.
4. Generate a 16-dimensional codebook of size 64 for the Sena image. Construct the
vector as a 4 × 4 block of pixels, an 8 × 2 block of pixels, and a 16 × 1 block of
pixels. Comment on the differences in the mean squared errors and the quality of
the reconstructed images. You can use the program trvqsp_img to obtain the
codebooks.
5. In Example 10.6.1 we designed a 60-level two-dimensional quantizer by taking the
two-dimensional representation of an 8-level scalar quantizer, removing 12 output
points from the 64 output points, and adding 8 points in other locations. Assume the
input is Laplacian with zero mean and unit variance, and = 07309.
(a) Calculate the increase in the probability of overload by the removal of the 12
points from the original 64.
(b) Calculate the decrease in overload probability when we added the 8 new points
to the remaining 52 points.
6. In this problem we will compare the performance of a 16-dimensional pyramid vector
quantizer and a 16-dimensional LBG vector quantizer for two different sources. In
each case the codebook for the pyramid vector quantizer consists of 272 elements:
32 vectors with 1 element equal to ± , and the other 15 equal to zero, and
240 vectors with 2 elements equal to ± and the other 14 equal to zero.
The value of should be adjusted to give the best performance. The codebook for the
LBG vector quantizer will be obtained by using the program trvqsp_img on the
source output. You will have to modify trvqsp_img slightly to give you a codebook
that is not a power of two.

(a) Use the two quantizers to quantize a sequence of 10,000 zero mean unit variance
Laplacian random numbers. Using either the mean squared error or the SNR as
a measure of performance, compare the performance of the two quantizers.
(b) Use the two quantizers to quantize the Sinan image. Compare the two quantizers
using either the mean squared error or the SNR and the reconstructed image.
Compare the difference between the performance of the two quantizers with the
difference when the input was random.

11
Differential Encoding
11 . 1 Overview
S
ources such as speech and images have a great deal of correlation from sample
to sample. We can use this fact to predict each sample based on its past
and only encode and transmit the differences between the prediction and the
sample value. Differential encoding schemes are built around this premise.
Because the prediction techniques are rather simple, these schemes are much
easier to implement than other compression schemes. In this chapter, we will look at
various components of differential encoding schemes and study how they are used to encode
sources—in particular, speech. We will also look at a widely used international differential
encoding standard for speech encoding.
11 . 2 Introduction
In the last chapter we looked at vector quantization—a rather complex scheme requiring
a significant amount of computational resources—as one way of taking advantage of the
structure in the data to perform lossy compression. In this chapter, we look at a different
approach that uses the structure in the source output in a slightly different manner, resulting
in a significantly less complex system.
When we design a quantizer for a given source, the size of the quantization interval
depends on the variance of the input. If we assume the input is uniformly distributed, the
variance depends on the dynamic range of the input. In turn, the size of the quantization
interval determines the amount of quantization noise incurred during the quantization process.
In many sources of interest, the sampled source output xn does not change a great deal
from one sample to the next. This means that both the dynamic range and the variance of
the sequence of differences dn = xn −xn−1 are significantly smaller than that of the source
output sequence. Furthermore, for correlated sources the distribution of dn is highly peaked

326 11 D I F F E R E N T I A L E N C O D I N G
at zero. We made use of this skew, and resulting loss in entropy, for the lossless compression
of images in Chapter 7. Given the relationship between the variance of the quantizer input
and the incurred quantization error, it is also useful, in terms of lossy compression, to look
at ways to encode the difference from one sample to the next rather than encoding the
actual sample value. Techniques that transmit information by encoding differences are called
differential encoding techniques.
E x a m p l e 11 . 2 . 1:
Consider the half cycle of a sinusoid shown in Figure 11.1 that has been sampled at the rate
of 30 samples per cycle. The value of the sinusoid ranges between 1 and −1. If we wanted
to quantize the sinusoid using a uniform four-level quantizer, we would use a step size of
0.5, which would result in quantization errors in the range −025025. If we take the
sample-to-sample differences (excluding the first sample), the differences lie in the range
−0202. To quantize this range of values with a four-level quantizer requires a step size
of 0.1, which results in quantization noise in the range −005005.
1.0
−1.0
−0.2
0.2
0.4
0.6
0.8
0
−0.4
−0.6
−0.8
1
0 2 3 4 5 6
Original
Difference
F I G U R E 11. 1 Sinusoid and sample-to-sample differences.
The sinusoidal signal in the previous example is somewhat contrived. However, if we
look at some of the real-world sources that we want to encode, we see that the dynamic
range that contains most of the differences is significantly smaller than the dynamic range
of the source output.
E x a m p l e 11 . 2 . 2:
Figure 11.2 is the histogram of the Sinan image. Notice that the pixel values vary over
almost the entire range of 0 to 255. To represent these values exactly, we need 8 bits per

1200
1000
800
600
400
200
0
0 50 100 150 200 250
F I G U R E 11. 2 Histogram of the Sinan image.
8000
7000
6000
5000
4000
1000
0
3000
2000
–100 –50 0 50 100
F I G U R E 11. 3 Histogram of pixel-to-pixel differences of the Sinan image.
pixel. To represent these values in a lossy manner to within an error in the least significant
bit, we need 7 bits per pixel. Figure 11.3 is the histogram of the differences.
More than 99% of the pixel values lie in the range −31 to 31. Therefore, if we were
willing to accept distortion in the least significant bit, for more than 99% of the difference
values we need 5 bits per pixel rather than 7. In fact, if we were willing to have a small
percentage of the differences with a larger error, we could get by with 4 bits for each
difference value.

In both examples, we have shown that the dynamic range of the differences between
samples is substantially less than the dynamic range of the source output. In the following
sections we describe encoding schemes that take advantage of this fact to provide improved
compression performance.
11 . 3 The Basic Algorithm
Although it takes fewer bits to encode differences than it takes to encode the original pixel,
we have not said whether it is possible to recover an acceptable reproduction of the original
sequence from the quantized difference value. When we were looking at lossless compression
schemes, we found that if we encoded and transmitted the first value of a sequence, followed
by the encoding of the differences between samples, we could losslessly recover the original
sequence. Unfortunately, a strictly analogous situation does not exist for lossy compression.
E x a m p l e 11 . 3 . 1:
Suppose a source puts out the sequence
62 97 132 59 8 74 42 18
We could generate the following sequence by taking the difference between samples
(assume that the first sample value is zero):
62 35 35 −73 21 −06 −32 −24
If we losslessly encoded these values, we could recover the original sequence at the receiver
by adding back the difference values. For example, to obtain the second reconstructed value,
we add the difference 3.5 to the first received value 6.2 to obtain a value of 9.7. The third
reconstructed value can be obtained by adding the received difference value of 3.5 to the
second reconstructed value of 9.7, resulting in a value of 13.2, which is the same as the
third value in the original sequence. Thus, by adding the nth received difference value to
the n−1 th reconstruction value, we can recover the original sequence exactly.
Now let us look at what happens if these difference values are encoded using a lossy
scheme. Suppose we had a seven-level quantizer with output values −6−4−20246.
The quantized sequence would be
6 4 4 −6 2 0 −4 −2
If we follow the same procedure for reconstruction as we did for the lossless compression
scheme, we get the sequence
6 10 14 8 10 10 6 4
The difference or error between the original sequence and the reconstructed sequence is
02 −03 −08 −21 −2 −26 −18 −22

Notice that initially the magnitudes of the error are quite small 0203 . As the reconstruc-
tion progresses, the magnitudes of the error become significantly larger (2.6, 1.8, 2.2).
To see what is happening, consider a sequence xn. A difference sequence dn is
generated by taking the differences xn −xn−1. This difference sequence is quantized to obtain
the sequence d̂n:
d̂n = Qdn = dn +qn
where qn is the quantization error. At the receiver, the reconstructed sequence x̂n is obtained
by adding d̂n to the previous reconstructed value x̂n−1:
x̂n = x̂n−1 +d̂n
Let us assume that both transmitter and receiver start with the same value x0, that is,
x̂0 = x0. Follow the quantization and reconstruction process for the first few samples:
d1 = x1 −x0 (11.1)
ˆ
d1 = Qd1 = d1 +q1 (11.2)
x̂1 = x0 +d̂1 = x0 +d1 +q1 = x1 +q1 (11.3)
d2 = x2 −x1 (11.4)
ˆ
d2 = Qd2 = d2 +q2 (11.5)
ˆ
x2 = x̂1 +d̂2 = x1 +q1 +d2 +q2 (11.6)
= x2 +q1 +q2 (11.7)
Continuing this process, at the nth iteration we get
x̂n = xn +
n

k=1
qk (11.8)
We can see that the quantization error accumulates as the process continues. Theoretically,
if the quantization error process is zero mean, the errors will cancel each other out in the
long run. In practice, often long before that can happen, the finite precision of the machines
causes the reconstructed value to overflow.
Notice that the encoder and decoder are operating with different pieces of information.
The encoder generates the difference sequence based on the original sample values, while
the decoder adds back the quantized difference onto a distorted version of the original signal.
We can solve this problem by forcing both encoder and decoder to use the same information
during the differencing and reconstruction operations. The only information available to the
receiver about the sequence xn is the reconstructed sequence x̂n. As this information
is also available to the transmitter, we can modify the differencing operation to use the
reconstructed value of the previous sample, instead of the previous sample itself, that is,
dn = xn −x̂n−1 (11.9)

Using this new differencing operation, let’s repeat our examination of the quantization
and reconstruction process. We again assume that x̂0 = x0.
d1 = x1 −x0 (11.10)
ˆ
d1 = Qd1 = d1 +q1 (11.11)
x̂1 = x0 +d̂1 = x0 +d1 +q1 = x1 +q1 (11.12)
d2 = x2 −x̂1 (11.13)
ˆ
d2 = Qd2 = d2 +q2 (11.14)
ˆ
x2 = x̂1 +d̂2 = x̂1 +d2 +q2 (11.15)
= x2 +q2 (11.16)
At the nth iteration we have
x̂n = xn +qn (11.17)
and there is no accumulation of the quantization noise. In fact, the quantization noise in the
nth reconstructed sequence is the quantization noise incurred by the quantization of the nth
difference. The quantization error for the difference sequence is substantially less than the
quantization error for the original sequence. Therefore, this procedure leads to an overall
reduction of the quantization error. If we are satisfied with the quantization error for a given
number of bits per sample, then we can use fewer bits with a differential encoding procedure
to attain the same distortion.
E x a m p l e 11 . 3 . 2:
Let us try to quantize and then reconstruct the sinusoid of Example 11.2.1 using the two
different differencing approaches. Using the first approach, we get a dynamic range of
1.0
0.6
0.4
0
–0.2
–0.8
–1.0
0.8
0.2
–0.4
–0.6
0.5 1.0 1.5 2.0 2.5 3.0
Original
Approach 2
Approach 1
+
+
+
+
+
+
+
+
+
+
+
+
+
+
F I G U R E 11. 4 Sinusoid and reconstructions.

differences from −02 to 0.2. Therefore, we use a quantizer step size of 0.1. In the second
approach, the differences lie in the range −0404. In order to cover this range, we use a
step size in the quantizer of 0.2. The reconstructed signals are shown in Figure 11.4.
Notice in the first case that the reconstruction diverges from the signal as we process
more and more of the signal. Although the second differencing approach uses a larger step
size, this approach provides a more accurate representation of the input.
A block diagram of the differential encoding system as we have described it to this point
is shown in Figure 11.5. We have drawn a dotted box around the portion of the encoder that
mimics the decoder. The encoder must mimic the decoder in order to obtain a copy of the
reconstructed sample used to generate the next difference.
We would like our difference value to be as small as possible. For this to happen,
given the system we have described to this point, x̂n−1 should be as close to xn as possible.
However, x̂n−1 is the reconstructed value of xn−1; therefore, we would like x̂n−1 to be
close to xn−1. Unless xn−1 is always very close to xn, some function of past values of the
reconstructed sequence can often provide a better prediction of xn. We will look at some
of these predictor functions later in this chapter. For now, let’s modify Figure 11.5 and
replace the delay block with a predictor block to obtain our basic differential encoding
system as shown in Figure 11.6. The output of the predictor is the prediction sequence pn
given by
pn = fx̂n−1x̂n−2 x̂0 (11.18)
This basic differential encoding system is known as the differential pulse code modulation
(DPCM) system. The DPCM system was developed at Bell Laboratories a few years after
World War II [164]. It is most popular as a speech-encoding system and is widely used in
telephone communications.
As we can see from Figure 11.6, the DPCM system consists of two major components,
the predictor and the quantizer. The study of DPCM is basically the study of these two
components. In the following sections, we will look at various predictor and quantizer
designs and see how they function together in a differential encoding system.
Q
Delay
Encoder
Decoder
dn
xn
xn
xn−1
xn−1
dn
+
−
+
+
dn
xn
^
^
^
^ ^
^
+
+
F I G U R E 11. 5 A simple differential encoding system.

Q
P
P
Encoder
Decoder
dn
xn
xn
pn
pn
pn
dn
+
–
+
+
dn
xn
^
^
^
^
+
+
F I G U R E 11. 6 The basic algorithm.
11 . 4 Prediction in DPCM
Differential encoding systems like DPCM gain their advantage by the reduction in the
variance and dynamic range of the difference sequence. How much the variance is reduced
depends on how well the predictor can predict the next symbol based on the past reconstructed
symbols. In this section we will mathematically formulate the prediction problem. The
analytical solution to this problem will give us one of the more widely used approaches
to the design of the predictor. In order to follow this development, some familiarity with
the mathematical concepts of expectation and correlation is needed. These concepts are
described in Appendix A.
Define 2
d, the variance of the difference sequence, as
2
d = Exn −pn
2
(11.19)
where E is the expectation operator. As the predictor outputs pn are given by (11.18), the
design of a good predictor is essentially the selection of the function f· that minimizes 2
d.
One problem with this formulation is that x̂n is given by
x̂n = xn +qn
and qn depends on the variance of dn. Thus, by picking f· , we affect 2
d, which in turn
affects the reconstruction x̂n, which then affects the selection of f· . This coupling makes an
explicit solution extremely difficult for even the most well-behaved source [165]. As most
real sources are far from well behaved, the problem becomes computationally intractable in
most applications.
We can avoid this problem by making an assumption known as the fine quantization
assumption. We assume that quantizer step sizes are so small that we can replace x̂n by xn,
and therefore
pn = fxn−1xn−2 x0 (11.20)
Once the function f· has been found, we can use it with the reconstructed values x̂n
to obtain pn. If we now assume that the output of the source is a stationary process,
from the study of random processes [166], we know that the function that minimizes 2
d

is the conditional expectation Exn xn−1xn−2 x0. Unfortunately, the assumption of
stationarity is generally not true, and even if it were, finding this conditional expectation
requires the knowledge of nth-order conditional probabilities, which would generally not be
available.
Given the difficulty of finding the best solution, in many applications we simplify the
problem by restricting the predictor function to be linear. That is, the prediction pn is given by
pn =
N

i=1
aix̂n−i (11.21)
The value of N specifies the order of the predictor. Using the fine quantization assumption,
we can now write the predictor design problem as follows: Find the ai so as to minimize 2
d.
2
d = E

xn −
N

i=1
aixn−i
2
(11.22)
where we assume that the source sequence is a realization of a real valued wide sense
stationary process. Take the derivative of 2
d with respect to each of the ai and set this equal
to zero. We get N equations and N unknowns:
2
d
a1
= −2E

xn −
N

i=1
aixn−i

xn−1

= 0 (11.23)
2
d
a2
= −2E

xn −
N

i=1
aixn−i

xn−2

= 0 (11.24)

2
d
aN
= −2E

xn −
N

i=1
aixn−i

xn−N

= 0 (11.25)
Taking the expectations, we can rewrite these equations as
N

i=1
aiRxxi−1 = Rxx1 (11.26)
N

i=1
aiRxxi−2 = Rxx2 (11.27)

N

i=1
aiRxxi−N = RxxN (11.28)
where Rxxk is the autocorrelation function of xn:
Rxxk = Exnxn+k (11.29)

We can write these equations in matrix form as
RA = P (11.30)
where
R =
⎡
⎢
⎢
⎢
⎢
⎢
⎣
Rxx0 Rxx1 Rxx2 ··· RxxN −1

RxxN −1 RxxN −2 RxxN −3 ··· Rxx0
⎤
⎥
⎥
⎥
⎥
⎥
⎦
(11.31)
A =
⎡
⎢
⎢
⎢
⎢
⎢
⎣
a1
a2
a3

aN
⎤
⎥
⎥
⎥
⎥
⎥
⎦
(11.32)
P =
⎡
⎢
⎢
⎢
⎢
⎢
⎣
Rxx1
Rxx2
Rxx3

RxxN
⎤
⎥
⎥
⎥
⎥
⎥
⎦
(11.33)
where we have used the fact that Rxx−k = Rxxk for real valued wide sense stationary
processes. These equations are referred to as the discrete form of the Wiener-Hopf equations.
If we know the autocorrelation values Rxxk for k = 01 N, then we can find the
predictor coefficients as
A = R−1
P (11.34)
E x a m p l e 11 . 4 . 1:
For the speech sequence shown in Figure 11.7, let us find predictors of orders one, two, and
three and examine their performance. We begin by estimating the autocorrelation values from
the data. Given M data points, we use the following average to find the value for Rxxk :
Rxxk =
1
M −k
M−k

i=1
xixi+k (11.35)
Using these autocorrelation values, we obtain the following coefficients for the three dif-
ferent predictors. For N = 1, the predictor coefficient is a1 = 066; for N = 2, the coefficients
are a1 = 0596a2 = 0096; and for N = 3, the coefficients are a1 = 0577a2 = −0025, and
a3 = 0204. We used these coefficients to generate the residual sequence. In order to see the
reduction in variance, we computed the ratio of the source output variance to the variance of

3
1
0
−2
−3
2
−1
500 1000 1500 2000 2500 3000 3500 4000
F I G U R E 11. 7 A segment of speech: a male speaker saying the word “test.”
the residual sequence. For comparison, we also computed this ratio for the case where the
residual sequence is obtained by taking the difference of neighboring samples. The sample-to-
sample differences resulted in a ratio of 1.63. Compared to this, the ratio of the input variance
to the variance of the residuals from the first-order predictor was 2.04. With a second-order
predictor, this ratio rose to 3.37, and with a third-order predictor, the ratio was 6.28.
The residual sequence for the third-order predictor is shown in Figure 11.8. Notice that
although there has been a reduction in the dynamic range, there is still substantial structure
3
1
0
−2
−3
2
−1
500 1000 1500 2000 2500 3000 3500 4000
F I G U R E 11. 8 The residual sequence using a third-order predictor.

in the residual sequence, especially in the range of samples from about the 700th sample
to the 2000th sample. We will look at ways of removing this structure when we discuss
speech coding.
Let us now introduce a quantizer into the loop and look at the performance of the DPCM
system. For simplicity, we will use a uniform quantizer. If we look at the histogram of the
residual sequence, we find that it is highly peaked. Therefore, we will assume that the input
to the quantizer will be Laplacian. We will also adjust the step size of the quantizer based on
the variance of the residual. The step sizes provided in Chapter 9 are based on the assumption
that the quantizer input has a unit variance. It is easy to show that when the variance
differs from unity, the optimal step size can be obtained by multiplying the step size for a
variance of one with the standard deviation of the input. Using this approach for a four-level
Laplacian quantizer, we obtain step sizes of 0.75, 0.59, and 0.43 for the first-, second-,
and third-order predictors, and step sizes of 0.3, 0.4, and 0.5 for an eight-level Laplacian
quantizer. We measure the performance using two different measures, the signal-to-noise
ratio (SNR) and the signal-to-prediction-error ratio. These are defined as follows:
SNRdB =
M
i=1 x2
i
M
i=1xi −x̂i
2
(11.36)
SPERdB =
M
i=1 x2
i
M
i=1xi −pi
2
(11.37)
The results are tabulated in Table 11.1. For comparison we have also included the
results when no prediction is used; that is, we directly quantize the input. Notice the large
difference between using a first-order predictor and a second-order predictor, and then
the relatively minor increase when going from a second-order predictor to a third-order
predictor. This is fairly typical when using a fixed quantizer.
Finally, let’s take a look at the reconstructed speech signal. The speech coded using
a third-order predictor and an eight-level quantizer is shown in Figure 11.9. Although the
reconstructed sequence looks like the original, notice that there is significant distortion in
areas where the source output values are small. This is because in these regions the input
to the quantizer is close to zero. Because the quantizer does not have a zero output level,
T A B L E 11 . 1 Performance of DPCM system
with different predictors and
quantizers.
Quantizer Predictor Order SNR (dB) SPER (dB)
Four-level None 2.43 0
1 3.37 2.65
2 8.35 5.9
3 8.74 6.1
Eight-level None 3.65 0
1 3.87 2.74
2 9.81 6.37
3 10.16 6.71

3
1
0
−2
−3
2
−1
500 1000 1500 2000 2500 3000 3500 4000
F I G U R E 11. 9 The reconstructed sequence using a third-order predictor and an
eight-level uniform quantizer.
the output of the quantizer flips between the two inner levels. If we listened to this signal,
we would hear a hissing sound in the reconstructed signal.
The speech signal used to generate this example is contained among the data sets
accompanying this book in the file testm.raw. The function readau.c can be used to
read the file. You are encouraged to reproduce the results in this example and listen to the
resulting reconstructions.
If we look at the speech sequence in Figure 11.7, we can see that there are several
distinct segments of speech. Between sample number 700 and sample number 2000, the
speech looks periodic. Between sample number 2200 and sample number 3500, the speech
is low amplitude and noiselike. Given the distinctly different characteristics in these two
regions, it would make sense to use different approaches to encode these segments. Some
approaches to dealing with these issues are specific to speech coding, and we will discuss
these approaches when we specifically discuss encoding speech using DPCM. However, the
problem is also much more widespread than when encoding speech. A general response to
the nonstationarity of the input is the use of adaptation in prediction. We will look at some
of these approaches in the next section.
11 . 5 Adaptive DPCM
As DPCM consists of two main components, the quantizer and the predictor, making DPCM
adaptive means making the quantizer and the predictor adaptive. Recall that we can adapt
a system based on its input or output. The former approach is called forward adaptation;
the latter, backward adaptation. In the case of forward adaptation, the parameters of the

system are updated based on the input to the encoder, which is not available to the decoder.
Therefore, the updated parameters have to be sent to the decoder as side information. In the
case of backward adaptation, the adaptation is based on the output of the encoder. As this
output is also available to the decoder, there is no need for transmission of side information.
In cases where the predictor is adaptive, especially when it is backward adaptive, we
generally use adaptive quantizers (forward or backward). The reason for this is that the
backward adaptive predictor is adapted based on the quantized outputs. If for some reason
the predictor does not adapt properly at some point, this results in predictions that are far
from the input, and the residuals will be large. In a fixed quantizer, these large residuals will
tend to fall in the overload regions with consequently unbounded quantization errors. The
reconstructed values with these large errors will then be used to adapt the predictor, which
will result in the predictor moving further and further from the input.
The same constraint is not present for quantization, and we can have adaptive quantization
with fixed predictors.
11 . 5 . 1 A d a p t i v e Q u a n t i z a t i o n i n D P C M
In forward adaptive quantization, the input is divided into blocks. The quantizer parameters
are estimated for each block. These parameters are transmitted to the receiver as side
information. In DPCM, the quantizer is in a feedback loop, which means that the input to
the quantizer is not conveniently available in a form that can be used for forward adaptive
quantization. Therefore, most DPCM systems use backward adaptive quantization.
The backward adaptive quantization used in DPCM systems is basically a variation of
the backward adaptive Jayant quantizer described in Chapter 9. In Chapter 9, the Jayant
algorithm was used to adapt the quantizer to a stationary input. In DPCM, the algorithm is
used to adapt the quantizer to the local behavior of nonstationary inputs. Consider the speech
segment shown in Figure 11.7 and the residual sequence shown in Figure 11.8. Obviously,
the quantizer used around the 3000th sample should not be the same quantizer that was used
around the 1000th sample. The Jayant algorithm provides an effective approach to adapting
the quantizer to the variations in the input characteristics.
E x a m p l e 11 . 5 . 1:
Let’s encode the speech sample shown in Figure 11.7 using a DPCM system with a backward
adaptive quantizer. We will use a third-order predictor and an eight-level quantizer. We will
also use the following multipliers [110]:
M0 = 090 M1 = 090 M2 = 125 M3 = 175
The results are shown in Figure 11.10. Notice the region at the beginning of the speech
sample and between the 3000th and 3500th sample, where the DPCM system with the
fixed quantizer had problems. Because the step size of the adaptive quantizer can become
quite small, these regions have been nicely reproduced. However, right after this region,
the speech output has a larger spike than the reconstructed waveform. This is an indication
that the quantizer is not expanding rapidly enough. This can be remedied by increasing the

3
1
0
−2
−3
2
−1
500 1000 1500 2000 2500 3000 3500 4000
F I G U R E 11. 10 The reconstructed sequence using a third-order predictor and an
eight-level Jayant quantizer.
value of M3. The program used to generate this example is dpcm_aqb. You can use this
program to study the behavior of the system for different configurations.
11 . 5 . 2 A d a p t i v e P r e d i c t i o n i n D P C M
The equations used to obtain the predictor coefficients were derived based on the assumption
of stationarity. However, we see from Figure 11.7 that this assumption is not true. In the
speech segment shown in Figure 11.7, different segments have different characteristics. This
is true for most sources we deal with; while the source output may be locally stationary over
any significant length of the output, the statistics may vary considerably. In this situation, it
is better to adapt the predictor to match the local statistics. This adaptation can be forward
adaptive or backward adaptive.
D P C M w i t h F o r w a r d A d a p t i v e P r e d i c t i o n ( D P C M - A P F )
In forward adaptive prediction, the input is divided into segments or blocks. In speech coding
this block consists of about 16 ms of speech. At a sampling rate of 8000 samples per second,
this corresponds to 128 samples per block [123, 167]. In image coding, we use an 8 × 8
block [168].
The autocorrelation coefficients are computed for each block. The predictor coefficients
are obtained from the autocorrelation coefficients and quantized using a relatively high-rate
quantizer. If the coefficient values are to be quantized directly, we need to use at least
12 bits per coefficient [123]. This number can be reduced considerably if we represent
the predictor coefficients in terms of parcor coefficients; we will describe how to obtain

the parcor coefficients in Chapter 17. For now, let’s assume that the coefficients can be
transmitted with an expenditure of about 6 bits per coefficient.
In order to estimate the autocorrelation for each block, we generally assume that the sam-
ple values outside each block are zero. Therefore, for a block length of M, the autocorrelation
function for the lth block would be estimated by
Rl
xx k =
1
M −k
lM−k

i=l−1 M+1
xixi+k (11.38)
for k positive, or
Rl
xx k =
1
M +k
lM

i=l−1 M+1−k
xixi+k (11.39)
for k negative. Notice that Rl
xx k = Rl
xx −k , which agrees with our initial assumption.
D P C M w i t h B a c k w a r d A d a p t i v e P r e d i c t i o n ( D P C M - A P B )
Forward adaptive prediction requires that we buffer the input. This introduces delay in the
transmission of the speech. As the amount of buffering is small, the use of forward adaptive
prediction when there is only one encoder and decoder is not a big problem. However,
in the case of speech, the connection between two parties may be several links, each of
which may consist of a DPCM encoder and decoder. In such tandem links, the amount of
delay can become large enough to be a nuisance. Furthermore, the need to transmit side
information makes the system more complex. In order to avoid these problems, we can adapt
the predictor based on the output of the encoder, which is also available to the decoder. The
adaptation is done in a sequential manner [169, 167].
In our derivation of the optimum predictor coefficients, we took the derivative of the
statistical average of the squared prediction error or residual sequence. In order to do this,
we had to assume that the input process was stationary. Let us now remove that assumption
and try to figure out how to adapt the predictor to the input algebraically. To keep matters
simple, we will start with a first-order predictor and then generalize the result to higher
orders.
For a first-order predictor, the value of the residual squared at time n would be given by
d2
n = xn −a1x̂n−1
2
(11.40)
If we could plot the value of d2
n against a1, we would get a graph similar to the one shown
in Figure 11.11. Let’s take a look at the derivative of d2
n as a function of whether the current
value of a1 is to the left or right of the optimal value of a1—that is, the value of a1 for
which d2
n is minimum. When a1 is to the left of the optimal value, the derivative is negative.
Furthermore, the derivative will have a larger magnitude when a1 is further away from the
optimal value. If we were asked to adapt a1, we would add to the current value of a1. The
amount to add would be large if a1 was far from the optimal value, and small if a1 was
close to the optimal value. If the current value was to the right of the optimal value, the
derivative would be positive, and we would subtract some amount from a1 to adapt it. The

dn
a1
2
F I G U R E 11. 11 A plot of the residual squared versus the predictor coefﬁcient.
amount to subtract would be larger if we were further from the optimal, and as before, the
derivative would have a larger magnitude if a1 were further from the optimal value.
At any given time, in order to adapt the coefficient at time n + 1, we add an amount
proportional to the magnitude of the derivative with a sign that is opposite to that of the
derivative of d2
n at time n:
a
n+1
1 = a
n
1 −
d2
n
a1
(11.41)
where is some proportionality constant.
d2
n
a1
= −2xn −a1x̂n−1 x̂n−1 (11.42)
= −2dnx̂n−1 (11.43)
Substituting this into (11.41), we get
a
n+1
1 = a
n
1 + dnx̂n−1 (11.44)
where we have absorbed the 2 into . The residual value dn is available only to the encoder.
Therefore, in order for both the encoder and decoder to use the same algorithm, we replace
dn by d̂n in (11.44) to obtain
a
n+1
1 = a
n
1 + d̂nx̂n−1 (11.45)
Extending this adaptation equation for a first-order predictor to an Nth-order predictor
is relatively easy. The equation for the squared prediction error is given by
d2
n =

xn −
N

i=1
aix̂n−i
2
(11.46)

Taking the derivative with respect to aj will give us the adaptation equation for the jth
predictor coefficient:
a
n+1
j = a
n
j + d̂nx̂n−j (11.47)
We can combine all N equations in vector form to get
An+1
= An
+ d̂nX̂n−1 (11.48)
where
X̂n =
⎡
⎢
⎢
⎢
⎣
x̂n
x̂n−1

x̂n−N+1

⎤
⎥
⎥
⎥
⎦
(11.49)
This particular adaptation algorithm is called the least mean squared (LMS) algorithm [170].
11 . 6 Delta Modulation
A very simple form of DPCM that has been widely used in a number of speech-coding
applications is the delta modulator (DM). The DM can be viewed as a DPCM system with
a 1-bit (two-level) quantizer. With a two-level quantizer with output values ±, we can
only represent a sample-to-sample difference of . If, for a given source sequence, the
sample-to-sample difference is often very different from , then we may incur substantial
distortion. One way to limit the difference is to sample more often. In Figure 11.12 we see
a signal that has been sampled at two different rates. The lower-rate samples are shown
by open circles, while the higher-rate samples are represented by +. It is apparent that the
lower-rate samples are further apart in value.
The rate at which a signal is sampled is governed by the highest frequency component of
a signal. If the highest frequency component in a signal is W, then in order to obtain an exact
reconstruction of the signal, we need to sample it at least at twice the highest frequency, or
2W. In systems that use delta modulation, we usually sample the signal at much more than
twice the highest frequency. If Fs is the sampling frequency, then the ratio of Fs to 2W can
range from almost 1 to almost 100 [123]. The higher sampling rates are used for high-quality
A/D converters, while the lower rates are more common for low-rate speech coders.
+
+ + + + +
+ +
+
+
+
+
+ + + + +
+
+ + +
+
+
F I G U R E 11. 12 A signal sampled at two different rates.

11.6 Delta Modulation 343
Granular region
Slope overload
region
F I G U R E 11. 13 A source output sampled and coded using delta modulation.
If we look at a block diagram of a delta modulation system, we see that, while the block
diagram of the encoder is identical to that of the DPCM system, the standard DPCM decoder
is followed by a filter. The reason for the existence of the filter is evident from Figure 11.13,
where we show a source output and the unfiltered reconstruction. The samples of the source
output are represented by the filled circles. As the source is sampled at several times the
highest frequency, the staircase shape of the reconstructed signal results in distortion in
frequency bands outside the band of frequencies occupied by the signal. The filter can be
used to remove these spurious frequencies.
The reconstruction shown in Figure 11.13 was obtained with a delta modulator using a
fixed quantizer. Delta modulation systems that use a fixed step size are often referred to as
linear delta modulators. Notice that the reconstructed signal shows one of two behaviors.
In regions where the source output is relatively constant, the output alternates up or down
by ; these regions are called the granular regions. In the regions where the source output
rises or falls fast, the reconstructed output cannot keep up; these regions are called the slope
overload regions. If we want to reduce the granular error, we need to make the step size
small. However, this will make it more difficult for the reconstruction to follow rapid
changes in the input. In other words, it will result in an increase in the overload error. To
avoid the overload condition, we need to make the step size large so that the reconstruction
can quickly catch up with rapid changes in the input. However, this will increase the granular
error.
One way to avoid this impasse is to adapt the step size to the characteristics of the
input, as shown in Figure 11.14. In quasi-constant regions, make the step size small in order
to reduce the granular error. In regions of rapid change, increase the step size in order to
reduce overload error. There are various ways of adapting the delta modulator to the local
characteristics of the source output. We describe two of the more popular ways here.
11 . 6 . 1 C o n s t a n t F a c t o r A d a p t i v e D e l t a M o d u l a t i o n
( C F D M )
The objective of adaptive delta modulation is clear: increase the step size in overload regions
and decrease it in granular regions. The problem lies in knowing when the system is in each
of these regions. Looking at Figure 11.13, we see that in the granular region the output of

Granular region
Slope overload
region
F I G U R E 11. 14 A source output sampled and coded using adaptive delta
modulation.
the quantizer changes sign with almost every input sample; in the overload region, the sign
of the quantizer output is the same for a string of input samples. Therefore, we can define
an overload or granular condition based on whether the output of the quantizer has been
changing signs. A very simple system [171] uses a history of one sample to decide whether
the system is in overload or granular condition and whether to expand or contract the step
size. If sn denotes the sign of the quantizer output d̂n,
sn =

1 if d̂n 0
−1 if d̂n 0
(11.50)
the adaptation logic is given by
n =

M1n−1 sn = sn−1
M2n−1 sn = sn−1
(11.51)
where M1 = 1
M2
= M 1. In general, M 2.
By increasing the memory, we can improve the response of the CFDM system. For
example, if we looked at two past samples, we could decide that the system was moving
from overload to granular condition if the sign had been the same for the past two samples
and then changed with the current sample:
sn = sn−1 = sn−2 (11.52)
In this case it would be reasonable to assume that the step size had been expanding previously
and, therefore, needed a sharp contraction. If
sn = sn−1 = sn−2 (11.53)
then it would mean that the system was probably entering the overload region, while
sn = sn−1 = sn−2 (11.54)
would mean the system was in overload and the step size should be expanded rapidly.

For the encoding of speech, the following multipliers Mi are recommended by [172] for
a CFDM system with two-sample memory:
sn = sn−1 = sn−2 M1 = 04 (11.55)
sn = sn−1 = sn−2 M2 = 09 (11.56)
sn = sn−1 = sn−2 M3 = 15 (11.57)
sn = sn−1 = sn−2 M4 = 20 (11.58)
The amount of memory can be increased further with a concurrent increase in complexity.
The space shuttle used a delta modulator with a memory of seven [173].
11 . 6 . 2 C o n t i n u o u s l y V a r i a b l e S l o p e D e l t a
M o d u l a t i o n
The CFDM systems described use a rapid adaptation scheme. For low-rate speech coding,
it is more pleasing if the adaptation is over a longer period of time. This slower adaptation
results in a decrease in the granular error and generally an increase in overload error. Delta
modulation systems that adapt over longer periods of time are referred to as syllabically
companded. A popular class of syllabically companded delta modulation systems is the
continuously variable slope delta modulation systems.
The adaptation logic used in CVSD systems is as follows [123]:
n = n−1 + n0 (11.59)
where is a number less than but close to one, and n is equal to one if J of the last K
quantizer outputs were of the same sign. That is, we look in a window of length K to obtain
the behavior of the source output. If this condition is not satisfied, then n is equal to zero.
Standard values for J and K are J = 3 and K = 3.
11 . 7 Speech Coding
Differential encoding schemes are immensely popular for speech encoding. They are used in
the telephone system, voice messaging, and multimedia applications, among others. Adaptive
DPCM is a part of several international standards (ITU-T G.721, ITU G.723, ITU G.726,
ITU-T G.722), which we will look at here and in later chapters.
Before we do that, let’s take a look at one issue specific to speech coding. In Figure 11.7,
we see that there is a segment of speech that looks highly periodic. We can see this periodicity
if we plot the autocorrelation function of the speech segment (Figure 11.15).
The autocorrelation peaks at a lag value of 47 and multiples of 47. This indicates a
periodicity of 47 samples. This period is called the pitch period. The predictor we originally
designed did not take advantage of this periodicity, as the largest predictor was a third-order
predictor, and this periodic structure takes 47 samples to show up. We can take advantage
of this periodicity by constructing an outer prediction loop around the basic DPCM structure

1.0
0.6
−0.2
−0.4
0.8
0.4
0
0.2
0 20 40 60 80 100
F I G U R E 11. 15 Autocorrelation function for test.snd.
^ ^
dn dn
xn
xn
^
+
–
+
+
Q
dn
Encoder
Decoder
P
P
pn
pn
pn
xn
–
+
+
+
Pp
PP
+
+
+
+
F I G U R E 11. 16 The DPCM structure with a pitch predictor.
as shown in Figure 11.16. This can be a simple single coefficient predictor of the form
bx̂n−, where is the pitch period. Using this system on testm.raw, we get the residual
sequence shown in Figure 11.17. Notice the decrease in amplitude in the periodic portion of
the speech.
Finally, remember that we have been using mean squared error as the distortion measure
in all of our discussions. However, perceptual tests do not always correlate with the mean
squared error. The level of distortion we perceive is often related to the level of the speech
signal. In regions where the speech signal is of higher amplitude, we have a harder time
perceiving the distortion, but the same amount of distortion in a different frequency band
might be very perceptible. We can take advantage of this by shaping the quantization error
so that most of the error lies in the region where the signal has a higher amplitude. This
variation of DPCM is called noise feedback coding (NFC) (see [123] for details).

3
1
0
−2
−3
2
−1
500 1000 1500 2000 2500 3000 3500 4000
F I G U R E 11. 17 The residual sequence using the DPCM system with a pitch
predictor.
11 . 7 . 1 G . 7 2 6
The International Telecommunications Union has published recommendations for a standard
ADPCM system, including recommendations G.721, G.723, and G.726. G.726 supersedes
G.721 and G.723. In this section we will describe the G.726 recommendation for ADPCM
systems at rates of 40, 32, 24, and 16 kbits.
T h e Q u a n t i z e r
The recommendation assumes that the speech output is sampled at the rate of 8000 samples
per second, so the rates of 40, 32, 24, and 16 kbits per second translate 5 bits per sample,
4 bits per sample, 3 bits per sample, and 2 bits per sample. Comparing this to the PCM
rate of 8 bits per sample, this would mean compression ratios of 1.6:1, 2:1, 2.67:1, and 4:1.
Except for the 16 kbits per second system, the number of levels in the quantizer are 2nb
−1,
where nb is the number of bits per sample. Thus, the number of levels in the quantizer is
odd, which means that for the higher rates we use a midtread quantizer.
The quantizer is a backward adaptive quantizer with an adaptation algorithm that is simi-
lar to the Jayant quantizer. The recommendation describes the adaptation of the quantization
interval in terms of the adaptation of a scale factor. The input dk is normalized by a scale
factor k. This normalized value is quantized, and the normalization removed by multiplying
with k. In this way the quantizer is kept fixed and k is adapted to the input. Therefore,
for example, instead of expanding the step size, we would increase the value of k.
The fixed quantizer is a nonuniform midtread quantizer. The recommendation describes
the quantization boundaries and reconstruction values in terms of the log of the scaled input.
The input-output characteristics for the 24 kbit system are shown in Table 11.2. An output
value of − in the table corresponds to a reconstruction value of 0.

T A B L E 11 . 2 Recommended input-output
characteristics of the quantizer for
24-kbits-per-second operation.
Input Range Label Output
log2
dk
k
Ik log2
dk
k
258 3 2.91
170258 2 2.13
006170 1 1.05
−−006 0 −
The adaptation algorithm is described in terms of the logarithm of the scale factor
yk = log2 k (11.60)
The adaptation of the scale factor or its log yk depends on whether the input is speech
or speechlike, where the sample-to-sample difference can fluctuate considerably, or whether
the input is voice-band data, which might be generated by a modem, where the sample-
to-sample fluctuation is quite small. In order to handle both these situations, the scale
factor is composed of two values, a locked slow scale factor for when the sample-to-sample
differences are quite small, and an unlocked value for when the input is more dynamic:
yk = alk yuk−1 +1−alk ylk−1 (11.61)
The value of alk depends on the variance of the input. It will be close to one for speech
inputs and close to zero for tones and voice band data.
The unlocked scale factor is adapted using the Jayant algorithm with one slight modifi-
cation. If we were to use the Jayant algorithm, the unlocked scale factor could be adapted as
uk = k−1MIk−1 (11.62)
where M· is the multiplier. In terms of logarithms, this becomes
yuk = yk−1 +logMIk−1 (11.63)
The modification consists of introducing some memory into the adaptive process so that the
encoder and decoder converge following transmission errors:
yuk = 1− yk−1 +WIk−1 (11.64)
where W· = logM·, and = 2−5
.
The locked scale factor is obtained from the unlocked scale factor through
ylk = 1− ylk−1 +yuk = 2−6
(11.65)

11.8 Image Coding 349
T h e P r e d i c t o r
The recommended predictor is a backward adaptive predictor that uses a linear combination
of the past two reconstructed values as well as the six past quantized differences to generate
the prediction
pk =
2

i=1
a
k−1
i x̂k−i +
6

i=1
b
k−1
i d̂k−i (11.66)
The set of predictor coefficients is updated using a simplified form of the LMS algorithm.
a
k
1 = 1−2−8
a
k−1
1 +3×2−8
sgnzk sgnzk−1 (11.67)
a
k
2 = 1−2−7
a
k−1
2 +2−7
sgnzk sgnzk−2
−f

a
k−1
1 sgnzk sgnzk−1

(11.68)
where
zk = d̂k +
6

i=1
b
k−1
i d̂k−i (11.69)
f =

4 ≤ 1
2
2sgn 1
2

(11.70)
The coefficients bi are updated using the following equation:
b
k
i = 1−2−8
b
k−1
i +2−7
sgnd̂ksgnd̂k−i (11.71)
Notice that in the adaptive algorithms we have replaced products of reconstructed values
and products of quantizer outputs with products of their signs. This is computationally
much simpler and does not lead to any significant degradation of the adaptation process.
Furthermore, the values of the coefficients are selected such that multiplication with these
coefficients can be accomplished using shifts and adds. The predictor coefficients are all set
to zero when the input moves from tones to speech.
11 . 8 Image Coding
We saw in Chapter 7 that differential encoding provided an efficient approach to the lossless
compression of images. The case for using differential encoding in the lossy compression of
images has not been made as clearly. In the early days of image compression, both differential
encoding and transform coding were popular forms of lossy image compression. At the
current time differential encoding has a much more restricted role as part of other compression
strategies. Several currently popular approaches to image compression decompose the image
into lower and higher frequency components. As low-frequency signals have high sample-to-
sample correlation, several schemes use differential encoding to compress the low-frequency
components. We will see this use of differential encoding when we look at subband- and
wavelet-based compression schemes and, to a lesser extent, when we study transform coding.

For now let us look at the performance of a couple of stand-alone differential image com-
pression schemes. We will compare the performance of these schemes with the performance
of the JPEG compression standard.
Consider a simple differential encoding scheme in which the predictor pjk for the
pixel in the jth row and the kth column is given by
pjk =
⎧
⎪
⎨
⎪
⎩
x̂jk−1 for k 0
x̂j −1k for k = 0 and j 0
128 for j = 0 and k = 0
where x̂jk is the reconstructed pixel in the jth row and kth column. We use this predictor
in conjunction with a fixed four-level uniform quantizer and code the quantizer output
using an arithmetic coder. The coding rate for the compressed image is approximately
1 bit per pixel. We compare this reconstructed image with a JPEG-coded image at the
same rate in Figure 11.18. The signal-to-noise ratio for the differentially encoded image is
22.33 dB (PSNR 31.42 dB) and for the JPEG-encoded image is 32.52 dB (PSNR 41.60 dB),
a difference of more than 10 dB!
However, this is an extremely simple system compared to the JPEG standard, which has
been fine-tuned for encoding images. Let’s make our differential encoding system slightly
more complicated by replacing the uniform quantizer with a recursively indexed quantizer
and the predictor by a somewhat more complicated predictor. For each pixel (except for the
boundary pixels) we compute the following three values:
p1 = 05×x̂j −1k+05×x̂jk−1 (11.72)
p2 = 05×x̂j −1k−1+05×x̂jk−1
p3 = 05×x̂j −1k−1+05×x̂j −1k
F I G U R E 11. 18 Left: Reconstructed image using differential encoding at 1 bit per
pixel. Right: Reconstructed image using JPEG at 1 bit per pixel.

11.9 Summary 351
F I G U R E 11. 19 Left: Reconstructed image using differential encoding at 1 bit per
pixel using median predictor and recursively indexed quantizer.
Right: Reconstructed image using JPEG at 1 bit per pixel.
then obtain the predicted value as
pjk = medianp1p2p3
For the boundary pixels we use the simple prediction scheme. At a coding rate of 1
bit per pixel, we obtain the image shown in Figure 11.19. For reference we show it next
to the JPEG-coded image at the same rate. The signal-to-noise ratio for this reconstruction
is 29.20 dB (PSNR 38.28 dB). We have made up two-thirds of the difference using some
relatively minor modifications. We can see that it might be feasible to develop differential
encoding schemes that are competitive with other image compression techniques. Therefore,
it makes sense not to dismiss differential encoding out of hand when we need to develop
image compression systems.
11 . 9 Summary
In this chapter we described some of the more well-known differential encoding techniques.
Although differential encoding does not provide compression as high as vector quantiza-
tion, it is very simple to implement. This approach is especially suited to the encoding
of speech, where it has found broad application. The DPCM system consists of two main
components, the quantizer and the predictor. We spent a considerable amount of time dis-
cussing the quantizer in Chapter 9, so most of the discussion in this chapter focused on
the predictor. We have seen different ways of making the predictor adaptive, and looked at
some of the improvements to be obtained from source-specific modifications to the predictor
design.

1. Digital Coding of Waveforms, by N.S. Jayant and P. Noll [123], contains some very
detailed and highly informative chapters on differential encoding.
2. “Adaptive Prediction in Speech Differential Encoding Systems,” by J.D. Gibson [167],
is a comprehensive treatment of the subject of adaptive prediction.
3. A real-time video coding system based on DPCM has been developed by NASA.
Details can be found in [174].
1. Generate an AR(1) process using the relationship
xn = 09×xn−1 +n
where n is the output of a Gaussian random number generator (this is option 2
in rangen).
(a) Encode this sequence using a DPCM system with a one-tap predictor with pre-
dictor coefficient 0.9 and a three-level Gaussian quantizer. Compute the variance
of the prediction error. How does this compare with the variance of the input?
How does the variance of the prediction error compare with the variance of the
n sequence?
(b) Repeat using predictor coefficient values of 0.5, 0.6, 0.7, 0.8, and 1.0. Comment
on the results.
2. Generate an AR(5) process using the following coefficients: 1.381, 0.6, 0.367, −07,
0.359.
(a) Encode this with a DPCM system with a 3-bit Gaussian nonuniform quantizer and
a first-, second-, third-, fourth-, and fifth-order predictor. Obtain these predictors
by solving (11.30). For each case compute the variance of the prediction error
and the SNR in dB. Comment on your results.
(b) Repeat using a 3-bit Jayant quantizer.
3. DPCM can also be used for encoding images. Encode the Sinan image using a one-tap
predictor of the form
x̂ij = a×xij−1
and a 2-bit quantizer. Experiment with quantizers designed for different distributions.
Comment on your results.
4. Repeat the image-coding experiment of the previous problem using a Jayant quantizer.
5. DPCM-encode the Sinan, Elif, and bookshelf1 images using a one-tap predictor and a
four-level quantizer followed by a Huffman coder. Repeat using a five-level quantizer.
Compute the SNR for each case, and compare the rate distortion performances.

6. We want to DPCM-encode images using a two-tap predictor of the form
x̂ij = a×xij−1 +b ×xi−1j
and a four-level quantizer followed by a Huffman coder. Find the equations we need
to solve to obtain coefficients a and b that minimize the mean squared error.
7. (a) DPCM-encode the Sinan, Elif, and bookshelf1 images using a two-tap predictor
and a four-level quantizer followed by a Huffman coder.
(b) Repeat using a five-level quantizer. Compute the SNR and rate (in bits per pixel)
for each case.
(c) Compare the rate distortion performances with the one-tap case.
(d) Repeat using a five-level quantizer. Compute the SNR for each case, and compare
the rate distortion performances using a one-tap and two-tap predictor.

12
Transforms, Subbands, and
Wavelets
12 . 1 Overview
I
n this chapter we will review some of the mathematical background necessary
for the study of transforms, subbands, and wavelets. The topics include Fourier
series, Fourier transforms, and their discrete counterparts. We will also look
at sampling and briefly review some linear system concepts.
12 . 2 Introduction
The roots of many of the techniques we will study can be found in the mathematical
literature. Therefore, in order to understand the techniques, we will need some mathematical
background. Our approach in general will be to introduce the mathematical tools just prior
to when they are needed. However, there is a certain amount of background that is required
for most of what we will be looking at. In this chapter we will present only that material that
is a common background to all the techniques we will be studying. Our approach will be
rather utilitarian; more sophisticated coverage of these topics can be found in [175]. We will
be introducing a rather large number of concepts, many of which depend on each other.
In order to make it easier for you to find a particular concept, we will identify the paragraph
in which the concept is first introduced.
We will begin our coverage with a brief introduction to the concept of vector spaces, and
in particular the concept of the inner product. We will use these concepts in our description
of Fourier series and Fourier transforms. Next is a brief overview of linear systems, then

356 12 T R A N S F O R M S , S U B B A N D S , A N D W A V E L E T S
a look at the issues involved in sampling a function. Finally, we will revisit the Fourier
concepts in the context of sampled functions and provide a brief introduction to Z-transforms.
Throughout, we will try to get a physical feel for the various concepts.
12 . 3 Vector Spaces
The techniques we will be using to obtain compression will involve manipulations and
decompositions of (sampled) functions of time. In order to do this we need some sort of
mathematical framework. This framework is provided through the concept of vector spaces.
We are very familiar with vectors in two- or three-dimensional space. An example of a
vector in two-dimensional space is shown in Figure 12.1. This vector can be represented in
a number of different ways: we can represent it in terms of its magnitude and direction, or
we can represent it as a weighted sum of the unit vectors in the x and y directions, or we
can represent it as an array whose components are the coefficients of the unit vectors. Thus,
the vector v in Figure 12.1 has a magnitude of 5 and an angle of 36.86 degrees,
v = 4ux +3uy
and
v =

4
3

We can view the second representation as a decomposition of V into simpler building
blocks, namely, the basis vectors. The nice thing about this is that any vector in two
dimensions can be decomposed in exactly the same way. Given a particular vector A and a
4
3
2
1
4
3
2
1
ux
v
uy
F I G U R E 12. 1 A vector.

basis set (more on this later), decomposition means finding the coefficients with which to
weight the unit vectors of the basis set. In our simple example it is easy to see what these
coefficients should be. However, we will encounter situations where it is not a trivial task
to find the coefficients that constitute the decomposition of the vector. We therefore need
some machinery to extract these coefficients. The particular machinery we will use here is
called the dot product or the inner product.
12 . 3 . 1 D o t o r I n n e r P r o d u c t
Given two vectors a and b such that
a =

a1
a2

b =

b1
b2

the inner product between a and b is defined as
a ·b = a1b1 +a2b2
Two vectors are said to be orthogonal if their inner product is zero. A set of vectors is
said to be orthogonal if each vector in the set is orthogonal to every other vector in the set.
The inner product between a vector and a unit vector from an orthogonal basis set will give
us the coefficient corresponding to that unit vector. It is easy to see that this is indeed so.
We can write ux and uy as
ux =

1
0

uy =

0
1

These are obviously orthogonal. Therefore, the coefficient a1 can be obtained by
a ·ux = a1 ×1+a2 ×0 = a1
and the coefficient of uy can be obtained by
a ·uy = a1 ×0+a2 ×1 = a2
The inner product between two vectors is in some sense a measure of how “similar” they
are, but we have to be a bit careful in how we define “similarity.” For example, consider
the vectors in Figure 12.2. The vector a is closer to ux than to uy. Therefore a ·ux will be
greater than a ·uy. The reverse is true for b.
12 . 3 . 2 V e c t o r S p a c e
In order to handle not just two- or three-dimensional vectors but general sequences and
functions of interest to us, we need to generalize these concepts. Let us begin with a more
general definition of vectors and the concept of a vector space.
A vector space consists of a set of elements called vectors that have the operations
of vector addition and scalar multiplication defined on them. Furthermore, the
results of these operations are also elements of the vector space.

ux
b
a
uy
F I G U R E 12. 2 Example of different vectors.
By vector addition of two vectors, we mean the vector obtained by the pointwise addition
of the components of the two vectors. For example, given two vectors a and b:
a =
⎡
⎣
a1
a2
a3
⎤
⎦ b =
⎡
⎣
b1
b2
b3
⎤
⎦ (12.1)
the vector addition of these two vectors is given as
a +b =
⎡
⎣
a1 +b1
a2 +b2
a3 +b3
⎤
⎦ (12.2)
By scalar multiplication, we mean the multiplication of a vector with a real or complex
number. For this set of elements to be a vector space it has to satisfy certain axioms.
Suppose V is a vector space; xyz are vectors; and and are scalars. Then the
following axioms are satisfied:
1. x +y = y+x (commutativity).
2. x +y+z = x +y+z and x = x (associativity).
3. There exists an element in V such that x + = x for all x in V. is called the
additive identity.
4. x +y = x +y, and +x = x +x (distributivity).
5. 1·x = x, and 0·x = .
6. For every x in V, there exists a −x such that x +−x = .
A simple example of a vector space is the set of real numbers. In this set zero is
the additive identity. We can easily verify that the set of real numbers with the standard

operations of addition and multiplication obey the axioms stated above. See if you can verify
that the set of real numbers is a vector space. One of the advantages of this exercise is to
emphasize the fact that a vector is more than a line with an arrow at its end.
E x a m p l e 12 . 3 . 1:
Another example of a vector space that is of more practical interest to us is the set of all
functions ft with finite energy. That is,

−
ft2
dt (12.3)
Let’s see if this set constitutes a vector space. If we define additions as pointwise addition
and scalar multiplication in the usual manner, the set of functions ft obviously satisfies
axioms 1, 2, and 4.
If ft and gt are functions with finite energy, and is a scalar, then the functions
ft+gt and ft also have finite energy.
If ft and gt are functions with finite energy, then ft+gt = gt+ft (axiom 1).
If ft, gt, and ht are functions with finite energy, and and are scalars, then
ft+gt+ht = ft+gt+ht and ft = ft (axiom 2).
If ft, gt, and ht are functions with finite energy, and is a scalar, then ft+
gt = ft+gt and +ft = ft+ft (axiom 4).
Let us define the additive identity function t as the function that is identically zero for
all t. This function satisfies the requirement of finite energy, and we can see that axioms 3 and
5 are also satisfied. Finally, if a function ft has finite energy, then from Equation (12.3),
the function −ft also has finite energy, and axiom 6 is satisfied. Therefore, the set of all
functions with finite energy constitutes a vector space. This space is denoted by L2f, or
simply L2.
12 . 3 . 3 S u b s p a c e
A subspace S of a vector space V is a subset of V whose members satisfy all the axioms of
the vector space and has the additional property that if x and y are in S, and is a scalar,
then x +y and x are also in S.
E x a m p l e 12 . 3 . 2:
Consider the set S of continuous bounded functions on the interval [0, 1]. Then S is a
subspace of the vector space L2.

12 . 3 . 4 B a s i s
One way we can generate a subspace is by taking linear combinations of a set of vectors. If
this set of vectors is linearly independent, then the set is called a basis for the subspace.
A set of vectors x1x2 is said to be linearly independent if no vector of the set
can be written as a linear combination of the other vectors in the set.
A direct consequence of this definition is the following theorem:
T h e o r e m A set of vectors X = x1x2 xN is linearly independent if and only if
the expression N
i=1 ixi = implies that i = 0 for all i = 12 N.
P r o o f The proof of this theorem can be found in most books on linear algebra [175].
The set of vectors formed by all possible linear combinations of vectors from a linearly
independent set X forms a vector space (Problem 1). The set X is said to be the basis for
this vector space. The basis set contains the smallest number of linearly independent vectors
required to represent each element of the vector space. More than one set can be the basis
for a given space.
E x a m p l e 12 . 3 . 3:
Consider the vector space consisting of vectors ab T
, where a and b are real numbers. Then
the set
X =

1
0

0
1

forms a basis for this space, as does the set
X =

1
1

1
0

In fact, any two vectors that are not scalar multiples of each other form a basis for this
space.
The number of basis vectors required to generate the space is called the dimension of
the vector space. In the previous example the dimension of the vector space is two. The
dimension of the space of all continuous functions on the interval 01 is infinity.
Given a particular basis, we can find a representation with respect to this basis for any
vector in the space.

E x a m p l e 12 . 3 . 4:
If a = 34 T
, then
a = 3

1
0

+4

0
1

and
a = 4

1
1

+−1

1
0

so the representation of a with respect to the first basis set is (3, 4), and the representation
of a with respect to the second basis set is 4 −1.
In the beginning of this section we had described a mathematical machinery for finding
the components of a vector that involved taking the dot product or inner product of the
vector to be decomposed with basis vectors. In order to use the same machinery in more
abstract vector spaces we need to generalize the notion of inner product.
12 . 3 . 5 I n n e r P r o d u c t -
— F o r m a l D e f i n i t i o n
An inner product between two vectors x and y, denoted by xy, associates a scalar value
with each pair of vectors. The inner product satisfies the following axioms:
1. xy = yx∗
, where ∗
denotes complex conjugate.
2. x +yz = xz+yz.
3. xy = xy.
4. xx ≥ 0, with equality if and only if x = . The quantity xx denoted by x is
called the norm of x and is analogous to our usual concept of distance.
12 . 3 . 6 O r t h o g o n a l a n d O r t h o n o r m a l S e t s
As in the case of Euclidean space, two vectors are said to be orthogonal if their inner product
is zero. If we select our basis set to be orthogonal (that is, each vector is orthogonal to every
other vector in the set) and further require that the norm of each vector be one (that is, the
basis vectors are unit vectors), such a basis set is called an orthonormal basis set. Given an
orthonormal basis, it is easy to find the representation of any vector in the space in terms
of the basis vectors using the inner product. Suppose we have a vector space SN with an
orthonormal basis set xi
N
i=1. Given a vector y in the space SN , by definition of the basis
set we can write y as a linear combination of the vectors xi:
y =
N
i=1
ixi

To find the coefficient k, we find the inner product of both sides of this equation with xk:
yxk =
N
i=1
ixixk
Because of orthonormality,
xixk =

1 i = k
0 i = k
and
yxk = k
By repeating this with each xi, we can get all the coefficients i. Note that in order to use
this machinery, the basis set has to be orthonormal.
We now have sufficient information in hand to begin looking at some of the well-known
techniques for representing functions of time. This was somewhat of a crash course in vector
spaces, and you might, with some justification, be feeling somewhat dazed. Basically, the
important ideas that we would like you to remember are the following:
Vectors are not simply points in two- or three-dimensional space. In fact, functions of
time can be viewed as elements in a vector space.
Collections of vectors that satisfy certain axioms make up a vector space.
All members of a vector space can be represented as linear, or weighted, combinations
of the basis vectors (keep in mind that you can have many different basis sets for the
same space). If the basis vectors have unit magnitude and are orthogonal, they are
known as an orthonormal basis set.
If a basis set is orthonormal, the weights, or coefficients, can be obtained by taking
the inner product of the vector with the corresponding basis vector.
In the next section we use these concepts to show how we can represent periodic functions
as linear combinations of sines and cosines.
12 . 4 Fourier Series
The representation of periodic functions in terms of a series of sines and cosines was
discovered by Jean Baptiste Joseph Fourier. Although he came up with this idea in order to
help him solve equations describing heat diffusion, this work has since become indispensable
in the analysis and design of systems. The work was awarded the grand prize for mathematics
in 1812 and has been called one of the most revolutionary contributions of the last century.
A very readable account of the life of Fourier and the impact of his discovery can be found
in [176].

12.4 Fourier Series 363
Fourier showed that any periodic function, no matter how awkward looking, could be
represented as the sum of smooth, well-behaved sines and cosines. Given a periodic function
ft with period T,
ft = ft +nT n = ±1±2
we can write ft as
ft = a0 +

n=1
an cosnw0t +

n=1
bn sin nw0t w0 =
2
T
(12.4)
This form is called the trigonometric Fourier series representation of ft.
A more useful form of the Fourier series representation from our point of view is the
exponential form of the Fourier series:
ft =

n=−
cnejnw0t
(12.5)
We can easily move between the exponential and trigonometric representations by using
Euler’s identity
ej
= cos+j sin
where j =
√
−1.
In the terminology of the previous section, all periodic functions with period T form a
vector space. The complex exponential functions ejn0t
constitute a basis for this space.
The parameters cn

n=− are the representations of a given function ft with respect to this
basis set. Therefore, by using different values of cn

n=−, we can build different periodic
functions. If we wanted to inform somebody what a particular periodic function looked like,
we could send the values of cn

n=− and they could synthesize the function.
We would like to see if this basis set is orthonormal. If it is, we want to be able to
obtain the coefficients that make up the Fourier representation using the approach described
in the previous section. In order to do all this, we need a definition of the inner product
on this vector space. If ft and gt are elements of this vector space, the inner product is
defined as
ftgt =
1
T
t0+T
t0
ftgt∗
dt (12.6)
where t0 is an arbitrary constant and ∗
denotes complex conjugate. For convenience we will
take t0 to be zero.
Using this inner product definition, let us check to see if the basis set is orthonormal.
ejn0t
ejm0t
=
1
T
T
0
ejn0t
e−jm0t
dt (12.7)
=
1
T
T
0
ejn−m0t
dt (12.8)

When n = m, Equation (12.7) becomes the norm of the basis vector, which is clearly one.
When n = m, let us define k = n−m. Then
ejn0t
ejm0t
=
1
T
T
0
ejk0t
dt (12.9)
=
1
jk0
ejk0T
−1 (12.10)
=
1
jk0
ejk2
−1 (12.11)
= 0 (12.12)
where we have used the facts that 0 = 2
T
and
ejk2
= cos2k+j sin2k = 1
Thus, the basis set is orthonormal.
Using this fact, we can find the coefficient cn by taking the inner product of ft with
the basis vector ejn0t
:
cn = ftejn0t
=
1
T
T
0
ftejn0t
dt (12.13)
What do we gain from obtaining the Fourier representation cn

n=− of a function ft?
Before we answer this question, let us examine the context in which we generally use Fourier
analysis. We start with some signal generated by a source. If we wish to look at how this
signal changes its amplitude over a period of time (or space), we represent it as a function
of time ft (or a function of space fx). Thus, ft (or fx) is a representation of the
signal that brings out how this signal varies in time (or space). The sequence cn

n=−
is a different representation of the same signal. However, this representation brings out a
different aspect of the signal. The basis functions are sinusoids that differ from each other
in how fast they fluctuate in a given time interval. The basis vector e2j0t
fluctuates twice
as fast as the basis vector ej0t
. The coefficients of the basis vectors cn

n=− give us a
measure of the different amounts of fluctuation present in the signal. Fluctuation of this sort
is usually measured in terms of frequency. A frequency of 1 Hz denotes the completion of
one period in one second, a frequency of 2 Hz denotes the completion of two cycles in one
second, and so on. Thus, the coefficients cn

n=− provide us with a frequency profile of
the signal: how much of the signal changes at the rate of 0
2
Hz, how much of the signal
changes at the rate of 20
2
Hz, and so on. This information cannot be obtained by looking at
the time representation ft. On the other hand, the use of the cn

n=− representation tells
us little about how the signal changes with time. Each representation emphasizes a different
aspect of the signal. The ability to view the same signal in different ways helps us to better
understand the nature of the signal, and thus develop tools for manipulation of the signal.
Later, when we talk about wavelets, we will look at representations that provide information
about both the time profile and the frequency profile of the signal.
The Fourier series provides us with a frequency representation of periodic signals.
However, many of the signals we will be dealing with are not periodic. Fortunately, the
Fourier series concepts can be extended to nonperiodic signals.

12 . 5 Fourier Transform
Consider the function ft shown in Figure 12.3. Let us define a function fPt as
fPt =

n=−
ft −nT (12.14)
where T t1. This function, which is obviously periodic (fPt + T = fPt), is called the
periodic extension of the function ft. Because the function fPt is periodic, we can define
a Fourier series expansion for it:
cn =
1
T
T
2
− T
2
fPte−jn0t
dt (12.15)
fPt =

n=−
cnejn0t
(12.16)
Define
CnT = cnT
and
= 0
and let us slightly rewrite the Fourier series equations:
CnT =
T
2
− T
2
fPte−jnt
dt (12.17)
fPt =

n=−
CnT
T
ejnt
(12.18)
We can recover ft from fPt by taking the limit of fPt as T goes to infinity. Because
= 0 = 2
T
, this is the same as taking the limit as goes to zero. As goes to zero,
n goes to a continuous variable . Therefore,
lim
T→→0
T
2
− T
2
fPte−jnt
dt =

−
fte−jt
dt (12.19)
f(t)
t1 t
F I G U R E 12. 3 A function of time.

From the right-hand side, we can see that the resulting function is a function only of . We
call this function the Fourier transform of ft, and we will denote it by F. To recover
ft from Fw, we apply the same limits to Equation (12.18):
ft lim
T→
fPt = lim
T→→0

n=−
CnT

2
ejnt
(12.20)
=
1
2

−
Fejt
d (12.21)
The equation
F =

−
fte−jt
dt (12.22)
is generally called the Fourier transform. The function F tells us how the signal fluctuates
at different frequencies. The equation
ft =
1
2

−
Fwejt
d (12.23)
is called the inverse Fourier transform, and it shows us how we can construct a signal
using components that fluctuate at different frequencies. We will denote the operation of the
Fourier transform by the symbol . Thus, in the preceding, F = ft .
There are several important properties of the Fourier transform, three of which will be
of particular use to us. We state them here and leave the proof to the problems (Problems 2,
3, and 4).
12 . 5 . 1 P a r s e v a l ’ s T h e o r e m
The Fourier transform is an energy-preserving transform; that is, the total energy when we
look at the time representation of the signal is the same as the total energy when we look
at the frequency representation of the signal. This makes sense because the total energy is
a physical property of the signal and should not change when we look at it using different
representations. Mathematically, this is stated as

−
ft2
=
1
2

−
F2
d (12.24)
The 1
2
factor is a result of using units of radians () for frequency instead of Hertz (f).
If we substitute = 2f in Equation (12.24), the 2 factor will go away. This property
applies to any vector space representation obtained using an orthonormal basis set.
12 . 5 . 2 M o d u l a t i o n P r o p e r t y
If ft has the Fourier transform F, then the Fourier transform of ftej0t
is Fw−w0.
That is, multiplication with a complex exponential in the time domain corresponds to a shift

in the frequency domain. As a sinusoid can be written as a sum of complex exponentials,
multiplication of ft by a sinusoid will also correspond to shifts of F. For example,
cos0t =
ej0t
+e−j0t
2

Therefore,
ftcos0t =
1
2
F−0+F+0
12 . 5 . 3 C o n v o l u t i o n T h e o r e m
When we examine the relationships between the input and output of linear systems, we will
encounter integrals of the following forms:
ft =

−
f1f2t −d
or
ft =

−
f1t −f2d
These are called convolution integrals. The convolution operation is often denoted as
ft = f1t⊗f2t
The convolution theorem states that if F = ft = f1t⊗f2t F1 = f1t ,
and F2 = f2t , then
F = F1F2
We can also go in the other direction. If
F = F1⊗F2 =

F1F2−d
then
ft = f1tf2t
As mentioned earlier, this property of the Fourier transform is important because the
convolution integral relates the input and output of linear systems, which brings us to one
of the major reasons for the popularity of the Fourier transform. We have claimed that the
Fourier series and Fourier transform provide us with an alternative frequency profile of a
signal. Although sinusoids are not the only basis set that can provide us with a frequency
profile, they do, however, have an important property that helps us study linear systems,
which we describe in the next section.

12 . 6 Linear Systems
A linear system is a system that has the following two properties:
Homogeneity: Suppose we have a linear system L with input ft and output gt:
gt = L ft
If we have two inputs, f1t and f2t, with corresponding outputs, g1t and g2t,
then the output of the sum of the two inputs is simply the sum of the two outputs:
L f1t+f2t = g1t+g2t
Scaling: Given a linear system L with input ft and output gt, if we multiply the
input with a scalar , then the output will be multiplied by the same scalar:
L ft = L ft = gt
The two properties together are referred to as superposition.
12 . 6 . 1 T i m e I n v a r i a n c e
Of specific interest to us are linear systems that are time invariant. A time-invariant system
has the property that the shape of the response of this system does not depend on the time
at which the input was applied. If the response of a linear system L to an input ft is gt,
L ft = gt
and we delay the input by some interval t0, then if L is a time-invariant system, the output
will be gt delayed by the same amount:
L ft −t0 = gt −t0 (12.25)
12 . 6 . 2 T r a n s f e r F u n c t i o n
Linear time-invariant systems have a very interesting (and useful) response when the input
is a sinusoid. If the input to a linear system is a sinusoid of a certain frequency 0, then the
output is also a sinusoid of the same frequency that has been scaled and delayed; that is,
L cos0t = cos0t −td
or in terms of the complex exponential
L ej0t
= ej0t−td

Thus, given a linear system, we can characterize its response to sinusoids of a particular
frequency by a pair of parameters, the gain and the delay td. In general, we use the phase
= 0td in place of the delay. The parameters and will generally be a function of the

frequency, so in order to characterize the system for all frequencies, we will need a pair
of functions and . As the Fourier transform allows us to express the signal as
coefficients of sinusoids, given an input ft, all we need to do is, for each frequency ,
multiply the Fourier transform of ft with some ej
, where and are the
gain and phase terms of the linear system for that particular frequency.
This pair of functions and constitute the transfer function of the linear
time-invariant system H:
H = Hej
where H = .
Because of the specific way in which a linear system responds to a sinusoidal input,
given a linear system with transfer function H, input ft, and output gt, the Fourier
transforms of the input and output F and G are related by
Gw = HF
Using the convolution theorem, ft and gt are related by
gt =

−
fht −d
or
gt =

−
ft −hd
where H is the Fourier transform of ht.
12 . 6 . 3 I m p u l s e R e s p o n s e
To see what ht is, let us look at the input-output relationship of a linear time-invariant
system from a different point of view. Let us suppose we have a linear system L with
input ft. We can obtain a staircase approximation fSt to the function ft, as shown in
Figure 12.4:
fSt = fntrect

t −nt
t

(12.26)
where
rect
t
T

=

1 t T
2
0 otherwise
(12.27)
The response of the linear system can be written as
L fSt = L

fntrect

t −nt
t

(12.28)
= L

fnt
rect
t−nt
t

t
t

(12.29)

Δt
F I G U R E 12. 4 A function of time.
For a given value of t, we can use the superposition property of linear systems to obtain
L fSt = fntL

rectt−nt
t

t

t (12.30)
If we now take the limit as t goes to zero in this equation, on the left-hand side fSt
will go to ft. To see what happens on the right-hand side of the equation, first let’s look
at the effect of this limit on the function rect t
t
/t. As t goes to zero, this function
becomes narrower and taller. However, at all times the integral of this function is equal
to one. The limit of this function as t goes to zero is called the Dirac delta function, or
impulse function, and is denoted by t:
lim
t→0
rectt−nt
t

t
= t (12.31)
Therefore,
L ft = lim
t→0
L fSt =

fL t − d (12.32)
Denote the response of the system L to an impulse, or the impulse response, by ht:
ht = L t (12.33)
Then, if the system is also time invariant,
L ft =

fht −d (12.34)
Using the convolution theorem, we can see that the Fourier transform of the impulse response
ht is the transfer function H.
The Dirac delta function is an interesting function. In fact, it is not clear that it is a
function at all. It has an integral that is clearly one, but at the only point where it is not zero,

it is undefined! One property of the delta function that makes it very useful is the sifting
property:
t2
t1
ftt −t0dt =

ft0 t1 ≤ t0 ≤ t2
0 otherwise
(12.35)
12 . 6 . 4 F i l t e r
The linear systems of most interest to us will be systems that permit certain frequency
components of the signal to pass through, while attenuating all other components of the
signal. Such systems are called filters. If the filter allows only frequency components below
a certain frequency W Hz to pass through, the filter is called a low-pass filter. The transfer
function of an ideal low-pass filter is given by
H =

e−j
2W
0 otherwise
(12.36)
This filter is said to have a bandwidth of W Hz. The magnitude of this filter is shown in
Figure 12.5. A low-pass filter will produce a smoothed version of the signal by blocking
higher-frequency components that correspond to fast variations in the signal.
A filter that attenuates the frequency components below a certain frequency W and
allows the frequency components above this frequency to pass through is called a high-pass
filter. A high-pass filter will remove slowly changing trends from the signal. Finally, a
signal that lets through a range of frequencies between two specified frequencies, say, W1
and W2, is called a band-pass filter. The bandwidth of this filter is said to be W2 −W1 Hz.
The magnitude of the transfer functions of an ideal high-pass filter and an ideal band-pass
filter with bandwidth W are shown in Figure 12.6. In all the ideal filter characteristics,
there is a sharp transition between the passband of the filter (the range of frequencies
that are not attenuated) and the stopband of the filter (those frequency intervals where the
signal is completely attenuated). Real filters do not have such sharp transitions, or cutoffs.
2πW ω
H(ω)
F I G U R E 12. 5 Magnitude of the transfer function of an ideal low-pass ﬁlter.

2πW ω
ω ω0 ω0 +2πW
H(ω) H(ω)
F I G U R E 12. 6 Magnitudes of the transfer functions of ideal high-pass (left) and
ideal band-pass (right) ﬁlters.
2πW ω
H(ω)
F I G U R E 12. 7 Magnitude of the transfer functions of a realistic low-pass ﬁlter.
The magnitude characteristics of a more realistic low-pass filter are shown in Figure 12.7.
Notice the more gentle rolloff. But when the cutoff between stopband and passband is not
sharp, how do we define the bandwidth? There are several different ways of defining the
bandwidth. The most common way is to define the frequency at which the magnitude of
the transfer function is 1/
√
2 of its maximum value (or the magnitude squared is 1/2 of its
maximum value) as the cutoff frequency.
12 . 7 Sampling
In 1928 Harry Nyquist at Bell Laboratories showed that if we have a signal whose Fourier
transform is zero above some frequency W Hz, it can be accurately represented using 2W
equally spaced samples per second. This very important result, known as the sampling
theorem, is at the heart of our ability to transmit analog waveforms such as speech and video

12.7 Sampling 373
using digital means. There are several ways to prove this result. We will use the results
presented in the previous section to do so.
12 . 7 . 1 I d e a l S a m p l i n g -
— F r e q u e n c y D o m a i n V i e w
Let us suppose we have a function ft with Fourier transform F, shown in Figure 12.8,
which is zero for greater than 2W. Define the periodic extension of F as
FP =

n=−
F−n0 0 = 4W (12.37)
The periodic extension is shown in Figure 12.9. As FP is periodic, we can express it in
terms of a Fourier series expansion:
FP =

n=−
cnejn 1
2W
(12.38)
The coefficients of the expansion cn

n=− are then given by
cn =
1
4W
2W
−2W
FPe−jn 1
2W
d (12.39)
However, in the interval −2W2WF is identical to FP; therefore,
cn =
1
4W
2W
−2W
Fe−jn 1
2W
d (12.40)
2πW ω
−2πW
F(ω)
F I G U R E 12. 8 A function F().
2πW ω
−2πW
FP(ω)
F I G U R E 12. 9 The periodic extension FP ().

The function F is zero outside the interval −2W2W, so we can extend the limits
to infinity without changing the result:
cn =
1
2W

1
2

−
Fe−jn 1
2W
d

(12.41)
The expression in brackets is simply the inverse Fourier transform evaluated at t = n
2W
;
therefore,
cn =
1
2W
f
n
2W

(12.42)
Knowing cn

n=− and the value of W, we can reconstruct FP. Because FP and
F are identical in the interval −2W2W, therefore knowing cn

n=−, we can also
reconstruct F in this interval. But cn

n=− are simply the samples of ft every 1
2W
seconds, and F is zero outside this interval. Therefore, given the samples of a function
ft obtained at a rate of 2W samples per second, we should be able to exactly reconstruct
the function ft.
Let us see how we can do this:
ft =
1
2

−
Fe−jt
d (12.43)
=
1
2
2W
−2W
Fe−jt
d (12.44)
=
1
2
2W
−2W
FPe−jt
d (12.45)
=
1
2
2W
−2W

n=−
cnejn 1
2W
e−jt
d (12.46)
=
1
2

n=−
cn
2W
−2W
ejwt− n
2W
d (12.47)
Evaluating the integral and substituting for cn from Equation (12.42), we obtain
ft =

n=−
f
n
2W

Sinc

2W

t −
n
2W

(12.48)
where
Sinc x =
sinx
x
(12.49)
Thus, given samples of ft taken every 1
2W
seconds, or, in other words, samples of ft
obtained at a rate of 2W samples per second, we can reconstruct ft by interpolating
between the samples using the Sinc function.

12.7 Sampling 375
12 . 7 . 2 I d e a l S a m p l i n g -
— T i m e D o m a i n V i e w
Let us look at this process from a slightly different point of view, starting with the sampling
operation. Mathematically, we can represent the sampling operation by multiplying the
function ft with a train of impulses to obtain the sampled function fSt:
fSt = ft

n=−
t −nT T
1
2W
(12.50)
To obtain the Fourier transform of the sampled function, we use the convolution theorem:

ft

n=−
t −nT

= ft ⊗

n=−
t −nT

(12.51)
Let us denote the Fourier transform of ft by F. The Fourier transform of a train of
impulses in the time domain is a train of impulses in the frequency domain (Problem 5):

n=−
t −nT

= 0

n=−
w−n0 0 =
2
T
(12.52)
Thus, the Fourier transform of fSt is
FS = F⊗

n=−
w−n0 (12.53)
=

n=−
F⊗w−n0 (12.54)
=

n=−
F−n0 (12.55)
where the last equality is due to the sifting property of the delta function.
Pictorially, for F as shown in Figure 12.8, FS is shown in Figure 12.10. Note that
if T is less than 1
2W
0 is greater than 4W, and as long as 0 is greater than 4W, we
can recover F by passing FS through an ideal low-pass filter with bandwidth W Hz
(2W radians).
What happens if we do sample at a rate less than 2W samples per second (that is, 0 is
less than 4W)? Again we can see the results most easily in a pictorial fashion. The result
2πW ω
σ0
−2πW
F I G U R E 12. 10 Fourier transform of the sampled function.

2πW ω
−2πW
2πW ω
−2πW
F I G U R E 12. 11 Effect of sampling at a rate less than 2W samples per second.
2πW ω
−2πW
F I G U R E 12. 12 Aliased reconstruction.
for 0 equal to 3W is shown in Figure 12.11. Filtering this signal through an ideal low-pass
filter, we get the distorted signal shown in Figure 12.12. Therefore, if 0 is less than 4W,
we can no longer recover the signal ft from its samples. This distortion is known as
aliasing. In order to prevent aliasing, it is useful to filter the signal prior to sampling using
a low-pass filter with a bandwidth less than half the sampling frequency.
Once we have the samples of a signal, sometimes the actual times they were sampled
at are not important. In these situations we can normalize the sampling frequency to unity.
This means that the highest frequency component in the signal is at 0.5 Hz, or radians.
Thus, when dealing with sampled signals, we will often talk about frequency ranges of −
to .
12 . 8 Discrete Fourier Transform
The procedures that we gave for obtaining the Fourier series and transform were based on the
assumption that the signal we were examining could be represented as a continuous function
of time. However, for the applications that we will be interested in, we will primarily be
dealing with samples of a signal. To obtain the Fourier transform of nonperiodic signals, we

12.8 Discrete Fourier Transform 377
started from the Fourier series and modified it to take into account the nonperiodic nature of
the signal. To obtain the discrete Fourier transform (DFT), we again start from the Fourier
series. We begin with the Fourier series representation of a sampled function, the discrete
Fourier series.
Recall that the Fourier series coefficients of a periodic function ft with period T is
given by
ck =
1
T
T
0
ftejkw0t
dt (12.56)
Suppose instead of a continuous function, we have a function sampled N times during each
period T. We can obtain the coefficients of the Fourier series representation of this sampled
function as
Fk =
1
T
T
0
ft
N−1
n=0

t −
n
N
T

ejkw0t
dt (12.57)
=
1
T
N−1
n=0
f
n
N
T

ej 2kn
N (12.58)
where we have used the fact that w0 = 2
T
, and we have replaced ck by Fk. Taking T = 1 for
convenience and defining
fn = f
n
N

we get the coefficients for the discrete Fourier series (DFS) representation:
Fk =
N−1
n=0
fnej 2kn
N (12.59)
Notice that the sequence of coefficients Fk is periodic with period N.
The Fourier series representation was given by
ft =

k=−
ckejn0t
(12.60)
Evaluating this for t = n
N
T, we get
fn = f
n
N
T =

k=−
ckej 2kn
N (12.61)
Let us write this in a slightly different form:
fn =
N−1
k=0

l=−
ck+lN ej 2nk+lN
N (12.62)
but
ej 2nk+lN
N = ej 2kn
N ej2nl
(12.63)
= ej 2kn
N (12.64)

Therefore,
fn =
N−1
k=0
ej 2kn
N

l=−
ck+lN (12.65)
Define
c̄k =

l=−
ck+lN (12.66)
Clearly, c̄k is periodic with period N. In fact, we can show that c̄k = 1
N
Fk and
fn =
1
N
N−1
k=0
Fkej 2kn
N (12.67)
Obtaining the discrete Fourier transform from the discrete Fourier series is simply a
matter of interpretation. We are generally interested in the discrete Fourier transform of
a finite-length sequence. If we assume that the finite-length sequence is one period of a
periodic sequence, then we can use the DFS equations to represent this sequence. The only
difference is that the expressions are only valid for one “period” of the “periodic” sequence.
The DFT is an especially powerful tool because of the existence of a fast algorithm,
called appropriately the fast Fourier transform (FFT), that can be used to compute it.
12 . 9 Z-Transform
In the previous section we saw how to extend the Fourier series to use with sampled
functions. We can also do the same with the Fourier transform. Recall that the Fourier
transform was given by the equation
F =

−
fte−jt
dt (12.68)
Replacing ft with its sampled version, we get
F =

−
ft

n=−
t −nTe−jt
dt (12.69)
=

n=−
fne−jnT
(12.70)
where fn = fnT. This is called the discrete time Fourier transform. The Z-transform of the
sequence fn is a generalization of the discrete time Fourier transform and is given by
Fz =

n=−
fnz−n
(12.71)
where
z = eT+jwT
(12.72)

Notice that if we let equal zero, we get the original expression for the Fourier transform
of a discrete time sequence. We denote the Z-transform of a sequence by
Fz = fn
We can express this another way. Notice that the magnitude of z is given by
z = eT

Thus, when equals zero, the magnitude of z is one. Because z is a complex number, the
magnitude of z is equal to one on the unit circle in the complex plane. Therefore, we can
say that the Fourier transform of a sequence can be obtained by evaluating the Z-transform
of the sequence on the unit circle. Notice that the Fourier transform thus obtained will be
periodic, which we expect because we are dealing with a sampled function. Further, if we
assume T to be one, varies from − to , which corresponds to a frequency range of
−05 to 0.5 Hz. This makes sense because, by the sampling theorem, if the sampling rate is
one sample per second, the highest frequency component that can be recovered is 0.5 Hz.
For the Z-transform to exist—in other words, for the power series to converge—we need
to have

n=−
fnz−n

Whether this inequality holds will depend on the sequence itself and the value of z. The
values of z for which the series converges are called the region of convergence of the
Z-transform. From our earlier discussion, we can see that for the Fourier transform of the
sequence to exist, the region of convergence should include the unit circle. Let us look at a
simple example.
E x a m p l e 12 . 9 . 1:
Given the sequence
fn = an
u n
where u n is the unit step function
u n =

1 n ≥ 0
0 n 0
(12.73)
the Z-transform is given by
Fz =

n=0
an
z−n
(12.74)
=

n=0
az−1
n
(12.75)
This is simply the sum of a geometric series. As we confront this kind of sum quite often,
let us briefly digress and obtain the formula for the sum of a geometric series.

Suppose we have a sum
Smn =
n
k=m
xk
= xm
+xm+1
+···+xn
(12.76)
then
xSmn = xm+1
+xm+2
+···+xn+1
(12.77)
Subtracting Equation (12.77) from Equation (12.76), we get
1−xSmn = xm
−xn+1
and
Smn =
xm
−xn+1
1−x

If the upper limit of the sum is infinity, we take the limit as n goes to infinity. This limit
exists only when x 1.
Using this formula, we get the Z-transform of the fn sequence as
Fz =
1
1−az−1

az−1

1 (12.78)
=
z
z−a
z a (12.79)

In this example the region of convergence is the region z a. For the Fourier transform
to exist, we need to include the unit circle in the region of convergence. In order for this to
happen, a has to be less than one.
Using this example, we can get some other Z-transforms that will be useful to us.
E x a m p l e 12 . 9 . 2:
In the previous example we found that

n=0
an
z−n
=
z
z−a
z a (12.80)
If we take the derivative of both sides of the equation with respect to a, we get

n=0
nan−1
z−n
=
z
z−a2
z a (12.81)
Thus,
nan−1
u n =
z
z−a2
z a

If we differentiate Equation (12.80) m times, we get

n=0
nn−1···n−m+1an−m
=
m!z
z−am+1

In other words,

n
m

an−m
u n

=
z
z−am+1
(12.82)

In these examples the Z-transform is a ratio of polynomials in z. For sequences of interest
to us, this will generally be the case, and the Z-transform will be of the form
Fz =
Nz
Dz

The values of z for which Fz is zero are called the zeros of Fz; the values for which
Fz is infinity are called the poles of Fz. For finite values of z, the poles will occur at the
roots of the polynomial Dz.
The inverse Z-transform is formally given by the contour integral
1
2j

C
Fzzn−1
dz
where the integral is over the counterclockwise contour C, and C lies in the region of
convergence. This integral can be difficult to evaluate directly; therefore, in most cases we
use alternative methods for finding the inverse Z-transform.
12 . 9 . 1 T a b u l a r M e t h o d
The inverse Z-transform has been tabulated for a number of interesting cases (see Table 12.1).
If we can write Fz as a sum of these functions
Fz = iFiz
T A B L E 12 . 1 Some Z-transform pairs.
fn Fz
an
u n z
z−a
nTu n Tz−1
1−z−12
sinnT sin nTz−1
1−2cosTz−1+z−2
cosnT cosnTz−1
1−2cosTz−1+z−2

then the inverse Z-transform is given by
fn = ifin
where Fiz = fin .
E x a m p l e 12 . 9 . 3:
Fz =
z
z−05
+
2z
z−03
From our earlier example we know the inverse Z-transform of z/z − a. Using that, the
inverse Z-transform of Fz is
fn = 05n
u n +203n
u n

12 . 9 . 2 P a r t i a l F r a c t i o n E x p a n s i o n
In order to use the tabular method, we need to be able to decompose the function of interest
to us as a sum of simpler terms. The partial fraction expansion approach does exactly that
when the function is a ratio of polynomials in z.
Suppose Fz can be written as a ratio of polynomials Nz and Dz. For the moment
let us assume that the degree of Dz is greater than the degree of Nz, and that all the
roots of Dz are distinct (distinct roots are referred to as simple roots); that is,
Fz =
Nz
z−z1z−z2···z−zL
(12.83)
Then we can write Fz/z as
Fz
z
=
L
i=1
Ai
z−zi
(12.84)
If we can find the coefficients Ai, then we can write Fz as
Fz =
L
i=1
Aiz
z−zi
and the inverse Z-transform will be given by
fn =
L
i=1
Aizn
i u n

The question then becomes one of finding the value of the coefficients Ai. This can be
simply done as follows: Suppose we want to find the coefficient Ak. Multiply both sides of
Equation (12.84) by z−zk. Simplifying this we obtain
Fzz−zk
z
=
L
i=1
Aiz−zk
z−zi
(12.85)
= Ak +
L
i=1
i=k
Aiz−zk
z−zi
(12.86)
Evaluating this equation at z = zk, all the terms in the summation go to zero and
Ak =
Fzz−zk
z

z=zk
(12.87)
E x a m p l e 12 . 9 . 4:
Let us use the partial fraction expansion method to find the inverse Z-transform of
Fz =
6z2
−9z
z2 −25z+1

Then
Fz
z
=
1
z
6z2
−9z
z2 −25z+1
(12.88)
=
6z−9
z−05z−2
(12.89)
We want to write Fz/z in the form
Fz
z
=
A1
z−05
+
A2
z−2

Using the approach described above, we obtain
A1 =
6z−9z−05
z−05z−2

z=05
(12.90)
= 4 (12.91)
A2 =
6z−9z−2
z−05z−2

z=2
(12.92)
= 2 (12.93)
Therefore,
Fz =
4z
z−05
+
2z
z−2

and
fn = 405n
+22n
u n
The procedure becomes slightly more complicated when we have repeated roots of Dz.
Suppose we have a function
Fz =
Nz
z−z1z−z22

The partial fraction expansion of this function is
Fz
z
=
A1
z−z1
+
A2
z−z2
+
A3
z−z22

The values of A1 and A3 can be found as shown previously:
A1 =
Fzz−z1
z

z=z1
(12.94)
A3 =
Fzz−z22
z

z=z2
(12.95)
However, we run into problems when we try to evaluate A2. Let’s see what happens
when we multiply both sides by z−z2:
Fzz−z2
z
=
A1z−z2
z−z1
+A2 +
A3
z−z2
(12.96)
If we now evaluate this equation at z = z2, the third term on the right-hand side becomes
undefined. In order to avoid this problem, we first multiply both sides by z−z22
and take
the derivative with respect to z prior to evaluating the equation at z = z2:
Fzz−z22
z
=
A1z−z22
z−z1
+A2z−z2+A3 (12.97)
Taking the derivative of both sides with respect to z, we get
d
dz
Fzz−z22
z
=
2A1z−z2z−z1−A1z−z22
z−z12
+A2 (12.98)
If we now evaluate the expression at z = z2, we get
A2 =
d
dz
Fzz−z22
z

z=z2
(12.99)

Generalizing this approach, we can show that if Dz has a root of order m at some zk,
that portion of the partial fraction expansion can be written as
Fz
z
=
A1
z−zk
+
A2
z−zk2
+···+
Am
z−zkm
(12.100)
and the lth coefficient can be obtained as
Al =
1
m−l!
dm−l
dzm−l
Fzz−zk
m
z

z=zk
(12.101)
Finally, let us drop the requirement that the degree of Dz be greater or equal to the degree
of Nz. When the degree of Nz is greater than the degree of Dz, we can simply divide
Nz by Dz to obtain
Fz =
Nz
Dz
= Qz+
Rz
Dz
(12.102)
where Qz is the quotient and Rz is the remainder of the division operation. Clearly, Rz
will have degree less than Dz.
To see how all this works together, consider the following example.
E x a m p l e 12 . 9 . 5:
Let us find the inverse Z-transform of the function
Fz =
2z4
+1
2z3 −5z2 +4z−1
(12.103)
The degree of the numerator is greater than the degree of the denominator, so we divide
once to obtain
Fz = z+
5z3
−4z2
+z+1
2z3 −5z2 +4z−1
(12.104)
The inverse Z-transform of z is n−1, where n is the discrete delta function defined as
n =

1 n = 0
0 otherwise.
(12.105)
Let us call the remaining ratio of polynomials F1z. We find the roots of the denominator
of F1z as
F1z =
5z3
−4z2
+z+1
2z−05z−12
(12.106)

Then
F1z
z
=
5z3
−4z2
+z+1
2zz−05z−12
(12.107)
=
A1
z
+
A2
z−05
+
A3
z−1
+
A4
z−12
(12.108)
Then
A1 =
5z3
−4z2
+z+1
2z−05z−12

z=0
= −1 (12.109)
A2 =
5z3
−4z2
+z+1
2zz−12

z=05
= 45 (12.110)
A4 =
5z3
−4z2
+z+1
2zz−05

z=1
= 3 (12.111)
To find A3, we take the derivative with respect to z, then set z = 1:
A3 =
d
dz

5z3
−4z2
+2z+1
2zz−05

z=1
= −3 (12.112)
Therefore,
F1z = −1+
45z
z−05
−
3z
z−1
+
3z
z−12
(12.113)
and
f1n = −n +4505n
u n −3u n +3nu n (12.114)
and
fn = n−1 −n +4505n
u n −3−3nu n (12.115)

12 . 9 . 3 L o n g D i v i s i o n
If we could write Fz as a power series, then from the Z-transform expression the coefficients
of z−n
would be the sequence values fn.
E x a m p l e 12 . 9 . 6:
Let’s find the inverse z-transform of
Fz =
z
z−a

Dividing the numerator by the denominator we get the following:
1 + az−1
+ a2
z−2
···
z−a

z
z − a
a
a − a2
z−1
a2
z−1
Thus, the quotient is
1+az−1
+a2
z−2
+··· =

n=0
an
z−n

We can easily see that the sequence for which Fz is the Z-transform is
fn = an
u n
12 . 9 . 4 Z - T r a n s f o r m P r o p e r t i e s
Analogous to the continuous linear systems, we can define the transfer function of a discrete
linear system as a function of z that relates the Z-transform of the input to the Z-transform of
the output. Let fn

n=− be the input to a discrete linear time-invariant system, and gn

n=−
be the output. If Fz is the Z-transform of the input sequence, and Gz is the Z-transform
of the output sequence, then these are related to each other by
Gz = HzFz (12.116)
and Hz is the transfer function of the discrete linear time-invariant system.
If the input sequence fn

n=− had a Z-transform of one, then Gz would be equal to
Hz. It is an easy matter to find the requisite sequence:
Fz =

n=−
fnz−n
= 1 ⇒ fn =

1 n = 0
0 otherwise.
(12.117)
This particular sequence is called the discrete delta function. The response of the system
to the discrete delta function is called the impulse response of the system. Obviously, the
transfer function Hz is the Z-transform of the impulse response.
12 . 9 . 5 D i s c r e t e C o n v o l u t i o n
In the continuous time case, the output of the linear time-invariant system was a convolution
of the input with the impulse response. Does the analogy hold in the discrete case? We can
check this out easily by explicitly writing out the Z-transforms in Equation (12.116). For

simplicity let us assume the sequences are all one-sided; that is, they are only nonzero for
nonnegative values of the subscript:

n=0
gnz−n
=

n=0
hnz−n

m=0
fmz−m
(12.118)
Equating like powers of z:
g0 = h0f0
g1 = f0h1 +f1h0
g2 = f0h2 +f1h1 +f2h0

gn =
n
m=0
fmhn−m
Thus, the output sequence is a result of the discrete convolution of the input sequence with
the impulse response.
Most of the discrete linear systems we will be dealing with will be made up of delay
elements, and their input-output relations can be written as constant coefficient difference
equations. For example, for the system shown in Figure 12.13, the input-output relationship
can be written in the form of the following difference equation:
gk = a0fk +a1fk−1 +a2fk−2 +b1gk−1 +b2gk−2 (12.119)
The transfer function of this system can be easily found by using the shifting theorem.
The shifting theorem states that if the Z-transform of a sequence fn is Fz, then the
Z-transform of the sequence shifted by some integer number of samples n0 is z−n0 Fz.
a0
a1
a2
b1
b2
fk
gk
gk–1
gk–2
fk–1
fk–2
Delay
Delay
Delay
Delay
F I G U R E 12. 13 A discrete system.

12.10 Summary 389
The theorem is easy to prove. Suppose we have a sequence fn with Z-transform Fz.
Let us look at the Z-transform of the sequence fn−n0
:
fn−n0
=

n=−
fn−n0
z−n
(12.120)
=

m=−
fmz−m−n0
(12.121)
= z−n0

m=−
fmz−m
(12.122)
= z−n0
Fz (12.123)
Assuming Gz is the Z-transform of gn and Fz is the Z-transform of fn , we can take
the Z-transform of both sides of the difference equation (12.119):
Gz = a0Fz+a1z−1
Fz+a2z−2
Fz+b1z−1
Gz+b2z−2
Gz (12.124)
from which we get the relationship between Gz and Fz as
Gz =
a0 +a1z−1
+a2z−2
1−b1z−1 −b2z−2
Fz (12.125)
By definition the transfer function Hz is therefore
Hz =
Gz
Fz
(12.126)
=
a0 +a1z−1
+a2z−2
1−b1z−1 −b2z−2
(12.127)
12 . 10 Summary
In this chapter we have reviewed some of the mathematical tools we will be using throughout
the remainder of this book. We started with a review of vector space concepts, followed by a
look at a number of ways we can represent a signal, including the Fourier series, the Fourier
transform, the discrete Fourier series, the discrete Fourier transform, and the Z-transform.
We also looked at the operation of sampling and the conditions necessary for the recovery
of the continuous representation of the signal from its samples.
1. There are a large number of books that provide a much more detailed look at the
concepts described in this chapter. A nice one is Signal Processing and Linear Systems,
by B.P. Lathi [177].
2. For a thorough treatment of the fast Fourier transform (FFT), see Numerical Recipes
in C, by W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.J. Flannery [178].

1. Let X be a set of N linearly independent vectors, and let V be the collection of vectors
obtained using all linear combinations of the vectors in X.
(a) Show that given any two vectors in V, the sum of these vectors is also an element
of V.
(b) Show that V contains an additive identity.
(c) Show that for every x in V, there exists a −x in V such that their sum is the
additive identity.
2. Prove Parseval’s theorem for the Fourier transform.
3. Prove the modulation property of the Fourier transform.
4. Prove the convolution theorem for the Fourier transform.
5. Show that the Fourier transform of a train of impulses in the time domain is a train
of impulses in the frequency domain:

n=−
t −nT

= 0

n=−
w−n0 0 =
2
T
(12.128)
6. Find the Z-transform for the following sequences:
(a) hn = 2−n
u n , where u n is the unit step function.
(b) hn = n2
−n3−n
u n .
(c) hn = n2−n
+06n
u n .
7. Given the following input-output relationship:
yn = 06yn−1 +05xn +02xn−1
(a) Find the transfer function Hz.
(b) Find the impulse response hn .
8. Find the inverse Z-transform of the following:
(a) Hz = 5
z−2
.
(b) Hz = z
z2−025
.
(c) Hz = z
z−05
.

13
Transform Coding
13 . 1 Overview
I
n this chapter we will describe a technique in which the source output is
decomposed, or transformed, into components that are then coded according
to their individual characteristics. We will then look at a number of different
transforms, including the popular discrete cosine transform, and discuss the
issues of quantization and coding of the transformed coefficients. This chapter
concludes with a description of the baseline sequential JPEG image-coding algorithm and
some of the issues involved with transform coding of audio signals.
13 . 2 Introduction
In the previous chapter we developed a number of tools that can be used to transform a
given sequence into different representations. If we take a sequence of inputs and transform
them into another sequence in which most of the information is contained in only a few
elements, we can then encode and transmit those elements, along with their location in
the new sequence, resulting in data compression. In our discussion, we will use the terms
“variance” and “information” interchangeably. The justification for this is shown in the
results in Chapter 7. For example, recall that for a Gaussian source the differential entropy is
given as 1
2
log2e2
. Thus, an increase in the variance results in an increase in the entropy,
which is a measure of the information contained in the source output.
To begin our discussion of transform coding, consider the following example.

392 13 T R A N S F O R M C O D I N G
E x a m p l e 13 . 2 . 1:
Let’s revisit Example 8.5.1. In Example 8.5.1, we studied the encoding of the output of a
source that consisted of a sequence of pairs of numbers. Each pair of numbers corresponds
to the height and weight of an individual. In particular, let’s look at the sequence of outputs
shown in Table 13.1.
If we look at the height and weight as the coordinates of a point in two-dimensional
space, the sequence can be shown graphically as in Figure 13.1. Notice that the output
T A B L E 13 . 1 Original sequence.
Height Weight
65 170
75 188
60 150
70 170
56 130
80 203
68 160
50 110
40 80
50 153
69 148
62 140
76 164
64 120
190
200
180
170
160
150
140
130
120
110
100
90
80
10 20 30 40 50 60 70 80
F I G U R E 13. 1 Source output sequence.

values tend to cluster around the line y = 25x. We can rotate this set of values by the
transformation
= Ax (13.1)
where x is the two-dimensional source output vector
x =

x0
x1

(13.2)
x0 corresponds to height and x1 corresponds to weight, A is the rotation matrix
A =

cos sin
−sin cos

(13.3)
is the angle between the x-axis and the y = 25x line, and
=

0
1

(13.4)
is the rotated or transformed set of values. For this particular case the matrix A is
A =

037139068 092847669
−092847669 037139068

(13.5)
and the transformed sequence (rounded to the nearest integer) is shown in Table 13.2. (For a
brief review of matrix concepts, see Appendix B.)
Notice that for each pair of values, almost all the energy is compacted into the first
element of the pair, while the second element of the pair is significantly smaller. If we plot
this sequence in pairs, we get the result shown in Figure 13.2. Note that we have rotated the
original values by an angle of approximately 68 degrees (arctan 25).
T A B L E 13 . 2 Transformed sequence.
First Coordinate Second Coordinate
182 3
202 0
162 0
184 −2
141 −4
218 1
174 −4
121 −6
90 −7
161 10
163 −9
153 −6
181 −9
135 −15

5
−5
10
−10
80 90 100 110 130 160 170 180 190 200 210 220 230
120 140 150
F I G U R E 13. 2 The transformed sequence.
Suppose we set all the second elements of the transformation to zero, that is, the second
coordinates of the sequence shown in Table 13.2. This reduces the number of elements that
need to be encoded by half. What is the effect of throwing away half the elements of the
sequence? We can find that out by taking the inverse transform of the reduced sequence.
The inverse transform consists of reversing the rotation. We can do this by multiplying the
blocks of two of the transformed sequences with the second element in each block set to
zero with the matrix
A−1
=

cos −sin
sin cos

(13.6)
and obtain the reconstructed sequence shown in Table 13.3. Comparing this to the original
sequence in Table 13.1, we see that, even though we transmitted only half the number of
elements present in the original sequence, this “reconstructed” sequence is very close to the
original. The reason there is so little error introduced in the sequence xn is that for this
T A B L E 13 . 3 Reconstructed sequence.
Height Weight
68 169
75 188
60 150
68 171
53 131
81 203
65 162
45 112
34 84
60 150
61 151
57 142
67 168
50 125

particular transformation the error introduced into the xn sequence is equal to the error
introduced into the n sequence. That is,
N−1

i=0
xi −x̂i
2
=
N−1

i=0
i − ˆ
i
2
(13.7)
where x̂n is the reconstructed sequence, and
ˆ
i =

i i = 0 2 4
0 otherwise
(13.8)
(see Problem 1). The error introduced in the n sequence is the sum of squares of the ns
that are set to zero. The magnitudes of these elements are quite small, and therefore the total
error introduced into the reconstructed sequence is quite small also.
We could reduce the number of samples we needed to code because most of the infor-
mation contained in each pair of values was put into one element of each pair. As the other
element of the pair contained very little information, we could discard it without a significant
effect on the fidelity of the reconstructed sequence. The transform in this case acted on pairs
of values; therefore, the maximum reduction in the number of significant samples was a
factor of two. We can extend this idea to longer blocks of data. By compacting most of the
information in a source output sequence into a few elements of the transformed sequence
using a reversible transform, and then discarding the elements of the sequence that do not
contain much information, we can get a large amount of compression. This is the basic idea
behind transform coding.
In Example 13.2.1 we have presented a geometric view of the transform process. We
can also examine the transform process in terms of the changes in statistics between the
original and transformed sequences. It can be shown that we can get the maximum amount
of compaction if we use a transform that decorrelates the input sequence; that is, the sample-
to-sample correlation of the transformed sequence is zero. The first transform to provide
decorrelation for discrete data was presented by Hotelling [179] in the Journal of Educational
Psychology in 1933. He called his approach the method of principal components. The
analogous transform for continuous functions was obtained by Karhunen [180] and Loéve
[181]. This decorrelation approach was first utilized for compression, in what we now call
transform coding, by Kramer and Mathews [182], and Huang and Schultheiss [183].
Transform coding consists of three steps. First, the data sequence xn is divided into
blocks of size N. Each block is mapped into a transform sequence n using a reversible
mapping in a manner similar to that described in Example 13.2.1. As shown in the example,
different elements of each block of the transformed sequence generally have different statis-
tical properties. In Example 13.2.1, most of the energy of the block of two input values was
contained in the first element of the block of two transformed values, while very little of
the energy was contained in the second element. This meant that the second element of each
block of the transformed sequence would have a small magnitude, while the magnitude of
the first element could vary considerably depending on the magnitude of the elements in the
input block. The second step consists of quantizing the transformed sequence. The quantiza-
tion strategy used will depend on three main factors: the desired average bit rate, the statistics

of the various elements of the transformed sequence, and the effect of distortion in the trans-
formed coefficients on the reconstructed sequence. In Example 13.2.1, we could take all the
bits available to us and use them to quantize the first coefficient. In more complex situations,
the strategy used may be very different. In fact, we may use different techniques, such as
differential encoding and vector quantization [118], to encode the different coefficients.
Finally, the quantized value needs to be encoded using some binary encoding technique.
The binary coding may be as simple as using a fixed-length code or as complex as a
combination of run-length coding and Huffman or arithmetic coding. We will see an example
of the latter when we describe the JPEG algorithm.
The various quantization and binary coding techniques have been described at some
length in previous chapters, so we will spend the next section describing various transforms.
We will then discuss quantization and coding strategies in the context of these transforms.
13 . 3 The Transform
All the transforms we deal with will be linear transforms; that is, we can get the sequence
n from the sequence xn as
n =
N−1

i=0
xian i (13.9)
This is referred to as the forward transform. For the transforms that we will be considering,
a major difference between the transformed sequence n and the original sequence xn
is that the characteristics of the elements of the sequence are determined by their position
within the sequence. For example, in Example 13.2.1 the first element of each pair of the
transformed sequence was more likely to have a large magnitude compared to the second
element. In general, we cannot make such statements about the source output sequence
xn. A measure of the differing characteristics of the different elements of the transformed
sequence n is the variance 2
n of each element. These variances will strongly influence
how we encode the transformed sequence. The size of the block N is dictated by practical
considerations. In general, the complexity of the transform grows more than linearly with N.
Therefore, beyond a certain value of N, the computational costs overwhelm any marginal
improvements that might be obtained by increasing N. Furthermore, in most real sources
the statistical characteristics of the source output can change abruptly. For example, when
we go from a silence period to a voiced period in speech, the statistics change drastically.
Similarly, in images, the statistical characteristics of a smooth region of the image can be
very different from the statistical characteristics of a busy region of the image. If N is
large, the probability that the statistical characteristics change significantly within a block
increases. This generally results in a larger number of the transform coefficients with large
values, which in turn leads to a reduction in the compression ratio.
The original sequence xn can be recovered from the transformed sequence n via the
inverse transform:
xn =
N−1

i=0
ibn i (13.10)

The transforms can be written in matrix form as
= Ax (13.11)
x = B (13.12)
where A and B are N ×N matrices and the i j th element of the matrices is given by
Ai j = ai j (13.13)
Bi j = bi j (13.14)
The forward and inverse transform matrices A and B are inverses of each other; that is,
AB = BA = I, where I is the identity matrix.
Equations (13.9) and (13.10) deal with the transform coding of one-dimensional
sequences, such as sampled speech and audio sequences. However, transform coding is one
of the most popular methods used for image compression. In order to take advantage of
the two-dimensional nature of dependencies in images, we need to look at two-dimensional
transforms.
Let Xi j be the i j th pixel in an image. A general linear two-dimensional transform for
a block of size N ×N is given as
k l =
N−1

i=0
N−1

j=0
Xi jai j k l (13.15)
All two-dimensional transforms in use today are separable transforms; that is, we can take
the transform of a two-dimensional block by first taking the transform along one dimension,
then repeating the operation along the other direction. In terms of matrices, this involves first
taking the (one-dimensional) transform of the rows, and then taking the column-by-column
transform of the resulting matrix. We can also reverse the order of the operations, first taking
the transform of the columns, and then taking the row-by-row transform of the resulting
matrix. The transform operation can be represented as
k l =
N−1

i=0
N−1

j=0
ak iXi jai j (13.16)
which in matrix terminology would be given by
= AXAT
(13.17)
The inverse transform is given as
X = BBT
(13.18)
All the transforms we deal with will be orthonormal transforms. An orthonormal trans-
form has the property that the inverse of the transform matrix is simply its transpose because
the rows of the transform matrix form an orthonormal basis set:
B = A−1
= AT
(13.19)

For an orthonormal transform, the inverse transform will be given as
X = AT
A (13.20)
Orthonormal transforms are energy preserving; that is, the sum of the squares of the
transformed sequence is the same as the sum of the squares of the original sequence. We
can see this most easily in the case of the one-dimensional transform:
N−1

i=0
2
i = T
(13.21)
= Ax T
Ax (13.22)
= xT
AT
Ax (13.23)
If A is an orthonormal transform, AT
A = A−1
A = I, then
xT
AT
Ax = xT
x (13.24)
=
N−1

n=0
x2
n (13.25)
and
N−1

i=0
2
i =
N−1

n=0
x2
n (13.26)
The efficacy of a transform depends on how much energy compaction is provided by the
transform. One way of measuring the amount of energy compaction afforded by a particular
orthonormal transform is to take a ratio of the arithmetic mean of the variances of the
transform coefficient to their geometric means [123]. This ratio is also referred to as the
transform coding gain GTC:
GTC =
1
N
N−1
i=0 2
i
N−1
i=0 2
i
1
N
(13.27)
where 2
i is the variance of the ith coefficient i.
Transforms can be interpreted in several ways. We have already mentioned a geometric
interpretation and a statistical interpretation. We can also interpret them as a decomposition
of the signal in terms of a basis set. For example, suppose we have a two-dimensional
orthonormal transform A. The inverse transform can be written as

x0
x1

=

a00 a10
a01 a11

0
1

= 0

a00
a01

+1

a10
a11

(13.28)
We can see that the transformed values are actually the coefficients of an expansion of the
input sequence in terms of the rows of the transform matrix. The rows of the transform
matrix are often referred to as the basis vectors for the transform because they form an
orthonormal basis set, and the elements of the transformed sequence are often called the
transform coefficients. By characterizing the basis vectors in physical terms we can get a
physical interpretation of the transform coefficients.

E x a m p l e 13 . 3 . 1:
Consider the following transform matrix:
A =
1
√
2

1 1
1 −1

(13.29)
We can verify that this is indeed an orthonormal transform.
Notice that the first row of the matrix would correspond to a “low-pass” signal (no change
from one component to the next), while the second row would correspond to a “high-pass”
signal. Thus, if we tried to express a sequence in which each element has the same value
in terms of these two rows, the second coefficient should be zero. Suppose the original
sequence is . Then

0
1

=
1
√
2

1 1
1 −1

=
√
2
0

(13.30)
The “low-pass” coefficient has a value of
√
2, while the “high-pass” coefficient has a
value of 0. The “low-pass” and “high-pass” coefficients are generally referred to as the
low-frequency and high-frequency coefficients.
Let us take two sequences in which the components are not the same and the degree of
variation is different. Consider the two sequences (3, 1) and 3 −1 . In the first sequence
the second element differs from the first by 2; in the second sequence, the magnitude of the
difference is 4. We could say that the second sequence is more “high pass” than the first
sequence. The transform coefficients for the two sequences are 2
√
2
√
2 and
√
2 2
√
2 ,
respectively. Notice that the high-frequency coefficient for the sequence in which we see
a larger change is twice that of the high-frequency coefficient for the sequence with less
change. Thus, the two coefficients do seem to behave like the outputs of a low-pass filter
and a high-pass filter.
Finally, notice that in every case the sum of the squares of the original sequence is the
same as the sum of the squares of the transform coefficients; that is, the transform is energy
preserving, as it must be, since A is orthonormal.
We can interpret one-dimensional transforms as an expansion in terms of the rows of the
transform matrix. Similarly, we can interpret two-dimensional transforms as expansions in
terms of matrices that are formed by the outer product of the rows of the transform matrix.
Recall that the outer product is given by
xxT
=
⎡
⎢
⎢
⎢
⎣
x0x0 x0x1 ··· x0xN−1
x1x0 x1x1 ··· x1xN−1

xN−1x0 xN−1x1 ··· xN−1xN−1
⎤
⎥
⎥
⎥
⎦
(13.31)
To see this more clearly, let us use the transform introduced in Example 13.3.1 for a
two-dimensional transform.

E x a m p l e 13 . 3 . 2:
For an N ×N transform A, let i j be the outer product of the ith and jth rows:
i j =
⎡
⎢
⎢
⎢
⎣
ai0
ai1

aiN−1
⎤
⎥
⎥
⎥
⎦

aj0 aj1 ··· ajN−1

(13.32)
=
⎡
⎢
⎢
⎢
⎣
ai0aj0 ai0aj1 ··· ai0ajN−1
ai1aj0 ai1aj1 ··· ai1ajN−1

aiN−1aj0 aiN−1aj1 ··· aiN−1ajN−1
⎤
⎥
⎥
⎥
⎦
(13.33)
For the transform of Example 13.3.1, the outer products are
0 0 =
1
2

1 1
1 1

0 1 =
1
2

1 −1
1 −1

(13.34)
1 0 =
1
2

1 1
−1 −1

1 1 =
1
2

1 −1
−1 1

(13.35)
From (13.20), the inverse transform is given by

x01 x01
x10 x11

=
1
2

1 1
1 −1

00 01
10 11

1 1
1 −1

(13.36)
=
1
2

00 +01 +10 +11 00 −01 +10 −11
00 +01 −10 −11 00 −01 −10 +11

(13.37)
= 000 0 +010 1 +101 0 +111 1 (13.38)
The transform values ij can be viewed as the coefficients of the expansion of x in terms of
the matrices i j. The matrices i j are known as the basis matrices.
For historical reasons, the coefficient 00, corresponding to the basis matrix 0 0, is called
the DC coefficient, while the coefficients corresponding to the other basis matrices are called
AC coefficients. DC stands for direct current, which is current that does not change with
time. AC stands for alternating current, which does change with time. Notice that all the
elements of the basis matrix 0 0 are the same, hence the DC designation.
In the following section we will look at some of the variety of transforms available to
us, then at some of the issues involved in quantization and coding. Finally, we will describe
in detail two applications, one for image coding and one for audio coding.
13 . 4 Transforms of Interest
In Example 13.2.1, we constructed a transform that was specific to the data. In practice,
it is generally not feasible to construct a transform for the specific situation, for several

reasons. Unless the characteristics of the source output are stationary over a long interval,
the transform needs to be recomputed often, and it is generally burdensome to compute a
transform for every different set of data. Furthermore, the overhead required to transmit
the transform itself might negate any compression gains. Both of these problems become
especially acute when the size of the transform is large. However, there are times when
we want to find out the best we can do with transform coding. In these situations, we
can use data-dependent transforms to obtain an idea of the best performance available. The
best-known data-dependent transform is the discrete Karhunen-Loéve transform (KLT). We
will describe this transform in the next section.
13 . 4 . 1 K a r h u n e n - L o é v e T r a n s f o r m
The rows of the discrete Karhunen-Loéve transform [184], also known as the Hotelling
transform, consist of the eigenvectors of the autocorrelation matrix. The autocorrelation
matrix for a random process X is a matrix whose i j th element Ri j is given by
Ri j = E XnXn+i−j (13.39)
We can show [123] that a transform constructed in this manner will minimize the geometric
mean of the variance of the transform coefficients. Hence, the Karhunen-Loéve transform
provides the largest transform coding gain of any transform coding method.
If the source output being compressed is nonstationary, the autocorrelation function will
change with time. Thus, the autocorrelation matrix will change with time, and the KLT will
have to be recomputed. For a transform of any reasonable size, this is a significant amount
of computation. Furthermore, as the autocorrelation is computed based on the source output,
it is not available to the receiver. Therefore, either the autocorrelation or the transform itself
has to be sent to the receiver. The overhead can be significant and remove any advantages
to using the optimum transform. However, in applications where the statistics change slowly
and the transform size can be kept small, the KLT can be of practical use [185].
E x a m p l e 13 . 4 . 1:
Let us see how to obtain the KLT transform of size two for an arbitrary input sequence. The
autocorrelation matrix of size two for a stationary process is
R =

Rxx 0 Rxx 1
Rxx 1 Rxx 0

(13.40)
Solving the equation I−R = 0, we get the two eigenvalues 1 = Rxx 0 + Rxx 1 , and
2 = Rxx 0 −Rxx 1 . The corresponding eigenvectors are
V1 =

V2 =

−

(13.41)

where and are arbitrary constants. If we now impose the orthonormality condition, which
requires the vectors to have a magnitude of 1, we get
= =
1
√
2
and the transform matrix K is
K =
1
√
2

1 1
1 −1

(13.42)
Notice that this matrix is not dependent on the values of Rxx 0 and Rxx 1 . This is only true
of the 2×2 KLT. The transform matrices of higher order are functions of the autocorrelation
values.
Although the Karhunen-Loéve transform maximizes the transform coding gain as defined
by (13.27), it is not practical in most circumstances. Therefore, we need transforms that do
not depend on the data being transformed. We describe some of the more popular transforms
in the following sections.
13 . 4 . 2 D i s c r e t e C o s i n e T r a n s f o r m
The discrete cosine transform (DCT) gets its name from the fact that the rows of the N ×N
transform matrix C are obtained as a function of cosines.
Ci j =
⎧
⎨
⎩

1
N
cos 2j+1 i
2N
i = 0 j = 0 1 N −1

2
N
cos 2j+1 i
2N
i = 1 2 N −1 j = 0 1 N −1
(13.43)
The rows of the transform matrix are shown in graphical form in Figure 13.3. Notice how
the amount of variation increases as we progress down the rows; that is, the frequency of
the rows increases as we go from top to bottom.
The outer products of the rows are shown in Figure 13.4. Notice that the basis matri-
ces show increased variation as we go from the top-left matrix, corresponding to the 00
coefficient, to the bottom-right matrix, corresponding to the N−1 N−1 coefficient.
The DCT is closely related to the discrete Fourier transform (DFT) mentioned in
Chapter 11, and in fact can be obtained from the DFT. However, in terms of compression,
the DCT performs better than the DFT.
To see why, recall that when we find the Fourier coefficients for a sequence of length N,
we assume that the sequence is periodic with period N. If the original sequence is as shown
in Figure 13.5a, the DFT assumes that the sequence outside the interval of interest behaves
in the manner shown in Figure 13.5b. This introduces sharp discontinuities, at the beginning
and the end of the sequence. In order to represent these sharp discontinuities, the DFT
needs nonzero coefficients for the high-frequency components. Because these components
are needed only at the two endpoints of the sequence, their effect needs to be canceled out at
other points in the sequence. Thus, the DFT adjusts other coefficients accordingly. When we
discard the high-frequency coefficients (which should not have been there anyway) during

0 3 6
7
1
4
2 5
F I G U R E 13. 3 Basis set for the discrete cosine transform. The numbers in the
circles correspond to the row of the transform matrix.
the compression process, the coefficients that were canceling out the high-frequency effect
in other parts of the sequence result in the introduction of additional distortion.
The DCT can be obtained using the DFT by mirroring the original N-point sequence to
obtain a 2N-point sequence, as shown in Figure 13.6b. The DCT is simply the first N points
of the resulting 2N-point DFT. When we take the DFT of the 2N-point mirrored sequence,
we again have to assume periodicity. However, as we can see from Figure 13.6c, this does
not introduce any sharp discontinuities at the edges.
The DCT is substantially better at energy compaction for most correlated sources when
compared to the DFT [123]. In fact, for Markov sources with high correlation coefficient ,
=
E xnxn+1
E x2
n
(13.44)
the compaction ability of the DCT is very close to that of the KLT. As many sources can
be modeled as Markov sources with high values for , this superior compaction ability has
made the DCT the most popular transform. It is a part of many international standards,
including JPEG, MPEG, and CCITT H.261, among others.

F I G U R E 13. 4 The basis matrices for the DCT.
13 . 4 . 3 D i s c r e t e S i n e T r a n s f o r m
The discrete sine transform (DST) is a complementary transform to the DCT. Where the
DCT provides performance close to the optimum KLT when the correlation coefficient
is large, the DST performs close to the optimum KLT in terms of compaction when the
magnitude of is small. Because of this property, it is often used as the complementary
transform to DCT in image [186] and audio [187] coding applications.
The elements of the transform matrix for an N ×N DST are
Sij =

2
N +1
sin
i+1 j +1
N +1
i j = 0 1 N −1 (13.45)
13 . 4 . 4 D i s c r e t e W a l s h - H a d a m a r d T r a n s f o r m
A transform that is especially simple to implement is the discrete Walsh-Hadamard transform
(DWHT). The DWHT transform matrices are rearrangements of discrete Hadamard matrices,
which are of particular importance in coding theory [188]. A Hadamard matrix of order N
is defined as an N ×N matrix H, with the property that HHT
= NI, where I is the N ×N

(a)
(b)
F I G U R E 13. 5 Taking the discrete Fourier transform of a sequence.
(a)
(b)
(c)
F I G U R E 13. 6 Taking the discrete cosine transform of a sequence.

identity matrix. Hadamard matrices whose dimensions are a power of two can be constructed
in the following manner:
H2N =

HN HN
HN −HN

(13.46)
with H1 = 1. Therefore,
H2 =

H1 H1
H1 −H1

=

1 1
1 −1

(13.47)
H4 =

H2 H2
H2 −H2

=
⎡
⎢
⎢
⎣
1 1
1 −1
1 1
1 −1
1 1
1 −1
−1 −1
−1 1
⎤
⎥
⎥
⎦ (13.48)
H8 =

H4 H4
H4 −H4

=
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
1 1 1 1 1 1 1 1
1 −1 1 −1 1 −1 1 −1
1 1 −1 −1 1 1 −1 −1
1 −1 −1 1 1 −1 −1 1
1 1 1 1 −1 −1 −1 −1
1 −1 1 −1 −1 1 −1 1
1 1 −1 −1 −1 −1 1 1
1 −1 −1 1 −1 1 1 −1
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
(13.49)
The DWHT transform matrix H can be obtained from the Hadamard matrix by multiplying
it by a normalizing factor so that HHT
= I instead of NI, and by reordering the rows in
increasing sequency order. The sequency of a row is half the number of sign changes in that
row. In H8 the first row has sequency 0, the second row has sequency 7/2, the third row has
sequency 3/2, and so on. Normalization involves multiplying the matrix by 1
√
N
. Reordering
the H8 matrix in increasing sequency order, we get
H =
1
√
8
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
1 1 1 1 1 1 1 1
1 1 1 1 −1 −1 −1 −1
1 1 −1 −1 −1 −1 1 1
1 1 −1 −1 1 1 −1 −1
1 −1 −1 1 1 −1 −1 1
1 −1 −1 1 −1 1 1 −1
1 −1 1 −1 −1 1 −1 1
1 −1 1 −1 1 −1 1 −1
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
(13.50)
Because the matrix without the scaling factor consists of ±1, the transform operation
consists simply of addition and subtraction. For this reason, this transform is useful in
situations where minimizing the amount of computations is very important. However, the
amount of energy compaction obtained with this transform is substantially less than the
compaction obtained by the use of the DCT. Therefore, where sufficient computational
power is available, DCT is the transform of choice.

13.5 Quantization and Coding of Transform Coefﬁcients 407
13 . 5 Quantization and Coding of Transform
Coefficients
If the amount of information conveyed by each coefficient is different, it makes sense to
assign differing numbers of bits to the different coefficients. There are two approaches to
assigning bits. One approach relies on the average properties of the transform coefficients,
while the other approach assigns bits as needed by individual transform coefficients.
In the first approach, we first obtain an estimate of the variances of the transform
coefficients. These estimates can be used by one of two algorithms to assign the number
of bits used to quantize each of the coefficients. We assume that the relative variance of
the coefficients corresponds to the amount of information contained in each coefficient.
Thus, coefficients with higher variance are assigned more bits than coefficients with smaller
variance.
Let us find an expression for the distortion, then find the bit allocation that minimizes
the distortion. To perform the minimization we will use the method of Lagrange [189]. If
the average number of bits per sample to be used by the transform coding system is R, and
the average number of bits per sample used by the kth coefficient is Rk, then
R =
1
M
M

k=1
Rk (13.51)
where M is the number of transform coefficients. The reconstruction error variance for the
kth quantizer 2
rk
is related to the kth quantizer input variance 2
k
by the following:
2
rk
= k2−2Rk
2
k
(13.52)
where k is a factor that depends on the input distribution and the quantizer.
The total reconstruction error is given by
2
r =
M

k=1
k2−2Rk
2
k
(13.53)
The objective of the bit allocation procedure is to find Rk to minimize (13.53) subject to
the constraint of (13.51). If we assume that k is a constant for all k, we can set up the
minimization problem in terms of Lagrange multipliers as
J =
M

k=1
2−2Rk
2
k
−

R−
1
M
M

k=1
Rk

(13.54)
Taking the derivative of J with respect to Rk and setting it equal to zero, we can obtain this
expression for Rk:
Rk =
1
2
log2

2ln 22
k

−
1
2
log2 (13.55)
Substituting this expression for Rk in (13.51), we get a value for :
=
M

k=1

2ln 22
k
1
M
2−2R
(13.56)

Substituting this expression for in (13.55), we finally obtain the individual bit allocations:
Rk = R+
1
2
log2
2
k
M
k=1 2
k
1
M
(13.57)
Although these values of Rk will minimize (13.53), they are not guaranteed to be integers, or
even positive. The standard approach at this point is to set the negative Rks to zero. This will
increase the average bit rate above the constraint. Therefore, the nonzero Rks are uniformly
reduced until the average rate is equal to R.
The second algorithm that uses estimates of the variance is a recursive algorithm and
functions as follows:
1. Compute 2
k
for each coefficient.
2. Set Rk = 0 for all k and set Rb = MR, where Rb is the total number of bits available
for distribution.
3. Sort the variances 2
k
. Suppose 2
1
is the maximum.
4. Increment Rl by 1, and divide 2
1
by 2.
5. Decrement Rb by 1. If Rb = 0, then stop; otherwise, go to 3.
If we follow this procedure, we end up allocating more bits to the coefficients with higher
variance.
This form of bit allocation is called zonal sampling. The reason for this name can be
seen from the example of a bit allocation map for the 8 × 8 DCT of an image shown in
Table 13.4. Notice that there is a zone of coefficients that roughly comprises the right lower
diagonal of the bit map that has been assigned zero bits. In other words, these coefficients are
to be discarded. The advantage to this approach is its simplicity. Once the bit allocation has
been obtained, every coefficient at a particular location is always quantized using the same
number of bits. The disadvantage is that, because the bit allocations are performed based
on average value, variations that occur on the local level are not reconstructed properly.
For example, consider an image of an object with sharp edges in front of a relatively plain
background. The number of pixels that occur on edges is quite small compared to the total
number of pixels. Therefore, if we allocate bits based on average variances, the coefficients
that are important for representing edges (the high-frequency coefficients) will get few or
T A B L E 13 . 4 Bit allocation map for an 8 × 8 transform.
8 7 5 3 1 1 0 0
7 5 3 2 1 0 0 0
4 3 2 1 1 0 0 0
3 3 2 1 1 0 0 0
2 1 1 1 0 0 0 0
1 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0

13.5 Quantization and Coding of Transform Coefﬁcients 409
no bits assigned to them. This means that the reconstructed image will not contain a very
good representation of the edges.
This problem can be avoided by using a different approach to bit allocation known as
threshold coding [190, 93, 191]. In this approach, which coefficient to keep and which
to discard is not decided a priori. In the simplest form of threshold coding, we specify a
threshold value. Coefficients with magnitude below this threshold are discarded, while the
other coefficients are quantized and transmitted. The information about which coefficients
have been retained is sent to the receiver as side information. A simple approach described
by Pratt [93] is to code the first coefficient on each line regardless of the magnitude. After
this, when we encounter a coefficient with a magnitude above the threshold value, we send
two codewords: one for the quantized value of the coefficient, and one for the count of the
number of coefficients since the last coefficient with magnitude greater than the threshold.
For the two-dimensional case, the block size is usually small, and each “line” of the transform
is very short. Thus, this approach would be quite expensive. Chen and Pratt [191] suggest
scanning the block of transformed coefficients in a zigzag fashion, as shown in Figure 13.7.
If we scan an 8×8 block of quantized transform coefficients in this manner, we will find
that in general a large section of the tail end of the scan will consist of zeros. This is because
F I G U R E 13. 7 The zigzag scanning pattern for an 8 × 8 transform.

generally the higher-order coefficients have smaller amplitude. This is reflected in the bit
allocation table shown in Table 13.4. As we shall see later, if we use midtread quantizers
(quantizers with a zero output level), combined with the fact that the step sizes for the
higher-order coefficients are generally chosen to be quite large, this means that many of
these coefficients will be quantized to zero. Therefore, there is a high probability that after
a few coefficients along the zigzag scan, all coefficients will be zero. In this situation, Chen
and Pratt suggest the transmission of a special end-of-block (EOB) symbol. Upon reception
of the EOB signal, the receiver would automatically set all remaining coefficients along the
zigzag scan to zero.
The algorithm developed by the Joint Photographic Experts Group (JPEG), described in
the next section, uses a rather clever variation of this approach.
13 . 6 Application to Image Compression-
—JPEG
The JPEG standard is one of the most widely known standards for lossy image compression.
It is a result of the collaboration of the International Standards Organization (ISO), which
is a private organization, and what was the CCITT (now ITU-T), a part of the United
Nations. The approach recommended by JPEG is a transform coding approach using the
DCT. The approach is a modification of the scheme proposed by Chen and Pratt [191]. In
this section we will briefly describe the baseline JPEG algorithm. In order to illustrate the
various components of the algorithm, we will use an 8×8 block of the Sena image, shown
in Table 13.5. For more details, see [10].
13 . 6 . 1 T h e T r a n s f o r m
The transform used in the JPEG scheme is the DCT transform described earlier. The input
image is first “level shifted” by 2P−1
; that is, we subtract 2P−1
from each pixel value, where
P is the number of bits used to represent each pixel. Thus, if we are dealing with 8-bit images
whose pixels take on values between 0 and 255, we would subtract 128 from each pixel so
that the value of the pixel varies between −128 and 127. The image is divided into blocks
of size 8×8, which are then transformed using an 8×8 forward DCT. If any dimension of
the image is not a multiple of eight, the encoder replicates the last column or row until the
T A B L E 13 . 5 An 8 × 8 block from the Sena image.
124 125 122 120 122 119 117 118
121 121 120 119 119 120 120 118
126 124 123 122 121 121 120 120
124 124 125 125 126 125 124 124
127 127 128 129 130 128 127 125
143 142 143 142 140 139 139 139
150 148 152 152 152 152 150 151
156 159 158 155 158 158 157 156

13.6 Application to Image Compression-
—JPEG 411
T A B L E 13 . 6 The DCT coefﬁcients corresponding to the block of data from the Sena
image after level shift.
3988 656 −224 122 −037 −108 079 113
−10243 456 226 112 035 −063 −105 −048
3777 131 177 025 −150 −221 −010 023
−567 224 −132 −081 141 022 −013 017
−337 −074 −175 077 −062 −265 −130 076
598 −013 −045 −077 199 −026 146 000
397 552 239 −055 −0051 −084 −052 −013
−343 051 −107 087 096 009 033 001
final size is a multiple of eight. These additional rows or columns are removed during the
decoding process. If we take the 8 × 8 block of pixels shown in Table 13.5, subtract 128
from it, and take the DCT of this level-shifted block, we obtain the DCT coefficients shown
in Table 13.6. Notice that the lower-frequency coefficients in the top-left corner of the table
have larger values than the higher-frequency coefficients. This is generally the case, except
for situations in which there is substantial activity in the image block.
13 . 6 . 2 Q u a n t i z a t i o n
The JPEG algorithm uses uniform midtread quantization to quantize the various coefficients.
The quantizer step sizes are organized in a table called the quantization table and can be
viewed as the fixed part of the quantization. An example of a quantization table from the
JPEG recommendation [10] is shown in Table 13.7. Each quantized value is represented by
a label. The label corresponding to the quantized value of the transform coefficient ij is
obtained as
lij =

ij
Qij
+05

(13.58)
T A B L E 13 . 7 Sample quantization table.
16 11 10 16 24 40 51 61
12 12 14 19 26 58 60 55
14 13 16 24 40 57 69 56
14 17 22 29 51 87 80 62
18 22 37 56 68 109 103 77
24 35 55 64 81 104 113 92
49 64 78 87 103 121 120 101
72 92 95 98 112 100 103 99

T A B L E 13 . 8 The quantizer labels obtained by using the
quantization table on the coefﬁcients.
2 1 0 0 0 0 0 0
−9 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
where Qij is the i j th element of the quantization table, and x is the largest integer
smaller than x. Consider the 00 coefficient from Table 13.6. The value of 00 is 39.88. From
Table 13.7, Q00 is 16. Therefore,
l00 =

3988
16
+05

= 29925 = 2 (13.59)
The reconstructed value is obtained from the label by multiplying the label with the
corresponding entry in the quantization table. Therefore, the reconstructed value of 00 would
be l00 ×Q00, which is 2×16 = 32. The quantization error in this case is 3988−32 = −788.
Similarly, from Tables 13.6 and 13.7, 01 is 6.56 and Q01 is 11. Therefore,
l01 =

656
11
+05

= 1096 = 1 (13.60)
The reconstructed value is 11, and the quantization error is 11−656 = 444. Continuing in
this fashion, we obtain the labels shown in Table 13.8.
From the sample quantization table shown in Table 13.7, we can see that the step size
generally increases as we move from the DC coefficient to the higher-order coefficients.
Because the quantization error is an increasing function of the step size, more quantization
error will be introduced in the higher-frequency coefficients than in the lower-frequency
coefficients. The decision on the relative size of the step sizes is based on how errors in
these coefficients will be perceived by the human visual system. Different coefficients in the
transform have widely different perceptual importance. Quantization errors in the DC and
lower AC coefficients are more easily detectable than the quantization error in the higher AC
coefficients. Therefore, we use larger step sizes for perceptually less important coefficients.
Because the quantizers are all midtread quantizers (that is, they all have a zero out-
put level), the quantization process also functions as the thresholding operation. All coeffi-
cients with magnitudes less than half the corresponding step size will be set to zero. Because
the step sizes at the tail end of the zigzag scan are larger, the probability of finding a long
run of zeros increases at the end of the scan. This is the case for the 8 ×8 block of labels
shown in Table 13.8. The entire run of zeros at the tail end of the scan can be coded with
an EOB code after the last nonzero label, resulting in substantial compression.

—JPEG 413
Furthermore, this effect also provides us with a method to vary the rate. By making the
step sizes larger, we can reduce the number of nonzero values that need to be transmitted,
which translates to a reduction in the number of bits that need to be transmitted.
13 . 6 . 3 C o d i n g
Chen and Pratt [191] used separate Huffman codes for encoding the label for each coefficient
and the number of coefficients since the last nonzero label. The JPEG approach is somewhat
more complex but results in higher compression. In the JPEG approach, the labels for the
DC and AC coefficients are coded differently.
From Figure 13.4 we can see that the basis matrix corresponding to the DC coefficient
is a constant matrix. Thus, the DC coefficient is some multiple of the average value in
the 8 × 8 block. The average pixel value in any 8 × 8 block will not differ substantially
from the average value in the neighboring 8×8 block; therefore, the DC coefficient values
will be quite close. Given that the labels are obtained by dividing the coefficients with the
corresponding entry in the quantization table, the labels corresponding to these coefficients
will be closer still. Therefore, it makes sense to encode the differences between neighboring
labels rather than to encode the labels themselves.
Depending on the number of bits used to encode the pixel values, the number of values
that the labels, and hence the differences, can take on may become quite large. A Huffman
code for such a large alphabet would be quite unmanageable. The JPEG recommendation
resolves this problem by partitioning the possible values that the differences can take on into
categories. The size of these categories grows as a power of two. Thus, category 0 has only
one member (0), category 1 has two members (−1 and 1), category 2 has four members
(−3, −2, 2, 3), and so on. The category numbers are then Huffman coded. The number of
codewords in the Huffman code is equal to the base two logarithm of the number of possible
values that the label differences can take on. If the differences can take on 4096 possible
values, the size of the Huffman code is log2 4096 = 12. The elements within each category
are specified by tacking on extra bits to the end of the Huffman code for that category. As
the categories are different sizes, we need a differing number of bits to identify the value
in each category. For example, because category 0 contains only one element, we need no
additional bits to specify the value. Category 1 contains two elements, so we need 1 bit
tacked on to the end of the Huffman code for category 1 to specify the particular element
in that category. Similarly, we need 2 bits to specify the element in category 2, 3 bits for
category 3, and n bits for category n.
The categories and the corresponding difference values are shown in Table 13.9. For
example, if the difference between two labels was 6, we would send the Huffman code
for category 3. As category 3 contains the eight values −7 −6 −5 −4 4 5 6 7, the
Huffman code for category 3 would be followed by 3 bits that would specify which of the
eight values in category 3 was being transmitted.
The binary code for the AC coefficients is generated in a slightly different manner. The
category C that a nonzero label falls in and the number of zero-valued labels Z since the
last nonzero label form a pointer to a specific Huffman code as shown in Table 13.10. Thus,
if the label being encoded falls in category 3, and there have been 15 zero-valued labels
prior to this nonzero label in the zigzag scan, then we form the pointer F/3, which points

T A B L E 13 . 9 Coding of the differences of the DC labels.
0 0
1 −1 1
2 −3 −2 2 3
3 −7 ··· −4 4 ··· 7
4 −15 ··· −8 8 ··· 15
5 −31 ··· −16 16 ··· 31
6 −63 ··· −32 32 ··· 63
7 −127 ··· −64 64 ··· 127
8 −255 ··· −128 128 ··· 255
9 −511 ··· −256 256 ··· 511
10 −1 023 ··· −512 512 ··· 1,023
11 −2 047 ··· −1 024 1,024 ··· 2,047
12 −4 095 ··· −2 048 2,048 ··· 4,095
13 −8 191 ··· −4 096 4,096 ··· 8,191
14 −16 383 ··· −8 192 8,192 ··· 16,383
15 −32 767 ··· −16 384 16,384 ··· 32,767
16 32,768
T A B L E 13 . 10 Sample table for obtaining the Huffman code for a given label value
and run length. The values of Z are represented in hexadecimal.
Z/C Codeword Z/C Codeword ··· Z/C Codeword
0/0 (EOB) 1010 ··· F/0 (ZRL) 11111111001
0/1 00 1/1 1100 ··· F/1 1111111111110101
0/2 01 1/2 11011 ··· F/2 1111111111110110
0/3 100 1/3 1111001 ··· F/3 1111111111110111
0/4 1011 1/4 111110110 ··· F/4 1111111111111000
0/5 11010 1/5 11111110110 ··· F/5 1111111111111001

to the codeword 1111111111110111. Because the label falls in category 3, we follow this
codeword with 3 bits that indicate which of the eight possible values in category 3 is the
value that the label takes on.
There are two special codes shown in Table 13.10. The first is for the end-of-block
(EOB). This is used in the same way as in the Chen and Pratt [191] algorithm; that is, if
a particular label value is the last nonzero value along the zigzag scan, the code for it is
immediately followed by the EOB code. The other code is the ZRL code, which is used
when the number of consecutive zero values along the zigzag scan exceeds 15.
To see how all of this fits together, let’s encode the labels in Table 13.8. The label
corresponding to the DC coefficient is coded by first taking the difference between the
value of the quantized label in this block and the quantized label in the previous block. If
we assume that the corresponding label in the previous block was −1, then the difference
would be 3. From Table 13.9 we can see that this value falls in category 2. Therefore, we

—JPEG 415
would send the Huffman code for category 2 followed by the 2-bit sequence 11 to indicate
that the value in category 2 being encoded was 3, and not −3 −2, or 2. To encode the AC
coefficients, we first order them using the zigzag scan. We obtain the sequence
1 −9 3 0 0 0···0
The first value, 1, belongs to category 1. Because there are no zeros preceding it, we transmit
the Huffman code corresponding to 0/1, which from Table 13.10 is 00. We then follow this
by a single bit 1 to indicate that the value being transmitted is 1 and not −1. Similarly,
−9 is the seventh element in category 4. Therefore, we send the binary string 1011, which
is the Huffman code for 0/4, followed by 0110 to indicate that −9 is the seventh element
in category 4. The next label is 3, which belongs to category 2, so we send the Huffman
code 01 corresponding to 0/2, followed by the 2 bits 11. All the labels after this point are
0, so we send the EOB Huffman code, which in this case is 1010. If we assume that the
Huffman code for the DC coefficient was 2 bits long, we have sent a grand total of 21 bits
to represent this 8×8 block. This translates to an average 21
64
bits per pixel.
To obtain a reconstruction of the original block, we perform the dequantization, which
simply consists of multiplying the labels in Table 13.8 with the corresponding values in
Table 13.7. Taking the inverse transform of the quantized coefficients shown in Table 13.11
and adding 128, we get the reconstructed block shown in Table 13.12. We can see that in
spite of going from 8 bits per pixel to 9
32
bits per pixel, the reproduction is remarkably close
to the original.
T A B L E 13 . 11 The quantized values of the coefﬁcients.
32 11 0 0 0 0 0 0
−108 0 0 0 0 0 0 0
42 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
T A B L E 13 . 12 The reconstructed block.
123 122 122 121 120 120 119 119
121 121 121 120 119 118 118 118
121 121 120 119 119 118 117 117
124 124 123 122 122 121 120 120
130 130 129 129 128 128 128 127
141 141 140 140 139 138 138 137
152 152 151 151 150 149 149 148
159 159 158 157 157 156 155 155

F I G U R E 13. 8 Sinan image coded at 0.5 bits per pixel using the JPEG algorithm.
If we wanted an even more accurate reproduction, we could do so at the cost of increased
bit rate by multiplying the step sizes in the quantization table by one-half and using these
values as the new step sizes. Using the same assumptions as before, we can show that this
will result in an increase in the number of bits transmitted. We can go in the other direction
by multiplying the step sizes with a number greater than one. This will result in a reduction
in bit rate at the cost of increased distortion.
Finally, we present some examples of JPEG-coded images in Figures 13.8 and 13.9.
These were coded using shareware generated by the Independent JPEG Group (orga-
nizer, Dr. Thomas G. Lane). Notice the high degree of “blockiness” in the lower-rate image
(Figure 13.8). This is a standard problem of most block-based techniques, and specifically
of the transform coding approach. A number of solutions have been suggested for removing
this blockiness, including postfiltering at the block edges as well as transforms that overlap
the block boundaries. Each approach has its own drawbacks. The filtering approaches tend
to reduce the resolution of the reconstructions, while the overlapped approaches increase the
complexity. One particular overlapped approach that is widely used in audio compression is
the modified DCT (MDCT), which is described in the next section.
13 . 7 Application to Audio Compression-
—The
MDCT
As mentioned in the previous section, the use of the block based transform has the unfortunate
effect of causing distortion at the block boundaries at low rates. A number of techniques
that use overlapping blocks have been developed over the years [192]. One that has gained

13.7 Application to Audio Compression-
—The MDCT 417
F I G U R E 13. 9 Sinan image coded at 0.25 bits per pixel using the JPEG algorithm.
wide acceptance in audio compression is a transform based on the discrete cosine transform
called the modified discrete cosine transform (MDCT). It is used in almost all popular audio
coding standards from mp3 and AAC to Ogg Vorbis.
The MDCT used in these algorithms uses 50% overlap. That is, each block overlaps half
of the previous block and half of the next block of data. Consequently, each audio sample
is part of two blocks. If we were to keep all the frequency coefficients we would end up
with twice as many coefficients as samples. Reducing the number of frequency coefficients
results in the introduction of distortion in the inverse transform. The distortion is referred
to as time domain aliasing [193]. The reason for the name is evident if we consider that the
distortion is being introduced by subsampling in the frequency domain. Recall that sampling
at less than the Nyquist frequency in the time domain leads to an overlap of replicas of the
frequency spectrum, or frequency aliasing. The lapped transforms are successful because
they are constructed in such a way that while the inverse transform of each block results in
time-domain aliasing, the aliasing in consecutive blocks cancel each other out.
[ ][ ][ ][
Block i
Block i + 1
Block i + 2
Block i – 1
r
p q
F I G U R E 13. 10 Source output sequence.

Consider the scenario shown in Figure 13.10. Let’s look at the coding for block i and
block i +1. The inverse transform of the coefficients resulting from both these blocks will
result in the audio samples in the subblock q. We assume that the blocksize is N and
therefore the subblock size is N/2. The forward transform can be represented by an N/2×N
matrix P. Let us partition the matrix into two N/2×N/2 blocks, A and B. Thus
P = AB
Let xi = pq, then the forward transform Pxi can be written in terms of the subblocks as
Xi = AB

p
q

The inverse transform matrix Q can be represented by an N ×N/2, which can be partitioned
into two N/2×N/2 blocks, C and D.
Q =

C
D

Applying the inverse transform, we get the reconstruction values x̂
x̂i = QXi = QPxi =

C
D

AB

p
q

=

CAp+CBq
DAp+DBq

Repeating the process for block i+1 we get
x̂i+1 = QXi+1 = QPxi+1 =

C
D

AB

q
r

=

CAq +CBr
DAq +DBr

To cancel out the aliasing in the second half of the block we need
CAq +CBr +DAp+DBq = q
From this we can get the requirements on the transform
CB = 0 (13.61)
DA = 0 (13.62)
CA+DB = I (13.63)
Note that the same requirements will help cancel the aliasing in the first half of block i by
using the second half of the inverse transform of block i−1. One selection that satisfies the
last condition is
CA =
1
2
I −J (13.64)
DB =
1
2
I +J (13.65)

13.8 Summary 419
The forward modified discrete transform is given by the following equation:
Xk =
N−1

n=0
xn cos

2
N
k+
1
2
n+
1
2
+
N
4

(13.66)
where xn are the audio samples and Xk are the frequency coefficients. The inverse MDCT
is given by
yn =
2
N
N
2 −1

n=0
Xk cos

2
N
k+
1
2
n+
1
2
+
N
4

(13.67)
or in terms of our matrix notation,
Pi j = cos

2
N
i+
1
2
j +
1
2
+
N
4

(13.68)
Qi j =
2
N
cos

2
N
i+
1
2
j +
1
2
+
N
4

(13.69)
It is easy to verify that, given a value of N, these matrices satisfy the conditions for alias
cancellation.
Thus, while the inverse transform for any one block will contain aliasing, by using the
inverse transform of neighboring blocks the aliasing can be canceled. What about blocks
that do not have neighbors—that is, the first and last blocks? One way to resolve this
problem is to pad the sampled audio sequence with N/2 zeros at the beginning and end
of the sequence. In practice, this is not necessary, because the data to be transformed is
windowed prior to the transform. For the first and last blocks we use a special window that
has the same effect as introducing zeros. For information on the design of windows for the
MDCT, see [194]. For more on how the MDCT is used in audio compression techniques, see
Chapter 16.
13 . 8 Summary
In this chapter we have described the concept of transform coding and provided some of the
details needed for the investigation of this compression scheme. The basic encoding scheme
works as follows:
Divide the source output into blocks. In the case of speech or audio data, they will be
one-dimensional blocks. In the case of images, they will be two-dimensional blocks.
In image coding, a typical block size is 8×8. In audio coding the blocks are generally
overlapped by 50%.
Take the transform of this block. In the case of one-dimensional data, this involves
pre-multiplying the N vector of source output samples by the transform matrix. In the
case of image data, for the transforms we have looked at, this involves pre-multiplying
the N × N block by the transform matrix and post-multiplying the result with the

transpose of the transform matrix. Fast algorithms exist for performing the transforms
described in this chapter (see [195]).
Quantize the coefficients. Various techniques exist for the quantization of these coef-
ficients. We have described the approach used by JPEG. In Chapter 16 we describe
the quantization techniques used in various audio coding algorithms.
Encode the quantized value. The quantized value can be encoded using a fixed-length
code or any of the different variable-length codes described in earlier chapters. We
have described the approach taken by JPEG.
The decoding scheme is the inverse of the encoding scheme for image compression. For the
overlapped transform used in audio coding the decoder adds the overlapped portions of the
inverse transform to cancel aliasing.
The basic approach can be modified depending on the particular characteristics of the
data. We have described some of the modifications used by various commercial algorithms
for transform coding of audio signals.
1. For detailed information about the JPEG standard, JPEG Still Image Data Compression
Standard, by W.B. Pennebaker and J.L. Mitchell [10], is an invaluable reference. This
book also contains the entire text of the official draft JPEG recommendation, ISO DIS
10918-1 and ISO DIS 10918-2.
2. For a detailed discussion of the MDCT and how it is used in audio coding, an
excellent source is Introduction to Digital Audio Coding Standards, by M. Bosi and
R.E. Goldberg [194]
3. Chapter 12 in Digital Coding of Waveforms, by N.S. Jayant and P. Noll [123], provides
a more mathematical treatment of the subject of transform coding.
4. A good source for information about transforms is Fundamentals of Digital Image
Processing, by A.K. Jain [196]. Another one is Digital Image Processing, by R.C.
Gonzales and R.E. Wood [96]. This book has an especially nice discussion of the
Hotelling transform.
5. The bit allocation problem and its solutions are described in Vector Quantization and
Signal Compression, by A. Gersho and R.M. Gray [5].
6. A very readable description of transform coding of images is presented in Digital
Image Compression Techniques, by M. Rabbani and P.W. Jones [80].
7. The Data Compression Book, by M. Nelson and J.-L. Gailly [60], provides a very
readable discussion of the JPEG algorithm.

1. A square matrix A has the property that AT
A = AAT
= I, where I is the identity
matrix. If X1 and X2 are two N-dimensional vectors and
1 = AX1
2 = AX2
then show that
X1 −X22
= 1 −22
(13.70)
2. Consider the following sequence of values:
10 11 12 11 12 13 12 11
10 −10 8 −7 8 −8 7 −7
(a) Transform each row separately using an eight-point DCT. Plot the resulting 16
transform coefficients.
(b) Combine all 16 numbers into a single vector and transform it using a 16-point
DCT. Plot the 16 transform coefficients.
(c) Compare the results of (a) and (b). For this particular case would you suggest a
block size of 8 or 16 for greater compression? Justify your answer.
3. Consider the following “image”:
4 3 2 1
3 2 1 1
2 1 1 1
1 1 1 1
(a) Obtain the two-dimensional DWHT transform by first taking the one-dimensional
transform of the rows, then taking the column-by-column transform of the result-
ing matrix.
(b) Obtain the two-dimensional DWHT transform by first taking the one-dimensional
transform of the columns, then taking the row-by-row transform of the resulting
matrix.
(c) Compare and comment on the results of (a) and (b).
4. (This problem was suggested by P.F. Swaszek.) Let us compare the energy compaction
properties of the DCT and the DWHT transforms.
(a) For the Sena image, compute the mean squared value of each of the 64 coeffi-
cients using the DCT. Plot these values.
(b) For the Sena image, compute the mean squared value of each of the 64 coeffi-
cients using the DWHT. Plot these values.
(c) Compare the results of (a) and (b). Which transform provides more energy
compaction? Justify your answer.

5. Implement the transform and quantization portions of the JPEG standard. For coding
the labels use an arithmetic coder instead of the modified Huffman code described in
this chapter.
(a) Encode the Sena image using this transform coder at rates of (approximately)
0.25, 0.5, and 0.75 bits per pixel. Compute the mean squared error at each rate
and plot the rate versus the mse.
(b) Repeat part (a) using one of the public domain implementations of JPEG.
(c) Compare the plots obtained using the two coders and comment on the relative
performance of the coders.
6. One of the extensions to the JPEG standard allows for the use of multiple quantization
matrices. Investigate the issues involved in designing a set of quantization matrices.
Should the quantization matrices be similar or dissimilar? How would you measure
their similarity? Given a particular block, do you need to quantize it with each
quantization matrix to select the best? Or is there a computationally more efficient
approach? Describe your findings in a report.

14
Subband Coding
14 . 1 Overview
I
n this chapter we present the second of three approaches to compression in
which the source output is decomposed into constituent parts. Each constituent
part is encoded using one or more of the methods that have been described
previously. The approach described in this chapter, known as subband coding,
relies on separating the source output into different bands of frequencies using
digital filters. We provide a general description of the subband coding system and, for those
readers with some knowledge of Z-transforms, a more mathematical analysis of the system.
The sections containing the mathematical analysis are not essential to understanding the
rest of the chapter and are marked with a . If you are not interested in the mathematical
analysis, you should skip these sections. This is followed by a description of a popular
approach to bit allocation. We conclude the chapter with applications to audio and image
compression.
14 . 2 Introduction
In previous chapters we looked at a number of different compression schemes. Each of these
schemes is most efficient when the data have certain characteristics. A vector quantization
scheme is most effective if blocks of the source output show a high degree of clustering.
A differential encoding scheme is most effective when the sample-to-sample difference is
small. If the source output is truly random, it is best to use scalar quantization or lattice vector
quantization. Thus, if a source exhibited certain well-defined characteristics, we could choose
a compression scheme most suited to that characteristic. Unfortunately, most source outputs
exhibit a combination of characteristics, which makes it difficult to select a compression
scheme exactly suited to the source output.

424 14 S U B B A N D C O D I N G
In the last chapter we looked at techniques for decomposing the source output into
different frequency bands using block transforms. The transform coefficients had differing
statistics and differing perceptual importance. We made use of these differences in allocating
bits for encoding the different coefficients. This variable bit allocation resulted in a decrease
in the average number of bits required to encode the source output. One of the drawbacks
of transform coding is the artificial division of the source output into blocks, which results
in the generation of coding artifacts at the block edges, or blocking. One approach to
avoiding this blocking is the lapped orthogonal transform (LOT) [192]. In this chapter
we look at a popular approach to decomposing the image into different frequency bands
without the imposition of an arbitrary block structure. After the input has been decomposed
into its constituents, we can use the coding technique best suited to each constituent to
improve compression performance. Furthermore, each component of the source output may
have different perceptual characteristics. For example, quantization error that is perceptually
objectionable in one component may be acceptable in a different component of the source
output. Therefore, a coarser quantizer that uses fewer bits can be used to encode the
component that is perceptually less important.
Consider the sequence xn plotted in Figure 14.1. We can see that, while there is a
significant amount of sample-to-sample variation, there is also an underlying long-term trend
shown by the dotted line that varies slowly.
One way to extract this trend is to average the sample values in a moving window.
The averaging operation smooths out the rapid variations, making the slow variations more
evident. Let’s pick a window of size two and generate a new sequence yn by averaging
neighboring values of xn:
yn =
xn +xn−1
2
(14.1)
The consecutive values of yn will be closer to each other than the consecutive values of xn.
Therefore, the sequence yn can be coded more efficiently using differential encoding than
we could encode the sequence xn. However, we want to encode the sequence xn, not
the sequence yn. Therefore, we follow the encoding of the averaged sequence yn by the
difference sequence zn:
zn = xn −yn = xn −
xn +xn−1
2
=
xn −xn−1
2
(14.2)
F I G U R E 14. 1 A rapidly changing source output that contains a long-term
component with slow variations.

The sequences yn and zn can be coded independently of each other. This way we can
use the compression schemes that are best suited for each sequence.
E x a m p l e 14 . 2 . 1:
Suppose we want to encode the following sequence of values xn:
10 14 10 12 14 8 14 12 10 8 10 12
There is a significant amount of sample-to-sample correlation, so we might consider using
a DPCM scheme to compress this sequence. In order to get an idea of the requirements on
the quantizer in a DPCM scheme, let us take a look at the sample-to-sample differences
xn −xn−1:
10 4 −4 2 2 −6 6 −2 −2 −2 2 2
Ignoring the first value, the dynamic range of the differences is from −6 to 6. Suppose we
want to quantize these values using m bits per sample. This means we could use a quantizer
with M = 2m
levels or reconstruction values. If we choose a uniform quantizer, the size
of each quantization interval, , is the range of possible input values divided by the total
number of reconstruction values. Therefore,
=
12
M
which would give us a maximum quantization error of
2
or 6
M
.
Now let’s generate two new sequences yn and zn according to (14.1) and (14.2).
All three sequences are plotted in Figure 14.2. Notice that given yn and zn, we can always
recover xn:
xn = yn +zn (14.3)
Let’s try to encode each of these sequences. The sequence yn is
10 12 12 11 13 11 11 13 11 10 9 11
Notice that the yn sequence is “smoother” than the xn sequence—the sample-to-sample
variation is much smaller. This becomes evident when we look at the sample-to-sample
differences:
10 2 0 −1 2 −2 0 2 −2 −1 −1 2
The difference sequences xn − xn−1 and yn − yn−1 are plotted in Figure 14.3. Again,
ignoring the first difference, the dynamic range of the differences yn −yn−1 is 4. If we take
the dynamic range of these differences as a measure of the range of the quantizer, then for
an M-level quantizer, the step size of the quantizer is 4
M
and the maximum quantization

2
0
Value
Sample
number
−2
−4
4
6
8
10
12
14 + + +
+
+
+
+
+
+
+
+
+
+
xn
zn
yn
2 4 6 8 10 12
F I G U R E 14. 2 Original set of samples and the two components.
0
Value
Sample
number
−4
−2
−6
2
4
6
+
+
+
+
+
+
+
+
+
xn − xn−1
yn − yn−1
2 4 6 8 12
10
F I G U R E 14. 3 Difference sequences generated from the original and averaged
sequences.

error is 2
M
. This maximum quantization error is one-third the maximum quantization error
incurred when the xn sequence is quantized using an M-level quantizer. However, in order
to reconstruct xn, we also need to transmit zn. The zn sequence is
0 2 −2 1 1 −3 3 −1 −1 −1 1 1
The dynamic range for zn is 6, half the dynamic range of the difference sequence for
xn. (We could have inferred this directly from the definition of zn.) The sample-to-sample
difference varies more than the actual values. Therefore, instead of differentially encoding
this sequence, we quantize each individual sample. For an M-level quantizer, the required
step size would be 6
M
, giving a maximum quantization error of 3
M
.
For the same number of bits per sample, we can code both yn and zn and incur less
distortion. At the receiver, we add yn and zn to get the original sequence xn back. The
maximum possible quantization error in the reconstructed sequence would be 5
M
, which is
less than the maximum error we would incur if we encoded the xn sequence directly.
Although we use the same number of bits for each value of yn and zn, the number of
elements in each of the yn and zn sequences is the same as the number of elements in
the original xn sequence. Although we are using the same number of bits per sample, we
are transmitting twice as many samples and, in effect, doubling the bit rate.
We can avoid this by sending every other value of yn and zn. Let’s divide the sequence
yn into subsequences y2n and y2n−1—that is, a subsequence containing only the odd-
numbered elements y1y3 , and a subsequence containing only the even-numbered
elements y2y4 . Similarly, we divide the zn sequence into subsequences z2n
and z2n−1. If we transmit either the even-numbered subsequences or the odd-numbered
subsequences, we would transmit only as many elements as in the original sequence. To see
how we recover the sequence xn from these subsequences, suppose we only transmitted
the subsequences y2n and z2n:
y2n =
x2n +x2n−1
2
z2n =
x2n −x2n−1
2

To recover the even-numbered elements of the xn sequence, we add the two subse-
quences. In order to obtain the odd-numbered members of the xn sequence, we take the
difference:
y2n +z2n = x2n (14.4)
y2n −z2n = x2n−1 (14.5)
Thus, we can recover the entire original sequence xn, sending only as many bits as required
to transmit the original sequence while incurring less distortion.
Is the last part of the previous statement still true? In our original scheme we proposed
to transmit the sequence yn by transmitting the differences yn −yn−1. As we now need to
transmit the subsequence y2n, we will be transmitting the differences y2n −y2n−2 instead.
In order for our original statement about reduction in distortion to hold, the dynamic range

of this new sequence of differences should be less than or equal to the dynamic range of the
original difference. A quick check of the yn shows us that the dynamic range of the new
differences is still 4, and our claim of incurring less distortion still holds.
There are several things we can see from this example. First, the number of different
values that we transmit is the same, whether we send the original sequence xn or the
two subsequences yn and zn. Decomposing the xn sequence into subsequences did
not result in any increase in the number of values that we need to transmit. Second, the
two subsequences had distinctly different characteristics, which led to our use of different
techniques to encode the different sequences. If we had not split the xn sequence, we would
have been using essentially the same approach to compress both subsequences. Finally,
we could have used the same decomposition approach to decompose the two constituent
sequences, which then could be decomposed further still.
While this example was specific to a particular set of values, we can see that decomposing
a signal can lead to different ways of looking at the problem of compression. This added
flexibility can lead to improved compression performance.
Before we leave this example let us formalize the process of decomposing or analysis,
and recomposing or synthesis. In our example, we decomposed the input sequence xn into
two subsequences yn and zn by the operations
yn =
xn +xn−1
2
(14.6)
zn =
xn −xn−1
2
(14.7)
We can implement these operations using discrete time filters. We briefly considered discrete
time filters in Chapter 12. We take a slightly more detailed look at filters in the next section.
14 . 3 Filters
A system that isolates certain frequency components is called a filter. The analogy here
with mechanical filters such as coffee filters is obvious. A coffee filter or a filter in a water
purification system blocks coarse particles and allows only the finer-grained components
of the input to pass through. The analogy is not complete, however, because mechanical
filters always block the coarser components of the input, while the filters we are discussing
can selectively let through or block any range of frequencies. Filters that only let through
components below a certain frequency f0 are called low-pass filters; filters that block all
frequency components below a certain value f0 are called high-pass filters. The frequency
f0 is called the cutoff frequency. Filters that let through components that have frequency
content above some frequency f1 but below frequency f2 are called band-pass filters.
One way to characterize filters is by their magnitude transfer function—the ratio of the
magnitude of the input and output of the filter as a function of frequency. In Figure 14.4
we show the magnitude transfer function for an ideal low-pass filter and a more realistic
low-pass filter, both with a cutoff frequency of f0. In the ideal case, all components of
the input signal with frequencies below f0 are unaffected except for a constant amount of

14.3 Filters 429
Magnitude
Frequency
f0
Magnitude
Frequency
f0
F I G U R E 14. 4 Ideal and realistic low-pass ﬁlter characteristics.
amplification. All frequencies above f0 are blocked. In other words, the cutoff is sharp. In
the case of the more realistic filter, the cutoff is more gradual. Also, the amplification for the
components with frequency less than f0 is not constant, and components with frequencies
above f0 are not totally blocked. This phenomenon is referred to as ripple in the passband
and stopband.
The filters we will discuss are digital filters, which operate on a sequence of numbers
that are usually samples of a continuously varying signal. We have discussed sampling
in Chapter 12. For those of you who skipped that chapter, let us take a brief look at the
sampling operation.
How often does a signal have to be sampled in order to reconstruct the signal from the
samples? If one signal changes more rapidly than another, it is reasonable to assume that we
would need to sample the more rapidly varying signal more often than the slowly varying
signal in order to achieve an accurate representation. In fact, it can be shown mathematically
that if the highest frequency component of a signal is f0, then we need to sample the signal
at more than 2f0 times per second. This result is known as the Nyquist theorem or Nyquist
rule after Harry Nyquist, a famous mathematician from Bell Laboratories. His pioneering
work laid the groundwork for much of digital communication. The Nyquist rule can also
be extended to signals that only have frequency components between two frequencies f1
and f2. If f1 and f2 satisfy certain criteria, then we can show that in order to recover the
signal exactly, we need to sample the signal at a rate of at least 2 f2 − f1 samples per
second [123].
What would happen if we violated the Nyquist rule and sampled at less than twice
the highest frequency? In Chapter 12 we showed that it would be impossible to recover
the original signal from the sample. Components with frequencies higher than half the
sampling rate show up at lower frequencies. This process is called aliasing. In order to
prevent aliasing, most systems that require sampling will contain an “anti-aliasing filter” that
restricts the input to the sampler to be less than half the sampling frequency. If the signal
contains components at more than half the sampling frequency, we will introduce distortion
by filtering out these components. However, the distortion due to aliasing is generally more
severe than the distortion we introduce due to filtering.

Digital filtering involves taking a weighted sum of current and past inputs to the filter
and, in some cases, the past outputs of the filter. The general form of the input-output
relationships of the filter is given by
yn =
N

i=0
aixn−i +
M

i=1
biyn−i (14.8)
where the sequence xn is the input to the filter, the sequence yn is the output from the
filter, and the values ai and bi are called the filter coefficients.
If the input sequence is a single 1 followed by all 0s, the output sequence is called the
impulse response of the filter. Notice that if the bi are all 0, then the impulse response will
die out after N samples. These filters are called finite impulse response (FIR) filters. The
number N is sometimes called the number of taps in the filter. If any of the bi have nonzero
values, the impulse response can, in theory, continue forever. Filters with nonzero values
for some of the bi are called infinite impulse response (IIR) filters.
E x a m p l e 14 . 3 . 1:
Suppose we have a filter with a0 = 125 and a1 = 05. If the input sequence xn is given by
xn =

1 n = 0
0 n = 0
(14.9)
then the output is given by
y0 = a0x0 +a1x−1 = 125
y1 = a0x1 +a1x0 = 05
yn = 0 n 0 or n 1
This output is called the impulse response of the filter. The impulse response sequence is
usually represented by hn. Therefore, for this filter we would say that
hn =
⎧
⎪
⎨
⎪
⎩
125 n = 0
05 n = 1
0 otherwise.
(14.10)
Notice that if we know the impulse response we also know the values of ai. Knowledge
of the impulse response completely specifies the filter. Furthermore, because the impulse
response goes to zero after a finite number of samples (two in this case), the filter is an FIR
filter.
The filters we used in Example 14.2.1 are both two-tap FIR filters with impulse responses
hn =
⎧
⎪
⎨
⎪
⎩
1
2
n = 0
1
2
n = 1
0 otherwise
(14.11)

14.3 Filters 431
for the “averaging” or low-pass filter, and
hn =
⎧
⎪
⎨
⎪
⎩
1
2
n = 0
−1
2
n = 1
0 otherwise
(14.12)
for the “difference” or high-pass filter.
Now let’s consider a different filter with a0 = 1 and b1 = 09. For the same input as
above, the output is given by
y0 = a0x0 +b1y−1 = 1 1 +09 0 = 1 (14.13)
y1 = a0x1 +b1y0 = 1 0 +09 1 = 09 (14.14)
y2 = a0x2 +b1y1 = 1 0 +09 09 = 081 (14.15)

yn = 09 n
(14.16)
The impulse response can be written more compactly as
hn =

0 n 0
09 n
n ≥ 0
(14.17)
Notice that the impulse response is nonzero for all n ≥ 0, which makes this an IIR filter.
Although it is not as clear in the IIR case as it was in the FIR case, the impulse response
completely specifies the filter. Once we know the impulse response of the filter, we know
the relationship between the input and output of the filter. If xn and yn are the input and
output, respectively, of a filter with impulse response hnM
n=0, then yn can be obtained
from xn and hn via the following relationship:
yn =
M

k=0
hkxn−k (14.18)
where M is finite for an FIR filter and infinite for an IIR filter. The relationship, shown in
(14.18), is known as convolution and can be easily obtained through the use of the properties
of linearity and shift invariance (see Problem 1).
Because FIR filters are simply weighted averages, they are always stable. When we say a
filter is stable we mean that as long as the input is bounded, the output will also be bounded.
This is not true of IIR filters. Certain IIR filters can give an unbounded output even when
the input is bounded.

E x a m p l e 14 . 3 . 2:
Consider a filter with a0 = 1 and b1 = 2. Suppose the input sequence is a single 1 followed
by 0s. Then the output is
y0 = a0x0 +b1y−1 = 1 1 +2 0 = 1 (14.19)
y1 = a0x0 +b1y0 = 1 0 +2 1 = 2 (14.20)
y2 = a0x1 +b1y1 = 1 0 +2 2 = 4 (14.21)

yn = 2n
(14.22)
Even though the input contained a single 1, the output at time n = 30 is 230
, or more than a
billion!
Although IIR filters can become unstable, they can also provide better performance, in
terms of sharper cutoffs and less ripple in the passband and stopband for a fewer number of
coefficients.
The study of design and analysis of digital filters is a fascinating and important subject.
We provide some of the details in Sections 14.5–14.8. If you are not interested in these
topics, you can take a more utilitarian approach and make use of the literature to select the
necessary filters rather than design them. In the following section we briefly describe some
of the families of filters used to generate the examples in this chapter. We also provide filter
coefficients that you can use for experiment.
14 . 3 . 1 S o m e F i l t e r s U s e d i n S u b b a n d C o d i n g
The most frequently used filter banks in subband coding consist of a cascade of stages, where
each stage consists of a low-pass filter and a high-pass filter, as shown in Figure 14.5. The
most popular among these filters are the quadrature mirror filters (QMF), which were first
proposed by Crosier, Esteban, and Galand [197]. These filters have the property that if the
impulse response of the low-pass filter is given by hn, then the high-pass impulse response
is given by −1 n
hN−1−n. The QMF filters designed by Johnston [198] are widely used in
a number of applications. The filter coefficients for 8-, 16-, and 32-tap filters are given in
Tables 14.1–14.3. Notice that the filters are symmetric; that is,
hN−1−n = hn n = 01
N
2
−1 (14.23)
As we shall see later, the filters with fewer taps are less efficient in their decomposition
than the filters with more taps. However, from Equation (14.18) we can see that the number
of taps dictates the number of multiply-add operations necessary to generate the filter outputs.
Thus, if we want to obtain more efficient decompositions, we do so by increasing the amount
of computation.
Another popular set of filters are the Smith-Barnwell filters [199], some of which are
shown in Tables 14.4 and 14.5.

14.3 Filters 433
Low-pass
filter
High-pass
filter
Low-pass
filter
High-pass
filter
Low-pass
filter
Low-pass
filter
High-pass
filter
Low-pass
filter
High-pass
filter
Low-pass
filter
High-pass
filter
High-pass
filter
Low-pass
filter
High-pass
filter
F I G U R E 14. 5 An eight-band filter bank.
T A B L E 14 . 1 Coefficients for the 8-tap
Johnston low-pass filter.
h0h7 000938715
h1h6 006942827
h2h5 −007065183
h3h4 048998080
h0h15 0002898163
h1h14 −0009972252
h2h13 −0001920936
h3h12 003596853
h4h11 −001611869
h5h10 −009530234
h6h9 01067987
h7h8 04773469

h0h31 00022551390
h1h30 −00039715520
h2h29 −00019696720
h3h28 00081819410
h4h27 000084268330
h5h26 −0014228990
h6h25 00020694700
h7h24 0022704150
h8h23 −00079617310
h9h22 −0034964400
h10h21 0019472180
h11h20 0054812130
h12h19 −0044524230
h13h18 −0099338590
h14h17 013297250
h15h16 046367410
T A B L E 14 . 4 Coefficients for the eight-tap
Smith-Barnwell low-pass filter.
h0 00348975582178515
h1 −001098301946252854
h2 −006286453934951963
h3 0223907720892568
h4 0556856993531445
h5 0357976304997285
h6 −002390027056113145
h7 −007594096379188282
Smith-Barnwell low-pass filter.
h0 002193598203004352
h1 0001578616497663704
h2 −006025449102875281
h3 −00118906596205391
h4 0137537915636625
h5 005745450056390939
h6 −0321670296165893
h7 −0528720271545339
h8 −0295779674500919
h9 00002043110845170894
h10 002906699789446796
h11 −003533486088708146
h12 −0006821045322743358
h13 002606678468264118
h14 0001033363491944126
h15 −001435930957477529

14.3 Filters 435
These families of filters differ in a number of ways. For example, consider the Johnston
eight-tap filter and the Smith-Barnwell eight-tap filter. The magnitude transfer functions for
these two filters are plotted in Figure 14.6. Notice that the cutoff for the Smith-Barnwell
filter is much sharper than the cutoff for the Johnston filter. This means that the separation
provided by the eight-tap Johnston filter is not as good as that provided by the eight-tap
Smith-Barnwell filter. We will see the effect of this when we look at image compression
later in this chapter.
These filters are examples of some of the more popular filters. Many more filters exist
in the literature, and more are being discovered.
4000
0 500 1000 1500 2000 2500 3000 3500
Frequency in Hz
(a)
4000
0 500 1000 1500 2000 2500 3000 3500
Frequency in Hz
(b)
−40
−60
−20
0
20
dB
−40
−60
−20
0
20
dB
F I G U R E 14. 6 Magnitude transfer functions of the (a) eight-tap Johnston and
(b) eight-tap Smith-Barnwell ﬁlters.

14 . 4 The Basic Subband Coding Algorithm
The basic subband coding system is shown in Figure 14.7.
14 . 4 . 1 A n a l y s i s
The source output is passed through a bank of filters, called the analysis filter bank, which
covers the range of frequencies that make up the source output. The passbands of the filters
can be nonoverlapping or overlapping. Nonoverlapping and overlapping filter banks are
shown in Figure 14.8. The outputs of the filters are then subsampled.
The justification for the subsampling is the Nyquist rule and its generalization, which
tells us that we only need twice as many samples per second as the range of frequencies.
This means that we can reduce the number of samples at the output of the filter because the
range of frequencies at the output of the filter is less than the range of frequencies at the
input to the filter. This process of reducing the number of samples is called decimation,1
or
downsampling. The amount of decimation depends on the ratio of the bandwidth of the filter
output to the filter input. If the bandwidth at the output of the filter is 1/M of the bandwidth
at the input to the filter, we would decimate the output by a factor of M by keeping every
Mth sample. The symbol M ↓ is used to denote this decimation.
Once the output of the filters has been decimated, the output is encoded using one of
several encoding schemes, including ADPCM, PCM, and vector quantization.
Synthesis
filter M
Decoder M M
Synthesis
filter 3
Decoder 3
Encoder 3 M
M
Synthesis
filter 2
Decoder 2
Encoder 2 M
M
Synthesis
filter 1
Decoder 1
Encoder 1 M
M
Encoder M
M
Analysis
filter M
Analysis
filter 3
Analysis
filter 2
Analysis
filter 1
Channel
F I G U R E 14. 7 Block diagram of the subband coding system.
1
The word decimation has a rather bloody origin. During the time of the Roman empire, if a legion broke ranks and
ran during battle, its members were lined up and every tenth person was killed. This process was called decimation.

14.4 The Basic Subband Coding Algorithm 437
Magnitude
Frequency
Magnitude
Frequency
F I G U R E 14. 8 Nonoverlapping and overlapping ﬁlter banks.
14 . 4 . 2 Q u a n t i z a t i o n a n d C o d i n g
Along with the selection of the compression scheme, the allocation of bits between the
subbands is an important design parameter. Different subbands contain differing amounts of
information. Therefore, we need to allocate the available bits among the subbands according
to some measure of the information content. There are a number of different ways we could
distribute the available bits. For example, suppose we were decomposing the source output
into four bands and we wanted a coding rate of 1 bit per sample. We could accomplish this
by using 1 bit per sample for each of the four bands. On the other hand, we could simply
discard the output of two of the bands and use 2 bits per sample for the two remaining
bands. Or, we could discard the output of three of the four filters and use 4 bits per sample
to encode the output of the remaining filter.
This bit allocation procedure can have a significant impact on the quality of the final
reconstruction, especially when the information content of different bands is very different.
If we use the variance of the output of each filter as a measure of information, and
assume that the compression scheme is scalar quantization, we can arrive at several simple
bit allocation schemes (see Section 13.5). If we use a slightly more sophisticated model
for the outputs of the filters, we can arrive at significantly better bit allocation procedures
(see Section 14.9).
14 . 4 . 3 S y n t h e s i s
The quantized and coded coefficients are used to reconstruct a representation of the original
signal at the decoder. First, the encoded samples from each subband are decoded at the
receiver. These decoded values are then upsampled by inserting an appropriate number of

0s between samples. Once the number of samples per second has been brought back to the
original rate, the upsampled signals are passed through a bank of reconstruction filters. The
outputs of the reconstruction filters are added to give the final reconstructed outputs.
We can see that the basic subband system is simple. The three major components of this
system are the analysis and synthesis filters, the bit allocation scheme, and the encoding
scheme. A substantial amount of research has focused on each of these components. Various
filter bank structures have been studied in order to find filters that are simple to implement
and provide good separation between the frequency bands. In the next section we briefly
look at some of the techniques used in the design of filter banks, but our descriptions
are necessarily limited. For a (much) more detailed look, see the excellent book by P.P.
Vaidyanathan [200].
The bit allocation procedures have also been extensively studied in the contexts of
subband coding, wavelet-based coding, and transform coding. We have already described
some bit allocation schemes in Section 13.5, and we describe a different approach in
Section 14.9. There are also some bit allocation procedures that have been developed in the
context of wavelets, which we describe in the next chapter.
The separation of the source output according to frequency also opens up the possibility
for innovative ways to use compression algorithms. The decomposition of the source output
in this manner provides inputs for the compression algorithms, each of which has more clearly
defined characteristics than the original source output. We can use these characteristics to
select separate compression schemes appropriate to each of the different inputs.
Human perception of audio and video inputs is frequency dependent. We can use this
fact to design our compression schemes so that the frequency bands that are most important
to perception are reconstructed most accurately. Whatever distortion there has to be is
introduced in the frequency bands to which humans are least sensitive. We describe some
applications to the coding of speech, audio, and images later in this chapter.
Before we proceed to bit allocation procedures and implementations, we provide a more
mathematical analysis of the subband coding system. We also look at some approaches to
the design of filter banks for subband coding. The analysis relies heavily on the Z-transform
concepts introduced in Chapter 12 and will primarily be of interest to readers with an
electrical engineering background. The material is not essential to understanding the rest of
the chapter; if you are not interested in these details, you should skip these sections and go
directly to Section 14.9.
14 . 5 Design of Filter Banks
In this and the following starred section we will take a closer look at the analysis, down-
sampling, upsampling, and synthesis operations. Our approach follows that of [201]. We
assume familiarity with the Z-transform concepts of Chapter 12. We begin with some nota-
tion. Suppose we have a sequence x0x1x2 . We can divide this sequence into two
subsequences: x0x2x4 and x1x3x5 using the scheme shown in Figure 14.9,
where z−1
corresponds to a delay of one sample and ↓ M denotes a subsampling by a factor
of M. This subsampling process is called downsampling or decimation.

2
x0 x1 x2 x0 x2 x4
2
0 x1 x3 x5
z–1
F I G U R E 14. 9 Decomposition of an input sequence into its odd and even
components.
2
x0 x2 x4 x0 0 x2 0 x4
2
x1 x3 x5
x1 0 x3 0 x5
0
0
0 x0 x1 x2
0
z–1
F I G U R E 14. 10 Reconstructing the input sequence from its odd and even
components.
The original sequence can be recovered from the two downsampled sequences by insert-
ing 0s between consecutive samples of the subsequences, delaying the top branch by one
sample and adding the two together. Adding 0s between consecutive samples is called
upsampling and is denoted by ↑ M. The reconstruction process is shown in Figure 14.10.
While we have decomposed the source output sequence into two subsequences, there is
no reason for the statistical and spectral properties of these subsequences to be different. As
our objective is to decompose the source output sequences into subsequences with differing
characteristics, there is much more yet to be done.
Generalizing this, we obtain the system shown in Figure 14.11. The source output
sequence is fed to an ideal low-pass filter and an ideal high-pass filter, each with a bandwidth
of /2. We assume that the source output sequence had a bandwidth of . If the original
source signal was sampled at the Nyquist rate, as the output of the two filters have bandwidths
half that of the original sequence, the filter outputs are actually oversampled by a factor
of two. We can, therefore, subsample these signals by a factor of two without any loss of
information. The two bands now have different characteristics and can be encoded differently.
For the moment let’s assume that the encoding is performed in a lossless manner so that the
reconstructed sequence exactly matches the source output sequence.

2
xn y1, n
z–1
z–1
Ideal
low-pass
filter
2
2
2
y2, n xn
Ideal
high-pass
filter
Ideal
low-pass
filter
Ideal
high-pass
filter
Encoder 1
Encoder 2
Decoder 1
Decoder 2
F I G U R E 14. 11 Decomposition into two bands using ideal ﬁlters.
Let us look at how this system operates in the frequency domain. We begin by looking
at the downsampling operation.
14 . 5 . 1 D o w n s a m p l i n g
To see the effects of downsampling, we will obtain the Z-transform of the downsampled
sequence in terms of the original source sequence. Because it is easier to understand what is
going on if we can visualize the process, we will use the example of a source sequence that
has the frequence profile shown in Figure 14.12. For this sequence the output of the ideal
filters will have the shape shown in Figure 14.13.
Let’s represent the downsampled sequence as win. The Z-transform W1 z of the
downsampled sequence w1n is
W1 z =

w1nz−n
(14.24)
The downsampling operation means that
w1n = y12n (14.25)
π/2 π ω
X(ejω)
F I G U R E 14. 12 Spectrum of the source output.

π/2 ω
Y1(ejω)
π/2 π ω
Y2(ejω)
F I G U R E 14. 13 Spectrum of the outputs of the ideal ﬁlters.
In order to find the Z-transform of this sequence, we go through a two-step process. Define
the sequence
y
1n =
1
2
1+ejn
y1n (14.26)
=

y1n n even
0 otherwise.
(14.27)
We could also have written Equation (14.26) as
y
1n =
1
2
1+ −1 n
y1n
however, writing the relationship as in Equation (14.26) makes it easier to extend this
development to the case where we divide the source output into more than two bands.
The Z-transform of y
1n is given as
Y
1 z =

n=−
1
2
1+ejn
y1nz−n
(14.28)
Assuming all summations converge,
Y
1 z =
1
2

n=−
y1nz−n
+
1
2

n=−
y1n ze−j −n
(14.29)
=
1
2
Y1 z +
1
2
Y1 −z (14.30)

where we have used the fact that
e−j
= cos −j sin = −1
Noting that
w1n = y
12n (14.31)
W1 z =

n=−
w1nz−n
=

−
y
12nz−n
(14.32)
Substituting m = 2n,
W1 z =

−
y
1mz
−m
2 (14.33)
= Y
1 z
1
2 (14.34)
=
1
2
Y1 z
1
2 +
1
2
Y1 −z
1
2 (14.35)
Why didn’t we simply write the Z-transform of w1n directly in terms of y1n and use the
substitution m = 2n? If we had, the equivalent equation to (14.33) would contain the odd
indexed terms of y1n, which we know do not appear at the output of the downsampler. In
Equation (14.33), we also get the odd indexed terms of y
1n; however, as these terms are all
zero (see Equation (14.26)), they do not contribute to the Z-transform.
Substituting z = ej
we get
W1 ej
=
1
2
Y1 ej 2 +
1
2
Y −e
j
2 (14.36)
Plotting this for the Y1 ej
of Figure 14.13, we get the spectral shape shown in Figure 14.14;
that is, the spectral shape of the downsampled signal is a stretched version of the spectral
shape of the original signal. A similar situation exists for the downsampled signal w2n.
π/2 π ω
W1(ejω)
F I G U R E 14. 14 Spectrum of the downsampled low-pass ﬁlter output.

14 . 5 . 2 U p s a m p l i n g
Let’s take a look now at what happens after the upsampling. The upsampled sequence v1n
can be written as
v1n =

w1 n
2
n even
0 n odd.
(14.37)
The Z-transform V1 z is thus
V1 z =

n=−
v1nz−n
(14.38)
=

n=−
w1 n
2
z−n
n even (14.39)
=

m=−
w1mz−2m
(14.40)
= W1 z2
(14.41)
The spectrum is sketched in Figure 14.15. The “stretching” of the sequence in the
time domain has led to a compression in the frequency domain. This compression has also
resulted in a replication of the spectrum in the 0 interval. This replication effect is called
imaging. We remove the images by using an ideal low-pass filter in the top branch and an
ideal high-pass filter in the bottom branch.
Because the use of the filters prior to sampling reduces the bandwidth, which in turn
allows the downsampling operation to proceed without aliasing, these filters are called anti-
aliasing filters. Because they decompose the source output into components, they are also
called analysis filters. The filters after the upsampling operation are used to recompose the
original signal; therefore, they are called synthesis filters. We can also view these filters as
interpolating between nonzero values to recover the signal at the point that we have inserted
zeros. Therefore, these filters are also called interpolation filters.
␲/2 ␲ ␻
V1(ej␻)
F I G U R E 14. 15 Spectrum of the upsampled signal.

Although the use of ideal filters would give us perfect reconstruction of the source
output, in practice we do not have ideal filters available. When we use more realistic filters
in place of the ideal filters, we end up introducing distortion. In the next section we look at
this situation and discuss how we can reduce or remove this distortion.
14 . 6 Perfect Reconstruction Using Two-Channel
Filter Banks
Suppose we replace the ideal low-pass filter in Figure 14.11 with a more realistic filter with
the magnitude response shown in Figure 14.4. The spectrum of the output of the low-pass
filter is shown in Figure 14.16. Notice that we now have nonzero values for frequencies
above 2
. If we now subsample by two, we will end up sampling at less than twice the
highest frequency, or in other words, we will be sampling at below the Nyquist rate. This
will result in the introduction of aliasing distortion, which will show up in the reconstruction.
A similar situation will occur when we replace the ideal high-pass filter with a realistic
high-pass filter.
In order to get perfect reconstruction after synthesis, we need to somehow get rid of
the aliasing and imaging effects. Let us look at the conditions we need to impose upon
the filters H1 z , H2 z , K1 z , and K2 z in order to accomplish this. These conditions are
called perfect reconstruction (PR) conditions.
Consider Figure 14.17. Let’s obtain an expression for X̂ z in terms of H1 z , H2 z ,
K1 z , and K2 z . We start with the reconstruction:
X̂ z = U1 z +U2 z (14.42)
= V1 z K1 z +V2 z K2 z (14.43)
Therefore, we need to find V1 z and V2 z . The sequence v1n is obtained by upsampling
w1n. Therefore, from Equation (14.41),
V1 z = W1 z2
(14.44)
π/2 ω
V1(ejω)
F I G U R E 14. 16 Output of the low-pass ﬁlter.

xn xn
ˆ
y1, n w1, n
H1(z) K1(z)
2 2
v1, n
y2, n w2, n
H2(z) K2(z)
2 2
v2, n
u1, n
u2, n
F I G U R E 14. 17 Two-channel subband decimation and interpolation.
The sequence w1n is obtained by downsampling y1n,
Y1 z = X z H1 z
Therefore, from Equation (14.35),
W1 z =
1
2

X z
1
2 H1 z
1
2 +X −z
1
2 H1 −z
1
2 (14.45)
and
V1 z =
1
2
X z H1 z +X −z H1 −z (14.46)
Similarly, we can also show that
V2 z =
1
2
X z H2 z +X −z H2 −z (14.47)
Substituting the expressions for V1 z and V2 z into Equation (14.43) we obtain
X̂ z =
1
2
H1 z K1 z +H2 z K2 z X z
+
1
2
H1 −z K1 z +H2 −z K2 z X −z (14.48)
For perfect reconstruction we would like X̂ z to be a delayed and perhaps amplitude-
scaled version of X z ; that is,
X̂ z = cX z z−n0
(14.49)
In order for this to be true, we need to impose conditions on H1 z , H2 z , K1 z , and K2 z .
There are several ways we can do this, with each approach providing a different solution.
One approach involves writing Equation (14.48) in matrix form as
X̂ z =
1
2
K1 z K2 z
H1 z H1 −z
H2 z H2 −z
X z
X −z
(14.50)
For perfect reconstruction, we need
K1 z K2 z
H1 z H1 −z
H2 z H2 −z
= cz−n0 0 (14.51)

where we have absorbed the factor of 1
2
into the constant c. This means that the synthesis
filters K1 z and K2 z satisfy
K1 z K2 z =
cz−n0
det z
H2 −z −H1 −z (14.52)
where
z =
H1 z H1 −z
H2 z H2 −z
(14.53)
If H1 z and/or H2 z are IIR filters, the reconstruction filters can become quite complex.
Therefore, we would like to have both the analysis and synthesis filters be FIR filters. If we
select the analysis filters to be FIR, then in order to guarantee that the synthesis filters are
also FIR we need
det z = z−n1
where is a constant. Examining det z
det z = H1 z H2 −z −H1 −z H2 z
= P z −P −z = z−n1
(14.54)
where P z = H1 z H2 −z . If we examine Equation (14.54), we can see that n1 has to
be odd because all terms containing even powers of z in P z will be canceled out by the
corresponding terms in P −z . Thus, P z can have an arbitrary number of even-indexed
coefficients (as they will get canceled out), but there must be only one nonzero coefficient
of an odd power of z. By choosing any valid factorization of the form
P z = P1 z P2 z (14.55)
we can obtain many possible solutions of perfect reconstruction FIR filter banks with
H1 z = P1 z (14.56)
and
H2 z = P2 −z (14.57)
Although these filters are perfect reconstruction filters, for applications in data compression
they suffer from one significant drawback. Because these filters may be of unequal band-
width, the output of the larger bandwidth filter suffers from severe aliasing. If the output of
both bands is available to the receiver, this is not a problem because the aliasing is canceled
out in the reconstruction process. However, in many compression applications we discard
the subband containing the least amount of energy, which will generally be the output of
the filter with the smaller bandwidth. In this case the reconstruction will contain a large
amount of aliasing distortion. In order to avoid this problem for compression applications,
we generally wish to minimize the amount of aliasing in each subband. A class of filters that
is useful in this situation is the quadrature mirror filters (QMF). We look at these filters in
the next section.

14 . 6 . 1 T w o - C h a n n e l P R Q u a d r a t u r e M i r r o r
F i l t e r s
Before we introduce the quadrature mirror filters, let’s rewrite Equation (14.48) as
X̂ z = T z X z +S z X −z (14.58)
where
T z =
1
2
H1 z K1 z +H2 z K2 z (14.59)
S z =
1
2
H1 −z K1 z +H2 −z K2 z (14.60)
In order for the reconstruction of the input sequence xn to be a delayed, and perhaps
scaled, version of xn, we need to get rid of the aliasing term X −z and have T z be a
pure delay. To get rid of the aliasing term, we need
S z = 0 ∀z
From Equation (14.60), this will happen if
K1 z = H2 −z (14.61)
K2 z = −H1 −z (14.62)
After removing the aliasing distortion, a delayed version of the input will be available
at the output if
T z = cz−n0
c is a constant (14.63)
Replacing z by ej
, this means that we want

T ej

= constant (14.64)
arg T ej
= Kw K constant (14.65)
The first requirement eliminates amplitude distortion, while the second, the linear phase
requirement, is necessary to eliminate phase distortion. If these requirements are satisfied,
x̂ n = cx n−n0 (14.66)
That is, the reconstructed signal is a delayed version of input signal x n . However, meeting
both requirements simultaneously is not a trivial task.
Consider the problem of designing T z to have linear phase. Substituting (14.61) and
(14.62) into Equation (14.59), we obtain
T z =
1
2
H1 z H2 −z −H1 −z H2 z (14.67)

Therefore, if we choose H1 z and H2 z to be linear phase FIR, T z will also be a linear
phase FIR filter. In the QMF approach, we first select the low-pass filter H1 z , then define
the high-pass filter H2 z to be a mirror image of the low-pass filter:
H2 z = H1 −z (14.68)
This is referred to as a mirror condition and is the original reason for the name of the QMF
filters [200]. We can see that this condition will force both filters to have equal bandwidth.
Given the mirror condition and H1 z , a linear phase FIR filter, we will have linear phase
and
T z =
1
2
H2
1 z −H2
1 −z (14.69)
It is not clear that T ej
is a constant. In fact, we will show in Section 14.8 that a linear
phase two-channel FIR QMF bank with the filters chosen as in Equation (14.68) can have
PR property if and only if H1 z is in the simple two-tap form
H1 z = h0z−2k0
+h1z− 2k1+1
(14.70)
Then, T z is given by
T z = 2h0h1z− 2k0+2k1+1
(14.71)
which is of the desired form cz−n0 . However, if we look at the magnitude characteristics
of the two filters, we see that they have poor cutoff characteristics. The magnitude of the
low-pass filter is given by

H1 ej

2
= h2
0 +h2
1 +2h0h1 cos 2k0 −2k1 −1 (14.72)
and the high-pass filter is given by

H2 ej

2
= h2
0 +h2
1 −2h0h1 cos 2k0 −2k1 −1 (14.73)
For h0 = h1 = k0 = k1 = 1, the magnitude responses are plotted in Figure 14.18. Notice the
poor cutoff characteristics of these two filters.
Thus, for perfect reconstruction with no aliasing and no amplitude or phase distortion,
the mirror condition does not seem like such a good idea. However, if we slightly relax
these rather strict conditions, we can obtain some very nice designs. For example, instead of
attempting to eliminate all phase and amplitude distortion, we could elect to eliminate only
the phase distortion and minimize the amplitude distortion. We can optimize the coefficients
of H1 z such that T ej
is made as close to a constant as possible, while minimizing
the stopband energy of H1 z in order to have a good low-pass characteristic. Such an
optimization has been suggested by Johnston [198] and Jain and Crochiere [202]. They
construct the objective function
J =

s

H1 ej

2
d + 1−

0
1−

T ej

2
d (14.74)
which has to be minimized to obtain H1 z and T1 z , where s is the cutoff frequency of
the filter.

−30
10
5
0
−5
−10
−15
−20
−25
0 0.5 1 1.5 2 2.5 3
Frequency
Magnitude
(dB)
F I G U R E 14. 18 Magnitude characteristics of the two-tap PR ﬁlters.
We can also go the other way and eliminate the amplitude distortion, then attempt to
minimize the phase distortion. A review of these approaches can be found in [201, 200].
14 . 6 . 2 P o w e r S y m m e t r i c F I R F i l t e r s
Another approach, independently discovered by Smith and Barnwell [199] and Mintzer
[203], can be used to design a two-channel filter bank in which aliasing, amplitude distortion,
and phase distortion can be completely eliminated. As discussed earlier, choosing
K1 z = −H2 −z
K2 z = H1 −z (14.75)
eliminates aliasing. This leaves us with
T z =
1
2
H1 −z H2 z −H1 z H2 −z
In the approach due to Smith and Barnwell [199] and Mintzer [203], with N an odd integer,
we select
H2 z = z−N
H1 −z−1
(14.76)

so that
T z =
1
2
z−N
H1 z H1 z−1
+H1 −z H1 −z−1
(14.77)
Therefore, the perfect reconstruction requirement reduces to finding a prototype low-pass
filter H z = H1 z such that
Q z = H z H z−1
+H −z H −z−1
= constant (14.78)
Defining
R z = H z H z−1
(14.79)
the perfect reconstruction requirement becomes
Q z = R z +R −z = constant (14.80)
But R z is simply the Z-transform of the autocorrelation sequence of h n . The auto-
correlation sequence n is given by
n =
N

k=0
hkhk+n (14.81)
The Z-transform of n is given by
R z = n =

N

k=0
hkhk+n

(14.82)
We can express the sum
N
k=0 hkhk+n as a convolution:
hn ⊗h−n =
N

k=0
hkhk+n (14.83)
Using the fact that the Z-transform of a convolution of two sequences is the product of the
Z-transforms of the individual sequences, we obtain
R z = hn h−n = H z H z−1
(14.84)
Writing out R z as the Z-transform of the sequence n we obtain
R z = N zN
+ N −1 zN−1
+···+ 0 +···+ N −1 z−N−1
+ N z−N
(14.85)
Then R −z is
R −z = − N zN
+ N −1 zN−1
−···+ 0 −···+ N −1 z−N−1
− N z−N
(14.86)
Adding R z and R −z , we obtain Q z as
Q z = 2 N −1 zN−1
+2 N −1 zN−3
+···+ 0 +···+2 N −1 z−N−1
(14.87)

Notice that the terms containing the odd powers of z got canceled out. Thus, for Q z
to be a constant all we need is that for even values of the lag n (except for n = 0), n be
zero. In other words
2n =
N

k=0
hkhk+2n = 0 n = 0 (14.88)
Writing this requirement in terms of the impulse response:
N

k=0
hkhk+2n =

0 n = 0
0 n = 0
(14.89)
If we now normalize the impulse response,
N

k=0
hk
2
= 1 (14.90)
we obtain the perfect reconstruction requirement
N

k=0
hkhk+2n = n (14.91)
In other words, for perfect reconstruction, the impulse response of the prototype filter is
orthogonal to the twice-shifted version of itself.
14 . 7 M -Band QMF Filter Banks
We have looked at how we can decompose an input signal into two bands. In many
applications it is necessary to divide the input into multiple bands. We can do this by using
a recursive two-band splitting as shown in Figure 14.19, or we can obtain banks of filters
that directly split the input into multiple bands. Given that we have good filters that provide
two-band splitting, it would seem that using a recursive splitting, as shown in Figure 14.19,
would be an efficient way of obtaining an M-band split. Unfortunately, even when the
spectral characteristics of the filters used for the two-band split are quite good, when we
employ them in the tree structure shown in Figure 14.19, the spectral characteristics may
not be very good. For example, consider the four-tap filter with filter coefficients shown in
Table 14.6. In Figure 14.20 we show what happens to the spectral characteristics when we
look at the two-band split (at point A in Figure 14.19), the four-band split (at point B in
Figure 14.19), and the eight-band split (at point C in Figure 14.19). For a two-band split the
magnitude characteristic is flat, with some aliasing. When we employ these same filters to
obtain a four-band split from the two-band split, there is an increase in the aliasing. When
we go one step further to obtain an eight-band split, the magnitude characteristics deteriorate
substantially, as evidenced by Figure 14.20. The various bands are no longer clearly distinct.
There is significant overlap between the bands, and hence there will be a significant amount
of aliasing in each band.
In order to see why there is an increase in distortion, let us follow the top branch of
the tree. The path followed by the signal is shown in Figure 14.21a. As we will show later

Low-pass
filter
High-pass
filter
Low-pass
filter
A B C
High-pass
filter
Low-pass
filter
Low-pass
filter
High-pass
filter
Low-pass
filter
High-pass
filter
Low-pass
filter
High-pass
filter
Low-pass
filter
High-pass
filter
High-pass
filter
F I G U R E 14. 19 Decomposition of an input sequence into multiple bands by
recursively using a two-band split.
T A B L E 14 . 6 Coefﬁcients for the four-tap
Daubechies low-pass ﬁlter.
h0 0.4829629131445341
h1 0.8365163037378079
h2 0.2241438680420134
h3 −01294095225512604
(Section 14.8), the three filters and downsamplers can be replaced by a single filter and
downsampler as shown in Figure 14.21b, where
A z = HL z HL z2
HL z4
(14.92)
If HL z corresponds to a 4-tap filter, then A z corresponds to a 3 × 6 × 12 = 216-tap
filter! However, this is a severely constrained filter because it was generated using only

1
0
2
3
0.5 1 1.5 2 2.5
0 3
1
0
2
3
0.5 1 1.5 2 2.5
0 3
1
0
2
3
0.5 1 1.5 2 2.5
0 3
Two-band
Magnitude
Magnitude
Magnitude
Four-band
Eight-band
F I G U R E 14. 20 Spectral characteristics at points A, B, and C.
2
HL(z) HL(z) 2 2
HL(z)
A(z) 8
(a)
(b)
F I G U R E 14. 21 Equivalent structures for recursive ﬁltering using a two-band split.

four coefficients. If we had set out to design a 216-tap filter from scratch, we would have
had significantly more freedom in selecting the coefficients. This is a strong motivation for
designing filters directly for the M-band case.
An M-band filter bank has two sets of filters that are arranged as shown in Figure 14.7.
The input signal x n is split into M frequency bands using an analysis bank of M filters of
bandwidth /M. The signal in any of these M channels is then downsampled by a factor L.
This constitutes the analysis bank. The subband signals yk n are encoded and transmitted.
At the synthesis stage the subband signals are then decoded, upsampled by a factor of L
by interlacing adjacent samples with L − 1 zeros, and then passed through the synthesis
or interpolation filters. The output of all these synthesis filters is added together to obtain
the reconstructed signal. This constitutes the synthesis filter bank. Thus, the analysis and
synthesis filter banks together take an input signal x n and produce an output signal x̂ n .
These filters could be any combination of FIR and IIR filters.
Depending on whether M is less than, equal to, or greater than L, the filter bank is called
an underdecimated, critically (maximally) decimated, or overdecimated filter bank. For most
practical applications, maximal decimation or “critical subsampling” is used.
A detailed study of M-band filters is beyond the scope of this chapter. Suffice it to say
that in broad outline much of what we said about two-band filters can be generalized to
M-band filters. (For more on this subject, see [200].)
14 . 8 The Polyphase Decomposition
A major problem with representing the combination of filters and downsamplers is the
time-varying nature of the up- and downsamplers. An elegant way of solving this problem
is with the use of polyphase decomposition. In order to demonstrate this concept, let us first
consider the simple case of two-band splitting. We will first consider the analysis portion of
the system shown in Figure 14.22. Suppose the analysis filter H1 z is given by
H1 z = h0 +h1z−1
+h2z−2
+h3z−3
+··· (14.93)
By grouping the odd and even terms together, we can write this as
H1 z = h0 +h2z−2
+h4z−4
+··· +z−1
h1 +h3z−2
+h5z−4
+··· (14.94)
2
H1(z)
2
H2(z)
F I G U R E 14. 22 Analysis portion of a two-band subband coder.

Define
H10 z = h0 +h2z−1
+h4z−2
+··· (14.95)
H11 z = h1 +h3z−1
+h5z−2
+··· (14.96)
Then H1 z = H10 z2
+ z−1
H11 z2
. Similarly, we can decompose the filter H2 z into
components H20 z and H21 z , and we can represent the system of Figure 14.22 as shown
in Figure 14.23. The filters H10 z , H11 z and H20 z H21 z are called the polyphase
components of H1 z and H2 z .
Let’s take the inverse Z-transform of the polyphase components of H1 z :
h10 n = h2n n = 01 (14.97)
h11 n = h2n+1 n = 01 (14.98)
Thus, h10 n and h11 n are simply the impulse response hn downsampled by two. Consider
the output of the downsampler for a given input X z . The input to the downsampler is
X z H1 z ; thus, the output from Equation (14.35) is
Y1 z =
1
2
X

z
1
2

H1

z
1
2

+
1
2
X

−z
1
2

H1

−z
1
2

(14.99)
2
H10(z)
H11(z)
z–1
2
H20(z)
H21(z)
z–1
F I G U R E 14. 23 Alternative representation of the analysis portion of a two-band
subband coder.

Replacing H1 z with its polyphase representation, we get
Y1 z =
1
2
X

z
1
2

H10 z +z− 1
2 H11 z +
1
2
X

−z
1
2

H10 z −z− 1
2 H11 z (14.100)
= H10 z
1
2
X

z
1
2

+
1
2
X

−z
1
2

+H11 z
1
2
z− 1
2 X

z
1
2

−
1
2
z− 1
2 X

−z
1
2

(14.101)
Note that the first expression in square brackets is the output of a downsampler whose
input is X z , while the quantity in the second set of square brackets is the output of a
downsampler whose input is z−1
X z . Therefore, we could implement this system as shown
in Figure 14.24.
Now let us consider the synthesis portion of the two-band system shown in Figure 14.25.
As in the case of the analysis portion, we can write the transfer functions in terms of their
polyphase representation. Thus,
G1 z = G10 z2
+z−1
G11 z2
(14.102)
G2 z = G20 z2
+z−1
G21 z2
(14.103)
Consider the output of the synthesis filter G1 z given an input Y1 z . From Equation (14.41),
the output of the upsampler is
U1 z = Y1 z2
(14.104)
2
2
H10(z)
H11(z)
z–1
2
2
H20(z)
H21(z)
z–1
F I G U R E 14. 24 Polyphase representation of the analysis portion of a two-band
subband coder.

2 G1(z)
2 G2(z)
F I G U R E 14. 25 The synthesis portion of a two-band subband coder.
and the output of G1 z is
V1 z = Y1 z2
G1 z (14.105)
= Y1 z2
G10 z2
+z−1
Y1 z2
G11 z2
(14.106)
The first term in the equation above is the output of an upsampler that follows a filter
with transfer function G10 z with input Y z . Similarly, Y1 z2
G11 z2
is the output of an
upsampler that follows a filter with transfer function G11 z with input Y z . Thus, this
system can be represented as shown in Figure 14.26.
Putting the polyphase representations of the analysis and synthesis portions together, we
get the system shown in Figure 14.27. Looking at the portion in the dashed box, we can see
that this is a completely linear time-invariant system.
G10(z) 2
G11(z)
z–1
2
G20(z) 2
G21(z)
z–1
2
F I G U R E 14. 26 Polyphase representation of the synthesis portion of a two-band
subband coder.

G10(z) 2
G11(z)
z–1
2
G20(z) 2
G21(z)
z–1
2
2
2
H10(z)
H11(z)
z–1
2
2
H20(z)
H21(z)
z–1
F I G U R E 14. 27 Polyphase representation of the two-band subband coder.
The polyphase representation can be a very useful tool for the design and analysis
of filters. While many of its uses are beyond the scope of this chapter, we can use this
representation to prove our statement about the two-band perfect reconstruction QMF filters.
Recall that we want
T z =
1
2
H1 z H2 −z −H1 −z H2 z = cz−n0

If we impose the mirror condition H2 z = H1 −z T z becomes
T z =
1
2
H2
1 z −H2
1 −z (14.107)
The polyphase decomposition of H1 z is
H1 z = H10 z2
+z−1
H11 z2

Substituting this into Equation (14.107) for H1 z and
H1 −z = H10 z2
−z−1
H11 z2
for H1 −z , we obtain
T z = 2z−1
H10 z2
H11 z2
(14.108)
Clearly, the only way T z can have the form cz−n0 is if both H10 z and H11 z are simple
delays; that is,
H10 z = h0z−k0
(14.109)
H11 z = h1z−k1
(14.110)

14.9 Bit Allocation 459
This results in
T z = 2h0h1z− 2k0+2k1+1
(14.111)
which is of the form cz−n0 as desired. The resulting filters have the transfer functions
H1 z = h0z−2k0
+h1z− 2k1+1
(14.112)
H2 z = h0z−2k0
−h1z− 2k1+1
(14.113)
14 . 9 Bit Allocation
Once we have separated the source output into the constituent sequences, we need to decide
how much of the coding resource should be used to encode the output of each synthesis
filter. In other words, we need to allocate the available bits between the subband sequences.
In the previous chapter we described a bit allocation procedure that uses the variances of
the transform coefficient. In this section we describe a bit allocation approach that attempts
to use as much information about the subbands as possible to distribute the bits.
Let’s begin with some notation. We have a total of BT bits that we need to distribute
among M subbands. Suppose R corresponds to the average rate in bits per sample for the
overall system, and Rk is the average rate for subband k. Let’s begin with the case where
the input is decomposed into M equal bands, each of which is decimated by a factor of M.
Finally, let’s assume that we know the rate distortion function for each band. (If you recall
from Chapter 8, this is a rather strong assumption and we will relax it shortly.) We also
assume that the distortion measure is such that the total distortion is the sum of the distortion
contribution of each band.
We want to find the bit allocation Rk such that
R =
1
M
M

k=1
Rk (14.114)
and the reconstruction error is minimized. Each value of Rk corresponds to a point on the
rate distortion curve. The question is where on the rate distortion curve for each subband
should we operate to minimize the average distortion. There is a trade-off between rate and
distortion. If we decrease the rate (that is, move down the rate distortion curve), we will
increase the distortion. Similarly, if we want to move to the left on the rate distortion curve
and minimize the distortion, we end up increasing the rate. We need a formulation that
incorporates both rate and distortion and the trade-off involved. The formulation we use is
based on a landmark paper in 1988 by Yaacov Shoham and Allen Gersho [204]. Let’s define
a functional Jk:
Jk = Dk +Rk (14.115)
where Dk is the distortion contribution from the kth subband and is a Lagrangian parameter.
This is the quantity we wish to minimize. In this expression the parameter in some sense
specifies the trade-off. If we are primarily interested in minimizing the distortion, we can
set to a small value. If our primary interest is in minimizing the rate, we keep the value of

large. We can show that the values of Dk and Rk that minimize Jk occur where the slope
of the rate distortion curve is . Thus, given a value of and the rate distortion function,
we can immediately identify the values of Rk and Dk. So what should the value of be, and
how should it vary between subbands?
Let’s take the second question first. We would like to allocate bits in such a way that
any increase in any of the rates will have the same impact on the distortion. This will
happen when we pick Rk in such a way that the slopes of the rate distortion functions for
the different subbands are the same; that is, we want to use the same for each subband.
Let’s see what happens if we do not. Consider the two rate distortion functions shown in
Figure 14.28. Suppose the points marked x on the rate distortion functions correspond to
the selected rates. Obviously, the slopes, and hence the values of , are different in the two
cases. Because of the differences in the slope, an increase by R in the rate R1 will result
in a much larger decrease in the distortion than the increase in distortion if we decreased R2
by R. Because the total distortion is the sum of the individual distortions, we can therefore
reduce the overall distortions by increasing R1 and decreasing R2. We will be able to keep
doing this until the slope corresponding to the rates are the same in both cases. Thus, the
answer to our second question is that we want to use the same value of for all the subbands.
Given a set of rate distortion functions and a value of , we automatically get a set of
rates Rk. We can then compute the average and check if it satisfieies our constraint on the
total number of bits we can spend. If it does not, we modify the value of until we get a
set of rates that satisfies our rate constraint.
However, generally we do not have rate distortion functions available. In these cases we
use whatever is available. For some cases we might have operational rate distortion curves
available. By “operational” we mean performance curves for particular types of encoders
operating on specific types of sources. For example, if we knew we were going to be using
pdf-optimized nonuniform quantizers with entropy coding, we could estimate the distribution
of the subband and use the performance curve for pdf-optimized nonuniform quantizers for
Distortion
Rate
R1
Distortion
Rate R2
F I G U R E 14. 28 Two rate distortion functions.

14.10 Application to Speech Coding-
—G.722 461
that distribution. We might only have the performance of the particular encoding scheme
for a limited number of rates. In this case we need to have some way of obtaining the slope
from a few points. We could estimate this numerically from these points. Or we could fit
the points to a curve and estimate the slope from the curve. In these cases we might not be
able to get exactly the average rate we wanted.
Finally, we have been talking about a situation where the number of samples in each
subband is exactly the same, and therefore the total rate is simply the sum of the individual
rates. If this is not true, we need to weight the rates of the individual subbands. The functional
to be minimized becomes
J =

Dk +

kRk (14.116)
where k is the weight reflecting the relative length of the sequence generated by the kth
filter. The distortion contribution from each subband might not be equally relevant, perhaps
because of the filter construction or because of the perceptual weight attached to those
frequencies [205]. In these cases we can modify our functional still further to include the
unequal weighting of the distortion:
J =

wkDk +

kRk (14.117)
14 . 10 Application to Speech Coding-
—G.722
The ITU-T recommendation G.722 provides a technique for wideband coding of speech
signals that is based on subband coding. The basic objective of this recommendation is to
provide high-quality speech at 64 kbits per second (kbps). The recommendation also contains
two other modes that encode the input at 56 and 48 kbps. These two modes are used when
an auxiliary channel is needed. The first mode provides for an auxiliary channel of 8 kbps;
the second mode, for an auxiliary channel of 16 kbps.
The speech output or audio signal is filtered to 7 kHz to prevent aliasing, then sampled
at 16,000 samples per second. Notice that the cutoff frequency for the anti-aliasing filter is
7 kHz, not 8 kHz, even though we are sampling at 16,000 samples per second. One reason
for this is that the cutoff for the anti-aliasing filter is not going to be sharp like that of the
ideal low-pass filter. Therefore, the highest frequency component in the filter output will be
greater than 7 kHz. Each sample is encoded using a 14-bit uniform quantizer. This 14-bit
input is passed through a bank of two 24-coefficient FIR filters. The coefficients of the
low-pass QMF filter are shown in Table 14.7.
The coefficients for the high-pass QMF filter can be obtained by the relationship
hHPn = −1 n
hLPn (14.118)
The low-pass filter passes all frequency components in the range of 0 to 4 kHz, while the
high-pass filter passes all remaining frequencies. The output of the filters is downsampled by
a factor of two. The downsampled sequences are encoded using adaptive differential PCM
(ADPCM) systems.
The ADPCM system that encodes the downsampled output of the low-frequency filter
uses 6 bits per sample, with the option of dropping 1 or 2 least significant bits in order to

T A B L E 14 . 7 Transmit and receive
QMF coefﬁcient values.
h0h23 366211×10−4
h1h22 −134277×10−3
h2h21 −134277×10−3
h3h20 646973×10−3
h4h19 146484×10−3
h5h18 −190430×10−2
h6h17 390625×10−3
h7h16 441895×10−2
h8h15 −256348×10−2
h9h14 −982666×10−2
h10h13 116089×10−1
h11h12 473145×10−1
provide room for the auxiliary channel. The output of the high-pass filter is encoded using 2
bits per sample. Because the 2 least significant bits of the quantizer output of the low-pass
ADPCM system could be dropped and then not available to the receiver, the adaptation
and prediction at both the transmitter and receiver are performed using only the 4 most
significant bits of the quantizer output.
If all 6 bits are used in the encoding of the low-frequency subband, we end up with a rate
of 48 kbps for the low band. Since the high band is encoded at 2 bits per sample, the output
rate for the high subband is 16 kbps. Therefore, the total output rate for the subband-ADPCM
system is 64 kbps.
The quantizer is adapted using a variation of the Jayant algorithm [110]. Both ADPCM
systemsusethepasttworeconstructedvaluesandthepastsixquantizeroutputstopredictthenext
sample, in the same way as the predictor for recommendation G.726 described in Chapter 11.
The predictor is adapted in the same manner as the predictor used in the G.726 algorithm.
At the receiver, after being decoded by the ADPCM decoder, each output signal is
upsampled by the insertion of a zero after each sample. The upsampled signals are passed
through the reconstruction filters. These filters are identical to the filters used for decompos-
ing the signal. The low-pass reconstruction filter coefficients are given in Table 14.7, and
the coefficients for the high-pass filter can be obtained using Equation (14.118).
14 . 11 Application to Audio Coding-
—MPEG Audio
The Moving Picture Experts Group (MPEG) has proposed an audio coding scheme that
is based in part on subband coding. Actually, MPEG has proposed three coding schemes,
called Layer I, Layer II, and Layer III coding. Each is more complex than the previous and
provides higher compression. The coders are also “upward” compatible; a Layer N decoder
is able to decode the bitstream generated by the Layer N −1 encoder. In this section we will
look primarily at the Layer 1 and Layer 2 coders.
The Layer 1 and Layer 2 coders both use a bank of 32 filters, splitting the input into
32 bands, each with a bandwidth of fs/64, where fs is the sampling frequency. Allowable

sampling frequencies are 32,000 samples per second, 44,100 samples per second, and 48,000
samples per second. Details of these coders are provided in Chapter 16.
14 . 12 Application to Image Compression
We have discussed how to separate a sequence into its components. However, all the exam-
ples we have used are one-dimensional sequences. What do we do when the sequences
contain two-dimensional dependencies such as images? The obvious answer is that we need
two-dimensional filters that separate the source output into components based on both the hor-
izontal and vertical frequencies. Fortunately, in most cases, this two-dimensional filter can be
implemented as two one-dimensional filters, which can be applied first in one dimension, then
in the other. Filters that have this property are called separable filters. Two-dimensional non-
separable filters do exist [206]; however, the gains are offset by the increase in complexity.
Generally, for subband coding of images we filter each row of the image separately using
a high-pass and low-pass filter. The output of the filters is decimated by a factor of two.
Assume that the images were of size N ×N. After this first stage, we will have two images
of size N × N
2
. We then filter each column of the two subimages, decimating the outputs of
the filters again by a factor of two. This results in four images of size N
2
× N
2
. We can stop
at this point or continue the decomposition process with one or more of the four subimages,
resulting in 7, 10, 13, or 16 images. Generally, of the four original subimages, only one or
two are further decomposed. The reason for not decomposing the other subimages is that
many of the pixel values in the high-frequency subimages are close to zero. Thus, there is
little reason to spend computational power to decompose these subimages.
E x a m p l e 14 . 12 . 1:
Let’s take the “image” in Table 14.8 and decompose it using the low-pass and high-pass
filters of Example 14.2.1. After filtering each row with the low-pass filter, the output is
decimated by a factor of two. Each output from the filter depends on the current input and
the past input. For the very first input (that is, the pixels at the left edge of the image),
we will assume that the past values of the input were zero. The decimated output of the
low-pass and high-pass filters is shown in Table 14.9.
We take each of these subimages and filter them column by column using the low-pass
and high-pass filters and decimate the outputs by two. In this case, the first input to the filters
T A B L E 14 . 8 A sample “image.”
10 14 10 12 14 8 14 12
10 12 8 12 10 6 10 12
12 10 8 6 8 10 12 14
8 6 4 6 4 6 8 10
14 12 10 8 6 4 6 8
12 8 12 10 6 6 6 6
12 10 6 6 6 6 6 6
6 6 6 6 6 6 6 6

T A B L E 14 . 9 Filtered and decimated output.
Decimated Decimated
Low-Pass Output High-Pass Output
5 12 13 11 5 −2 1 3
5 10 11 8 5 −2 −1 2
6 9 7 11 6 −1 1 1
4 5 5 7 4 −1 −1 1
7 11 7 5 7 −1 −1 1
6 10 8 6 6 2 −2 0
6 8 6 6 6 −2 0 0
3 6 6 6 3 0 0 0
T A B L E 14 . 10 Four subimages.
Low-Low Image Low-High Image
2.5 6 6.5 5.5 2.5 6 6.5 5.5
5.5 9.5 9 9.5 0.5 −05 −2 1.5
5.5 8 6 6 1.5 3 1 −1
6 9 7 6 0 −1 −1 0
High-Low Image High-High Image
2.5 −1 0.5 1.5 2.5 −1 0.5 1.5
5.5 −15 0 1.5 0.5 0.5 1 −05
5.5 −1 −1 1 1.5 0 0 0
6 0 −1 0 0 −2 1 0
is the top element in each row. We assume that there is a zero row of pixels right above this
row in order to provide the filter with “past” values. After filtering and decimation, we get
four subimages (Table 14.10). The subimage obtained by low-pass filtering of the columns
of the subimage (which was the output of the row low-pass filtering) is called the low-low
(LL) image. Similarly, the other images are called the low-high (LH), high-low (HL), and
high-high (HH) images.
If we look closely at the final set of subimages in the previous example, we notice that
there is a difference in the characteristics of the values in the left or top row and the interiors
of some of the subimages. For example, in the high-low subimage, the values in the first
column are significantly larger than the other values in the subimage. Similarly, in the low-
high subimage, the values in the first row are generally very different than the other values
in the subimage. The reason for this variance is our assumption that the “past” of the image
above the first row and to the left of the column was zero. The difference between zero and
the image values was much larger than the normal pixel-to-pixel differences. Therefore, we
ended up adding some spurious structure to the image reflected in the subimages. Generally,
this is undesirable because it is easier to select appropriate compression schemes when the
characteristics of the subimages are as uniform as possible. For example, if we did not have

T A B L E 14 . 11 Alternate four subimages.
Low-Low Image Low-High Image
10 12 13 11 0 0 −05 −05
11 95 9 95 1 −05 −2 15
11 8 6 6 3 3 1 −1
12 9 7 6 0 −1 −1 0
High-Low Image High-High Image
0 −2 1 3 0 0 0 0
0 −15 0 15 0 05 1 −05
0 −1 −1 1 0 0 0 0
0 0 −1 0 0 −2 1 0
the relatively large values in the first column of the high-low subimage, we could choose a
quantizer with a smaller step size.
In this example, this effect was limited to a single row or column because the filters used
a single past value. However, most filters use a substantially larger number of past values
in the filtering operation, and a larger portion of the subimage is affected.
We can avoid this problem by assuming a different “past.” There are a number of ways
this can be done. A simple method that works well is to reflect the values of the pixels at
the boundary. For example, for the sequence 6 9 5 4 7 2 ···, which was to be filtered with a
three-tap filter, we would assume the past as 9 6 6 9 5 4 7 2···. If we use this approach
for the image in Example 14.12.1, the four subimages would be as shown in Table 14.11.
Notice how much sparser each image is, except for the low-low image. Most of the
energy in the original image has been compacted into the low-low image. Since the other
subimages have very few values that need to be encoded, we can devote most of our
resources to the low-low subimage.
14 . 12 . 1 D e c o m p o s i n g a n I m a g e
Earlier a set of filters was provided to be used in one-dimensional subband coding. We can
use those same filters to decompose an image into its subbands.
E x a m p l e 14 . 12 . 2:
Let’s use the eight-tap Johnston filter to decompose the Sinan image into four subbands.
The results of the decomposition are shown in Figure 14.29. Notice that, as in the case of
the image in Example 14.12.1, most of the signal energy is concentrated in the low-low
subimage. However, there remains a substantial amount of energy in the higher bands. To
see this more clearly, let’s look at the decomposition using the 16-tap Johnston filter. The
results are shown in Figure 14.30. Notice how much less energy there is in the higher

F I G U R E 14. 29 Decomposition of Sinan image using the eight-tap Johnston ﬁlter.
F I G U R E 14. 30 Decomposition of Sinan image using the 16-tap Johnston ﬁlter.

subbands. In fact, the high-high subband seems completely empty. As we shall see later,
this difference in energy compaction can have a drastic effect on the reconstruction.
F I G U R E 14. 31 Decomposition of Sinan image using the the eight-tap
Smith-Barnwell ﬁlter.
Increasing the size of the filter is not necessarily the only way of improving the energy
compaction. Figure 14.31 shows the decomposition obtained using the eight-tap Smith-
Barnwell filter. The results are almost identical to the 16-tap Johnston filter. Therefore,
rather than increase the computational load by going to a 16-tap filter, we can keep the same
computational load and simply use a different filter.
14 . 12 . 2 C o d i n g t h e S u b b a n d s
Once we have decomposed an image into subbands, we need to find the best encoding
scheme to use with each subband. The coding schemes we have studied to date are scalar
quantization, vector quantization, and differential encoding. Let us encode some of the
decomposed images from the previous section using two of the coding schemes we have
studied earlier, scalar quantization and differential encoding.
E x a m p l e 14 . 12 . 3:
In the previous example we noted the fact that the eight-tap Johnston filter did not compact
the energy as well as the 16-tap Johnston filter or the eight-tap Smith-Barnwell filter. Let’s
see how this affects the encoding of the decomposed images.

When we encode these images at an average rate of 0.5 bits per pixel, there are 4×05 = 2
bits available to encode four values, one value from each of the four subbands. If we use
the recursive bit allocation procedure on the eight-tap Johnston filter outputs, we end up
allocating 1 bit to the low-low band and 1 bit to the high-low band. As the pixel-to-pixel
difference in the low-low band is quite small, we use a DPCM encoder for the low-low band.
The high-low band does not show this behavior, which means we can simply use scalar
quantization for the high-low band. As there are no bits available to encode the other two
bands, these bands can be discarded. This results in the image shown in Figure 14.32, which
is far from pleasing. However, if we use the same compression approach with the image
decomposed using the eight-tap Smith-Barnwell filter, the result is Figure 14.33, which is
much more pleasing.
F I G U R E 14. 32 Sinan image coded at 0.5 bits per pixel using the eight-tap
Johnston ﬁlter.
To understand why we get such different results from using the two filters, we need to
look at the way the bits were allocated to the different bands. In this implementation, we
used the recursive bit allocation algorithm. In the image decomposed using the Johnston
filter, there was significant energy in the high-low band. The algorithm allocated 1 bit to
the low-low band and 1 bit to the high-low band. This resulted in poor encoding for both,
and subsequently poor reconstruction. There was very little signal content in any of the
bands other than the low-low band for the image decomposed using the Smith-Barnwell
filter. Therefore, the bit allocation algorithm assigned both bits to the low-low band, which
provided a reasonable reconstruction.

F I G U R E 14. 33 Sinan image coded at 0.5 bits per pixel using the eight-tap
Smith-Barnwell ﬁlter.
If the problem with the encoding of the image decomposed by the Johnston filter is an
insufficient number of bits for encoding the low-low band, why not simply assign both bits
to the low-low band? The problem is that the bit allocation scheme assigned a bit to the
high-low band because there was a significant amount of information in that band. If both
bits were assigned to the low-low band, we would have no bits left for use in encoding
the high-low band, and we would end up throwing away information necessary for the
reconstruction.
The issue of energy compaction becomes a very important factor in reconstruction
quality. Filters that allow for more energy compaction permit the allocation of bits to a
smaller number of subbands. This in turn results in a better reconstruction.
The coding schemes used in this example were DPCM and scalar quantization, the
techniques generally preferred in subband coding. The advantage provided by subband coding
is readily apparent if we compare the result shown in Figure 14.33 to results in the previous
chapters where we used either DPCM or scalar quantization without prior decomposition.
It would appear that the subband approach lends itself naturally to vector quantization.
After decomposing an image into subbands, we could design separate codebooks for each
subband to reflect the characteristics of that particular subband. The only problem with this
idea is that the low-low subband generally requires a large number of bits per pixel. As
we mentioned in Chapter 10, it is generally not feasible to operate the nonstructured vector
quantizers at high rates. Therefore, when vector quantizers are used, they are generally

used only for encoding the higher frequency bands. This may change as vector quantization
algorithms that operate at higher rates are developed.
14 . 13 Summary
In this chapter we introduced another approach to the decomposition of signals. In subband
coding we decompose the source output into components. Each of these components can
then be encoded using one of the techniques described in the previous chapters. The general
subband encoding procedure can be summarized as follows:
Select a set of filters for decomposing the source. We have provided a number of
filters in this chapter. Many more filters can be obtained from the published literature
(we give some references below).
Using the filters, obtain the subband signals ykn:
ykn =
N−1

i=0
hkixn−i (14.119)
where hkn are the coefficients of the kth filter.
Decimate the output of the filters.
Encode the decimated output.
The decoding procedure is the inverse of the encoding procedure. When encoding images
the filtering and decimation operations have to be performed twice, once along the rows and
once along the columns. Care should be taken to avoid problems at edges, as described in
Section 14.12.
1. Handbook for Digital Signal Processing, edited by S.K. Mitra and J.F. Kaiser [162],
is an excellent source of information about digital filters.
2. Multirate Systems and Filter Banks, by P.P. Vaidyanathan [200], provides detailed
information on QMF filters, as well as the relationship between wavelets and filter
banks and much more.
3. The topic of subband coding is also covered in Digital Coding of Waveforms, by
N.S. Jayant and P. Noll [123].
4. The MPEG-1 audio coding algorithm is described in “ISO-MPEG-1 Audio: A
Generic Standard for Coding of High-Quality Digital Audio,” by K. Brandenburg and
G. Stoll [28], in the October 1994 issue of the Journal of the Audio Engineering Society.
5. A review of the rate distortion method of bit allocation is provided in “Rate Distortion
Methods for Image and Video Compression,” by A. Ortega and K. Ramachandran, in
the November 1998 issue of IEEE Signal Processing Magazine [169].

1. A linear shift invariant system has the following properties:
If for a given input sequence xn the output of the system is the sequence yn,
then if we delay the input sequence by k units to obtain the sequence xn−k,
the corresponding output will be the sequence yn delayed by k units.
If the output corresponding to the sequence x 1
n is y 1
n , and the output
corresponding to the sequence x 2
n is y 2
n , then the output corresponding to
the sequence x 1
n +x 2
n is y 1
n +y 2
n .
Use these two properties to show the convolution property given in Equation (14.18).
2. Let’s design a set of simple four-tap filters that satisfies the perfect reconstruction
condition.
(a) We begin with the low-pass filter. Assume that the impulse response of the filter
is given by h1kk=3
k=0. Further assume that

h1k

=

h1j

∀jk
Find a set of values for hij that satisfies Equation (14.91).
(b) Plot the magnitude of the transfer function H1 z .
(c) Using Equation (14.23), find the high-pass filter coefficients h2k.
(d) Find the magnitude of the transfer function H2 z .
3. Given an input sequence
xn =

−1 n
n = 012
0 otherwise
(a) Find the output sequence yn if the filter impulse response is
hn =

1
√
2
n = 01
0 otherwise.
(b) Find the output sequence wn if the impulse response of the filter is
hn =
⎧
⎪
⎨
⎪
⎩
1
√
2
n = 0
− 1
√
2
n = 1
0 otherwise.
(c) Looking at the sequences yn and wn, what can you say about the sequence xn?
4. Given an input sequence
xn =

1 n = 012
0 otherwise

(a) Find the output sequence yn if the filter impulse response is
hn =

1
√
2
n = 01
0 otherwise.
(b) Find the output sequence wn if the impulse response of the filter is
hn =
⎧
⎪
⎨
⎪
⎩
1
√
2
n = 0
− 1
√
2
n = 1
0 otherwise.
(c) Looking at the sequences yn and wn, what can you say about the sequence xn?
5. Write a program to perform the analysis and downsampling operations and another to
perform the upsampling and synthesis operations for an image compression application.
The programs should read the filter parameters from a file. The synthesis program
should read the output of the analysis program and write out the reconstructed images.
The analysis program should also write out the subimages scaled so that they can be
displayed. Test your program using the Johnston eight-tap filter and the Sena image.
6. In this problem we look at some of the many ways we can encode the subimages
obtained after subsampling. Use the eight-tap Johnston filter to decompose the Sena
image into four subimages.
(a) Encode the low-low band using an adptive delta modulator (CFDM or CVSD).
Encode all other bands using a 1-bit scalar quantizer.
(b) Encode the low-low band using a 2-bit adaptive DPCM system. Encode the
low-high and high-low bands using a 1-bit scalar quantizer.
(c) Encode the low-low band using a 3-bit adaptive DPCM system. Encode the
low-high and high-low band using a 0.5 bit/pixel vector quantizer.
(d) Compare the reconstructions obtained using the different schemes.

15
Wavelet-Based Compression
15 . 1 Overview
I
n this chapter we introduce the concept of wavelets and describe how to use
wavelet-based decompositions in compression schemes. We begin with an
introduction to wavelets and multiresolution analysis and then describe how
we can implement a wavelet decomposition using filters. We then examine
the implementations of several wavelet-based compression schemes.
15 . 2 Introduction
In the previous two chapters we looked at a number of ways to decompose a signal. In this
chapter we look at another approach to decompose a signal that has become increasingly
popular in recent years: the use of wavelets. Wavelets are being used in a number of different
applications. Depending on the application, different aspects of wavelets can be emphasized.
As our particular application is compression, we will emphasize those aspects of wavelets
that are important in the design of compression algorithms. You should be aware that there
is much more to wavelets than is presented in this chapter. At the end of the chapter we
suggest options if you want to delve more deeply into this subject.
The practical implementation of wavelet compression schemes is very similar to that
of subband coding schemes. As in the case of subband coding, we decompose the signal
(analysis) using filter banks. The outputs of the filter banks are downsampled, quantized,
and encoded. The decoder decodes the coded representations, upsamples, and recomposes
the signal using a synthesis filter bank.
In the next several sections we will briefly examine the construction of wavelets and
describe how we can obtain a decomposition of a signal using multiresolution analysis. We
will then describe some of the currently popular schemes for image compression. If you are

474 15 W A V E L E T - B A S E D C O M P R E S S I O N
primarily interested at this time in implementation of wavelet-based compression schemes,
you should skip the next few sections and go directly to Section 15.5.
In the last two chapters we have described several ways of decomposing signals. Why do
we need another one? To answer this question, let’s begin with our standard tool for analysis,
the Fourier transform. Given a function ft, we can find the Fourier transform F as
F =

−
ftejt
dt
Integration is an averaging operation; therefore, the analysis we obtain, using the Fourier
transform, is in some sense an “average” analysis, where the averaging interval is all of
time. Thus, by looking at a particular Fourier transform, we can say, for example, that there
is a large component of frequency 10 kHz in a signal, but we cannot tell when in time this
component occurred. Another way of saying this is that Fourier analysis provides excellent
localization in frequency and none in time. The converse is true for the time function ft,
which provides exact information about the value of the function at each instant of time
but does not directly provide spectral information. It should be noted that both ft and
F represent the same function, and all the information is present in each representation.
However, each representation makes different kinds of information easily accessible.
If we have a very nonstationary signal, like the one shown in Figure 15.1, we would
like to know not only the frequency components but when in time the particular frequency
components occurred. One way to obtain this information is via the short-term Fourier
transform (STFT). With the STFT, we break the time signal ft into pieces of length T and
apply Fourier analysis to each piece. This way we can say, for example, that a component at
10 kHz occurred in the third piece—that is, between time 2T and time 3T. Thus, we obtain
an analysis that is a function of both time and frequency. If we simply chopped the function
into pieces, we could get distortion in the form of boundary effects (see Problem 1). In order
to reduce the boundary effects, we window each piece before we take the Fourier transform.
If the window shape is given by gt, the STFT is formally given by
F =

−
ftg∗
t −ejt
dt (15.1)
If the window function gt is a Gaussian, the STFT is called the Gabor transform.
t0
2t0
F I G U R E 15. 1 A nonstationary signal.

The problem with the STFT is the fixed window size. Consider Figure 15.1. In order to
obtain the low-pass component at the beginning of the function, the window size should be
at least t0 so that the window will contain at least one cycle of the low-frequency component.
However, a window size of t0 or greater means that we will not be able to accurately localize
the high-frequency spurt. A large window in the time domain corresponds to a narrow filter
in the frequency domain, which is what we want for the low-frequency components—and
what we do not want for the high-frequency components. This dilemma is formalized in
the uncertainty principle, which states that for a given window gt, the product of the time
spread 2
t and the frequency spread 2
is lower bounded by
√
1/2, where
2
t =

t2
gt2
dt

gt2
dt
(15.2)
2
=

2
G2
d

G2
d
(15.3)
Thus, if we wish to have finer resolution in time, that is, reduce 2
t , we end up with an
increase in 2
, or a lower resolution in the frequency domain. How do we get around this
problem?
Let’s take a look at the discrete STFT in terms of basis expansion, and for the moment,
let’s look at just one interval:
Fm0 =

−
ftg∗
te−jm0t
dt (15.4)
The basis functions are gt, gtejot
, gtej2ot
, and so on. The first three basis functions
are shown in Figure 15.2. We can see that we have a window with constant size, and
within this window, we have sinusoids with an increasing number of cycles. Let’s conjure
up a different set of functions in which the number of cycles is constant, but the size of
the window keeps changing, as shown in Figure 15.3. Notice that although the number of
F I G U R E 15. 2 The ﬁrst three STFT basis functions for the ﬁrst time interval.
F I G U R E 15. 3 Three wavelet basis functions.

cycles of the sinusoid in each window is the same, as the size of the window gets smaller,
these cycles occur in a smaller time interval; that is, the frequency of the sinusoid increases.
Furthermore, the lower frequency functions cover a longer time interval, while the higher
frequency functions cover a shorter time interval, thus avoiding the problem that we had
with the STFT. If we can write our function in terms of these functions and their translates,
we have a representation that gives us time and frequency localization and can provide high
frequency resolution at low frequencies (longer time window) and high time resolution at
high frequencies (shorter time window). This, crudely speaking, is the basic idea behind
wavelets.
In the following section we will formalize the concept of wavelets. Then we will discuss
how to get from a wavelet basis set to an implementation. If you wish to move directly to
implementation issues, you should skip to Section 15.5.
15 . 3 Wavelets
In the example at the end of the previous section, we started out with a single function. All
other functions were obtained by changing the size of the function or scaling and translating
this single function. This function is called the mother wavelet. Mathematically, we can
scale a function ft by replacing t with t/a, where the parameter a governs the amount of
scaling. For example, consider the function
ft =

cos t −1 ≤ t ≤ 1
0 otherwise.
We have plotted this function in Figure 15.4. To scale this function by 0.5, we replace t by
t/05:
f
t
05

=

cos t
05
−1 ≤ t
05
≤ 1
0 otherwise
=

cos2 t −1
2
≤ t ≤ 1
2
0 otherwise.
We have plotted the scaled function in Figure 15.5. If we define the norm of a function ft
by
ft2
=

−
f2
tdt
scaling obviously changes the norm of the function:

f
t
a

2
=

−
f2
t
a

dt
= a

−
f2
xdx

15.3 Wavelets 477
−1.5
1
0.5
0
−0.5
−1
1.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
t
f(t)
F I G U R E 15. 4 A function f(t).
−1.5
1
0.5
0
−0.5
−1
1.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
t
f(t/0.5)
F I G U R E 15. 5 The function f( t
05
).

where we have used the substitution x = t/a. Thus,

f
t
a

2
= aft2

If we want the scaled function to have the same norm as the original function, we need to
multiply it by 1/
√
a.
Mathematically, we can represent the translation of a function to the right or left by an
amount b by replacing t by t −b or t +b. For example, if we want to translate the scaled
function shown in Figure 15.5 by one, we have
f

t −1
05
=

cos2 t −1 −1
2
≤ t −1 ≤ 1
2
0 otherwise
=

cos2 t −1 1
2
≤ t ≤ 3
2
0 otherwise.
The scaled and translated function is shown in Figure 15.6. Thus, given a mother wavelet
t, the remaining functions are obtained as
abt =
1
√
a

t −b
a
(15.5)
with Fourier transforms
= t
ab = abt (15.6)
−1.5
1
0.5
0
−0.5
−1
1.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
t
f((t–1)/0.5)
F I G U R E 15. 6 A scaled and translated function.

15.3 Wavelets 479
Our expansion using coefficients with respect to these functions is obtained from the
inner product of ft with the wavelet functions:
wab = abtft =

−
abtftdt (15.7)
We can recover the function ft from the wab by
ft =
1
C

−

−
wab abt
dadb
a2
(15.8)
where
C =

0
2

d (15.9)
For integral (15.8) to exist, we need C to be finite. For C to be finite, we need 0 = 0.
Otherwise, we have a singularity in the integrand of (15.9). Note that 0 is the average
value of t; therefore, a requirement on the mother wavelet is that it have zero mean. The
condition that C be finite is often called the admissibility condition. We would also like the
wavelets to have finite energy; that is, we want the wavelets to belong to the vector space
L2 (see Example 12.3.1). Using Parseval’s relationship, we can write this requirement as

−
2
d
For this to happen, 2
has to decay as goes to infinity. These requirements mean that
the energy in is concentrated in a narrow frequency band, which gives the wavelet its
frequency localization capability.
If a and b are continuous, then wab is called the continuous wavelet transform (CWT).
Just as with other transforms, we will be more interested in the discrete version of this
transform. We first obtain a series representation where the basis functions are continuous
functions of time with discrete scaling and translating parameters a and b. The discrete
versions of the scaling and translating parameters have to be related to each other because
if the scale is such that the basis functions are narrow, the translation step should be
correspondingly small and vice versa. There are a number of ways we can choose these
parameters. The most popular approach is to select a and b according to
a = a−m
0 b = nb0a−m
0 (15.10)
where m and n are integers, a0 is selected to be 2, and b0 has a value of 1. This gives us
the wavelet set
mnt = am/2
0 am
0 t −nb0 mn ∈ Z (15.11)
For a0 = 2 and b0 = 1, we have
mnt = 2m/2
2m
t −n (15.12)

(Note that these are the most commonly used choices, but they are not the only choices.) If
this set is complete, then mnt are called affine wavelets. The wavelet coefficients are
given by
wmn = ft mnt (15.13)
= am/2
0

ft am
0 t −nb0dt (15.14)
The function ft can be reconstructed from the wavelet coefficients by
ft =
m n
wmn mnt (15.15)
Wavelets come in many shapes. We will look at some of the more popular ones later in
this chapter. One of the simplest wavelets is the Haar wavelet, which we will use to explore
the various aspects of wavelets. The Haar wavelet is given by
t =

1 0 ≤ t 1
2
−1 1
2
≤ t 1
(15.16)
By translating and scaling this mother wavelet, we can synthesize a variety of functions.
This version of the transform, where ft is a continuous function while the transform
consists of discrete values, is a wavelet series analogous to the Fourier series. It is also
called the discrete time wavelet transform (DTWT). We have moved from the continuous
wavelet transform, where both the time function ft and its transform wab were continuous
functions of their arguments, to the wavelet series, where the time function is continuous
but the time-scale wavelet representation is discrete. Given that in data compression we
are generally dealing with sampled functions that are discrete in time, we would like both
the time and frequency representations to be discrete. This is called the discrete wavelet
transform (DWT). However, before we get to that, let’s look into one additional concept—
multiresolution analysis.
15 . 4 Multiresolution Analysis and the Scaling
Function
The idea behind multiresolution analysis is fairly simple. Let’s define a function t that
we call a scaling function. We will later see that the scaling function is closely related to
the mother wavelet. By taking linear combinations of the scaling function and its translates
we can generate a large number of functions
ft =
k
akt −k (15.17)
The scaling function has the property that a function that can be represented by the scaling
function can also be represented by the dilated versions of the scaling function.

For example, one of the simplest scaling functions is the Haar scaling function:
t =

1 0 ≤ t 1
0 otherwise.
(15.18)
Then ft can be any piecewise continuous function that is constant in the interval kk+1
for all k.
Let’s define
kt = t −k (15.19)
The set of all functions that can be obtained using a linear combination of the set kt
ft =
k
akkt (15.20)
is called the span of the set kt, or Spankt. If we now add all functions that
are limits of sequences of functions in Spankt, this is referred to as the closure of
Spankt and denoted by Spankt. Let’s call this set V0.
If we want to generate functions at a higher resolution, say, functions that are required
to be constant over only half a unit interval, we can use a dilated version of the “mother”
scaling function. In fact, we can obtain scaling functions at different resolutions in a manner
similar to the procedure used for wavelets:
jkt = 2j/2
2j
t −k (15.21)
The indexing scheme is the same as that used for wavelets, with the first index referring
to the resolution while the second index denotes the translation. For the Haar example,
10t =
√
2 0 ≤ t 1
2
0 otherwise.
(15.22)
We can use translates of 10t to represent all functions that are constant over intervals
k/2k+1/2 for all k. Notice that in general any function that can be represented by the
translates of t can also be represented by a linear combination of translates of 10t.
The converse, however, is not true. Defining
V1 = Span1kt (15.23)
we can see that V0 ⊂ V1. Similarly, we can show that V1 ⊂ V2, and so on.
E x a m p l e 15 . 4 . 1:
Consider the function shown in Figure 15.7. We can approximate this function using trans-
lates of the Haar scaling function t. The approximation is shown in Figure 15.8a. If we
call this approximation
0
f t, then

0
f t =
k
c0kkt (15.24)

f(t)
t
F I G U R E 15. 7 A sample function.
φf (t)
(0)
φf (t)
(1)
φf (t)
(2)
t
t
t
(a)
(b)
(c)
F I G U R E 15. 8 Approximations of the function shown Figure 15.7.

where
c0k =
k+1
k
ftktdt (15.25)
We can obtain a more refined approximation, or an approximation at a higher resolution,

1
f t, shown in Figure 15.8b, if we use the set 1kt:

1
f t =
k
c1k1kt (15.26)
Notice that we need twice as many coefficients at this resolution compared to the previous
resolution. The coefficients at the two resolutions are related by
c0k =
1
√
2
c12k +c12k+1 (15.27)
Continuing in this manner (Figure 15.8c), we can get higher and higher resolution approxi-
mations of ft with

m
f t =
k
cmkmkt (15.28)
Recall that, according to the Nyquist rule, if the highest frequency component of a signal
is at f0 Hz, we need 2f0 samples per second to accurately represent it. Therefore, we could
obtain an accurate representation of ft using the set of translates jkt, where 2−j
1
2f0
.
As
cjk = 2j/2
k+1
2j
k
2j
ftdt (15.29)
by the mean value theorem of calculus, cjk is equal to a sample value of ft in the interval
k2−j
k + 12−j
. Therefore, the function
j
f t would represent more than 2f0 samples
per second of ft.
We said earlier that a scaling function has the property that any function that can be
represented exactly by an expansion at some resolution j can also be represented by dilations
of the scaling function at resolution j +1. In particular, this means that the scaling function
itself can be represented by its dilations at a higher resolution:
t =
k
hk1kt (15.30)
Substituting 1kt =
√
22t−k, we obtain the multiresolution analysis (MRA) equation:
t =
k
hk
√
22t −k (15.31)
This equation will be of great importance to us when we begin looking at ways of imple-
menting the wavelet transform.

E x a m p l e 15 . 4 . 2:
Consider the Haar scaling function. Picking
h0 = h1 =
1
√
2
and
hk = 0 for k 1
satisfies the recursion equation.
E x a m p l e 15 . 4 . 3:
Consider the triangle scaling function shown in Figure 15.9. For this function
h0 =
1
2
√
2
h1 =
1
√
2
h2 =
1
2
√
2
satisfies the recursion equation.
F I G U R E 15. 9 Triangular scaling function.

While both the Haar scaling function and the triangle scaling functions are valid scaling
functions, there is an important difference between the two. The Haar function is orthogonal
to its translates; that is,

tt −mdt = m
This is obviously not true of the triangle function. In this chapter we will be principally
concerned with scaling functions that are orthogonal because they give rise to orthonormal
transforms that, as we have previously seen, are very useful in compression.
How about the Haar wavelet? Can it be used as a scaling function? Some reflection
will show that we cannot obtain the Haar wavelet from a linear combination of its dilated
versions.
So, where do wavelets come into the picture? Let’s continue with our example using the
Haar scaling function. Let us assume for the moment that there is a function gt that can
be exactly represented by 1
g t; that is, gt is a function in the set V1. We can decompose

1
g t into the sum of a lower-resolution version of itself, namely, 0
g t, and the difference
1
g t−0
g t. Let’s examine this difference over an arbitrary unit interval kk+1:
1
g t−0
g t =

c0k −
√
2c12k k ≤ t k+ 1
2
c0k −
√
2c12k+1 k+ 1
2
≤ t k+1
(15.32)
Substituting for c0k from (15.27), we obtain
1
g t−0
g t =

− 1
√
2
c12k + 1
√
2
c12k+1 k ≤ t k+ 1
2
1
√
2
c12k − 1
√
2
c12k+1 k+ 1
2
≤ t k+1
(15.33)
Defining
d0k = −
1
√
2
c12k +
1
√
2
c12k+1
over the arbitrary interval kk+1,
1
g t−0
g t = d0k 0kt (15.34)
where
0kt =

1 k ≤ t k+ 1
2
−1 k+ 1
2
≤ t k+1
(15.35)
But this is simply the kth translate of the Haar wavelet. Thus, for this particular case the
function can be represented as the sum of a scaling function and a wavelet at the same
resolution:
1
g t =
k
c0k0kt+
k
d0k 0kt (15.36)
In fact, we can show that this decomposition is not limited to this particular example.
A function in V1 can be decomposed into a function in V0—that is, a function that is a
linear combination of the scaling function at resolution 0, and a function that is a linear
combination of translates of a mother wavelet. Denoting the set of functions that can be
obtained by a linear combination of the translates of the mother wavelet as W0, we can write
this symbolically as
V1 = V0 ⊕W0 (15.37)
In other words, any function in V1 can be represented using functions in V0 and W0.
Obviously, once a scaling function is selected, the choice of the wavelet function cannot
be arbitrary. The wavelet that generates the set W0 and the scaling function that generates
the sets V0 and V1 are intrinsically related. In fact, from (15.37), W0 ⊂ V1, and therefore any
function in W0 can be represented by a linear combination of 1k. In particular, we can
write the mother wavelet t as
t =
k
wk1kt (15.38)

or
t =
k
wk
√
22t −k (15.39)
This is the counterpart of the multiresolution analysis equation for the wavelet function and
will be of primary importance in the implementation of the decomposition.
All of this development has been for a function in V1. What if the function can only
be accurately represented at resolution j +1? If we define Wj as the closure of the span of
jkt, we can show that
Vj+1 = Vj ⊕Wj (15.40)
But, as j is arbitrary,
Vj = Vj−1 ⊕Wj−1 (15.41)
and
Vj+1 = Vj−1 ⊕Wj−1 ⊕Wj (15.42)
Continuing in this manner, we can see that for any k ≤ j
Vj+1 = Vk ⊕Wk ⊕Wk+1 ⊕···⊕Wj (15.43)
In other words, if we have a function that belongs to Vj+1 (i.e., that can be exactly represented
by the scaling function at resolution j + 1), we can decompose it into a sum of functions
starting with a lower-resolution approximation followed by a sequence of functions generated
by dilations of the wavelet that represent the leftover details. This is very much like what
we did in subband coding. A major difference is that, while the subband decomposition
is in terms of sines and cosines, the decomposition in this case can use a variety of
scaling functions and wavelets. Thus, we can adapt the decomposition to the signal being
decomposed by selecting the scaling function and wavelet.
15 . 5 Implementation Using Filters
One of the most popular approaches to implementing the decomposition discussed in the
previous section is using a hierarchical filter structure similar to the one used in subband
coding. In this section we will look at how to obtain the structure and the filter coefficients.
We start with the MRA equation
t =
k
hk
√
22t −k (15.44)
Substituting t = 2j
t −m, we obtain the equation for an arbitrary dilation and translation:
2j
t −m =
k
hk
√
222j
t −m−k (15.45)
=
k
hk
√
22j+1
t −2m−k (15.46)
=
l
hl−2m
√
22j+1
t −l (15.47)

where in the last equation we have used the substitution l = 2m + k. Suppose we have
a function ft that can be accurately represented at resolution j + 1 by some scaling
function t. We assume that the scaling function and its dilations and translations form
an orthonormal set. The coefficients cj+1 can be obtained by
cj+1k =

ftj+1kdt (15.48)
If we can represent ft accurately at resolution j +1 with a linear combination of j+1kt,
then from the previous section we can decompose it into two functions: one in terms of
jkt and one in terms of the jth dilation of the corresponding wavelet jkt. The
coefficients cjk are given by
cjk =

ftjktdt (15.49)
=

ft2
j
2 2j
t −kdt (15.50)
Substituting for 2j
t −k from (15.47), we get
cjl =

ft2
j
2
l
hl−2k
√
22j+1
t −ldt (15.51)
Interchanging the order of summation and integration, we get
cjl =
l
hl−2k

ft2
j
2
√
22j+1
t −ldt (15.52)
But the integral is simply cj+1k. Therefore,
cjk =
k
hk−2mcj+1k (15.53)
We have encountered this relationship before in the context of the Haar function. Equa-
tion (15.27) provides the relationship between coefficients of the Haar expansion at two
resolution levels. In a more general setting, the coefficients hj provide a link between the
coefficients cjk at different resolutions. Thus, given the coefficients at resolution level
j +1, we can obtain the coefficients at all other resolution levels. But how do we start the
process? Recall that ft can be accurately represented at resolution j + 1. Therefore, we
can replace cj+1k by the samples of ft. Let’s represent these samples by xk. Then the
coefficients of the low-resolution expansion are given by
cjk =
k
hk−2mxk (15.54)
In Chapter 12, we introduced the input-output relationship of a linear filter as
ym =
k
hkxm−k =
k
hm−kxk (15.55)
Replacing m by 2m, we get every other sample of the output
y2m =
k
h2m−kxk (15.56)

Comparing (15.56) with (15.54), we can see that the coefficients of the low-resolution
approximation are every other output of a linear filter whose impulse response is h−k. Recall
that hk are the coefficients that satisfy the MRA equation. Using the terminology of
subband coding, the coefficients cjk are the downsampled output of the linear filter with
impulse response h−k.
The detail portion of the representation is obtained in a similar manner. Again we start
from the recursion relationship. This time we use the recursion relationship for the wavelet
function as our starting point:
t =
k
wk
√
22t −k (15.57)
Again substituting t = 2j
t −m and using the same simplifications, we get
2j
t −m =
k
wk−2m
√
22j+1
t −k (15.58)
Using the fact that the dilated and translated wavelets form an orthonormal basis, we can
obtain the detail coefficients djk by
djk =

ft jktdt (15.59)
=

ft2
j
2 2j
t −kdt (15.60)
=

ft2
j
2
l
wl−2k
√
22j+1
t −ldt (15.61)
=
l
wl−2k

ft2
j+1
2 2j+1
t −ldt (15.62)
=
l
wl−2kcj+1l (15.63)
Thus, the detail coefficients are the decimated outputs of a filter with impulse response
w−k.
At this point we can use exactly the same arguments to further decompose the coefficients
cj.
In order to retrieve cj+1k from cjk and djk, we upsample the lower resolution
coefficients and filter, using filters with impulse response hk and wk
cj+1k =
l
cjlbk−2l
l
djlwk−2l
15 . 5 . 1 S c a l i n g a n d W a v e l e t C o e f f i c i e n t s
In order to implement the wavelet decomposition, the coefficients hk and wk are of
primary importance. In this section we look at some of the properties of these coefficients
that will help us in finding different decompositions.

We start with the MRA equation. Integrating both sides of the equation over all t, we
obtain

−
tdt =

− k
hk
√
22t −kdt (15.64)
Interchanging the summation and integration on the right-hand side of the equation, we get

−
tdt =
k
hk
√
2

−
2t −kdt (15.65)
Substituting x = 2t −k with dx = 2dt in the right-hand side of the equation, we get

−
tdt =
k
hk
√
2

−
x
1
2
dx (15.66)
=
k
hk
1
√
2

−
xdx (15.67)
Assuming that the average value of the scaling function is not zero, we can divide both sides
by the integral and we get
k
hk =
√
2 (15.68)
If we normalize the scaling function to have a magnitude of one, we can use the orthogonality
condition on the scaling function to get another condition on hk:

t2
dt =

k
hk
√
22t −k
m
hm
√
22t −mdt (15.69)
=
k m
hkhm2

2t −k2t −mdt (15.70)
=
k m
hkhm

x −kx −mdx (15.71)
where in the last equation we have used the substitution x = 2t. The integral on the right-hand
side is zero except when k = m. When k = m, the integral is unity and we obtain
k
h2
k = 1 (15.72)
We can actually get a more general property by using the orthogonality of the translates
of the scaling function

tt −mdt = m (15.73)
Rewriting this using the MRA equation to substitute for t and t −m, we obtain

k
hk
√
22t −k
l
hl
√
22t −2m−l dt
=
k l
hkhl2

2t −k2t −2m−ldt (15.74)

Substituting x = 2t, we get

tt −mdt =
k l
hkhl

x −kx −2m−ldx (15.75)
=
k l
hkhlk−2m+l (15.76)
=
k
hkhk−2m (15.77)
Therefore, we have
k
hkhk−2m = m (15.78)
Notice that this is the same relationship we had to satisfy for perfect reconstruction in the
previous chapter.
Using these relationships, we can generate scaling coefficients for filters of various
lengths.
E x a m p l e 15 . 5 . 1:
For k = 2, we have from (15.68) and (15.72)
h0 +h1 =
√
2 (15.79)
h2
0 +h2
1 = 1 (15.80)
These equations are uniquely satisfied by
h0 = h1 =
1
√
2

which is the Haar scaling function.
An orthogonal expansion does not exist for all lengths. In the following example, we
consider the case of k = 3.
E x a m p l e 15 . 5 . 2:
For k = 3, from the three conditions (15.68), (15.72), and (15.78), we have
h0 +h1 +h2 =
√
2 (15.81)
h2
0 +h2
1 +h2
2 = 1 (15.82)
h0h2 = 0 (15.83)
The last condition can only be satisfied if h0 = 0 or h2 = 0. In either case we will be left
with the two-coefficient filter for the Haar scaling function.
In fact, we can see that for k odd, we will always end up with a condition that will
force one of the coefficients to zero, thus leaving an even number of coefficients. When the

number of coefficients gets larger than the number of conditions, we end up with an infinite
number of solutions.
E x a m p l e 15 . 5 . 3:
Consider the case when k = 4. The three conditions give us the following three equations:
h0 +h1 +h2 +h3 =
√
2 (15.84)
h2
0 +h2
1 +h2
2 +h2
3 = 1 (15.85)
h0h2 +h1h3 = 0 (15.86)
We have three equations and four unknowns; that is, we have one degree of freedom. We
can use this degree of freedom to impose further conditions on the solution. The solutions
to these equations include the Daubechies four-tap solution:
h0 =
1+
√
3
4
√
2
h1 =
3+
√
3
4
√
2
h2 =
3−
√
3
4
√
2
h3 =
1−
√
3
4
√
2

Given the close relationship between the scaling function and the wavelet, it seems
reasonable that we should be able to obtain the coefficients for the wavelet filter from the
coefficients of the scaling filter. In fact, if the wavelet function is orthogonal to the scaling
function at the same scale

t −k t −mdt = 0 (15.87)
then
wk = ±−1k
hN−k (15.88)
and
k
hkwn−2k = 0 (15.89)
Furthermore,
k
wk = 0 (15.90)
The proof of these relationships is somewhat involved [207].
15 . 5 . 2 F a m i l i e s o f W a v e l e t s
Let’s move to the more practical aspects of compression using wavelets. We have said
that there is an infinite number of possible wavelets. Which one is best depends on the
application. In this section we list different wavelets and their corresponding filters. You are
encouraged to experiment with these to find those best suited to your application.
The 4-tap, 12-tap, and 20-tap Daubechies filters are shown in Tables 15.1–15.3. The
6-tap, 12-tap, and 18-tap Coiflet filters are shown in Tables 15.4–15.6.

h0 0.4829629131445341
h1 0.8365163037378079
h2 0.2241438680420134
h3 −01294095225512604
h0 0.111540743350
h1 0.494623890398
h2 0.751133908021
h3 0.315250351709
h4 −0226264693965
h5 −0129766867567
h6 0.097501605587
h7 0.027522865530
h8 −0031582039318
h9 0.000553842201
h10 0.004777257511
h11 −0001077301085
h0 0.026670057901
h1 0.188176800078
h2 0.527201188932
h3 0.688459039454
h4 0.281172343661
h5 −0249846424327
h6 −0195946274377
h7 0.127369340336
h8 0.093057364604
h9 −0071394147166
h10 −0029457536822
h11 0.033212674059
h12 0.003606553567
h13 −0010733175483
h14 0.001395351747
h15 0.001992405295
h16 −0000685856695
h17 −0000116466855
h18 0.000093588670
h19 −0000013264203

T A B L E 15 . 4 Coefficients for the 6-tap Coiflet
low-pass filter.
h0 −0051429728471
h1 0.238929728471
h2 0.602859456942
h3 0.272140543058
h4 −0051429972847
h5 −0011070271529
low-pass filter.
h0 0.011587596739
h1 −0029320137980
h2 −0047639590310
h3 0.273021046535
h4 0.574682393857
h5 0.294867193696
h6 −0054085607092
h7 −0042026480461
h8 0.016744410163
h9 0.003967883613
h10 −0001289203356
h11 −0000509505539
low-pass filter.
h0 −0002682418671
h1 0.005503126709
h2 0.016583560479
h3 −0046507764479
h4 −0043220763560
h5 0.286503335274
h6 0.561285256870
h7 0.302983571773
h8 −0050770140755
h9 −0058196250762
h10 0.024434094321
h11 0.011229240962
h12 −0006369601011
h13 −0001820458916
h14 0.000790205101
h15 0.000329665174
h16 −0000050192775
h17 −0000024465734

15 . 6 Image Compression
One of the most popular applications of wavelets has been to image compression. The JPEG
2000 standard, which is designed to update and replace the current JPEG standard, will use
wavelets instead of the DCT to perform decomposition of the image. During our discussion
we have always referred to the signal to be decomposed as a one-dimensional signal;
however, images are two-dimensional signals. There are two approaches to the subband
decomposition of two-dimensional signals: using two-dimensional filters, or using separable
transforms that can be implemented using one-dimensional filters on the rows first and then
on the columns (or vice versa). Most approaches, including the JPEG 2000 verification
model, use the second approach.
In Figure 15.10 we show how an image can be decomposed using subband decomposition.
We begin with an N × M image. We filter each row and then downsample to obtain two
N × M
2
images. We then filter each column and subsample the filter output to obtain four
N
2
× M
2
images. Of the four subimages, the one obtained by low-pass filtering the rows and
columns is referred to as the LL image; the one obtained by low-pass filtering the rows
and high-pass filtering the columns is referred to as the LH image; the one obtained by
high-pass filtering the rows and low-pass filtering the columns is called the HL image; and
the subimage obtained by high-pass filtering the rows and columns is referred to as the HH
image. This decomposition is sometimes represented as shown in Figure 15.11. Each of the
subimages obtained in this fashion can then be filtered and subsampled to obtain four more
subimages. This process can be continued until the desired subband structure is obtained.
Three popular structures are shown in Figure 15.12. In the structure in Figure 15.12a, the
LL subimage has been decomposed after each decomposition into four more subimages,
resulting in a total of 10 subimages. This is one of the more popular decompositions.
a0
H0
H0
H1
2
2
2
LL N/2
N
M/2
M/2
N
M
x
LH
a1
H1
H0
H1
2
2
2
HL
HH
F I G U R E 15. 10 Subband decomposition of an N × M image.

N/2 LL
M/2
LH
N/2 HL HH
M/2
F I G U R E 15. 11 First-level decomposition.
(a) (b) (c)
F I G U R E 15. 12 Three popular subband structures.
E x a m p l e 15 . 6 . 1:
Let’s use the Daubechies wavelet filter to repeat what we did in Examples 14.12.2 and 14.12.3
using the Johnston and the Smith-Barnwell filters. If we use the 4-tap Daubechies filter,
we obtain the decomposition shown in Figure 15.13. Notice that even though we are only
using a 4-tap filter, we get results comparable to the 16-tap Johnston filter and the 8-tap
Smith-Barnwell filters.
If we now encode this image at the rate of 05 bits per pixel, we get the reconstructed
image shown in Figure 15.14. Notice that the quality is comparable to that obtained using
filters requiring two or four times as much computation.
In this example we used a simple scalar quantizer for quantization of the coefficients.
However, if we use strategies that are motivated by the properties of the coefficients them-
selves, we can obtain significant performance improvements. In the next sections we examine
two popular quantization strategies developed specifically for wavelets.

F I G U R E 15. 13 Decomposition of Sinan image using the four-tap Daubechies ﬁlter.
F I G U R E 15. 14 Reconstruction of Sinan image encoded using 0.5 bits per pixel
and the four-tap Daubechies ﬁlter.

15 . 7 Embedded Zerotree Coder
The embedded zerotree wavelet (EZW) coder was introduced by Shapiro [208]. It is a
quantization and coding strategy that incorporates some characteristics of the wavelet decom-
position. Just as the quantization and coding approach used in the JPEG standard, which
were motivated by the characteristics of the coefficients, were superior to the generic zonal
coding algorithms, the EZW approach and its descendants significantly outperform some
of the generic approaches. The particular characteristic used by the EZW algorithm is that
there are wavelet coefficients in different subbands that represent the same spatial loca-
tion in the image. If the decomposition is such that the size of the different subbands is
different (the first two decompositions in Figure 15.12), then a single coefficient in the
smaller subband may represent the same spatial location as multiple coefficients in the other
subbands.
In order to put our discussion on more solid ground, consider the 10-band decomposition
shown in Figure 15.15. The coefficient a in the upper-left corner of band I represents the
same spatial location as coefficients a1 in band II, a2 in band III, and a3 in band IV. In turn,
the coefficient a1 represents the same spatial location as coefficients a11a12a13, and a14
in band V. Each of these pixels represents the same spatial location as four pixels in band
VIII, and so on. In fact, we can visualize the relationships of these coefficients in the form
of a tree: The coefficient a forms the root of the tree with three descendants a1a2, and a3.
The coefficient a1 has descendants a11a12a13, and a14. The coefficient a2 has descendants
a21a22a23, and a24, and the coefficient a3 has descendants a31a32a33, and a34. Each of
these coefficients in turn has four descendants, making a total of 64 coefficients in this tree.
A pictorial representation of the tree is shown in Figure 15.16.
Recall that when natural images are decomposed in this manner most of the energy is
compacted into the lower bands. Thus, in many cases the coefficients closer to the root of the
tree have higher magnitudes than coefficients further away from the root. This means that
often if a coefficient has a magnitude less than a given threshold, all its descendants will have
magnitudes less than that threshold. In a scalar quantizer, the outer levels of the quantizer
correspond to larger magnitudes. Consider the 3-bit quantizer shown in Figure 15.17. If we
determine that all coefficients arising from a particular root have magnitudes smaller than
T0 and we inform the decoder of this situation, then for all coefficients in that tree we need
only use 2 bits per sample, while getting the same performance as we would have obtained
using the 3-bit quantizer. If the binary coding scheme used in Figure 15.17 is used, in which
the first bit is the sign bit and the next bit is the most significant bit of the magnitude, then
the information that a set of coefficients has value less than T0 is the same as saying that
the most significant bit of the magnitude is 0. If there are N coefficients in the tree, this
is a savings of N bits minus however many bits are needed to inform the decoder of this
situation.
Before we describe the EZW algorithm, we need to introduce some terminology. Given
a threshold T, if a given coefficient has a magnitude greater than T, it is called a significant
coefficient at level T. If the magnitude of the coefficient is less than T (it is insignificant),
and all its descendants have magnitudes less than T, then the coefficient is called a zerotree
root. Finally, it might happen that the coefficient itself is less than T but some of its
descendants have a value greater than T. Such a coefficient is called an isolated zero.

a
I II
III
IV
VI
VIII
X
IX
V
VII
a1 a11 a12
a13 a14
a111 a112
a113 a114
a121 a122
a123 a124
a131 a132
a133 a134
a141 a142
a143 a144
a2 a3
F I G U R E 15. 15 A 10-band wavelet decomposition.
The EZW algorithm is a multiple-pass algorithm, with each pass consisting of two steps:
significance map encoding or the dominant pass, and refinement or the subordinate pass. If
cmax is the value of the largest coefficient, the initial value of the threshold T0 is given by
T0 = 2 log2 cmax
(15.91)
This selection guarantees that the largest coefficient will lie in the interval T02T0. In each
pass, the threshold Ti is reduced to half the value it had in the previous pass:
Ti =
1
2
Ti−1 (15.92)

a111
a11 a12
a1 a2
a
a3
a13 a14 a21 a22 a23 a24 a31 a32 a33 a34
a112 a113 a114 a121 a122 a123 a124 a131 a132 a133 a134 a141 a142 a143 a144
F I G U R E 15. 16 Data structure used in the EZW coder.
100
000
001
010
011
101
110
111
−T0 T0
F I G U R E 15. 17 A 3-bit midrise quantizer.

For a given value of Ti, we assign one of four possible labels to the coefficients: significant
positive (sp), significant negative (sn), zerotree root (zr), and isolated zero (iz). If we used
a fixed-length code, we would need 2 bits to represent each of the labels. Note that when
a coefficient has been labeled a zerotree root, we do not need to label its descendants. This
assignment is referred to as significance map coding.
We can view the significance map coding in part as quantization using a three-level
midtread quantizer. This situation is shown in Figure 15.18. The coefficients labeled signifi-
cant are simply those that fall in the outer levels of the quantizer and are assigned an initial
reconstructed value of 15Ti or −15Ti, depending on whether the coefficient is positive or
negative. Note that selecting Ti according to (15.91) and (15.92) guarantees the significant
coefficients will lie in the interval T2T. Once a determination of significance has been
made, the significant coefficients are included in a list for further refinement in the refine-
ment or subordinate passes. In the refinement pass, we determine whether the coefficient
lies in the upper or lower half of the interval T2T. In successive refinement passes, as
the value of T is reduced, the interval containing the significant coefficient is narrowed still
further and the reconstruction is updated accordingly. An easy way to perform the refinement
is to take the difference between the coefficient value and its reconstruction and quantize it
using a two-level quantizer with reconstruction values ±T/4. This quantized value is then
added on to the current reconstruction value as a correction term.
The wavelet coefficients that have not been previously determined significant are scanned
in the manner depicted in Figure 15.19, with each parent node in a tree scanned before its
offspring. This makes sense because if the parent is determined to be a zerotree root, we
would not need to encode the offspring.
−T
−1.5T
T
1.5T
F I G U R E 15. 18 A three-level midtread quantizer.

F I G U R E 15. 19 Scanning of wavelet coefﬁcients for encoding using the EZW
algorithm.
Although this may sound confusing, in order to see how simple the encoding procedure
actually is, let’s use an example.
E x a m p l e 15 . 7 . 1:
Let’s use the seven-level decomposition shown below to demonstrate the various steps of
EZW:
26 6 13 10
−7 7 6 4
4 −4 4 −3
2 −2 −2 0
To obtain the initial threshold value T0, we find the maximum magnitude coefficient, which
in this case is 26. Then
T0 = 2 log2 26
= 16

Comparing the coefficients against 16, we find 26 is greater than 16 so we send sp. The next
coefficient in the scan is 6, which is less than 16. Furthermore, its descendants (13, 10, 6,
and 4) are all less than 16. Therefore, 6 is a zerotree root, and we encode this entire set with
the label zr. The next coefficient in the scan is −7, which is also a zerotree root, as is 7, the
final element in the scan. We do not need to encode the rest of the coefficients separately
because they have already been encoded as part of the various zerotrees. The sequence of
labels to be transmitted at this point is
sp zr zr zr
Since each label requires 2 bits (for fixed-length encoding), we have used up 8 bits from
our bit budget. The only significant coefficient in this pass is the coefficient with a value of
26. We include this coefficient in our list to be refined in the subordinate pass. Calling the
subordinate list LS, we have
LS = 26
The reconstructed value of this coefficient is 15T0 = 24, and the reconstructed bands look
like this:
24 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
The next step is the subordinate pass, in which we obtain a correction term for the
reconstruction value of the significant coefficients. In this case, the list LS contains only one
element. The difference between this element and its reconstructed value is 26 − 24 = 2.
Quantizing this with a two-level quantizer with reconstruction levels ±T0/4, we obtain
a correction term of 4. Thus, the reconstruction becomes 24 + 4 = 28. Transmitting the
correction term costs a single bit, therefore at the end of the first pass we have used up 9
bits. Using only these 9 bits, we would obtain the following reconstruction:
28 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
We now reduce the value of the threshold by a factor of two and repeat the process. The
value of T1 is 8. We rescan the coefficients that have not yet been deemed significant. To

emphasize the fact that we do not consider the coefficients that have been deemed significant
in the previous pass, we replace them with :
6 13 10
−7 7 6 4
4 −4 4 −3
2 −2 −2 0
The first coefficient we encounter has a value of 6. This is less than the threshold value
of 8; however, the descendants of this coefficient include coefficients with values of 13 and
10. Therefore, this coefficient cannot be classified as a zerotree root. This is an example of
what we defined as an isolated zero. The next two coefficients in the scan are −7 and 7.
Both of these coefficients have magnitudes less than the threshold value of 8. Furthermore,
all their descendants also have magnitudes less than 8. Therefore, these two coefficients are
coded as zr. The next two elements in the scan are 13 and 10, which are both coded as sp.
The final two elements in the scan are 6 and 4. These are both less than the threshold, but
they do not have any descendants. We code these coefficients as iz. Thus, this dominant
pass is coded as
iz zr zr sp sp iz iz
which requires 14 bits, bringing the total number of bits used to 23. The significant coef-
ficients are reconstructed with values 15T1 = 12. Thus, the reconstruction at this point
is
28 0 12 12
0 0 0 0
0 0 0 0
0 0 0 0
We add the new significant coefficients to the subordinate list:
LS = 261310
In the subordinate pass, we take the difference between the coefficients and their
reconstructions and quantize these to obtain the correction or refinement values for these
coefficients. The possible values for the correction terms are ±T1/4 = ±2:
26−28 = −2 ⇒ Correction term = −2
13−12 = 1 ⇒ Correction term = 2 (15.93)
10−12 = −2 ⇒ Correction term = −2

Each correction requires a single bit, bringing the total number of bits used to 26. With these
corrections, the reconstruction at this stage is
26 0 14 10
0 0 0 0
0 0 0 0
0 0 0 0
If we go through one more pass, we reduce the threshold value to 4. The coefficients to
be scanned are
6
−7 7 6 4
4 −4 4 −3
2 −2 −2 0
The dominant pass results in the following coded sequence:
sp sn sp sp sp sp sn iz iz sp iz iz iz
This pass cost 26 bits, equal to the total number of bits used previous to this pass. The
reconstruction upon decoding of the dominant pass is
26 6 14 10
−6 6 6 6
6 −6 6 0
0 0 0 0
The subordinate list is
LS = 2613106−77644−44
By now it should be reasonably clear how the algorithm works. We continue encoding
until we have exhausted our bit budget or until some other criterion is satisfied.
There are several observations we can make from this example. Notice that the encoding
process is geared to provide the most bang for the bit at each step. At each step the bits

are used to provide the maximum reduction in the reconstruction error. If at any time the
encoding is interrupted, the reconstruction using this (interrupted) encoding is the best that
the algorithm could have provided using this many bits. The encoding improves as more
bits are transmitted. This form of coding is called embedded coding. In order to enhance
this aspect of the algorithm, we can also sort the subordinate list at the end of each pass
using information available to both encoder and decoder. This would increase the likelihood
of larger coefficients being encoded first, thus providing for a greater reduction in the
reconstruction error.
Finally, in the example we determined the number of bits used by assuming fixed-length
encoding. In practice, arithmetic coding is used, providing a further reduction in rate.
15 . 8 Set Partitioning in Hierarchical Trees
The SPIHT (Set Partitioning in Hierarchical Trees) algorithm is a generalization of the EZW
algorithm and was proposed by Amir Said and William Pearlman [209]. Recall that in EZW
we transmit a lot of information for little cost when we declare an entire subtree to be
insignificant and represent all the coefficients in it with a zerotree root label zr. The SPIHT
algorithm uses a partitioning of the trees (which in SPIHT are called spatial orientation trees)
in a manner that tends to keep insignificant coefficients together in larger subsets. The
partitioning decisions are binary decisions that are transmitted to the decoder, providing a
significance map encoding that is more efficient than EZW. In fact, the efficiency of the
significance map encoding in SPIHT is such that arithmetic coding of the binary decisions
provides very little gain. The thresholds used for checking significance are powers of two,
so in essence the SPIHT algorithm sends the binary representation of the integer value of
the wavelet coefficients. As in EZW, the significance map encoding, or set partitioning and
ordering step, is followed by a refinement step in which the representations of the significant
coefficients are refined.
Let’s briefly describe the algorithm and then look at some examples of its operation.
However, before we do that we need to get familiar with some notation. The data structure
used by the SPIHT algorithm is similar to that used by the EZW algorithm—although not
the same. The wavelet coefficients are again divided into trees originating from the lowest
resolution band (band I in our case). The coefficients are grouped into 2 × 2 arrays that,
except for the coefficients in band I, are offsprings of a coefficient of a lower resolution band.
The coefficients in the lowest resolution band are also divided into 2 ×2 arrays. However,
unlike the EZW case, all but one of them are root nodes. The coefficient in the top-left
corner of the array does not have any offsprings. The data structure is shown pictorially in
Figure 15.20 for a seven-band decomposition.
The trees are further partitioned into four types of sets, which are sets of coordinates of
the coefficients:
ij This is the set of coordinates of the offsprings of the wavelet coefficient at
location ij. As each node can either have four offsprings or none, the size of ij
is either zero or four. For example, in Figure 15.20 the set 01 consists of the
coordinates of the coefficients b1, b2, b3, and b4.

III
IV
I
II
V
VII
VI
b1 b2
b3 b4
b11 b12
b13 b14
b21 b22
b23 b24
b31 b32
b33 b34
b41 b42
b43 b44
a b
d c
F I G U R E 15. 20 Data structure used in the SPIHT algorithm.
ij This is the set of all descendants of the coefficient at location ij. Descen-
dants include the offsprings, the offsprings of the offsprings, and so on. For exam-
ple, in Figure 15.20 the set 01 consists of the coordinates of the coefficients
b1 b4b11 b14 b44. Because the number of offsprings can either be zero
or four, the size of ij is either zero or a sum of powers of four.
This is the set of all root nodes—essentially band I in the case of Figure 15.20.
ij This is the set of coordinates of all the descendants of the coefficient at location
ij except for the immediate offsprings of the coefficient at location ij. In other
words,
ij = ij−ij

In Figure 15.20 the set 01 consists of the coordinates of the coefficients
b11 b14 b44.
A set ij or ij is said to be significant if any coefficient in the set has a
magnitude greater than the threshold. Finally, thresholds used for checking significance are
powers of two, so in essence the SPIHT algorithm sends the binary representation of the
integer value of the wavelet coefficients. The bits are numbered with the least significant bit
being the zeroth bit, the next bit being the first significant bit, and the kth bit being referred
to as the k−1 most significant bit.
With these definitions under our belt, let us now describe the algorithm. The algorithm
makes use of three lists: the list of insignificant pixels (LIP), the list of significant pixels
(LSP), and the list of insignificant sets (LIS). The LSP and LIS lists will contain the
coordinates of coefficients, while the LIS will contain the coordinates of the roots of sets
of type or . We start by determining the initial value of the threshold. We do this by
calculating
n = log2 cmax
where cmax is the maximum magnitude of the coefficients to be encoded. The LIP list is
initialized with the set . Those elements of that have descendants are also placed in
LIS as type entries. The LSP list is initially empty.
In each pass, we will first process the members of LIP, then the members of LIS. This
is essentially the significance map encoding step. We then process the elements of LSP in
the refinement step.
We begin by examining each coordinate contained in LIP. If the coefficient at that
coordinate is significant (that is, it is greater than 2n
), we transmit a 1 followed by a bit
representing the sign of the coefficient (we will assume 1 for positive, 0 for negative).
We then move that coefficient to the LSP list. If the coefficient at that coordinate is not
significant, we transmit a 0.
After examining each coordinate in LIP, we begin examining the sets in LIS. If the set at
coordinate ij is not significant, we transmit a 0. If the set is significant, we transmit a 1.
What we do after that depends on whether the set is of type or .
If the set is of type , we check each of the offsprings of the coefficient at that coordinate.
In other words, we check the four coefficients whose coordinates are in ij. For each
coefficient that is significant, we transmit a 1, the sign of the coefficient, and then move the
coefficient to the LSP. For the rest we transmit a 0 and add their coordinates to the LIP.
Now that we have removed the coordinates of ij from the set, what is left is simply the
set ij. If this set is not empty, we move it to the end of the LIS and mark it to be of
type . Note that this new entry into the LIS has to be examined during this pass. If the set
is empty, we remove the coordinate ij from the list.
If the set is of type , we add each coordinate in ij to the end of the LIS as the
root of a set of type . Again, note that these new entries in the LIS have to be examined
during this pass. We then remove ij from the LIS.
Once we have processed each of the sets in the LIS (including the newly formed ones),
we proceed to the refinement step. In the refinement step we examine each coefficient that
was in the LSP prior to the current pass and output the nth most significant bit of cij .

We ignore the coefficients that have been added to the list in this pass because, by declaring
them significant at this particular level, we have already informed the decoder of the value
of the nth most significant bit.
This completes one pass. Depending on the availability of more bits or external factors,
if we decide to continue with the coding process, we decrement n by one and continue. Let’s
see the functioning of this algorithm on an example.
E x a m p l e 15 . 8 . 1:
Let’s use the same example we used for demonstrating the EZW algorithm:
26 6 13 10
−7 7 6 4
4 −4 4 −3
2 −2 −2 0
We will go through three passes at the encoder and generate the transmitted bitstream, then
decode this bitstream.
First Pass The value for n in this case is 4. The three lists at the encoder are
LIP 00 → 2601 → 610 → −711 → 7
LIS 011011
LSP
In the listing for LIP, we have included the → # to make it easier to follow the example.
Beginning our algorithm, we examine the contents of LIP. The coefficient at location (0, 0) is
greater than 16. In other words, it is significant; therefore, we transmit a 1, then a 0 to indicate
the coefficient is positive and move the coordinate to LSP. The next three coefficients are
all insignificant at this value of the threshold; therefore, we transmit a 0 for each coefficient
and leave them in LIP. The next step is to examine the contents of LIS. Looking at the
descendants of the coefficient at location (0, 1) (13, 10, 6, and 4), we see that none of them
are significant at this value of the threshold so we transmit a 0. Looking at the descendants
of c10 and c11, we can see that none of these are significant at this value of the threshold.
Therefore, we transmit a 0 for each set. As this is the first pass, there are no elements from
the previous pass in LSP; therefore, we do not do anything in the refinement pass. We have
transmitted a total of 8 bits at the end of this pass (10000000), and the situation of the three
lists is as follows:
LIP 01 → 610 → −711 → 7
LIS 011011
LSP 00 → 26

Second Pass For the second pass we decrement n by 1 to 3, which corresponds to a threshold
value of 8. Again, we begin our pass by examining the contents of LIP. There are three
elements in LIP. Each is insignificant at this threshold so we transmit three 0s. The next
step is to examine the contents of LIS. The first element of LIS is the set containing the
descendants of the coefficient at location (0, 1). Of this set, both 13 and 10 are significant at
this value of the threshold; in other words, the set 01 is significant. We signal this by
sending a 1 and examine the offsprings of c01. The first offspring has a value of 13, which
is significant and positive, so we send a 1 followed by a 0. The same is true for the second
offspring, which has a value of 10. So we send another 1 followed by a 0. We move the
coordinates of these two to the LSP. The next two offsprings are both insignificant at this
level; therefore, we move these to LIP and transmit a 0 for each. As 01 = , we remove
01 from LIS. Looking at the other elements of LIS, we can clearly see that both of
these are insignificant at this level; therefore, we send a 0 for each. In the refinement pass
we examine the contents of LSP from the previous pass. There is only one element in there
that is not from the current sorting pass, and it has a value of 26. The third MSB of 26 is 1;
therefore, we transmit a 1 and complete this pass. In the second pass we have transmitted 13
bits: 0001101000001. The condition of the lists at the end of the second pass is as follows:
LIP 01 → 610 → −711 → 712 → 613 → 4
LIS 1011
LSP 00 → 2602 → 1303 → 10
Third Pass The third pass proceeds with n = 2. As the threshold is now smaller, there
are significantly more coefficients that are deemed significant, and we end up sending 26
bits. You can easily verify for yourself that the transmitted bitstream for the third pass is
10111010101101100110000010. The condition of the lists at the end of the third pass is as
follows:
LIP 30 → 231 → −223 → −332 → −233 → 0
LIS
LSP 00 → 2602 → 1303 → 1001 → 610 → −711 → 7
12 → 613 → 420 → 421 → −422 → 4
Now for decoding this sequence. At the decoder we also start out with the same lists as
the encoder:
LIP 00011011
LIS 011011
LSP
We assume that the initial value of n is transmitted to the decoder. This allows us to set the
threshold value at 16. Upon receiving the results of the first pass (10000000), we can see

that the first element of LIP is significant and positive and no other coefficient is significant
at this level. Using the same reconstruction procedure as in EZW, we can reconstruct the
coefficients at this stage as
24 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
and, following the same procedure as at the encoder, the lists can be updated as
LIP 011011
LIS 011011
LSP 00
For the second pass we decrement n by one and examine the transmitted bitstream:
0001101000001. Since the first 3 bits are 0 and there are only three entries in LIP, all
the entries in LIP are still insignificant. The next 9 bits give us information about the sets
in LIS. The fourth bit of the received bitstream is 1. This means that the set with root
at coordinate (0,1) is significant. Since this set is of type , the next bits relate to its
offsprings. The 101000 sequence indicates that the first two offsprings are significant at
this level and positive and the last two are insignificant. Therefore, we move the first two
offsprings to LSP and the last two to LIP. We can also approximate these two significant
coefficients in our reconstruction by 15×23
= 12. We also remove 01 from LIS. The
next two bits are both 0, indicating that the two remaining sets are still insignificant. The
final bit corresponds to the refinement pass. It is a 1, so we update the reconstruction of the
(0, 0) coefficient to 24+8/2 = 28. The reconstruction at this stage is
28 0 12 12
0 0 0 0
0 0 0 0
0 0 0 0
and the lists are as follows:
LIP 0110111213
LIS 1011
LSP 000203

For the third pass we again decrement n, which is now 2, giving a threshold value of 4.
Decoding the bitstream generated during the third pass (10111010101101100110000010),
we update our reconstruction to
26 6 14 10
−6 6 6 6
6 −6 6 0
0 0 0 0
and our lists become
LIP 3031
LIS
LSP 00020301101112202132
At this stage we do not have any sets left in LIS and we simply update the values of
the coefficients.
Finally, let’s look at an example of an image coded using SPIHT. The image shown in
Figure 15.21 is the reconstruction obtained from a compressed representation that used 0.5
F I G U R E 15. 21 Reconstruction of Sinan image encoded using SPIHT at 0.5 bits
per pixel.

bits per pixel. (The programs used to generate this image were obtained from the authors)
Comparing this with Figure 15.14, we can see a definite improvement in the quality.
Wavelet decomposition has been finding its way into various standards. The earliest
example was the FBI fingerprint image compression standard. The latest is the new image
compression being developed by the JPEG committee, commonly referred to as JPEG 2000.
We take a brief look at the current status of JPEG 2000.
15 . 9 JPEG 2000
The current JPEG standard provides excellent performance at rates above 0.25 bits per pixel.
However, at lower rates there is a sharp degradation in the quality of the reconstructed
image. To correct this and other shortcomings, the JPEG committee initiated work on another
standard, commonly known as JPEG 2000. The JPEG 2000 is the standard will be based on
wavelet decomposition.
There are actually two types of wavelet filters that are included in the standard. One
type is the wavelet filters we have been discussing in this chapter. Another type consists
of filters that generate integer coefficients; this type is particularly useful when the wavelet
decomposition is part of a lossless compression scheme.
The coding scheme is based on a scheme, originally proposed by Taubman [210] and
Taubman and Zakhor [211], known as EBCOT. The acronym EBCOT stands for “Embedded
Block Coding with Optimized Truncation,” which nicely summarizes the technique. It
is a block coding scheme that generates an embedded bitstream. The block coding is
independently performed on nonoverlapping blocks within individual subbands. Within a
subband all blocks that do not lie on the right or lower boundaries are required to have the
same dimensions. A dimension cannot exceed 256.
Embedding and independent block coding seem inherently contradictory. The way
EBCOT resolves this contradiction is to organize the bitstream in a succession of layers.
Each layer corresponds to a certain distortion level. Within each layer each block is coded
with a variable number of bits (which could be zero). The partitioning of bits between blocks
is obtained using a Lagrangian optimization that dictates the partitioning or truncation points.
The quality of the reproduction is proportional to the numbers of layers received.
The embedded coding scheme is similar in philosophy to the EZW and SPIHT algo-
rithms; however, the data structures used are different. The EZW and SPIHT algorithms
used trees of coefficients from the same spatial location across different bands. In the case
of the EBCOT algorithm, each block resides entirely within a subband, and each block
is coded independently of other blocks, which precludes the use of trees of the type used
by EZW and SPIHT. Instead, the EBCOT algorithm uses a quadtree data structure. At
the lowest level, we have a 2 × 2 set of blocks of coefficients. These are, in turn, orga-
nized into sets of 2 × 2 quads, and so on. A node in this tree is said to be significant
at level n if any of its descendants are significant at that level. A coefficient cij is said
to be significant at level n if cij ≥ 2n
. As in the case of EZW and SPIHT, the algo-
rithm makes multiple passes, including significance map encoding passes and a magnitude
refinement pass. The bits generated during these procedures are encoded using arithmetic
coding.

15 . 10 Summary
In this chapter we have introduced the concepts of wavelets and multiresolution analysis, and
we have seen how we can use wavelets to provide an efficient decomposition of signals prior
to compression. We have also described several compression techniques based on wavelet
decomposition. Wavelets and their applications are currently areas of intensive research.
1. There are a number of excellent introductory books on wavelets. The one I found
most accessible was Introduction to Wavelets and Wavelet Transforms—A Primer, by
C.S. Burrus, R.A. Gopinath, and H. Guo [207].
2. Probably the best mathematical source on wavelets is the book Ten Lectures on
Wavelets, by I. Daubechies [58].
3. There are a number of tutorials on wavelets available on the Internet. The best source
for all matters related to wavelets (and more) on the Internet is “The Wavelet Digest”
(http://www.wavelet.org). This site includes pointers to many other interesting and
useful sites dealing with different applications of wavelets.
4. The JPEG 2000 standard is covered in detail in JPEG 2000: Image Compression
Fundamentals, Standards and Practice, by D. Taubman and M. Marcellin [212].
1. In this problem we consider the boundary effects encountered when using the short-
term Fourier transform. Given the signal
ft = sin2t
(a) Find the Fourier transform F of ft.
(b) Find the STFT F1 of ft using a rectangular window
gt =

1 −2 ≤ t ≤ 2
0 otherwise
for the interval −22 .
(c) Find the STFT F2 of ft using a window
gt =

1+cos 2
t −2 ≤ t ≤ 2
0 otherwise.
(d) Plot F, F1, and F2. Comment on the effect of using different
window functions.

2. For the function
ft =

1+sin2t 0 ≤ t ≤ 1
sin2t otherwise
using the Haar wavelet find and plot the coefficients cjk, j = 012; k = 0 10.
3. For the seven-level decomposition shown below:
21 6 15 12
−6 3 6 3
3 −3 0 −3
3 0 0 0
(a) Find the bitstream generated by the EZW coder.
(b) Decode the bitstream generated in the previous step. Verify that you get the
original coefficient values.
4. Using the coefficients from the seven-level decomposition in the previous problem:
(a) Find the bitstream generated by the SPIHT coder.
(b) Decode the bitstream generated in the previous step. Verify that you get the
original coefficient values.

16
Audio Coding
16 . 1 Overview
L
ossy compression schemes can be based on a source model, as in the case of
speech compression, or a user or sink model, as is somewhat the case in image
compression. In this chapter we look at audio compression approaches that are
explicitly based on the model of the user. We will look at audio compression
approaches in the context of audio compression standards. Principally, we will
examine the different MPEG standards for audio compression. These include MPEG Layer
I, Layer II, Layer III (or mp3) and the Advanced Audio Coding Standard. As with other
standards described in this book, the goal here is not to provide all the details required for
implementation. Rather the goal is to provide the reader with enough familiarity so that they
can then find it much easier to understand these standards.
16 . 2 Introduction
The various speech coding algorithms we studied in the previous chapter rely heavily on
the speech production model to identify structures in the speech signal that can be used
for compression. Audio compression systems have taken, in some sense, the opposite tack.
Unlike speech signals, audio signals can be generated using a large number of different
mechanisms. Lacking a unique model for audio production, the audio compression methods
have focused on the unique model for audio perception, a psychoacoustic model for hearing.
At the heart of the techniques described in this chapter is a psychoacoustic model of human
perception. By identifying what can and, more important what cannot be heard, the schemes
described in this chapter obtain much of their compression by discarding information that
cannot be perceived. The motivation for the development of many of these perceptual coders
was their potential application in broadcast multimedia. However, their major impact has
been in the distribution of audio over the Internet.

516 16 A U D I O C O D I N G
We live in an environment rich in auditory stimuli. Even an environment described
as quiet is filled with all kinds of natural and artificial sounds. The sounds are always
present and come to us from all directions. Living in this stimulus-rich environment, it is
essential that we have mechanisms for ignoring some of the stimuli and focusing on others.
Over the course of our evolutionary history we have developed limitations on what we
can hear. Some of these limitations are physiological, based on the machinery of hearing.
Others are psychological, based on how our brain processes auditory stimuli. The insight of
researchers in audio coding has been the understanding that these limitations can be useful
in selecting information that needs to be encoded and information that can be discarded.
The limitations of human perception are incorporated into the compression process through
the use of psychoacoustic models. We briefly describe the auditory model used by the most
popular audio compression approaches. Our description is necessarily superficial and we
refer readers interested in more detail to [97, 194].
The machinery of hearing is frequency dependent. The variation of what is perceived
as equally loud at different frequencies was first measured by Fletcher and Munson at
Bell Labs in the mid-1930s [96]. These measurements of perceptual equivalence were later
refined by Robinson and Dadson. This dependence is usually displayed as a set of equal
loudness curves, where the sound pressure level (SPL) is plotted as a function of frequency
for tones perceived to be equally loud. Clearly, what two people think of as equally loud will
be different. Therefore, these curves are actually averages and serve as a guide to human
auditory perception. The particular curve that is of special interest to us is the threshold-of-
hearing curve. This is the SPL curve that delineates the boundary of audible and inaudible
sounds at different frequencies. In Figure 16.1 we show a plot of this audibility threshold in
quiet. Sounds that lie below the threshold are not perceived by humans. Thus, we can see
that a low amplitude sound at a frequency of 3 kHz may be perceptible while the same level
of sound at 100 Hz would not be perceived.
Threshold
of audibility
Audible
region
Inaudible
region
SPL
(dB)
0
20
40
60
80
Frequency (Hz)
50
20 100 200 500 1000 2000 5000 10,000 20,000
F I G U R E 16. 1 A typical plot of the audibility threshold.

16 . 2 . 1 S p e c t r a l M a s k i n g
Lossy compression schemes require the use of quantization at some stage. Quantization can
be modeled as as an additive noise process in which the output of the quantizer is the input
plus the quantization noise. To hide quantization noise, we can make use of the fact that
signals below a particular amplitude at a particular frequency are not audible. If we select the
quantizer step size such that the quantization noise lies below the audibility threshold, the
noise will not be perceived. Furthermore, the threshold of audibility is not absolutely fixed
and typically rises when multiple sounds impinge on the human ear. This phenomenon gives
rise to spectral masking. A tone at a certain frequency will raise the threshold in a critical
band around that frequency. These critical bands have a constant Q, which is the ratio of
frequency to bandwidth. Thus, at low frequencies the critical band can have a bandwidth
as low as 100 Hz, while at higher frequencies the bandwidth can be as large as 4 kHz. This
increase of the threshold has major implications for compression. Consider the situation in
Figure 16.2. Here a tone at 1 kHz has raised the threshold of audibility so that the adjacent
tone above it in frequency is no longer audible. At the same time, while the tone at 500 Hz
is audible, because of the increase in the threshold the tone can be quantized more crudely.
This is because increase of the threshold will allow us to introduce more quantization noise
at that frequency. The degree to which the threshold is increased depends on a variety of
factors, including whether the signal is sinusoidal or atonal.
16 . 2 . 2 T e m p o r a l M a s k i n g
Along with spectral masking, the psychoacoustic coders also make use of the phenomenon
of temporal masking. The temporal masking effect is the masking that occurs when a sound
raises the audibility threshold for a brief interval preceding and following the sound. In
Figure 16.3 we show the threshold of audibility close to a masking sound. Sounds that occur
in an interval around the masking sound (both after and before the masking tone) can be
masked. If the masked sound occurs prior to the masking tone, this is called premasking
Inaudible
region
Frequency (Hz)
SPL
(dB)
0
20
40
60
80
Original
threshold
of audibility
Audible
region
Raised
threshold
of audibility
50
20 100 200 500 1000 2000 5000 10,000 20,000
F I G U R E 16. 2 Change in the audibility threshold.

518 16 A U D I O C O D I N G
SPL
(dB)
20
40
60
80
0 100 200 300 400
Masking
sound
P
o
s
t
m
a
s
k
i
n
g
P
r
e
m
a
s
k
i
n
g
–100
Time (msec)
F I G U R E 16. 3 Change in the audibility threshold in time.
or backward masking, and if the sound being masked occurs after the masking tone this
effect is called postmasking or forward masking. The forward masking remains in effect for
a much longer time interval than the backward masking.
16 . 2 . 3 P s y c h o a c o u s t i c M o d e l
These attributes of the ear are used by all algorithms that use

data_compression.pdf explains different

More Related Content

Similar to data_compression.pdf explains different (20)

Recently uploaded (20)

data_compression.pdf explains different