Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on a Low Cost GPU

ACEEE Int. J. on Signal & Image Processing, Vol. 03, No. 01, Jan 2012

Simple and Fast Implementation of Segmented Matrix
Algorithm for Haar DWT on a Low Cost GPU
Madeena Sultana1 and Nurul Muntasir Mamun2
1
Dept. of Computer Science and Engineering, Jahangirnagar University, Dhaka, Bangladesh
Email: deena.sultana@gmail.com
2
Dept. of Applied Physics Electronics and Communication Engineering, Dhaka University, Dhaka, Bangladesh
Email: mamun.muntasir@yahoo.com

Abstract— Haar discrete wavelet transform (DWT), the sorts of high performance computers, for special purpose
simplest among all DWTs, has diverse applications in signal hardware [2]-[4], for FPGAs [5][6] and for SIMD architectures
and image processing fields. A traditional approach for 2D [7].Considerable amount of speedup is also achieved by
Haar DWT is 1D row operation followed by and 1D column employing GPUs with OpenGL and Cg-based implementations
operation. In 2002, Chen and Liao presented a fast algorithm
for DWT computations [8]-[10].
for 2D Haar DWT based on segmented matrix. However, this
method is infeasible for its high computational requirements However, GPU accelerated computation became especially
for processing large sized images. In this paper, we have interesting since early 2007 when NVIDIA introduced CUDA
implemented the segmented matrix algorithm on a low cost (Compute Unified Device Architecture) enabled GPUs, which
NVIDIA’s GPU to achieve speedup in computation. The offer massive parallel computation power. Providing many
efficiency of our GPU based implementation is measured and hundreds of gigaflops of processing power current GPUs are
compared with CPU based algorithms. Our experimental leveraging the parallel computation in a more efficient way
results show performance improvement over a factor of 28.5 than on a CPU [11].
compared with Chen and Liao’s CPU based segmented matrix Being harnessed by many researches, these commodity
algorithm and a factor of 8 compared to MATLAB’s wavelet
and readily available GPUs are providing dramatic
function for an image of size 2560×2560.
computation speedup in various research fields. Joaquín
Index Terms—Haar discrete wavelet transform (DWT), CUDA, Franco, Gregorio Bernabé, Juan Fernández and Manuel E.
GPU, segmented matrix algorithm, parallel discrete wavelet Acacio [12] achieved significant speed up with NVIDIA’s
transform Tesla C870 over Intel’s Core 2 Quad Q6700 (2.66GHz). Vaclav
Simek and Ram Rakesh Asn [13] used CUDA enabled GPU
I. BACKGROUND AND INTRODUCTION for accelerated 2D wavelet based image compression.
Recently, Wladimir J. van der Laan, Andrei C. Jalba and Jos
Discrete wavelet transforms (DWTs) has been used in a
B.T.M. Roerdink [14] implemented a fast hybrid method for
wide range of signal and image processing applications such
2D DWT on CUDA for both 2D images and 3D volume data.
as – image and video coding (MPEG-4 or JPEG 2000), pattern
In this paper we have implemented the segmented matrix
recognition, image watermarking, medical image analysis etc.
algorithm for 2D Haar wavelet transform on a low cost,
In traditional approach, 2D (two-dimensional) Haar DWT is
commodity GPU. Our objective is to achieve computation
performed in two phase- one row operation, one column
speed up to process large scaled images without increasing
operation, and column operation cannot be performed until
computational complexity and cost.
the row operation is completed. Therefore, the speed of
computation degrades significantly. To address this problem,
II. TRADITIONAL COMPUTATION
Chen and Liao [1] proposed the segmented matrix algorithm
where computation is performed by data rearrangements and Haar DWT is the simplest since it only uses two low pass
one matrix multiplication. Therefore, this simple algorithm can filter coefficients (1,1) and two high pass filter coefficients
produce the same results as traditional 2D Haar DWT with a (1,-1). Haar wavelet transform in frequency domain can be
much faster speed. Moreover, it is highly suitable for parallel obtained by addition and subtraction of the pixels of images.
implementation as only two rows are involved in computation 2D haar DWT decomposes an input image into four sub-
at a time. bands, one average component (WLL ) and three detail
Nowadays large size images are common due to the components (WLH, WHL, WHH).
availability and advancement of image capturing technology. Traditionally, 2D Haar wavelet transform can be
Therefore many wavelet based applications have to manage accomplished by one row and one column operations where
large scaled image processing. Parallel computing is a direct the result of row transform is the input of column transform.
way of speeding up these high computation requirements. A Fig. 1 represents the 2D Haar wavelet transforms of a 4×4
significant amount of works have already been done for all image.

© 2012 ACEEE 32
DOI: 01.IJSIP.03.01.117


The rearrangements are as follows -
a) The elements in the first column of H are filled in WLL row
by row.
b) The elements in the second column of H are filled in WHL
row by row.
c) The elements in the third column of H are filled in WLH row
by row.
d) The elements in the fourth column of H are filled in WHH
row by row.

IV. CUDA IMPLEMENTATION
The CUDA platform is currently concentrating an
enormous attention due to its tremendous potential of parallel
processing. In November 2006, NVIDIA introduced CUDA
Figure.1. 2D Haar DWT of a 4×4 image by traditional approach.
with a new parallel programming model and instruction set
architecture to solve many complex computational problems
III. SEGMENTED MATRIX ALGORITHM
very efficiently [11]. Each CUDA complainant device is a set
Chen and Liao [1] proposed a computationally fast of multiprocessor cores where each core has SIMT (Single
algorithm called “segmented matrix algorithm” where 2D Haar Instruction, Multiple Thread) architecture. Today four quad-
DWT can be performed by only one matrix multiplication core CPUs can run only 16 threads concurrently, whereas the
instead of two separate 1D transforms. The step by step smallest executable parallel unit on a CUDA device comprised
process of this algorithm is as follows. of 32 threads. All CUDA enabled NVIDIA GPUs support at
Step 1: Consider I as the input image of size m×n. Form Bij=2×2 least 768 concurrently active threads per multiprocessor.
sub-blocks from original image I where i=1…m/2 and j=1…n/ Moreover, some GPUs can support 1,024 or more active
2. For example, threads per multiprocessor [11]. Devices comprise of 30
multiprocessors (e.g. NVIDIAGeForce GTX 280), can support
more than 30,000 active threads [15]. A good parallel
implementation of an application on a GPU can achieve more
than 100 times speedup over sequential execution [16].
Step 2: Z-scan each Bij and generate m×n row vectors Aij. For In SIMT architecture of CUDA, a portion of a parallel
example, application executed many times independently on different
data, by many threads running on different processors, at
any given clock cycle. This parallel portion can be isolated
into a function which is called kernel. A kernel is organized as
Step 3: Express these row matrices as an intermediate matrix a set of thread blocks and each thread block is, in turn,
M. organized as a three-dimensional array of threads. Threads
within the same block can efficiently cooperate through shared
memory and can synchronize with each other. Each thread
has its own unique thread ID which is defined by the three
thread indices: threadIdx.x, threadIdx.y, and threadIdx.z. Each
block is identified by a unique two-dimensional coordinate
given by the CUDA specific keywords blockIdx.x and
Step 4: Consider filter coefficient matrix blockIdx.y. All blocks must have the equal number of threads
organized exactly in the same manner. The use of
multidimensional identifiers simplifies memory addressing of
multidimensional data. The block and grid dimensions,
collectively known as execution configuration, can be set at
Find H= M×C. run-time.
In our implementation we have used blocks each having
Step 5: Haar wavelet transform can be divided into four sub- 16×16 threads. The grid size is set at run-time according to
matrices of size
m

n
, the size of input image. Our CUDA implementation consists
2 2 of the following steps:
1. Copy image data from host memory to GPU memory.
The rearrangement of the elements of H into four sub-
2. Determine the execution configuration.
matrices will produce the resultant Haar wavelet transform
3. GPU executes kernel to compute the elements of the
matrix W.
intermediate matrix M on each core in a parallel fashion.

© 2012 ACEEE 33
DOI: 01.IJSIP.03.01.117


4. The resulted matrix H is computed simultaneously in GPU. Table I shows that the performance of CPU based segmented
5. Copy the result from GPU memory to host memory. matrix algorithm declined noticeably for large sized images,
The CPU-based algorithm is implemented on Intel Pentium although it performed better for small sized images. In contrast,
IV, 3.00GHz processor equipped with 512MB DDR2 RAM. GPU based implementation of this algorithm improved the
The GPU based algorithm is tested on NVIDIA GeForce performance for large over a factor of 10 to 28 for images
8500GT graphics card containing 16 cores, maximum 512 sized 1024×1024 to 2560×2560. Moreover, it performed better
threads per block and 512 MB global memory. than MATLAB’s wavelet function for all small and large sized
images. Therefore, among the three algorithms our GPU based
V. RESULTS AND DISCUSSION segmented matrix algorithm performed the best for high
resolution images.
To test the computational efficiency of our GPU based
However, the main drawback of GPU computation is the
segmented matrix algorithm, we have taken images of different
transfer time between the host memory and device memory.
sizes as inputs. Fig. 2 shows one level 2D Haar DWT of
The time needed to copy data from the host’s memory to
256×256 lena image using CPU based and GPU based
GPU’s global memory requires a large fraction of total
segmented matrix algorithm. For comparison we also have
execution time. Therefore, if we exclude the data transfer time
considered MATLAB’s dwt2() function from wavelet toolbox
from execution time, we would get significant speedup for
and the CPU implementation of segmented matrix algorithm.
large sized images.

CONCLUSIONS
The widespread usage of the Haar Discrete Wavelet
Transform (DWT) has motivated the implementation of a
simple and low cost GPU based DWT algorithm. Our
experimental results show that for an image of size 2560×2560,
the GPU based segmented matrix algorithm is more than 28.5
times faster than CPU computation including data transfer.
Moreover, this GPU based method achieved approximately 8x
speedup than the CPU based computation of MATLAB’s dwt2()
for the same image. Due to the speedy calculations we believe
(a) (b) that the ideas presented in this paper will have widespread
Figure.2. One level 2D Haar DWT using (a) CPU based and (b) applications in processing large sized images.
GPU based segmented matrix algorithm.
Table I represents the comparison of computing time of REFERENCES
MATLAB’s dwt2(), segmented matrix algorithm on CPU and
[1] P. Y. Chen and E. C. Liao, “A new algorithm for Haar discrete
on GPU with increasing size of input images.
wavelet transform,” IEEE International Symposium on Intelligent
TABLE.I. COMPUTATION TIME COMPARISON RELATIVE TO IMAGE SIZE Signal Processing and Communication Systems, pp. 453-457, 2002.
[2] M. Martina, G.Masera, G.Piccinini and M.Zamboni, “A VLSI
Architecture for IWT (Integer Wavelet Transform),” Proc. of 43rd
Midwest Symposium on Circuits and Systems, pp. 1174-1177,
August 2000.
[3]K. Haapala, P. Kolinummi, T. Hamalainen, and J.
Saarinen, ”Parallel DSP implementation of wavelet transform in
image compression,” Proc. of ISCAS IEEE International Symposium
on Circuits and Systems, vol. 5. pp. 89-92, 2000.
[4] Matthias Hopf and Thomas Ertl., “Hardware Accelerated
Wavelet Transformations,” Proc. of EG/IEEE TCVG Symposium
on Visualization VisSym, pp. 93–103, 2000.
[5] C. Graves and C. Gloster, “Use of dynamically reconfigurable
logic in adaptive wavelet packet applications,” Proc. of the 5th
Canadian Workshop on Field-Programmable Devices, June 1998.
[6] Baofeng Li, Yong Dou, Haifang Zhou, and Xingming Zhou,
“FPGA accelerator for wavelet-based automated global image
registration,” EURASIP J. Embedded Syst., pp. 1–10, 2009.
[7] Mats Holmström, “Parallelizing the fast wavelet transform,”
Parallel Computing, vol. 11(21), pp. 837-1848, April 1995.
[8] T. T. Wong, C. S. Leung, P. A. Heng, and J. Wang, “Discrete
wavelet transform on consumer-level graphics hardware,” IEEE
Transactions on Multimedia, vol. 9(3), pp. 668–673, April 2007.

© 2012 ACEEE 34
DOI: 01.IJSIP.03.01.117


[9] C. Tenllado, J. Setoain, M. Prieto, L. Piñuel, and F. Tirado, [13] Vaclav Simek and Ram Rakesh Asn, “GPU Acceleration of
“Parallel Implementation of the 2D Discrete Wavelet Transform 2D-DWT Image Compression in MATLAB with CUDA,” Second
on Graphics Processing Units: Filter Bank versus Lifting,” IEEE UKSIM European Symposium on Computer Modeling and
Trans. Parallel Distrib. Syst., vol. 19(3), pp. 299-310, 2008. Simulation, pp.274-277, 2008.
[10] Antonio Garcia and Han-Wei Shen, “GPU-based 3D wavelet [14] Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M.
reconstruction with tileboarding,” The Visual Computer , vol. 21(8), Roerdink, “Accelerating Wavelet Lifting on Graphics Hardware
pp. 755-763, September 2005. Using CUDA,” IEEE Transactions on Parallel and Distributed
[11] “NVIDIA CUDA C Programming Guide 4.0,” Available at Systems, vol. 22(1), pp. 132-146, January 2011.
http://developer.nvidia.com/cuda-toolkit-3.1-downloads, accessed [15] “CUDA C best practices guide,” Available at http://
August 02, 2011. developer.nvidia.com/cuda-toolkit-31-downloads, accessed August
[12] Joaquín Franco, Gregorio Bernabé, Juan Fernández, and Manuel 05, 2011.
E. Acacio, “A Parallel Implementation of the 2D Wavelet Transform [16] David B. Kirk and Wen-mei W. Hwu, Programming massively
Using CUDA,” Proc. of International Conf. on Parallel, Distributed parallel processors- a hands-on approach, Elsevier Inc., USA,
and Network-based Processing, pp.111-118, 2009. January 22, 2010.

© 2012 ACEEE 35
DOI: 01.IJSIP.03.01.117

Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on a Low Cost GPU

More Related Content

What's hot (20)

Similar to Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on a Low Cost GPU (20)

More from IDES Editor (20)

Recently uploaded (20)

Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on a Low Cost GPU