SlideShare a Scribd company logo
ACEEE Int. J. on Signal & Image Processing, Vol. 03, No. 01, Jan 2012



 Simple and Fast Implementation of Segmented Matrix
    Algorithm for Haar DWT on a Low Cost GPU
                                     Madeena Sultana1 and Nurul Muntasir Mamun2
                1
                 Dept. of Computer Science and Engineering, Jahangirnagar University, Dhaka, Bangladesh
                                              Email: deena.sultana@gmail.com
      2
        Dept. of Applied Physics Electronics and Communication Engineering, Dhaka University, Dhaka, Bangladesh
                                            Email: mamun.muntasir@yahoo.com


Abstract— Haar discrete wavelet transform (DWT), the                   sorts of high performance computers, for special purpose
simplest among all DWTs, has diverse applications in signal            hardware [2]-[4], for FPGAs [5][6] and for SIMD architectures
and image processing fields. A traditional approach for 2D             [7].Considerable amount of speedup is also achieved by
Haar DWT is 1D row operation followed by and 1D column                 employing GPUs with OpenGL and Cg-based implementations
operation. In 2002, Chen and Liao presented a fast algorithm
                                                                       for DWT computations [8]-[10].
for 2D Haar DWT based on segmented matrix. However, this
method is infeasible for its high computational requirements               However, GPU accelerated computation became especially
for processing large sized images. In this paper, we have              interesting since early 2007 when NVIDIA introduced CUDA
implemented the segmented matrix algorithm on a low cost               (Compute Unified Device Architecture) enabled GPUs, which
NVIDIA’s GPU to achieve speedup in computation. The                    offer massive parallel computation power. Providing many
efficiency of our GPU based implementation is measured and             hundreds of gigaflops of processing power current GPUs are
compared with CPU based algorithms. Our experimental                   leveraging the parallel computation in a more efficient way
results show performance improvement over a factor of 28.5             than on a CPU [11].
compared with Chen and Liao’s CPU based segmented matrix                    Being harnessed by many researches, these commodity
algorithm and a factor of 8 compared to MATLAB’s wavelet
                                                                       and readily available GPUs are providing dramatic
function for an image of size 2560×2560.
                                                                       computation speedup in various research fields. Joaquín
Index Terms—Haar discrete wavelet transform (DWT), CUDA,               Franco, Gregorio Bernabé, Juan Fernández and Manuel E.
GPU, segmented matrix algorithm, parallel discrete wavelet             Acacio [12] achieved significant speed up with NVIDIA’s
transform                                                              Tesla C870 over Intel’s Core 2 Quad Q6700 (2.66GHz). Vaclav
                                                                       Simek and Ram Rakesh Asn [13] used CUDA enabled GPU
             I. BACKGROUND    AND INTRODUCTION                         for accelerated 2D wavelet based image compression.
                                                                       Recently, Wladimir J. van der Laan, Andrei C. Jalba and Jos
    Discrete wavelet transforms (DWTs) has been used in a
                                                                       B.T.M. Roerdink [14] implemented a fast hybrid method for
wide range of signal and image processing applications such
                                                                       2D DWT on CUDA for both 2D images and 3D volume data.
as – image and video coding (MPEG-4 or JPEG 2000), pattern
                                                                           In this paper we have implemented the segmented matrix
recognition, image watermarking, medical image analysis etc.
                                                                       algorithm for 2D Haar wavelet transform on a low cost,
In traditional approach, 2D (two-dimensional) Haar DWT is
                                                                       commodity GPU. Our objective is to achieve computation
performed in two phase- one row operation, one column
                                                                       speed up to process large scaled images without increasing
operation, and column operation cannot be performed until
                                                                       computational complexity and cost.
the row operation is completed. Therefore, the speed of
computation degrades significantly. To address this problem,
                                                                                       II. TRADITIONAL COMPUTATION
Chen and Liao [1] proposed the segmented matrix algorithm
where computation is performed by data rearrangements and                  Haar DWT is the simplest since it only uses two low pass
one matrix multiplication. Therefore, this simple algorithm can        filter coefficients (1,1) and two high pass filter coefficients
produce the same results as traditional 2D Haar DWT with a             (1,-1). Haar wavelet transform in frequency domain can be
much faster speed. Moreover, it is highly suitable for parallel        obtained by addition and subtraction of the pixels of images.
implementation as only two rows are involved in computation            2D haar DWT decomposes an input image into four sub-
at a time.                                                             bands, one average component (WLL ) and three detail
    Nowadays large size images are common due to the                   components (WLH, WHL, WHH).
availability and advancement of image capturing technology.                Traditionally, 2D Haar wavelet transform can be
Therefore many wavelet based applications have to manage               accomplished by one row and one column operations where
large scaled image processing. Parallel computing is a direct          the result of row transform is the input of column transform.
way of speeding up these high computation requirements. A              Fig. 1 represents the 2D Haar wavelet transforms of a 4×4
significant amount of works have already been done for all             image.



© 2012 ACEEE                                                      32
DOI: 01.IJSIP.03.01.117
ACEEE Int. J. on Signal & Image Processing, Vol. 03, No. 01, Jan 2012


                                                                       The rearrangements are as follows -
                                                                       a) The elements in the first column of H are filled in WLL row
                                                                       by row.
                                                                       b) The elements in the second column of H are filled in WHL
                                                                       row by row.
                                                                       c) The elements in the third column of H are filled in WLH row
                                                                       by row.
                                                                       d) The elements in the fourth column of H are filled in WHH
                                                                       row by row.

                                                                                         IV. CUDA IMPLEMENTATION
                                                                           The CUDA platform is currently concentrating an
                                                                       enormous attention due to its tremendous potential of parallel
                                                                       processing. In November 2006, NVIDIA introduced CUDA
 Figure.1. 2D Haar DWT of a 4×4 image by traditional approach.
                                                                       with a new parallel programming model and instruction set
                                                                       architecture to solve many complex computational problems
             III. SEGMENTED MATRIX ALGORITHM
                                                                       very efficiently [11]. Each CUDA complainant device is a set
    Chen and Liao [1] proposed a computationally fast                  of multiprocessor cores where each core has SIMT (Single
algorithm called “segmented matrix algorithm” where 2D Haar            Instruction, Multiple Thread) architecture. Today four quad-
DWT can be performed by only one matrix multiplication                 core CPUs can run only 16 threads concurrently, whereas the
instead of two separate 1D transforms. The step by step                smallest executable parallel unit on a CUDA device comprised
process of this algorithm is as follows.                               of 32 threads. All CUDA enabled NVIDIA GPUs support at
Step 1: Consider I as the input image of size m×n. Form Bij=2×2        least 768 concurrently active threads per multiprocessor.
sub-blocks from original image I where i=1…m/2 and j=1…n/              Moreover, some GPUs can support 1,024 or more active
2. For example,                                                        threads per multiprocessor [11]. Devices comprise of 30
                                                                       multiprocessors (e.g. NVIDIAGeForce GTX 280), can support
                                                                       more than 30,000 active threads [15]. A good parallel
                                                                       implementation of an application on a GPU can achieve more
                                                                       than 100 times speedup over sequential execution [16].
Step 2: Z-scan each Bij and generate m×n row vectors Aij. For              In SIMT architecture of CUDA, a portion of a parallel
example,                                                               application executed many times independently on different
                                                                       data, by many threads running on different processors, at
                                                                       any given clock cycle. This parallel portion can be isolated
                                                                       into a function which is called kernel. A kernel is organized as
Step 3: Express these row matrices as an intermediate matrix           a set of thread blocks and each thread block is, in turn,
M.                                                                     organized as a three-dimensional array of threads. Threads
                                                                       within the same block can efficiently cooperate through shared
                                                                       memory and can synchronize with each other. Each thread
                                                                       has its own unique thread ID which is defined by the three
                                                                       thread indices: threadIdx.x, threadIdx.y, and threadIdx.z. Each
                                                                       block is identified by a unique two-dimensional coordinate
                                                                       given by the CUDA specific keywords blockIdx.x and
Step 4: Consider filter coefficient matrix                             blockIdx.y. All blocks must have the equal number of threads
                                                                       organized exactly in the same manner. The use of
                                                                       multidimensional identifiers simplifies memory addressing of
                                                                       multidimensional data. The block and grid dimensions,
                                                                       collectively known as execution configuration, can be set at
Find H= M×C.                                                           run-time.
                                                                           In our implementation we have used blocks each having
Step 5: Haar wavelet transform can be divided into four sub-           16×16 threads. The grid size is set at run-time according to
matrices of size
                   m
                     
                       n
                           ,                                           the size of input image. Our CUDA implementation consists
                   2   2                                               of the following steps:
                                                                       1. Copy image data from host memory to GPU memory.
   The rearrangement of the elements of H into four sub-
                                                                       2. Determine the execution configuration.
matrices will produce the resultant Haar wavelet transform
                                                                       3. GPU executes kernel to compute the elements of the
matrix W.
                                                                       intermediate matrix M on each core in a parallel fashion.

© 2012 ACEEE                                                      33
DOI: 01.IJSIP.03.01.117
ACEEE Int. J. on Signal & Image Processing, Vol. 03, No. 01, Jan 2012


4. The resulted matrix H is computed simultaneously in GPU.               Table I shows that the performance of CPU based segmented
5. Copy the result from GPU memory to host memory.                        matrix algorithm declined noticeably for large sized images,
    The CPU-based algorithm is implemented on Intel Pentium               although it performed better for small sized images. In contrast,
IV, 3.00GHz processor equipped with 512MB DDR2 RAM.                       GPU based implementation of this algorithm improved the
The GPU based algorithm is tested on NVIDIA GeForce                       performance for large over a factor of 10 to 28 for images
8500GT graphics card containing 16 cores, maximum 512                     sized 1024×1024 to 2560×2560. Moreover, it performed better
threads per block and 512 MB global memory.                               than MATLAB’s wavelet function for all small and large sized
                                                                          images. Therefore, among the three algorithms our GPU based
                  V. RESULTS AND DISCUSSION                               segmented matrix algorithm performed the best for high
                                                                          resolution images.
    To test the computational efficiency of our GPU based
                                                                              However, the main drawback of GPU computation is the
segmented matrix algorithm, we have taken images of different
                                                                          transfer time between the host memory and device memory.
sizes as inputs. Fig. 2 shows one level 2D Haar DWT of
                                                                          The time needed to copy data from the host’s memory to
256×256 lena image using CPU based and GPU based
                                                                          GPU’s global memory requires a large fraction of total
segmented matrix algorithm. For comparison we also have
                                                                          execution time. Therefore, if we exclude the data transfer time
considered MATLAB’s dwt2() function from wavelet toolbox
                                                                          from execution time, we would get significant speedup for
and the CPU implementation of segmented matrix algorithm.
                                                                          large sized images.

                                                                                                    CONCLUSIONS
                                                                              The widespread usage of the Haar Discrete Wavelet
                                                                          Transform (DWT) has motivated the implementation of a
                                                                          simple and low cost GPU based DWT algorithm. Our
                                                                          experimental results show that for an image of size 2560×2560,
                                                                          the GPU based segmented matrix algorithm is more than 28.5
                                                                          times faster than CPU computation including data transfer.
                                                                          Moreover, this GPU based method achieved approximately 8x
                                                                          speedup than the CPU based computation of MATLAB’s dwt2()
                                                                          for the same image. Due to the speedy calculations we believe
                  (a)                                 (b)                 that the ideas presented in this paper will have widespread
   Figure.2. One level 2D Haar DWT using (a) CPU based and (b)            applications in processing large sized images.
              GPU based segmented matrix algorithm.
 Table I represents the comparison of computing time of                                              REFERENCES
 MATLAB’s dwt2(), segmented matrix algorithm on CPU and
                                                                          [1] P. Y. Chen and E. C. Liao, “A new algorithm for Haar discrete
 on GPU with increasing size of input images.
                                                                          wavelet transform,” IEEE International Symposium on Intelligent
      TABLE.I. COMPUTATION TIME COMPARISON RELATIVE TO IMAGE SIZE         Signal Processing and Communication Systems, pp. 453-457, 2002.
                                                                          [2] M. Martina, G.Masera, G.Piccinini and M.Zamboni, “A VLSI
                                                                          Architecture for IWT (Integer Wavelet Transform),” Proc. of 43rd
                                                                          Midwest Symposium on Circuits and Systems, pp. 1174-1177,
                                                                          August 2000.
                                                                          [3]K. Haapala, P. Kolinummi, T. Hamalainen, and J.
                                                                          Saarinen, ”Parallel  DSP  implementation  of  wavelet  transform  in
                                                                          image compression,” Proc. of ISCAS IEEE International Symposium
                                                                          on Circuits and Systems, vol. 5. pp. 89-92, 2000.
                                                                          [4] Matthias Hopf and Thomas Ertl., “Hardware Accelerated
                                                                          Wavelet Transformations,” Proc. of EG/IEEE TCVG Symposium
                                                                          on Visualization VisSym, pp. 93–103, 2000.
                                                                          [5] C. Graves and C. Gloster, “Use of dynamically reconfigurable
                                                                          logic in adaptive wavelet packet applications,” Proc. of the 5th
                                                                          Canadian Workshop on Field-Programmable Devices, June 1998.
                                                                          [6] Baofeng Li, Yong Dou, Haifang Zhou, and Xingming Zhou,
                                                                          “FPGA accelerator for wavelet-based automated global image
                                                                          registration,” EURASIP J. Embedded Syst., pp. 1–10, 2009.
                                                                          [7] Mats Holmström, “Parallelizing the fast wavelet transform,”
                                                                          Parallel Computing, vol. 11(21), pp. 837-1848, April 1995.
                                                                          [8] T. T. Wong, C. S. Leung, P. A. Heng, and J. Wang, “Discrete
                                                                          wavelet transform on consumer-level graphics hardware,” IEEE
                                                                          Transactions on Multimedia, vol. 9(3), pp. 668–673, April 2007.


© 2012 ACEEE                                                         34
DOI: 01.IJSIP.03.01.117
ACEEE Int. J. on Signal & Image Processing, Vol. 03, No. 01, Jan 2012

[9] C. Tenllado, J. Setoain, M. Prieto, L. Piñuel, and F. Tirado,           [13] Vaclav Simek and Ram Rakesh Asn, “GPU Acceleration of
“Parallel Implementation of the 2D Discrete Wavelet Transform               2D-DWT Image Compression in MATLAB with CUDA,” Second
on Graphics Processing Units: Filter Bank versus Lifting,” IEEE             UKSIM European Symposium on Computer Modeling and
Trans. Parallel Distrib. Syst., vol. 19(3), pp. 299-310, 2008.              Simulation, pp.274-277, 2008.
[10] Antonio Garcia and Han-Wei Shen, “GPU-based 3D wavelet                 [14] Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M.
reconstruction with tileboarding,” The Visual Computer , vol. 21(8),        Roerdink, “Accelerating Wavelet Lifting on Graphics Hardware
pp. 755-763, September 2005.                                                Using CUDA,” IEEE Transactions on Parallel and Distributed
[11] “NVIDIA CUDA C Programming Guide 4.0,” Available at                    Systems, vol. 22(1), pp. 132-146, January 2011.
http://developer.nvidia.com/cuda-toolkit-3.1-downloads, accessed            [15] “CUDA C best practices guide,” Available at http://
August 02, 2011.                                                            developer.nvidia.com/cuda-toolkit-31-downloads, accessed August
[12] Joaquín Franco, Gregorio Bernabé, Juan Fernández, and Manuel           05, 2011.
E. Acacio, “A Parallel Implementation of the 2D Wavelet Transform           [16] David B. Kirk and Wen-mei W. Hwu, Programming massively
Using CUDA,” Proc. of International Conf. on Parallel, Distributed          parallel processors- a hands-on approach, Elsevier Inc., USA,
and Network-based Processing, pp.111-118, 2009.                             January 22, 2010.




© 2012 ACEEE                                                           35
DOI: 01.IJSIP.03.01.117

More Related Content

PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PPTX
Design and implementation of DADCT
PDF
Ad24210214
PDF
SECURED COLOR IMAGE WATERMARKING TECHNIQUE IN DWT-DCT DOMAIN
PDF
Bj31416421
PDF
Highly Parallel Pipelined VLSI Implementation of Lifting Based 2D Discrete Wa...
PDF
Robust Watermarking through Dual Band IWT and Chinese Remainder Theorem
PDF
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
Design and implementation of DADCT
Ad24210214
SECURED COLOR IMAGE WATERMARKING TECHNIQUE IN DWT-DCT DOMAIN
Bj31416421
Highly Parallel Pipelined VLSI Implementation of Lifting Based 2D Discrete Wa...
Robust Watermarking through Dual Band IWT and Chinese Remainder Theorem
High Speed and Time Efficient 1-D DWT on Xilinx Virtex4 DWT Using 9/7 Filter ...

What's hot (20)

PDF
An35225228
PDF
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
PDF
Lifting Scheme Cores for Wavelet Transform
PDF
IRJET- Digital Watermarking using Integration of DWT & SVD Techniques
PDF
Paper id 25201467
PPTX
Ppt
PDF
PDF
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
PDF
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
PDF
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLAB
PDF
A High Performance Modified SPIHT for Scalable Image Compression
PDF
Satellite Image Resolution Enhancement Technique Using DWT and IWT
PDF
PDF
R044120124
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PDF
0 nidhi sethi_finalpaper--1-5
PDF
High Speed and Area Efficient 2D DWT Processor Based Image Compression
PDF
Iaetsd wavelet transform based latency optimized image compression for
PDF
www.ijerd.com
PDF
A Detailed Survey on VLSI Architectures for Lifting based DWT for efficient h...
An35225228
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
Lifting Scheme Cores for Wavelet Transform
IRJET- Digital Watermarking using Integration of DWT & SVD Techniques
Paper id 25201467
Ppt
Implementation of Vedic Multiplier in Image Compression Using Discrete Wavele...
Reconfigurable CORDIC Low-Power Implementation of Complex Signal Processing f...
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLAB
A High Performance Modified SPIHT for Scalable Image Compression
Satellite Image Resolution Enhancement Technique Using DWT and IWT
R044120124
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
0 nidhi sethi_finalpaper--1-5
High Speed and Area Efficient 2D DWT Processor Based Image Compression
Iaetsd wavelet transform based latency optimized image compression for
www.ijerd.com
A Detailed Survey on VLSI Architectures for Lifting based DWT for efficient h...
Ad

Similar to Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on a Low Cost GPU (20)

PDF
IRJET- An Efficient VLSI Architecture for 3D-DWT using Lifting Scheme
PDF
Ek35775781
PDF
FPGA IMPLEMENTATION OF EFFICIENT VLSI ARCHITECTURE FOR FIXED POINT 1-D DWT US...
PDF
Dynamic Texture Coding using Modified Haar Wavelet with CUDA
PPT
project ppt (1)FINAL vlsi_field_gate.ppt
PDF
Hz2514321439
PDF
Hz2514321439
PDF
Hz2514321439
PDF
Ijetr011837
PDF
An Energy Efficient and High Speed Image Compression System Using Stationary ...
PDF
Cb34474478
PDF
Architectural implementation of video compression
PDF
Image Compression Using Wavelet Packet Tree
PDF
Feature Based watermarking algorithm for Image Authentication using D4 Wavele...
PDF
Fpga sotcore architecture for lifting scheme revised
PDF
High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...
PPTX
discrete wavelet transform
PDF
Wavelet-Based Warping Technique for Mobile Devices
PDF
Hq3114621465
PDF
Ju3417721777
IRJET- An Efficient VLSI Architecture for 3D-DWT using Lifting Scheme
Ek35775781
FPGA IMPLEMENTATION OF EFFICIENT VLSI ARCHITECTURE FOR FIXED POINT 1-D DWT US...
Dynamic Texture Coding using Modified Haar Wavelet with CUDA
project ppt (1)FINAL vlsi_field_gate.ppt
Hz2514321439
Hz2514321439
Hz2514321439
Ijetr011837
An Energy Efficient and High Speed Image Compression System Using Stationary ...
Cb34474478
Architectural implementation of video compression
Image Compression Using Wavelet Packet Tree
Feature Based watermarking algorithm for Image Authentication using D4 Wavele...
Fpga sotcore architecture for lifting scheme revised
High Speed Data Exchange Algorithm in Telemedicine with Wavelet based on 4D M...
discrete wavelet transform
Wavelet-Based Warping Technique for Mobile Devices
Hq3114621465
Ju3417721777
Ad

More from IDES Editor (20)

PDF
Power System State Estimation - A Review
PDF
Artificial Intelligence Technique based Reactive Power Planning Incorporating...
PDF
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
PDF
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
PDF
Line Losses in the 14-Bus Power System Network using UPFC
PDF
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
PDF
Assessing Uncertainty of Pushover Analysis to Geometric Modeling
PDF
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
PDF
Selfish Node Isolation & Incentivation using Progressive Thresholds
PDF
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
PDF
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
PDF
Cloud Security and Data Integrity with Client Accountability Framework
PDF
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
PDF
Enhancing Data Storage Security in Cloud Computing Through Steganography
PDF
Low Energy Routing for WSN’s
PDF
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
PDF
Rotman Lens Performance Analysis
PDF
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
PDF
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
PDF
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Power System State Estimation - A Review
Artificial Intelligence Technique based Reactive Power Planning Incorporating...
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
Line Losses in the 14-Bus Power System Network using UPFC
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
Assessing Uncertainty of Pushover Analysis to Geometric Modeling
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
Selfish Node Isolation & Incentivation using Progressive Thresholds
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
Cloud Security and Data Integrity with Client Accountability Framework
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
Enhancing Data Storage Security in Cloud Computing Through Steganography
Low Energy Routing for WSN’s
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
Rotman Lens Performance Analysis
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Unlocking AI with Model Context Protocol (MCP)
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Spectroscopy.pptx food analysis technology
PDF
KodekX | Application Modernization Development
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
Electronic commerce courselecture one. Pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
Unlocking AI with Model Context Protocol (MCP)
The AUB Centre for AI in Media Proposal.docx
Building Integrated photovoltaic BIPV_UPV.pdf
Network Security Unit 5.pdf for BCA BBA.
Spectroscopy.pptx food analysis technology
KodekX | Application Modernization Development
Per capita expenditure prediction using model stacking based on satellite ima...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing

Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on a Low Cost GPU

  • 1. ACEEE Int. J. on Signal & Image Processing, Vol. 03, No. 01, Jan 2012 Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on a Low Cost GPU Madeena Sultana1 and Nurul Muntasir Mamun2 1 Dept. of Computer Science and Engineering, Jahangirnagar University, Dhaka, Bangladesh Email: deena.sultana@gmail.com 2 Dept. of Applied Physics Electronics and Communication Engineering, Dhaka University, Dhaka, Bangladesh Email: mamun.muntasir@yahoo.com Abstract— Haar discrete wavelet transform (DWT), the sorts of high performance computers, for special purpose simplest among all DWTs, has diverse applications in signal hardware [2]-[4], for FPGAs [5][6] and for SIMD architectures and image processing fields. A traditional approach for 2D [7].Considerable amount of speedup is also achieved by Haar DWT is 1D row operation followed by and 1D column employing GPUs with OpenGL and Cg-based implementations operation. In 2002, Chen and Liao presented a fast algorithm for DWT computations [8]-[10]. for 2D Haar DWT based on segmented matrix. However, this method is infeasible for its high computational requirements However, GPU accelerated computation became especially for processing large sized images. In this paper, we have interesting since early 2007 when NVIDIA introduced CUDA implemented the segmented matrix algorithm on a low cost (Compute Unified Device Architecture) enabled GPUs, which NVIDIA’s GPU to achieve speedup in computation. The offer massive parallel computation power. Providing many efficiency of our GPU based implementation is measured and hundreds of gigaflops of processing power current GPUs are compared with CPU based algorithms. Our experimental leveraging the parallel computation in a more efficient way results show performance improvement over a factor of 28.5 than on a CPU [11]. compared with Chen and Liao’s CPU based segmented matrix Being harnessed by many researches, these commodity algorithm and a factor of 8 compared to MATLAB’s wavelet and readily available GPUs are providing dramatic function for an image of size 2560×2560. computation speedup in various research fields. Joaquín Index Terms—Haar discrete wavelet transform (DWT), CUDA, Franco, Gregorio Bernabé, Juan Fernández and Manuel E. GPU, segmented matrix algorithm, parallel discrete wavelet Acacio [12] achieved significant speed up with NVIDIA’s transform Tesla C870 over Intel’s Core 2 Quad Q6700 (2.66GHz). Vaclav Simek and Ram Rakesh Asn [13] used CUDA enabled GPU I. BACKGROUND AND INTRODUCTION for accelerated 2D wavelet based image compression. Recently, Wladimir J. van der Laan, Andrei C. Jalba and Jos Discrete wavelet transforms (DWTs) has been used in a B.T.M. Roerdink [14] implemented a fast hybrid method for wide range of signal and image processing applications such 2D DWT on CUDA for both 2D images and 3D volume data. as – image and video coding (MPEG-4 or JPEG 2000), pattern In this paper we have implemented the segmented matrix recognition, image watermarking, medical image analysis etc. algorithm for 2D Haar wavelet transform on a low cost, In traditional approach, 2D (two-dimensional) Haar DWT is commodity GPU. Our objective is to achieve computation performed in two phase- one row operation, one column speed up to process large scaled images without increasing operation, and column operation cannot be performed until computational complexity and cost. the row operation is completed. Therefore, the speed of computation degrades significantly. To address this problem, II. TRADITIONAL COMPUTATION Chen and Liao [1] proposed the segmented matrix algorithm where computation is performed by data rearrangements and Haar DWT is the simplest since it only uses two low pass one matrix multiplication. Therefore, this simple algorithm can filter coefficients (1,1) and two high pass filter coefficients produce the same results as traditional 2D Haar DWT with a (1,-1). Haar wavelet transform in frequency domain can be much faster speed. Moreover, it is highly suitable for parallel obtained by addition and subtraction of the pixels of images. implementation as only two rows are involved in computation 2D haar DWT decomposes an input image into four sub- at a time. bands, one average component (WLL ) and three detail Nowadays large size images are common due to the components (WLH, WHL, WHH). availability and advancement of image capturing technology. Traditionally, 2D Haar wavelet transform can be Therefore many wavelet based applications have to manage accomplished by one row and one column operations where large scaled image processing. Parallel computing is a direct the result of row transform is the input of column transform. way of speeding up these high computation requirements. A Fig. 1 represents the 2D Haar wavelet transforms of a 4×4 significant amount of works have already been done for all image. © 2012 ACEEE 32 DOI: 01.IJSIP.03.01.117
  • 2. ACEEE Int. J. on Signal & Image Processing, Vol. 03, No. 01, Jan 2012 The rearrangements are as follows - a) The elements in the first column of H are filled in WLL row by row. b) The elements in the second column of H are filled in WHL row by row. c) The elements in the third column of H are filled in WLH row by row. d) The elements in the fourth column of H are filled in WHH row by row. IV. CUDA IMPLEMENTATION The CUDA platform is currently concentrating an enormous attention due to its tremendous potential of parallel processing. In November 2006, NVIDIA introduced CUDA Figure.1. 2D Haar DWT of a 4×4 image by traditional approach. with a new parallel programming model and instruction set architecture to solve many complex computational problems III. SEGMENTED MATRIX ALGORITHM very efficiently [11]. Each CUDA complainant device is a set Chen and Liao [1] proposed a computationally fast of multiprocessor cores where each core has SIMT (Single algorithm called “segmented matrix algorithm” where 2D Haar Instruction, Multiple Thread) architecture. Today four quad- DWT can be performed by only one matrix multiplication core CPUs can run only 16 threads concurrently, whereas the instead of two separate 1D transforms. The step by step smallest executable parallel unit on a CUDA device comprised process of this algorithm is as follows. of 32 threads. All CUDA enabled NVIDIA GPUs support at Step 1: Consider I as the input image of size m×n. Form Bij=2×2 least 768 concurrently active threads per multiprocessor. sub-blocks from original image I where i=1…m/2 and j=1…n/ Moreover, some GPUs can support 1,024 or more active 2. For example, threads per multiprocessor [11]. Devices comprise of 30 multiprocessors (e.g. NVIDIAGeForce GTX 280), can support more than 30,000 active threads [15]. A good parallel implementation of an application on a GPU can achieve more than 100 times speedup over sequential execution [16]. Step 2: Z-scan each Bij and generate m×n row vectors Aij. For In SIMT architecture of CUDA, a portion of a parallel example, application executed many times independently on different data, by many threads running on different processors, at any given clock cycle. This parallel portion can be isolated into a function which is called kernel. A kernel is organized as Step 3: Express these row matrices as an intermediate matrix a set of thread blocks and each thread block is, in turn, M. organized as a three-dimensional array of threads. Threads within the same block can efficiently cooperate through shared memory and can synchronize with each other. Each thread has its own unique thread ID which is defined by the three thread indices: threadIdx.x, threadIdx.y, and threadIdx.z. Each block is identified by a unique two-dimensional coordinate given by the CUDA specific keywords blockIdx.x and Step 4: Consider filter coefficient matrix blockIdx.y. All blocks must have the equal number of threads organized exactly in the same manner. The use of multidimensional identifiers simplifies memory addressing of multidimensional data. The block and grid dimensions, collectively known as execution configuration, can be set at Find H= M×C. run-time. In our implementation we have used blocks each having Step 5: Haar wavelet transform can be divided into four sub- 16×16 threads. The grid size is set at run-time according to matrices of size m  n , the size of input image. Our CUDA implementation consists 2 2 of the following steps: 1. Copy image data from host memory to GPU memory. The rearrangement of the elements of H into four sub- 2. Determine the execution configuration. matrices will produce the resultant Haar wavelet transform 3. GPU executes kernel to compute the elements of the matrix W. intermediate matrix M on each core in a parallel fashion. © 2012 ACEEE 33 DOI: 01.IJSIP.03.01.117
  • 3. ACEEE Int. J. on Signal & Image Processing, Vol. 03, No. 01, Jan 2012 4. The resulted matrix H is computed simultaneously in GPU. Table I shows that the performance of CPU based segmented 5. Copy the result from GPU memory to host memory. matrix algorithm declined noticeably for large sized images, The CPU-based algorithm is implemented on Intel Pentium although it performed better for small sized images. In contrast, IV, 3.00GHz processor equipped with 512MB DDR2 RAM. GPU based implementation of this algorithm improved the The GPU based algorithm is tested on NVIDIA GeForce performance for large over a factor of 10 to 28 for images 8500GT graphics card containing 16 cores, maximum 512 sized 1024×1024 to 2560×2560. Moreover, it performed better threads per block and 512 MB global memory. than MATLAB’s wavelet function for all small and large sized images. Therefore, among the three algorithms our GPU based V. RESULTS AND DISCUSSION segmented matrix algorithm performed the best for high resolution images. To test the computational efficiency of our GPU based However, the main drawback of GPU computation is the segmented matrix algorithm, we have taken images of different transfer time between the host memory and device memory. sizes as inputs. Fig. 2 shows one level 2D Haar DWT of The time needed to copy data from the host’s memory to 256×256 lena image using CPU based and GPU based GPU’s global memory requires a large fraction of total segmented matrix algorithm. For comparison we also have execution time. Therefore, if we exclude the data transfer time considered MATLAB’s dwt2() function from wavelet toolbox from execution time, we would get significant speedup for and the CPU implementation of segmented matrix algorithm. large sized images. CONCLUSIONS The widespread usage of the Haar Discrete Wavelet Transform (DWT) has motivated the implementation of a simple and low cost GPU based DWT algorithm. Our experimental results show that for an image of size 2560×2560, the GPU based segmented matrix algorithm is more than 28.5 times faster than CPU computation including data transfer. Moreover, this GPU based method achieved approximately 8x speedup than the CPU based computation of MATLAB’s dwt2() for the same image. Due to the speedy calculations we believe (a) (b) that the ideas presented in this paper will have widespread Figure.2. One level 2D Haar DWT using (a) CPU based and (b) applications in processing large sized images. GPU based segmented matrix algorithm. Table I represents the comparison of computing time of REFERENCES MATLAB’s dwt2(), segmented matrix algorithm on CPU and [1] P. Y. Chen and E. C. Liao, “A new algorithm for Haar discrete on GPU with increasing size of input images. wavelet transform,” IEEE International Symposium on Intelligent TABLE.I. COMPUTATION TIME COMPARISON RELATIVE TO IMAGE SIZE Signal Processing and Communication Systems, pp. 453-457, 2002. [2] M. Martina, G.Masera, G.Piccinini and M.Zamboni, “A VLSI Architecture for IWT (Integer Wavelet Transform),” Proc. of 43rd Midwest Symposium on Circuits and Systems, pp. 1174-1177, August 2000. [3]K. Haapala, P. Kolinummi, T. Hamalainen, and J. Saarinen, ”Parallel  DSP  implementation  of  wavelet  transform  in image compression,” Proc. of ISCAS IEEE International Symposium on Circuits and Systems, vol. 5. pp. 89-92, 2000. [4] Matthias Hopf and Thomas Ertl., “Hardware Accelerated Wavelet Transformations,” Proc. of EG/IEEE TCVG Symposium on Visualization VisSym, pp. 93–103, 2000. [5] C. Graves and C. Gloster, “Use of dynamically reconfigurable logic in adaptive wavelet packet applications,” Proc. of the 5th Canadian Workshop on Field-Programmable Devices, June 1998. [6] Baofeng Li, Yong Dou, Haifang Zhou, and Xingming Zhou, “FPGA accelerator for wavelet-based automated global image registration,” EURASIP J. Embedded Syst., pp. 1–10, 2009. [7] Mats Holmström, “Parallelizing the fast wavelet transform,” Parallel Computing, vol. 11(21), pp. 837-1848, April 1995. [8] T. T. Wong, C. S. Leung, P. A. Heng, and J. Wang, “Discrete wavelet transform on consumer-level graphics hardware,” IEEE Transactions on Multimedia, vol. 9(3), pp. 668–673, April 2007. © 2012 ACEEE 34 DOI: 01.IJSIP.03.01.117
  • 4. ACEEE Int. J. on Signal & Image Processing, Vol. 03, No. 01, Jan 2012 [9] C. Tenllado, J. Setoain, M. Prieto, L. Piñuel, and F. Tirado, [13] Vaclav Simek and Ram Rakesh Asn, “GPU Acceleration of “Parallel Implementation of the 2D Discrete Wavelet Transform 2D-DWT Image Compression in MATLAB with CUDA,” Second on Graphics Processing Units: Filter Bank versus Lifting,” IEEE UKSIM European Symposium on Computer Modeling and Trans. Parallel Distrib. Syst., vol. 19(3), pp. 299-310, 2008. Simulation, pp.274-277, 2008. [10] Antonio Garcia and Han-Wei Shen, “GPU-based 3D wavelet [14] Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M. reconstruction with tileboarding,” The Visual Computer , vol. 21(8), Roerdink, “Accelerating Wavelet Lifting on Graphics Hardware pp. 755-763, September 2005. Using CUDA,” IEEE Transactions on Parallel and Distributed [11] “NVIDIA CUDA C Programming Guide 4.0,” Available at Systems, vol. 22(1), pp. 132-146, January 2011. http://developer.nvidia.com/cuda-toolkit-3.1-downloads, accessed [15] “CUDA C best practices guide,” Available at http:// August 02, 2011. developer.nvidia.com/cuda-toolkit-31-downloads, accessed August [12] Joaquín Franco, Gregorio Bernabé, Juan Fernández, and Manuel 05, 2011. E. Acacio, “A Parallel Implementation of the 2D Wavelet Transform [16] David B. Kirk and Wen-mei W. Hwu, Programming massively Using CUDA,” Proc. of International Conf. on Parallel, Distributed parallel processors- a hands-on approach, Elsevier Inc., USA, and Network-based Processing, pp.111-118, 2009. January 22, 2010. © 2012 ACEEE 35 DOI: 01.IJSIP.03.01.117