Hough Transform: Serial and Parallel Implementations

Hough Transform: Serial and Parallel
Implementations
Jan Essbach1 , Björn Lindequist1 , Claudia Nacke1
1 University of Applied Sciences Engineering and Economics Berlin
Treskowallee 8, 10318 Berlin, Germany
Abstract – Circle detection has been widely applied in image
processing applications. Hough transform, the most popular
method of shape detection, normally takes a long time to achieve
reasonable results, especially for large images. Such perfor-
mance makes it almost impossible to conduct real-time image
processing with sequential algorithms on community computers.
Recently, OpenCL was developed providing a programming
paradigm to explore the tremendous computational power for
operations on vectors, matrices and high-dimensional matrices.
In this paper, five different approaches of sequential and
parallelized Hough transform algorithms are researched using
CPU and GPU execution. Experimental results indicate that the
realized Hough transform on GPUs can achieve up to 4000 times
speedup over the serial version on CPU. With other efficient
image scaling algorithms, real-time circle extraction can be
achieved with GPU support.
Keywords – Hough Transform, GPU Acceleration, OpenCL,
Image Processing
I. INTRODUCTION
Hough transform is a popular technique for feature extrac-
tion in image processing and computer vision. This concept
was first proposed to detect straight lines [1] and was later
generalized into a robust technique to detect the positions
and directions for any shapes that are already known [2].
Such scheme was known as generalized Hough transform.
Because of its powerful nature on shape recognition, Hough
transform also plays an important role in image and object re-
construction. However, the classical Hough transform adopts
brute-force approach, which normally takes long execution
time to detect shapes with more than two parameters, such as
circles and ellipses. Many researchers have been working on
optimizations of Hough transform. So far, the execution time
of Hough transform to detect shapes with multiple parameters
is still intolerable.
Circle detection can be found in many applications in
a wide range of academic areas, such as medical image
processing [3], [4] and robot vision [5]. Since a circle in
plane has three parameters, the parameter domain should be
a cube, which requires long execution time and large memory
capacity. To solve such kind of problem, people try to reduce
the dimension of parameter space using specific techniques
on certain problems [6]. However a generalized solution has
not been accomplished yet.
This paper intends to accelerate Hough-transform-based
circle detection with parameter space using OpenCL tech-
nology on GPUs. It makes the following contributions:
• One sequential Hough transform algorithm [7] is re-
searched using CPU execution (1)
• An sequential (2) and parallelized (3) optimization of
former algorithm is investigated and
• A CPU (4) and GPU (5) version of Hough transform
are deployed for OpenCL architecture.
The rest of the paper is organized as follows:
Section 2 discusses the Hough transform in general and
focuses on circle detection using generalized Hough trans-
form. Section 3 introduces the possibilities of parallelization
regarding Algorithms and Choice of architecture. Section 4
provides the actual state of the art and Section 5 will provide
the concrete implementation details of Hough transform for
OpenCL. Section 6 illustrates detailed experimental results
and performance analyses to demonstrate the effectiveness of
GPU acceleration. Finally, in the last section our conclusion
and future work are described.
II. HOUGH TRANSFORM AND CIRCLE DETECTION
In this paper, we only discuss shapes on a plane. Shapes
with two parameters a and b can be represented as a function
f(a, b) = 0, such as x + ay + b = 0 and x2
+ ax + b − y = 0
for lines and a special type of parabolas, respectively. While
the image domain resides in the X-Y coordinate system, the
transformed domain, or parameter domain, should be located
in the A-B coordinate system. Fig. 1 gives rough ideas about
Hough transform. First the edges of the input image will be
determined using edge detection algorithms like canny edge
detector [8]. The image will then be converted to a binary
image and Hough Transform can be applied. The resulting
hough space is then utilized to find local maxims which will
represent the circle centers.
Edge
Detection
Hough
Transform
Find local
maxima
Binary
Image
Input Image
Hough
space
Model Parameters
Figure 1. General workflow of Hough Transform
The equation of circles can be written as (x − a)2
+
(y − b)2
= c2
, so we set our parameter domain as a 3-D

cube A-B-C. Assuming we have a edge point (x0, y0) in
the image domain, which can be seen as a point on the
circumference of circle centered at (xc, yc), we can select all
other points as this center point, then rc can be calculated
by rc = (xc − x0)2 + (yc − y0)2, hence we get a set
of coordinates (xc, yc, rc) corresponding to a point in the
parameter domain (see Fig. 2).
ϕ
rc
(xc, yc)
(x0, y0)
Figure 2. Calculation of points in parameter domain
We set counters for all points in that 3-D cube and
increase them by 1 when they are “visited” by the calculated
(xc, yc, rc). It can be imagined that for one (x0, y0), there will
be a conical surface radiate from the point (x0, y0, 0) along
the line f(a, b) : {a = x0; b = y0} in the 3-D cube. The
conical surfaces here are counterparts of the lines in the K-B
domain. If we deduce back from the parameter domain, we
can find that for a real circle, say (x−a0)2
+(y−b0)2
= c2
0 ,
counter’s value of the corresponding point (a0, b0, c0) in the
parameter domain must outstand among its neighbors. We
can see this phenomenon clearly in Fig.3, where three circles
shaded by intensive noises are found in parameter domain as
peaks. Another good nature of circular Hough transform is
we do not have to worry about the infinite slope, but the range
of r should still be specified to save space and calculation
time.
In real applications, we should not always follow the tra-
ditional algorithm but need to find better ways to implement
it. There are several mapping strategies between the image
domain and the parameter domain to make right points stand
out. Two strategies of circular Hough transform and their
relative merits will be discussed.
III. POSSIBILITIES OF PARALLELIZATION
Parallelization can be achieved through various different
approaches. They range from simple thread-programming
in popular programming languages (eg C/C++ or Java) to
the inclusion of powerful frameworks. For Hough transform
the fact that each edge point from the original image is
processed independently of one another can be utilized, there-
fore numerous ways of parallelized solutions are possible.
Some of these possibilities are presented and discussed in the
first part of this section. The second part introduces popular
computing platforms and programming models that allow the
participation of the GPU of modern graphic cards.
A. Algorithms
The basic idea of parallelization is to divide the overall
task into independent sub-tasks that can be processed si-
multaneously. Modern computers have one or more CPUs
with multiple cores each, so a real parallel processing can be
realized. The division of the original problem into appropriate
sub-tasks is of crucial importance, especially when resources
must be shared between them.
The simplest form of parallelization is the processing of
multiple images simultaneously. Each available core calcu-
lates the complete Hough transform for an image. Data
exchange between the cores is not required. When processing
different-sized images it is very likely that the work is not
equally distributed among the cores. To avoid such a waste
of processing power, a more complex approach is required.
One possibility for the Hough transform to recognize
circles with unknown radii is the parallel processing of the
individual radii for one image. Each CPU core computes
the Hough transform for a single radius, and generates a
separate Hough space, only itself has access to. Therefore
a synchronization of this space is not required. All CPU
cores can operate completely in parallel without blocking
each other, since the shared resource – the list of contour
points – is read-only. The result of each sub-task contains
all center points of the detected circles with this particular
radius.
Another approach is the parallel processing of individual
regions within an image. Each task computes the Hough
transform for contour points of its region and generates a
separate section of the complete Hough space. These sections
are each task-exclusive, so synchronization is not necessary.
A disadvantage of this approach is an additional processing
step in which the Hough space for neighboring regions must
be composed of the individual spaces of each task. On
the other hand not the entire Hough space has to be fully
allocated. Only the subspace of neighboring regions must
be kept in memory simultaneously. Once a region is no
longer needed for the voting process, the memory used can
be released. This reduction of the required memory can be
extremely useful when processing large images (eg with more
than 100,000 pixels in each dimension).
There are many other approaches to the parallelization of
the Hough transform. These examples should give only a
general view of the possibilities.
B. Choice of architecture
The hardware architecture of a CPU and a GPU differ
significantly from each other. The reason is the different
purpose of both devices. A task that is highly parallelizable,
can be calculated much faster on a GPU than on a CPU if

Figure 3. Hough transform on circles with r = 30
the specifics of the GPU (eg memory usage) are taken into
account. The gain in speed between CPU and GPU is greater
when a job involves a lot of calculations a GPU is optimized
for. For a detailed description of the hardware architecture
and the resulting application scenarios for GPUs see [9].
In recent years, various computing platforms were intro-
duced, allowing the programming of software that can be
executed on GPUs. Two of these platforms are CUDA1
and
OpenCL2
. In addition to the details of their implementation,
the most striking difference is the number of supported
hardware devices. While OpenCL supports different devices
(CPU, GPU, DSP or other processors) from different manu-
facturers, CUDA can only utilize GPUs from NVIDIA. This
limitation of CUDA can lead to a speed advantage of up to
50% in comparison to an OpenCL program which has not
been specifically adapted3
for Nvidia hardware [10].
As already described, the Hough transform is a highly
parallelizable task. The use of a platform for parallelization,
with the aim to significantly reduce the computation time, is
a reasonable solution.
IV. STATE OF THE ART
Digital image processing is an area with a wide range
of applications with an active research and development
community. Since the Hough transform is an important tool,
it is included in many frameworks or applications. One of
the most common is the cross-platform framework OpenCV.
Despite the existing solutions, the Hough transform is
a field of active development, especially since the option
to include the GPU for calculations is available. Current
GPU implementations of the Hough transform can achieve
remarkable increases in speed when compared with a CPU.
Two recent implementations for the CUDA platform could
achieve speedup factors of up to 400 [11] [12].
1Compute Unified Device Architecture – NVIDIA
2Open Computing Language – initially Apple Inc., now Khronos Group
3adapted OpenCL programs can be equally fast but loose their portability
V. PARALLELIZATION OF HOUGH TRANSFORM
Within our solution the first step is to generate all jobs
for the hough kernel. These jobs are represented by a triple
(x, y, r). For each edge point in the original image and
all radii, meaning rtotal ∈ [rmin, rmax], such a pair is
generated. The total job size is calculated by edges · |rtotal|
and will be the global work size for the first OpenCL kernel.
For each pair (x, y, r) the kernel will be executed and the
midpoint algorithm is used to “visit” the pixels of interest
and increment the visited pixel by value one.
w
h
w'
h'
0 0 00...
w' * h'
Original Image Resized Hough Space Linear aligned Hough Space
rmax
x,y x',y'
rmax
Figure 4. hough space
The original image will be transformed into the resized
hough space by adding the maximum radius rmax to all sides
of the image domain. Within the OpenCL kernel the resized
hough space is interpreted as a linear aligned data structure.
The array index n for Point (r, x, y) is calculated using
formula i(r, x, y) which is shown in Fig. V. For each radius,
different hough space images are aligned in a sequential
order. Thus, formula one is used, to determine the offset
within the hough space. To determine the original parameters
(r, x, y) from the array index n the third formula i−1
(n) can
be used. Adding this “junk data” leads to an optimization
of the midpoint algorithm where no further boundary check
within the OpenCL kernel is needed. Checking operations
are extremely slow within OpenCL using GPU execution.
This junk data will be ignored for voting process by setting
the corresponding three dimensional global work-size of the

voting kernel.
ϕ(r) = (r − rmin) · w · h (1)
i(r, x, y) = ϕ(r) + y · w + x (2)
i−1
(n) =



n/(w · h ), r ∈ N
n − ϕ(r)/w , y ∈ N
n − ϕ(r) − y · w , x ∈ N
(3)
The voting kernel will be started with a three dimensional
global work-size and a global offset to skip all junk data,
which can be seen in listing 1.
1 offset[3] = {r_max, r_max, 0};
2 worksize[3] = {img.cols + r_max, img.rows + r_max,
3 r_max - r_min + 1};
Listing 1. Setting clEnqueueNDRangeKernel Parameters
For voting a simple thresholding and local maxims search
is used. The threshold is dependent on the radius of the
circle. Within the voting process the coordinates (x , y )
are converted back to original image domain (x, y) and the
number of found circles is saved in a __global variable.
The vote space itself will have size of found edges as a
maximum because no dynamic data structures can be used
within a OpenCL kernel. After the vote space kernel is
finished the data structure consisting of concrete instances
of (x, y, r) is copied back to the host system and the Hough
transform is finished.
VI. EVALUATION
To evaluate the implemented Hough transform with
OpenCL a benchmark was conducted. The structure of the
benchmark is as follows:
A. Test images
Images with 4 different quadratic resolutions (256x256 -
2048x2048 pixel) and an increasing number of edge points
(up to ≈ 100.000) were used in the benchmark. One sample
image with a resolution of 2048x2048 pixel is shown in
Fig. 5.
B. Implementations
In order to assess the implementation presented, it should
be compared with three other freely available solutions:
• V1 – sequential Hough transform algorithm [7] (CPU)
• V2 – optimization of V1 (sequential – CPU)
• V3 – optimization of V1 (parallelized, 30 Threads –
CPU)
• V4 – presented OpenCL implementation (CPU)
• V5 – presented OpenCL implementation (GPU)
All implementations of the Hough transform should deter-
mine circles having radii of 10 to 40 in the respective image.
C. Hardware
Used Hardware for Benchmark:
• CPU: Intel(R) Core(TM) i7-3720QM
• GPU: ATI Radeon HD 5770
Figure 5. Test image 2048x2048 pixels and 92640 egde points
D. Results
To obtain reliable processing times, all images were pro-
cessed ten times by each solution and then the average time
was calculated.
The average times for all solutions and images are shown
in Tab. I
Format Edges V1 V2 V3 V4 V5
256x256 1503 11,321 0,813 0,210 0,022 0,031
512x512 5790 167,432 3,301 0,856 0,080 0,073
1024x1024 23160 1200,344 13,373 3,278 0,476 0,354
2048x2048 92640 3400,442 53,169 13,056 1,772 0,791
Table I. EXECUTION TIME OF HOUGH TRANSFORM IN SECONDS
A graphical representation of the benchmark result is
shown in Fig. 6 (V1 has been omitted for clarity).
The result of the benchmark shows that the presented
OpenCL implementation (CPU and GPU) is significantly
faster than any other solution. The minimal difference for
images up to the resolution of 1024x1024 can be explained by
the simplicity of the calculations, which do not fully exploit
the potential of the GPU. Only for the largest image, the
resolution is almost irrelevant, the number of contour points
is crucial to the work, the GPU is faster than the CPU by
a factor of 2. The number of calculations depends on the
number of contour points which is multiplied by the number
of observed radii. For the 2048x2048 resolution image there
are 92640 contour points and 31 radii (10 to 40) which results
in 31 · 92.640 = 2.871.840 jobs each calling the midpoint
OpenCL kernel. With this large number of calls, the graphics
card can begin to take benefit of their advantages.

500 1,000 1,500 2,000
0
10
20
30
40
50
V2
V3
V4
V5
Image Dimension n × n
Executiontimeinseconds
Figure 6. Execution time of different Approaches
If the processing time of solutions V1 and V5 are com-
pared, the speedup can be calculated as follows:
SpeedupV 1/V 5 =
3400.442
0.791
= 4298.91 (4)
The implemented Hough transform for the OpenCL pro-
gramming platform (processed on a GPU) can achieve a
speedup by a factor of ≈ 4000 compared with a serial
CPU version. By optimizing the OpenCL kernels the speedup
could possibly be increased even further.
VII. CONCLUSION
In this paper the broad applicability of Hough transform in
digital image processing has been presented. Various possi-
bilities for parallelization to reduce the processing time were
discussed. The result of this work is the implementation of
Hough transform using the programming platform OpenCL.
In order to assess the results in comparison to other imple-
mentations (serial and parallel on CPU), the processing time
was compared with these. The introduced OpenCL solution
can reach a speedup by a factor of up to ≈ 4000.
To further improve the solution, the implemented algo-
rithms are investigated to determine whether they can exploit
the capabilities of the GPU even more.
So far, all tests were conducted solely with artificial
images. To assess the quality of the implemented OpenCL
solution even better, real-world images will be used for
testing in the next step. Possible sources for these test images
could be free databases of test images for image processing,
eg [13] [14].
Another approach to further increase the reliability of the
evaluation is the use of parameters such as precision and
recall. Using these values, a statement can be made about
the reliability of the detection, ie could the Hough transform
detect all circles in the image and were objects that are not
circles, identified as such.
The long-term goal of this work is the detection of circles
with variable radii in a video stream (eg a web cam) in
real-time. To achieve this, an edge detection is necessary
to perform the Hough transform and has to be also created
in real-time. The next step will therefore be to evaluate
appropriate edge detectors. The main criterion is the effective
computability on a GPU using OpenCL. In [15] the pos-
sibility to implement the canny edge detector on an GPU
was investigated. This method should be sufficient for the
intended application.
REFERENCES
[1] P. V. C. Hough, “Method and means for recognizing complex patterns,”
1962.
[2] R. O. Duda and P. E. Hart, “Use of the hough transformation to detect
lines and curves in pictures,” in Communications of the ACM, 1972.
[3] S. Eom, R. Bise, and T. Kanade, “Detection of hematopoietic stem
cells in microscopy images using a bank of ring filters,” in The IEEE
International Symposium on Biomedical Imaging, 2010.
[4] M. Smereka and I. Duleba, “Circular object detection using a modified
hough transform,” in International Journal of Applied Mathematics and
Computer Science, 2008.
[5] Y. Yabuta, H. Mizumoto, and S. Arii, “Binocular robot vision system
with shape recognition,” in International Conference on Control,
Automation and Systems, 2007.
[6] Y. Xie and Q. Ji, “A new efficient ellipse detection method,” in
International Conference on Pattern Recognition, 2002.
[7] M. Bowes. (2009) Hough circle detector. [Online]. Available:
https://github.com/marcbowes/Hough-Circle-Detector
[8] J. Canny, “A computational approach to edge detection,” Pattern
Analysis and Machine Intelligence, IEEE Transactions on, no. 6, pp.
679–698, 1986.
[9] J. Owens and U. D. Davis. (2007) Gpu architecture
overview. [Online]. Available: http://gpgpu.org/static/s2007/slides/
02-gpu-architecture-overview-s07.pdf
[10] K. Karimi, N. G. Dickson, and F. Hamze. (2010) A performance
comparison of cuda and opencl. [Online]. Available: http://arxiv.org/
ftp/arxiv/papers/1005/1005.2581.pdf
[11] S. Chen and H. Jiang, “Accelerating the hough transform with cuda
on graphics processing units,” in Proceedings of 2011 International
Conference on Parallel and Distributed Processing Techniques and
Applications (PDPTA), 2011.
[12] F. S. Tasel and A. Temizel, “Parallelization of hough transform for
circles using cuda,” GPU Technology Conference, 2012.
[13] Computer vision test images. Carnegie Mellon University. [Online].
Available: http://www.cs.cmu.edu/∼cil/v-images.html
[14] The usc-sipi image database. Signal and Image Processing Institute.
[Online]. Available: http://sipi.usc.edu/database/
[15] Y. Luo and R. Duraiswami, “Canny edge detection on nvidia cuda,” in
Computer Vision and Pattern Recognition Workshops, 2008. CVPRW
’08. IEEE Computer Society Conference on, 2008, pp. 1–8.
[16] M. Bäuml and R. Stiefelhagen, “Evaluation of Local Features for
Person Re-Identification in Image Sequences,” Research Paper, Institute
of Technology, Karlsruhe, 2011.
[17] D. Wagner, “Marker-Based Tracking,” 2008. [Online]. Available:
http://handheldar.icg.tugraz.at/markerbased.php
[18] D. Lowe, “Distinctive Image Features from Scale Invariant Keypoints,”
Ph.D. dissertation, University of British Columbia, Canada, 2004.
[19] ——, “Object Recognition from Local Scale-Invariant Features,” Re-
search Paper, University of British Columbia, Canada, 1999.
[20] R. E. Kalman, “A New Approach to Linear Filtering and Prediction
Problems,” Research Paper, Research Institute for Advanced Study,
Baltimore, 2002.

Hough Transform: Serial and Parallel Implementations

More Related Content

What's hot (11)

Similar to Hough Transform: Serial and Parallel Implementations (20)

Recently uploaded (20)

Hough Transform: Serial and Parallel Implementations