SlideShare a Scribd company logo
Experiences Programming for GPUs with OpenCL
Oliver Fishstein, Alejandro Villegas
Abstract
This project looks at the complexities of parallel com-
puting with GPUs using OpenCL. In order to write a
parallel program, there is needed to understand the possi-
ble synchronization issues along with the GPU execution
and memory models, which are discussed in this paper.
These concepts were used to write a Genetic Algorithm
to solve the Knapsack Problem in OpenCL and analyze
the differences of using local memory and global memory
to compute.
Keywords: OpenCL, Parallel Programming, GPGPU
1 Introduction
General purpose computing on GPUs is the utiliza-
tion of the graphics processing unit to perform com-
putations and applications traditionally performed
on the CPU, instead of limiting the GPU’s uses to
the traditional graphics computations. Programming
GPUs takes advantage of the parallel nature of graph-
ics processing to increase the runtime speed [1].
The current dominant general-purpose GPU comput-
ing language is OpenCL. It defines a C-like language,
called kernels, that are executed on compute devices
or accelerators. In our case, we will focus on GPUs as
the accelerator. In order to execute OpenCL kernels,
it is necessary to write a host program in either C
or C++ that launches kernels on the compute device
and manages the device memory, which is usually
seperate from the host memory [2].
1.1 Sychronization Issues in Parallel
Programming
One type of synchronization issue in parallel pro-
gramming is hazards. There are three types of haz-
ards: read after write, write after read, and write
after write. These occur when instructions from dif-
ferent execution threads modify shared data in an
unexpected temporal order [3]. An example of code
with hazards can be seen in code listing 1.
Listing 1: Hazard Example
1// Both threads share this var.
2shared int a[2];
3// Each thread has a private copy of this var.
4private int b;
5// Returns 0 for thread 0 and 1 for thread 1
6private int id = get_id ();
7a[id] = id; // line 1
8b = a[1-id]; // line 2
9a[id] = b; // line 3
There is a read after write hazard in lines 1 and 2
because line 2 in the thread 1 can be executed before
the thread 0 executes the line 1. In addition, there is
another read after write along with a write after read
hazard in lines 2 and 3 because line 3 in the thread 1
can be executed before the thread 0 executes the line
2. In order to remove these hazards, a barrier can
be placed in between lines 1 and 2 along with lines 2
and 3. This indicates that all threads must reach the
barrier, written as barrier(), before proceeding to the
next portion of the program.
The other type of synchronization issue is critcal sec-
tions. Critical sections are lines in code that access a
shared resource (device or data structure) that can-
not be concurrently accessed by more than one thread
at a time [4]. This can lead to results that are not ex-
pected by the programmer due to threads interfering
with one another. An example of code with a critical
section can be seen in code listing 2.
Listing 2: Critical Section Example
1# define EMPTY -1
2void insert (int * list , int val) {
3 int i = 0;
4 while(list[i] != EMPTY) i++; // line 1
5 list[i] = i; // line 2
6}
There is a critical section in lines 1 and 2. By mak-
ing these lines a critical section, the only possible
outputs are list = 0 1 and list = 1 0 depending on
which thread takes the lead. In order to implement
the critical section, locks can be used. A lock is a
variable that can have the value 1, indicating locked,
or the value 0, indicating unlocked. A thread can
check the state of the lock, and if it is locked, it will
wait until it is unlocked. If it is unlocked, the thread
sets the lock as locked, executes the critical section,
and then sets the lock as unlocked. Thread APIs usu-
ally provide the programmer with functions to define
and use locks, which rely in lower level atomic oper-
ations. Code listing 3 is the code from code listing 2
corrected with a lock.
Listing 3: Lock Example
1# define EMPTY -1
2void insert (int * list , int val) {
3 int i = 0;
4 while(getLock (lock) == false) {}
5 while(list[i] != EMPTY) i++; // line 1
6 list[i] = 1; // line 2
7 releaseLock();
8}
1.2 GPU Execution Model
The GPU Execution model in OpenCL consists of
workitems, wavefronts, and workgroups. Workitems
are the total amount of threads that the program is
using. These are divided up into workgroups that
are going to be executed in the individual compute
units located in the GPU. The amount of workitems
in each workgroup is a multiply of the size of the
wavefront, which is the amount of workitems that
can run concurrently within a compute unit. The
wavefront size is a fixed size defined by the hardware.
Each workitem in the wavefront executes in lock-step
with the other workitems in the same wavefront. A
workgroup can be made up of multiple wavefronts.
1.3 GPU Memory Model
The GPU memory model consists of global memory
and local memory. The global memory is allocated by
the host program. This memory is visible by all the
workitems running in the GPU. The local memory is
a smaller and faster portion of memory. It is declared
within the kernel, and is visible only by the workitems
belonging to the same workgroup. Generally, doing
computations in local memory is faster than using
global memory.
2 Genetic Algorithm and the Knapsack
Problem
A genetic algorithm is a search heuristic that mimics
the process of natural selection. The knapsack prob-
lem is a particular application of genetic algorithms.
Both will be detailed below.
2.1 Genetic Algorithm
A genetic algorithm uses the biological concept of
natural selection in order to search for the best or
”most fit” solution [5]. A population of potential so-
lutions to an optimization problem is evolved toward
better solutions. The evolution usually starts from a
population of randomly generated values, although
it can also use a predefined population, and goes
through an iterative process that evolves in ”genera-
tions”. After the population of solutions is initialized,
the selection process begins. During each iteration,
values are selected to create a second generation by
comparing two existing solutions and choosing the
”fitter” one to determine the next generation. The
fitness comparison can take place with every value or
only a randomized sample.
The next step of the genetic algorithm is to mutate
the more fit result. The mutation is based on which
problem the algorithm is being applied to solved.
The mutation is then used to replace the less fit
value. All of the values are then shuffled in order
to change which values are compared. This whole
process is repeated for as many iterations or ”gener-
ations” as needed although generally more iterations
will give the better results,especially when working
with a more complex mutation.
2.2 Knapsack Problem
The knapsack problem is a specific type of application
that can be solved by a genetic algorithm. Given a set
of items, each with a corresponding mass, determine
which items to include in the ”knapsack” so that the
total weight is less than or equal to the given limit.
Ideally, the total weight should equal the limits. The
complexity of the problem can be increased by adding
additional criteria like value and dimensions [6].
3 Implementing the Knapsack Problem
in OpenCL
The implementation of the knapsack problem in
OpenCL used here consisted of an input of 128 val-
ues initialized to the values 0 through 127 and used
64 workitems in order to compute the ideal value.
The entire implementation was in a single workgroup
and wavefront. The goal mass was 500. In order
to compute the mass for each input, its binary was
compared against a preexisting array of values (5,
10, 20, 50, 100, 300, 200, 150), and when a bit was
high, the corresponding array value was added to the
total mass for the input. For example, an element
with a binary value of 11010000 would have a mass
of 5+10+50.
For the comparison process, the input corresponding
to the workitem was compared to the input corre-
sponding to the workitem value plus 64, in order to
access all 128 values. In order to determine which
value was most fit, various conditions had to be set.
If both values were less than the goal, then the larger
value was selected. If any value was equal to the goal,
it was selected, and if both were equal to the goal,
then the first value was selected. If both values were
greater than the goal, then the smaller value was se-
lected.
The mutation for this version of the algorithm was to
replace the unfit value with the fit value. They were
then replaced back into the input array, and the val-
ues corresponding with work items were shuffled by
adding 1 and taking the modulus of 128 and replacing
them in the input array. Barriers were used in order
to ensure no hazards occured during the replacement
process.
The entire process is repeated for a set number of
iterations in order to find the ideal result. The min-
imum number of iterations to ensure that the ideal
result was found was 100. The algorithm was run
through five different amounts of iterations in order
to find the ideal result and compare compute times,
which was the decimal value of 96 that corresponded
to a mass of the goal 500.
4 Evaluation
The initial version of the knapsack problem imple-
mentation was with global memory. This involved
constantly passing values back to the global memory,
which is generally a time consuming process. The
algorithm was tested at 100, 1000, 10000, 100000 it-
erations, and every time the result was the ideal 96.
In order to determine the amount of time it took
to compute, timing functions from C++11 were in-
cluded in the host program. It was run five times in
order to get a good grasp of the compute time, and
these values can be seen in table 1.
Tab. 1: Global Memory Results
100 Iters 1000 Iters 10000 Iters 100000 Iters
1 0.000927s 0.002132s 0.014585s 0.147747s
2 0.000820s 0.002078s 0.014294s 0.140781s
3 0.000701s 0.002178s 0.014957s 0.142362s
4 0.000794s 0.002183s 0.024343s 0.138756s
5 0.000811s 0.002082s 0.016055s 0.159741s
The genetic algorithm was also implemented using
local memory. Theoretically, this implementation
should be significantly faster, but passing every value
from global memory to local memory required an
additional barrier and took a significant amount of
time. In addition, the computation was not com-
plex enough to make up for that time, so the time
to compute in global memory and local memory was
equivalent. The compute time values are in table 2.
Tab. 2: Local Memory Results
100 Iters 1000 Iters 10000 Iters 100000 Iters
1 0.000803s 0.002324s 0.014757s 0.208451s
2 0.000789s 0.002060s 0.021008s 0.134539s
3 0.000841s 0.002004s 0.015843s 0.135663s
4 0.000816s 0.002197s 0.014649s 0.137022s
5 0.000755s 0.002107s 0.014124s 0.144065s
4.1 Testbed Characteristics
CPU: 2.7 GHz Intel Core i7
GPU: Intel HD Graphics 4000
NVIDIA GeForce GT 650M
Memory: 16 GB 1600 MHz DDR3
Operating System: OS X Yosemite
5 Conclusions
Through working on this project, a solid level of un-
derstanding on programming for GPUs with OpenCL
was developed. Learning about the synchronization
issues in parallel programming allowed for the devel-
opment of programming skills in order to complete
more complex algorithms like the genetic algorithm
as discussed in this. The knowledge of the GPU exe-
cution and memory models allowed for better under-
standing of how OpenCL interacts with the GPU in
order to program more effectively.
References
[1] General-purpose computing on graphics process-
ing units. http://en.wikipedia.org/wiki/General-
purposecomputingongraphicsprocessingunits.
[2] Opencl. http://en.wikipedia.org/wiki/OpenCL.
[3] Hazard (computer architecture).
http://en.wikipedia.org/wiki/Hazard(computerarchitecture).
[4] Critical section. http://en.wikipedia.org/wiki/Criticalsection.
[5] Genetic algorithm. http://en.wikipedia.org/wiki/Geneticalgorithm.
[6] Knapsack problem. http://en.wikipedia.org/wiki/Knapsackproblem.

More Related Content

PPT
Open MPI 2
PPT
Code Tuning
PDF
Java 8 - Stamped Lock
PDF
Adam Sitnik "State of the .NET Performance"
PPTX
Concurrency with java
PPTX
Mathemetics module
PPTX
Inter thread communication & runnable interface
PPTX
Thread syncronization
Open MPI 2
Code Tuning
Java 8 - Stamped Lock
Adam Sitnik "State of the .NET Performance"
Concurrency with java
Mathemetics module
Inter thread communication & runnable interface
Thread syncronization

What's hot (20)

PDF
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
PDF
Attention mechanisms with tensorflow
PPT
Distributed System by Pratik Tambekar
PPT
Ccr - Concurrency and Coordination Runtime
PDF
A New Modified Version of Caser Cipher Algorithm
PDF
Metrics ekon 14_2_kleiner
PDF
Intake 37 12
PPT
Inter threadcommunication.38
DOC
CS2309 JAVA LAB MANUAL
PDF
Why we cannot ignore Functional Programming
DOC
Matlab file
PPT
Mutual exclusion and sync
PDF
Multithreading in Java
ODP
Java Concurrency, Memory Model, and Trends
PDF
Chaotic Variations Of Aes Algorithm
PDF
CHAOTIC VARIATIONS OF AES ALGORITHM
PDF
Novel Algorithm For Encryption:Hybrid of Transposition and Substitution Method
PDF
Sync, async and multithreading
PPT
Csphtp1 14
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
Attention mechanisms with tensorflow
Distributed System by Pratik Tambekar
Ccr - Concurrency and Coordination Runtime
A New Modified Version of Caser Cipher Algorithm
Metrics ekon 14_2_kleiner
Intake 37 12
Inter threadcommunication.38
CS2309 JAVA LAB MANUAL
Why we cannot ignore Functional Programming
Matlab file
Mutual exclusion and sync
Multithreading in Java
Java Concurrency, Memory Model, and Trends
Chaotic Variations Of Aes Algorithm
CHAOTIC VARIATIONS OF AES ALGORITHM
Novel Algorithm For Encryption:Hybrid of Transposition and Substitution Method
Sync, async and multithreading
Csphtp1 14
Ad

Viewers also liked (14)

PDF
SCN_0001
PDF
Business solutions
PPT
Tecnología y Sociedad
PDF
Squire Technologies:SVI 9220
PDF
Squire Technologies: Signal Transfer Point
PDF
Squire Technologes: Session Border Controller
PDF
Squire Technologies: Media Gateway Controller
PDF
Squire Technologies: Signalling Gateway
PDF
Squire Technologies: Media Gateway
PDF
Squire Technologies: Short Message Server Gateway
PDF
Squire Technologies: Media Gateway Controller Function
PPTX
Squire Technologies: Rolling Out A VoLTE Network
PPTX
SMS is alive and well
PDF
Squire Technologies: Short Message Service Centre
SCN_0001
Business solutions
Tecnología y Sociedad
Squire Technologies:SVI 9220
Squire Technologies: Signal Transfer Point
Squire Technologes: Session Border Controller
Squire Technologies: Media Gateway Controller
Squire Technologies: Signalling Gateway
Squire Technologies: Media Gateway
Squire Technologies: Short Message Server Gateway
Squire Technologies: Media Gateway Controller Function
Squire Technologies: Rolling Out A VoLTE Network
SMS is alive and well
Squire Technologies: Short Message Service Centre
Ad

Similar to genalg (20)

PDF
Lect04
PDF
Software Verification, Validation and Testing
PPT
PPT
Parallel Programming: Beyond the Critical Section
PPTX
Operating System Assignment Help
PDF
Pipelining understandingPipelining is running multiple stages of .pdf
PPT
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
PPT
PDF
Simon Peyton Jones: Managing parallelism
PDF
Peyton jones-2011-parallel haskell-the_future
PPT
Chap2 slides
PDF
Cache aware hybrid sorter
PDF
Joe armstrong erlanga_languageforprogrammingreliablesystems
PDF
Functional Programming You Already Know - Kevlin Henney - Codemotion Rome 2015
KEY
Playing Go with Clojure
PPTX
Full Basic Programming in c material ppt
PDF
C optimization notes
PPTX
Interactions complicate debugging
PDF
Introducing Parallel Pixie Dust
Lect04
Software Verification, Validation and Testing
Parallel Programming: Beyond the Critical Section
Operating System Assignment Help
Pipelining understandingPipelining is running multiple stages of .pdf
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
Simon Peyton Jones: Managing parallelism
Peyton jones-2011-parallel haskell-the_future
Chap2 slides
Cache aware hybrid sorter
Joe armstrong erlanga_languageforprogrammingreliablesystems
Functional Programming You Already Know - Kevlin Henney - Codemotion Rome 2015
Playing Go with Clojure
Full Basic Programming in c material ppt
C optimization notes
Interactions complicate debugging
Introducing Parallel Pixie Dust

genalg

  • 1. Experiences Programming for GPUs with OpenCL Oliver Fishstein, Alejandro Villegas Abstract This project looks at the complexities of parallel com- puting with GPUs using OpenCL. In order to write a parallel program, there is needed to understand the possi- ble synchronization issues along with the GPU execution and memory models, which are discussed in this paper. These concepts were used to write a Genetic Algorithm to solve the Knapsack Problem in OpenCL and analyze the differences of using local memory and global memory to compute. Keywords: OpenCL, Parallel Programming, GPGPU 1 Introduction General purpose computing on GPUs is the utiliza- tion of the graphics processing unit to perform com- putations and applications traditionally performed on the CPU, instead of limiting the GPU’s uses to the traditional graphics computations. Programming GPUs takes advantage of the parallel nature of graph- ics processing to increase the runtime speed [1]. The current dominant general-purpose GPU comput- ing language is OpenCL. It defines a C-like language, called kernels, that are executed on compute devices or accelerators. In our case, we will focus on GPUs as the accelerator. In order to execute OpenCL kernels, it is necessary to write a host program in either C or C++ that launches kernels on the compute device and manages the device memory, which is usually seperate from the host memory [2]. 1.1 Sychronization Issues in Parallel Programming One type of synchronization issue in parallel pro- gramming is hazards. There are three types of haz- ards: read after write, write after read, and write after write. These occur when instructions from dif- ferent execution threads modify shared data in an unexpected temporal order [3]. An example of code with hazards can be seen in code listing 1. Listing 1: Hazard Example 1// Both threads share this var. 2shared int a[2]; 3// Each thread has a private copy of this var. 4private int b; 5// Returns 0 for thread 0 and 1 for thread 1 6private int id = get_id (); 7a[id] = id; // line 1 8b = a[1-id]; // line 2 9a[id] = b; // line 3 There is a read after write hazard in lines 1 and 2 because line 2 in the thread 1 can be executed before the thread 0 executes the line 1. In addition, there is another read after write along with a write after read hazard in lines 2 and 3 because line 3 in the thread 1 can be executed before the thread 0 executes the line 2. In order to remove these hazards, a barrier can be placed in between lines 1 and 2 along with lines 2 and 3. This indicates that all threads must reach the barrier, written as barrier(), before proceeding to the next portion of the program. The other type of synchronization issue is critcal sec- tions. Critical sections are lines in code that access a shared resource (device or data structure) that can- not be concurrently accessed by more than one thread at a time [4]. This can lead to results that are not ex- pected by the programmer due to threads interfering with one another. An example of code with a critical section can be seen in code listing 2.
  • 2. Listing 2: Critical Section Example 1# define EMPTY -1 2void insert (int * list , int val) { 3 int i = 0; 4 while(list[i] != EMPTY) i++; // line 1 5 list[i] = i; // line 2 6} There is a critical section in lines 1 and 2. By mak- ing these lines a critical section, the only possible outputs are list = 0 1 and list = 1 0 depending on which thread takes the lead. In order to implement the critical section, locks can be used. A lock is a variable that can have the value 1, indicating locked, or the value 0, indicating unlocked. A thread can check the state of the lock, and if it is locked, it will wait until it is unlocked. If it is unlocked, the thread sets the lock as locked, executes the critical section, and then sets the lock as unlocked. Thread APIs usu- ally provide the programmer with functions to define and use locks, which rely in lower level atomic oper- ations. Code listing 3 is the code from code listing 2 corrected with a lock. Listing 3: Lock Example 1# define EMPTY -1 2void insert (int * list , int val) { 3 int i = 0; 4 while(getLock (lock) == false) {} 5 while(list[i] != EMPTY) i++; // line 1 6 list[i] = 1; // line 2 7 releaseLock(); 8} 1.2 GPU Execution Model The GPU Execution model in OpenCL consists of workitems, wavefronts, and workgroups. Workitems are the total amount of threads that the program is using. These are divided up into workgroups that are going to be executed in the individual compute units located in the GPU. The amount of workitems in each workgroup is a multiply of the size of the wavefront, which is the amount of workitems that can run concurrently within a compute unit. The wavefront size is a fixed size defined by the hardware. Each workitem in the wavefront executes in lock-step with the other workitems in the same wavefront. A workgroup can be made up of multiple wavefronts. 1.3 GPU Memory Model The GPU memory model consists of global memory and local memory. The global memory is allocated by the host program. This memory is visible by all the workitems running in the GPU. The local memory is a smaller and faster portion of memory. It is declared within the kernel, and is visible only by the workitems belonging to the same workgroup. Generally, doing computations in local memory is faster than using global memory. 2 Genetic Algorithm and the Knapsack Problem A genetic algorithm is a search heuristic that mimics the process of natural selection. The knapsack prob- lem is a particular application of genetic algorithms. Both will be detailed below. 2.1 Genetic Algorithm A genetic algorithm uses the biological concept of natural selection in order to search for the best or ”most fit” solution [5]. A population of potential so- lutions to an optimization problem is evolved toward better solutions. The evolution usually starts from a population of randomly generated values, although it can also use a predefined population, and goes through an iterative process that evolves in ”genera- tions”. After the population of solutions is initialized, the selection process begins. During each iteration, values are selected to create a second generation by comparing two existing solutions and choosing the ”fitter” one to determine the next generation. The fitness comparison can take place with every value or only a randomized sample. The next step of the genetic algorithm is to mutate the more fit result. The mutation is based on which problem the algorithm is being applied to solved. The mutation is then used to replace the less fit value. All of the values are then shuffled in order to change which values are compared. This whole process is repeated for as many iterations or ”gener- ations” as needed although generally more iterations
  • 3. will give the better results,especially when working with a more complex mutation. 2.2 Knapsack Problem The knapsack problem is a specific type of application that can be solved by a genetic algorithm. Given a set of items, each with a corresponding mass, determine which items to include in the ”knapsack” so that the total weight is less than or equal to the given limit. Ideally, the total weight should equal the limits. The complexity of the problem can be increased by adding additional criteria like value and dimensions [6]. 3 Implementing the Knapsack Problem in OpenCL The implementation of the knapsack problem in OpenCL used here consisted of an input of 128 val- ues initialized to the values 0 through 127 and used 64 workitems in order to compute the ideal value. The entire implementation was in a single workgroup and wavefront. The goal mass was 500. In order to compute the mass for each input, its binary was compared against a preexisting array of values (5, 10, 20, 50, 100, 300, 200, 150), and when a bit was high, the corresponding array value was added to the total mass for the input. For example, an element with a binary value of 11010000 would have a mass of 5+10+50. For the comparison process, the input corresponding to the workitem was compared to the input corre- sponding to the workitem value plus 64, in order to access all 128 values. In order to determine which value was most fit, various conditions had to be set. If both values were less than the goal, then the larger value was selected. If any value was equal to the goal, it was selected, and if both were equal to the goal, then the first value was selected. If both values were greater than the goal, then the smaller value was se- lected. The mutation for this version of the algorithm was to replace the unfit value with the fit value. They were then replaced back into the input array, and the val- ues corresponding with work items were shuffled by adding 1 and taking the modulus of 128 and replacing them in the input array. Barriers were used in order to ensure no hazards occured during the replacement process. The entire process is repeated for a set number of iterations in order to find the ideal result. The min- imum number of iterations to ensure that the ideal result was found was 100. The algorithm was run through five different amounts of iterations in order to find the ideal result and compare compute times, which was the decimal value of 96 that corresponded to a mass of the goal 500. 4 Evaluation The initial version of the knapsack problem imple- mentation was with global memory. This involved constantly passing values back to the global memory, which is generally a time consuming process. The algorithm was tested at 100, 1000, 10000, 100000 it- erations, and every time the result was the ideal 96. In order to determine the amount of time it took to compute, timing functions from C++11 were in- cluded in the host program. It was run five times in order to get a good grasp of the compute time, and these values can be seen in table 1. Tab. 1: Global Memory Results 100 Iters 1000 Iters 10000 Iters 100000 Iters 1 0.000927s 0.002132s 0.014585s 0.147747s 2 0.000820s 0.002078s 0.014294s 0.140781s 3 0.000701s 0.002178s 0.014957s 0.142362s 4 0.000794s 0.002183s 0.024343s 0.138756s 5 0.000811s 0.002082s 0.016055s 0.159741s The genetic algorithm was also implemented using local memory. Theoretically, this implementation should be significantly faster, but passing every value from global memory to local memory required an additional barrier and took a significant amount of time. In addition, the computation was not com- plex enough to make up for that time, so the time to compute in global memory and local memory was equivalent. The compute time values are in table 2.
  • 4. Tab. 2: Local Memory Results 100 Iters 1000 Iters 10000 Iters 100000 Iters 1 0.000803s 0.002324s 0.014757s 0.208451s 2 0.000789s 0.002060s 0.021008s 0.134539s 3 0.000841s 0.002004s 0.015843s 0.135663s 4 0.000816s 0.002197s 0.014649s 0.137022s 5 0.000755s 0.002107s 0.014124s 0.144065s 4.1 Testbed Characteristics CPU: 2.7 GHz Intel Core i7 GPU: Intel HD Graphics 4000 NVIDIA GeForce GT 650M Memory: 16 GB 1600 MHz DDR3 Operating System: OS X Yosemite 5 Conclusions Through working on this project, a solid level of un- derstanding on programming for GPUs with OpenCL was developed. Learning about the synchronization issues in parallel programming allowed for the devel- opment of programming skills in order to complete more complex algorithms like the genetic algorithm as discussed in this. The knowledge of the GPU exe- cution and memory models allowed for better under- standing of how OpenCL interacts with the GPU in order to program more effectively. References [1] General-purpose computing on graphics process- ing units. http://en.wikipedia.org/wiki/General- purposecomputingongraphicsprocessingunits. [2] Opencl. http://en.wikipedia.org/wiki/OpenCL. [3] Hazard (computer architecture). http://en.wikipedia.org/wiki/Hazard(computerarchitecture). [4] Critical section. http://en.wikipedia.org/wiki/Criticalsection. [5] Genetic algorithm. http://en.wikipedia.org/wiki/Geneticalgorithm. [6] Knapsack problem. http://en.wikipedia.org/wiki/Knapsackproblem.