SlideShare a Scribd company logo
Introduction to OpenCLPerhaad Mistry & Dana Schaa,Northeastern University Computer ArchitectureResearch Lab, with Benedict R. Gaster, AMD© 2011
Instructor NotesThis is a straight-forward lecture.  It introduces the OpenCL specification while building a simple vector addition programThe Mona Lisa images in the slides may be misleading in that we are not actually using OpenCL images, but they were nicer to look at than diagrams of buffersIt would probably be a good idea to open up the example code and walk through it along with the lecture2Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
OpenCL ArchitectureOpenCL allows parallel computing on heterogeneous devicesCPUs, GPUs, other processors (Cell, DSPs, etc)Provides portable accelerated codeDefined in four partsPlatform ModelExecution ModelMemory ModelProgramming Model(We’re going to diverge from this structure a bit)3Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Platform ModelEach OpenCL implementation (i.e. an OpenCL library from AMD, NVIDIA, etc.) defines platforms which enable the host system to interact with OpenCL-capable devicesCurrently each vendor supplies only a single platform per implementationOpenCL uses an “Installable Client Driver” modelThe goal is to allow platforms from different vendors to co-existCurrent systems’ device driver model will not allow different vendors’ GPUs to run at the same time4Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Platform ModelThe model consists of a host connected to one or more OpenCL devicesA device is divided into one or more compute unitsCompute units are divided into one or more processing elementsEach processing element maintains its own program counter5Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Host/DevicesThe host is whatever the OpenCL library runs on  x86 CPUs for both NVIDIA and AMDDevices are processors that the library can talk to CPUs, GPUs, and generic acceleratorsFor AMD All CPUs are combined into a single device (each core is a compute unit and processing element)Each GPU is a separate device6Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Selecting a PlatformThis function is usually called twiceThe first call is used to get the number of platforms available to the implementationSpace is then allocated for the platform objectsThe second call is used to retrieve the platform objects7Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Selecting DevicesOnce a platform is selected, we can then query for the devices that it knows how to interact with We can specify which types of devices we are interested in (e.g. all devices, CPUs only, GPUs only) This call is performed twice as with clGetPlatformIDsThe first call is to determine the number of devices, the second retrieves the device objects8Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
ContextsA context refers to the environment for managing OpenCL objects and resourcesTo manage OpenCL programs, the following are associated with a contextDevices: the things doing the executionProgram objects: the program source that implements the kernelsKernels: functions that run on OpenCL devicesMemory objects: data that are operated on by the deviceCommand queues: mechanisms forinteraction with the devicesMemory commands (data transfers)Kernel executionSynchronization9Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
ContextsWhen you create a context, you will provide a list of devices to associate with itFor the rest of the OpenCL resources, you will associate them with the context as they are createdEmpty contextContext10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
ContextsThis function creates a context given a list of devicesThe properties argument specifies which platform to use (if NULL, the default chosen by the vendor will be used)The function also provides a callback mechanism for reporting errors to the user 11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Command QueuesA command queue is the mechanism for the host to request that an action be performed by the devicePerform a memory transfer, begin executing, etc. A separate command queue is required for each deviceCommands within the queue can be synchronous or asynchronousCommands can execute in-order or out-of-order12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Command QueuesA command queue establishes a relationship between a context and a deviceThe command queue properties specify:If out-of-order execution of commands is allowedIf profiling is enabledProfiling is done using events (discussed in a later lecture) and will create some overhead13Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Command QueuesCommand queues associate a context with a deviceDespite the figure below, they are not a physical connectionCommand queuesContext14Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Memory ObjectsMemory objects are OpenCL data that can be moved on and off devicesObjects are classified as either buffers or imagesBuffersContiguous chunks of memory – stored sequentially and can be accessed directly (arrays, pointers, structs)Read/write capableImagesOpaque objects (2D or 3D)Can only be accessed via read_image() and write_image()Can either be read or written in a kernel, but not both 15Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Creating buffersThis function creates a buffer (cl_mem object) for the given contextImages are more complex and will be covered in a later lectureThe flags specify: the combination of reading and writing allowed on the data if the host pointer itself should be used to store the dataif the data should be copied from the host pointer16Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Memory ObjectsMemory objects are associated with a contextThey must be explicitly transferred to devices prior to execution (covered later)Uninitialized OpenCL memory objects—the original data will be transferred later to/from these objectsContextOriginal input/output data(not OpenCLmemory objects)17Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Transferring DataOpenCL provides commands to transfer data to and from devices clEnqueue{Read|Write}{Buffer|Image}Copying from the host to a device is considered writingCopying from a device to the host is readingThe write command both initializes the memory object with data and places it on a deviceThe validity of memory objects that are present on multiple devices is undefined by the OpenCL spec (i.e. are vendor specific)OpenCL calls also exist to directly map part of a memory object to a host pointer18Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Transferring DataThis command initializes the OpenCL memory object and writes data to the device associated with the command queueThe command will write data from a host pointer (ptr) to the deviceThe blocking_write parameter specifies whether or not the command should return before the data transfer is completeEvents (discussed in another lecture) can specify which commands should be completed before this one runs19Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Transferring DataMemory objects are transferred to devices by specifying an action (read or write) and a command queueThe validity of memory objects that are present on multiple devices is undefined by the OpenCL spec (i.e. is vendor specific)The images areredundant here toshow that they areboth part of the context (on the host) and physically on thedeviceContextImages are written to a device20Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
ProgramsA program object is basically a collection of OpenCL kernelsCan be source code (text) or precompiled binaryCan also contain constant data and auxiliary functionsCreating a program object requires either reading in a string (source code) or a precompiled binaryTo compile the programSpecify which devices are targetedProgram is compiled for each device Pass in compiler flags (optional)Check for compilation errors (optional, output to screen)21Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
ProgramsA program object is created and compiled by providing source code or a binary file and selecting which devices to targetProgramContext22Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Creating ProgramsThis function creates a program object from strings of source codecount specifies the number of stringsThe user must create a function to read in the source code to a stringIf the strings are not NULL-terminated, the lengths fields are used to specify the string lengths23Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Compiling ProgramsThis function compiles and links an executable from the program object for each device in the contextIf device_list is supplied, then only those devices are targetedOptional preprocessor, optimization, and other options can be supplied by the options argument   24Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Reporting Compile ErrorsIf a program fails to compile, OpenCL requires the programmer to explicitly ask for compiler outputA compilation failure is determined by an error value returned from clBuildProgram()Calling clGetProgramBuildInfo() with the program object and the parameter CL_PROGRAM_BUILD_STATUS returns a string with the compiler output25Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
KernelsA kernel is a function declared in a program that is executed on an OpenCL deviceA kernel object is a kernel function along with its associated argumentsA kernel object is created from a compiled programMust explicitly associate arguments (memory objects, primitives, etc) with the kernel object26Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
KernelsKernel objects are created from a program object by specifying the name of the kernel functionKernelsContext27Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
KernelsCreates a kernel from the given programThe kernel that is created is specified by a string that matches the name of the function within the program28Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Runtime CompilationThere is a high overhead for compiling programs and creating kernels Each operation only has to be performed once (at the beginning of the program)The kernel objects can be reused any number of times by setting different argumentsRead source code into an arrayclCreateProgramWithSourceclBuildProgram clCreateKernel clCreateProgramWithBinary 29Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Setting Kernel ArgumentsKernel arguments are set by repeated calls to clSetKernelArgsEach call must specify:
The index of the argument as it appears in the function signature, the size, and a pointer to the data
Examples:
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&d_iImage);
clSetKernelArg(kernel, 1, sizeof(int), (void*)&a);30Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Kernel ArgumentsMemory objects and individual data values can be set as kernel argumentsContextData (e.g. images) are set as kernel arguments31Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread StructureMassively parallel programs are usually written so that each thread computes one part of a problemFor vector addition, we will add corresponding elements from two arrays, so each thread will perform one additionIf we think about the thread structure visually, the threads will usually be arranged in the same shape as the data32Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread StructureConsider a simple vector addition of 16 elements2 input buffers (A, B) and 1 output buffer (C) are requiredArray IndicesVector Addition:A+B=C33Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread StructureCreate thread structure to match the problem 1-dimensional problem in this caseThread IDsThread structure:1415121310118967452301Vector Addition:A+B=C34Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread StructureEach thread is responsible for adding the indices corresponding to its IDThread structure:1415121310118967452301Vector Addition:A+B=C35Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread StructureOpenCL’s thread structure is designed to be scalableEach instance of a kernel is called a work-item (though “thread” is commonly used as well)Work-items are organized as work-groupsWork-groups are independent from one-another (this is where scalability comes from)An index space defines a hierarchy of work-groups and work-items36Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread StructureWork-items can uniquely identify themselves based on:A global id (unique within the index space)A work-group ID and a local ID within the work-group37Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread StructureAPI calls allow threads to identify themselves and their dataThreads can determine their global ID in each dimensionget_global_id(dim) get_global_size(dim)Or they can determine their work-group ID and ID within the workgroupget_group_id(dim)get_num_groups(dim)get_local_id(dim)get_local_size(dim)get_global_id(0) = column, get_global_id(1) = rowget_num_groups(0) * get_local_size(0) == get_global_size(0)38Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Memory ModelThe OpenCL memory model defines the various types of memories (closely related to GPU memory hierarchy)39Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Memory ModelMemory management is explicit Must move data from host memory to device global memory, from global memory to local memory, and backWork-groups are assigned to execute on compute-unitsNo guaranteed communication/coherency between different work-groups (no software mechanism in the OpenCL specification)40Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Writing a KernelOne instance of the kernel is created for each threadKernels:Must begin with keyword __kernelMust have return type voidMust declare the address space of each argument that is a memory object (next slide)Use API calls (such as get_global_id()) to determine which data a thread will work on41Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Address Space Identifiers__global – memory allocated from global address space__constant – a special type of read-only memory__local – memory shared by a work-group__private – private per work-item memory__read_only/__write_only – used for imagesKernel arguments that are memory objects must be global, local, or constant42Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Example KernelSimple vector addition kernel:__kernelvoid vecadd(__globalint* A,            __global int* B,            __global int* C) {inttid = get_global_id(0);C[tid] = A[tid] + B[tid];}43Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Executing the KernelNeed to set the dimensions of the index space, and (optionally) of the work-group sizesKernels execute asynchronously from the host clEnqueueNDRangeKernel just adds is to the queue, but doesn’t guarantee that it will start executing44Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Executing the KernelA thread structure defined by the index-space that is createdEach thread executes the same kernel on different dataContextAn index space of threads is created(dimensions matchthe data)45Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Executing the KernelA thread structure defined by the index-space that is createdEach thread executes the same kernel on different dataContextEach thread executesthe kernel46Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Executing the KernelTells the device associated with a command queue to begin executing the specified kernelThe global (index space) must be specified and the local (work-group) sizes are optionally specifiedA list of events can be used to specify prerequisite operations that must be complete before executing47Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Copying Data BackThe last step is to copy the data back from the device to the hostSimilar call as writing a buffer to a device, but data will be transferred back to the host48Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Copying Data BackThe output data is read from the device back to the hostContextCopied backfrom GPU49Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Releasing ResourcesMost OpenCL resources/objects are pointers that should be freed after they are done being usedThere is a clRelase{Resource} command for most OpenCL typesEx: clReleaseProgram(), clReleaseMemObject()50Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Error CheckingOpenCL commands return error codes as negative integer valuesReturn value of 0 indicates CL_SUCCESSNegative values indicates an error cl.h defines meaning of each return valueNote: Errors are sometimes reported asynchronouslyCL_DEVICE_NOT_FOUND                      	  	-1CL_DEVICE_NOT_AVAILABLE                     	-2CL_COMPILER_NOT_AVAILABLE                   	-3CL_MEM_OBJECT_ALLOCATION_FAILURE 	-4CL_OUT_OF_RESOURCES                         	-551Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

More Related Content

PPTX
Lec11 timing
PPTX
Lec05 buffers basic_examples
PPTX
Lec13 multidevice
PPTX
Lec09 nbody-optimization
PPTX
Lec07 threading hw
PPT
Performance improvement techniques for software distributed shared memory
PPTX
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
PPT
Contiki introduction II-from what to how
Lec11 timing
Lec05 buffers basic_examples
Lec13 multidevice
Lec09 nbody-optimization
Lec07 threading hw
Performance improvement techniques for software distributed shared memory
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Contiki introduction II-from what to how

What's hot (20)

PPTX
Putting a Fork in Fork (Linux Process and Memory Management)
PPTX
Making a Process
PDF
Kqueue : Generic Event notification
PPTX
Crossing into Kernel Space
PPTX
Virtual Memory (Making a Process)
PPT
Contiki introduction I.
PDF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
PPTX
Scheduling in Linux and Web Servers
PPTX
Scheduling
PDF
YOW2021 Computing Performance
PDF
Linux BPF Superpowers
PDF
Performance Analysis: new tools and concepts from the cloud
PDF
Performance and predictability
PPTX
Mutual Exclusion
PPTX
Zookeeper Architecture
PDF
LISA2010 visualizations
PDF
Security Monitoring with eBPF
DOCX
PPTX
Memory model
PDF
Class 1: What is an Operating System?
Putting a Fork in Fork (Linux Process and Memory Management)
Making a Process
Kqueue : Generic Event notification
Crossing into Kernel Space
Virtual Memory (Making a Process)
Contiki introduction I.
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
Scheduling in Linux and Web Servers
Scheduling
YOW2021 Computing Performance
Linux BPF Superpowers
Performance Analysis: new tools and concepts from the cloud
Performance and predictability
Mutual Exclusion
Zookeeper Architecture
LISA2010 visualizations
Security Monitoring with eBPF
Memory model
Class 1: What is an Operating System?
Ad

Similar to Lec02 03 opencl_intro (20)

PDF
General Purpose GPU Computing
PDF
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
PDF
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
PDF
Open CL For Speedup Workshop
PDF
Introduction to OpenCL
PPTX
MattsonTutorialSC14.pptx
PDF
Introduction to OpenCL By Hammad Ghulam Mustafa
PPTX
Lec04 gpu architecture
PDF
clWrap: Nonsense free control of your GPU
PPTX
Hands on OpenCL
PDF
MattsonTutorialSC14.pdf
PDF
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
PDF
Parallel and Distributed Computing Chapter 8
PDF
Challenges in GPU compilers
PPTX
Indic threads pune12-accelerating computation in html 5
PDF
Introduction to OpenCL, 2010
PDF
OpenCL Programming 101
PDF
OpenCL 2.0 Reference Card
PDF
SDAccel Design Contest: Xilinx SDAccel
PPTX
Lec12 debugging
General Purpose GPU Computing
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Open CL For Speedup Workshop
Introduction to OpenCL
MattsonTutorialSC14.pptx
Introduction to OpenCL By Hammad Ghulam Mustafa
Lec04 gpu architecture
clWrap: Nonsense free control of your GPU
Hands on OpenCL
MattsonTutorialSC14.pdf
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
Parallel and Distributed Computing Chapter 8
Challenges in GPU compilers
Indic threads pune12-accelerating computation in html 5
Introduction to OpenCL, 2010
OpenCL Programming 101
OpenCL 2.0 Reference Card
SDAccel Design Contest: Xilinx SDAccel
Lec12 debugging
Ad

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Cloud computing and distributed systems.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Machine learning based COVID-19 study performance prediction
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
KodekX | Application Modernization Development
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Spectroscopy.pptx food analysis technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
MIND Revenue Release Quarter 2 2025 Press Release
Cloud computing and distributed systems.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation_ Review paper, used for researhc scholars
Diabetes mellitus diagnosis method based random forest with bat algorithm
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Machine learning based COVID-19 study performance prediction
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
Programs and apps: productivity, graphics, security and other tools
KodekX | Application Modernization Development
The AUB Centre for AI in Media Proposal.docx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Spectroscopy.pptx food analysis technology
The Rise and Fall of 3GPP – Time for a Sabbatical?

Lec02 03 opencl_intro

  • 1. Introduction to OpenCLPerhaad Mistry & Dana Schaa,Northeastern University Computer ArchitectureResearch Lab, with Benedict R. Gaster, AMD© 2011
  • 2. Instructor NotesThis is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition programThe Mona Lisa images in the slides may be misleading in that we are not actually using OpenCL images, but they were nicer to look at than diagrams of buffersIt would probably be a good idea to open up the example code and walk through it along with the lecture2Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 3. OpenCL ArchitectureOpenCL allows parallel computing on heterogeneous devicesCPUs, GPUs, other processors (Cell, DSPs, etc)Provides portable accelerated codeDefined in four partsPlatform ModelExecution ModelMemory ModelProgramming Model(We’re going to diverge from this structure a bit)3Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 4. Platform ModelEach OpenCL implementation (i.e. an OpenCL library from AMD, NVIDIA, etc.) defines platforms which enable the host system to interact with OpenCL-capable devicesCurrently each vendor supplies only a single platform per implementationOpenCL uses an “Installable Client Driver” modelThe goal is to allow platforms from different vendors to co-existCurrent systems’ device driver model will not allow different vendors’ GPUs to run at the same time4Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 5. Platform ModelThe model consists of a host connected to one or more OpenCL devicesA device is divided into one or more compute unitsCompute units are divided into one or more processing elementsEach processing element maintains its own program counter5Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 6. Host/DevicesThe host is whatever the OpenCL library runs on x86 CPUs for both NVIDIA and AMDDevices are processors that the library can talk to CPUs, GPUs, and generic acceleratorsFor AMD All CPUs are combined into a single device (each core is a compute unit and processing element)Each GPU is a separate device6Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 7. Selecting a PlatformThis function is usually called twiceThe first call is used to get the number of platforms available to the implementationSpace is then allocated for the platform objectsThe second call is used to retrieve the platform objects7Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 8. Selecting DevicesOnce a platform is selected, we can then query for the devices that it knows how to interact with We can specify which types of devices we are interested in (e.g. all devices, CPUs only, GPUs only) This call is performed twice as with clGetPlatformIDsThe first call is to determine the number of devices, the second retrieves the device objects8Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 9. ContextsA context refers to the environment for managing OpenCL objects and resourcesTo manage OpenCL programs, the following are associated with a contextDevices: the things doing the executionProgram objects: the program source that implements the kernelsKernels: functions that run on OpenCL devicesMemory objects: data that are operated on by the deviceCommand queues: mechanisms forinteraction with the devicesMemory commands (data transfers)Kernel executionSynchronization9Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 10. ContextsWhen you create a context, you will provide a list of devices to associate with itFor the rest of the OpenCL resources, you will associate them with the context as they are createdEmpty contextContext10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 11. ContextsThis function creates a context given a list of devicesThe properties argument specifies which platform to use (if NULL, the default chosen by the vendor will be used)The function also provides a callback mechanism for reporting errors to the user 11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 12. Command QueuesA command queue is the mechanism for the host to request that an action be performed by the devicePerform a memory transfer, begin executing, etc. A separate command queue is required for each deviceCommands within the queue can be synchronous or asynchronousCommands can execute in-order or out-of-order12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 13. Command QueuesA command queue establishes a relationship between a context and a deviceThe command queue properties specify:If out-of-order execution of commands is allowedIf profiling is enabledProfiling is done using events (discussed in a later lecture) and will create some overhead13Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 14. Command QueuesCommand queues associate a context with a deviceDespite the figure below, they are not a physical connectionCommand queuesContext14Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 15. Memory ObjectsMemory objects are OpenCL data that can be moved on and off devicesObjects are classified as either buffers or imagesBuffersContiguous chunks of memory – stored sequentially and can be accessed directly (arrays, pointers, structs)Read/write capableImagesOpaque objects (2D or 3D)Can only be accessed via read_image() and write_image()Can either be read or written in a kernel, but not both 15Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 16. Creating buffersThis function creates a buffer (cl_mem object) for the given contextImages are more complex and will be covered in a later lectureThe flags specify: the combination of reading and writing allowed on the data if the host pointer itself should be used to store the dataif the data should be copied from the host pointer16Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 17. Memory ObjectsMemory objects are associated with a contextThey must be explicitly transferred to devices prior to execution (covered later)Uninitialized OpenCL memory objects—the original data will be transferred later to/from these objectsContextOriginal input/output data(not OpenCLmemory objects)17Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 18. Transferring DataOpenCL provides commands to transfer data to and from devices clEnqueue{Read|Write}{Buffer|Image}Copying from the host to a device is considered writingCopying from a device to the host is readingThe write command both initializes the memory object with data and places it on a deviceThe validity of memory objects that are present on multiple devices is undefined by the OpenCL spec (i.e. are vendor specific)OpenCL calls also exist to directly map part of a memory object to a host pointer18Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 19. Transferring DataThis command initializes the OpenCL memory object and writes data to the device associated with the command queueThe command will write data from a host pointer (ptr) to the deviceThe blocking_write parameter specifies whether or not the command should return before the data transfer is completeEvents (discussed in another lecture) can specify which commands should be completed before this one runs19Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 20. Transferring DataMemory objects are transferred to devices by specifying an action (read or write) and a command queueThe validity of memory objects that are present on multiple devices is undefined by the OpenCL spec (i.e. is vendor specific)The images areredundant here toshow that they areboth part of the context (on the host) and physically on thedeviceContextImages are written to a device20Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 21. ProgramsA program object is basically a collection of OpenCL kernelsCan be source code (text) or precompiled binaryCan also contain constant data and auxiliary functionsCreating a program object requires either reading in a string (source code) or a precompiled binaryTo compile the programSpecify which devices are targetedProgram is compiled for each device Pass in compiler flags (optional)Check for compilation errors (optional, output to screen)21Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 22. ProgramsA program object is created and compiled by providing source code or a binary file and selecting which devices to targetProgramContext22Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 23. Creating ProgramsThis function creates a program object from strings of source codecount specifies the number of stringsThe user must create a function to read in the source code to a stringIf the strings are not NULL-terminated, the lengths fields are used to specify the string lengths23Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 24. Compiling ProgramsThis function compiles and links an executable from the program object for each device in the contextIf device_list is supplied, then only those devices are targetedOptional preprocessor, optimization, and other options can be supplied by the options argument 24Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 25. Reporting Compile ErrorsIf a program fails to compile, OpenCL requires the programmer to explicitly ask for compiler outputA compilation failure is determined by an error value returned from clBuildProgram()Calling clGetProgramBuildInfo() with the program object and the parameter CL_PROGRAM_BUILD_STATUS returns a string with the compiler output25Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 26. KernelsA kernel is a function declared in a program that is executed on an OpenCL deviceA kernel object is a kernel function along with its associated argumentsA kernel object is created from a compiled programMust explicitly associate arguments (memory objects, primitives, etc) with the kernel object26Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 27. KernelsKernel objects are created from a program object by specifying the name of the kernel functionKernelsContext27Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 28. KernelsCreates a kernel from the given programThe kernel that is created is specified by a string that matches the name of the function within the program28Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 29. Runtime CompilationThere is a high overhead for compiling programs and creating kernels Each operation only has to be performed once (at the beginning of the program)The kernel objects can be reused any number of times by setting different argumentsRead source code into an arrayclCreateProgramWithSourceclBuildProgram clCreateKernel clCreateProgramWithBinary 29Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 30. Setting Kernel ArgumentsKernel arguments are set by repeated calls to clSetKernelArgsEach call must specify:
  • 31. The index of the argument as it appears in the function signature, the size, and a pointer to the data
  • 34. clSetKernelArg(kernel, 1, sizeof(int), (void*)&a);30Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 35. Kernel ArgumentsMemory objects and individual data values can be set as kernel argumentsContextData (e.g. images) are set as kernel arguments31Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 36. Thread StructureMassively parallel programs are usually written so that each thread computes one part of a problemFor vector addition, we will add corresponding elements from two arrays, so each thread will perform one additionIf we think about the thread structure visually, the threads will usually be arranged in the same shape as the data32Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 37. Thread StructureConsider a simple vector addition of 16 elements2 input buffers (A, B) and 1 output buffer (C) are requiredArray IndicesVector Addition:A+B=C33Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 38. Thread StructureCreate thread structure to match the problem 1-dimensional problem in this caseThread IDsThread structure:1415121310118967452301Vector Addition:A+B=C34Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 39. Thread StructureEach thread is responsible for adding the indices corresponding to its IDThread structure:1415121310118967452301Vector Addition:A+B=C35Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 40. Thread StructureOpenCL’s thread structure is designed to be scalableEach instance of a kernel is called a work-item (though “thread” is commonly used as well)Work-items are organized as work-groupsWork-groups are independent from one-another (this is where scalability comes from)An index space defines a hierarchy of work-groups and work-items36Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 41. Thread StructureWork-items can uniquely identify themselves based on:A global id (unique within the index space)A work-group ID and a local ID within the work-group37Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 42. Thread StructureAPI calls allow threads to identify themselves and their dataThreads can determine their global ID in each dimensionget_global_id(dim) get_global_size(dim)Or they can determine their work-group ID and ID within the workgroupget_group_id(dim)get_num_groups(dim)get_local_id(dim)get_local_size(dim)get_global_id(0) = column, get_global_id(1) = rowget_num_groups(0) * get_local_size(0) == get_global_size(0)38Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 43. Memory ModelThe OpenCL memory model defines the various types of memories (closely related to GPU memory hierarchy)39Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 44. Memory ModelMemory management is explicit Must move data from host memory to device global memory, from global memory to local memory, and backWork-groups are assigned to execute on compute-unitsNo guaranteed communication/coherency between different work-groups (no software mechanism in the OpenCL specification)40Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 45. Writing a KernelOne instance of the kernel is created for each threadKernels:Must begin with keyword __kernelMust have return type voidMust declare the address space of each argument that is a memory object (next slide)Use API calls (such as get_global_id()) to determine which data a thread will work on41Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 46. Address Space Identifiers__global – memory allocated from global address space__constant – a special type of read-only memory__local – memory shared by a work-group__private – private per work-item memory__read_only/__write_only – used for imagesKernel arguments that are memory objects must be global, local, or constant42Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 47. Example KernelSimple vector addition kernel:__kernelvoid vecadd(__globalint* A, __global int* B, __global int* C) {inttid = get_global_id(0);C[tid] = A[tid] + B[tid];}43Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 48. Executing the KernelNeed to set the dimensions of the index space, and (optionally) of the work-group sizesKernels execute asynchronously from the host clEnqueueNDRangeKernel just adds is to the queue, but doesn’t guarantee that it will start executing44Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 49. Executing the KernelA thread structure defined by the index-space that is createdEach thread executes the same kernel on different dataContextAn index space of threads is created(dimensions matchthe data)45Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 50. Executing the KernelA thread structure defined by the index-space that is createdEach thread executes the same kernel on different dataContextEach thread executesthe kernel46Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 51. Executing the KernelTells the device associated with a command queue to begin executing the specified kernelThe global (index space) must be specified and the local (work-group) sizes are optionally specifiedA list of events can be used to specify prerequisite operations that must be complete before executing47Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 52. Copying Data BackThe last step is to copy the data back from the device to the hostSimilar call as writing a buffer to a device, but data will be transferred back to the host48Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 53. Copying Data BackThe output data is read from the device back to the hostContextCopied backfrom GPU49Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 54. Releasing ResourcesMost OpenCL resources/objects are pointers that should be freed after they are done being usedThere is a clRelase{Resource} command for most OpenCL typesEx: clReleaseProgram(), clReleaseMemObject()50Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 55. Error CheckingOpenCL commands return error codes as negative integer valuesReturn value of 0 indicates CL_SUCCESSNegative values indicates an error cl.h defines meaning of each return valueNote: Errors are sometimes reported asynchronouslyCL_DEVICE_NOT_FOUND -1CL_DEVICE_NOT_AVAILABLE -2CL_COMPILER_NOT_AVAILABLE -3CL_MEM_OBJECT_ALLOCATION_FAILURE -4CL_OUT_OF_RESOURCES -551Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 56. Big Picture52Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 57. Programming ModelData parallelOne-to-one mapping between work-items and elements in a memory objectWork-groups can be defined explicitly (like CUDA) or implicitly (specify the number of work-items and OpenCL creates the work-groups)Task parallelKernel is executed independent of an index spaceOther ways to express parallelism: enqueueing multiple tasks, using device-specific vector types, etc.SynchronizationPossible between items in a work-groupPossible between commands in a context command queue53Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 58. Running the Example CodeA simple vector addition OpenCL program that goes along with this lecture was providedBefore running, the following should appear in your .bashrc file:export LD_LIBRARY_PATH=<path to stream sdk>/lib/x86_64To compile:Make sure that vecadd.c and vecadd.cl are in the current working directorygcc -ovecaddvecadd.c -I<path to stream sdk>/include -L<path to stream sdk>/lib/x86_64 -lOpenCL54Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 59. SummaryOpenCL provides an interface for the interaction of hosts with accelerator devicesA context is created that contains all of the information and data required to execute an OpenCL programMemory objects are created that can be moved on and off devicesCommand queues allow the host to request operations to be performed by the devicePrograms and kernels contain the code that devices need to execute55Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Editor's Notes

  • #5: Devices can be associated with multiple contexts if desired
  • #6: There is another function called clCreateContextFromType(), which will create a context using all the GPUs, CPUs, etc.
  • #10: Even though we show images here, the example will really be working with buffers
  • #11: If the device were a CPU, it could execute on the memory object in-place.
  • #18: Since kernels are executed asynchronously but return an error value immediately, runtime errors will likely be reported later.