Lec02 03 opencl_intro

Instructor NotesThis is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition programThe Mona Lisa images in the slides may be misleading in that we are not actually using OpenCL images, but they were nicer to look at than diagrams of buffersIt would probably be a good idea to open up the example code and walk through it along with the lecture2Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

OpenCL ArchitectureOpenCL allows parallel computing on heterogeneous devicesCPUs, GPUs, other processors (Cell, DSPs, etc)Provides portable accelerated codeDefined in four partsPlatform ModelExecution ModelMemory ModelProgramming Model(We’re going to diverge from this structure a bit)3Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Platform ModelEach OpenCL implementation (i.e. an OpenCL library from AMD, NVIDIA, etc.) defines platforms which enable the host system to interact with OpenCL-capable devicesCurrently each vendor supplies only a single platform per implementationOpenCL uses an “Installable Client Driver” modelThe goal is to allow platforms from different vendors to co-existCurrent systems’ device driver model will not allow different vendors’ GPUs to run at the same time4Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Platform ModelThe model consists of a host connected to one or more OpenCL devicesA device is divided into one or more compute unitsCompute units are divided into one or more processing elementsEach processing element maintains its own program counter5Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Host/DevicesThe host is whatever the OpenCL library runs on x86 CPUs for both NVIDIA and AMDDevices are processors that the library can talk to CPUs, GPUs, and generic acceleratorsFor AMD All CPUs are combined into a single device (each core is a compute unit and processing element)Each GPU is a separate device6Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Selecting a PlatformThis function is usually called twiceThe first call is used to get the number of platforms available to the implementationSpace is then allocated for the platform objectsThe second call is used to retrieve the platform objects7Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Selecting DevicesOnce a platform is selected, we can then query for the devices that it knows how to interact with We can specify which types of devices we are interested in (e.g. all devices, CPUs only, GPUs only) This call is performed twice as with clGetPlatformIDsThe first call is to determine the number of devices, the second retrieves the device objects8Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

ContextsA context refers to the environment for managing OpenCL objects and resourcesTo manage OpenCL programs, the following are associated with a contextDevices: the things doing the executionProgram objects: the program source that implements the kernelsKernels: functions that run on OpenCL devicesMemory objects: data that are operated on by the deviceCommand queues: mechanisms forinteraction with the devicesMemory commands (data transfers)Kernel executionSynchronization9Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

ContextsWhen you create a context, you will provide a list of devices to associate with itFor the rest of the OpenCL resources, you will associate them with the context as they are createdEmpty contextContext10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

ContextsThis function creates a context given a list of devicesThe properties argument specifies which platform to use (if NULL, the default chosen by the vendor will be used)The function also provides a callback mechanism for reporting errors to the user 11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Command QueuesA command queue is the mechanism for the host to request that an action be performed by the devicePerform a memory transfer, begin executing, etc. A separate command queue is required for each deviceCommands within the queue can be synchronous or asynchronousCommands can execute in-order or out-of-order12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Command QueuesA command queue establishes a relationship between a context and a deviceThe command queue properties specify:If out-of-order execution of commands is allowedIf profiling is enabledProfiling is done using events (discussed in a later lecture) and will create some overhead13Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Command QueuesCommand queues associate a context with a deviceDespite the figure below, they are not a physical connectionCommand queuesContext14Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Memory ObjectsMemory objects are OpenCL data that can be moved on and off devicesObjects are classified as either buffers or imagesBuffersContiguous chunks of memory – stored sequentially and can be accessed directly (arrays, pointers, structs)Read/write capableImagesOpaque objects (2D or 3D)Can only be accessed via read_image() and write_image()Can either be read or written in a kernel, but not both 15Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Creating buffersThis function creates a buffer (cl_mem object) for the given contextImages are more complex and will be covered in a later lectureThe flags specify: the combination of reading and writing allowed on the data if the host pointer itself should be used to store the dataif the data should be copied from the host pointer16Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Memory ObjectsMemory objects are associated with a contextThey must be explicitly transferred to devices prior to execution (covered later)Uninitialized OpenCL memory objects—the original data will be transferred later to/from these objectsContextOriginal input/output data(not OpenCLmemory objects)17Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Transferring DataOpenCL provides commands to transfer data to and from devices clEnqueue{Read|Write}{Buffer|Image}Copying from the host to a device is considered writingCopying from a device to the host is readingThe write command both initializes the memory object with data and places it on a deviceThe validity of memory objects that are present on multiple devices is undefined by the OpenCL spec (i.e. are vendor specific)OpenCL calls also exist to directly map part of a memory object to a host pointer18Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Transferring DataThis command initializes the OpenCL memory object and writes data to the device associated with the command queueThe command will write data from a host pointer (ptr) to the deviceThe blocking_write parameter specifies whether or not the command should return before the data transfer is completeEvents (discussed in another lecture) can specify which commands should be completed before this one runs19Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Transferring DataMemory objects are transferred to devices by specifying an action (read or write) and a command queueThe validity of memory objects that are present on multiple devices is undefined by the OpenCL spec (i.e. is vendor specific)The images areredundant here toshow that they areboth part of the context (on the host) and physically on thedeviceContextImages are written to a device20Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

ProgramsA program object is basically a collection of OpenCL kernelsCan be source code (text) or precompiled binaryCan also contain constant data and auxiliary functionsCreating a program object requires either reading in a string (source code) or a precompiled binaryTo compile the programSpecify which devices are targetedProgram is compiled for each device Pass in compiler flags (optional)Check for compilation errors (optional, output to screen)21Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

ProgramsA program object is created and compiled by providing source code or a binary file and selecting which devices to targetProgramContext22Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Creating ProgramsThis function creates a program object from strings of source codecount specifies the number of stringsThe user must create a function to read in the source code to a stringIf the strings are not NULL-terminated, the lengths fields are used to specify the string lengths23Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Compiling ProgramsThis function compiles and links an executable from the program object for each device in the contextIf device_list is supplied, then only those devices are targetedOptional preprocessor, optimization, and other options can be supplied by the options argument 24Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Reporting Compile ErrorsIf a program fails to compile, OpenCL requires the programmer to explicitly ask for compiler outputA compilation failure is determined by an error value returned from clBuildProgram()Calling clGetProgramBuildInfo() with the program object and the parameter CL_PROGRAM_BUILD_STATUS returns a string with the compiler output25Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

KernelsA kernel is a function declared in a program that is executed on an OpenCL deviceA kernel object is a kernel function along with its associated argumentsA kernel object is created from a compiled programMust explicitly associate arguments (memory objects, primitives, etc) with the kernel object26Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

KernelsKernel objects are created from a program object by specifying the name of the kernel functionKernelsContext27Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

KernelsCreates a kernel from the given programThe kernel that is created is specified by a string that matches the name of the function within the program28Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Runtime CompilationThere is a high overhead for compiling programs and creating kernels Each operation only has to be performed once (at the beginning of the program)The kernel objects can be reused any number of times by setting different argumentsRead source code into an arrayclCreateProgramWithSourceclBuildProgram clCreateKernel clCreateProgramWithBinary 29Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Setting Kernel ArgumentsKernel arguments are set by repeated calls to clSetKernelArgsEach call must specify:

The index of the argument as it appears in the function signature, the size, and a pointer to the data

clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&d_iImage);

Kernel ArgumentsMemory objects and individual data values can be set as kernel argumentsContextData (e.g. images) are set as kernel arguments31Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Thread StructureMassively parallel programs are usually written so that each thread computes one part of a problemFor vector addition, we will add corresponding elements from two arrays, so each thread will perform one additionIf we think about the thread structure visually, the threads will usually be arranged in the same shape as the data32Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Thread StructureConsider a simple vector addition of 16 elements2 input buffers (A, B) and 1 output buffer (C) are requiredArray IndicesVector Addition:A+B=C33Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Thread StructureCreate thread structure to match the problem 1-dimensional problem in this caseThread IDsThread structure:1415121310118967452301Vector Addition:A+B=C34Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Thread StructureEach thread is responsible for adding the indices corresponding to its IDThread structure:1415121310118967452301Vector Addition:A+B=C35Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Thread StructureOpenCL’s thread structure is designed to be scalableEach instance of a kernel is called a work-item (though “thread” is commonly used as well)Work-items are organized as work-groupsWork-groups are independent from one-another (this is where scalability comes from)An index space defines a hierarchy of work-groups and work-items36Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Thread StructureWork-items can uniquely identify themselves based on:A global id (unique within the index space)A work-group ID and a local ID within the work-group37Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Thread StructureAPI calls allow threads to identify themselves and their dataThreads can determine their global ID in each dimensionget_global_id(dim) get_global_size(dim)Or they can determine their work-group ID and ID within the workgroupget_group_id(dim)get_num_groups(dim)get_local_id(dim)get_local_size(dim)get_global_id(0) = column, get_global_id(1) = rowget_num_groups(0) * get_local_size(0) == get_global_size(0)38Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Memory ModelThe OpenCL memory model defines the various types of memories (closely related to GPU memory hierarchy)39Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Memory ModelMemory management is explicit Must move data from host memory to device global memory, from global memory to local memory, and backWork-groups are assigned to execute on compute-unitsNo guaranteed communication/coherency between different work-groups (no software mechanism in the OpenCL specification)40Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Writing a KernelOne instance of the kernel is created for each threadKernels:Must begin with keyword __kernelMust have return type voidMust declare the address space of each argument that is a memory object (next slide)Use API calls (such as get_global_id()) to determine which data a thread will work on41Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Address Space Identifiers__global – memory allocated from global address space__constant – a special type of read-only memory__local – memory shared by a work-group__private – private per work-item memory__read_only/__write_only – used for imagesKernel arguments that are memory objects must be global, local, or constant42Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Example KernelSimple vector addition kernel:__kernelvoid vecadd(__globalint* A, __global int* B, __global int* C) {inttid = get_global_id(0);C[tid] = A[tid] + B[tid];}43Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Executing the KernelNeed to set the dimensions of the index space, and (optionally) of the work-group sizesKernels execute asynchronously from the host clEnqueueNDRangeKernel just adds is to the queue, but doesn’t guarantee that it will start executing44Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Executing the KernelA thread structure defined by the index-space that is createdEach thread executes the same kernel on different dataContextAn index space of threads is created(dimensions matchthe data)45Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Executing the KernelA thread structure defined by the index-space that is createdEach thread executes the same kernel on different dataContextEach thread executesthe kernel46Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Executing the KernelTells the device associated with a command queue to begin executing the specified kernelThe global (index space) must be specified and the local (work-group) sizes are optionally specifiedA list of events can be used to specify prerequisite operations that must be complete before executing47Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Copying Data BackThe last step is to copy the data back from the device to the hostSimilar call as writing a buffer to a device, but data will be transferred back to the host48Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Copying Data BackThe output data is read from the device back to the hostContextCopied backfrom GPU49Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Releasing ResourcesMost OpenCL resources/objects are pointers that should be freed after they are done being usedThere is a clRelase{Resource} command for most OpenCL typesEx: clReleaseProgram(), clReleaseMemObject()50Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Error CheckingOpenCL commands return error codes as negative integer valuesReturn value of 0 indicates CL_SUCCESSNegative values indicates an error cl.h defines meaning of each return valueNote: Errors are sometimes reported asynchronouslyCL_DEVICE_NOT_FOUND -1CL_DEVICE_NOT_AVAILABLE -2CL_COMPILER_NOT_AVAILABLE -3CL_MEM_OBJECT_ALLOCATION_FAILURE -4CL_OUT_OF_RESOURCES -551Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Lec02 03 opencl_intro

More Related Content

What's hot (20)

Similar to Lec02 03 opencl_intro (20)

Recently uploaded (20)

Lec02 03 opencl_intro

Editor's Notes