oneAPI DPC++ Workshop
9th December 2020
Intel Confidential 2
Agenda
• Intel® oneAPI
• Introduction
• DPC++
• Introduction
• DPC++ “Hello world”
• Lab
• Intel® DPC++ Compatibility Tool
• Introduction
• Demo
Optimization Notice
2
Introduction to Intel®
oneAPI
Intel Confidential 4
XPUs
Programming
Challenges
Growth in specialized workloads
Variety of data-centric hardware required
No common programming language or APIs
Inconsistent tool support across platforms
Each platform requires unique software investment
Middleware / Frameworks
Application Workloads Need Diverse Hardware
Language & Libraries
Scalar Vector Matrix Spatial
4
CP
U
GP
U
FP
GA
Other
accel.
Intel Confidential 5
5
introducing
oneapi
Unified programming model to simplify development across diverse
architectures
Unified and simplified language and libraries for expressing parallelism
Uncompromised native high-level language performance
Based on industry standards and open specifications
Interoperable with existing HPC programming models
Industry Intel
Initiative Product
Middleware / Frameworks
Application Workloads Need Diverse Hardware
Scalar Vector Matrix Spatial
XPUs
CP
U
GP
U
FP
GA
Other
accel.
oneAPI
Data Parallel C++
Subarnarekha Ghosal
Intel Confidential 7
Introduction
Intel Confidential 8
Intel® oneAPI DPC++ Overview
DPC++
SYCL Next
(Intel
Extensions)
Latest Available
SYCL Spec
C++ 17
Intel Confidential 9
Intel® oneAPI DPC++ Overview
1.
• Data Parallel C++ is a high-level language designed to target
heterogenous architecture and take advantage of data parallelism.
2.
• Reuse Code across CPU and accelerators while performing custom
tuning.
3.
• Open-source implementation in Github helps to incorporate ideas
from end users.
9
Intel Confidential 10
Before we start
Lambda Expressions #include <algorithm>
#include <cmath>
void abssort(float* x, unsigned n) {
std::sort(x, x + n,
// Lambda expression
[ ](float a, float b)
{
return (std::abs(a) < std::abs(b));
}
);
}
• A convenient way of defining an
anonymous function object right at
the location where it is invoked or
passed as an argument to a function
• Lambda functions can be used to
define kernels in SYCL
• The kernel lambda MUST use copy
for all its captures (i.e., [=])
Capture clause
Parameter list
Lambda body
10
Intel Confidential 11
COMMAND GROUP
HANDLER
DEVICE (S)
Query for the
Available device
Kernel Model: Send a kernel (lambda) for
execution.
Queue executes the commands on the
device
parallel_for will execute in parallel across
the compute elements of the device
BUF A
BUF B
BUF C
ACC B
ACC C
Read
Read
Write
ACC A
Command groups control
execution on the device
Dispatches Kernels to the
device
Buffers and Accessors
manage memory across
Host and Device
QUEUE
HOST
DPC++ Program Flow
Intel Confidential 12
DPC++ “Hello world”
Intel Confidential 13
13
Step 1
#include <CL/sycl.hpp>
using namespace cl::sycl;
Intel Confidential 14
Step 2
buffer bufA (A, range(SIZE) );
buffer bufB (B, range (SIZE) );
buffer bufC (C, range (SIZE) );
14
Intel Confidential 15
Step 3
gpu_selector deviceSelector;
queue myQueue(deviceSelector);
15
• The device selector can be a default selector or a cpu or gpu selector or intel::fpga_selector.
• If the device is not explicitly mentioned during the creation of command queue, the runtime
selects one for you.
• It is a good practice to specify the selector to make sure the right device is chosen.
Intel Confidential 16
Step 4
myQueue.submit([&](handler& cgh) {
16
Intel Confidential 17
Step 5
auto A = bufA.get_access(cgh, read_only);
auto B = bufB.get_access(cgh, read_only);
auto C = bufC.get_access(cgh);
17
Intel Confidential 18
Step 6
cgh.parallel_for<class vector_add>(N, [=](auto i) {
C[i] = A[i] + B[i];});
18
 Each iteration (work-
item) will have a
separate index id (i)
Intel Confidential 19
int main() {
float A[N], B[N], C[N];
{ buffer bufA (A, range(N));
buffer bufB (B, range(N));
buffer bufC (C, range(N));
queue myQueue;
myQueue.submit([&](handler& cgh) {
auto A = bufA.get_access(cgh, read_only);
auto B = bufB.get_access(cgh, read_only);
auto C = bufC.get_access(cgh);
cgh.parallel_for<class vector_add>(N, [=](auto i){
C[i] = A[i] + B[i];});
});
}
for (int i = 0; i < 5; i++){
cout << "C[" << i << "] = " << C[i] <<std::endl;
}
return 0;
}
DPC++ “Hello World”: Vector Addition Entire Code
19
Intel Confidential 20
int main() {
float A[N], B[N], C[N];
{ buffer bufA (A, range(N));
buffer bufB (B, range(N));
buffer bufC (C, range(N));
queue myQueue;
myQueue.submit([&](handler& cgh) {
auto A = bufA.get_access(cgh, read_only);
auto B = bufB.get_access(cgh, read_only);
auto C = bufC.get_access(cgh);
cgh.parallel_for<class vector_add>(N, [=](auto i) {
C[i] = A[i] + B[i];});
});
}
for (int i = 0; i < 5; i++){
cout << "C[" << i << "] = " << C[i] <<std::endl;}
return 0;
}
Host code
Anatomy of a DPC++ Application
20
Host code
Intel Confidential 21
int main() {
float A[N], B[N], C[N];
{ buffer bufA (A, range(N));
buffer bufB (B, range(N));
buffer bufC (C, range(N));
queue myQueue;
myQueue.submit([&](handler& cgh) {
auto A = bufA.get_access(cgh, read_only);
auto B = bufB.get_access(cgh, read_only);
auto C = bufC.get_access(cgh);
cgh.parallel_for<class vector_add>(N, [=](auto i) {
C[i] = A[i] + B[i];});
});
}
for (int i = 0; i < 5; i++){
cout << "C[" << i << "] = " << C[i] <<std::endl;
}
return 0;
}
Accelerator
device code
Anatomy of a DPC++ Application
21
Host code
Host code
Intel Confidential 22
int main() {
float A[N], B[N], C[N];
{ buffer bufA (A, range(N));
buffer bufB (B, range(N));
buffer bufC (C, range(N));
queue myQueue;
myQueue.submit([&](handler& cgh) {
auto A = bufA.get_access(cgh, read_only);
auto B = bufB.get_access(cgh, read_only);
auto C = bufC.get_access(cgh);
cgh.parallel_for<class vector_add>(N, [=](auto i) {
C[i] = A[i] + B[i];});
});
}
for (int i = 0; i < 5; i++){
cout << "C[" << i << "] = " << C[i] <<std::endl;
}
return 0;
}
22
DPC++ basics
 Write-buffer is now out-of-scope, so
kernel completes, and host pointer
has consistent view of output.
Intel Confidential 23
int main() {
float A[N], B[N], C[N];
{ buffer bufA (A, range(N));
buffer bufB (B, range(N));
buffer bufC (C, range(N));
queue myQueue;
myQueue.submit([&](handler& cgh) {
auto A = bufA.get_access(cgh, read_only);
auto B = bufB.get_access(cgh, read_only);
auto C = bufC.get_access(cgh);
cgh.parallel_for<class vector_add>(N, [=](auto i) {
C[i] = A[i] + B[i];});
});
}
for (int i = 0; i < 5; i++){
cout << "C[" << i << "] = " << C[i] <<std::endl;
}
return 0;
}
23
DPC++ basics
Intel Confidential 24
DPCPP Demo session
Intel Confidential 25
Intel® oneAPI DPC++ Heterogenous Platform
CPU
(Host)
GPU
(Device)
FPGA
(Device)
Other
Accelerator
(Device)
CPU
(Device)
25
26Intel Confidential
For code samples on all these concepts Visit:
https://github.com/oneapi-src/oneAPI-samples/
Intel Confidential 27
DPC++ Summary
•DPC++ is an open standard based programming model for Heterogenous Platforms.
•It can target different accelerators from different vendors
•Single sourced programming model
•oneAPI specifications available publicly:
https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions
Feedback and active participation encouraged
Intel® DPC++ Compatibility Tool
Intel Confidential 29
 Migrates some portion of their existing code written in CUDA to the newly developed DPC++
language.
 Our experience has shown that this can vary greatly, but on average, about 80-90% of CUDA code in
applications can be migrated by this tool.
 Completion of the code and verification of the final code is expected to be manual process done by
the developer.
https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-intel-dpcpp-
compatibility-tool/top.html
What is the Intel® DPC++ Compatibility Tool?
Intel Confidential 30
DPCT* Demo session
Intel Confidential 31
Backup
Intel Confidential 32
DPC++ Deep Dive
Intel Confidential 33
Intel® oneAPI DPC++ Heterogenous Platform
CPU
(Host)
GPU
(Device)
FPGA
(Device)
Other
Accelerator
(Device)
CPU
(Device)
33
Intel Confidential 34
Execution Flow
Global/Constant Memory
Host Memory
Host
Device
(CPU)
(GPU, MIC, FPGA, …)
Compute Unit
(CU)
LocalMemoryLocalMemoryLocalMemoryLocalMemory
Command
Group
• Synchronization cmd
• Data movement ops
• User-defined kernels
Command
GroupCommand
GroupCommand
Group
Command
Queue
Executed on…
submits...
Command
QueueCommand
Queue
Host code
Executed on…
DPC++ Application
Device code
Private Memory
34
Intel Confidential 35
Execution Flow Contd.
Execution of Kernel Instances
Device (GPU, FPGA, …)
Compute Unit
(CU)
Kernel instance =
Kernel object &
nd_range &
work-group
decomposition
Work-pool
Command
QueueCommand
QueueCommand
Queue
enqueued…
35
Intel Confidential 36
Memory Model
Intel Confidential 37
Hardware Architecture
Intel Confidential 38
 Global memory:
 Accessible to all work-items in all work-
groups.
 Reads and writes may be cached.
 Persistent across kernel invocations
Memory Model
Constant memory:
• A region of global memory that
remains constant during the
execution of a kernel
Local Memory:
• Memory region shared between work-items
in a single work-group.
Private Memory:
• Region of memory private to a work-item.
Variables defined in one work-item’s private
memory are not visible to another work-item
Global/Constant Memory
Device (GPU, FPGA, …)
Compute Unit
(CU)
LocalMemoryLocalMemoryLocalMemoryLocalMemory
Private Memory
38
Intel Confidential 39
DPC++ - device memory model
Local Memory
Private
Memory
Work-Item
Private
Memory
Work-Item
Private
Memory
Work-Item
Work-Group
Global Memory Constant
Memory
Device
Work-Group
……
Work-GroupWork-Group
…
…
Local Memory
Private
Memory
Work-Item
Private
Memory
Work-Item
Private
Memory
Work-Item…
Work-Group
…
Device
Intel Confidential 40
Unified Shared Memory
 SYCL 1.2.1 specification offers: – Buffer/Accessor: For tracking and managing memory transfer and
guarantee data consistency across host and DPC++ devices.
 Many HPC and Enterprise applications use pointers to manage data.
 DPC++ Extension for Pointer Based programming: – Unified Shared Memory (USM): Device Kernels
can access the data using pointers
Intel Confidential 41
USM Allocation
Device(Explicit
data movement)
Host(Data sent
over bus, such
as PCIe)
Shared(Data can
migrate b/w host
and memory)
Types of USM
Intel Confidential 42
Kernel Model
Intel Confidential 43
Kernel Execution Model
 Kernel Parallelism
 Multi Dimensional Kernel
 ND-Range
 Sub-group
 Work-Group
 Work Item
Intel Confidential 44
Kernel Execution Model
 Explicit ND-range for control- similar to programming models such as OpenCL, SYCL, CUDA.
ND-range
Global work size
Work-group
Work-item
44
Intel Confidential 45
nd_range & nd_item
 Example: Process every pixel in a 1920x1080 image
 Each pixel needs processing, kernel is executed on each pixel (work-item)
 1920 x 1080 = 2M pixels = global size
 Not all 2M can run in parallel on device, there is hardware resource limits.
 We have to split into smaller groups of pixel blocks = local size (work-group)
 Either let the complier determine work-group size OR we can specify the work-group size using nd_range()

Intel Confidential 46
Example: Process every pixel in a 1920x1080 image
 Let compiler determine work-group size

 Programmer specifies work-group size
h.parallel_for(nd_range<2>(range<2>(1920,1080),range<2>(8,8)),
[=](id<2> item){
// CODE THAT RUNS ON DEVICE
})
h.parallel_for(range<2>(1920,1080), [=](id<2>
item){
// CODE THAT RUNS ON DEVICE
});
nd_range & nd_item
global
size
local size
(work-group
size)
Intel Confidential 47
nd_range & nd_item
 Example: Process every pixel in a 1920x1080 image
 How do we choose work-group size?
• Work-group size of 8x8 divides equally for 1920x1080
• Work-group size of 9x9 does not divide equally for 1920x1080
• Compiler will throw error (invalid work group size error)
• Work-group size of 10x10 divides equally for 1920x1080
• Works, but always better to use multiple of 8 for better resource utilization
• Work-group size of 24x24 divides equally for 1920x1080
• 24x24=576, will fail compile assuming GPU max work-group size is 256
GOOD
48

More Related Content

PPTX
OneAPI Series 2 Webinar - 9th, Dec-20
PDF
DCEU 18: Designing a Global Centralized Container Platform for a Multi-Cluste...
PDF
Redfish & python redfish
PDF
Amd ces tech day 2018 lisa su
PDF
HP CAST 2017 Frankfurt : HPE UberCloud boosting HPC as a Service
PPTX
oneAPI: Industry Initiative & Intel Product
PDF
QuAI platform
PDF
"Current and Planned Standards for Computer Vision and Machine Learning," a P...
OneAPI Series 2 Webinar - 9th, Dec-20
DCEU 18: Designing a Global Centralized Container Platform for a Multi-Cluste...
Redfish & python redfish
Amd ces tech day 2018 lisa su
HP CAST 2017 Frankfurt : HPE UberCloud boosting HPC as a Service
oneAPI: Industry Initiative & Intel Product
QuAI platform
"Current and Planned Standards for Computer Vision and Machine Learning," a P...

What's hot (20)

PPTX
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
PPTX
Intel Developer Program
PPTX
OpenCV for Embedded: Lessons Learned
PDF
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
PDF
DCEU 18: Edge Computing with Docker Enterprise
PDF
Resilient microservices with Kubernetes - Mete Atamel
PDF
End-to-End Big Data AI with Analytics Zoo
PDF
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
PDF
Tesla Accelerated Computing Platform
PDF
Journey Through Four Stages of Kubernetes Deployment Maturity
PPTX
Fabio rapposelli pks-vmug
PDF
NFV features in kubernetes
PDF
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
PDF
IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...
PDF
Modest scale HPC on Azure using CGYRO
PPTX
Kube con china_2019_7 missing factors for your production-quality 12-factor apps
PDF
Red Hat OpenShift Container Platform Overview
PDF
ONS 2018 LA - Intel Tutorial: Cloud Native to NFV - Alon Bernstein, Cisco & K...
PDF
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
PDF
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
Intel Developer Program
OpenCV for Embedded: Lessons Learned
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
DCEU 18: Edge Computing with Docker Enterprise
Resilient microservices with Kubernetes - Mete Atamel
End-to-End Big Data AI with Analytics Zoo
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
Tesla Accelerated Computing Platform
Journey Through Four Stages of Kubernetes Deployment Maturity
Fabio rapposelli pks-vmug
NFV features in kubernetes
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
IS-4082, Real-Time insight in Big Data – Even faster using HSA, by Norbert He...
Modest scale HPC on Azure using CGYRO
Kube con china_2019_7 missing factors for your production-quality 12-factor apps
Red Hat OpenShift Container Platform Overview
ONS 2018 LA - Intel Tutorial: Cloud Native to NFV - Alon Bernstein, Cisco & K...
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
Ad

Similar to OneAPI dpc++ Virtual Workshop 9th Dec-20 (20)

PDF
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
PPTX
Griffon Topic2 Presentation (Tia)
PDF
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
PDF
Performance Verification for ESL Design Methodology from AADL Models
PDF
"Making OpenCV Code Run Fast," a Presentation from Intel
PPTX
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
PDF
Chapter three embedded system corse ppt AASTU.pdf
PPT
Agnostic Device Drivers
PDF
“Intel Video AI Box—Converging AI, Media and Computing in a Compact and Open ...
PDF
Deep Learning Edge
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PDF
Open CL For Haifa Linux Club
PDF
Brillo/Weave Part 2: Deep Dive
PDF
GNU Compiler Collection - August 2005
PDF
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
PDF
ELC North America 2021 Introduction to pin muxing and gpio control under linux
PDF
Mesa and Its Debugging, Вадим Шовкопляс
PDF
Perceptual Computing Workshop à Paris
PDF
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
PPT
Developing new zynq based instruments
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
Griffon Topic2 Presentation (Tia)
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
Performance Verification for ESL Design Methodology from AADL Models
"Making OpenCV Code Run Fast," a Presentation from Intel
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
Chapter three embedded system corse ppt AASTU.pdf
Agnostic Device Drivers
“Intel Video AI Box—Converging AI, Media and Computing in a Compact and Open ...
Deep Learning Edge
Preparing to program Aurora at Exascale - Early experiences and future direct...
Open CL For Haifa Linux Club
Brillo/Weave Part 2: Deep Dive
GNU Compiler Collection - August 2005
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
ELC North America 2021 Introduction to pin muxing and gpio control under linux
Mesa and Its Debugging, Вадим Шовкопляс
Perceptual Computing Workshop à Paris
“Khronos Standard APIs for Accelerating Vision and Inferencing,” a Presentati...
Developing new zynq based instruments
Ad

More from Tyrone Systems (20)

PDF
Kubernetes in The Enterprise
PDF
Why minio wins the hybrid cloud?
PDF
why min io wins the hybrid cloud
PDF
5 ways hci (hyper-converged infrastructure) powering today’s modern learning ...
PDF
5 current and near-future use cases of ai in broadcast and media.
PDF
How hci is driving digital transformation in the insurance firms to enable pr...
PDF
How blockchain is revolutionising healthcare industry’s challenges of genomic...
PDF
5 ways hpc can provides cost savings and flexibility to meet the technology i...
PDF
How Emerging Technologies are Enabling The Banking Industry
PDF
Five Exciting Ways HCI can accelerates digital transformation for Media and E...
PPTX
Design and Optimize your code for high-performance with Intel® Advisor and I...
PDF
Fast-Track Your Digital Transformation with Intelligent Automation
PDF
Top Five benefits of Hyper-Converged Infrastructure
PDF
An Effective Approach to Cloud Migration for Small and Medium Enterprises (SMEs)
PDF
How can Artificial Intelligence improve software development process?
PDF
3 Ways Machine Learning Facilitates Fraud Detection
PDF
Four ways to digitally transform with HPC in the cloud
PDF
How to Secure Containerized Environments?
PPTX
Tyrone-Intel oneAPI Webinar: Optimized Tools for Performance-Driven, Cross-Ar...
PDF
Top 5 Benefits of Hyper-Converged Infrastructure
Kubernetes in The Enterprise
Why minio wins the hybrid cloud?
why min io wins the hybrid cloud
5 ways hci (hyper-converged infrastructure) powering today’s modern learning ...
5 current and near-future use cases of ai in broadcast and media.
How hci is driving digital transformation in the insurance firms to enable pr...
How blockchain is revolutionising healthcare industry’s challenges of genomic...
5 ways hpc can provides cost savings and flexibility to meet the technology i...
How Emerging Technologies are Enabling The Banking Industry
Five Exciting Ways HCI can accelerates digital transformation for Media and E...
Design and Optimize your code for high-performance with Intel® Advisor and I...
Fast-Track Your Digital Transformation with Intelligent Automation
Top Five benefits of Hyper-Converged Infrastructure
An Effective Approach to Cloud Migration for Small and Medium Enterprises (SMEs)
How can Artificial Intelligence improve software development process?
3 Ways Machine Learning Facilitates Fraud Detection
Four ways to digitally transform with HPC in the cloud
How to Secure Containerized Environments?
Tyrone-Intel oneAPI Webinar: Optimized Tools for Performance-Driven, Cross-Ar...
Top 5 Benefits of Hyper-Converged Infrastructure

Recently uploaded (20)

PDF
Trump Administration's workforce development strategy
PDF
IGGE1 Understanding the Self1234567891011
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
International_Financial_Reporting_Standa.pdf
PDF
My India Quiz Book_20210205121199924.pdf
PPTX
Computer Architecture Input Output Memory.pptx
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
20th Century Theater, Methods, History.pptx
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
Trump Administration's workforce development strategy
IGGE1 Understanding the Self1234567891011
Introduction to pro and eukaryotes and differences.pptx
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
Chinmaya Tiranga quiz Grand Finale.pdf
Paper A Mock Exam 9_ Attempt review.pdf.
International_Financial_Reporting_Standa.pdf
My India Quiz Book_20210205121199924.pdf
Computer Architecture Input Output Memory.pptx
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
20th Century Theater, Methods, History.pptx
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Uderstanding digital marketing and marketing stratergie for engaging the digi...
Weekly quiz Compilation Jan -July 25.pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
B.Sc. DS Unit 2 Software Engineering.pptx
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
Practical Manual AGRO-233 Principles and Practices of Natural Farming

OneAPI dpc++ Virtual Workshop 9th Dec-20

  • 2. Intel Confidential 2 Agenda • Intel® oneAPI • Introduction • DPC++ • Introduction • DPC++ “Hello world” • Lab • Intel® DPC++ Compatibility Tool • Introduction • Demo Optimization Notice 2
  • 4. Intel Confidential 4 XPUs Programming Challenges Growth in specialized workloads Variety of data-centric hardware required No common programming language or APIs Inconsistent tool support across platforms Each platform requires unique software investment Middleware / Frameworks Application Workloads Need Diverse Hardware Language & Libraries Scalar Vector Matrix Spatial 4 CP U GP U FP GA Other accel.
  • 5. Intel Confidential 5 5 introducing oneapi Unified programming model to simplify development across diverse architectures Unified and simplified language and libraries for expressing parallelism Uncompromised native high-level language performance Based on industry standards and open specifications Interoperable with existing HPC programming models Industry Intel Initiative Product Middleware / Frameworks Application Workloads Need Diverse Hardware Scalar Vector Matrix Spatial XPUs CP U GP U FP GA Other accel. oneAPI
  • 8. Intel Confidential 8 Intel® oneAPI DPC++ Overview DPC++ SYCL Next (Intel Extensions) Latest Available SYCL Spec C++ 17
  • 9. Intel Confidential 9 Intel® oneAPI DPC++ Overview 1. • Data Parallel C++ is a high-level language designed to target heterogenous architecture and take advantage of data parallelism. 2. • Reuse Code across CPU and accelerators while performing custom tuning. 3. • Open-source implementation in Github helps to incorporate ideas from end users. 9
  • 10. Intel Confidential 10 Before we start Lambda Expressions #include <algorithm> #include <cmath> void abssort(float* x, unsigned n) { std::sort(x, x + n, // Lambda expression [ ](float a, float b) { return (std::abs(a) < std::abs(b)); } ); } • A convenient way of defining an anonymous function object right at the location where it is invoked or passed as an argument to a function • Lambda functions can be used to define kernels in SYCL • The kernel lambda MUST use copy for all its captures (i.e., [=]) Capture clause Parameter list Lambda body 10
  • 11. Intel Confidential 11 COMMAND GROUP HANDLER DEVICE (S) Query for the Available device Kernel Model: Send a kernel (lambda) for execution. Queue executes the commands on the device parallel_for will execute in parallel across the compute elements of the device BUF A BUF B BUF C ACC B ACC C Read Read Write ACC A Command groups control execution on the device Dispatches Kernels to the device Buffers and Accessors manage memory across Host and Device QUEUE HOST DPC++ Program Flow
  • 12. Intel Confidential 12 DPC++ “Hello world”
  • 13. Intel Confidential 13 13 Step 1 #include <CL/sycl.hpp> using namespace cl::sycl;
  • 14. Intel Confidential 14 Step 2 buffer bufA (A, range(SIZE) ); buffer bufB (B, range (SIZE) ); buffer bufC (C, range (SIZE) ); 14
  • 15. Intel Confidential 15 Step 3 gpu_selector deviceSelector; queue myQueue(deviceSelector); 15 • The device selector can be a default selector or a cpu or gpu selector or intel::fpga_selector. • If the device is not explicitly mentioned during the creation of command queue, the runtime selects one for you. • It is a good practice to specify the selector to make sure the right device is chosen.
  • 16. Intel Confidential 16 Step 4 myQueue.submit([&](handler& cgh) { 16
  • 17. Intel Confidential 17 Step 5 auto A = bufA.get_access(cgh, read_only); auto B = bufB.get_access(cgh, read_only); auto C = bufC.get_access(cgh); 17
  • 18. Intel Confidential 18 Step 6 cgh.parallel_for<class vector_add>(N, [=](auto i) { C[i] = A[i] + B[i];}); 18  Each iteration (work- item) will have a separate index id (i)
  • 19. Intel Confidential 19 int main() { float A[N], B[N], C[N]; { buffer bufA (A, range(N)); buffer bufB (B, range(N)); buffer bufC (C, range(N)); queue myQueue; myQueue.submit([&](handler& cgh) { auto A = bufA.get_access(cgh, read_only); auto B = bufB.get_access(cgh, read_only); auto C = bufC.get_access(cgh); cgh.parallel_for<class vector_add>(N, [=](auto i){ C[i] = A[i] + B[i];}); }); } for (int i = 0; i < 5; i++){ cout << "C[" << i << "] = " << C[i] <<std::endl; } return 0; } DPC++ “Hello World”: Vector Addition Entire Code 19
  • 20. Intel Confidential 20 int main() { float A[N], B[N], C[N]; { buffer bufA (A, range(N)); buffer bufB (B, range(N)); buffer bufC (C, range(N)); queue myQueue; myQueue.submit([&](handler& cgh) { auto A = bufA.get_access(cgh, read_only); auto B = bufB.get_access(cgh, read_only); auto C = bufC.get_access(cgh); cgh.parallel_for<class vector_add>(N, [=](auto i) { C[i] = A[i] + B[i];}); }); } for (int i = 0; i < 5; i++){ cout << "C[" << i << "] = " << C[i] <<std::endl;} return 0; } Host code Anatomy of a DPC++ Application 20 Host code
  • 21. Intel Confidential 21 int main() { float A[N], B[N], C[N]; { buffer bufA (A, range(N)); buffer bufB (B, range(N)); buffer bufC (C, range(N)); queue myQueue; myQueue.submit([&](handler& cgh) { auto A = bufA.get_access(cgh, read_only); auto B = bufB.get_access(cgh, read_only); auto C = bufC.get_access(cgh); cgh.parallel_for<class vector_add>(N, [=](auto i) { C[i] = A[i] + B[i];}); }); } for (int i = 0; i < 5; i++){ cout << "C[" << i << "] = " << C[i] <<std::endl; } return 0; } Accelerator device code Anatomy of a DPC++ Application 21 Host code Host code
  • 22. Intel Confidential 22 int main() { float A[N], B[N], C[N]; { buffer bufA (A, range(N)); buffer bufB (B, range(N)); buffer bufC (C, range(N)); queue myQueue; myQueue.submit([&](handler& cgh) { auto A = bufA.get_access(cgh, read_only); auto B = bufB.get_access(cgh, read_only); auto C = bufC.get_access(cgh); cgh.parallel_for<class vector_add>(N, [=](auto i) { C[i] = A[i] + B[i];}); }); } for (int i = 0; i < 5; i++){ cout << "C[" << i << "] = " << C[i] <<std::endl; } return 0; } 22 DPC++ basics  Write-buffer is now out-of-scope, so kernel completes, and host pointer has consistent view of output.
  • 23. Intel Confidential 23 int main() { float A[N], B[N], C[N]; { buffer bufA (A, range(N)); buffer bufB (B, range(N)); buffer bufC (C, range(N)); queue myQueue; myQueue.submit([&](handler& cgh) { auto A = bufA.get_access(cgh, read_only); auto B = bufB.get_access(cgh, read_only); auto C = bufC.get_access(cgh); cgh.parallel_for<class vector_add>(N, [=](auto i) { C[i] = A[i] + B[i];}); }); } for (int i = 0; i < 5; i++){ cout << "C[" << i << "] = " << C[i] <<std::endl; } return 0; } 23 DPC++ basics
  • 25. Intel Confidential 25 Intel® oneAPI DPC++ Heterogenous Platform CPU (Host) GPU (Device) FPGA (Device) Other Accelerator (Device) CPU (Device) 25
  • 26. 26Intel Confidential For code samples on all these concepts Visit: https://github.com/oneapi-src/oneAPI-samples/
  • 27. Intel Confidential 27 DPC++ Summary •DPC++ is an open standard based programming model for Heterogenous Platforms. •It can target different accelerators from different vendors •Single sourced programming model •oneAPI specifications available publicly: https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions Feedback and active participation encouraged
  • 29. Intel Confidential 29  Migrates some portion of their existing code written in CUDA to the newly developed DPC++ language.  Our experience has shown that this can vary greatly, but on average, about 80-90% of CUDA code in applications can be migrated by this tool.  Completion of the code and verification of the final code is expected to be manual process done by the developer. https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-intel-dpcpp- compatibility-tool/top.html What is the Intel® DPC++ Compatibility Tool?
  • 33. Intel Confidential 33 Intel® oneAPI DPC++ Heterogenous Platform CPU (Host) GPU (Device) FPGA (Device) Other Accelerator (Device) CPU (Device) 33
  • 34. Intel Confidential 34 Execution Flow Global/Constant Memory Host Memory Host Device (CPU) (GPU, MIC, FPGA, …) Compute Unit (CU) LocalMemoryLocalMemoryLocalMemoryLocalMemory Command Group • Synchronization cmd • Data movement ops • User-defined kernels Command GroupCommand GroupCommand Group Command Queue Executed on… submits... Command QueueCommand Queue Host code Executed on… DPC++ Application Device code Private Memory 34
  • 35. Intel Confidential 35 Execution Flow Contd. Execution of Kernel Instances Device (GPU, FPGA, …) Compute Unit (CU) Kernel instance = Kernel object & nd_range & work-group decomposition Work-pool Command QueueCommand QueueCommand Queue enqueued… 35
  • 38. Intel Confidential 38  Global memory:  Accessible to all work-items in all work- groups.  Reads and writes may be cached.  Persistent across kernel invocations Memory Model Constant memory: • A region of global memory that remains constant during the execution of a kernel Local Memory: • Memory region shared between work-items in a single work-group. Private Memory: • Region of memory private to a work-item. Variables defined in one work-item’s private memory are not visible to another work-item Global/Constant Memory Device (GPU, FPGA, …) Compute Unit (CU) LocalMemoryLocalMemoryLocalMemoryLocalMemory Private Memory 38
  • 39. Intel Confidential 39 DPC++ - device memory model Local Memory Private Memory Work-Item Private Memory Work-Item Private Memory Work-Item Work-Group Global Memory Constant Memory Device Work-Group …… Work-GroupWork-Group … … Local Memory Private Memory Work-Item Private Memory Work-Item Private Memory Work-Item… Work-Group … Device
  • 40. Intel Confidential 40 Unified Shared Memory  SYCL 1.2.1 specification offers: – Buffer/Accessor: For tracking and managing memory transfer and guarantee data consistency across host and DPC++ devices.  Many HPC and Enterprise applications use pointers to manage data.  DPC++ Extension for Pointer Based programming: – Unified Shared Memory (USM): Device Kernels can access the data using pointers
  • 41. Intel Confidential 41 USM Allocation Device(Explicit data movement) Host(Data sent over bus, such as PCIe) Shared(Data can migrate b/w host and memory) Types of USM
  • 43. Intel Confidential 43 Kernel Execution Model  Kernel Parallelism  Multi Dimensional Kernel  ND-Range  Sub-group  Work-Group  Work Item
  • 44. Intel Confidential 44 Kernel Execution Model  Explicit ND-range for control- similar to programming models such as OpenCL, SYCL, CUDA. ND-range Global work size Work-group Work-item 44
  • 45. Intel Confidential 45 nd_range & nd_item  Example: Process every pixel in a 1920x1080 image  Each pixel needs processing, kernel is executed on each pixel (work-item)  1920 x 1080 = 2M pixels = global size  Not all 2M can run in parallel on device, there is hardware resource limits.  We have to split into smaller groups of pixel blocks = local size (work-group)  Either let the complier determine work-group size OR we can specify the work-group size using nd_range() 
  • 46. Intel Confidential 46 Example: Process every pixel in a 1920x1080 image  Let compiler determine work-group size   Programmer specifies work-group size h.parallel_for(nd_range<2>(range<2>(1920,1080),range<2>(8,8)), [=](id<2> item){ // CODE THAT RUNS ON DEVICE }) h.parallel_for(range<2>(1920,1080), [=](id<2> item){ // CODE THAT RUNS ON DEVICE }); nd_range & nd_item global size local size (work-group size)
  • 47. Intel Confidential 47 nd_range & nd_item  Example: Process every pixel in a 1920x1080 image  How do we choose work-group size? • Work-group size of 8x8 divides equally for 1920x1080 • Work-group size of 9x9 does not divide equally for 1920x1080 • Compiler will throw error (invalid work group size error) • Work-group size of 10x10 divides equally for 1920x1080 • Works, but always better to use multiple of 8 for better resource utilization • Work-group size of 24x24 divides equally for 1920x1080 • 24x24=576, will fail compile assuming GPU max work-group size is 256 GOOD
  • 48. 48