SlideShare a Scribd company logo
OpenCL Kernel
Optimization Tips
Champ Yen (champ.yen@gmail.com)
http://champyen.blogspot.com
ver.20140820
Optimization - a form of balance
Device/Platform
Features
Runtime
Toolchain
Problem
Algorithm
Optimization
Optimization is not only greedy
searching in single direction. It is
more like to find a good balance
point between device, toolchain
and the problem.
Device - Computation
● device type
○ cpu - powerful single thread performance
○ gpu - many threads, great total throughput
● ISA design
○ scalar-based
○ vector-based
● # of compute unit/processing elements
● estimate impact of using divergence & barrier
● capability of asynchronous data transfer
Device - Memory
● get basic memory characteristics:
○ size
○ latency
○ throughput
○ coalescing effect
○ addressing mode
● global memory - unified or not
● local memory - real or not
● penalty of oversize
Toolchain/Runtime
● document/tutorial/guide for debugging, profiling and optimization.
● there is no perfect runtime/toolchain
● profiling/debugging tools.
● it is not always a good idea to debug/optimization on different
platforms.
● automatic optimization MAY NOT HELP the thinking of optimization
● tricky forms of computation/memory operations.
○ MAD operations
○ memory access mode
Problem/Algorithms
● DATA PARALLEL!
● multi-stages is not always bad.
○ doing all things together uses more memory resource in one workitem.
● vectorized is not always a good idea
● use appropriate work group size
○ bad memory access pattern, less coalescing
○ may cause lower cache hit rate
○ less local memory for each workitem
○ may be less private memory for each workitem.
● different form of implementation
● do optimization things manually.
○ DO NOT relies on automatic features.
Q & A

More Related Content

PDF
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
PDF
Simd programming introduction
PDF
Video Compression Standards - History & Introduction
PDF
Kernel Recipes 2014 - Performance Does Matter
PPTX
OutSystems Tips and Tricks
PDF
Machine learning and big data @ uber a tale of two systems
PDF
Benchmarks, performance, scalability, and capacity what's behind the numbers
PDF
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Simd programming introduction
Video Compression Standards - History & Introduction
Kernel Recipes 2014 - Performance Does Matter
OutSystems Tips and Tricks
Machine learning and big data @ uber a tale of two systems
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what s behind the numbers...

Similar to OpenCL Kernel Optimization Tips (20)

PDF
Machine Learning & Graph Processing w/ Spark and Accumulo
PDF
Software Design Practices for Large-Scale Automation
PDF
Ad109 - XPages Performance and Scalability
PDF
Programming for Problem Solving
PDF
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
ODP
Multicore
PDF
2016-01-16 03 Денис Нелюбин. How to test a million
PDF
Dfrws eu 2014 rekall workshop
PDF
Apache Singa AI
PDF
Monitoring and automation
PPTX
Why Concurrency is hard ?
PPTX
Concurrency - Why it's hard ?
PDF
PDF
Liferay portals in real projects
PDF
Anurag Awasthi - Machine Learning applications for CloudStack
ODP
Memory Management in Amoeba
PDF
TDX2025 SFwelly April 2025 presented by David Smith
PPT
Lecture01 algorithm analysis
PPTX
Lessons learned from designing a QA Automation for analytics databases (big d...
PDF
Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data
Machine Learning & Graph Processing w/ Spark and Accumulo
Software Design Practices for Large-Scale Automation
Ad109 - XPages Performance and Scalability
Programming for Problem Solving
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
Multicore
2016-01-16 03 Денис Нелюбин. How to test a million
Dfrws eu 2014 rekall workshop
Apache Singa AI
Monitoring and automation
Why Concurrency is hard ?
Concurrency - Why it's hard ?
Liferay portals in real projects
Anurag Awasthi - Machine Learning applications for CloudStack
Memory Management in Amoeba
TDX2025 SFwelly April 2025 presented by David Smith
Lecture01 algorithm analysis
Lessons learned from designing a QA Automation for analytics databases (big d...
Voxxed Athens 2018 - Methods and Practices for Guaranteed Failure in Big Data
Ad

More from Champ Yen (6)

PDF
Halide tutorial 2019
PPT
Linux SD/MMC Driver Stack
PDF
OpenGL ES 2.x Programming Introduction
PDF
Chrome OS Observation
PPT
Play With Android
PDF
Linux Porting
Halide tutorial 2019
Linux SD/MMC Driver Stack
OpenGL ES 2.x Programming Introduction
Chrome OS Observation
Play With Android
Linux Porting
Ad

Recently uploaded (20)

PDF
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
PDF
Designing Intelligence for the Shop Floor.pdf
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PPTX
Patient Appointment Booking in Odoo with online payment
PPTX
Custom Software Development Services.pptx.pptx
PPTX
Cybersecurity: Protecting the Digital World
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PPTX
"Secure File Sharing Solutions on AWS".pptx
PPTX
chapter 5 systemdesign2008.pptx for cimputer science students
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Website Design Services for Small Businesses.pdf
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PDF
Types of Token_ From Utility to Security.pdf
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
Designing Intelligence for the Shop Floor.pdf
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
Patient Appointment Booking in Odoo with online payment
Custom Software Development Services.pptx.pptx
Cybersecurity: Protecting the Digital World
GSA Content Generator Crack (2025 Latest)
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
Oracle Fusion HCM Cloud Demo for Beginners
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
"Secure File Sharing Solutions on AWS".pptx
chapter 5 systemdesign2008.pptx for cimputer science students
wealthsignaloriginal-com-DS-text-... (1).pdf
Website Design Services for Small Businesses.pdf
How Tridens DevSecOps Ensures Compliance, Security, and Agility
Types of Token_ From Utility to Security.pdf
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
Top 10 Software Development Trends to Watch in 2025 🚀.pdf

OpenCL Kernel Optimization Tips

  • 1. OpenCL Kernel Optimization Tips Champ Yen (champ.yen@gmail.com) http://champyen.blogspot.com ver.20140820
  • 2. Optimization - a form of balance Device/Platform Features Runtime Toolchain Problem Algorithm Optimization Optimization is not only greedy searching in single direction. It is more like to find a good balance point between device, toolchain and the problem.
  • 3. Device - Computation ● device type ○ cpu - powerful single thread performance ○ gpu - many threads, great total throughput ● ISA design ○ scalar-based ○ vector-based ● # of compute unit/processing elements ● estimate impact of using divergence & barrier ● capability of asynchronous data transfer
  • 4. Device - Memory ● get basic memory characteristics: ○ size ○ latency ○ throughput ○ coalescing effect ○ addressing mode ● global memory - unified or not ● local memory - real or not ● penalty of oversize
  • 5. Toolchain/Runtime ● document/tutorial/guide for debugging, profiling and optimization. ● there is no perfect runtime/toolchain ● profiling/debugging tools. ● it is not always a good idea to debug/optimization on different platforms. ● automatic optimization MAY NOT HELP the thinking of optimization ● tricky forms of computation/memory operations. ○ MAD operations ○ memory access mode
  • 6. Problem/Algorithms ● DATA PARALLEL! ● multi-stages is not always bad. ○ doing all things together uses more memory resource in one workitem. ● vectorized is not always a good idea ● use appropriate work group size ○ bad memory access pattern, less coalescing ○ may cause lower cache hit rate ○ less local memory for each workitem ○ may be less private memory for each workitem. ● different form of implementation ● do optimization things manually. ○ DO NOT relies on automatic features.