SlideShare a Scribd company logo
1

GPU Ray Tracing
with CUDA
BY TOM PITKIN

Bill Clark, PhD
Stu Steiner, MS, PhC
Objectives


Develop a sequential CPU and parallel GPU ray tracer



Illustrate the difference in rendering speed and design of a CPU and
GPU ray tracer

2
Outline


Introduction to Ray Tracing



CUDA



Parallelization with CUDA / Results



Future Work



Questions

3
What is Ray Tracing?


Rendering technique used in computer graphics



Simulates the behavior of light



Can produce advanced optical effects

4
Light in the Physical World

5
Light Source

Film
Object with
Red Reflectivity

Pinhole
The Virtual Camera Model


Eye Position – camera location in 3D space



Reference Point – point in 3D space where the camera is pointing



Orientation Vectors (u, v, n) – camera orientation in 3D space



Image Plane – projected plane of the camera’s field of view

Reference Point
v (Up Vector)
n

u
Eye Position

6
Ray Generation


Map the physical screen to the image plane



Divide the image plane into a uniform grid of pixel locations



7

Send a ray through the center of each pixel location

𝐼𝑚𝑎𝑔𝑒 𝑃𝑙𝑎𝑛𝑒 𝐻𝑒𝑖𝑔ℎ𝑡
𝑆𝑐𝑟𝑒𝑒𝑛 𝐻𝑒𝑖𝑔ℎ𝑡

Pixel
Eye Position
𝐼𝑚𝑎𝑔𝑒 𝑃𝑙𝑎𝑛𝑒 𝑊𝑖𝑑𝑡ℎ
𝑆𝑐𝑟𝑒𝑒𝑛 𝑊𝑖𝑑𝑡ℎ
Ray Intersection Testing


Ray – Sphere Intersection



Ray – Triangle Intersection

8
Phong Reflection Model

Ambient

+

Diffuse

+

Specular

9

=

Phong Reflection
Specular Reflection


Recursive Ray Tracing

10
Outline


Introduction to Ray Tracing



CUDA



Parallelization with CUDA / Results



Future Work



Questions

11
What is CUDA?


Compute Unified Device Architecture (CUDA)



Parallel computing platform



Developed by Nvidia

12
Kernel Functions


Specifies the code to be executed in parallel



Single Program, Multiple Data (SPMD)

13
Kernel Execution


Grids



Blocks



Threads

14
Memory Model


Global Memory



Constant Memory



Texture Memory



Registers



Local Memory



Shared Memory

15
Outline


Introduction to Ray Tracing



CUDA



Parallelization with CUDA / Results



Future Work



Questions

16
Thread Organization


2D array of blocks



2D array of threads



17

Each thread represents
a ray

Block (0, 0)

Block (1, 0)

Block (2, 0)

Block (0, 1)

Block (1, 1)

Block (2, 1)

Image Plane
Testing Environment


OS – Ubuntu Gnome Remix 13.04



CPU – Core i7-920




Core Clock – 2.66 GHz

GPU – Nvidia GTX 570


Core Clock - 742 MHz



CUDA Core - 480



Memory Clock - 3800 MHz



Video Memory - GDDR5 1280MB

18
Test Objects


Teapot






Surfaces: 1

Triangles: 992

Al





Surfaces: 174
Triangles: 7,124

Crocodile


Surfaces: 6



Triangles: 34,404

19
Single Kernel

20
Single Kernel
Single Thread

160 (0.16 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

23,003 (23 sec)

411 (0.41 sec)

Al
(Surfaces: 174)
(Triangles: 7,124)

55,260 (55.26 sec)

5,867 (5.87 sec)

Crocodile
(Surfaces: 6)
(Triangles: 34,404)

1,617,160 (26.95 min)
1

10

100

1,000

10,000

Milliseconds

100,000

1,000,000

10,000,000
Kernel Complexity and Size


Driver timeout



Register Spilling

21
Replacing Recursion


Iterative Loop



Layer based stack




Layers store color values returned from rays

Final image from convex combination of layers

22
Multi-Kernel

23
Multi-Kernel
Single Kernel (Previous Kernel)

381 (0.38 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

160 (0.16 sec)

967 (0.97 sec)

Al
(Surfaces: 174)
(Triangles: 7,124)

411 (0.41 sec)

13,217 (13.22 sec)

Crocodile
(Surfaces: 6)
(Triangles: 34,404)

5,867 (5.87 sec)
1

10

100

1,000

Milliseconds

10,000

100,000
Multi-Kernel with Single-Precision Floating Points

24

Multi-Kernel with Single-Precision
Floating Points
Multi-Kernel (Previous Kernel)
46 (0.05 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

381 (0.38 sec)

118 (0.12 sec)

Al
(Surfaces: 174)
(Triangles: 7,124)

967 (0.97 sec)

1,556 (1.56 sec)

Crocodile
(Surfaces: 6)
(Triangles: 34,404)

13,217 (13.22 sec)
1

10

100

1,000

Milliseconds

10,000

100,000
Caching Surface Data


Object’s surface data stored on shared memory



All threads in same block have access to cached surface data



Removes duplicate memory requests



Data reuse

25
Multi-Kernel with Surface Caching

26

Multi-Kernel with Surface Caching
Multi-Kernel with Single-Precision
Floating Points (Previous Kernel)

30 (0.03 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

46 (0.05 sec)

133 (0.13 sec)

Al
(Surfaces: 174)
(Triangles: 7,124)

118 (0.12 sec)

1,007 (1.01 sec)

Crocodile
(Surfaces: 6)
(Triangles: 34,404)

1,556 (1.56 sec)
1

10

100

Milliseconds

1,000

10,000
Simplifying Mesh Data

27



Triangle data originally stored as three points (vertices)



Optimize data by storing triangles as one point (vertex) and two edges


Calculate edges on host before kernel call

0.5, 1

0, 0

0.5, 1

1, 0
Multi-Kernel with Mesh Optimization

28

Multi-Kernel with Mesh Optimization
Multi-Kernel with Surface Caching
(Previous Kernel)

27 (0.03 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

30 (0.03 sec)

127 (0.13 sec)

Al
(Surfaces: 174)
(Triangles: 7,124)

133 (0.13 sec)

873 (0.87 sec)

Crocodile
(Surfaces: 6)
(Triangles: 34,404)

1,007 (1.01 sec)
1

10

100

Milliseconds

1,000

10,000
Final Results

29
Multi-Kernel with Intersection
Optimization
Single Thread

27 (0.03 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

23,003 (23 sec)

127 (0.13 sec)

Al
(Surfaces: 174)
(Triangles: 7,124)

55,260 (55.26 sec)

873 (0.87 sec)

Crocodile
(Surfaces: 6)
(Triangles: 34,404)

1,617,160 (26.95 min)
1

10

100

1,000

10,000

Milliseconds

100,000

1,000,000

10,000,000
Outline


Introduction to Ray Tracing



CUDA



Parallelization with CUDA / Results



Future Work



Questions

30
Future Work


Spatial partitioning



Multiple GPUs



Optimize code for different GPUs

31
Questions?

32

More Related Content

PPTX
Introduction to computer science
PPTX
Présentation PFE
PDF
Django Introduction & Tutorial
PPTX
Slides de présentation de la thèse du doctorat
PPTX
C++ Overview PPT
PPTX
Data Privacy Act in the Philippines
PDF
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
PPTX
Market Structure
Introduction to computer science
Présentation PFE
Django Introduction & Tutorial
Slides de présentation de la thèse du doctorat
C++ Overview PPT
Data Privacy Act in the Philippines
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Market Structure

What's hot (20)

PPT
HCI - Chapter 3
PDF
Transfer Learning: An overview
PPTX
Image classification with Deep Neural Networks
PPTX
Final thesis presentation
PDF
Convolutional Neural Networks (CNN)
PPTX
Deep Learning Explained
PDF
Machine Learning: Introduction to Neural Networks
PPTX
1.Introduction to deep learning
PPT
HCI 3e - Ch 12: Cognitive models
PDF
Deep learning - A Visual Introduction
PPTX
PARADIGM SHIFT IN HUMAN COMPUTER INTERACTION
PPTX
Conceptual Models
PDF
Neural networks and deep learning
PPTX
Transfer Learning and Fine-tuning Deep Neural Networks
PPTX
Introduction to Deep Learning
PDF
CS6007 information retrieval - 5 units notes
PPT
Human computer interaction
PPTX
Semantic nets in artificial intelligence
PDF
The fundamentals of Machine Learning
PDF
Introduction of slam
HCI - Chapter 3
Transfer Learning: An overview
Image classification with Deep Neural Networks
Final thesis presentation
Convolutional Neural Networks (CNN)
Deep Learning Explained
Machine Learning: Introduction to Neural Networks
1.Introduction to deep learning
HCI 3e - Ch 12: Cognitive models
Deep learning - A Visual Introduction
PARADIGM SHIFT IN HUMAN COMPUTER INTERACTION
Conceptual Models
Neural networks and deep learning
Transfer Learning and Fine-tuning Deep Neural Networks
Introduction to Deep Learning
CS6007 information retrieval - 5 units notes
Human computer interaction
Semantic nets in artificial intelligence
The fundamentals of Machine Learning
Introduction of slam
Ad

Viewers also liked (8)

PPTX
Thesis presentation
PDF
Hardening Your Config Management - Security and Attack Vectors in Config Mana...
PDF
Cyber Security - IDS/IPS is not enough
PDF
Computer Security and Intrusion Detection(IDS/IPS)
PPT
IDS and IPS
PDF
My Thesis Defense Presentation
PPT
Powerpoint presentation M.A. Thesis Defence
PPT
How to Defend your Thesis Proposal like a Professional
Thesis presentation
Hardening Your Config Management - Security and Attack Vectors in Config Mana...
Cyber Security - IDS/IPS is not enough
Computer Security and Intrusion Detection(IDS/IPS)
IDS and IPS
My Thesis Defense Presentation
Powerpoint presentation M.A. Thesis Defence
How to Defend your Thesis Proposal like a Professional
Ad

Similar to Computer Science Thesis Defense (20)

PDF
The technology behind_the_elemental_demo_16x9-1248544805
PPT
Computer graphics
PPTX
Computer Graphics - Introduction and CRT Devices
PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
PPT
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
PPT
Secrets of CryENGINE 3 Graphics Technology
PDF
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs
PPTX
Octnews featured article
PPT
2.Hardware.ppt
PDF
高解析度面板瑕疵檢測
PDF
Multi-core GPU – Fast parallel SAR image generation
PDF
Sparse coding Super-Resolution を用いた核医学画像処理
PPT
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
PDF
thesis
PDF
Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...
PDF
CG2_HWcomputergraphicshardwareeeeeee.pdf
PDF
Rendering Techniques in Virtual Reality.pdf
PPT
D3 D10 Unleashed New Features And Effects
PDF
CUDA by Example : Constant Memory and Events : Notes
The technology behind_the_elemental_demo_16x9-1248544805
Computer graphics
Computer Graphics - Introduction and CRT Devices
Optimizing the Graphics Pipeline with Compute, GDC 2016
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
Secrets of CryENGINE 3 Graphics Technology
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs
Octnews featured article
2.Hardware.ppt
高解析度面板瑕疵檢測
Multi-core GPU – Fast parallel SAR image generation
Sparse coding Super-Resolution を用いた核医学画像処理
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
thesis
Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...
CG2_HWcomputergraphicshardwareeeeeee.pdf
Rendering Techniques in Virtual Reality.pdf
D3 D10 Unleashed New Features And Effects
CUDA by Example : Constant Memory and Events : Notes

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Electronic commerce courselecture one. Pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Big Data Technologies - Introduction.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Spectral efficient network and resource selection model in 5G networks
The AUB Centre for AI in Media Proposal.docx
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
A Presentation on Artificial Intelligence
20250228 LYD VKU AI Blended-Learning.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Diabetes mellitus diagnosis method based random forest with bat algorithm
Per capita expenditure prediction using model stacking based on satellite ima...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Dropbox Q2 2025 Financial Results & Investor Presentation
Big Data Technologies - Introduction.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
NewMind AI Weekly Chronicles - August'25 Week I
Spectral efficient network and resource selection model in 5G networks

Computer Science Thesis Defense

  • 1. 1 GPU Ray Tracing with CUDA BY TOM PITKIN Bill Clark, PhD Stu Steiner, MS, PhC
  • 2. Objectives  Develop a sequential CPU and parallel GPU ray tracer  Illustrate the difference in rendering speed and design of a CPU and GPU ray tracer 2
  • 3. Outline  Introduction to Ray Tracing  CUDA  Parallelization with CUDA / Results  Future Work  Questions 3
  • 4. What is Ray Tracing?  Rendering technique used in computer graphics  Simulates the behavior of light  Can produce advanced optical effects 4
  • 5. Light in the Physical World 5 Light Source Film Object with Red Reflectivity Pinhole
  • 6. The Virtual Camera Model  Eye Position – camera location in 3D space  Reference Point – point in 3D space where the camera is pointing  Orientation Vectors (u, v, n) – camera orientation in 3D space  Image Plane – projected plane of the camera’s field of view Reference Point v (Up Vector) n u Eye Position 6
  • 7. Ray Generation  Map the physical screen to the image plane  Divide the image plane into a uniform grid of pixel locations  7 Send a ray through the center of each pixel location 𝐼𝑚𝑎𝑔𝑒 𝑃𝑙𝑎𝑛𝑒 𝐻𝑒𝑖𝑔ℎ𝑡 𝑆𝑐𝑟𝑒𝑒𝑛 𝐻𝑒𝑖𝑔ℎ𝑡 Pixel Eye Position 𝐼𝑚𝑎𝑔𝑒 𝑃𝑙𝑎𝑛𝑒 𝑊𝑖𝑑𝑡ℎ 𝑆𝑐𝑟𝑒𝑒𝑛 𝑊𝑖𝑑𝑡ℎ
  • 8. Ray Intersection Testing  Ray – Sphere Intersection  Ray – Triangle Intersection 8
  • 11. Outline  Introduction to Ray Tracing  CUDA  Parallelization with CUDA / Results  Future Work  Questions 11
  • 12. What is CUDA?  Compute Unified Device Architecture (CUDA)  Parallel computing platform  Developed by Nvidia 12
  • 13. Kernel Functions  Specifies the code to be executed in parallel  Single Program, Multiple Data (SPMD) 13
  • 15. Memory Model  Global Memory  Constant Memory  Texture Memory  Registers  Local Memory  Shared Memory 15
  • 16. Outline  Introduction to Ray Tracing  CUDA  Parallelization with CUDA / Results  Future Work  Questions 16
  • 17. Thread Organization  2D array of blocks  2D array of threads  17 Each thread represents a ray Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Image Plane
  • 18. Testing Environment  OS – Ubuntu Gnome Remix 13.04  CPU – Core i7-920   Core Clock – 2.66 GHz GPU – Nvidia GTX 570  Core Clock - 742 MHz  CUDA Core - 480  Memory Clock - 3800 MHz  Video Memory - GDDR5 1280MB 18
  • 19. Test Objects  Teapot    Surfaces: 1 Triangles: 992 Al    Surfaces: 174 Triangles: 7,124 Crocodile  Surfaces: 6  Triangles: 34,404 19
  • 20. Single Kernel 20 Single Kernel Single Thread 160 (0.16 sec) Teapot (Surfaces: 1) (Triangles: 992) 23,003 (23 sec) 411 (0.41 sec) Al (Surfaces: 174) (Triangles: 7,124) 55,260 (55.26 sec) 5,867 (5.87 sec) Crocodile (Surfaces: 6) (Triangles: 34,404) 1,617,160 (26.95 min) 1 10 100 1,000 10,000 Milliseconds 100,000 1,000,000 10,000,000
  • 21. Kernel Complexity and Size  Driver timeout  Register Spilling 21
  • 22. Replacing Recursion  Iterative Loop  Layer based stack   Layers store color values returned from rays Final image from convex combination of layers 22
  • 23. Multi-Kernel 23 Multi-Kernel Single Kernel (Previous Kernel) 381 (0.38 sec) Teapot (Surfaces: 1) (Triangles: 992) 160 (0.16 sec) 967 (0.97 sec) Al (Surfaces: 174) (Triangles: 7,124) 411 (0.41 sec) 13,217 (13.22 sec) Crocodile (Surfaces: 6) (Triangles: 34,404) 5,867 (5.87 sec) 1 10 100 1,000 Milliseconds 10,000 100,000
  • 24. Multi-Kernel with Single-Precision Floating Points 24 Multi-Kernel with Single-Precision Floating Points Multi-Kernel (Previous Kernel) 46 (0.05 sec) Teapot (Surfaces: 1) (Triangles: 992) 381 (0.38 sec) 118 (0.12 sec) Al (Surfaces: 174) (Triangles: 7,124) 967 (0.97 sec) 1,556 (1.56 sec) Crocodile (Surfaces: 6) (Triangles: 34,404) 13,217 (13.22 sec) 1 10 100 1,000 Milliseconds 10,000 100,000
  • 25. Caching Surface Data  Object’s surface data stored on shared memory  All threads in same block have access to cached surface data  Removes duplicate memory requests  Data reuse 25
  • 26. Multi-Kernel with Surface Caching 26 Multi-Kernel with Surface Caching Multi-Kernel with Single-Precision Floating Points (Previous Kernel) 30 (0.03 sec) Teapot (Surfaces: 1) (Triangles: 992) 46 (0.05 sec) 133 (0.13 sec) Al (Surfaces: 174) (Triangles: 7,124) 118 (0.12 sec) 1,007 (1.01 sec) Crocodile (Surfaces: 6) (Triangles: 34,404) 1,556 (1.56 sec) 1 10 100 Milliseconds 1,000 10,000
  • 27. Simplifying Mesh Data 27  Triangle data originally stored as three points (vertices)  Optimize data by storing triangles as one point (vertex) and two edges  Calculate edges on host before kernel call 0.5, 1 0, 0 0.5, 1 1, 0
  • 28. Multi-Kernel with Mesh Optimization 28 Multi-Kernel with Mesh Optimization Multi-Kernel with Surface Caching (Previous Kernel) 27 (0.03 sec) Teapot (Surfaces: 1) (Triangles: 992) 30 (0.03 sec) 127 (0.13 sec) Al (Surfaces: 174) (Triangles: 7,124) 133 (0.13 sec) 873 (0.87 sec) Crocodile (Surfaces: 6) (Triangles: 34,404) 1,007 (1.01 sec) 1 10 100 Milliseconds 1,000 10,000
  • 29. Final Results 29 Multi-Kernel with Intersection Optimization Single Thread 27 (0.03 sec) Teapot (Surfaces: 1) (Triangles: 992) 23,003 (23 sec) 127 (0.13 sec) Al (Surfaces: 174) (Triangles: 7,124) 55,260 (55.26 sec) 873 (0.87 sec) Crocodile (Surfaces: 6) (Triangles: 34,404) 1,617,160 (26.95 min) 1 10 100 1,000 10,000 Milliseconds 100,000 1,000,000 10,000,000
  • 30. Outline  Introduction to Ray Tracing  CUDA  Parallelization with CUDA / Results  Future Work  Questions 30
  • 31. Future Work  Spatial partitioning  Multiple GPUs  Optimize code for different GPUs 31

Editor's Notes

  • #3: Used C++ and CUDA
  • #7: Forward Ray TracingBackward Ray Tracing
  • #8: Pixel – picture element that represents one point on an image. Consists of a single color
  • #9: Don’t forget to mention what happens if a ray misses completely
  • #10: Ambient Light – indirect light reflected off of other objects in the sceneDiffuse Light – direct light reflected off the surface in all directionsSpecular light – direct light reflected off the surface in a single direction
  • #15: Block and Threads have unique identifier
  • #16: Register Memory – 50x faster than Global MemoryL2 Cache – LRU (Least Recently Used)L1 Cache – Spatial Locality (Quickly access memory in nearby location of current memory reference), Caches per-thread stack and other local data structures
  • #21: Logarithmic ScaleSingle Pass, 640 x 480
  • #30: 1852X speedup!