GPU Programming on CPU - Using C++AMP

GPU
Programming
on CPUs
Using C++AMP
Miller Lee

Outline
1. Introduction to C++AMP
2. Introduction to Tiling
3. tile_static
4. barrier.wait and solutions
a. C++11 thread
b. setjmp/longjmp
c. ucontext
2

(Homogeneous coordinates)
(0, 0) (0, 1) (0, 2) (0, 3)
(1, 0) (1, 1) (1, 2) (1, 3)
(2, 0) (2, 1) (2, 2) (2, 3)
(3, 0) (3, 1) (3, 2) (3, 3)
X
0
1
2
3
Matrix A b
=
0
1
2
3
result
Computing example
● Simple matrix multiplication
3

C++ Version
1. int A[4][4];
2. int b[4];
3. int result[4];
4. for (int i = 0; i < 4; i++) {
5. result[i] = 0;
6. for (int j = 0; j < 4; j++)
7. result[i] += A[i][j] * b[j];
8. } 4

C++AMP Version
1. array_view<float, 2> A(4, 4);
2. array_view<float, 1> b(4);
3. array_view<float, 1> result(4);
4. extent<1> ext(4);
5. parallel_for_each(ext, [&](index<1> idx) restrict(amp)
6. {
7. result[idx[0]] = 0;
8. for (int i = 0; i < 4; i++)
9. result[idx[0]] += A(idx[0], i) * b(i);
10. });
5

memory access
0 1 2 3
P0 P1 P2 P3
global memory
b
100t
Total access time = 400t 6

shared memory
0 1 2 3
shared memory
10t
100t
Total access time = 130t
b
7

1. array_view<float, 2> A(4, 4);
2. array_view<float, 1> b(4);
3. array_view<float, 1> result(4);
4. extent<1> ext(4);
5. parallel_for_each(ext.tile<4>(), [&](tiled_index<4> tidx)
restrict(amp)
6. {
7. int local = tidx.local[0];
8. int global = tidx.global[0];
9. tile_statc int buf[4];
10. buf[local] = b[global];
11. tidx.barrier.wait();
12. result[idx[0]] = 0;
13. for (int i = 0; i < 4; i++)
14. result[idx[0]] += A[idx[0]][i] * buf[i];
15. }); 8

Architecture
source: NVIDIA TESLA:AUNIFIED GRAPHICS AND COMPUTING ARCHITECTURE
shared memory
accessible to all SPs
10

Goal
● Implement all the C++AMP function on CPU
instead of GPU without any compiler
modification.
11

tiled_static
● The limitation of C++ syntax leads to the
following choices
○ const, volatile
○ __attribute__(...)
○ static
● Choose static
○ static memory can be shared among all the threads
○ side effect: At most one thread group can be
executed at the same time.
#define tile_static static
12

Barrier.wait
● Threads in the same thread group will be
waited at the point where “wait” is called.
● Program can
a. perform real barrier action
b. jump out of current execution context
13

● True threading
○ C++11 thread
● Fake threading(Coroutines)
○ setjmp/longjmp
○ makecontext/getcontext/swapcontext/setcontext
Approaches
14

C++11 thread
● launch hundreds of threads at a time.
● implemente my own barrier by using C++11
mutex library.
→ extremely slow.
→ The data on static memory will be corrupted
15

setjmp/longjmp
● int setjmp(jmp_buf env)
○ setjmp() saves the stack context/environment in env
for later use by longjmp.
○ The stack context will be invalidated if the function
which called setjmp() returns.
● void longjmp(jmp_buf env, int val);
○ longjmp() restores the environment saved by the last
call of setjmp.
16

1. #include <stdio.h>
2. #include <setjmp.h>
3. jmp_buf buf;
4. void wait(void) {
5. printf("waitn"); // prints
6. longjmp(buf,1);
7. }
8. void first(void) {
9. wait();
10. printf("firstn"); // does not print
11. }
12. int main() {
13. if (!setjmp(buf))
14. first(); // when executed, setjmp returns 0
15. else // when longjmp jumps back, setjmp returns 1
16. printf("mainn"); // prints
17. return 0;
18. }
17

Pseudo code (1)
void entry()
{
while(!finish)
for(t : tasks)
run(t)
}
void fun()
{
…
wait();
...
}
void fun()
{
…
wait();
...
}
void entry()
{
while(!finish)
for(t : tasks)
run(t)
}
void fun()
{
…
wait();
...
}
void fun()
{
…
wait();
...
}
18

Pseudo code (2)
void entry()
{
while(!finish)
for(t : tasks)
run(t)
}
void fun()
{
…
wait();
...
}
void fun()
{
…
wait();
...
}
void entry()
{
while(!finish)
for(t : tasks)
run(t)
}
void fun()
{
…
wait();
...
}
void fun()
{
…
wait();
...
}
19

3. jmp_buf buf, b;
5. printf("waitn");
6. if (setjmp(b) == 0)
7. longjmp(buf,1);
8. }
10. wait();
11. }
12. int main() {
13. if (!setjmp(buf) )
14. first();
15. else {
16. printf("mainn");
17. longjmp(b, 10);
18. }
19. return 0;
20. } 20

3. jmp_buf buf, b;
5. printf("waitn");
7. longjmp(buf,1);
8. }
10. wait();
11. }
12. int main() {
14. first();
15. else {
17. longjmp(b, 10);
18. }
19. return 0;
20. }
buf
21

3. jmp_buf buf, b;
5. printf("waitn");
7. longjmp(buf,1);
8. }
10. wait();
11. }
12. int main() {
14. first();
15. else {
17. longjmp(b, 10);
18. }
19. return 0;
20. }
ret address
buf
b
22

3. jmp_buf buf, b;
5. printf("waitn");
7. longjmp(buf,1);
8. }
10. wait();
11. }
12. int main() {
14. first();
15. else {
17. longjmp(b, 10);
18. }
19. return 0;
20. }
buf
b
23

3. jmp_buf buf, b;
5. printf("waitn");
7. longjmp(buf,1);
8. }
10. wait();
11. }
12. int main() {
14. first();
15. else {
17. longjmp(b, 10);
18. }
19. return 0;
20. }
Cannot return
？？？
？？？
？？？
buf
b
24

Problems
● Cannot return
○ return address in the stack is destroyed
● Cannot use too many static variables
○ will lost spilled registers
→ can be solved by using “alloca”
http://www.codemud.net/~thinker/GinGin_CGI.
py/show_id_doc/489
25

ucontext.h
● ucontext_t
● getcontext
● makecontest
● swapcontext
● setcontext
26

ucontext_t
typedef struct ucontext {
struct ucontext *uc_link;
sigset_t uc_sigmask;
stack_t uc_stack;
mcontext_t uc_mcontext;
...
} ucontext_t;
● uc_link
○ points to the context that will be resumed when the current context
terminates
● uc_stack
○ the stack used by this context
● uc_mcontext
○ machine-specific representation of the saved context, that includes the
calling thread's machine registers
27

Functions
● int getcontext(ucontext_t *ucp);
○ initializes the structure pointed at by ucp.
● int setcontext(const ucontext_t *ucp);
○ restores the user context pointed at by ucp
● int swapcontext(ucontext_t *oucp, const
ucontext_t *ucp);
○ saves the current context in the structure pointed to
by oucp, and then activates the context pointed to by
ucp.
28

makecontext
● void makecontext(ucontext_t *ucp, void
(*func)(), int argc, ...);
○ glibc(x86_64) saves the arguments to registers
instead of pushing them on stack as AMD64 ABI
said
○ The size of the arguments that passed to
makecontext should be no less than sizeof(register)
29

2. #include <ucontext.h>
3. static ucontext_t ctx[2];
4. static void f1 (void) {
5. puts("start f1");
6. swapcontext(&ctx[1], &ctx[0]);
7. puts("finish f1");
8. }
9. int main (void)
10. {
11. char st1[8192];
12. getcontext(&ctx[1]);
13. ctx[1].uc_stack.ss_sp = st1;
14. ctx[1].uc_stack.ss_size = sizeof st1;
15. ctx[1].uc_link = &ctx[0];
16. makecontext(&ctx[1], f1, 0);
19. return 0;
20. } 30

2. #include <ucontext.h>
3. static ucontext_t ctx[3];
4. static void f1 (void) {
6. swapcontext(&ctx[1], &ctx
[0]);
8. }
9. static void f2 (void)
10. {
12. swapcontext(&ctx[2], &ctx
[1]);
14. }
1. int main (void)
2. {
3. char st1[8192], st2[8192];
6. ctx[1].uc_stack.ss_size = sizeof
st1;
9.
12. ctx[2].uc_stack.ss_size = sizeof
st2;
17. return 0;
18. }
31

Fake threading (yield)
void entry()
{
setup(fun, 2);
while(!finish)
switch_to();
}
void fun()
{
…
wait();
...
}
void fun()
{
…
wait();
...
}
32
void entry()
{
setup(fun, 2);
while(!finish)
switch_to();
}
void fun()
{
…
wait();
...
}
void fun()
{
…
wait();
...
}

Problems
1. How to pass a lambda?
○ makecontext(&ctx,
(void (*)(void))&Kernel::operator(), …);
2. How to pass non-int arguments?
○ What if sizeof(Type) > sizeof(int)
○ How about complex structure and class
33

Pass lambda
1. Use a wrapper function!!
template <typename Ker, typename Arg>
void fun(Ker k, Arg arg)
{
k(arg);
}
void makectx(Ker k, Arg arg)
{
makecontext(&ctx, (void (*)(void))fun<ker, Arg>, 2, k, arg);
}
34

Pass non-int arguments
2. Pass pointer instead!!
void fun(Ker *k, Arg *arg)
{
(*k)(*arg);
}
void makectx(Ker k, Arg arg)
{
makecontext(&ctx, (void (*)(void))fun<ker, Arg>, 2, &k, &arg);
}
35

Additional
● Use a counter so that we can spawn
coroutines dynamically
● Can it be multithreaded? Yes
36

true threading
barrier
There are 12 threads in one thread group
37

multithreading
barrier
Hardware Core = 4
39

barrier
struct bar_t {
unsigned const count;
std::atomic<unsigned> spaces;
std::atomic<unsigned> generation;
bar_t(unsigned count_) :
count(count_), spaces(count_), generation(0)
{}
void wait() noexcept {
unsigned const my_generation = generation;
if (!--spaces) {
spaces = count;
++generation;
} else {
while(generation == my_generation);
}
}
}; source: C++ Concurrency in Action: Practical Multithreading
40

Summary
● It works fine on AMP right now
● The importance of low level knowledge
41

GPU Programming on CPU - Using C++AMP

More Related Content

What's hot (20)

Similar to GPU Programming on CPU - Using C++AMP (20)

Recently uploaded (20)

GPU Programming on CPU - Using C++AMP