The document discusses challenges with parallel programming on GPUs including tasks with statically known data dependences, SIMD divergence, lack of fine-grained synchronization and writeable coherent caches. It also presents performance results for sorting algorithms on different GPU and CPU architectures, with GPUs providing much higher sorting throughput than CPUs. Parallel prefix sum is proposed as a method for allocating work in parallel tasks that require dynamic scheduling or allocation.