Persistent RNNs: Stashing Recurrent Weights On-Chip

Persistent RNNs
(stashing recurrent weights on-chip)
Presenter: Gregory Diamos
Silicon Valley AI Lab
Baidu
Jun 20, 2016
Presenter: Gregory Diamos Persistent RNNs

Machine learning has beneﬁted greatly from faster computer systems.
GPUs in particular, have delivered a step forward.

Imagine the problems that you could solve
with even faster systems.

HPC is an opportunity
10,000x
TitanX GPU
Fastest supercomputer

Limits of data-parallelism

Hardware limits
wall-clocktimetoconvergence
mini-batch size
inefficient hardware
Hardware becomes less eﬃcient at small batch sizes.

Optimization limits
mini-batch size
inefficient optimization
Optimization algorithms perform more work at large batch sizes.

Mini-batch limits
mini-batch size
inefficient hardware inefficient optimization
These eﬀects combine to limit the maximum number of GPUs.

Persistent RNNs
Open source CUDA implementation:
https://github.com/baidu-research/persistent-rnns

Persistent RNN Details

Persistent RNNs
weights
GEMM GEMM GEMM GEMM
Persistent RNN
weights
weights weights weights
data0 data1 data2 data3 data4
data0 data1 data2 data3 data4
RNNs built on GEMM routines reload the weights each timestep.
However, the weights are constant, and this is wasteful.

Cache weights in registers
weights
GPU thread
registers
datapath

A global barrier
data0 GPU data1 GPU
barrier

Experiments

Scaling to 128 GPUs

Exploring deep residual RNNs

Pascal and future
Future GPUs will enable bigger and faster RNN layers.

Three challenges

Close the gap with the fastest supercomputers.

Do not settle for ineﬃcient algorithms.

Push performance to the edge of physical limits.
10 PetaFlops in 300 Watts.
150 ExaFlops in 25 MegaWatts.

Persistent RNNs: Stashing Recurrent Weights On-Chip

More Related Content

Viewers also liked (12)

Similar to Persistent RNNs: Stashing Recurrent Weights On-Chip (20)

Recently uploaded (20)

Persistent RNNs: Stashing Recurrent Weights On-Chip