This document discusses CuPy, an implementation of NumPy for GPUs. It describes CuPy's elementwise and reduction operations, and the problem of small functions being called many times on the GPU, leading to communication overhead between CPU and GPU. The document proposes automatically fusing functions together into a single kernel call to help resolve this issue. It provides examples of user interfaces for specifying elementwise and reduction kernels, and rewriting existing code like an Adam optimizer to use function fusion in CuPy. Benchmark results show that fusion can reduce memory usage and running times compared to performing operations separately.