This document summarizes techniques for making deep learning models more efficient, including pruning, weight sharing, quantization, low-rank approximation, and Winograd transformations. It provides examples of applying these techniques to convolutional neural networks to reduce model size by up to 49x while maintaining accuracy. Specific techniques discussed include clustering weights to share values, quantizing weights to fewer bits, pruning low-impact connections, and iteratively retraining pruned models. Energy and computation reductions are achieved through smaller, lower-precision models with fewer operations.