This document provides an overview of techniques to boost Spark performance, including:
1) Phase 1 focused on memory management, code generation, and cache-aware algorithms which provided 5-30x speedups
2) Phase 2 focused on whole-stage code generation and columnar in-memory support which are now enabled by default in Spark 2.0+
3) Additional techniques discussed include choosing an optimal garbage collector, using multiple small executors, exploiting data locality, disabling hardware prefetchers, and keeping hyper-threading on.
Related topics: