This document introduces Spark, an open-source cluster computing framework. Spark improves on Hadoop MapReduce by keeping intermediate data in memory rather than disk, speeding up iterative jobs. Spark uses resilient distributed datasets (RDDs) that can tolerate failures using lineage graphs to recompute lost data. It runs on Hadoop YARN and HDFS and is programmed using Scala, a functional programming language that supports objects, higher-order functions, and nested functions.