Resilient Distributed Datasets (RDDs) provide a fault-tolerant abstraction for in-memory cluster computing. RDDs allow data to be partitioned across nodes and kept in memory for efficient reuse across jobs, while retaining properties of MapReduce like fault tolerance. RDDs track the lineage of transformations to rebuild lost data and optimize data placement and partitioning to minimize network shuffling.
Related topics: