This document provides an overview of Spark, including:
- Spark's processing model involves chopping live data streams into batches and treating each batch as an RDD to apply transformations and actions.
- Resilient Distributed Datasets (RDDs) are Spark's primary abstraction, representing an immutable distributed collection of objects that can be operated on in parallel.
- An example word count program is presented to illustrate how to create and manipulate RDDs to count the frequency of words in a text file.