This document summarizes the agenda and key topics from Day 3 of a Data Science Bootcamp.
The agenda included:
- An introduction to Apache Spark and its configuration in single node and cluster modes.
- An introduction to Apache Kafka and its single node configuration including creating topics and pushing messages.
The document reviewed Spark and SQL contexts, Resilient Distributed Datasets (RDDs), and DataFrames - the primary Spark abstraction. It discussed DataFrame transformations and actions, caching data in memory/disk, and when to use DataFrames over RDDs. Finally, it provided an example of implementing MapReduce in Spark.