Starting with Apache Spark, Best Practices and Learning from the Field discusses optimizing Spark performance. It recommends using Parquet format for storage, avoiding UDFs, caching frequently used data, and checkpointing for streaming jobs. Predicate pushdown and broadcast joins can improve query performance. Structured streaming extends DataFrames to streams, enabling exactly-once processing with checkpointing.
Related topics: