This document summarizes Spark SQL and DataFrames in Spark. It notes that Spark SQL is part of the core Spark distribution and allows running SQL and HiveQL queries. DataFrames provide a way to select, filter, aggregate and plot structured data like in R and Pandas. DataFrames allow writing less code through a high-level API and reading less data by using optimized formats and partitioning. The optimizer can optimize queries across functions and push down predicates to read less data. This allows creating and running Spark programs faster.
Related topics: