Apache Spark sql

Spark SQL: Relational Data Processing in Spark
Aftab Alam
Department of Computer Engineering, Kyung Hee University

Spark SQL: Relational Data Processing in Spark
Contents
Background
Project Proposal
Review
Challenges & Solution
Evaluation
7
6
5
2
1
4
3 Programming Interface
Catalyst Optimizer

Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Background
Big Data
• A broad term for data sets so large or complex
– That traditional data processing tools are inadequate.
– Characteristics: Volume, Variety, Velocity, Variability, Veracity
• Typical Big Data Stack

Background
Big Data Frameworks
• Apache Hadoop
– 1st Generation)
– Batch
• MapReduce does not support:
o Iterative Jobs (Acyclic)
o Interactive Analysis
• Apache Spark (3rd Generation)
– Iterative Jobs (Cyclic)
– Interactive Analysis
– Real-time processing
• Improve Efficiency (In-memory computing)
• Improve usability through (Scala, Java, Python)
• Up to 100× faster (2-10× on disk)
• 2-5× less code

Background
Big Data Alternatives
Hadoop
Ecosystem
Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark
Notebook/ISpark

Challenges and Solutions (Spark SQL)
• Early big data applications i.e., MapReduce,
– Need manual optimization to achieve high performance
• Automatic Optimization (Result: Pig, Hive, Dremel and Shark)
– Declarative queries & richer automatic optimization
• User prefer declarative queries
– But insufficient for many big data applications.
Challenges
1. Users perform Extract Transform & Load (ETL) to
and from various
• Semi or unstructured data sources
• Requiring custom code
2. User wants to Perform advanced analytics e.g.:
• machine learning & graph processing
• that are hard to express in relational
systems.
3. User has to opt between the two class system
• relational or procedural
Solutions
• A DataFrame API:
• That can perform relational operations
• on both external data sources &
• Spark’s built-in distributed collections.
• Catalyst
• Highly extensible optimizer &
• Make it easy to add
• data sources
• Optimization rules,
• data types for domains (ML)

Goals
Improvement upon Existing Art
• Spark engine does not understand the
structure of the data in RDDs or the
semantics of the user functions
– limited optimization.
• To query external data in Hive catalog
• Limited data sources
• Can only be invoked via SQL string from Spark
• Error prone
• Hive optimizer custom-made for MapReduce
• Difficult to extend

Goals
More goals for Spark SQL
• Support for relational processing both within
– Spark programs (on native Resilient Distributed Datasets (RDD)) and
– on external data sources using a programmer friendly API.
• To improve performance by using DBMS techniques.
• Support for data sources
– semi-structured data and external databases
• Enable extension with advanced analytics algorithms
– such as graph processing and machine learning
• A data structure
• Immutable objects
• In-memory
• Faster MapReduce Operations
• Interactive & Cyclic operations

Programming Interface
Interfaces to Spark SQL, and interaction with Spark
• Catalyst Optimizer
1. Trees & Rules
2. Catalyst in S-SQL
3. Advance Features
• Programming Interface
1. DataFram API
2. Data Model
3. DataFrom Operations
4. DF vs Rel. Q. Lang.
5. Querying Native DS
6. User Defined Ftns.
• Evaluation

1 - DataFrome API
• A DataFrame (DF)
– is equivalent to a table in a Relational Database.
– can be constructed from tables in a system catalog
o (based on external data sources) or
o from existing RDDs of native Java/Python objects
– Keep Track of their schema and
o support Relational Operations.
o Unlike RDD
– Lazily Evaluation
o Logical Plane: DF object represents logical plan to compute a dataset,
o Physical Plane: Execution occur when output operation is called i.e., save, count etc.
Interfaces to Spark SQL, and interaction with Spark
• Might includes optimizations
• If(columnar)
• only scanning the “age” column, or
• Or indexing in the data source to count the
matching rows.
1. DataFram API
2. Data Model

2 - Data Model
• Supports Primitive & Complex SQL types
o Boolean, integer, double, decimal, string, Timestamp
o structs, arrays, maps, and unions
– Also user defined types.
– First class support for complex data types
• Model data from a variety of sources
– Hive,
– Relational databases,
– JSON, and
– Native objects in Java/Scala/Python.
1. DataFram API
2. Data Model

3 - DataFrome Operations
• Supports relational operators
– Project (Selection), Aggregation (group by),
– Filter (where), & Join
• Example: Query to compute the NO of Female-Emp/Dept.
• All these of the operators are build up an Abstract Syntax Tree (AST),
– which is then optimized by Catalyst.
– Unlike the native Spark API
• DF can also be registered as temporary SQL table and
– perform traditional SQL query strings
DataFrame
Expression (=, <, >, +, -)
http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
1. DataFram API
2. Data Model

4 - DF VS Relational Query Languages
• Full optimization across functions composed in different languages.
• Control structures
– (e.g. if, for)
• Logical plan analyzed eagerly
– identify code errors associated with data schema issues on the fly.
– Report error while typing before execution
1. DataFram API
2. Data Model

5 - Querying Native Datasets
• Pipelines extract data from heterogeneous sources
– and run a wide variety of algorithms from different programming libraries.
• Infer column names and types directly from data objects , via
– reflection in Java and Scala and
– data sampling in Python, which is dynamically typed
• Native objects accessed in-place to avoid expensive data format transformation.
• Benefits:
– Run relational operations on existing Spark programs.
– Combine RDDs with external structured data.
1. DataFram API
2. Data Model

6 - User-Defined Functions (UDFs)
• UDFs are impotent DB extension: MySQL UDFs -> SYS, JSON, XML
• DB UDF separate Programming environment (!query interface)
– UDF in Pig to be written in a Java package that’s loaded into the Pig script.
• DataFrame API supports inline definition of UDFs
– Can be defined on simple data types or entire tables.
• UDFs available to other interfaces after registration (JDBC/ODBC)
1. DataFram API
2. Data Model

Catalyst Optimizer
• Catalyst: extensible optimizer
– Based on functional programming
– constructs in Scala
• Purposes of Extensible Optimizer
1. Can add new optimization techniques & features
o Big data – semi-structured data
2. Enable developers to extend the optimizer
o by adding data source specific rules
o can push filtering or aggregation into external storage systems,
o or support for new data types.
1. Trees & Rules
3. Advance Features

Catalyst Optimizer
• Catalyst Contains Core Library for representing
– Trees and Applying rules
– Cost-based optimization is performed
o by generating multiple plans using rules,
o and then computing their costs.
• On top of this framework,
– built libraries specific to relational query processing
o e.g., expressions, logical query plans, and
o several sets of rules that handle different phases of query execution:
 analysis,
 Logical, optimization,
 physical planning, and
 code generation
1. Trees & Rules
3. Advance Features

Catalyst Optimizer
1 - Trees & Rules
• Trees
– Literal (value: Int): a constant value
– Attribute (name: String): an attribute from an input row, e.g., “x”
– Add (left: TreeNode, right: TreeNode): sum of two expressions
• Add(Attribute(x), Add(Literal(1), Literal(2)))
x + (1 + 2)
Add
Attribute(x) Literal(3)
1. Trees & Rules
3. Advance Features

Catalyst Optimizer
2 - Using Catalyst in Spark SQL
• Catalyst’s general tree transformation framework
– (1) Analyzing a logical plan to resolve references
– (2) Logical plan optimization
– (3) Physical planning, and
o Catalyst may generate multiple plans and
o Compare them based on cost
– (4) Code generation to compile parts of the query to Java bytecode.
• Spark SQL begins with a relation to be computed,
o Either from an abstract syntax tree (AST) returned by a SQL parser, or
o DataFrame object constructed using the API.
1. Trees & Rules
3. Advance Features

Catalyst Optimizer
1 – Analysis
• Unresolved Logical Plane
– An attribute is unresolved if its type is not known or
– it’s not matched to an input table.
• To resolve attributes
– Look up relations by name from the catalog
– Map named attributes to the input provided given operator’s children
o E.g. Col.
– UID for references to the same value
– Propagate and coerce types through expressions
o e.g. (1 + col) unknown return type
SELECT col FROM sales
1. Trees & Rules
3. Advance Features

Catalyst Optimizer
2 – Logical Optimization
• Applies standard rule-based optimization
– constant folding,
– predicate-pushdown,
– projection pruning,
– null propagation,
– Boolean expression simplification, etc.
1. Trees & Rules
3. Advance Features

Catalyst Optimizer
3 – Physical Planning
Logical Plan
filter
join
Events File
users table
(Hive)
Physical Plan
join
scan
(events)
filter
scan
(users)
Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users
def add_demographics(events):
u = sqlCtx.table("users") # Load partitioned Hive table
events
.join(u, events.user_id == u.user_id) # Join on user_id
.withColumn("city", zipToCity(df.zip)) # Run udf to add city column
events = add_demographics(sqlCtx.load("/data/events", “JSON"))
training_data = events.where(events.city == "Melbourne").select(events.timestamp).collect()
Expressive
Only join
Relevant Users
“parquet” ))
1. Trees & Rules
3. Advance Features

Catalyst Optimizer
4 – Code Generation
– Query optimization involves generating Java bytecode
o to run on each machine
• Spark SQL operates on in-memory datasets,
– where processing is CPU-bound,
• to support code generation to speed up execution
1. Trees & Rules
3. Advance Features

Advanced Analytics Features
• Integration with Spark’s Machine Learning Library
• Schema Inference for Semi-structured Data
– JSON
1. Trees & Rules
3. Advance Features

Evaluation
• Two dimensions: SQL query processing performance & Spark program performance
• 110 GB of data after columnar compression
• with Parquet

Conclusion
• In conclusion, Spark SQL is a new module in Apache Spark that
integrates relational and procedural interfaces, which makes it
easy to express the large-scale data processing job. The seamless
integration of the two interfaces is the key contribution of the paper.
Already leads to a new unified interface for large-scale data
processing.

Apache Spark sql

More Related Content

What's hot (20)

Similar to Apache Spark sql (20)

More from aftab alam (9)

Recently uploaded (20)

Apache Spark sql