SlideShare a Scribd company logo
Spark SQL: Relational Data Processing in Spark
Aftab Alam
Department of Computer Engineering, Kyung Hee University
Spark SQL: Relational Data Processing in Spark
Contents
Background
Project Proposal
Review
Challenges & Solution
Evaluation
7
6
5
2
1
4
3 Programming Interface
Catalyst Optimizer
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Background
Big Data
• A broad term for data sets so large or complex
– That traditional data processing tools are inadequate.
– Characteristics: Volume, Variety, Velocity, Variability, Veracity
• Typical Big Data Stack
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Background
Big Data Frameworks
• Apache Hadoop
– 1st Generation)
– Batch
• MapReduce does not support:
o Iterative Jobs (Acyclic)
o Interactive Analysis
• Apache Spark (3rd Generation)
– Iterative Jobs (Cyclic)
– Interactive Analysis
– Real-time processing
• Improve Efficiency (In-memory computing)
• Improve usability through (Scala, Java, Python)
• Up to 100× faster (2-10× on disk)
• 2-5× less code
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Background
Big Data Alternatives
Hadoop
Ecosystem
Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark
Notebook/ISpark
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Challenges and Solutions (Spark SQL)
• Early big data applications i.e., MapReduce,
– Need manual optimization to achieve high performance
• Automatic Optimization (Result: Pig, Hive, Dremel and Shark)
– Declarative queries & richer automatic optimization
• User prefer declarative queries
– But insufficient for many big data applications.
Challenges
1. Users perform Extract Transform & Load (ETL) to
and from various
• Semi or unstructured data sources
• Requiring custom code
2. User wants to Perform advanced analytics e.g.:
• machine learning & graph processing
• that are hard to express in relational
systems.
3. User has to opt between the two class system
• relational or procedural
Solutions
• A DataFrame API:
• That can perform relational operations
• on both external data sources &
• Spark’s built-in distributed collections.
• Catalyst
• Highly extensible optimizer &
• Make it easy to add
• data sources
• Optimization rules,
• data types for domains (ML)
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Goals
Improvement upon Existing Art
• Spark engine does not understand the
structure of the data in RDDs or the
semantics of the user functions
– limited optimization.
• To query external data in Hive catalog
• Limited data sources
• Can only be invoked via SQL string from Spark
• Error prone
• Hive optimizer custom-made for MapReduce
• Difficult to extend
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Goals
More goals for Spark SQL
• Support for relational processing both within
– Spark programs (on native Resilient Distributed Datasets (RDD)) and
– on external data sources using a programmer friendly API.
• To improve performance by using DBMS techniques.
• Support for data sources
– semi-structured data and external databases
• Enable extension with advanced analytics algorithms
– such as graph processing and machine learning
• A data structure
• Immutable objects
• In-memory
• Faster MapReduce Operations
• Interactive & Cyclic operations
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Programming Interface
Interfaces to Spark SQL, and interaction with Spark
• Catalyst Optimizer
1. Trees & Rules
2. Catalyst in S-SQL
3. Advance Features
• Programming Interface
1. DataFram API
2. Data Model
3. DataFrom Operations
4. DF vs Rel. Q. Lang.
5. Querying Native DS
6. User Defined Ftns.
• Evaluation
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Programming Interface
1 - DataFrome API
• A DataFrame (DF)
– is equivalent to a table in a Relational Database.
– can be constructed from tables in a system catalog
o (based on external data sources) or
o from existing RDDs of native Java/Python objects
– Keep Track of their schema and
o support Relational Operations.
o Unlike RDD
– Lazily Evaluation
o Logical Plane: DF object represents logical plan to compute a dataset,
o Physical Plane: Execution occur when output operation is called i.e., save, count etc.
Interfaces to Spark SQL, and interaction with Spark
• Might includes optimizations
• If(columnar)
• only scanning the “age” column, or
• Or indexing in the data source to count the
matching rows.
1. DataFram API
2. Data Model
3. DataFrom Operations
4. DF vs Rel. Q. Lang.
5. Querying Native DS
6. User Defined Ftns.
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Programming Interface
2 - Data Model
• Supports Primitive & Complex SQL types
o Boolean, integer, double, decimal, string, Timestamp
o structs, arrays, maps, and unions
– Also user defined types.
– First class support for complex data types
• Model data from a variety of sources
– Hive,
– Relational databases,
– JSON, and
– Native objects in Java/Scala/Python.
1. DataFram API
2. Data Model
3. DataFrom Operations
4. DF vs Rel. Q. Lang.
5. Querying Native DS
6. User Defined Ftns.
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Programming Interface
3 - DataFrome Operations
• Supports relational operators
– Project (Selection), Aggregation (group by),
– Filter (where), & Join
• Example: Query to compute the NO of Female-Emp/Dept.
• All these of the operators are build up an Abstract Syntax Tree (AST),
– which is then optimized by Catalyst.
– Unlike the native Spark API
• DF can also be registered as temporary SQL table and
– perform traditional SQL query strings
DataFrame
Expression (=, <, >, +, -)
http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
1. DataFram API
2. Data Model
3. DataFrom Operations
4. DF vs Rel. Q. Lang.
5. Querying Native DS
6. User Defined Ftns.
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Programming Interface
4 - DF VS Relational Query Languages
• Full optimization across functions composed in different languages.
• Control structures
– (e.g. if, for)
• Logical plan analyzed eagerly
– identify code errors associated with data schema issues on the fly.
– Report error while typing before execution
1. DataFram API
2. Data Model
3. DataFrom Operations
4. DF vs Rel. Q. Lang.
5. Querying Native DS
6. User Defined Ftns.
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Programming Interface
5 - Querying Native Datasets
• Pipelines extract data from heterogeneous sources
– and run a wide variety of algorithms from different programming libraries.
• Infer column names and types directly from data objects , via
– reflection in Java and Scala and
– data sampling in Python, which is dynamically typed
• Native objects accessed in-place to avoid expensive data format transformation.
• Benefits:
– Run relational operations on existing Spark programs.
– Combine RDDs with external structured data.
1. DataFram API
2. Data Model
3. DataFrom Operations
4. DF vs Rel. Q. Lang.
5. Querying Native DS
6. User Defined Ftns.
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Programming Interface
6 - User-Defined Functions (UDFs)
• UDFs are impotent DB extension: MySQL UDFs -> SYS, JSON, XML
• DB UDF separate Programming environment (!query interface)
– UDF in Pig to be written in a Java package that’s loaded into the Pig script.
• DataFrame API supports inline definition of UDFs
– Can be defined on simple data types or entire tables.
• UDFs available to other interfaces after registration (JDBC/ODBC)
1. DataFram API
2. Data Model
3. DataFrom Operations
4. DF vs Rel. Q. Lang.
5. Querying Native DS
6. User Defined Ftns.
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Catalyst Optimizer
• Catalyst: extensible optimizer
– Based on functional programming
– constructs in Scala
• Purposes of Extensible Optimizer
1. Can add new optimization techniques & features
o Big data – semi-structured data
2. Enable developers to extend the optimizer
o by adding data source specific rules
o can push filtering or aggregation into external storage systems,
o or support for new data types.
• Catalyst Optimizer
1. Trees & Rules
2. Catalyst in S-SQL
3. Advance Features
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Catalyst Optimizer
• Catalyst Contains Core Library for representing
– Trees and Applying rules
– Cost-based optimization is performed
o by generating multiple plans using rules,
o and then computing their costs.
• On top of this framework,
– built libraries specific to relational query processing
o e.g., expressions, logical query plans, and
o several sets of rules that handle different phases of query execution:
 analysis,
 Logical, optimization,
 physical planning, and
 code generation
• Catalyst Optimizer
1. Trees & Rules
2. Catalyst in S-SQL
3. Advance Features
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Catalyst Optimizer
1 - Trees & Rules
• Trees
– Literal (value: Int): a constant value
– Attribute (name: String): an attribute from an input row, e.g., “x”
– Add (left: TreeNode, right: TreeNode): sum of two expressions
• Add(Attribute(x), Add(Literal(1), Literal(2)))
x + (1 + 2)
Add
Attribute(x) Literal(3)
• Catalyst Optimizer
1. Trees & Rules
2. Catalyst in S-SQL
3. Advance Features
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Catalyst Optimizer
2 - Using Catalyst in Spark SQL
• Catalyst’s general tree transformation framework
– (1) Analyzing a logical plan to resolve references
– (2) Logical plan optimization
– (3) Physical planning, and
o Catalyst may generate multiple plans and
o Compare them based on cost
– (4) Code generation to compile parts of the query to Java bytecode.
• Spark SQL begins with a relation to be computed,
o Either from an abstract syntax tree (AST) returned by a SQL parser, or
o DataFrame object constructed using the API.
• Catalyst Optimizer
1. Trees & Rules
2. Catalyst in S-SQL
3. Advance Features
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Catalyst Optimizer
2 - Using Catalyst in Spark SQL
1 – Analysis
• Unresolved Logical Plane
– An attribute is unresolved if its type is not known or
– it’s not matched to an input table.
• To resolve attributes
– Look up relations by name from the catalog
– Map named attributes to the input provided given operator’s children
o E.g. Col.
– UID for references to the same value
– Propagate and coerce types through expressions
o e.g. (1 + col) unknown return type
SELECT col FROM sales
• Catalyst Optimizer
1. Trees & Rules
2. Catalyst in S-SQL
3. Advance Features
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Catalyst Optimizer
2 - Using Catalyst in Spark SQL
2 – Logical Optimization
• Applies standard rule-based optimization
– constant folding,
– predicate-pushdown,
– projection pruning,
– null propagation,
– Boolean expression simplification, etc.
• Catalyst Optimizer
1. Trees & Rules
2. Catalyst in S-SQL
3. Advance Features
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Catalyst Optimizer
2 - Using Catalyst in Spark SQL
3 – Physical Planning
Logical Plan
filter
join
Events File
users table
(Hive)
Physical Plan
join
scan
(events)
filter
scan
(users)
Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users
def add_demographics(events):
u = sqlCtx.table("users") # Load partitioned Hive table
events 
.join(u, events.user_id == u.user_id)  # Join on user_id
.withColumn("city", zipToCity(df.zip)) # Run udf to add city column
events = add_demographics(sqlCtx.load("/data/events", “JSON"))
training_data = events.where(events.city == "Melbourne").select(events.timestamp).collect()
Expressive
Only join
Relevant Users
“parquet” ))
• Catalyst Optimizer
1. Trees & Rules
2. Catalyst in S-SQL
3. Advance Features
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Catalyst Optimizer
2 - Using Catalyst in Spark SQL
4 – Code Generation
– Query optimization involves generating Java bytecode
o to run on each machine
• Spark SQL operates on in-memory datasets,
– where processing is CPU-bound,
• to support code generation to speed up execution
• Catalyst Optimizer
1. Trees & Rules
2. Catalyst in S-SQL
3. Advance Features
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Advanced Analytics Features
• Integration with Spark’s Machine Learning Library
• Schema Inference for Semi-structured Data
– JSON
• Catalyst Optimizer
1. Trees & Rules
2. Catalyst in S-SQL
3. Advance Features
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Evaluation
• Two dimensions: SQL query processing performance & Spark program performance
• 110 GB of data after columnar compression
• with Parquet
Data & Knowledge Engineering Laboratory
Department of Computer Engineering, Kyung Hee
Conclusion
• In conclusion, Spark SQL is a new module in Apache Spark that
integrates relational and procedural interfaces, which makes it
easy to express the large-scale data processing job. The seamless
integration of the two interfaces is the key contribution of the paper.
Already leads to a new unified interface for large-scale data
processing.
Your Logo
THANK YOU!
?

More Related Content

PPTX
Introduction to Azure Databricks
PDF
Spark SQL
PDF
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
PDF
Summary introduction to data engineering
PDF
Spark sql
PPTX
Free Training: How to Build a Lakehouse
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
PPTX
Introduction to Data Engineering
Introduction to Azure Databricks
Spark SQL
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Summary introduction to data engineering
Spark sql
Free Training: How to Build a Lakehouse
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Introduction to Data Engineering

What's hot (20)

PPTX
Introduction to Data Engineering
ODP
Deep Dive Into Elasticsearch
PDF
Azure Synapse 101 Webinar Presentation
PPTX
Microsoft Data Platform - What's included
PDF
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PPTX
Introduction to NoSQL Databases
PPTX
Programming in Spark using PySpark
PPTX
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
PPTX
PySpark dataframe
PPTX
Transparent Data Encryption
PPTX
Azure Synapse Analytics Overview (r1)
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Looking towards an official cassandra sidecar netflix
PPTX
Introduction to Apache Spark
PDF
Achieving Lakehouse Models with Spark 3.0
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PPTX
Data Preprocessing || Data Mining
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Oracle architecture with details-yogiji creations
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
Introduction to Data Engineering
Deep Dive Into Elasticsearch
Azure Synapse 101 Webinar Presentation
Microsoft Data Platform - What's included
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
Introduction to NoSQL Databases
Programming in Spark using PySpark
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
PySpark dataframe
Transparent Data Encryption
Azure Synapse Analytics Overview (r1)
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Looking towards an official cassandra sidecar netflix
Introduction to Apache Spark
Achieving Lakehouse Models with Spark 3.0
Spark SQL Deep Dive @ Melbourne Spark Meetup
Data Preprocessing || Data Mining
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Oracle architecture with details-yogiji creations
Designing Structured Streaming Pipelines—How to Architect Things Right
Ad

Similar to Apache Spark sql (20)

PPTX
This is training for spark SQL essential
PDF
Spark SQL In Depth www.syedacademy.com
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
PPTX
Spark Sql and DataFrame
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
PPTX
Learning spark ch09 - Spark SQL
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
SparkSQL: A Compiler from Queries to RDDs
PDF
Jump Start on Apache Spark 2.2 with Databricks
PPTX
Spark sql meetup
PDF
Introduction to Structured Data Processing with Spark SQL
PDF
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
PDF
Jump Start with Apache Spark 2.0 on Databricks
ODP
A Step to programming with Apache Spark
PPTX
big data analytics (BAD601) Module-5.pptx
PPTX
Spark sql
PDF
Data processing with spark in r &amp; python
PDF
20140908 spark sql & catalyst
PDF
Introduction to Spark Datasets - Functional and relational together at last
PDF
Intro to Spark and Spark SQL
This is training for spark SQL essential
Spark SQL In Depth www.syedacademy.com
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark Sql and DataFrame
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Learning spark ch09 - Spark SQL
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
SparkSQL: A Compiler from Queries to RDDs
Jump Start on Apache Spark 2.2 with Databricks
Spark sql meetup
Introduction to Structured Data Processing with Spark SQL
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Jump Start with Apache Spark 2.0 on Databricks
A Step to programming with Apache Spark
big data analytics (BAD601) Module-5.pptx
Spark sql
Data processing with spark in r &amp; python
20140908 spark sql & catalyst
Introduction to Spark Datasets - Functional and relational together at last
Intro to Spark and Spark SQL
Ad

More from aftab alam (9)

PPTX
Locally densest subgraph discovery
PPTX
Carved visual hulls for image based modeling
PPTX
Distributed graph summarization
PPTX
A Graph Summarization: A Survey | Summarizing and understanding large graphs
PPTX
Compressing Graphs and Indexes with Recursive Graph Bisection
PPTX
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
PPTX
Writing for computer science: Fourteen steps to a clearly written technical p...
PPTX
Writing for Computer Science: Design an article
PDF
Efficient aggregation for graph summarization
Locally densest subgraph discovery
Carved visual hulls for image based modeling
Distributed graph summarization
A Graph Summarization: A Survey | Summarizing and understanding large graphs
Compressing Graphs and Indexes with Recursive Graph Bisection
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
Writing for computer science: Fourteen steps to a clearly written technical p...
Writing for Computer Science: Design an article
Efficient aggregation for graph summarization

Recently uploaded (20)

PPTX
Welding lecture in detail for understanding
PDF
Digital Logic Computer Design lecture notes
PPTX
Lecture Notes Electrical Wiring System Components
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPT
Mechanical Engineering MATERIALS Selection
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Well-logging-methods_new................
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
web development for engineering and engineering
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Welding lecture in detail for understanding
Digital Logic Computer Design lecture notes
Lecture Notes Electrical Wiring System Components
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Model Code of Practice - Construction Work - 21102022 .pdf
Mechanical Engineering MATERIALS Selection
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Well-logging-methods_new................
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
web development for engineering and engineering
Structs to JSON How Go Powers REST APIs.pdf
Strings in CPP - Strings in C++ are sequences of characters used to store and...

Apache Spark sql

  • 1. Spark SQL: Relational Data Processing in Spark Aftab Alam Department of Computer Engineering, Kyung Hee University
  • 2. Spark SQL: Relational Data Processing in Spark Contents Background Project Proposal Review Challenges & Solution Evaluation 7 6 5 2 1 4 3 Programming Interface Catalyst Optimizer
  • 3. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Background Big Data • A broad term for data sets so large or complex – That traditional data processing tools are inadequate. – Characteristics: Volume, Variety, Velocity, Variability, Veracity • Typical Big Data Stack
  • 4. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Background Big Data Frameworks • Apache Hadoop – 1st Generation) – Batch • MapReduce does not support: o Iterative Jobs (Acyclic) o Interactive Analysis • Apache Spark (3rd Generation) – Iterative Jobs (Cyclic) – Interactive Analysis – Real-time processing • Improve Efficiency (In-memory computing) • Improve usability through (Scala, Java, Python) • Up to 100× faster (2-10× on disk) • 2-5× less code
  • 5. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Background Big Data Alternatives Hadoop Ecosystem Spark Ecosystem Component HDFS Tachyon YARN Mesos Tools Pig Spark native API Hive Spark SQL Mahout MLlib Storm Spark Streaming Giraph GraphX HUE Spark Notebook/ISpark
  • 6. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Challenges and Solutions (Spark SQL) • Early big data applications i.e., MapReduce, – Need manual optimization to achieve high performance • Automatic Optimization (Result: Pig, Hive, Dremel and Shark) – Declarative queries & richer automatic optimization • User prefer declarative queries – But insufficient for many big data applications. Challenges 1. Users perform Extract Transform & Load (ETL) to and from various • Semi or unstructured data sources • Requiring custom code 2. User wants to Perform advanced analytics e.g.: • machine learning & graph processing • that are hard to express in relational systems. 3. User has to opt between the two class system • relational or procedural Solutions • A DataFrame API: • That can perform relational operations • on both external data sources & • Spark’s built-in distributed collections. • Catalyst • Highly extensible optimizer & • Make it easy to add • data sources • Optimization rules, • data types for domains (ML)
  • 7. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Goals Improvement upon Existing Art • Spark engine does not understand the structure of the data in RDDs or the semantics of the user functions – limited optimization. • To query external data in Hive catalog • Limited data sources • Can only be invoked via SQL string from Spark • Error prone • Hive optimizer custom-made for MapReduce • Difficult to extend
  • 8. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Goals More goals for Spark SQL • Support for relational processing both within – Spark programs (on native Resilient Distributed Datasets (RDD)) and – on external data sources using a programmer friendly API. • To improve performance by using DBMS techniques. • Support for data sources – semi-structured data and external databases • Enable extension with advanced analytics algorithms – such as graph processing and machine learning • A data structure • Immutable objects • In-memory • Faster MapReduce Operations • Interactive & Cyclic operations
  • 9. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Programming Interface Interfaces to Spark SQL, and interaction with Spark • Catalyst Optimizer 1. Trees & Rules 2. Catalyst in S-SQL 3. Advance Features • Programming Interface 1. DataFram API 2. Data Model 3. DataFrom Operations 4. DF vs Rel. Q. Lang. 5. Querying Native DS 6. User Defined Ftns. • Evaluation
  • 10. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Programming Interface 1 - DataFrome API • A DataFrame (DF) – is equivalent to a table in a Relational Database. – can be constructed from tables in a system catalog o (based on external data sources) or o from existing RDDs of native Java/Python objects – Keep Track of their schema and o support Relational Operations. o Unlike RDD – Lazily Evaluation o Logical Plane: DF object represents logical plan to compute a dataset, o Physical Plane: Execution occur when output operation is called i.e., save, count etc. Interfaces to Spark SQL, and interaction with Spark • Might includes optimizations • If(columnar) • only scanning the “age” column, or • Or indexing in the data source to count the matching rows. 1. DataFram API 2. Data Model 3. DataFrom Operations 4. DF vs Rel. Q. Lang. 5. Querying Native DS 6. User Defined Ftns.
  • 11. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Programming Interface 2 - Data Model • Supports Primitive & Complex SQL types o Boolean, integer, double, decimal, string, Timestamp o structs, arrays, maps, and unions – Also user defined types. – First class support for complex data types • Model data from a variety of sources – Hive, – Relational databases, – JSON, and – Native objects in Java/Scala/Python. 1. DataFram API 2. Data Model 3. DataFrom Operations 4. DF vs Rel. Q. Lang. 5. Querying Native DS 6. User Defined Ftns.
  • 12. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Programming Interface 3 - DataFrome Operations • Supports relational operators – Project (Selection), Aggregation (group by), – Filter (where), & Join • Example: Query to compute the NO of Female-Emp/Dept. • All these of the operators are build up an Abstract Syntax Tree (AST), – which is then optimized by Catalyst. – Unlike the native Spark API • DF can also be registered as temporary SQL table and – perform traditional SQL query strings DataFrame Expression (=, <, >, +, -) http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes 1. DataFram API 2. Data Model 3. DataFrom Operations 4. DF vs Rel. Q. Lang. 5. Querying Native DS 6. User Defined Ftns.
  • 13. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Programming Interface 4 - DF VS Relational Query Languages • Full optimization across functions composed in different languages. • Control structures – (e.g. if, for) • Logical plan analyzed eagerly – identify code errors associated with data schema issues on the fly. – Report error while typing before execution 1. DataFram API 2. Data Model 3. DataFrom Operations 4. DF vs Rel. Q. Lang. 5. Querying Native DS 6. User Defined Ftns.
  • 14. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Programming Interface 5 - Querying Native Datasets • Pipelines extract data from heterogeneous sources – and run a wide variety of algorithms from different programming libraries. • Infer column names and types directly from data objects , via – reflection in Java and Scala and – data sampling in Python, which is dynamically typed • Native objects accessed in-place to avoid expensive data format transformation. • Benefits: – Run relational operations on existing Spark programs. – Combine RDDs with external structured data. 1. DataFram API 2. Data Model 3. DataFrom Operations 4. DF vs Rel. Q. Lang. 5. Querying Native DS 6. User Defined Ftns.
  • 15. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Programming Interface 6 - User-Defined Functions (UDFs) • UDFs are impotent DB extension: MySQL UDFs -> SYS, JSON, XML • DB UDF separate Programming environment (!query interface) – UDF in Pig to be written in a Java package that’s loaded into the Pig script. • DataFrame API supports inline definition of UDFs – Can be defined on simple data types or entire tables. • UDFs available to other interfaces after registration (JDBC/ODBC) 1. DataFram API 2. Data Model 3. DataFrom Operations 4. DF vs Rel. Q. Lang. 5. Querying Native DS 6. User Defined Ftns.
  • 16. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Catalyst Optimizer • Catalyst: extensible optimizer – Based on functional programming – constructs in Scala • Purposes of Extensible Optimizer 1. Can add new optimization techniques & features o Big data – semi-structured data 2. Enable developers to extend the optimizer o by adding data source specific rules o can push filtering or aggregation into external storage systems, o or support for new data types. • Catalyst Optimizer 1. Trees & Rules 2. Catalyst in S-SQL 3. Advance Features
  • 17. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Catalyst Optimizer • Catalyst Contains Core Library for representing – Trees and Applying rules – Cost-based optimization is performed o by generating multiple plans using rules, o and then computing their costs. • On top of this framework, – built libraries specific to relational query processing o e.g., expressions, logical query plans, and o several sets of rules that handle different phases of query execution:  analysis,  Logical, optimization,  physical planning, and  code generation • Catalyst Optimizer 1. Trees & Rules 2. Catalyst in S-SQL 3. Advance Features
  • 18. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Catalyst Optimizer 1 - Trees & Rules • Trees – Literal (value: Int): a constant value – Attribute (name: String): an attribute from an input row, e.g., “x” – Add (left: TreeNode, right: TreeNode): sum of two expressions • Add(Attribute(x), Add(Literal(1), Literal(2))) x + (1 + 2) Add Attribute(x) Literal(3) • Catalyst Optimizer 1. Trees & Rules 2. Catalyst in S-SQL 3. Advance Features
  • 19. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Catalyst Optimizer 2 - Using Catalyst in Spark SQL • Catalyst’s general tree transformation framework – (1) Analyzing a logical plan to resolve references – (2) Logical plan optimization – (3) Physical planning, and o Catalyst may generate multiple plans and o Compare them based on cost – (4) Code generation to compile parts of the query to Java bytecode. • Spark SQL begins with a relation to be computed, o Either from an abstract syntax tree (AST) returned by a SQL parser, or o DataFrame object constructed using the API. • Catalyst Optimizer 1. Trees & Rules 2. Catalyst in S-SQL 3. Advance Features
  • 20. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Catalyst Optimizer 2 - Using Catalyst in Spark SQL 1 – Analysis • Unresolved Logical Plane – An attribute is unresolved if its type is not known or – it’s not matched to an input table. • To resolve attributes – Look up relations by name from the catalog – Map named attributes to the input provided given operator’s children o E.g. Col. – UID for references to the same value – Propagate and coerce types through expressions o e.g. (1 + col) unknown return type SELECT col FROM sales • Catalyst Optimizer 1. Trees & Rules 2. Catalyst in S-SQL 3. Advance Features
  • 21. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Catalyst Optimizer 2 - Using Catalyst in Spark SQL 2 – Logical Optimization • Applies standard rule-based optimization – constant folding, – predicate-pushdown, – projection pruning, – null propagation, – Boolean expression simplification, etc. • Catalyst Optimizer 1. Trees & Rules 2. Catalyst in S-SQL 3. Advance Features
  • 22. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Catalyst Optimizer 2 - Using Catalyst in Spark SQL 3 – Physical Planning Logical Plan filter join Events File users table (Hive) Physical Plan join scan (events) filter scan (users) Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users def add_demographics(events): u = sqlCtx.table("users") # Load partitioned Hive table events .join(u, events.user_id == u.user_id) # Join on user_id .withColumn("city", zipToCity(df.zip)) # Run udf to add city column events = add_demographics(sqlCtx.load("/data/events", “JSON")) training_data = events.where(events.city == "Melbourne").select(events.timestamp).collect() Expressive Only join Relevant Users “parquet” )) • Catalyst Optimizer 1. Trees & Rules 2. Catalyst in S-SQL 3. Advance Features
  • 23. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Catalyst Optimizer 2 - Using Catalyst in Spark SQL 4 – Code Generation – Query optimization involves generating Java bytecode o to run on each machine • Spark SQL operates on in-memory datasets, – where processing is CPU-bound, • to support code generation to speed up execution • Catalyst Optimizer 1. Trees & Rules 2. Catalyst in S-SQL 3. Advance Features
  • 24. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Advanced Analytics Features • Integration with Spark’s Machine Learning Library • Schema Inference for Semi-structured Data – JSON • Catalyst Optimizer 1. Trees & Rules 2. Catalyst in S-SQL 3. Advance Features
  • 25. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Evaluation • Two dimensions: SQL query processing performance & Spark program performance • 110 GB of data after columnar compression • with Parquet
  • 26. Data & Knowledge Engineering Laboratory Department of Computer Engineering, Kyung Hee Conclusion • In conclusion, Spark SQL is a new module in Apache Spark that integrates relational and procedural interfaces, which makes it easy to express the large-scale data processing job. The seamless integration of the two interfaces is the key contribution of the paper. Already leads to a new unified interface for large-scale data processing.