SlideShare a Scribd company logo
Department of Computer Science and Engineering
School of Engineering
Shiv Nadar University Chennai
Sundharakumar KB
SPARK SQL
• Dataframes instead of RDDs
• Extends RDDs to dataframe objects
• DF :
• Contain row elements
• Can run SQL queries
• Can have a schema
• Read and write to JSON, HIVE, csv, etc.
• Communicates with JDBC/ODBC, tableau, etc.
Spark SQL
• Instead of creating sparkcontext, we must create sparksession to use it with spark sql.
• Eg: from pyspark.sql import SparkSession, Row
• Sparksession is the entry point to use dataframes.
• To use SparkSession, you must use SparkSession.builder
Spark SQL
• inputData = spark.read.json(data)
• inputData.createOrReplaceTempView(“myView”)
• resultDF = spark.sql(“select foo from xyz ORDER BY foobar”)
Spark SQL
• resultDF.show()
• resultDF.select(“someFieldName”)
• resultDF.filter(resultDF(“someFieldName”>200))
• resultDF.groupby(resultDF(“someFieldName”)).mean()
• resultDF.rdd().map(mapperFunction)
• Most of the current spark operations deals with the Dataframes more than RDDs because
of the flexibility that dataframes offer.
Spark SQL
• Spark SQL exposes JDBC/ODBC server.
• Can also be connected using the spark shell.
• Can also be used with hive using hiveCtx.cacheTable(“tablename”).
• Provides SQL shell to directly create new tables or query from existing tables.
Spark SQL
• From pyspark.sql.types import IntegerType
• Def square(x):
• return x*x
• Spark.udf.register(“square”, square, IntegerType())
• Df = spark.sql(“select square(“SomeNumField”) from tableName”)
User Defined Functions

More Related Content

PPTX
Building highly scalable data pipelines with Apache Spark
PPTX
Intro to Spark
PDF
Solr as a Spark SQL Datasource
PPTX
Spark sql
PDF
Sydney Spark Meetup - September 2015
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
실시간 Streaming using Spark and Kafka 강의교재
PDF
Meetup developing building and_deploying databases with SSDT
Building highly scalable data pipelines with Apache Spark
Intro to Spark
Solr as a Spark SQL Datasource
Spark sql
Sydney Spark Meetup - September 2015
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
실시간 Streaming using Spark and Kafka 강의교재
Meetup developing building and_deploying databases with SSDT

Similar to Introduction to Spark SQL, query types and UDF (20)

PPTX
Dive into spark2
PPTX
Jdbc presentation
PPTX
An Introduct to Spark - Atlanta Spark Meetup
PPTX
An Introduction to Spark
PPTX
Learning spark ch09 - Spark SQL
PPTX
The Pushdown of Everything by Stephan Kessler and Santiago Mola
PDF
20170126 big data processing
PPTX
Storlets fb session_16_9
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
PPTX
Orms vs Micro-ORMs
PDF
Sqoop on Spark for Data Ingestion
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
Analytics with Cassandra & Spark
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
PDF
What Is RDD In Spark? | Edureka
PDF
Scala and Spring
PDF
Spark SQL with Scala Code Examples
PDF
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Dive into spark2
Jdbc presentation
An Introduct to Spark - Atlanta Spark Meetup
An Introduction to Spark
Learning spark ch09 - Spark SQL
The Pushdown of Everything by Stephan Kessler and Santiago Mola
20170126 big data processing
Storlets fb session_16_9
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Orms vs Micro-ORMs
Sqoop on Spark for Data Ingestion
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Analytics with Cassandra & Spark
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
What Is RDD In Spark? | Edureka
Scala and Spring
Spark SQL with Scala Code Examples
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Ad

Recently uploaded (20)

PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Lesson notes of climatology university.
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Cell Structure & Organelles in detailed.
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Cell Types and Its function , kingdom of life
PDF
RMMM.pdf make it easy to upload and study
PPTX
master seminar digital applications in india
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
01-Introduction-to-Information-Management.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Pharma ospi slides which help in ospi learning
Lesson notes of climatology university.
GDM (1) (1).pptx small presentation for students
Module 4: Burden of Disease Tutorial Slides S2 2025
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Microbial disease of the cardiovascular and lymphatic systems
Cell Structure & Organelles in detailed.
Microbial diseases, their pathogenesis and prophylaxis
Final Presentation General Medicine 03-08-2024.pptx
Final Presentation General Medicine 03-08-2024.pptx
Cell Types and Its function , kingdom of life
RMMM.pdf make it easy to upload and study
master seminar digital applications in india
human mycosis Human fungal infections are called human mycosis..pptx
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
01-Introduction-to-Information-Management.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Ad

Introduction to Spark SQL, query types and UDF

  • 1. Department of Computer Science and Engineering School of Engineering Shiv Nadar University Chennai Sundharakumar KB SPARK SQL
  • 2. • Dataframes instead of RDDs • Extends RDDs to dataframe objects • DF : • Contain row elements • Can run SQL queries • Can have a schema • Read and write to JSON, HIVE, csv, etc. • Communicates with JDBC/ODBC, tableau, etc. Spark SQL
  • 3. • Instead of creating sparkcontext, we must create sparksession to use it with spark sql. • Eg: from pyspark.sql import SparkSession, Row • Sparksession is the entry point to use dataframes. • To use SparkSession, you must use SparkSession.builder Spark SQL
  • 4. • inputData = spark.read.json(data) • inputData.createOrReplaceTempView(“myView”) • resultDF = spark.sql(“select foo from xyz ORDER BY foobar”) Spark SQL
  • 5. • resultDF.show() • resultDF.select(“someFieldName”) • resultDF.filter(resultDF(“someFieldName”>200)) • resultDF.groupby(resultDF(“someFieldName”)).mean() • resultDF.rdd().map(mapperFunction) • Most of the current spark operations deals with the Dataframes more than RDDs because of the flexibility that dataframes offer. Spark SQL
  • 6. • Spark SQL exposes JDBC/ODBC server. • Can also be connected using the spark shell. • Can also be used with hive using hiveCtx.cacheTable(“tablename”). • Provides SQL shell to directly create new tables or query from existing tables. Spark SQL
  • 7. • From pyspark.sql.types import IntegerType • Def square(x): • return x*x • Spark.udf.register(“square”, square, IntegerType()) • Df = spark.sql(“select square(“SomeNumField”) from tableName”) User Defined Functions