SlideShare a Scribd company logo
Basic
Using Spark DataFrame
For SQL
charsyam@naver.com
Create DataFrame From File
val path = “abc.txt”
val df = spark.read.text(path)
Create DataFrame From Kafka
val rdd = KafkaUtils.createRDD[String, String](...)
val logsDF = rdd.map { _.value }.toDF
Spark DataFrame Column
1) col("column name")
2) $"column name"
1) And 2) are the same.
Simple Iris TSV Logs
http://www.math.uah.edu/stat/data/Fisher.txt
Type PW PL SW SL
0 2 14 33 50
1 24 56 31 67
1 23 51 31 69
0 2 10 36 46
1 20 52 30 65
1 19 51 27 58
Load TSV with StructType
import org.apache.spark.sql.types._
var irisSchema = StructType(Array(
StructField("Type", IntegerType, true),
StructField("PetalWidth", IntegerType, true),
StructField("PetalLength", IntegerType, true),
StructField("SepalWidth", IntegerType, true),
StructField("SepalLength", IntegerType, true)
))
Load TSV with Encoder #1
import org.apache.spark.sql.Encoders
case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int,
SepalWidth: Int, SepalLength: Int)
var irisSchema = Encoders.product[IrisSchema].schema
Load TSV
var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV.
option("header", "true"). // Does the file have a header line?
option("delimiter", "t"). // Set delimiter to tab or comma.
schema(irisSchema). // Schema that was built above.
load("Fisher.txt")
irisDf.show(5)
Load TSV - Show Results
scala> irisDf.show(5)
+----+----------+-----------+----------+-----------+
|Type|PetalWidth|PetalLength|SepalWidth|SepalLength|
+----+----------+-----------+----------+-----------+
| 0| 2| 14| 33| 50|
| 1| 24| 56| 31| 67|
| 1| 23| 51| 31| 69|
| 0| 2| 10| 36| 46|
| 1| 20| 52| 30| 65|
+----+----------+-----------+----------+-----------+
only showing top 5 rows
Using sqlContext sql
Super easy way
val view = df.createOrReplaceTempView("tmp_iris")
val resultDF = df.sqlContext.sql("select type, PetalWidth from tmp_iris")
Simple Select
SQL:
Select type, petalwidth + sepalwidth as sum_width from …
val sumDF = df.withColumn("sum_width", col("PetalWidth") + col("SepalWidth"))
val resultDF = sumDF.selectExpr("Type", "sum_width")
val resultDF = sumDF.selectExpr("*") ← select *
Select with where
SQL:
Select type, petalwidth from … where petalwidth > 10
val whereDF = df.filter($"petalwidth" > 10)
val whereDF = df.where($"petalwidth" > 10)
//filter and where are the same
val resultDF = whereDF.selectExpr("Type", "petalwidth")
Select with order by
SQL:
Select petalwidth, sepalwidth from … order by petalwidth, sepalwidth desc
1) val sortDF = df.sort($"petalwidth", $"sepalwidth".desc)
2) val sortDF = df.sort($"petalwidth", desc("sepalwidth"))
3) val sortDF = df.orderBy($"petalwidth", desc("sepalwidth"))
1), 2) And 3) are the same.
val resultDF = sortDF.selectExpr("petalwidth", "sepalwidth")
Select with Group by
SQL:
Select type, max(petalwidth) A, min(sepalwidth) B from … group by type
val groupDF = df.groupBy($"type").agg(max($"petalwidth").as("A"),
min($"sepalwidth").as("B"))
val resultDF = groupDF.selectExpr("type", "A", "B")
Tip - Support MapType<String, String> like Hive
SQL in Hive:
Create table test (type map<string, string>);
Hive support str_to_map, but spark not support for dataframe(spark support
str_to_map for hiveQL).
Using udf to solve this.
val string_line = "A=1,B=2,C=3"
Val df = logsDF.withColumn("type", str_to_map(string_line))
UDF - str_to_map
val str_to_map = udf {
text : String =>
val pairs = text.split("delimiter1|delimiter2").grouped(2)
pairs.map { case Array(k, v) => k -> v}.toMap
}
Thank you.

More Related Content

PPTX
Javascript Arrays
PPT
Xm lparsers
PPTX
Querying Nested JSON Data Using N1QL and Couchbase
PDF
The Ring programming language version 1.2 book - Part 26 of 84
PPTX
Apache Spark - Aram Mkrtchyan
PDF
Hidden Gems in Swift
PPTX
Database testing in postgresql query
PDF
Avro, la puissance du binaire, la souplesse du JSON
Javascript Arrays
Xm lparsers
Querying Nested JSON Data Using N1QL and Couchbase
The Ring programming language version 1.2 book - Part 26 of 84
Apache Spark - Aram Mkrtchyan
Hidden Gems in Swift
Database testing in postgresql query
Avro, la puissance du binaire, la souplesse du JSON

What's hot (20)

DOCX
Format xls sheets Demo Mode
PDF
The Ring programming language version 1.6 book - Part 32 of 189
PDF
The Ring programming language version 1.2 book - Part 19 of 84
PDF
SICP_2.5 일반화된 연산시스템
PDF
The Ring programming language version 1.10 book - Part 47 of 212
PDF
The Ring programming language version 1.4.1 book - Part 13 of 31
PDF
JSON Support in MariaDB: News, non-news and the bigger picture
PPTX
Rule Your Geometry with the Terraformer Toolkit
PPTX
Get docs from sp doc library
PPTX
GreenDao Introduction
PDF
The Ring programming language version 1.7 book - Part 41 of 196
PDF
Memory management
PDF
The Ring programming language version 1.7 book - Part 48 of 196
KEY
Node js mongodriver
PDF
The Ring programming language version 1.5.3 book - Part 30 of 184
PDF
The Ring programming language version 1.9 book - Part 46 of 210
PPTX
Slick: Bringing Scala’s Powerful Features to Your Database Access
PDF
The Ring programming language version 1.5 book - Part 8 of 31
PDF
The Ring programming language version 1.5.3 book - Part 37 of 184
PDF
Odoo Technical Concepts Summary
Format xls sheets Demo Mode
The Ring programming language version 1.6 book - Part 32 of 189
The Ring programming language version 1.2 book - Part 19 of 84
SICP_2.5 일반화된 연산시스템
The Ring programming language version 1.10 book - Part 47 of 212
The Ring programming language version 1.4.1 book - Part 13 of 31
JSON Support in MariaDB: News, non-news and the bigger picture
Rule Your Geometry with the Terraformer Toolkit
Get docs from sp doc library
GreenDao Introduction
The Ring programming language version 1.7 book - Part 41 of 196
Memory management
The Ring programming language version 1.7 book - Part 48 of 196
Node js mongodriver
The Ring programming language version 1.5.3 book - Part 30 of 184
The Ring programming language version 1.9 book - Part 46 of 210
Slick: Bringing Scala’s Powerful Features to Your Database Access
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5.3 book - Part 37 of 184
Odoo Technical Concepts Summary
Ad

Similar to Using spark data frame for sql (20)

PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
PPTX
Spark sql
PPTX
Learning spark ch09 - Spark SQL
PDF
Introduction to Spark Datasets - Functional and relational together at last
PDF
SparkSQL and Dataframe
PDF
Intro to Spark and Spark SQL
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Beyond SQL: Speeding up Spark with DataFrames
PDF
Introduction to Spark SQL training workshop
PDF
Pivoting Data with SparkSQL by Andrew Ray
PDF
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
PPTX
Spark Sql and DataFrame
PDF
Spark - Alexis Seigneurin (English)
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Apache Spark's Built-in File Sources in Depth
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Spark sql
Learning spark ch09 - Spark SQL
Introduction to Spark Datasets - Functional and relational together at last
SparkSQL and Dataframe
Intro to Spark and Spark SQL
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark SQL Deep Dive @ Melbourne Spark Meetup
Beyond SQL: Speeding up Spark with DataFrames
Introduction to Spark SQL training workshop
Pivoting Data with SparkSQL by Andrew Ray
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Spark Sql and DataFrame
Spark - Alexis Seigneurin (English)
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Apache Spark's Built-in File Sources in Depth
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Ad

More from DaeMyung Kang (20)

PPTX
Count min sketch
PDF
PDF
Ansible
PDF
Why GUID is needed
PDF
How to use redis well
PPTX
The easiest consistent hashing
PDF
How to name a cache key
PDF
Integration between Filebeat and logstash
PDF
How to build massive service for advance
PDF
Massive service basic
PDF
Data Engineering 101
PDF
How To Become Better Engineer
PPTX
Kafka timestamp offset_final
PPTX
Kafka timestamp offset
PPTX
Data pipeline and data lake
PDF
Redis acl
PDF
Coffee store
PDF
Scalable webservice
PDF
Number system
PDF
webservice scaling for newbie
Count min sketch
Ansible
Why GUID is needed
How to use redis well
The easiest consistent hashing
How to name a cache key
Integration between Filebeat and logstash
How to build massive service for advance
Massive service basic
Data Engineering 101
How To Become Better Engineer
Kafka timestamp offset_final
Kafka timestamp offset
Data pipeline and data lake
Redis acl
Coffee store
Scalable webservice
Number system
webservice scaling for newbie

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Electronic commerce courselecture one. Pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
Review of recent advances in non-invasive hemoglobin estimation
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Big Data Technologies - Introduction.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The Rise and Fall of 3GPP – Time for a Sabbatical?
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Electronic commerce courselecture one. Pdf
Empathic Computing: Creating Shared Understanding
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Encapsulation_ Review paper, used for researhc scholars
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
NewMind AI Monthly Chronicles - July 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction

Using spark data frame for sql

  • 1. Basic Using Spark DataFrame For SQL charsyam@naver.com
  • 2. Create DataFrame From File val path = “abc.txt” val df = spark.read.text(path)
  • 3. Create DataFrame From Kafka val rdd = KafkaUtils.createRDD[String, String](...) val logsDF = rdd.map { _.value }.toDF
  • 4. Spark DataFrame Column 1) col("column name") 2) $"column name" 1) And 2) are the same.
  • 5. Simple Iris TSV Logs http://www.math.uah.edu/stat/data/Fisher.txt Type PW PL SW SL 0 2 14 33 50 1 24 56 31 67 1 23 51 31 69 0 2 10 36 46 1 20 52 30 65 1 19 51 27 58
  • 6. Load TSV with StructType import org.apache.spark.sql.types._ var irisSchema = StructType(Array( StructField("Type", IntegerType, true), StructField("PetalWidth", IntegerType, true), StructField("PetalLength", IntegerType, true), StructField("SepalWidth", IntegerType, true), StructField("SepalLength", IntegerType, true) ))
  • 7. Load TSV with Encoder #1 import org.apache.spark.sql.Encoders case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int, SepalWidth: Int, SepalLength: Int) var irisSchema = Encoders.product[IrisSchema].schema
  • 8. Load TSV var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV. option("header", "true"). // Does the file have a header line? option("delimiter", "t"). // Set delimiter to tab or comma. schema(irisSchema). // Schema that was built above. load("Fisher.txt") irisDf.show(5)
  • 9. Load TSV - Show Results scala> irisDf.show(5) +----+----------+-----------+----------+-----------+ |Type|PetalWidth|PetalLength|SepalWidth|SepalLength| +----+----------+-----------+----------+-----------+ | 0| 2| 14| 33| 50| | 1| 24| 56| 31| 67| | 1| 23| 51| 31| 69| | 0| 2| 10| 36| 46| | 1| 20| 52| 30| 65| +----+----------+-----------+----------+-----------+ only showing top 5 rows
  • 10. Using sqlContext sql Super easy way val view = df.createOrReplaceTempView("tmp_iris") val resultDF = df.sqlContext.sql("select type, PetalWidth from tmp_iris")
  • 11. Simple Select SQL: Select type, petalwidth + sepalwidth as sum_width from … val sumDF = df.withColumn("sum_width", col("PetalWidth") + col("SepalWidth")) val resultDF = sumDF.selectExpr("Type", "sum_width") val resultDF = sumDF.selectExpr("*") ← select *
  • 12. Select with where SQL: Select type, petalwidth from … where petalwidth > 10 val whereDF = df.filter($"petalwidth" > 10) val whereDF = df.where($"petalwidth" > 10) //filter and where are the same val resultDF = whereDF.selectExpr("Type", "petalwidth")
  • 13. Select with order by SQL: Select petalwidth, sepalwidth from … order by petalwidth, sepalwidth desc 1) val sortDF = df.sort($"petalwidth", $"sepalwidth".desc) 2) val sortDF = df.sort($"petalwidth", desc("sepalwidth")) 3) val sortDF = df.orderBy($"petalwidth", desc("sepalwidth")) 1), 2) And 3) are the same. val resultDF = sortDF.selectExpr("petalwidth", "sepalwidth")
  • 14. Select with Group by SQL: Select type, max(petalwidth) A, min(sepalwidth) B from … group by type val groupDF = df.groupBy($"type").agg(max($"petalwidth").as("A"), min($"sepalwidth").as("B")) val resultDF = groupDF.selectExpr("type", "A", "B")
  • 15. Tip - Support MapType<String, String> like Hive SQL in Hive: Create table test (type map<string, string>); Hive support str_to_map, but spark not support for dataframe(spark support str_to_map for hiveQL). Using udf to solve this. val string_line = "A=1,B=2,C=3" Val df = logsDF.withColumn("type", str_to_map(string_line))
  • 16. UDF - str_to_map val str_to_map = udf { text : String => val pairs = text.split("delimiter1|delimiter2").grouped(2) pairs.map { case Array(k, v) => k -> v}.toMap }