SlideShare a Scribd company logo
Type Checking Scala Spark
Datasets: 

Data Set Transforms
John Nestor 47 Degrees
www.47deg.com
Seattle Spark Meetup
September 22, 2016
147deg.com
47deg.com © Copyright 2016 47 Degrees
Outline
• Introduction
• Transforms
• Demos
• Implementation
• Getting the Code
2
Introduction
3
47deg.com © Copyright 2016 47 Degrees
Spark Scala APIs
• RDD (pass closures)
• Functional programming model
• Types checked at compile time
• DataFrame (pass SQL)
• SQL programming model (can be optimized)
• Types checked at run time
• Dataset (pass SQL)
• Combines best of RDDs and DataFrames
• Some (not all) types checked at compile time
4
47deg.com © Copyright 2016 47 Degrees
Run-Time Scala Checking
• Field/column names
• Names specified as strings
• RT error if no such field
• Field/column types
• Specified via casting to expected type
• RT error if not of expected type
5
47deg.com © Copyright 2016 47 Degrees
Dataset Example
case class ABC(a: Int, b: String, c: String)

case class CA(c: String, a: Int)


val abc = ABC(3, "foo", "test")

val abc1 = ABC(5, "xxx", "alpha")

val abc3 = ABC(10, "aaa", "aaa")

val abcs = Seq(abc, abc1, abc3)

val ds = abcs.toDS()
/* Compile time type checking;
but must pass closure and can’t optimize */

val ds1 = ds.map(abc => CA(abc.b, abc.a * 2 + abc.a))



/* Can be query optimized;
but run-time type and field name checking */

val ds2 = ds.select($"b" as "c",
($"a" * 2 + $"a") as "a").as[CA]
6
Transforms
7
47deg.com © Copyright 2016 47 Degrees
Goal
• Add strong typing to Scala Spark Datasets
• Check field names at compile time
• Check field types at compile time
• Each transform maps one of more Datasets to a new
Dataset.
• Dataset rows are compile-time types: Scala case
classes
8
47deg.com © Copyright 2016 47 Degrees
Transform Example
case class ABC(a: Int, b: String, c: String)

case class CA(c: String, a: Int)


val abc = ABC(3, "foo", "test")

val abc1 = ABC(5, "xxx", "alpha")

val abc3 = ABC(10, "aaa", "aaa")

val abcs = Seq(abc, abc1, abc3)

val ds = abcs.toDS()
/* Compile time type checking;
but can do query optimization */


val smap = SqlMap[ABC, CA]
.act(cols => (cols.b, cols.a * 2 + cols.a))
val ds3 = smap(ds)


9
47deg.com © Copyright 2016 47 Degrees
Current Transforms
• Filter
• Map
• Sort
• Join (combines 2 DataSets)
• Aggregate (sum, count, max)
10
Demos
11
47deg.com © Copyright 2016 47 Degrees
Demo
• Dataset example
• map
• select
• Transform examples
• Map
• Sort
• Join
• Filter
• Aggregate
12
Implementation
13
47deg.com © Copyright 2016 47 Degrees
Scala Macros
• Scala code executed at compile time
• Kinds
• Black box - single result type specified
• * White box - result type computed
14
47deg.com © Copyright 2016 47 Degrees
Transform Implementation
• case class Person(name:String,age:Int)

val p = Person(“Sam”,30)
• Scala macro converts
• from: an arbitrary case class type
• classOf[p]
• to: a meta structure that encodes field names and
types
• case class PersonM(name:StringCol,age:IntCol)

val cols =
PersonM(name:StringCol(“name”),age:IntCol(“age”))
15
47deg.com © Copyright 2016 47 Degrees
Column Operations
• StrCol(“A”) === StrCol(“B”) => BoolCol(“A === B”)
• IntCol(“A”) + IntCol(“B”) => IntCol(“A + B”)
• IntCol(“A”).max => IntCol(“A.max”)
16
47deg.com © Copyright 2016 47 Degrees
White Box Macro Restrictions
• Works fine in SBT and Eclipse
• Not supported in Intellij but can use
• Reports type errors
• Does not show available completions
17
Getting the Code
18
47deg.com © Copyright 2016 47 Degrees
Transforms Code
• https://github.com/nestorpersist/dataset-transform
• Code
• Documentation
• Examples
• "com.persist" % "dataset-transforms_2.11" % "0.0.5"
19
Questions
20

More Related Content

PDF
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
A Scala tutorial
PDF
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
PPTX
Dynamodb ppt
PDF
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
PDF
Spark: Taming Big Data
PDF
Apex as yarn application
PPTX
Scala 20140715
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLab
A Scala tutorial
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Dynamodb ppt
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Spark: Taming Big Data
Apex as yarn application
Scala 20140715

Viewers also liked (20)

PDF
Logging in Scala
PDF
Spark fundamentals i (bd095 en) version #1: updated: april 2015
PDF
Unikernels: in search of a killer app and a killer ecosystem
PDF
Full stack analytics with Hadoop 2
PDF
New Analytics Toolbox DevNexus 2015
PDF
臺灣高中數學講義 - 第一冊 - 數與式
PPTX
Think Like Spark: Some Spark Concepts and a Use Case
PDF
Resilient Distributed Datasets
PDF
Apache Spark: killer or savior of Apache Hadoop?
PPTX
IBM Spark Meetup - RDD & Spark Basics
PPTX
Apache Spark Introduction @ University College London
PPTX
Think Like Spark
PDF
Hadoop Spark Introduction-20150130
PDF
Hadoop to spark_v2
PDF
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
PDF
Spark in 15 min
PDF
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
PPTX
Intro to Spark development
PDF
Beneath RDD in Apache Spark by Jacek Laskowski
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Logging in Scala
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Unikernels: in search of a killer app and a killer ecosystem
Full stack analytics with Hadoop 2
New Analytics Toolbox DevNexus 2015
臺灣高中數學講義 - 第一冊 - 數與式
Think Like Spark: Some Spark Concepts and a Use Case
Resilient Distributed Datasets
Apache Spark: killer or savior of Apache Hadoop?
IBM Spark Meetup - RDD & Spark Basics
Apache Spark Introduction @ University College London
Think Like Spark
Hadoop Spark Introduction-20150130
Hadoop to spark_v2
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Spark in 15 min
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
Intro to Spark development
Beneath RDD in Apache Spark by Jacek Laskowski
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Ad

Similar to Type Checking Scala Spark Datasets: Dataset Transforms (20)

PPT
Scaling web applications with cassandra presentation
PDF
Streaming Microservices With Akka Streams And Kafka Streams
PDF
Big data analytics with Spark & Cassandra
PDF
Scala in Places API
PDF
3 Dundee-Spark Overview for C* developers
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PPTX
Taxonomy of Scala
PPTX
Cassandra Java APIs Old and New – A Comparison
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
PDF
楽々Scalaプログラミング
PDF
Scala active record
PPT
Scaling Web Applications with Cassandra Presentation (1).ppt
PDF
Rails on Oracle 2011
PDF
Scala for Java Programmers
PDF
Scala Macros
PDF
Apache Spark RDDs
PPTX
Cassandra Overview
KEY
No SQL, No problem - using MongoDB in Ruby
PDF
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
PPTX
Presentation
Scaling web applications with cassandra presentation
Streaming Microservices With Akka Streams And Kafka Streams
Big data analytics with Spark & Cassandra
Scala in Places API
3 Dundee-Spark Overview for C* developers
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Taxonomy of Scala
Cassandra Java APIs Old and New – A Comparison
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
楽々Scalaプログラミング
Scala active record
Scaling Web Applications with Cassandra Presentation (1).ppt
Rails on Oracle 2011
Scala for Java Programmers
Scala Macros
Apache Spark RDDs
Cassandra Overview
No SQL, No problem - using MongoDB in Ruby
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Presentation
Ad

More from John Nestor (7)

PDF
LambdaFlow: Scala Functional Message Processing
PDF
LambdaTest
PDF
Messaging patterns
PDF
Experience Converting from Ruby to Scala
PPTX
Scala and Spark are Ideal for Big Data
PDF
Scala Json Features and Performance
PPT
Neutronium
LambdaFlow: Scala Functional Message Processing
LambdaTest
Messaging patterns
Experience Converting from Ruby to Scala
Scala and Spark are Ideal for Big Data
Scala Json Features and Performance
Neutronium

Recently uploaded (20)

PDF
Nekopoi APK 2025 free lastest update
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
ai tools demonstartion for schools and inter college
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Reimagine Home Health with the Power of Agentic AI​
Nekopoi APK 2025 free lastest update
Adobe Illustrator 28.6 Crack My Vision of Vector Design
CHAPTER 2 - PM Management and IT Context
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
VVF-Customer-Presentation2025-Ver1.9.pptx
PTS Company Brochure 2025 (1).pdf.......
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
ai tools demonstartion for schools and inter college
Navsoft: AI-Powered Business Solutions & Custom Software Development
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
How to Choose the Right IT Partner for Your Business in Malaysia
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Odoo POS Development Services by CandidRoot Solutions
Odoo Companies in India – Driving Business Transformation.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
Reimagine Home Health with the Power of Agentic AI​

Type Checking Scala Spark Datasets: Dataset Transforms

  • 1. Type Checking Scala Spark Datasets: 
 Data Set Transforms John Nestor 47 Degrees www.47deg.com Seattle Spark Meetup September 22, 2016 147deg.com
  • 2. 47deg.com © Copyright 2016 47 Degrees Outline • Introduction • Transforms • Demos • Implementation • Getting the Code 2
  • 4. 47deg.com © Copyright 2016 47 Degrees Spark Scala APIs • RDD (pass closures) • Functional programming model • Types checked at compile time • DataFrame (pass SQL) • SQL programming model (can be optimized) • Types checked at run time • Dataset (pass SQL) • Combines best of RDDs and DataFrames • Some (not all) types checked at compile time 4
  • 5. 47deg.com © Copyright 2016 47 Degrees Run-Time Scala Checking • Field/column names • Names specified as strings • RT error if no such field • Field/column types • Specified via casting to expected type • RT error if not of expected type 5
  • 6. 47deg.com © Copyright 2016 47 Degrees Dataset Example case class ABC(a: Int, b: String, c: String)
 case class CA(c: String, a: Int) 
 val abc = ABC(3, "foo", "test")
 val abc1 = ABC(5, "xxx", "alpha")
 val abc3 = ABC(10, "aaa", "aaa")
 val abcs = Seq(abc, abc1, abc3)
 val ds = abcs.toDS() /* Compile time type checking; but must pass closure and can’t optimize */
 val ds1 = ds.map(abc => CA(abc.b, abc.a * 2 + abc.a))
 
 /* Can be query optimized; but run-time type and field name checking */
 val ds2 = ds.select($"b" as "c", ($"a" * 2 + $"a") as "a").as[CA] 6
  • 8. 47deg.com © Copyright 2016 47 Degrees Goal • Add strong typing to Scala Spark Datasets • Check field names at compile time • Check field types at compile time • Each transform maps one of more Datasets to a new Dataset. • Dataset rows are compile-time types: Scala case classes 8
  • 9. 47deg.com © Copyright 2016 47 Degrees Transform Example case class ABC(a: Int, b: String, c: String)
 case class CA(c: String, a: Int) 
 val abc = ABC(3, "foo", "test")
 val abc1 = ABC(5, "xxx", "alpha")
 val abc3 = ABC(10, "aaa", "aaa")
 val abcs = Seq(abc, abc1, abc3)
 val ds = abcs.toDS() /* Compile time type checking; but can do query optimization */ 
 val smap = SqlMap[ABC, CA] .act(cols => (cols.b, cols.a * 2 + cols.a)) val ds3 = smap(ds) 
 9
  • 10. 47deg.com © Copyright 2016 47 Degrees Current Transforms • Filter • Map • Sort • Join (combines 2 DataSets) • Aggregate (sum, count, max) 10
  • 12. 47deg.com © Copyright 2016 47 Degrees Demo • Dataset example • map • select • Transform examples • Map • Sort • Join • Filter • Aggregate 12
  • 14. 47deg.com © Copyright 2016 47 Degrees Scala Macros • Scala code executed at compile time • Kinds • Black box - single result type specified • * White box - result type computed 14
  • 15. 47deg.com © Copyright 2016 47 Degrees Transform Implementation • case class Person(name:String,age:Int)
 val p = Person(“Sam”,30) • Scala macro converts • from: an arbitrary case class type • classOf[p] • to: a meta structure that encodes field names and types • case class PersonM(name:StringCol,age:IntCol)
 val cols = PersonM(name:StringCol(“name”),age:IntCol(“age”)) 15
  • 16. 47deg.com © Copyright 2016 47 Degrees Column Operations • StrCol(“A”) === StrCol(“B”) => BoolCol(“A === B”) • IntCol(“A”) + IntCol(“B”) => IntCol(“A + B”) • IntCol(“A”).max => IntCol(“A.max”) 16
  • 17. 47deg.com © Copyright 2016 47 Degrees White Box Macro Restrictions • Works fine in SBT and Eclipse • Not supported in Intellij but can use • Reports type errors • Does not show available completions 17
  • 19. 47deg.com © Copyright 2016 47 Degrees Transforms Code • https://github.com/nestorpersist/dataset-transform • Code • Documentation • Examples • "com.persist" % "dataset-transforms_2.11" % "0.0.5" 19