SlideShare a Scribd company logo
Spark as the Gateway Drug
To Typed Functional Programming
Jeff Smith
Rohan Aletty
x.ai
Real World AI
• Scale is increasing
• Complexity is increasing
• Human brain size is constant
System Complexity
Data Ingest
Annotation
Routing
Response
Generation
Annotation
Services
Models
Annotation
Services
Models
Annotation
Services
Models
Annotation
Services
Models
Annotation
Services
Models
Annotation
Services
Models
Annotation
Services
Models
Models
Annotation
Services
Knowledge
Base
Problem Complexity
Complex Intelligence
Datanauts
Tools
Scala
• Bleeding edge
• Real world
Spark
• Incredibly powerful
• Easy to use
Typed Functional Programming
• Powerful abstractions
• Tough learning curve
Functions
Methods
• Collection of statements
• Might have side effects
• On an object
Methods
public class Dataset {
private List<Double> observations;
private Double average;
public Dataset(List<Double> inputData) {
observations = inputData;
}
}
Methods
public class Dataset {
public double getAverage() {
Double runningSum = 0.0;
for (Double observation : observations) {
runningSum += observation;
}
average = runningSum / observations.size();
return average;
}
}
Methods
public class Dataset {
public void setObservations(List<Double> inputData) {
observations = inputData;
}
}
Methods
public class Dataset {
private List<Double> observations;
private Double average;
public Dataset(List<Double> inputData) {
observations = inputData;
}
public double getAverage() {
Double runningSum = 0.0;
for (Double observation : observations) {
runningSum += observation;
}
average = runningSum / observations.size();
return average;
}
public void setObservations(List<Double> inputData) {
observations = inputData;
}
}
Functions
• Collection of expressions
• Returns a value
• Are objects (in Scala)
• Can be in-lined
Functions in Scala
val inputData = List(1.0, 2.0, 3.0)
Functions in Scala
def average(observations: List[Double]) {
observations.sum / observations.size
}
average(inputData)
Functions in Scala
def add(x: Double, y: Double) = {
x + y
}
val sum = inputData.foldLeft(0.0)(add)
val average = sum / inputData.size
Functions in Scala
val sum = inputData.foldLeft(0.0)(add)
val average = sum / inputData.size
inputData.foldLeft(0.0)(_ + _) / inputData.size
Functions in Spark
inputData.foldLeft(0.0)(_ + _) / inputData.size
val observations = sc.parallelize(inputData)
observations.fold(0.0)(_ + _) / observations.count()
Immutability
Mutation
• Changing an object
Mutation
visits = {"Church": 2, "Backus": 1, "McCarthy": 4}
old_value = visits["Backus"]
visits["Backus"] = old_value + 1
Immutability
• Never changing objects
Immutability in Scala
val visits = Map("Church" -> 2, "Backus" -> 1, "McCarthy" -> 4)
val updatedVisits = visits.updated("Backus", 2)
Immutability in Spark
val manyVisits = sc.parallelize(visits.toSeq)
val additionalVisit = sc.parallelize(Seq(("Backus", 1)))
val updatedVisits = manyVisits.union(additionalVisit)
.aggregateByKey(0)(_ + _, _ + _)
Recap
Concepts
• Higher-order functions
• Anonymous functions
• Purity of functions
Concepts
• Currying
• Referential transparency
• Closures
• Resilient Distributed Datasets
Lazy Evaluation
Functional Programming — Lazy Evaluation
• Delaying evaluation of an expression until a value is needed
• Two major advantages of lazy evaluation
• Deferring computation allows program only evaluate what is necessary
• Changing evaluation scheme into to be more efficient
Spark — Lazy Evaluation
• All transformations are lazy
• Their existence added to Spark computation DAG
• Example DAGs
Spark — Lazy Evaluation
val rdd1 = sc.parallelize(...)
val rdd2 = rdd1.map(...)
val rdd3 = rdd1.map(...)
val rdd4 = rdd1.map(...)
rdd3.take(5)
Spark — Learning Laziness
• Advantage 1: (deferred computation)
• draws directly from only evaluating parts of DAG that are necessary
when executing an action
• Advantage 2: (optimized evaluation scheme)
• draws directly from pipelining within Spark stages to make execution
more efficient
Types
Functional Programming — Type Systems
• Mechanism for defining algebraic data types (ADTs) which
are useful for program structure
• i.e. “let’s group this data together and brand it a new type”
• Compile time guarantees of correctness of program
• e.g. “no, you cannot add Foo to Bar”
Spark — Types
• RDD’s (typed), Datasets (typed), DataFrames (untyped)
• Types provide great schema enforcement on a dataset for
preventing unexpected behavior
Spark — Types
case class Person(name: String, age: Int)
val peopleDS = spark.read.json(path).as[Person]
val ageGroupedDs = peopleDS.groupBy(_.age)
Spark — Learning Types
• Spark through Scala also allows learning of pattern
matching
• ADTs as both product types and union types
• Allows us to reason about code easier
• Gives us compile time safety
Spark — Learning Types
trait Person { def name: String }
case class Student(name: String, grade: String) extends Person
case class Professional(name: String, job: String) extends Person
val personRDD: RDD[Person] = sc.parallelize(…)
// working with both union and product types
val mappedRDD: RDD[String] = personRDD.map {
case Student(name, grade) => grade
case Professional(name, job) => job
}
Spark — Learning Types
val rdd1: RDD[Person] = sc.parallelize(...)
val rdd2: RDD[String] = rdd1.map("name: " + _) // Compilation error!
val rdd3: RDD[String] = rdd2.map("name: " + _.name) // It works!
Monads
Functional Programming — Monads
• In category theory:
• “a monad in X is just a monoid in the category of endofunctors of X”
• In functional programming, refers to a container that can:
• Inject a value into the container
• Perform operations on values returning a container with new values
• Flatten nested containers into a single container
Scala — Monads!
trait Monad[M[]] {
// constructs a Monad instance from the given value, e.g. List(1)
def apply[T](v: T): M[T]
// effectively lets you transform values within a Monad
def bind[T, U](m: M[T])(fn: (T) => M[U]): M[U]
}
Scala — Monads!
• Many monads in Scala
• List, Set, Option, etc.
• Powerful line of thinking
• Helps code comprehension
• Reduces error checking logic (pattern matching!)
• Can build further transformations: map(), filter(), foreach(), etc.
Spark — Learning Monads?
• We have many “computation builders” -- (RDD’s, Datasets,
DataFrames)
• Containers on which transformations can be applied
• Similar to monads though not identical
• No unit function to wrap constituent values
• Cannot lift all types into flatMap function unconstrained
For Later
Conclusions
• Spark introduces all types of devs to Scala
• Scala helps people learn typed functional programming
• Typed functional programming improves Spark development
x.ai
@xdotai
hello@human.x.ai
New York, New York
Use the code
ctwsparks17
for 40% off!
Thank You

More Related Content

PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
PDF
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
PDF
Integrating Deep Learning Libraries with Apache Spark
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Recent Developments in Spark MLlib and Beyond
PPTX
Large Scale Machine learning with Spark
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PDF
Machine Learning by Example - Apache Spark
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Integrating Deep Learning Libraries with Apache Spark
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Recent Developments in Spark MLlib and Beyond
Large Scale Machine learning with Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Machine Learning by Example - Apache Spark

What's hot (20)

PDF
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
PDF
Large-Scale Machine Learning with Apache Spark
PPTX
Introduction to Spark - DataFactZ
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PDF
How Machine Learning and AI Can Support the Fight Against COVID-19
PDF
Designing Distributed Machine Learning on Apache Spark
PDF
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PDF
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
PDF
Composable Parallel Processing in Apache Spark and Weld
PPTX
Introduction to Apache Spark Developer Training
PPTX
Apache Spark Fundamentals
PPTX
Lightening Fast Big Data Analytics using Apache Spark
PDF
Spark Community Update - Spark Summit San Francisco 2015
PDF
Practical Machine Learning Pipelines with MLlib
PDF
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
PDF
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
PDF
Building a Large Scale Recommendation Engine with Spark and Redis-ML with Sha...
PPTX
Spark MLlib - Training Material
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Large-Scale Machine Learning with Apache Spark
Introduction to Spark - DataFactZ
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
How Machine Learning and AI Can Support the Fight Against COVID-19
Designing Distributed Machine Learning on Apache Spark
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Composable Parallel Processing in Apache Spark and Weld
Introduction to Apache Spark Developer Training
Apache Spark Fundamentals
Lightening Fast Big Data Analytics using Apache Spark
Spark Community Update - Spark Summit San Francisco 2015
Practical Machine Learning Pipelines with MLlib
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Building a Large Scale Recommendation Engine with Spark and Redis-ML with Sha...
Spark MLlib - Training Material
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Ad

Similar to Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East talk by Jeff Smith and Rohan Aletty (20)

PPTX
Apache spark core
PPTX
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PPTX
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PDF
Big Data Analytics with Apache Spark
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PPT
Scala and spark
PDF
Apache Spark: What? Why? When?
PDF
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
PDF
Introduction to Spark
PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
PPT
11. From Hadoop to Spark 2/2
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
PDF
TriHUG talk on Spark and Shark
PDF
Apache Spark and DataStax Enablement
PDF
Advanced spark training advanced spark internals and tuning reynold xin
PDF
Boston Spark Meetup event Slides Update
PPTX
Dive into spark2
PPTX
Paris Data Geek - Spark Streaming
Apache spark core
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache spark sneha challa- google pittsburgh-aug 25th
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
AI與大數據數據處理 Spark實戰(20171216)
Big Data Analytics with Apache Spark
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Scala and spark
Apache Spark: What? Why? When?
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Spark
Introduction to Spark - Phoenix Meetup 08-19-2014
11. From Hadoop to Spark 2/2
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
TriHUG talk on Spark and Shark
Apache Spark and DataStax Enablement
Advanced spark training advanced spark internals and tuning reynold xin
Boston Spark Meetup event Slides Update
Dive into spark2
Paris Data Geek - Spark Streaming
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
PDF
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Moving the Public Sector (Government) to a Digital Adoption
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Supervised vs unsupervised machine learning algorithms
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Mega Projects Data Mega Projects Data
STUDY DESIGN details- Lt Col Maksud (21).pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
oil_refinery_comprehensive_20250804084928 (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Knowledge Engineering Part 1
Major-Components-ofNKJNNKNKNKNKronment.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
climate analysis of Dhaka ,Banglades.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Fluorescence-microscope_Botany_detailed content
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Moving the Public Sector (Government) to a Digital Adoption

Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East talk by Jeff Smith and Rohan Aletty