SlideShare a Scribd company logo
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science Company
Boosting Big Data with Apache Spark
Mathias Lavaert
April 2015
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
About Infofarm
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data
Science
Big
Data
Identifying, extracting and using data of all types
and origins; exploring, correlating and using it in new
and innovative ways in order to extract meaning
and business value from it.
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Java
PHP
E-Commerce
Mobile
Web
Development
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
About me
Mathias Lavaert
Big Data Developer at InfoFarm since May, 2014
Proud citizen of West-Flanders
Outdoor enthusiast
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Agenda
• What is Apache Spark?
• An in-depth overview
– Spark Core and Resilient Distributed Data
– Unified access to structured data with Spark SQL
– Machine Learning with Spark MLLib
– Scalable streaming applications Spark Streaming
• Q&A
• Wrap-up & lunch
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
What is Apache Spark?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
“Apache Spark is a fast and general engine for big data
processing, with built-in modules for streaming, SQL,
machine learning and graph processing”
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
History
• Created by Matei Zaharia at UC Berkeley in 2009
• Based on 2007 Microsoft Dryad paper
• Donated in 2013 to Apache Software Foundation
• 465 contributors in 2014 making it the most active
Apache Project
• Currently supported by Databricks, a company founded
by the creators of Apache Spark
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Target users
● Data Scientists
○ Data exploration and data modelling using interactive
shells
○ Machine Learning
○ Ad Hoc analysis to answer business questions or
discovering new insights
● Engineers
○ Fault-tolerant production data applications
○ ‘Productizing’ the work of the data scientist
○ Integration with business application
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Where to situate Apache Spark?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Differences with MapReduce
• Faster by minimizing IO and trying to use
the memory as much as possible
• Unified libraries
• Huge community effort, very fast
development pace.
• Ships with higher level tools included
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Daytona GraySort Contest
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Differences with Hive, Pig, others...
• One integrated framework that suits a
wide range of problems
• No need for a workflow application like
Oozie
• Only 1 language/framework to learn
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Explosion of Specialized Systems
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Architecture
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Advantages of unified libraries
Advancements in higher-level libraries are pushed down into core and
vice-versa
● Spark Core
○ Highly-optimized, low overhead, network-saturating shuffle
● Spark Streaming
○ Garbage collection, memory management, cleanup
improvements
● Spark GraphX
○ IndexedRDD for random access within a partition vs scanning
entire partition
● Spark MLLib
○ Statistics (Correlations, sampling, heuristics)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Supported languages
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Difference between Java and Scala
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Cluster Resource Managers
● Spark Standalone
○ Suitable for a lot of production workloads
○ Only suitable for Spark workloads
● YARN
○ Allows hierarchies of resources
○ Kerberos integration
○ Multiple workloads from different execution frameworks
■ Hive, Pig, Spark, MapReduce, Cascading, etc…
● Mesos
○ Similar to YARN, but allows elastic allocation
○ Coarse-grained
■ Single, long-running Mesos tasks runs Spark mini tasks
○ Fine-grained
■ New Mesos task for each Spark task
■ Higher overhead, not good for long-running Spark jobs
(Streaming)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Storage Layers for Spark
Spark can create distributed datasets from:
● Any file stored in the Hadoop distributed filesystem (HDFS)
● Any storage system supported by the Hadoop APIs
○ Local filesystem
○ S3
○ Cassandra
○ Hive
○ HBase
Note that Apache Spark doesn’t require Hadoop, but it has support for
storage systems implementing the Hadoop APIs.
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Short introduction to functional
programming
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
What is functional programming?
A programming paradigm where the
basic unit of abstraction is the function
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Basic concepts
● Higher-order functions
○ Are functions that can either take other functions as
arguments
○ or return functions as a result of a function
● Pure functions
○ Purely functional expressions have no side effects
● Recursion
○ Iteration in functional languages is usually
accomplished via recursion.
● Immutable data structures
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Small example with a functional
language: Scala
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Introduction to Spark concepts
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Resilient Distributed Datasets (RDDs)
● Core Spark abstraction
● Immutable distributed collection of objects
● Split into multiple partitions
● May be computed on different nodes of the cluster
● Can contain any type of Scala, Java or Python object
including user-defined classes
“Distributed Scala collections”
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Driver and context
● Driver
○ Shell
○ Standalone program
● Spark Context represents a connection to a computing cluster
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
RDD Operations
● Transformations
○ map
○ filter
○ flatMap
○ sample
○ groupByKey
○ reduceByKey
○ union
○ join
○ sort
● Actions
○ count
○ collect
○ reduce
○ lookup
○ save
● Transformations are lazy
● Actions force the computation of transformations
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Narrow vs wide dependencies
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Demo using only core operations
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Specialized operations for specific
types of RDDs
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Specialized operations for Key/Value pairs
● reduceByKey
● groupByKey
● combineByKey
● mapValues
● flatMapValues
● keys
● sortByKey
● subtractByKey
● join
● rightOuterJoin
● leftOuterJoin
● cogroup
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Specialized operations for numeric RDDs
● count
● mean
● sum
● max
● min
● variance
● sampleVariance
● stdev
● sampleStDev
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
And many more...
● HadoopRDD
● FilteredRDD
● MappedRDD
● PairRDD
● ShuffledRDD
● UnionRDD
● DoubleRDD
● JdbcRDD
● JsonRDD
● SchemaRDD
● VertexRDD
● EdgeRDD
● CassandraRDD
● GeoRDD
● EsSpark (Elastic Search
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Spark SQL
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Spark SQL Overview
● Newest component of Spark
● Tightly integrated to work with structured data
○ Tables with rows and columns
● Transform RDDs using SQL
● Data source integration: Hive, Parquet, JSON and more…
● Optimizes execution plan
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Differences with Spark Core
● Spark + RDDs
○ Functional transformations on
collections of objects
● SQL + SchemaRDDs
○ Declarative transformations on
collections of tuples
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Getting started with Spark SQL
● Create an instance of SQLContext or HiveContext
○ Entry point for all SQL functionality
○ Wraps/extends existing Spark Context (Decorator Pattern)
● If you’re using the shell a SQLContext has been created for you
val sparkContext = new SparkContext("local[4]", "SQL")
val sqlContext = new SQLContext(sparkContext)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Language Integrated UDFs
● Ability to write custom SQL-functions in one of the languages that is
supported by Spark
● Another example on how Spark simplifies the big data stack
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Parquet compatibility
Native support for reading data stored in Parquet:
● Columnar storage avoids reading unneeded data
● SchemaRDDs can be written to Parquet while preserving the schema
● Convert other slower formats like JSON to Parquet for repeated querying.
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Demo: Spark SQL
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Spark MLLib
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Machine Learning Algorithms
● Supervised
○ Prediction: Train a model with existing data + label, predict
label for new data
■ Classification (categorical)
■ Regression (continuous numeric)
○ Recommendation: recommend to similar users
■ User -> user, item -> item, user -> item similarity
● Unsupervised
○ Clustering: Find natural clusters in data based on similarities
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Algorithms provided by Spark
● Classification and regression
○ Linear models (SVMs, logistic regression, linear regression)
○ Naive Bayes
○ Decision trees
○ Ensembles of trees (Random Forests and Gradient-Boosted trees)
○ Isotonic regression
● Recommendations
○ Alternating Least Squares (ALS)
○ FP-growth
● Clustering
○ K-Means
○ Gaussian mixture
○ Power Iteration clustering
○ Latent Dirichlet allocation
○ Streaming k-means
● Dimensionality reduction
○ Singular value decomposition (SVD)
○ Principal component analysis (PCA)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Tools provided by Spark
● Tools for basic statistics including
○ Summary statistics
○ Correlations
○ Sampling
○ Hypothesis testing
○ Random data generation
● Tools for feature extraction and transformation
○ Extracting features out of text
○ Uniform Vector format to store features
● Tools to build Machine Learning Pipelines
using Spark SQL
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Why choose for MLLib?
● One of the best documented machine learning
libraries available for the JVM
● Simple API, constructs are the same for different
algorithms
● Well integrated with other Spark-components
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Demo: Spark MLLib
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Spark Streaming
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Spark Streaming Overview
● Build around the concept of DStreams or discretized
streams
● Long-running Spark application
● Micro-batch architecture
● Supports Flume, Kafka, Twitter, Amazon Kinesis,
Socket, File…
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
DStreams
● A sequence of RDDs
● Stateless transformations
● Stateful transformations
● Checkpointing
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Spark Streaming Use Cases
● ETL and enrichment of streaming data on ingestion
● Lambda Architecture
● Operational dashboards
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Demo: Spark Streaming
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Spark on Amazon EC2
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Apache Spark runs easily on Amazon EC2
Apache Spark comes with a script to launch Spark clusters
on Amazon EC2.
So there is no need to invest in a cluster of servers...
Furthermore it has support for multiple Amazon
components.
● Spark can read files from Amazon S3
● Spark Streaming can easily be integrated with Amazon
Kinesis
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Conclusion
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Why choose for Apache Spark?
● Modern integrated full-stack Big Data framework
● Suitable for both batch and (near) real time applications
● Well supported by a very large community
● The Big Data landscape seems to shift to Apache Spark
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Questions?

More Related Content

PDF
Real Time Big Data
PDF
Pinot: Realtime Distributed OLAP datastore
PDF
Data pipelines from zero to solid
PDF
A primer on building real time data-driven products
PDF
10 ways to stumble with big data
PDF
Protecting privacy in practice
PPTX
Implementing BigPetStore with Apache Flink
PDF
Engineering data quality
Real Time Big Data
Pinot: Realtime Distributed OLAP datastore
Data pipelines from zero to solid
A primer on building real time data-driven products
10 ways to stumble with big data
Protecting privacy in practice
Implementing BigPetStore with Apache Flink
Engineering data quality

What's hot (20)

PDF
Slide 2 collecting, storing and analyzing big data
PPTX
Publishing Linked Statistical Data: Aragón, a case study
PPTX
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PPTX
How To Leverage OBIEE Within A Big Data Architecture
PDF
Observability for Data Pipelines With OpenLineage
PDF
Knowledge graph
PDF
Building a knowledge graph of the Belgian War Press
PDF
Iceberg: a fast table format for S3
PDF
Red hat infrastructure for analytics
PDF
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
PDF
Eventually, time will kill your data processing
PDF
Instrumentation with Splunk
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
PDF
OU RSE Tutorial Big Data Cluster
PPTX
Lightning Talk: Get Even More Value from MongoDB Applications
PDF
Custom Pregel Algorithms in ArangoDB
PPTX
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
PDF
Drupal and the Semantic Web - ESIP Webinar
Slide 2 collecting, storing and analyzing big data
Publishing Linked Statistical Data: Aragón, a case study
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Presto Summit 2018 - 09 - Netflix Iceberg
How To Leverage OBIEE Within A Big Data Architecture
Observability for Data Pipelines With OpenLineage
Knowledge graph
Building a knowledge graph of the Belgian War Press
Iceberg: a fast table format for S3
Red hat infrastructure for analytics
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Eventually, time will kill your data processing
Instrumentation with Splunk
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Geospatial querying in Apache Marmotta - ApacheCon Big Data Europe 2015
OU RSE Tutorial Big Data Cluster
Lightning Talk: Get Even More Value from MongoDB Applications
Custom Pregel Algorithms in ArangoDB
Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Ma...
Drupal and the Semantic Web - ESIP Webinar
Ad

Viewers also liked (8)

PPTX
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
PPTX
Data Driven Decisions seminar
PPTX
Big Data with Apache Hadoop
PPTX
First impressions of SparkR: our own machine learning algorithm
PPTX
Machine learning
PPTX
Harvesting business Value with Data Science
PPTX
Data Science for e-commerce
PPTX
Introduction to (Big) Data Science
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Data Driven Decisions seminar
Big Data with Apache Hadoop
First impressions of SparkR: our own machine learning algorithm
Machine learning
Harvesting business Value with Data Science
Data Science for e-commerce
Introduction to (Big) Data Science
Ad

Similar to Boosting big data with apache spark (20)

PDF
Apache Spark 101 - Demi Ben-Ari
PDF
Apache Spark 101 - Demi Ben-Ari - Panorays
PDF
Unified Big Data Processing with Apache Spark
PPTX
APACHE SPARK.pptx
PPTX
Apache Spark Fundamentals
PDF
20150716 introduction to apache spark v3
PDF
Big data distributed processing: Spark introduction
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PDF
Bds session 13 14
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PPTX
In Memory Analytics with Apache Spark
PPTX
Glint with Apache Spark
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Big Data Analytics and Ubiquitous computing
PDF
Started with-apache-spark
PDF
How Apache Spark fits in the Big Data landscape
PPTX
Intro to Apache Spark by CTO of Twingo
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari - Panorays
Unified Big Data Processing with Apache Spark
APACHE SPARK.pptx
Apache Spark Fundamentals
20150716 introduction to apache spark v3
Big data distributed processing: Spark introduction
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Bds session 13 14
Simplifying Big Data Analytics with Apache Spark
Big data vahidamiri-tabriz-13960226-datastack.ir
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
In Memory Analytics with Apache Spark
Glint with Apache Spark
Unified Big Data Processing with Apache Spark (QCON 2014)
Big Data Analytics and Ubiquitous computing
Started with-apache-spark
How Apache Spark fits in the Big Data landscape
Intro to Apache Spark by CTO of Twingo
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Recently uploaded (20)

PDF
Digital Strategies for Manufacturing Companies
PDF
System and Network Administraation Chapter 3
PPTX
Transform Your Business with a Software ERP System
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPT
Introduction Database Management System for Course Database
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
ai tools demonstartion for schools and inter college
PDF
Nekopoi APK 2025 free lastest update
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
Introduction to Artificial Intelligence
PDF
System and Network Administration Chapter 2
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
top salesforce developer skills in 2025.pdf
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Digital Strategies for Manufacturing Companies
System and Network Administraation Chapter 3
Transform Your Business with a Software ERP System
Upgrade and Innovation Strategies for SAP ERP Customers
Which alternative to Crystal Reports is best for small or large businesses.pdf
Introduction Database Management System for Course Database
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Design an Analysis of Algorithms II-SECS-1021-03
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Digital Systems & Binary Numbers (comprehensive )
ai tools demonstartion for schools and inter college
Nekopoi APK 2025 free lastest update
Navsoft: AI-Powered Business Solutions & Custom Software Development
Computer Software and OS of computer science of grade 11.pptx
Reimagine Home Health with the Power of Agentic AI​
Introduction to Artificial Intelligence
System and Network Administration Chapter 2
How to Choose the Right IT Partner for Your Business in Malaysia
top salesforce developer skills in 2025.pdf
Embracing Complexity in Serverless! GOTO Serverless Bengaluru

Boosting big data with apache spark

  • 1. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Data Science Company Boosting Big Data with Apache Spark Mathias Lavaert April 2015
  • 2. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company About Infofarm
  • 3. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Data Science Big Data Identifying, extracting and using data of all types and origins; exploring, correlating and using it in new and innovative ways in order to extract meaning and business value from it.
  • 4. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 5. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Java PHP E-Commerce Mobile Web Development
  • 6. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 7. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be About me Mathias Lavaert Big Data Developer at InfoFarm since May, 2014 Proud citizen of West-Flanders Outdoor enthusiast
  • 8. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Agenda • What is Apache Spark? • An in-depth overview – Spark Core and Resilient Distributed Data – Unified access to structured data with Spark SQL – Machine Learning with Spark MLLib – Scalable streaming applications Spark Streaming • Q&A • Wrap-up & lunch
  • 9. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company What is Apache Spark?
  • 10. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be “Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”
  • 11. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be History • Created by Matei Zaharia at UC Berkeley in 2009 • Based on 2007 Microsoft Dryad paper • Donated in 2013 to Apache Software Foundation • 465 contributors in 2014 making it the most active Apache Project • Currently supported by Databricks, a company founded by the creators of Apache Spark
  • 12. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Target users ● Data Scientists ○ Data exploration and data modelling using interactive shells ○ Machine Learning ○ Ad Hoc analysis to answer business questions or discovering new insights ● Engineers ○ Fault-tolerant production data applications ○ ‘Productizing’ the work of the data scientist ○ Integration with business application
  • 13. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Where to situate Apache Spark?
  • 14. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Differences with MapReduce • Faster by minimizing IO and trying to use the memory as much as possible • Unified libraries • Huge community effort, very fast development pace. • Ships with higher level tools included
  • 15. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Daytona GraySort Contest
  • 16. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Differences with Hive, Pig, others... • One integrated framework that suits a wide range of problems • No need for a workflow application like Oozie • Only 1 language/framework to learn
  • 17. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Explosion of Specialized Systems
  • 18. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Architecture
  • 19. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Advantages of unified libraries Advancements in higher-level libraries are pushed down into core and vice-versa ● Spark Core ○ Highly-optimized, low overhead, network-saturating shuffle ● Spark Streaming ○ Garbage collection, memory management, cleanup improvements ● Spark GraphX ○ IndexedRDD for random access within a partition vs scanning entire partition ● Spark MLLib ○ Statistics (Correlations, sampling, heuristics)
  • 20. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Supported languages
  • 21. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Difference between Java and Scala
  • 22. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Cluster Resource Managers ● Spark Standalone ○ Suitable for a lot of production workloads ○ Only suitable for Spark workloads ● YARN ○ Allows hierarchies of resources ○ Kerberos integration ○ Multiple workloads from different execution frameworks ■ Hive, Pig, Spark, MapReduce, Cascading, etc… ● Mesos ○ Similar to YARN, but allows elastic allocation ○ Coarse-grained ■ Single, long-running Mesos tasks runs Spark mini tasks ○ Fine-grained ■ New Mesos task for each Spark task ■ Higher overhead, not good for long-running Spark jobs (Streaming)
  • 23. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Storage Layers for Spark Spark can create distributed datasets from: ● Any file stored in the Hadoop distributed filesystem (HDFS) ● Any storage system supported by the Hadoop APIs ○ Local filesystem ○ S3 ○ Cassandra ○ Hive ○ HBase Note that Apache Spark doesn’t require Hadoop, but it has support for storage systems implementing the Hadoop APIs.
  • 24. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company Short introduction to functional programming
  • 25. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be What is functional programming? A programming paradigm where the basic unit of abstraction is the function
  • 26. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Basic concepts ● Higher-order functions ○ Are functions that can either take other functions as arguments ○ or return functions as a result of a function ● Pure functions ○ Purely functional expressions have no side effects ● Recursion ○ Iteration in functional languages is usually accomplished via recursion. ● Immutable data structures
  • 27. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Small example with a functional language: Scala
  • 28. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company Introduction to Spark concepts
  • 29. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Resilient Distributed Datasets (RDDs) ● Core Spark abstraction ● Immutable distributed collection of objects ● Split into multiple partitions ● May be computed on different nodes of the cluster ● Can contain any type of Scala, Java or Python object including user-defined classes “Distributed Scala collections”
  • 30. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Driver and context ● Driver ○ Shell ○ Standalone program ● Spark Context represents a connection to a computing cluster
  • 31. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be RDD Operations ● Transformations ○ map ○ filter ○ flatMap ○ sample ○ groupByKey ○ reduceByKey ○ union ○ join ○ sort ● Actions ○ count ○ collect ○ reduce ○ lookup ○ save ● Transformations are lazy ● Actions force the computation of transformations
  • 32. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Narrow vs wide dependencies
  • 33. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Demo using only core operations
  • 34. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Specialized operations for specific types of RDDs
  • 35. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Specialized operations for Key/Value pairs ● reduceByKey ● groupByKey ● combineByKey ● mapValues ● flatMapValues ● keys ● sortByKey ● subtractByKey ● join ● rightOuterJoin ● leftOuterJoin ● cogroup
  • 36. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Specialized operations for numeric RDDs ● count ● mean ● sum ● max ● min ● variance ● sampleVariance ● stdev ● sampleStDev
  • 37. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be And many more... ● HadoopRDD ● FilteredRDD ● MappedRDD ● PairRDD ● ShuffledRDD ● UnionRDD ● DoubleRDD ● JdbcRDD ● JsonRDD ● SchemaRDD ● VertexRDD ● EdgeRDD ● CassandraRDD ● GeoRDD ● EsSpark (Elastic Search
  • 38. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company Spark SQL
  • 39. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Spark SQL Overview ● Newest component of Spark ● Tightly integrated to work with structured data ○ Tables with rows and columns ● Transform RDDs using SQL ● Data source integration: Hive, Parquet, JSON and more… ● Optimizes execution plan
  • 40. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Differences with Spark Core ● Spark + RDDs ○ Functional transformations on collections of objects ● SQL + SchemaRDDs ○ Declarative transformations on collections of tuples
  • 41. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Getting started with Spark SQL ● Create an instance of SQLContext or HiveContext ○ Entry point for all SQL functionality ○ Wraps/extends existing Spark Context (Decorator Pattern) ● If you’re using the shell a SQLContext has been created for you val sparkContext = new SparkContext("local[4]", "SQL") val sqlContext = new SQLContext(sparkContext)
  • 42. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Language Integrated UDFs ● Ability to write custom SQL-functions in one of the languages that is supported by Spark ● Another example on how Spark simplifies the big data stack
  • 43. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Parquet compatibility Native support for reading data stored in Parquet: ● Columnar storage avoids reading unneeded data ● SchemaRDDs can be written to Parquet while preserving the schema ● Convert other slower formats like JSON to Parquet for repeated querying.
  • 44. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Demo: Spark SQL
  • 45. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company Spark MLLib
  • 46. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Machine Learning Algorithms ● Supervised ○ Prediction: Train a model with existing data + label, predict label for new data ■ Classification (categorical) ■ Regression (continuous numeric) ○ Recommendation: recommend to similar users ■ User -> user, item -> item, user -> item similarity ● Unsupervised ○ Clustering: Find natural clusters in data based on similarities
  • 47. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Algorithms provided by Spark ● Classification and regression ○ Linear models (SVMs, logistic regression, linear regression) ○ Naive Bayes ○ Decision trees ○ Ensembles of trees (Random Forests and Gradient-Boosted trees) ○ Isotonic regression ● Recommendations ○ Alternating Least Squares (ALS) ○ FP-growth ● Clustering ○ K-Means ○ Gaussian mixture ○ Power Iteration clustering ○ Latent Dirichlet allocation ○ Streaming k-means ● Dimensionality reduction ○ Singular value decomposition (SVD) ○ Principal component analysis (PCA)
  • 48. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Tools provided by Spark ● Tools for basic statistics including ○ Summary statistics ○ Correlations ○ Sampling ○ Hypothesis testing ○ Random data generation ● Tools for feature extraction and transformation ○ Extracting features out of text ○ Uniform Vector format to store features ● Tools to build Machine Learning Pipelines using Spark SQL
  • 49. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Why choose for MLLib? ● One of the best documented machine learning libraries available for the JVM ● Simple API, constructs are the same for different algorithms ● Well integrated with other Spark-components
  • 50. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Demo: Spark MLLib
  • 51. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company Spark Streaming
  • 52. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Spark Streaming Overview ● Build around the concept of DStreams or discretized streams ● Long-running Spark application ● Micro-batch architecture ● Supports Flume, Kafka, Twitter, Amazon Kinesis, Socket, File…
  • 53. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be DStreams ● A sequence of RDDs ● Stateless transformations ● Stateful transformations ● Checkpointing
  • 54. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Spark Streaming Use Cases ● ETL and enrichment of streaming data on ingestion ● Lambda Architecture ● Operational dashboards
  • 55. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Demo: Spark Streaming
  • 56. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company Spark on Amazon EC2
  • 57. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Apache Spark runs easily on Amazon EC2 Apache Spark comes with a script to launch Spark clusters on Amazon EC2. So there is no need to invest in a cluster of servers... Furthermore it has support for multiple Amazon components. ● Spark can read files from Amazon S3 ● Spark Streaming can easily be integrated with Amazon Kinesis
  • 58. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company Conclusion
  • 59. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Why choose for Apache Spark? ● Modern integrated full-stack Big Data framework ● Suitable for both batch and (near) real time applications ● Well supported by a very large community ● The Big Data landscape seems to shift to Apache Spark
  • 60. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company Questions?