SlideShare a Scribd company logo
Nada ZIRARI
Douaa HASNAOUI
Exponential
Growth in
Data
(1 trillion gigabytes!)
The 4 ٥٦٧٨٩٠٨٧٦٥’s of Big data
- Volume
- Velocity
- Variety
- Veracity
Exponential
Growth in
Data
Over 1 billion TB!
The 4 V’s of Big data
- Volume
- Velocity
- Variety
- Veracity
https://www.domo.com/learn/data-never-sleeps-7
How do we add value
with all this data?
• Real-time, reliable results
• streaming data
• efficient parallel processing
• fault handling
• Straightforward coding
• good libraries on several platforms
• Actionable insights!
• machine learning
• useful visualizations
Over 1 billion TB!
https://www.domo.com/learn/data-never-sleeps-7
• Open-source cluster-computing framework
• Developed in UC Bekeley’s AmpLab in 2009
• Donated to Apache Software Foundation in 2013, first release in 2014
• Developed to enable real-time processing of large/streaming datasets
Overview
Real-time
results!
• Distributes data to memory
• Up to 100x faster than reading from disk
• Cross-platform
• Hadoop, Apache Mesos, Kubernenetes,
stand-alone or in the cloud
• Multiple languages
• Java, R, Scala, and Python
• Rich set of libraries
• Enable large variety of processing
Features
From https://intellipaat.com/blog/tutorial/spark-tutorial/spark-features/
• Spark outperforms Hadoop by up to 20x in iterative machine learning applications.
• avoids I/O and deserialization costs by storing data in memory as Java objects
• Spark can be used to query a 1 TB dataset interactively with latencies of 5–7 seconds
Performance
Page Rank Example
Machine Learning Example
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
• Fundamental Data Structure of Apache Spark
• Immutable
• Parallel Data Structures
• Are represented by coarse-based transformation (i.e. map,filter and join) that are
applied to datasets
• RDDs can only be created through deterministic operations (transformations) on Data or
from other RDDs. Hence, the RDDs are not an actual dataset but a transformation
representation of a dataset.
• Fault tolerant
RDDs (Resilient Distributed Datasets)
• DAGs provides workflow to RDDs
• Structure that is used to model pairwise
relations between objects
• Components of graphs are vertices(nodes
or points) which are connected by edges
(links or lines)
• DAGs have edges with directions and
connect their vertices sequentially.
• Apache Spark uses DAGs to represent
vertices as RDDs, and edges as
transformations performed to RDDs.
DAGs (Directed acyclic graphs)
Partitions
• Data size that is big in size, needs to be “partitioned” or sliced across different
nodes/machines
• Number of partitions is chosen by the user
https://medium.com/@thejasbabu/spark-under-the-hood-partition-d386aaaa26b7
Parallelized
• Functions applied to the dataset can be run simultaneously to each “slice” or “partition”.
https://medium.com/@lavishj77/spark-fundamentals-part-2-a2d1a78eff73
Fault Tolerance/Failure Recovery
• The previously mentioned coarse
grained transformation allows for
easy “fault tolerance” as they are
logged (called a lineage).
• If a node fails, the task manager then
runs the lineage to restore
information across all nodes
• Partitioned RDDs have enough
information about how other
partitions were derived, so lost data
can be easily recovered
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
Spark’s Libraries
https://intellipaat.com/blog/what-is-apache-spark/
• Uses two main concepts:
• DataFrame API
• Catalyst optimizer
• The DataFrame API uses a collection of
objects from Java and Python
• The Catalyst optimizer contains a general
library, which represents trees and applies
different rules to transform them
• The library runs on top of Spark as shown
in picture
SQL and DataFrames
http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
• The Catalyst’s main Data type is a tree,
which is composed of a node type objects
• Trees can be modified by functional
transformations, and are able to
communicate from one tree to another
• Tree Objects are Immutable
• Spark SQL uses the Pattern matching
function/rule that allows for the
extraction of values from nested
structures of algebraic types.
SQL and DataFrames
http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
• Catalyst’s transformation tree is used in four phases.
• Analysis: rule-based optimization performed when there are any unresolved relations or
references
• Logical Optimization: rule-based optimization, to the already build “Logical Plan”
• Physical Planning: cost-based optimization, that chooses the best model based on a cost model
• Code Generation: Uses “quasiquotes”, to produce code efficiently in Java bytecode
SQL and DataFrames
http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
• Provides fast distributed implementations of common learning algorithms, which
are: linear models, naïve Bayes, decision trees, and k-clustering
• Provide many algorithms optimizations, such as the ALS algorithm, to improve on
linear algebra operations
• Provides with Machine learning pipelines , which can be often described as
workflows that involve sequentially data preparation, feature extraction, model
fitting and validation stages.
• Provides with robust documentation, it also provides with a list of codes
dependencies.
Machine Learning Library(MLlib)
• MLlib’s 1.4 version is faster than MLlib’s
version 1.1. Also, it can be observed that
when using Apache’s Mahout v0.9 (which
runs Hadoop MapReduce), the computational
time is substantially slower than any of MLlib’s
versions. The benchmark used for this graph
was the ALS algorithm.
• MLlib v1.1 relative improvement over MLlib
1.0 , it can be observed that there was on
average 3.0x improvement on different
algorithms
Machine Learning Library(MLlib)
http://www.jmlr.org/papers/volume17/15-237/15-237.pdf
• Proposes new streaming model than the well known “continuous operator”
model
• New model avoids regular problems by structuring computations as a set of
short, stateless, deterministic tasks, rather than continuous operators. These
objects are called discretized streams.
Spark Streaming
http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf
• Graph Data
• Graphs – consist of nodes and edges
• Nodes contain entities
• Edges contain relationships
• Most social media data are graph data
• Graph-parallel systems
• Optimized for graph-based data (NoSQL)
GraphX – Intro to Graph Data
http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html
• Distributed graph framework that
unifies graph-parallel and data-
parallel computation
• Data can be viewed as both graphs and
tables
• Uses two RDDs - edge collection, vertex
collection
• Allows end-to-end pipeline of graph
and table operations without moving
the data
• Less efficient on pure graph-parallel
workflows
• More efficient on combined graph-
parallel and data-parallel workflows
Library: GraphX
https://spark.apache.org/docs/0.9.0/graphx-programming-guide.html
Technical Demonstration:
ML Lib - Decision Tree
Decision Tree - Purity
Technical Demonstration:
Spark Streaming
https://spark.apache.org/docs/latest/streaming-programming-guide.html
Technical Demonstration:
Spark Streaming
https://spark.apache.org/docs/latest/streaming-programming-guide.html
• Real-time Sentiment Analysis – based on social media posts (facebook, twitter, etc.)
• Real-time Fraud Detection of bank transactions
Use Cases
• Netflix
• captures all member activities to personalize recommendations
• processes 450 billion events per day
• MyFitnessPal
• used to build their data pipeline, create a list of ‘Verified Foods’
• provided a ten-fold speed improvement over their previous data pipeline
Need for Streaming Applications
Apache Spark at MyFitnessPal
The largest health and fitness community MyFitnessPal helps people achieve a healthy lifestyle through
better diet and exercise. MyFitnessPal uses apache spark to clean the data entered by users with the
end goal of identifying high quality food items. Using Spark, MyFitnessPal has been able to scan through
food calorie data of about 80 million users. Earlier, MyFitnessPal used Hadoop to process 2.5TB of data
and that took several days to identify any errors or missing information in it.
Netflix
Netflix uses Apache Spark for real-time stream processing to provide online recommendations to its
customers. Streaming devices at Netflix send events which capture all member activities and play a vital
role in personalization. It processes 450 billion events per day which flow to server side applications
and are directed to Apache Kafka.
Use cases
https://www.dezyre.com/article/top-5-apache-spark-use-cases/271
• With more data being created all the time and users requiring real-time results, we
need methods that can process big data quickly
• Spark distributes data to memory – up to 100x faster than reading from disk
• Rich set of libraries enable large variety of processing
• Enabled by RDDs (Resilient Distributed Arrays)
• Read only, partitioned collection of records that are fault-tolerant
Key Take-Aways
Interactive Component
Speed Comparison: MapReduce vs Spark
TV TV
Spark
2
Map
Reduce
2
Spark
1
Map
Reduce
1
Whiteboard 1 Whiteboard 2
Instructions
Category Lollipop Chocolate Minibar Kerr’s Toffee
Count 8 9 3
Count (after action) 0 10 2
Sum Total 12 (0 + 10 + 2)
Each worker will take 20 candies from the bin in the middle
Task 1: Group candies by category
Task 2: Count candies
Task 3: For each category, perform the following actions:
- If candy count in Lollipop is even, discard all Lollipop candies
- If candy count in Chocolate Minibar is odd, add an extra Chocolate Minibar candy to your pile
- If candy count in Kerr’s Toffee is even, remove one Kerr’s Toffee candy from your pile
Task 4: Sum Total Candy (Regardless of Category)
Debrief
Speed Comparison: MapReduce vs Spark
• PySpark Installation Guide and Jupyter notebooks – available on D2L
• Spark website
• https://spark.apache.org
• Key technical papers
• M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: a fault-
tolerant abstraction for in-memory cluster computing”, NSDI '12 Proceedings of the 9th USENIX conference on Networked Systems Design and
Implementation, pp. 2-2, April 2012. Available: http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
• Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica. “Discretized Streams: Fault-Tolerant Streaming Computation
at Scale.” SOSP 2013. November 2013. Available: http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf
• X. Meng,J. Bradley, B.Yavuz,E.Sparks,S.Venkataramn, D.Liu,J.Freeman,DB Tai, M. Made, S.Owen, D. Xin, R.Xin,M. Franklin, R.Zadeh, M.Zaharia, A.
Talwalkar, “MLlib: Machine Learning in Apache Spark”,Journal of Machine Learning Research (2016), pp.1-7, May, 2015. Available:
http://www.jmlr.org/papers/volume17/15-237/15-237.pdf
• M. Armburst, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M.J. Franklin, A. Ghodsi, M. Zaharia, “Spark SQL: Relational data
processing in Spark”, SIGMOD 2015, June 2015. Available: http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
• R.S. Xin, D. Crankshaw, A. Dave, J. E. Gonzalez, M. J. Franklin, and I. Stoica, “GraphX: Unifying data-parallel and graph-parallel analytics”. Arxiv, Feb.
2014. Available: https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx.pdf
References

More Related Content

PDF
Big Data Analytics and Ubiquitous computing
PPTX
APACHE SPARK.pptx
PDF
An introduction To Apache Spark
PPTX
SPARK ARCHITECTURE
PDF
Apache spark - Spark's distributed programming model
PDF
Spark Driven Big Data Analytics
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
PPTX
Processing Large Data with Apache Spark -- HasGeek
Big Data Analytics and Ubiquitous computing
APACHE SPARK.pptx
An introduction To Apache Spark
SPARK ARCHITECTURE
Apache spark - Spark's distributed programming model
Spark Driven Big Data Analytics
Spark Concepts - Spark SQL, Graphx, Streaming
Processing Large Data with Apache Spark -- HasGeek

Similar to Apache Spark Presentation good for big data (20)

PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PPTX
In Memory Analytics with Apache Spark
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PPT
Big_data_analytics_NoSql_Module-4_Session
PDF
Spark forplainoldjavageeks svforum_20140724
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Apache spark - Architecture , Overview & libraries
PPTX
Unit II Real Time Data Processing tools.pptx
PPTX
Apache Spark for Beginners
PDF
Bds session 13 14
PPTX
Apache Spark Core
PPTX
Apache Spark Fundamentals
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Spark After Dark - LA Apache Spark Users Group - Feb 2015
PDF
Spark after Dark by Chris Fregly of Databricks
PDF
Spark For Plain Old Java Geeks (June2014 Meetup)
PDF
Introduction to apache spark
Simplifying Big Data Analytics with Apache Spark
Big data vahidamiri-tabriz-13960226-datastack.ir
In Memory Analytics with Apache Spark
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big_data_analytics_NoSql_Module-4_Session
Spark forplainoldjavageeks svforum_20140724
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Apache spark - Architecture , Overview & libraries
Unit II Real Time Data Processing tools.pptx
Apache Spark for Beginners
Bds session 13 14
Apache Spark Core
Apache Spark Fundamentals
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
Apache spark-melbourne-april-2015-meetup
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark after Dark by Chris Fregly of Databricks
Spark For Plain Old Java Geeks (June2014 Meetup)
Introduction to apache spark
Ad

Recently uploaded (20)

PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Lecture1 pattern recognition............
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Computer network topology notes for revision
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Taxes Foundatisdcsdcsdon Certificate.pdf
Reliability_Chapter_ presentation 1221.5784
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
STUDY DESIGN details- Lt Col Maksud (21).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Lecture1 pattern recognition............
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Supervised vs unsupervised machine learning algorithms
Computer network topology notes for revision
Introduction-to-Cloud-ComputingFinal.pptx
.pdf is not working space design for the following data for the following dat...
Major-Components-ofNKJNNKNKNKNKronment.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Ad

Apache Spark Presentation good for big data

  • 2. Exponential Growth in Data (1 trillion gigabytes!) The 4 ٥٦٧٨٩٠٨٧٦٥’s of Big data - Volume - Velocity - Variety - Veracity
  • 3. Exponential Growth in Data Over 1 billion TB! The 4 V’s of Big data - Volume - Velocity - Variety - Veracity https://www.domo.com/learn/data-never-sleeps-7
  • 4. How do we add value with all this data? • Real-time, reliable results • streaming data • efficient parallel processing • fault handling • Straightforward coding • good libraries on several platforms • Actionable insights! • machine learning • useful visualizations Over 1 billion TB! https://www.domo.com/learn/data-never-sleeps-7
  • 5. • Open-source cluster-computing framework • Developed in UC Bekeley’s AmpLab in 2009 • Donated to Apache Software Foundation in 2013, first release in 2014 • Developed to enable real-time processing of large/streaming datasets Overview Real-time results!
  • 6. • Distributes data to memory • Up to 100x faster than reading from disk • Cross-platform • Hadoop, Apache Mesos, Kubernenetes, stand-alone or in the cloud • Multiple languages • Java, R, Scala, and Python • Rich set of libraries • Enable large variety of processing Features From https://intellipaat.com/blog/tutorial/spark-tutorial/spark-features/
  • 7. • Spark outperforms Hadoop by up to 20x in iterative machine learning applications. • avoids I/O and deserialization costs by storing data in memory as Java objects • Spark can be used to query a 1 TB dataset interactively with latencies of 5–7 seconds Performance Page Rank Example Machine Learning Example http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  • 8. • Fundamental Data Structure of Apache Spark • Immutable • Parallel Data Structures • Are represented by coarse-based transformation (i.e. map,filter and join) that are applied to datasets • RDDs can only be created through deterministic operations (transformations) on Data or from other RDDs. Hence, the RDDs are not an actual dataset but a transformation representation of a dataset. • Fault tolerant RDDs (Resilient Distributed Datasets)
  • 9. • DAGs provides workflow to RDDs • Structure that is used to model pairwise relations between objects • Components of graphs are vertices(nodes or points) which are connected by edges (links or lines) • DAGs have edges with directions and connect their vertices sequentially. • Apache Spark uses DAGs to represent vertices as RDDs, and edges as transformations performed to RDDs. DAGs (Directed acyclic graphs)
  • 10. Partitions • Data size that is big in size, needs to be “partitioned” or sliced across different nodes/machines • Number of partitions is chosen by the user https://medium.com/@thejasbabu/spark-under-the-hood-partition-d386aaaa26b7
  • 11. Parallelized • Functions applied to the dataset can be run simultaneously to each “slice” or “partition”. https://medium.com/@lavishj77/spark-fundamentals-part-2-a2d1a78eff73
  • 12. Fault Tolerance/Failure Recovery • The previously mentioned coarse grained transformation allows for easy “fault tolerance” as they are logged (called a lineage). • If a node fails, the task manager then runs the lineage to restore information across all nodes • Partitioned RDDs have enough information about how other partitions were derived, so lost data can be easily recovered http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  • 14. • Uses two main concepts: • DataFrame API • Catalyst optimizer • The DataFrame API uses a collection of objects from Java and Python • The Catalyst optimizer contains a general library, which represents trees and applies different rules to transform them • The library runs on top of Spark as shown in picture SQL and DataFrames http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
  • 15. • The Catalyst’s main Data type is a tree, which is composed of a node type objects • Trees can be modified by functional transformations, and are able to communicate from one tree to another • Tree Objects are Immutable • Spark SQL uses the Pattern matching function/rule that allows for the extraction of values from nested structures of algebraic types. SQL and DataFrames http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
  • 16. • Catalyst’s transformation tree is used in four phases. • Analysis: rule-based optimization performed when there are any unresolved relations or references • Logical Optimization: rule-based optimization, to the already build “Logical Plan” • Physical Planning: cost-based optimization, that chooses the best model based on a cost model • Code Generation: Uses “quasiquotes”, to produce code efficiently in Java bytecode SQL and DataFrames http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
  • 17. • Provides fast distributed implementations of common learning algorithms, which are: linear models, naïve Bayes, decision trees, and k-clustering • Provide many algorithms optimizations, such as the ALS algorithm, to improve on linear algebra operations • Provides with Machine learning pipelines , which can be often described as workflows that involve sequentially data preparation, feature extraction, model fitting and validation stages. • Provides with robust documentation, it also provides with a list of codes dependencies. Machine Learning Library(MLlib)
  • 18. • MLlib’s 1.4 version is faster than MLlib’s version 1.1. Also, it can be observed that when using Apache’s Mahout v0.9 (which runs Hadoop MapReduce), the computational time is substantially slower than any of MLlib’s versions. The benchmark used for this graph was the ALS algorithm. • MLlib v1.1 relative improvement over MLlib 1.0 , it can be observed that there was on average 3.0x improvement on different algorithms Machine Learning Library(MLlib) http://www.jmlr.org/papers/volume17/15-237/15-237.pdf
  • 19. • Proposes new streaming model than the well known “continuous operator” model • New model avoids regular problems by structuring computations as a set of short, stateless, deterministic tasks, rather than continuous operators. These objects are called discretized streams. Spark Streaming http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf
  • 20. • Graph Data • Graphs – consist of nodes and edges • Nodes contain entities • Edges contain relationships • Most social media data are graph data • Graph-parallel systems • Optimized for graph-based data (NoSQL) GraphX – Intro to Graph Data http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html
  • 21. • Distributed graph framework that unifies graph-parallel and data- parallel computation • Data can be viewed as both graphs and tables • Uses two RDDs - edge collection, vertex collection • Allows end-to-end pipeline of graph and table operations without moving the data • Less efficient on pure graph-parallel workflows • More efficient on combined graph- parallel and data-parallel workflows Library: GraphX https://spark.apache.org/docs/0.9.0/graphx-programming-guide.html
  • 23. Decision Tree - Purity
  • 26. • Real-time Sentiment Analysis – based on social media posts (facebook, twitter, etc.) • Real-time Fraud Detection of bank transactions Use Cases • Netflix • captures all member activities to personalize recommendations • processes 450 billion events per day • MyFitnessPal • used to build their data pipeline, create a list of ‘Verified Foods’ • provided a ten-fold speed improvement over their previous data pipeline Need for Streaming Applications
  • 27. Apache Spark at MyFitnessPal The largest health and fitness community MyFitnessPal helps people achieve a healthy lifestyle through better diet and exercise. MyFitnessPal uses apache spark to clean the data entered by users with the end goal of identifying high quality food items. Using Spark, MyFitnessPal has been able to scan through food calorie data of about 80 million users. Earlier, MyFitnessPal used Hadoop to process 2.5TB of data and that took several days to identify any errors or missing information in it. Netflix Netflix uses Apache Spark for real-time stream processing to provide online recommendations to its customers. Streaming devices at Netflix send events which capture all member activities and play a vital role in personalization. It processes 450 billion events per day which flow to server side applications and are directed to Apache Kafka. Use cases https://www.dezyre.com/article/top-5-apache-spark-use-cases/271
  • 28. • With more data being created all the time and users requiring real-time results, we need methods that can process big data quickly • Spark distributes data to memory – up to 100x faster than reading from disk • Rich set of libraries enable large variety of processing • Enabled by RDDs (Resilient Distributed Arrays) • Read only, partitioned collection of records that are fault-tolerant Key Take-Aways
  • 31. Instructions Category Lollipop Chocolate Minibar Kerr’s Toffee Count 8 9 3 Count (after action) 0 10 2 Sum Total 12 (0 + 10 + 2) Each worker will take 20 candies from the bin in the middle Task 1: Group candies by category Task 2: Count candies Task 3: For each category, perform the following actions: - If candy count in Lollipop is even, discard all Lollipop candies - If candy count in Chocolate Minibar is odd, add an extra Chocolate Minibar candy to your pile - If candy count in Kerr’s Toffee is even, remove one Kerr’s Toffee candy from your pile Task 4: Sum Total Candy (Regardless of Category)
  • 33. • PySpark Installation Guide and Jupyter notebooks – available on D2L • Spark website • https://spark.apache.org • Key technical papers • M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: a fault- tolerant abstraction for in-memory cluster computing”, NSDI '12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 2-2, April 2012. Available: http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf • Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica. “Discretized Streams: Fault-Tolerant Streaming Computation at Scale.” SOSP 2013. November 2013. Available: http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf • X. Meng,J. Bradley, B.Yavuz,E.Sparks,S.Venkataramn, D.Liu,J.Freeman,DB Tai, M. Made, S.Owen, D. Xin, R.Xin,M. Franklin, R.Zadeh, M.Zaharia, A. Talwalkar, “MLlib: Machine Learning in Apache Spark”,Journal of Machine Learning Research (2016), pp.1-7, May, 2015. Available: http://www.jmlr.org/papers/volume17/15-237/15-237.pdf • M. Armburst, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M.J. Franklin, A. Ghodsi, M. Zaharia, “Spark SQL: Relational data processing in Spark”, SIGMOD 2015, June 2015. Available: http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf • R.S. Xin, D. Crankshaw, A. Dave, J. E. Gonzalez, M. J. Franklin, and I. Stoica, “GraphX: Unifying data-parallel and graph-parallel analytics”. Arxiv, Feb. 2014. Available: https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx.pdf References