SlideShare a Scribd company logo
www.hadoopexpress.com
Introduction to Apache Spark
An Overview of Features
© Net Serpents LLC, USA
08-24-2016
www.hadoopexpress.com
Introduction to Apache Spark
Agenda
What is Apache Spark
Major Vendors and Users
Key Features
Hadoop Vs Spark
Spark Architecture
Spark Streaming
Spark Processing
Examples and Use Cases
Part 1: Introduction
© Net Serpents LLC, USA 2
Disclaimer: Apache Hadoop and Spark is a registered trademark of the Apache Software Foundation(ASF ). Hadoop Express
and Net Serpents is not affiliated in any way to ASF. All educational material is created and owned by Net Serpents (dba
Hadoop Express) and is intended only to provide training. Net Serpents does not own any of the products on which it provides
training, many of which are owned by Apache while others are owned companies such as SAS, Python and Oracle. Net
Serpents LLC is committed to education and online learning. All recognizable terms, names of software, tools, programming
languages that appear on this site belong to the respective copyright and/or trademark owners.
www.hadoopexpress.com
 General data processing engine compatible with Hadoop data
 Used to query, analyze and transform data
 Developed in 2009 at AMPLab at University of California, Berkeley
 Became an Apache open source project in 2010
 Became top level project of Apache in 2014
 First discussed in the Mesos Whitepaper created in AMPLab
 Optimized to run in memory
100 times faster than MapReduce when run in memory
10 times faster than MapReduce when writing data to disk
What is Apache Spark
© Net Serpents LLC, USA
What is Apache Spark
3
Apache Spark is an open source big data processing framework
built around speed, ease of use, and sophisticated analytics
www.hadoopexpress.com
 A general-purpose data processing engine, suitable for use in a wide range
of circumstances
 Interactive queries across large data sets, processing of streaming data
from sensors or financial systems, and machine learning tasks
 supports other data processing tasks with developer libraries and APIs
 Support of languages like as Java, Python, R and Scala
 Often used alongside Hadoop’s HDFS
 Can also integrate equally well with other popular data storage subsystems
such as HBase, Cassandra, MapR-DB, MongoDB and Amazon’s S3
What is Apache Spark
© Net Serpents LLC, USA
What is Apache Spark
4
www.hadoopexpress.com
• Data Bricks – founded by founders of Spark at Berkeley
• Cloudera
• Hortonworks
• MapR
Major Vendors
© Net Serpents LLC, USA 5
• More than 1000 organizations are using Spark in production
• IBM, Huawei, Baidu, Aliba Taobao (eCommerce web site)
• Tencent (social nertworking site with 800 million users; 8000 compute nodes)
• Amazon, Ebay, Yahoo! And many others….
Major Users
Major Vendors and Users
Major Vendors and
Users
www.hadoopexpress.com
Simplicity / Ease of Use
Rich set of APIs
 to interact with large datasets
 Well documented
 Structured
© Net Serpents LLC, USA
Key Features
6
Key Features
www.hadoopexpress.com
Speed
In Memory / On Disk
Spark is designed for speed, operating both in memory and on disk.
 In 2014, won the Daytona Gray Sort benchmarking challenge
Processed 100 terabytes of data on solid-state drives in 23 minutes. The
previous winner used Hadoop that took 72 minutes.
Key Features
© Net Serpents LLC, USA 7
Key Features
www.hadoopexpress.com
Key Features
Stream processing
Process “streams” of data from multiple sources simultaneously
Machine learning
 Well suited to training machine learning algorithms.
Running broadly similar queries again and again, at scale, significantly
reduces the time required to iterate through a set of possible solutions in
order to find the most efficient algorithms.
Interactive analytics
 explore data interactively by viewing query results and then either altering the
initial query slightly or drilling deeper into results
Data integration
 Spark (and Hadoop) are increasingly being used to reduce the cost and time
required for ETL process.
Key Features
© Net Serpents LLC, USA 8
www.hadoopexpress.com
Development Language Support
SCALA
Python
Java
SQL
R
Key Features
© Net Serpents LLC, USA 9
Key Features
www.hadoopexpress.com
Hadoop Versus Spark
 Hadoop has cluster management features provided by YARN while
Spark requires a cluster manager
 Spark can run on top of Hadoop and utilize its cluster manager (YARN)
or run separately utilizing other cluster managers such as Mesos.
 Spark is not designed for data management and cluster management.
Hadoop handles these well
 Hadoop provides advanced data security which is missing in Spark
 Hadoop provides Disaster Recovery capabilities to Spark
 Spark provides for fast in-memory data processing of large data
volumes which Hadoop does not
 Spark provides enterprise-class streaming, graph processing and
machine learning capabilities which can be utilized by Hadoop
Hadoop Vs Spark
© Net Serpents LLC, USA 10
Spark is not a
replacement of
Hadoop. Spark
and Hadoop
complement
each other
www.hadoopexpress.com
© Net Serpents LLC, USA 11
Architecture Architecture
Integrations
Spark can run in following modes:
•Standalone cluster mode
•On Hadoop YARN
•On Apache Mesos
Spark can access data in:
•HDFS
•Cassandra
•Hive
•Hbase
•Tachyon
•Any Hadoop data source
www.hadoopexpress.com
Architecture Architecture
© Net Serpents LLC, USA 12
SPARK Core Engine
SPARK SQL
SPARK
Streaming
(Streaming)
MLib
(Machine
Learning)
GraphX
(Graph
Computation)
Spark R
(R on
Spark)
SPARK Technology Stack
Standalone
Scheduler
YARN MESOS
www.hadoopexpress.com
Architecture
© Net Serpents LLC, USA 13
SPARK Core Engine
SPARK
Streaming
(Streaming)
MLib
(Machine
Learning)
GraphX
(Graph
Computation)
Spark R
(R on
Spark)
SPARK Technology Stack
Standalone
Scheduler
YARN MESOS
SPARK SQL
SPARK Core Engine
•Basic functionality of Spark
•Uses RDDs (Resilient Distributed
Datasets)
•Contains APIs for manipulating
RDDs
Spark RDDs are a collection of items distributed across compute nodes.
Spark core APIs allows manipulation of these RDDs in parallel
Architecture
www.hadoopexpress.com
Architecture
© Net Serpents LLC, USA 14
SPARK Core Engine
SPARK
Streaming
(Streaming)
MLib
(Machine
Learning)
GraphX
(Graph
Computation)
Spark R
(R on
Spark)
SPARK Technology Stack
Standalone
Scheduler
YARN MESOS
SPARK
SQL
SPARK SQL
•Used for working with structured
data
•Allows querying with SQL and HQL
(Hive QL)
•Data sources can be Hive tables,
Parquet, JSON, others..
•Allows intermixing SQL with
programmatic manipulation of
RDDs in Python, Scala, Java
Note: Shark is an older version of SPARK SQL developed by UC, Berkeley
Architecture
www.hadoopexpress.com
© Net Serpents LLC, USA 15
SPARK Core Engine
SPARK
Strea
ming
(Strea
ming) MLib
(Machine
Learning)
GraphX
(Graph
Computation)
Spark R
(R on
Spark)
Standalone
Scheduler
YARN MESOS
SPARK SQL
SPARK Streaming
•Used for processing live streams of
data
•Eg., log files / message queues
•Can manipulate data stored on
disk or in-memory as it arrives in
real time
Streaming offers high throughput and is fault tolerant and scalable
Architecture Architecture
SPARK Technology Stack
www.hadoopexpress.com
© Net Serpents LLC, USA 16
SPARK Core Engine
SPARK
Streaming
(Streaming)
MLib
(Mac
hine
Learn
ing) GraphX
(Graph
Computation)
Spark R
(R on
Spark)
Standalone
Scheduler
YARN MESOS
SPARK SQL
MLib
•Provides machine learning (ML)
algorithms
•Eg., clustering, regression analysis,
classification, filtering, model
evaluation, data import
•Includes lower level ML primitives
like gradient descent
MLib is a library with methods that have the capability to scale out across a cluster
Architecture Architecture
SPARK Technology Stack
www.hadoopexpress.com
© Net Serpents LLC, USA 17
SPARK Core Engine
SPARK
Streaming
(Streaming)
MLib
(Machine
Learning)
GraphX
(Graph
Comput
ation) Spark R
(R on
Spark)
Standalone
Scheduler
YARN MESOS
SPARK SQL
GraphX
•Library for manipulating graphs
•Allows viewing data as graphs
called property graphs
•Pregel API is an API to create
custom iterative graph algorithms
Property graphs are immutable, fault
tolerant and distributed (just like RDDs)
Architecture Architecture
SPARK Technology Stack
www.hadoopexpress.com
© Net Serpents LLC, USA 18
SPARK Core Engine
SPARK
Streaming
(Streaming)
MLib
(Machine
Learning)
GraphX
(Graph
Computation)
Spark R
(R on Spark)
Standalone
Scheduler
YARN MESOS
SPARK SQL
Spark R
•Support for R in Spark is more
recent (with release 1.4)
•Allows data scientists working in R
to utilize Spark capabilities
Architecture Architecture
SPARK Technology Stack
www.hadoopexpress.com
Streaming
Spark Streaming Spark
Streaming
© Net Serpents LLC, USA
19
• Allows ingestion of data from a wide range of data sources
• Data processed by Spark can be stored in external systems or presented in
dashboards
KAFKA
FLUME
HDFS
TWITTER
Databases
HDFS
Dashboards
www.hadoopexpress.com
Streaming
Spark Streaming
© Net Serpents LLC, USA
20
Input stream of data is divided into discreet chunks
KAFKA
FLUME
HDFS
TWITTER
Databases
HDFS
Dashboards
Each chunk represents data collected during a brief period
and is processed individually
Input
data
Stream
Spark Engine
@ time 0
@ time 1
@ time 2
Discreet
Sequence of
RDDs
Processed
RDDs
Spark
Streaming
www.hadoopexpress.com
SPARK Processing
© Net Serpents LLC, USA 21
Source: https://spark.apache.org/docs/latest/cluster-overview.html
Spark
Streaming
www.hadoopexpress.com
SPARK Processing
© Net Serpents LLC, USA 22
Source: https://spark.apache.org/docs/latest/cluster-overview.html
Driver program accesses Spark through a SparkContext object.
Spark
Processing
www.hadoopexpress.com
SPARK Processing Spark Processing
© Net Serpents LLC, USA 23
Source: https://spark.apache.org/docs/latest/cluster-overview.html
Spark Context represents a connection to a computing cluster
Once created, it can be used to build RDDs
www.hadoopexpress.com
SPARK Processing
© Net Serpents LLC, USA 24
Source: https://spark.apache.org/docs/latest/cluster-overview.html
Cluster Manager is an external service
•A default built-in cluster manager called Standalone Cluster manager is pre-
packaged with Spark
•Hadoop YARN and Apache Mesos are two popular cluster managers
•Driver requests cluster manager to provide resources for launching executors
•Cluster manager launches executors which are then used by driver to run tasks
Spark
Processing
www.hadoopexpress.com
SPARK Processing
© Net Serpents LLC, USA 25
Source: https://spark.apache.org/docs/latest/cluster-overview.html
Tasks are the smallest unit of physical execution
•The driver program implicitly creates a DAG (Direct Acyclic Graph) of
operations
•This DAG is converted to a physical execution plan
•The execution plan is used by the driver to execute tasks using executors
on the worker nodes
Spark
Processing
www.hadoopexpress.com
SPARK Processing Spark Processing
© Net Serpents LLC, USA 26
Source: https://spark.apache.org/docs/latest/cluster-overview.html
Executors are processes that execute tasks
•Executors run the tasks and return results to the driver
•Also provide in-memory storage for RDDs
www.hadoopexpress.com
SPARK Use Cases Use Cases
© Net Serpents LLC, USA 27
Spark Streaming Use Cases
ETL (Extract Transform Load)
•With Spark streaming it is possible to run ETL on streaming data that is
continually cleaned and aggregated before moving it to data stores
•This is different from tradition approach of ETL based on batch processing
•IoT data collected via sensors on devices can be continually collected,
cleaned and stored in datastores for analytics
Online Data Enrichment
•With Spark Streaming it is possible to combine historical data of online
customers with changes in their buying behavior and preferences to
present targeted advertisements in real time
www.hadoopexpress.com
SPARK Use Cases Use Cases
© Net Serpents LLC, USA 28
Spark Streaming Use Cases
Trigger Event Detection
•Spark streaming is being utilized to detect events and respond quickly to
them by raising alerts. Eg., fraudulent transaction detection by banking
systems and detecting changes in a patient’s vital signs such as heartbeat
and blood pressure in a hospital
Session Analysis on the web
•Spark Streaming can be used to analyze a user’s online activity on a web
site and and provide real-time recommendations. Eg., suggesting movies
to a user on Netflix
www.hadoopexpress.com
SPARK Use Cases Use Cases
© Net Serpents LLC, USA 29
Machine Learning Use Cases
MLib is used for common big data functions like customer segmentation
and sentiment analysis
Network Security: Predictive Intelligence can be used to inspect and
detect threats on data packets arriving over the network before passing
them to the storage platform.
www.hadoopexpress.com
SPARK Use Cases Use Cases
© Net Serpents LLC, USA 30
Business examples
•Uber uses Kafka, Spark Streaming and HDFS to analyze and terabytes of
user data by collecting and converting it from unstructured event data
into structured data
• Pinterest uses an ETL pipeline to gain insights into how users are engaging
all over the world with Pins to help them select products to buy or plan trips
to destinations.
•Conviva uses Spark to optimize video streams and manage live videot
traffic of over 4 million video feeds per month
www.hadoopexpress.com
Special thanks to references Use Cases
© Net Serpents LLC, USA 31
Special thanks to the following authors and contributors for providing
valuable material used in this presentation:
Apache website: spark.apache.org
Learning Spark (Lightning fast data analytics) by Holden Karau, Andy
Konwinski and Matei Zaharia
Getting started on Apache Spark by James A Scott
Top Apache Use Cases : https://www.qubole.com/blog/big-
data/apache-spark-use-cases/
Introduction to Apache Spark by Databricks.com (download slides:
http://cdn.liber118.com/workshop/itas_workshop.pdf)
www.hadoopexpress.com
Thank You!
© Net Serpents LLC, USA© Net Serpents LLC, USA
For queries / suggestions/ feedback please send an email to
info@hadoopexpress.com or shashi@netserpents.com

More Related Content

PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
PDF
Big Data Processing with Spark and Scala
PDF
Apache Spark Notes
PDF
Performance of Spark vs MapReduce
PDF
Spark mhug2
PDF
Apache Spark PDF
PPTX
Spark for big data analytics
PPTX
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Transitioning Compute Models: Hadoop MapReduce to Spark
Big Data Processing with Spark and Scala
Apache Spark Notes
Performance of Spark vs MapReduce
Spark mhug2
Apache Spark PDF
Spark for big data analytics
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...

What's hot (20)

PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PDF
Spark SQL | Apache Spark
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
PDF
Apache Spark beyond Hadoop MapReduce
PDF
Getting Spark ready for real-time, operational analytics
PDF
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
PPTX
Introduction to the Hadoop EcoSystem
PDF
SparkPaper
PPTX
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
PPTX
Big data Processing with Apache Spark & Scala
PPTX
The Future of Hadoop: A deeper look at Apache Spark
PPTX
Hadoop Innovation Summit 2014
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PDF
Hadoop & Complex Systems Research
PDF
Apache spark - Architecture , Overview & libraries
PDF
Apache Spark Briefing
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Spark SQL | Apache Spark
Intro to Apache Spark by CTO of Twingo
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Next-generation Python Big Data Tools, powered by Apache Arrow
Apache Spark beyond Hadoop MapReduce
Getting Spark ready for real-time, operational analytics
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to the Hadoop EcoSystem
SparkPaper
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Big data Processing with Apache Spark & Scala
The Future of Hadoop: A deeper look at Apache Spark
Hadoop Innovation Summit 2014
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Hadoop & Complex Systems Research
Apache spark - Architecture , Overview & libraries
Apache Spark Briefing
Ad

Similar to Spark_Part 1 (20)

PPTX
In Memory Analytics with Apache Spark
PDF
Started with-apache-spark
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Introduction to Apache Spark
PDF
spark_v1_2
PPTX
Apache Spark Fundamentals
PDF
39.-Introduction-to-Sparkspark and all-1.pdf
PPTX
Apache Spark in Industry
PDF
Apache spark
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PPTX
APACHE SPARK.pptx
PPTX
[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing
PDF
Review on Apache Spark Technology
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PPTX
Big Data Processing Using Spark.pptx
PPTX
Glint with Apache Spark
PPTX
Apache spark
PPTX
Apachespark 160612140708
In Memory Analytics with Apache Spark
Started with-apache-spark
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Introduction to Apache Spark
spark_v1_2
Apache Spark Fundamentals
39.-Introduction-to-Sparkspark and all-1.pdf
Apache Spark in Industry
Apache spark
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
APACHE SPARK.pptx
[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing
Review on Apache Spark Technology
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Unified Big Data Processing with Apache Spark (QCON 2014)
Big Data Processing Using Spark.pptx
Glint with Apache Spark
Apache spark
Apachespark 160612140708
Ad

Spark_Part 1

  • 1. www.hadoopexpress.com Introduction to Apache Spark An Overview of Features © Net Serpents LLC, USA 08-24-2016
  • 2. www.hadoopexpress.com Introduction to Apache Spark Agenda What is Apache Spark Major Vendors and Users Key Features Hadoop Vs Spark Spark Architecture Spark Streaming Spark Processing Examples and Use Cases Part 1: Introduction © Net Serpents LLC, USA 2 Disclaimer: Apache Hadoop and Spark is a registered trademark of the Apache Software Foundation(ASF ). Hadoop Express and Net Serpents is not affiliated in any way to ASF. All educational material is created and owned by Net Serpents (dba Hadoop Express) and is intended only to provide training. Net Serpents does not own any of the products on which it provides training, many of which are owned by Apache while others are owned companies such as SAS, Python and Oracle. Net Serpents LLC is committed to education and online learning. All recognizable terms, names of software, tools, programming languages that appear on this site belong to the respective copyright and/or trademark owners.
  • 3. www.hadoopexpress.com  General data processing engine compatible with Hadoop data  Used to query, analyze and transform data  Developed in 2009 at AMPLab at University of California, Berkeley  Became an Apache open source project in 2010  Became top level project of Apache in 2014  First discussed in the Mesos Whitepaper created in AMPLab  Optimized to run in memory 100 times faster than MapReduce when run in memory 10 times faster than MapReduce when writing data to disk What is Apache Spark © Net Serpents LLC, USA What is Apache Spark 3 Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics
  • 4. www.hadoopexpress.com  A general-purpose data processing engine, suitable for use in a wide range of circumstances  Interactive queries across large data sets, processing of streaming data from sensors or financial systems, and machine learning tasks  supports other data processing tasks with developer libraries and APIs  Support of languages like as Java, Python, R and Scala  Often used alongside Hadoop’s HDFS  Can also integrate equally well with other popular data storage subsystems such as HBase, Cassandra, MapR-DB, MongoDB and Amazon’s S3 What is Apache Spark © Net Serpents LLC, USA What is Apache Spark 4
  • 5. www.hadoopexpress.com • Data Bricks – founded by founders of Spark at Berkeley • Cloudera • Hortonworks • MapR Major Vendors © Net Serpents LLC, USA 5 • More than 1000 organizations are using Spark in production • IBM, Huawei, Baidu, Aliba Taobao (eCommerce web site) • Tencent (social nertworking site with 800 million users; 8000 compute nodes) • Amazon, Ebay, Yahoo! And many others…. Major Users Major Vendors and Users Major Vendors and Users
  • 6. www.hadoopexpress.com Simplicity / Ease of Use Rich set of APIs  to interact with large datasets  Well documented  Structured © Net Serpents LLC, USA Key Features 6 Key Features
  • 7. www.hadoopexpress.com Speed In Memory / On Disk Spark is designed for speed, operating both in memory and on disk.  In 2014, won the Daytona Gray Sort benchmarking challenge Processed 100 terabytes of data on solid-state drives in 23 minutes. The previous winner used Hadoop that took 72 minutes. Key Features © Net Serpents LLC, USA 7 Key Features
  • 8. www.hadoopexpress.com Key Features Stream processing Process “streams” of data from multiple sources simultaneously Machine learning  Well suited to training machine learning algorithms. Running broadly similar queries again and again, at scale, significantly reduces the time required to iterate through a set of possible solutions in order to find the most efficient algorithms. Interactive analytics  explore data interactively by viewing query results and then either altering the initial query slightly or drilling deeper into results Data integration  Spark (and Hadoop) are increasingly being used to reduce the cost and time required for ETL process. Key Features © Net Serpents LLC, USA 8
  • 10. www.hadoopexpress.com Hadoop Versus Spark  Hadoop has cluster management features provided by YARN while Spark requires a cluster manager  Spark can run on top of Hadoop and utilize its cluster manager (YARN) or run separately utilizing other cluster managers such as Mesos.  Spark is not designed for data management and cluster management. Hadoop handles these well  Hadoop provides advanced data security which is missing in Spark  Hadoop provides Disaster Recovery capabilities to Spark  Spark provides for fast in-memory data processing of large data volumes which Hadoop does not  Spark provides enterprise-class streaming, graph processing and machine learning capabilities which can be utilized by Hadoop Hadoop Vs Spark © Net Serpents LLC, USA 10 Spark is not a replacement of Hadoop. Spark and Hadoop complement each other
  • 11. www.hadoopexpress.com © Net Serpents LLC, USA 11 Architecture Architecture Integrations Spark can run in following modes: •Standalone cluster mode •On Hadoop YARN •On Apache Mesos Spark can access data in: •HDFS •Cassandra •Hive •Hbase •Tachyon •Any Hadoop data source
  • 12. www.hadoopexpress.com Architecture Architecture © Net Serpents LLC, USA 12 SPARK Core Engine SPARK SQL SPARK Streaming (Streaming) MLib (Machine Learning) GraphX (Graph Computation) Spark R (R on Spark) SPARK Technology Stack Standalone Scheduler YARN MESOS
  • 13. www.hadoopexpress.com Architecture © Net Serpents LLC, USA 13 SPARK Core Engine SPARK Streaming (Streaming) MLib (Machine Learning) GraphX (Graph Computation) Spark R (R on Spark) SPARK Technology Stack Standalone Scheduler YARN MESOS SPARK SQL SPARK Core Engine •Basic functionality of Spark •Uses RDDs (Resilient Distributed Datasets) •Contains APIs for manipulating RDDs Spark RDDs are a collection of items distributed across compute nodes. Spark core APIs allows manipulation of these RDDs in parallel Architecture
  • 14. www.hadoopexpress.com Architecture © Net Serpents LLC, USA 14 SPARK Core Engine SPARK Streaming (Streaming) MLib (Machine Learning) GraphX (Graph Computation) Spark R (R on Spark) SPARK Technology Stack Standalone Scheduler YARN MESOS SPARK SQL SPARK SQL •Used for working with structured data •Allows querying with SQL and HQL (Hive QL) •Data sources can be Hive tables, Parquet, JSON, others.. •Allows intermixing SQL with programmatic manipulation of RDDs in Python, Scala, Java Note: Shark is an older version of SPARK SQL developed by UC, Berkeley Architecture
  • 15. www.hadoopexpress.com © Net Serpents LLC, USA 15 SPARK Core Engine SPARK Strea ming (Strea ming) MLib (Machine Learning) GraphX (Graph Computation) Spark R (R on Spark) Standalone Scheduler YARN MESOS SPARK SQL SPARK Streaming •Used for processing live streams of data •Eg., log files / message queues •Can manipulate data stored on disk or in-memory as it arrives in real time Streaming offers high throughput and is fault tolerant and scalable Architecture Architecture SPARK Technology Stack
  • 16. www.hadoopexpress.com © Net Serpents LLC, USA 16 SPARK Core Engine SPARK Streaming (Streaming) MLib (Mac hine Learn ing) GraphX (Graph Computation) Spark R (R on Spark) Standalone Scheduler YARN MESOS SPARK SQL MLib •Provides machine learning (ML) algorithms •Eg., clustering, regression analysis, classification, filtering, model evaluation, data import •Includes lower level ML primitives like gradient descent MLib is a library with methods that have the capability to scale out across a cluster Architecture Architecture SPARK Technology Stack
  • 17. www.hadoopexpress.com © Net Serpents LLC, USA 17 SPARK Core Engine SPARK Streaming (Streaming) MLib (Machine Learning) GraphX (Graph Comput ation) Spark R (R on Spark) Standalone Scheduler YARN MESOS SPARK SQL GraphX •Library for manipulating graphs •Allows viewing data as graphs called property graphs •Pregel API is an API to create custom iterative graph algorithms Property graphs are immutable, fault tolerant and distributed (just like RDDs) Architecture Architecture SPARK Technology Stack
  • 18. www.hadoopexpress.com © Net Serpents LLC, USA 18 SPARK Core Engine SPARK Streaming (Streaming) MLib (Machine Learning) GraphX (Graph Computation) Spark R (R on Spark) Standalone Scheduler YARN MESOS SPARK SQL Spark R •Support for R in Spark is more recent (with release 1.4) •Allows data scientists working in R to utilize Spark capabilities Architecture Architecture SPARK Technology Stack
  • 19. www.hadoopexpress.com Streaming Spark Streaming Spark Streaming © Net Serpents LLC, USA 19 • Allows ingestion of data from a wide range of data sources • Data processed by Spark can be stored in external systems or presented in dashboards KAFKA FLUME HDFS TWITTER Databases HDFS Dashboards
  • 20. www.hadoopexpress.com Streaming Spark Streaming © Net Serpents LLC, USA 20 Input stream of data is divided into discreet chunks KAFKA FLUME HDFS TWITTER Databases HDFS Dashboards Each chunk represents data collected during a brief period and is processed individually Input data Stream Spark Engine @ time 0 @ time 1 @ time 2 Discreet Sequence of RDDs Processed RDDs Spark Streaming
  • 21. www.hadoopexpress.com SPARK Processing © Net Serpents LLC, USA 21 Source: https://spark.apache.org/docs/latest/cluster-overview.html Spark Streaming
  • 22. www.hadoopexpress.com SPARK Processing © Net Serpents LLC, USA 22 Source: https://spark.apache.org/docs/latest/cluster-overview.html Driver program accesses Spark through a SparkContext object. Spark Processing
  • 23. www.hadoopexpress.com SPARK Processing Spark Processing © Net Serpents LLC, USA 23 Source: https://spark.apache.org/docs/latest/cluster-overview.html Spark Context represents a connection to a computing cluster Once created, it can be used to build RDDs
  • 24. www.hadoopexpress.com SPARK Processing © Net Serpents LLC, USA 24 Source: https://spark.apache.org/docs/latest/cluster-overview.html Cluster Manager is an external service •A default built-in cluster manager called Standalone Cluster manager is pre- packaged with Spark •Hadoop YARN and Apache Mesos are two popular cluster managers •Driver requests cluster manager to provide resources for launching executors •Cluster manager launches executors which are then used by driver to run tasks Spark Processing
  • 25. www.hadoopexpress.com SPARK Processing © Net Serpents LLC, USA 25 Source: https://spark.apache.org/docs/latest/cluster-overview.html Tasks are the smallest unit of physical execution •The driver program implicitly creates a DAG (Direct Acyclic Graph) of operations •This DAG is converted to a physical execution plan •The execution plan is used by the driver to execute tasks using executors on the worker nodes Spark Processing
  • 26. www.hadoopexpress.com SPARK Processing Spark Processing © Net Serpents LLC, USA 26 Source: https://spark.apache.org/docs/latest/cluster-overview.html Executors are processes that execute tasks •Executors run the tasks and return results to the driver •Also provide in-memory storage for RDDs
  • 27. www.hadoopexpress.com SPARK Use Cases Use Cases © Net Serpents LLC, USA 27 Spark Streaming Use Cases ETL (Extract Transform Load) •With Spark streaming it is possible to run ETL on streaming data that is continually cleaned and aggregated before moving it to data stores •This is different from tradition approach of ETL based on batch processing •IoT data collected via sensors on devices can be continually collected, cleaned and stored in datastores for analytics Online Data Enrichment •With Spark Streaming it is possible to combine historical data of online customers with changes in their buying behavior and preferences to present targeted advertisements in real time
  • 28. www.hadoopexpress.com SPARK Use Cases Use Cases © Net Serpents LLC, USA 28 Spark Streaming Use Cases Trigger Event Detection •Spark streaming is being utilized to detect events and respond quickly to them by raising alerts. Eg., fraudulent transaction detection by banking systems and detecting changes in a patient’s vital signs such as heartbeat and blood pressure in a hospital Session Analysis on the web •Spark Streaming can be used to analyze a user’s online activity on a web site and and provide real-time recommendations. Eg., suggesting movies to a user on Netflix
  • 29. www.hadoopexpress.com SPARK Use Cases Use Cases © Net Serpents LLC, USA 29 Machine Learning Use Cases MLib is used for common big data functions like customer segmentation and sentiment analysis Network Security: Predictive Intelligence can be used to inspect and detect threats on data packets arriving over the network before passing them to the storage platform.
  • 30. www.hadoopexpress.com SPARK Use Cases Use Cases © Net Serpents LLC, USA 30 Business examples •Uber uses Kafka, Spark Streaming and HDFS to analyze and terabytes of user data by collecting and converting it from unstructured event data into structured data • Pinterest uses an ETL pipeline to gain insights into how users are engaging all over the world with Pins to help them select products to buy or plan trips to destinations. •Conviva uses Spark to optimize video streams and manage live videot traffic of over 4 million video feeds per month
  • 31. www.hadoopexpress.com Special thanks to references Use Cases © Net Serpents LLC, USA 31 Special thanks to the following authors and contributors for providing valuable material used in this presentation: Apache website: spark.apache.org Learning Spark (Lightning fast data analytics) by Holden Karau, Andy Konwinski and Matei Zaharia Getting started on Apache Spark by James A Scott Top Apache Use Cases : https://www.qubole.com/blog/big- data/apache-spark-use-cases/ Introduction to Apache Spark by Databricks.com (download slides: http://cdn.liber118.com/workshop/itas_workshop.pdf)
  • 32. www.hadoopexpress.com Thank You! © Net Serpents LLC, USA© Net Serpents LLC, USA For queries / suggestions/ feedback please send an email to info@hadoopexpress.com or shashi@netserpents.com