SlideShare a Scribd company logo
In Memory Analytics-
Apache Spark
Ravi
Agenda
 Overview of Spark
 Spark with Hadoop MapReduce
 Spark Elements and Operations
 Spark Cluster Overview
 Spark Examples
 Spark Stack Extensions:
 Shark
 Streaming
 Mlib
 Graphx
In Memory Analytics
• In-memory analytics is an approach to querying data when it resides in a
computer’s random access memory (RAM), as opposed to querying data
that is stored on physical disks.
• This results in vastly shortened query response times, allowing business
intelligence (BI) and analytic applications to support faster business
decisions.
• As the cost of RAM declines, in-memory analytics is becoming feasible
for many businesses.
• BI and analytic applications have long supported caching data in RAM, but
older 32-bit operating systems provided only 4 GB of addressable memory.
• Newer 64-bit operating systems, with up to 1 terabyte (TB) addressable
memory (and perhaps more in the future), have made it possible to cache
large volumes of data -- potentially an entire data warehouse or data mart --
in a computer’s RAM.
 Not a modified version of Hadoop
 Separate, fast, Map-Reduce-like engine
 In-memory data storage for very fast iterative queries
 Generate execution of graphs and powerful optimizations
 Up to 40x faster than Hadoop
 Spark beats Hadoop by providing primitives for in-memory cluster
computing; thereby avoiding the I/O bottleneck between the individual
jobs of an iterative MapReduce workflow that repeatedly performs
computations on the same working set.
 Compatible with Hadoop’s storage APIs
 Can read/write to any Hadoop-supported systems, including
HDFS, Hbase, SequenceFiles, etc
What is Spark
- Lightning-Fast Cluster Computing
Quick Recap Hadoop Eco System
In Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Spark Programming Model
 Key idea : Resilient Distributed Data (RDD)
 Distributed collections of objects that can be cached in memory across cluster nodes
 Manipulated through various parallel operations
 Automatically rebuilt on failures
 Types of RDD:
 Parallelized collections: Take an existing Scala collection and run functions on it in
parallel
 scala> val distData = sc.parallelize(data)
 distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e
 Hadoop datasets : Run functions on each record of a file in Hadoop distributed file
system or any other storage system supported by Hadoop
 scala> val distFile = sc.textFile("data.txt")
 distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08
In Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
For example, consider the following job:
rdd1.map(splitlines).filter("ERROR")
rdd2.map(splitlines).groupBy(key)
rdd2.join(rdd1, key).take(10)
Automatic Parallelization of Complex Flows
 When constructing a complex pipeline of
MapReduce jobs, the task of correctly
parallelizing the sequence of jobs is left to
you. Thus, a scheduler tool such as
Apache Oozie is often required to
carefully construct this sequence.
 With Spark, a whole series of individual
tasks is expressed as a single program
flow that is lazily evaluated so that the
system has a complete picture of the
execution graph.
 This approach allows the core scheduler
to correctly map the dependencies
across different stages in the
application, and automatically parallelize
the flow of operators without user
intervention.
Spark vs Hadoop
Spark is a high-speed cluster computing system compatible with Hadoop that
can outperform it by up to 100 times considering its ability to perform
computations in memory
Transformations (eg: map, filter, group by) :
Create a new dataset from an existing one
Actions ( eg: count, collect, save) :
Return a value to the driver program after running a computation
on the dataset
Spark Elements
 Application User program built on Spark. Consists of a driver program and executors on the
cluster.
 Driver program The process running the main() function of the application and creating the
SparkContext
 Cluster manager An external service for acquiring resources on the cluster (e.g. standalone
manager, Mesos, YARN)
 Worker node Any node that can run application code in the cluster
 Executor A process launched for an application on a worker node, that runs tasks and
keeps data in memory or disk storage across them. Each application has its own executors.
 Task A unit of work that will be sent to one executor
 Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark
action (e.g. save, collect); you'll see this term used in the driver's logs.
 Stage Each job gets divided into smaller sets of tasks called stages that depend on each other
(similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
Spark Cluster Overview
Cluster Manager Types
• Standalone – a simple cluster manager included with Spark that makes it
easy to set up a cluster.
• Apache Mesos – a general cluster manager that can also run Hadoop
MapReduce and service applications.
• Hadoop YARN – the resource manager in Hadoop 2.
Mesos (Dynamic Resource Sharing for
Clusters) Run Modes
 Spark can run over Mesos in two modes: “fine-grained” and “coarse-
grained”.
 Fine-grained mode, which is the default, each Spark task runs as a
separate Mesos task.
 This allows multiple instances of Spark (and other frameworks) to share machines at
a very fine granularity, where each application gets more or fewer machines as it
ramps up, but it comes with an additional overhead in launching each task, which
may be inappropriate for low-latency applications (e.g. interactive queries or serving
web requests).
 Coarse-grained mode will instead launch only one long-running Spark
task on each Mesos machine, and dynamically schedule its own “mini-
tasks” within it.
 The benefit is much lower startup overhead, but at the cost of reserving the Mesos
resources for the complete duration of the application.
In Memory Analytics with Apache Spark
Task Scheduler
• Runs general DAGs
• Pipelines functions within a
stage
• Cache-aware data reuse &
locality
• Partitioning-aware to avoid
shuffles
Spark Stack Extension
Spark powers a stack of high-level tools including
 Shark for SQL
 MLlib for machine learning
 GraphX
 Spark Streaming.
You can combine these frameworks seamlessly in the same
application.
In Memory Analytics with Apache Spark
Shark
Shark makes Hive faster and more powerful.
 Shark is a new data analysis system that marries query
processing with complex analytics on large clusters
 Shark is an open source distributed SQL query engine for
Hadoop data. It brings state-of-the-art performance and
advanced analytics to Hive users.
 Speed : Run Hive queries up to 100x faster in memory, or
10x on disk.
In Memory Analytics with Apache Spark
Streaming
Spark Streaming makes it easy to build scalable fault-tolerant
streaming applications.
 Spark Streaming brings Spark's language-integrated API to stream processing, letting
you write streaming applications the same way you write batch jobs.
 It supports both Java and Scala.
 Spark Streaming lets you reuse the same code for batch processing, join streams
against historical data, or run ad-hoc queries on stream state
 Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ.
 Since Spark Streaming is built on top of Spark, users can apply Spark's in-built
machine learning algorithms (MLlib), and graph processing algorithms (GraphX) on
data streams
TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))
Counting tweets on a sliding window
stream.join(historicCounts).filter {
case (word, (curCount, oldCount)) =>
curCount > oldCount
}
Find words with higher frequency than
historic data
MLlib
MLlib is Apache Spark's scalable machine learning library.
 MLlib fits into Spark's APIs and interoperates with NumPy in
Python (starting in Spark 0.9). You can use any Hadoop data
source (e.g. HDFS, HBase, or local files), making it easy to plug
into Hadoop workflows.
points = spark.textFile("hdfs://...")
.map(parsePoint)
model = KMeans.train(points)
Calling MLlib in Scala
GraphX
Unifying Graphs and Tables
 GraphX extends the distributed fault-tolerant collections API and
interactive console of Spark with a new graph API which leverages
recent advances in graph systems (e.g., GraphLab) to enable
users to easily and interactively build, transform, and reason about
graph structured data at scale.
BDAS, the Berkeley Data
Analytics Stack,
https://amplab.cs.berkeley.edu/software/
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that
integrates software components being built by the AMPLab to make sense of Big Data.
Software and Research
Projects
 Shark - Hive and SQL on top of Spark
 MLbase - Machine Learning project on top of Spark
 BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark
 GraphX - a graph processing & analytics framework on top of Spark (GraphX has been merged into
Spark 0.9)
 Apache Mesos - Cluster management system that supports running Spark
 Tachyon - In memory storage system that supports running Spark
 Apache MRQL - A query processing and optimization system for large-scale, distributed data
analysis, built on top of Apache Hadoop, Hama, and Spark
 OpenDL - A deep learning algorithm library based on Spark framework. Just kick off.
 SparkR - R frontend for Spark
 Spark Job Server - REST interface for managing and submitting Spark jobs on the same cluster
Conclusion
 “Bigdata” is moving beyond one-pass batch jobs, to
low-latency apps that need data sharing
 RDDs offer fault-tolerant sharing at memory speed
 Spark uses them to combine streaming, batch &
interactive analytics in one system

More Related Content

PDF
Hadoop YARN
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Rdbms
PPTX
NOSQL Databases types and Uses
ODP
Cassandra Data Modelling
PDF
All Aboard the Databus
PPTX
Cassandra ppt 1
PDF
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
Hadoop YARN
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Rdbms
NOSQL Databases types and Uses
Cassandra Data Modelling
All Aboard the Databus
Cassandra ppt 1
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...

What's hot (20)

PPTX
Intro to Apache Spark
PDF
Cassandra Database
PPTX
Phases of distributed query processing
PPTX
Load balancing
PPTX
Scheduling in Cloud Computing
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
DOCX
DATABASE MANAGEMENT SYSTEM PRACTICAL LAB ASSIGNMENT 1
PDF
Aneka platform
PDF
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
PPTX
Mongodb basics and architecture
PDF
Introduction to Apache Cassandra
PDF
Hive tuning
PPTX
GFS & HDFS Introduction
PDF
NoSQL
PPTX
Introduction to Hadoop Technology
PPTX
PPTX
Sizing MongoDB Clusters
PPT
Unit 4 DBMS.ppt
PDF
Declarative Concurrency with Reactive Programming
PPTX
ER model to Relational model mapping
Intro to Apache Spark
Cassandra Database
Phases of distributed query processing
Load balancing
Scheduling in Cloud Computing
Introduction to Big Data & Hadoop Architecture - Module 1
DATABASE MANAGEMENT SYSTEM PRACTICAL LAB ASSIGNMENT 1
Aneka platform
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Mongodb basics and architecture
Introduction to Apache Cassandra
Hive tuning
GFS & HDFS Introduction
NoSQL
Introduction to Hadoop Technology
Sizing MongoDB Clusters
Unit 4 DBMS.ppt
Declarative Concurrency with Reactive Programming
ER model to Relational model mapping
Ad

Viewers also liked (14)

PDF
What Is Visualization?
PPTX
An Introduction to Evaluation in Medical Visualization
PDF
Information Visualization for Medical Informatics
PDF
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
PPTX
Theius: A Streaming Visualization Suite for Hadoop Clusters
PPT
Info vis 4-22-2013-dc-vis-meetup-shneiderman
PPTX
Building a Big Data Pipeline
PDF
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
PPT
Web 2 0 Projects Elementary
PPTX
Presentation Brucon - Anubisnetworks and PTCoresec
PPTX
Text and text stream mining tutorial
PDF
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
PPT
Towards Utilizing GPUs in Information Visualization
PDF
Stream Processing with Kafka in Uber, Danny Yuan
What Is Visualization?
An Introduction to Evaluation in Medical Visualization
Information Visualization for Medical Informatics
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Theius: A Streaming Visualization Suite for Hadoop Clusters
Info vis 4-22-2013-dc-vis-meetup-shneiderman
Building a Big Data Pipeline
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
Web 2 0 Projects Elementary
Presentation Brucon - Anubisnetworks and PTCoresec
Text and text stream mining tutorial
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Towards Utilizing GPUs in Information Visualization
Stream Processing with Kafka in Uber, Danny Yuan
Ad

Similar to In Memory Analytics with Apache Spark (20)

PPTX
Apache Spark Fundamentals
PPTX
APACHE SPARK.pptx
PPTX
Glint with Apache Spark
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PDF
Bds session 13 14
PDF
Spark Driven Big Data Analytics
PDF
Started with-apache-spark
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Introduction to Apache Spark
PDF
SparkPaper
PDF
Introduction to Spark Training
PDF
Introduction to apache spark
PPTX
Intro to Spark development
PDF
Dev Ops Training
PPTX
Unit II Real Time Data Processing tools.pptx
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPT
Big_data_analytics_NoSql_Module-4_Session
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
39.-Introduction-to-Sparkspark and all-1.pdf
PDF
Apache spark
Apache Spark Fundamentals
APACHE SPARK.pptx
Glint with Apache Spark
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Bds session 13 14
Spark Driven Big Data Analytics
Started with-apache-spark
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Introduction to Apache Spark
SparkPaper
Introduction to Spark Training
Introduction to apache spark
Intro to Spark development
Dev Ops Training
Unit II Real Time Data Processing tools.pptx
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Big_data_analytics_NoSql_Module-4_Session
Processing Large Data with Apache Spark -- HasGeek
39.-Introduction-to-Sparkspark and all-1.pdf
Apache spark

More from Venkata Naga Ravi (10)

PPTX
Microservices with Docker
PPTX
Quick Trip with Docker
PPTX
PPTX
Big Data Benchmarking
PPTX
PPTX
Kubernetes
PPTX
NoSQL & HBase overview
PPTX
Software Defined Network - SDN
PPTX
Virtual Container - Docker
PPTX
Java 8 Lambda and Streams
Microservices with Docker
Quick Trip with Docker
Big Data Benchmarking
Kubernetes
NoSQL & HBase overview
Software Defined Network - SDN
Virtual Container - Docker
Java 8 Lambda and Streams

Recently uploaded (20)

PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
ai tools demonstartion for schools and inter college
PDF
System and Network Administraation Chapter 3
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
medical staffing services at VALiNTRY
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Digital Strategies for Manufacturing Companies
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PDF
top salesforce developer skills in 2025.pdf
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
L1 - Introduction to python Backend.pptx
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Upgrade and Innovation Strategies for SAP ERP Customers
Odoo POS Development Services by CandidRoot Solutions
ai tools demonstartion for schools and inter college
System and Network Administraation Chapter 3
How to Migrate SBCGlobal Email to Yahoo Easily
medical staffing services at VALiNTRY
Understanding Forklifts - TECH EHS Solution
Digital Strategies for Manufacturing Companies
Design an Analysis of Algorithms I-SECS-1021-03
Design an Analysis of Algorithms II-SECS-1021-03
Online Work Permit System for Fast Permit Processing
Materi-Enum-and-Record-Data-Type (1).pptx
top salesforce developer skills in 2025.pdf
How to Choose the Right IT Partner for Your Business in Malaysia
ISO 45001 Occupational Health and Safety Management System
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Softaken Excel to vCard Converter Software.pdf
L1 - Introduction to python Backend.pptx

In Memory Analytics with Apache Spark

  • 2. Agenda  Overview of Spark  Spark with Hadoop MapReduce  Spark Elements and Operations  Spark Cluster Overview  Spark Examples  Spark Stack Extensions:  Shark  Streaming  Mlib  Graphx
  • 3. In Memory Analytics • In-memory analytics is an approach to querying data when it resides in a computer’s random access memory (RAM), as opposed to querying data that is stored on physical disks. • This results in vastly shortened query response times, allowing business intelligence (BI) and analytic applications to support faster business decisions. • As the cost of RAM declines, in-memory analytics is becoming feasible for many businesses. • BI and analytic applications have long supported caching data in RAM, but older 32-bit operating systems provided only 4 GB of addressable memory. • Newer 64-bit operating systems, with up to 1 terabyte (TB) addressable memory (and perhaps more in the future), have made it possible to cache large volumes of data -- potentially an entire data warehouse or data mart -- in a computer’s RAM.
  • 4.  Not a modified version of Hadoop  Separate, fast, Map-Reduce-like engine  In-memory data storage for very fast iterative queries  Generate execution of graphs and powerful optimizations  Up to 40x faster than Hadoop  Spark beats Hadoop by providing primitives for in-memory cluster computing; thereby avoiding the I/O bottleneck between the individual jobs of an iterative MapReduce workflow that repeatedly performs computations on the same working set.  Compatible with Hadoop’s storage APIs  Can read/write to any Hadoop-supported systems, including HDFS, Hbase, SequenceFiles, etc What is Spark - Lightning-Fast Cluster Computing
  • 5. Quick Recap Hadoop Eco System
  • 8. Spark Programming Model  Key idea : Resilient Distributed Data (RDD)  Distributed collections of objects that can be cached in memory across cluster nodes  Manipulated through various parallel operations  Automatically rebuilt on failures  Types of RDD:  Parallelized collections: Take an existing Scala collection and run functions on it in parallel  scala> val distData = sc.parallelize(data)  distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e  Hadoop datasets : Run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop  scala> val distFile = sc.textFile("data.txt")  distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08
  • 11. For example, consider the following job: rdd1.map(splitlines).filter("ERROR") rdd2.map(splitlines).groupBy(key) rdd2.join(rdd1, key).take(10) Automatic Parallelization of Complex Flows  When constructing a complex pipeline of MapReduce jobs, the task of correctly parallelizing the sequence of jobs is left to you. Thus, a scheduler tool such as Apache Oozie is often required to carefully construct this sequence.  With Spark, a whole series of individual tasks is expressed as a single program flow that is lazily evaluated so that the system has a complete picture of the execution graph.  This approach allows the core scheduler to correctly map the dependencies across different stages in the application, and automatically parallelize the flow of operators without user intervention.
  • 12. Spark vs Hadoop Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100 times considering its ability to perform computations in memory
  • 13. Transformations (eg: map, filter, group by) : Create a new dataset from an existing one Actions ( eg: count, collect, save) : Return a value to the driver program after running a computation on the dataset
  • 14. Spark Elements  Application User program built on Spark. Consists of a driver program and executors on the cluster.  Driver program The process running the main() function of the application and creating the SparkContext  Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)  Worker node Any node that can run application code in the cluster  Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.  Task A unit of work that will be sent to one executor  Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.  Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
  • 15. Spark Cluster Overview Cluster Manager Types • Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster. • Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications. • Hadoop YARN – the resource manager in Hadoop 2.
  • 16. Mesos (Dynamic Resource Sharing for Clusters) Run Modes  Spark can run over Mesos in two modes: “fine-grained” and “coarse- grained”.  Fine-grained mode, which is the default, each Spark task runs as a separate Mesos task.  This allows multiple instances of Spark (and other frameworks) to share machines at a very fine granularity, where each application gets more or fewer machines as it ramps up, but it comes with an additional overhead in launching each task, which may be inappropriate for low-latency applications (e.g. interactive queries or serving web requests).  Coarse-grained mode will instead launch only one long-running Spark task on each Mesos machine, and dynamically schedule its own “mini- tasks” within it.  The benefit is much lower startup overhead, but at the cost of reserving the Mesos resources for the complete duration of the application.
  • 18. Task Scheduler • Runs general DAGs • Pipelines functions within a stage • Cache-aware data reuse & locality • Partitioning-aware to avoid shuffles
  • 19. Spark Stack Extension Spark powers a stack of high-level tools including  Shark for SQL  MLlib for machine learning  GraphX  Spark Streaming. You can combine these frameworks seamlessly in the same application.
  • 21. Shark Shark makes Hive faster and more powerful.  Shark is a new data analysis system that marries query processing with complex analytics on large clusters  Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users.  Speed : Run Hive queries up to 100x faster in memory, or 10x on disk.
  • 23. Streaming Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.  Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs.  It supports both Java and Scala.  Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state  Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ.  Since Spark Streaming is built on top of Spark, users can apply Spark's in-built machine learning algorithms (MLlib), and graph processing algorithms (GraphX) on data streams TwitterUtils.createStream(...) .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5)) Counting tweets on a sliding window stream.join(historicCounts).filter { case (word, (curCount, oldCount)) => curCount > oldCount } Find words with higher frequency than historic data
  • 24. MLlib MLlib is Apache Spark's scalable machine learning library.  MLlib fits into Spark's APIs and interoperates with NumPy in Python (starting in Spark 0.9). You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows. points = spark.textFile("hdfs://...") .map(parsePoint) model = KMeans.train(points) Calling MLlib in Scala
  • 25. GraphX Unifying Graphs and Tables  GraphX extends the distributed fault-tolerant collections API and interactive console of Spark with a new graph API which leverages recent advances in graph systems (e.g., GraphLab) to enable users to easily and interactively build, transform, and reason about graph structured data at scale.
  • 26. BDAS, the Berkeley Data Analytics Stack, https://amplab.cs.berkeley.edu/software/ BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.
  • 27. Software and Research Projects  Shark - Hive and SQL on top of Spark  MLbase - Machine Learning project on top of Spark  BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark  GraphX - a graph processing & analytics framework on top of Spark (GraphX has been merged into Spark 0.9)  Apache Mesos - Cluster management system that supports running Spark  Tachyon - In memory storage system that supports running Spark  Apache MRQL - A query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, and Spark  OpenDL - A deep learning algorithm library based on Spark framework. Just kick off.  SparkR - R frontend for Spark  Spark Job Server - REST interface for managing and submitting Spark jobs on the same cluster
  • 28. Conclusion  “Bigdata” is moving beyond one-pass batch jobs, to low-latency apps that need data sharing  RDDs offer fault-tolerant sharing at memory speed  Spark uses them to combine streaming, batch & interactive analytics in one system

Editor's Notes

  • #13: http://blog.cloudera.com/blog/2014/03/apache-spark-a-delight-for-developers/