SlideShare a Scribd company logo
www.edureka.co/r-for-analytics
www.edureka.co/apache-spark-scala-training
5 Reasons why Spark is in demand !
Slide 2Slide 2Slide 2 www.edureka.co/apache-spark-scala-training
Agenda
At the end of this webinar you will be able to know about:
 Reason #1 : Low Latency
 Reason #2 : Streaming Support
 Reason #3 : Machine Learning and Graph
 Reason #4 : Data Frame API introduction
 Reason #5 : Spark integration with hadoop
Slide 3Slide 3Slide 3 www.edureka.co/apache-spark-scala-training
Spark Architecure
Machine Learning
Library
Graph
programming
Spark interface
For RDBMS lovers
Utility for
continues
ingestion of data
Slide 4Slide 4Slide 4 www.edureka.co/apache-spark-scala-training
Low Latency
Slide 5Slide 5Slide 5 www.edureka.co/apache-spark-scala-training
Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency
computations, whereas MapReduce keeps shuffling things in and out of disk.
Sparks Cuts Down Read/Write I/O To Disk
Spark is good for data that fits in memory and off memory
Slide 6Slide 6Slide 6 www.edureka.co/apache-spark-scala-training
The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce cluster of 2100 nodes
Using Spark on 206 EC2 nodes, spark completed the benchmark in 23 minutes.
Spark sorted the same data 3X faster using 10X fewer machines
All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.
How Fast A System Can Sort 100 TB Of Data
Slide 7Slide 7Slide 7 www.edureka.co/apache-spark-scala-training
2014, 4.27 TB/min
100 TB in 1,406 seconds
207 Amazon EC2 i2.8xlarge nodes x
(32 vCores - 2.5Ghz Intel Xeon E5-2670
v2, 244GB memory, 8x800 GB SSD)
Reynold Xin, Parviz Deyhim, Xiangrui
Meng,
Ali Ghodsi, Matei Zaharia
Courtesy : sortbenchmark.org/
Sparks Benchmark
Slide 8Slide 8Slide 8 www.edureka.co/apache-spark-scala-training
Streaming Support
Slide 9Slide 9Slide 9 www.edureka.co/apache-spark-scala-training
Used for processing the real-time streaming data.
It uses the DStream : a series of RDDs, to process the real-time data
support streaming analytics reasonably well.
The Spark Streaming API closely matches that of the Spark Core
Event processing
Slide 10Slide 10Slide 10 www.edureka.co/apache-spark-scala-training
Machine Learning and graph
implementation with DAG
Slide 11Slide 11Slide 11 www.edureka.co/apache-spark-scala-training
MLlib,a machine
learning library
classification regression clustering collaborative filtering and so on
Some of these algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering
Machine Learning
Slide 12Slide 12Slide 12 www.edureka.co/apache-spark-scala-training
Cyclic Data Flows
• All jobs in spark comprise a series of operators and run on a set of data.
• All the operators in a job are used to construct a DAG (Directed Acyclic
Graph).
• The DAG is optimized by rearranging and combining operators where
possible.
Slide 13Slide 13Slide 13 www.edureka.co/apache-spark-scala-training
Component for graphs and graph-parallel computation
Extends the Spark RDD by introducing a new Graph abstraction
Graph Algorithms
PageRank Connected Components Triangle Counting
GraphX
Slide 14Slide 14Slide 14 www.edureka.co/apache-spark-scala-training
Support for Data Frames
Slide 15Slide 15Slide 15 www.edureka.co/apache-spark-scala-training
As spark continues to grow, it wants to enable wider audiences beyond “big data” engineers to leverage the
power of distributed processing.
Inspired by data frames in r and python (pandas)
Dataframes API is designed to make big data processing on tabular data easier
Dataframe is a distributed collection of data organized into named columns.
Provides operations to filter, group, or compute aggregates, and can be used with spark sql.
Can be constructed from structured data files, existing rdds, tables in hive, or external databases.
DataFrame
Slide 16Slide 16Slide 16 www.edureka.co/apache-spark-scala-training
Ability to scale from KBs to PBs
Support for a wide array of data formats and storage systems
State-of-the-art optimization and code generation through the spark SQL catalyst optimizer
Seamless integration with all big data tooling and infrastructure via spark
Apis for python, java, scala, and R (in development via sparkr)
DataFrame features
Slide 17Slide 17Slide 17 www.edureka.co/apache-spark-scala-training
Spark can use HDFS
Spark can use YARN
Slide 18Slide 18Slide 18 www.edureka.co/apache-spark-scala-training
Spark can leverage the resource negotiator of Hadoop framework i.e. YARN
Spark workloads can make use of Symphony scheduling policies and execute via YARN
Spark execution modes
Standalone Mesos HDFS
Spark Execution Platforms
Slide 19Slide 19Slide 19 www.edureka.co/apache-spark-scala-training
Spark Features/Modules In Demand
Source: Typesafe
Slide 20Slide 20Slide 20 www.edureka.co/apache-spark-scala-training
New Features In 2015
Data Frames 
• Similar API to data frames in R and Pandas
• Automatically optimised via Spark SQL
• Released in Spark 1.3
SparkR 
• Released in Spark 1.4
• Exposes DataFrames, RDD’s & ML library in R
Machine Learning Pipelines 
• High Level API
• Featurization
• Evaluation
• Model Tuning
External Data Sources 
• Platform API to plug Data-Sources into Spark
• Pushes logic into sources
Source: Databrix
Slide 21Slide 21Slide 21 www.edureka.co/apache-spark-scala-training
Spark overview
Questions
Slide 22

More Related Content

PDF
Apache Spark beyond Hadoop MapReduce
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Apache spark
PPTX
Spark for big data analytics
PDF
Performance of Spark vs MapReduce
PDF
Big Data Processing with Spark and Scala
PDF
Big Data Processing With Spark
PDF
Spark SQL | Apache Spark
Apache Spark beyond Hadoop MapReduce
Intro to Apache Spark by CTO of Twingo
Apache spark
Spark for big data analytics
Performance of Spark vs MapReduce
Big Data Processing with Spark and Scala
Big Data Processing With Spark
Spark SQL | Apache Spark

What's hot (20)

PPTX
Big data Processing with Apache Spark & Scala
PDF
Spark Streaming
PDF
5 Reasons why Spark is in demand!
PDF
Apache spark
PPTX
5 things one must know about spark!
PDF
Apache spark linkedin
PDF
Apache spark - Architecture , Overview & libraries
PPTX
Introduction to Apache Spark
PDF
Spark Will Replace Hadoop ! Know Why
PPTX
An Introduction to Apache Spark
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PDF
Spark For Faster Batch Processing
PDF
AI at Scale
PDF
Spark Summit EU talk by Stephan Kessler
PDF
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PPT
Spark_Part 1
PDF
Scaling Machine Learning with Apache Spark
PDF
Apache Spark Overview
PDF
Building Robust, Adaptive Streaming Apps with Spark Streaming
PDF
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Big data Processing with Apache Spark & Scala
Spark Streaming
5 Reasons why Spark is in demand!
Apache spark
5 things one must know about spark!
Apache spark linkedin
Apache spark - Architecture , Overview & libraries
Introduction to Apache Spark
Spark Will Replace Hadoop ! Know Why
An Introduction to Apache Spark
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark For Faster Batch Processing
AI at Scale
Spark Summit EU talk by Stephan Kessler
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Spark_Part 1
Scaling Machine Learning with Apache Spark
Apache Spark Overview
Building Robust, Adaptive Streaming Apps with Spark Streaming
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Ad

Viewers also liked (8)

PDF
5 things one must know about spark!
PDF
Understanding Big Data And Hadoop
PDF
Fault Tolerance with Kafka
PDF
Introduction to Big Data & Hadoop
PDF
Hadoop Architecture and HDFS
PDF
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
PDF
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
PDF
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
5 things one must know about spark!
Understanding Big Data And Hadoop
Fault Tolerance with Kafka
Introduction to Big Data & Hadoop
Hadoop Architecture and HDFS
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Ad

Similar to 5 reasons why spark is in demand! (20)

PDF
Spark is going to replace Apache Hadoop! Know Why?
PPTX
Apache Spark & Scala
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PDF
Unified Big Data Processing with Apache Spark
PDF
Apache Spark Presentation good for big data
PPTX
APACHE SPARK.pptx
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
PPTX
Apache Spark for Beginners
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PDF
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
PPTX
Apache Spark Core
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PPTX
Apache Spark Fundamentals
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PPTX
Apache Spark in Industry
PDF
Spark-summit-2013 Matei Zaharia
Spark is going to replace Apache Hadoop! Know Why?
Apache Spark & Scala
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Unified Big Data Processing with Apache Spark
Apache Spark Presentation good for big data
APACHE SPARK.pptx
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
Apache Spark for Beginners
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Apache Spark Core
Simplifying Big Data Analytics with Apache Spark
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark Fundamentals
Processing Large Data with Apache Spark -- HasGeek
Unified Big Data Processing with Apache Spark (QCON 2014)
Apache Spark in Industry
Spark-summit-2013 Matei Zaharia

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
A Presentation on Artificial Intelligence
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Encapsulation theory and applications.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Understanding_Digital_Forensics_Presentation.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
The AUB Centre for AI in Media Proposal.docx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Approach and Philosophy of On baking technology
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
A Presentation on Artificial Intelligence
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Big Data Technologies - Introduction.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Modernizing your data center with Dell and AMD
Encapsulation theory and applications.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
20250228 LYD VKU AI Blended-Learning.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm

5 reasons why spark is in demand!

  • 2. Slide 2Slide 2Slide 2 www.edureka.co/apache-spark-scala-training Agenda At the end of this webinar you will be able to know about:  Reason #1 : Low Latency  Reason #2 : Streaming Support  Reason #3 : Machine Learning and Graph  Reason #4 : Data Frame API introduction  Reason #5 : Spark integration with hadoop
  • 3. Slide 3Slide 3Slide 3 www.edureka.co/apache-spark-scala-training Spark Architecure Machine Learning Library Graph programming Spark interface For RDBMS lovers Utility for continues ingestion of data
  • 4. Slide 4Slide 4Slide 4 www.edureka.co/apache-spark-scala-training Low Latency
  • 5. Slide 5Slide 5Slide 5 www.edureka.co/apache-spark-scala-training Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency computations, whereas MapReduce keeps shuffling things in and out of disk. Sparks Cuts Down Read/Write I/O To Disk Spark is good for data that fits in memory and off memory
  • 6. Slide 6Slide 6Slide 6 www.edureka.co/apache-spark-scala-training The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce cluster of 2100 nodes Using Spark on 206 EC2 nodes, spark completed the benchmark in 23 minutes. Spark sorted the same data 3X faster using 10X fewer machines All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. How Fast A System Can Sort 100 TB Of Data
  • 7. Slide 7Slide 7Slide 7 www.edureka.co/apache-spark-scala-training 2014, 4.27 TB/min 100 TB in 1,406 seconds 207 Amazon EC2 i2.8xlarge nodes x (32 vCores - 2.5Ghz Intel Xeon E5-2670 v2, 244GB memory, 8x800 GB SSD) Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia Courtesy : sortbenchmark.org/ Sparks Benchmark
  • 8. Slide 8Slide 8Slide 8 www.edureka.co/apache-spark-scala-training Streaming Support
  • 9. Slide 9Slide 9Slide 9 www.edureka.co/apache-spark-scala-training Used for processing the real-time streaming data. It uses the DStream : a series of RDDs, to process the real-time data support streaming analytics reasonably well. The Spark Streaming API closely matches that of the Spark Core Event processing
  • 10. Slide 10Slide 10Slide 10 www.edureka.co/apache-spark-scala-training Machine Learning and graph implementation with DAG
  • 11. Slide 11Slide 11Slide 11 www.edureka.co/apache-spark-scala-training MLlib,a machine learning library classification regression clustering collaborative filtering and so on Some of these algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering Machine Learning
  • 12. Slide 12Slide 12Slide 12 www.edureka.co/apache-spark-scala-training Cyclic Data Flows • All jobs in spark comprise a series of operators and run on a set of data. • All the operators in a job are used to construct a DAG (Directed Acyclic Graph). • The DAG is optimized by rearranging and combining operators where possible.
  • 13. Slide 13Slide 13Slide 13 www.edureka.co/apache-spark-scala-training Component for graphs and graph-parallel computation Extends the Spark RDD by introducing a new Graph abstraction Graph Algorithms PageRank Connected Components Triangle Counting GraphX
  • 14. Slide 14Slide 14Slide 14 www.edureka.co/apache-spark-scala-training Support for Data Frames
  • 15. Slide 15Slide 15Slide 15 www.edureka.co/apache-spark-scala-training As spark continues to grow, it wants to enable wider audiences beyond “big data” engineers to leverage the power of distributed processing. Inspired by data frames in r and python (pandas) Dataframes API is designed to make big data processing on tabular data easier Dataframe is a distributed collection of data organized into named columns. Provides operations to filter, group, or compute aggregates, and can be used with spark sql. Can be constructed from structured data files, existing rdds, tables in hive, or external databases. DataFrame
  • 16. Slide 16Slide 16Slide 16 www.edureka.co/apache-spark-scala-training Ability to scale from KBs to PBs Support for a wide array of data formats and storage systems State-of-the-art optimization and code generation through the spark SQL catalyst optimizer Seamless integration with all big data tooling and infrastructure via spark Apis for python, java, scala, and R (in development via sparkr) DataFrame features
  • 17. Slide 17Slide 17Slide 17 www.edureka.co/apache-spark-scala-training Spark can use HDFS Spark can use YARN
  • 18. Slide 18Slide 18Slide 18 www.edureka.co/apache-spark-scala-training Spark can leverage the resource negotiator of Hadoop framework i.e. YARN Spark workloads can make use of Symphony scheduling policies and execute via YARN Spark execution modes Standalone Mesos HDFS Spark Execution Platforms
  • 19. Slide 19Slide 19Slide 19 www.edureka.co/apache-spark-scala-training Spark Features/Modules In Demand Source: Typesafe
  • 20. Slide 20Slide 20Slide 20 www.edureka.co/apache-spark-scala-training New Features In 2015 Data Frames  • Similar API to data frames in R and Pandas • Automatically optimised via Spark SQL • Released in Spark 1.3 SparkR  • Released in Spark 1.4 • Exposes DataFrames, RDD’s & ML library in R Machine Learning Pipelines  • High Level API • Featurization • Evaluation • Model Tuning External Data Sources  • Platform API to plug Data-Sources into Spark • Pushes logic into sources Source: Databrix
  • 21. Slide 21Slide 21Slide 21 www.edureka.co/apache-spark-scala-training Spark overview

Editor's Notes

  • #16: You can show hands on with data frames like Load data into Spark DataFrames Explore data with Spark SQL Here is a reference for hands on : https://www.mapr.com/blog/using-apache-spark-dataframes-processing-tabular-data#.VdxJofmqqko
  • #21: http://www.information-management.com/gallery/Big-Data-Hadoop-2015-Predictions-Forrester-10026357-1.html https://www.forrester.com/Predictions+2015+Hadoop+Will+Become+A+Cornerstone+Of+Your+Business+Technology+Agenda/fulltext/-/E-RES117705