SlideShare a Scribd company logo
B I G D A T A W O R K G R O U P . I R
WHAT IS SPARK
Apache Spark is an open source big data
processing framework built around speed, ease
of use, and sophisticated analytics. It was
originally developed in 2009 in UC Berkeley’s
AMPLab, and open sourced in 2010 as an
Apache project.
B I G D A T A W O R K G R O U P . I R
WHAT IS SPARK
Advantages: In Memory
 Spark enables applications in Hadoop clusters to run up to 100
times faster in memory and 10 times faster even when running
on disk.
B I G D A T A W O R K G R O U P . I R
WHAT IS SPARK
Advantages: Generic API
 Spark lets you quickly write applications in Java, Scala, or
Python. It comes with a built-in set of over 80 high-level
operators. And you can use it interactively to query data within
the shell.
B I G D A T A W O R K G R O U P . I R
WHAT IS SPARK
Advantages: Many Applications
 Spark gives us a comprehensive, unified framework to manage
big data processing requirements with a variety of data sets
that are diverse in nature (text data, graph data etc) as well as
the source of data (batch v. real-time streaming data).
B I G D A T A W O R K G R O U P . I R
WHAT IS SPARK
Advantages: Many Applications
 In addition to Map and Reduce operations, it supports SQL
queries, streaming data, machine learning and graph data
processing. Developers can use these capabilities stand-alone
or combine them to run in a single data pipeline use case.
B I G D A T A W O R K G R O U P . I R
HADOOP AND SPARK
Hadoop Spark
Map & Reduce -> suitable for on-
pass computations
multi-step data pipelines using
directed acyclic graph (DAG)
pattern.
Clusters are hard to set up and
manage
supports in-memory data sharing
across DAGs.
need to integrate with Mahout
(Machine Learning) and Storm
(Streaming data processing)
Spark as an alternative to Hadoop
MapReduce
B I G D A T A W O R K G R O U P . I R
SPARK FEATURES
Less expensive shuffles in the data processing. With capabilities like in-
memory data storage
Lazy evaluation of big data queries, which helps with optimization of the
steps in data processing workflows.
Higher level API to improve developer productivity and a consistent
architect model for big data solutions.
B I G D A T A W O R K G R O U P . I R
SPARK FEATURES
Spark holds intermediate results in memory rather than writing them to
disk
Spark can be used for processing datasets that larger than the aggregate
memory in a cluster.
B I G D A T A W O R K G R O U P . I R
SPARK ECOSYSTEM
Spark Streaming
 micro batch style of computing and processing.(DStream)
Spark SQL
 JDBC API, SQL like queries, ETL
Spark Mlib
 including classification, regression, clustering, collaborative filtering,
dimensionality reduction, as well as underlying optimization primitives
B I G D A T A W O R K G R O U P . I R
SPARK ECOSYSTEM
Spark GraphX
GraphX extends the Spark RDD by introducing the
Resilient Distributed Property Graph
Set of fundamental operators (e.g., subgraph,
joinVertices, and aggregateMessages)
B I G D A T A W O R K G R O U P . I R
SPARK ECOSYSTEM
BlinkDB
trade-off query accuracy for response time.
Tachyon
Caches working set files in memory
Spark Cassandra Connector
access data stored in a Cassandra database
SparkR
B I G D A T A W O R K G R O U P . I R
B I G D A T A W O R K G R O U P . I R
SPARK ARCHITECTURE
B I G D A T A W O R K G R O U P . I R
RESILIENT DISTRIBUTED DATASETS
Fault tolerance because an RDD know how to recreate and re-compute the
datasets.
RDDs are immutable.
B I G D A T A W O R K G R O U P . I R
RDD OPERATIONS
B I G D A T A W O R K G R O U P . I R
HOW TO RUN SPARK
B I G D A T A W O R K G R O U P . I R
HOW TO INTERACT WITH SPARK
spark-shell.cmd
B I G D A T A W O R K G R O U P . I R
SPARK WEB CONSOLE
http://localhost:4040
B I G D A T A W O R K G R O U P . I R
SHARED VARIABLES
Broadcast Variables
Accumulators
B I G D A T A W O R K G R O U P . I R
SPARK ECOSYSTEM
Spark SQL
 JDBC API, SQL like queries, ETL
B I G D A T A W O R K G R O U P . I R
SPARK ECOSYSTEM
Spark Streaming
 micro batch style of computing and processing.(DStream)
B I G D A T A W O R K G R O U P . I R

More Related Content

PPTX
Big data concepts
PPTX
Hadoop and big data
PPTX
Big Data Concepts
PPTX
Big data ppt
PPTX
Apache hadoop introduction and architecture
PPTX
Big Data and Hadoop
PPTX
Big data analytics with hadoop volume 2
PDF
Hadoop tools with Examples
Big data concepts
Hadoop and big data
Big Data Concepts
Big data ppt
Apache hadoop introduction and architecture
Big Data and Hadoop
Big data analytics with hadoop volume 2
Hadoop tools with Examples

What's hot (20)

PDF
Introduction to Big Data Analytics on Apache Hadoop
PDF
Introduction to Bigdata and HADOOP
PDF
What is hadoop
PPS
Big data hadoop rdbms
PPTX
Big data Analytics Hadoop
PDF
Hadoop core concepts
PDF
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
PPTX
Hadoop: Distributed Data Processing
PPTX
Hadoop Presentation
PPTX
Understanding hdfs
PPTX
Big Data and Hadoop Introduction
PPTX
Introduction to Apache Hadoop Eco-System
PPTX
Hadoop and Big Data
PPTX
Apache Hadoop at 10
PPTX
Big Data & Hadoop Tutorial
PDF
Hadoop Family and Ecosystem
PPTX
Top Hadoop Big Data Interview Questions and Answers for Fresher
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PDF
Emergent Distributed Data Storage
PPT
Big Data and Hadoop Basics
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Bigdata and HADOOP
What is hadoop
Big data hadoop rdbms
Big data Analytics Hadoop
Hadoop core concepts
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Hadoop: Distributed Data Processing
Hadoop Presentation
Understanding hdfs
Big Data and Hadoop Introduction
Introduction to Apache Hadoop Eco-System
Hadoop and Big Data
Apache Hadoop at 10
Big Data & Hadoop Tutorial
Hadoop Family and Ecosystem
Top Hadoop Big Data Interview Questions and Answers for Fresher
Introduction to Big Data & Hadoop Architecture - Module 1
Emergent Distributed Data Storage
Big Data and Hadoop Basics
Ad

Similar to Big data processing with apache spark part1 (20)

PPTX
Glint with Apache Spark
PPTX
Processing Large Data with Apache Spark -- HasGeek
PPTX
APACHE SPARK.pptx
PPTX
In Memory Analytics with Apache Spark
PDF
Started with-apache-spark
PPTX
Spark.pptx to knowledge gaining in wdm days ago
PDF
Big data processing with apache spark
PPTX
Apache Spark Fundamentals
PPT
Big_data_analytics_NoSql_Module-4_Session
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PPTX
Apache Spark for Beginners
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Apache spark - Architecture , Overview & libraries
PDF
Bds session 13 14
PDF
SparkPaper
PDF
Spark: A Unified Engine for Big Data Processing
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PPTX
Unit II Real Time Data Processing tools.pptx
PDF
Introduction to Spark Training
Glint with Apache Spark
Processing Large Data with Apache Spark -- HasGeek
APACHE SPARK.pptx
In Memory Analytics with Apache Spark
Started with-apache-spark
Spark.pptx to knowledge gaining in wdm days ago
Big data processing with apache spark
Apache Spark Fundamentals
Big_data_analytics_NoSql_Module-4_Session
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Apache Spark for Beginners
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Apache spark - Architecture , Overview & libraries
Bds session 13 14
SparkPaper
Spark: A Unified Engine for Big Data Processing
Intro to Apache Spark by CTO of Twingo
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
Unit II Real Time Data Processing tools.pptx
Introduction to Spark Training
Ad

Recently uploaded (20)

PDF
01-Introduction-to-Information-Management.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
master seminar digital applications in india
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
Basic Mud Logging Guide for educational purpose
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Classroom Observation Tools for Teachers
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Pre independence Education in Inndia.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
01-Introduction-to-Information-Management.pdf
Supply Chain Operations Speaking Notes -ICLT Program
master seminar digital applications in india
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Final Presentation General Medicine 03-08-2024.pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPH.pptx obstetrics and gynecology in nursing
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Basic Mud Logging Guide for educational purpose
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Microbial diseases, their pathogenesis and prophylaxis
Classroom Observation Tools for Teachers
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Pre independence Education in Inndia.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...

Big data processing with apache spark part1

  • 1. B I G D A T A W O R K G R O U P . I R
  • 2. WHAT IS SPARK Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project. B I G D A T A W O R K G R O U P . I R
  • 3. WHAT IS SPARK Advantages: In Memory  Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. B I G D A T A W O R K G R O U P . I R
  • 4. WHAT IS SPARK Advantages: Generic API  Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell. B I G D A T A W O R K G R O U P . I R
  • 5. WHAT IS SPARK Advantages: Many Applications  Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data). B I G D A T A W O R K G R O U P . I R
  • 6. WHAT IS SPARK Advantages: Many Applications  In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case. B I G D A T A W O R K G R O U P . I R
  • 7. HADOOP AND SPARK Hadoop Spark Map & Reduce -> suitable for on- pass computations multi-step data pipelines using directed acyclic graph (DAG) pattern. Clusters are hard to set up and manage supports in-memory data sharing across DAGs. need to integrate with Mahout (Machine Learning) and Storm (Streaming data processing) Spark as an alternative to Hadoop MapReduce B I G D A T A W O R K G R O U P . I R
  • 8. SPARK FEATURES Less expensive shuffles in the data processing. With capabilities like in- memory data storage Lazy evaluation of big data queries, which helps with optimization of the steps in data processing workflows. Higher level API to improve developer productivity and a consistent architect model for big data solutions. B I G D A T A W O R K G R O U P . I R
  • 9. SPARK FEATURES Spark holds intermediate results in memory rather than writing them to disk Spark can be used for processing datasets that larger than the aggregate memory in a cluster. B I G D A T A W O R K G R O U P . I R
  • 10. SPARK ECOSYSTEM Spark Streaming  micro batch style of computing and processing.(DStream) Spark SQL  JDBC API, SQL like queries, ETL Spark Mlib  including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives B I G D A T A W O R K G R O U P . I R
  • 11. SPARK ECOSYSTEM Spark GraphX GraphX extends the Spark RDD by introducing the Resilient Distributed Property Graph Set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages) B I G D A T A W O R K G R O U P . I R
  • 12. SPARK ECOSYSTEM BlinkDB trade-off query accuracy for response time. Tachyon Caches working set files in memory Spark Cassandra Connector access data stored in a Cassandra database SparkR B I G D A T A W O R K G R O U P . I R
  • 13. B I G D A T A W O R K G R O U P . I R
  • 14. SPARK ARCHITECTURE B I G D A T A W O R K G R O U P . I R
  • 15. RESILIENT DISTRIBUTED DATASETS Fault tolerance because an RDD know how to recreate and re-compute the datasets. RDDs are immutable. B I G D A T A W O R K G R O U P . I R
  • 16. RDD OPERATIONS B I G D A T A W O R K G R O U P . I R
  • 17. HOW TO RUN SPARK B I G D A T A W O R K G R O U P . I R
  • 18. HOW TO INTERACT WITH SPARK spark-shell.cmd B I G D A T A W O R K G R O U P . I R
  • 19. SPARK WEB CONSOLE http://localhost:4040 B I G D A T A W O R K G R O U P . I R
  • 20. SHARED VARIABLES Broadcast Variables Accumulators B I G D A T A W O R K G R O U P . I R
  • 21. SPARK ECOSYSTEM Spark SQL  JDBC API, SQL like queries, ETL B I G D A T A W O R K G R O U P . I R
  • 22. SPARK ECOSYSTEM Spark Streaming  micro batch style of computing and processing.(DStream) B I G D A T A W O R K G R O U P . I R