SlideShare a Scribd company logo
A big data processing tool built with Scala and runs on JVM
Introduce to
Spark
ADB 2017
Yen Hao Huang
1
Big Data
● 4Vs
○ Volume/Variety/Velocity/Veracity
Due to the rise of Big Data, faster tools are required for
processing data.
2
Hadoop
3
Hadoop
● A platform to store and process large scale data
● Features
○ Scalable
○ Economical : many cheap servers
○ Flexible : schema-less
○ Reliable : replicas
4
Hadoop MapReduce
● Map
- Divide job to multiple tiny tasks and distribute to
servers
● Reduce
- Summary the results from those servers
5
Hadoop MapReduce
6
Figure Refence
● File I/O - write the middle process data to disk
Hadoop - Bottleneck
Iteration Iteration
Read Read WriteWrite
7
Spark
8
RDD
In-memory computation framework
9
RDD (Resilient Distributed Dataset)
● Write the middle process data to memory
● 10 - 100 times faster than hadoop
Iteration Iteration
Read Memory Read WriteMemory Write
RDD
10
Spark
● Features
○ Speed
○ Ease of use : Scala、Python、Java、R
○ Supports hadoop : HDFS、MapReduce
○ Accessibility : runs on many platforms
11
RDD Features
● Computations
○ Transformation - Lazy compute
○ Action - Execute the computations
○ Persistence - Keep RDD in ram/ disk
Transformation
RDD OutputAction
12
● Error Fixing
RDD Lineage
RDD1 RDD2
Transformation Action
[ 7, 10 ]
[ 2, 3 ] [ ?, ? ]f(x) = x2
+1
13
RDD2RDD1
2
[ 7, 10 ]
Fix !
1
Spark Functionality
14

More Related Content

PPTX
Basic Hadoop Architecture V1 vs V2
PPTX
Hadoop
PDF
Alluxio Data Orchestration Platform for the Cloud
PDF
Handling the growth of data
PPTX
PDF
Aziksa hadoop architecture santosh jha
PPT
Hadoop technology
PPTX
Big data
Basic Hadoop Architecture V1 vs V2
Hadoop
Alluxio Data Orchestration Platform for the Cloud
Handling the growth of data
Aziksa hadoop architecture santosh jha
Hadoop technology
Big data

What's hot (20)

ODP
Apache Hadoop HDFS
PPTX
Redis Modules - Redis India Tour - 2017
PDF
Hadoop at datasift
PDF
Why You Definitely Don’t Want to Build Your Own Time Series Database
PDF
Map reduce & HDFS with Hadoop
PDF
NetFlow Data processing using Hadoop and Vertica
PDF
Hadoop at datasift
PDF
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
PPTX
Introduction to HDFS and MapReduce
PDF
Spark Core
PDF
DIscover Spark and Spark streaming
ODT
Hadoop online trainings
PPTX
Hadoop
PDF
HBaseCon2017 Apache HBase at Didi
DOCX
Hadoop Research
PDF
Iceberg: a fast table format for S3
PPTX
Indexing with solr search server and hadoop framework
PPTX
Cloud Optimized Big Data
PPTX
Google mesa
ODP
Glusterfs and Hadoop
Apache Hadoop HDFS
Redis Modules - Redis India Tour - 2017
Hadoop at datasift
Why You Definitely Don’t Want to Build Your Own Time Series Database
Map reduce & HDFS with Hadoop
NetFlow Data processing using Hadoop and Vertica
Hadoop at datasift
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Introduction to HDFS and MapReduce
Spark Core
DIscover Spark and Spark streaming
Hadoop online trainings
Hadoop
HBaseCon2017 Apache HBase at Didi
Hadoop Research
Iceberg: a fast table format for S3
Indexing with solr search server and hadoop framework
Cloud Optimized Big Data
Google mesa
Glusterfs and Hadoop
Ad

Similar to Introduce to spark (20)

PDF
Intro to Apache Hadoop
PDF
9/2017 STL HUG - Back to School
PDF
Introduction to Apache Spark
PDF
Effectively deploying hadoop to the cloud
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
PDF
Evolution of apache spark
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PPTX
Session 01 - Into to Hadoop
PDF
JDG 7 & Spark Integration
PPTX
Unit II Real Time Data Processing tools.pptx
PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
PDF
Spark Driven Big Data Analytics
PDF
Apache spark on Hadoop Yarn Resource Manager
PDF
New Analytics Toolbox DevNexus 2015
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
PPTX
Hadoop
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PPTX
Hadoop introduction
PDF
2014 hadoop wrocław jug
PDF
Spark vs Hadoop
Intro to Apache Hadoop
9/2017 STL HUG - Back to School
Introduction to Apache Spark
Effectively deploying hadoop to the cloud
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Evolution of apache spark
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Session 01 - Into to Hadoop
JDG 7 & Spark Integration
Unit II Real Time Data Processing tools.pptx
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Spark Driven Big Data Analytics
Apache spark on Hadoop Yarn Resource Manager
New Analytics Toolbox DevNexus 2015
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop
AWS Big Data Demystified #1: Big data architecture lessons learned
Hadoop introduction
2014 hadoop wrocław jug
Spark vs Hadoop
Ad

Recently uploaded (20)

PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
composite construction of structures.pdf
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPT
Mechanical Engineering MATERIALS Selection
PPTX
additive manufacturing of ss316l using mig welding
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Geodesy 1.pptx...............................................
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
DOCX
573137875-Attendance-Management-System-original
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Well-logging-methods_new................
PPT
Drone Technology Electronics components_1
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Welding lecture in detail for understanding
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Foundation to blockchain - A guide to Blockchain Tech
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
composite construction of structures.pdf
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Mechanical Engineering MATERIALS Selection
additive manufacturing of ss316l using mig welding
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Internet of Things (IOT) - A guide to understanding
Geodesy 1.pptx...............................................
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
573137875-Attendance-Management-System-original
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Well-logging-methods_new................
Drone Technology Electronics components_1
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Welding lecture in detail for understanding
CYBER-CRIMES AND SECURITY A guide to understanding
Foundation to blockchain - A guide to Blockchain Tech

Introduce to spark