SlideShare a Scribd company logo
Introduing spark
taotao.li@datayes.com
03/11/2016
Introducing Spark
Copyright © 2014 DataYes. All rights reserved
Agenda
Spark ! When, What, Why ?1
Basic Concepts in Spark2
Programming Model in Spark3
Demo & Next4
5 Q & A
Copyright © 2014 DataYes. All rights reserved
Spark ! When, What, Why ?
Top-level in Apache
2009 : Spark birth in AMPLab@UCB
2010 : open source
Into Apache incubator
2009~2010
2013
2014
New Stage : more than an open
source project
Copyright © 2014 DataYes. All rights reserved
Spark ! When, What, Why ?
From official: Apache Spark™ is a fast and general engine for large-scale data processing.
Key Points:
● A framework
● Birth for large-scale data processing
● Generalize programming model for data processing [ more than MR ]
● Provides high-level APIs : Scala, Python, R, Java
● Arm to teeth : SQL, Streaming, Machine Learning, GraphX
● Compatible with previous ecology : hadoop, mesos, hdfs, cassandra, hbase, s3 …
Copyright © 2014 DataYes. All rights reserved
Spark ! When, What, Why ?
● General
● Fast in develop
○ REPL explore
○ RDD operations
○ Less code
● Fast in processing
● Compatible
● Packges and 3-party packages
● Memory, cheaper and cheaper
● Company who accepts Spark
Copyright © 2014 DataYes. All rights reserved
Spark ! When, What, Why ?
Copyright © 2014 DataYes. All rights reserved
Spark ! When, What, Why ?
DDR4-3000 288-pin DIMM 4x4GB Price Trend
Copyright © 2014 DataYes. All rights reserved
Basic Concepts in Spark
Copyright © 2014 DataYes. All rights reserved
Basic Concepts in Spark
● Driver, Master, Worker, Executor
● Application
● SparkContext, i.e : sc
● RDD
● Transform & Action in RDD
need more ? check : 『 Spark 』2. spark 基本概念解析
Copyright © 2014 DataYes. All rights reserved
Programming Model in Spark
Copyright © 2014 DataYes. All rights reserved
Programming Model in Spark
Three basic steps to build a Spark Application
● load dataset
○ static dataset
○ dynamic dataset
● Processing
○ RDD operation
○ UDF
○ Cache
● Output Display
○ collect
○ store in database, file system ...
Copyright © 2014 DataYes. All rights reserved
Demo & Next
● Wrapper Spark for Uqer Use Cases
●
● Try Tungsten
●
● Dataframe & Datasets
●
● SQL & Mlib & Streaming
●
● 3-party package wrapper [sklearn, pandas, numpy ...etc]
Copyright © 2014 DataYes. All rights reserved
Demo & Next
Copyright © 2014 DataYes. All rights reserved
Demo & Next
● Monte Carlo in Spark
● Spark in finance : index similarity calculating
● Spark in finance : distributed backtesting strategy
Copyright © 2014 DataYes. All rights reserved
Demo, Demo, Demo
Q & A
谢 谢

More Related Content

PDF
Performant data processing with PySpark, SparkR and DataFrame API
PDF
The SparkSQL things you maybe confuse
PDF
High Performance Python on Apache Spark
PDF
Apache spark linkedin
PDF
Introduction to apache spark
PDF
Performance of Spark vs MapReduce
PDF
My Data Journey with Python (SciPy 2015 Keynote)
PDF
Python Data Wrangling: Preparing for the Future
Performant data processing with PySpark, SparkR and DataFrame API
The SparkSQL things you maybe confuse
High Performance Python on Apache Spark
Apache spark linkedin
Introduction to apache spark
Performance of Spark vs MapReduce
My Data Journey with Python (SciPy 2015 Keynote)
Python Data Wrangling: Preparing for the Future

What's hot (20)

PDF
Presto Fast SQL on Anything
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
PPTX
Spark for big data analytics
PPTX
Big data Processing with Apache Spark & Scala
PDF
Intro to Apache Spark
PDF
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
PDF
Spark Core
PDF
The Computer Science Behind a modern Distributed Database
PDF
How Apache Arrow and Parquet boost cross-language interoperability
PPTX
Koalas: Unifying Spark and pandas APIs
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PDF
Operational Tips for Deploying Spark
PDF
Apache Spark 101
PDF
Introduction to SparkR
PDF
Extending Pandas using Apache Arrow and Numba
PDF
Introduction to Apache Spark
PDF
Introduction to apache spark
PDF
Ibis: Scaling the Python Data Experience
PDF
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PPTX
First impressions of SparkR: our own machine learning algorithm
Presto Fast SQL on Anything
An Incomplete Data Tools Landscape for Hackers in 2015
Spark for big data analytics
Big data Processing with Apache Spark & Scala
Intro to Apache Spark
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Core
The Computer Science Behind a modern Distributed Database
How Apache Arrow and Parquet boost cross-language interoperability
Koalas: Unifying Spark and pandas APIs
Apache Arrow -- Cross-language development platform for in-memory data
Operational Tips for Deploying Spark
Apache Spark 101
Introduction to SparkR
Extending Pandas using Apache Arrow and Numba
Introduction to Apache Spark
Introduction to apache spark
Ibis: Scaling the Python Data Experience
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
First impressions of SparkR: our own machine learning algorithm
Ad

Viewers also liked (9)

DOCX
kapilumak
PPT
Propuestas para pensar la enseñanza en la diversidad
PPT
Democrats-Clinton&Obama
PPTX
Love Of A Tree. (Purple magosteen / Mangostán)
PDF
Google Analytics Report on how to reduce bounce rate
PPS
Human Alphabets 4
PPTX
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
PPTX
SunGuard implementation kickoff meeting 07142016 pm
PDF
Resolución examen residentado 2016 26 de junio 2016
kapilumak
Propuestas para pensar la enseñanza en la diversidad
Democrats-Clinton&Obama
Love Of A Tree. (Purple magosteen / Mangostán)
Google Analytics Report on how to reduce bounce rate
Human Alphabets 4
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
SunGuard implementation kickoff meeting 07142016 pm
Resolución examen residentado 2016 26 de junio 2016
Ad

Similar to Introduing spark (20)

PPTX
Spark and Hadoop Technology
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PPTX
Introduction to spark
PPTX
Learn Apache Spark: A Comprehensive Guide
PPTX
YARN Ready: Apache Spark
PPTX
Getting Started with Apache Spark (Scala)
PDF
Spark1.0での動作検証 - Hadoopユーザ・デベロッパから見たSparkへの期待 (Hadoop Conference Japan 2014)
PDF
Getting Started with Spark Scala
PPTX
Apache Spark in Scientific Applciations
PPTX
Apache Spark in Scientific Applications
PDF
spark_v1_2
PDF
Started with-apache-spark
PDF
shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014
PDF
Hive on spark berlin buzzwords
PDF
A Master Guide To Apache Spark Application And Versatile Uses.pdf
PDF
Introduction to Apache Spark
PDF
Detailed guide to the Apache Spark Framework
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Spark and Hadoop Technology
Spark Summit EU 2015: Lessons from 300+ production users
Introduction to spark
Learn Apache Spark: A Comprehensive Guide
YARN Ready: Apache Spark
Getting Started with Apache Spark (Scala)
Spark1.0での動作検証 - Hadoopユーザ・デベロッパから見たSparkへの期待 (Hadoop Conference Japan 2014)
Getting Started with Spark Scala
Apache Spark in Scientific Applciations
Apache Spark in Scientific Applications
spark_v1_2
Started with-apache-spark
shark attack on sql-on-hadoop Talk at BerlinBuzzwords 2014
Hive on spark berlin buzzwords
A Master Guide To Apache Spark Application And Versatile Uses.pdf
Introduction to Apache Spark
Detailed guide to the Apache Spark Framework
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

Recently uploaded (20)

PPT
Mechanical Engineering MATERIALS Selection
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
additive manufacturing of ss316l using mig welding
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Well-logging-methods_new................
PDF
737-MAX_SRG.pdf student reference guides
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
DOCX
573137875-Attendance-Management-System-original
PPTX
web development for engineering and engineering
PPTX
UNIT 4 Total Quality Management .pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Sustainable Sites - Green Building Construction
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Artificial Intelligence
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
Mechanical Engineering MATERIALS Selection
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
additive manufacturing of ss316l using mig welding
R24 SURVEYING LAB MANUAL for civil enggi
Well-logging-methods_new................
737-MAX_SRG.pdf student reference guides
Foundation to blockchain - A guide to Blockchain Tech
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
573137875-Attendance-Management-System-original
web development for engineering and engineering
UNIT 4 Total Quality Management .pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Sustainable Sites - Green Building Construction
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Automation-in-Manufacturing-Chapter-Introduction.pdf
Artificial Intelligence
Internet of Things (IOT) - A guide to understanding
CYBER-CRIMES AND SECURITY A guide to understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems

Introduing spark

  • 3. Copyright © 2014 DataYes. All rights reserved Agenda Spark ! When, What, Why ?1 Basic Concepts in Spark2 Programming Model in Spark3 Demo & Next4 5 Q & A
  • 4. Copyright © 2014 DataYes. All rights reserved Spark ! When, What, Why ? Top-level in Apache 2009 : Spark birth in AMPLab@UCB 2010 : open source Into Apache incubator 2009~2010 2013 2014 New Stage : more than an open source project
  • 5. Copyright © 2014 DataYes. All rights reserved Spark ! When, What, Why ? From official: Apache Spark™ is a fast and general engine for large-scale data processing. Key Points: ● A framework ● Birth for large-scale data processing ● Generalize programming model for data processing [ more than MR ] ● Provides high-level APIs : Scala, Python, R, Java ● Arm to teeth : SQL, Streaming, Machine Learning, GraphX ● Compatible with previous ecology : hadoop, mesos, hdfs, cassandra, hbase, s3 …
  • 6. Copyright © 2014 DataYes. All rights reserved Spark ! When, What, Why ? ● General ● Fast in develop ○ REPL explore ○ RDD operations ○ Less code ● Fast in processing ● Compatible ● Packges and 3-party packages ● Memory, cheaper and cheaper ● Company who accepts Spark
  • 7. Copyright © 2014 DataYes. All rights reserved Spark ! When, What, Why ?
  • 8. Copyright © 2014 DataYes. All rights reserved Spark ! When, What, Why ? DDR4-3000 288-pin DIMM 4x4GB Price Trend
  • 9. Copyright © 2014 DataYes. All rights reserved Basic Concepts in Spark
  • 10. Copyright © 2014 DataYes. All rights reserved Basic Concepts in Spark ● Driver, Master, Worker, Executor ● Application ● SparkContext, i.e : sc ● RDD ● Transform & Action in RDD need more ? check : 『 Spark 』2. spark 基本概念解析
  • 11. Copyright © 2014 DataYes. All rights reserved Programming Model in Spark
  • 12. Copyright © 2014 DataYes. All rights reserved Programming Model in Spark Three basic steps to build a Spark Application ● load dataset ○ static dataset ○ dynamic dataset ● Processing ○ RDD operation ○ UDF ○ Cache ● Output Display ○ collect ○ store in database, file system ...
  • 13. Copyright © 2014 DataYes. All rights reserved Demo & Next ● Wrapper Spark for Uqer Use Cases ● ● Try Tungsten ● ● Dataframe & Datasets ● ● SQL & Mlib & Streaming ● ● 3-party package wrapper [sklearn, pandas, numpy ...etc]
  • 14. Copyright © 2014 DataYes. All rights reserved Demo & Next
  • 15. Copyright © 2014 DataYes. All rights reserved Demo & Next ● Monte Carlo in Spark ● Spark in finance : index similarity calculating ● Spark in finance : distributed backtesting strategy
  • 16. Copyright © 2014 DataYes. All rights reserved Demo, Demo, Demo Q & A