SlideShare a Scribd company logo
1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved.
Intro to Apache Spark
A primer for POJGs
(Plain Old Java Geeks)
Scott Deeg: Sr. Field Engineer
sdeeg@gopivotal.com
2© Copyright 2013 Pivotal. All rights reserved.
Agenda
Ÿ  Intro: Agenda, it’s all about ME!, 10 seconds on Pivotal
Ÿ  What is Spark, and what does it have to do with BigData/Hadoop?
–  Ecosystem (Shark, Streaming, MLlib, GraphX)
Ÿ  Spark Programming Model
–  Demo: interactive shell
Ÿ  Related Projects
Ÿ  Spark 1.0
Ÿ  More Tech: WordCount, TicTacToe – dev experience, Java8
Ÿ  Deployment Topologies
–  Simple Cluster Demo
3© Copyright 2013 Pivotal. All rights reserved.
Who Am I?
Just a Plain Old Java Guy
Ÿ  Java since 1996, Symantec Visual Café 1.0
Ÿ  Random consulting around Si Valley
Ÿ  Hacker on Java based BPM product for 10 years
Ÿ  Joined VMW 2009 when they acquired SpringSource
Ÿ  Rolled into Pivotal April 1 2013
4© Copyright 2013 Pivotal. All rights reserved.
What is Pivotal?
Ÿ  Cloud, Big Data, Fast Data, Modern Apps
Ÿ  Technology Bets
–  HDFS will be the way we talk to Enterprise data repositories
▪  Consolidate Silos in “Data Lake”
▪  Eco-system of services will arise to utilize HDFS data
–  PaaS will manage the Application Life Cycle
–  OSS will be the basis for solutions
–  Cloud Architecture
▪  Distributed / Parallel
▪  CPU, Memory, Network … storage is a distributed service
5© Copyright 2013 Pivotal. All rights reserved.
Data
Sources
Application Platform
Stream
Server
IMDG
ASF
Services
MPP
SQL
HDFS
Pivotal Platform
SQL
Objects
JSON GemFireXD
...ETC
End Users Developers
AppOps
6© Copyright 2013 Pivotal. All rights reserved. 6© Copyright 2013 Pivotal. All rights reserved.
What Is Spark?
Hint: It’s all about the RDD
7© Copyright 2013 Pivotal. All rights reserved.
?
Ÿ  Is it “Big Data”
Ÿ  Is it “Hadoop”
Ÿ  It’s one of those “in memory” things, right
Ÿ  JVM, Java, Scala
Ÿ  Is it Real or just another shiny technology with a long, but
ultimately small tail
8© Copyright 2013 Pivotal. All rights reserved.
Spark is …
Ÿ  Distributed/Cluster Compute Execution Engine
–  Came out of AMPLab project at UCB, now ASF top level project
Ÿ  Designed to work with data in memory
Ÿ  Similar scalability and fault tolerance as Hadoop Map/Reduce
–  Utilizes Lineage to reconstitute data instead of replication
Ÿ  Generalization of Map/Reduce
–  Implementation of Resilient Distributed Dataset (RDD)
Ÿ  Programmatic or Interactive
Ÿ  Written in Scala
9© Copyright 2013 Pivotal. All rights reserved.
Spark is also …
Ÿ  An ASF Top Level project
Ÿ  Has ~100 contributors across 25 companies
–  More active than Hadoop MapReduce
Ÿ  An eco-system of domain specific tools
–  Different models, but mostly interoperable
Ÿ  Hadoop Compatible
10© Copyright 2013 Pivotal. All rights reserved.
Berkley Data Analytics Stack (BDAS)
Support
Ÿ  Batch
Ÿ  Streaming
Ÿ  Interactive
Make it easy to
compose them
11© Copyright 2013 Pivotal. All rights reserved.
Short History
Ÿ  2009 Started as research project at UCB
Ÿ  2010 Open Sourced
Ÿ  January 2011 AMPLab Created
Ÿ  October 2012 0.6
–  Java, Stand alone cluster, maven
Ÿ  June 21 2013 Spark accepted into ASF Incubator
Ÿ  Feb 27 2014 Spark becomes top level ASF project
Ÿ  May 30 2014 Spark 1.0
12© Copyright 2013 Pivotal. All rights reserved.
Spark Philosophy
Ÿ  Make life easy and productive for Data Scientists
Ÿ  Provide well documented and expressive APIs
Ÿ  Powerful Domain Specific Libraries
Ÿ  Easy integration with storage systems
Ÿ  Caching to avoid data movement (performance)
Ÿ  Well defined releases, stable API
13© Copyright 2013 Pivotal. All rights reserved.
Spark is not Hadoop, but is compatible
Ÿ  Often better than Hadoop (Eric Baldeschwieler)
–  M/R fine for “Data Parallel”, but awkward for some workloads
–  Low latency dispatch, Iterative, Streaming
Ÿ  Natively accesses Hadoop data
Ÿ  Spark just another YARN job
–  Maintains huge investment in data collection
–  Brings Spark to the Data
Ÿ  It’s not OR … it’s AND!
14© Copyright 2013 Pivotal. All rights reserved.
Improvements over Map/Reduce
Ÿ  Efficiency
–  General Execution Graphs (not just map->reduce->store)
–  In memory
Ÿ  Usability
–  Rich APIs in Scala, Java, Python
–  Interactive
Ÿ  Can Spark be the R for Big Data?
15© Copyright 2013 Pivotal. All rights reserved. 15© Copyright 2013 Pivotal. All rights reserved.
Spark Programming
Model
RDDs in Detail
16© Copyright 2013 Pivotal. All rights reserved.
Core Concept
Think of a program as a set of transformations on a
Distributed Dataset
Model: Resilient Distributed Dataset (RDD)
–  Read Only Collection of Objects spread across a cluster
–  RDDs are built through parallel transformations (map, filter, etc.)
–  Automatically rebuilt on failure using lineage
–  Controllable persistence (RAM, HDFS, etc.)
17© Copyright 2013 Pivotal. All rights reserved.
Operations
Ÿ  Create
–  From stable storage (hdfs)
Ÿ  Transform
–  Generate RDD from other RDD (map, filter, groupBy)
–  Lazy Operations that build a DAG
–  Once Spark knows your transformations it can build an efficient plan
Ÿ  Action
–  Return a result or write to storage (count, collect, reduce, save)
18© Copyright 2013 Pivotal. All rights reserved.
Demo: Log Mining
Ÿ  Scala shell
Ÿ  Load file from HDFS
Ÿ  Search for patterns
19© Copyright 2013 Pivotal. All rights reserved.
Transformation and Actions
Ÿ  Transformations
–  Map
–  filter
–  flatMap
–  sample
–  groupByKey
–  reduceByKey
–  union
–  join
–  sort
Ÿ  Actions
–  count
–  collect
–  reduce
–  lookup
–  save
20© Copyright 2013 Pivotal. All rights reserved.
RDD Fault Tolerance
Ÿ  RDDs maintain lineage information that can be used to
reconstruct lost partitions
cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(‘t’)(2))
.cache()
HdfsRDD
path: hdfs://…
FilteredRDD
func: contains(...)
MappedRDD
func: split(…)
CachedRDD
21© Copyright 2013 Pivotal. All rights reserved.
RDDs are Foundational
Ÿ  General purpose enough to use to implement other
programing models
–  SQL
–  Graph
–  ML
–  MR
22© Copyright 2013 Pivotal. All rights reserved. 22© Copyright 2013 Pivotal. All rights reserved.
Related Projects
Things that run on Spark
23© Copyright 2013 Pivotal. All rights reserved.
Related Projects
Ÿ  Shark
Ÿ  Spark SQL
Ÿ  Spark Streaming
Ÿ  GraphX
Ÿ  MLbase
Ÿ  Others
24© Copyright 2013 Pivotal. All rights reserved.
Shark
Ÿ  Hive on Spark
–  HiveQL, UDFs, etc.
Ÿ  Turn SQL into RDD
–  Part of the lineage
Ÿ  Based on Hive, but takes advantage of Spark for
–  Fast Scheduling
–  Queries are DAGs of jobs, not chained M/R
–  Fast broadcast variables
© Apache Software Foundation
25© Copyright 2013 Pivotal. All rights reserved.
Shark (cont)
Ÿ  Optimized Columnar Storage format
Ÿ  Fast/Efficient Compression
–  From Yahoo!
–  Able to hold 3-20x more data in same cluster
Ÿ  Various other optimizations using partitioning
Ÿ  Will ultimately run on Spark SQL
–  No Hive dependencies except to accessing Hive datastore
–  Long running process with management tools
26© Copyright 2013 Pivotal. All rights reserved.
Spark SQL
Ÿ  Lib in Spark Core to treat RDDs as relations
–  SchemaRDD
Ÿ  Lighter weight version of Shark
–  No code from Hive
Ÿ  Import/Export in different Storage formats
–  Parquet, learn schema from existing Hive warehouse
Ÿ  Takes columnar storage from Shark
27© Copyright 2013 Pivotal. All rights reserved.
Spark SQL Code
Ÿ  Go take a look
28© Copyright 2013 Pivotal. All rights reserved.
Spark Streaming
Ÿ  Extend Spark to do large scale stream processing
–  100s of nodes and second scale end to end latency
Ÿ  Stateful Processing
–  Hard to make FT
–  Storm: requires idempotent updates
Ÿ  Simple, batch like API with RDDs
Ÿ  Single semantics for both real time and high latency
29© Copyright 2013 Pivotal. All rights reserved.
Streaming (cont)
Ÿ  Input is broken up into Batches that become RDDs
Ÿ  RDD’s are composed into DAGs to generate output
Ÿ  Raw data is replicated in-memory for FT
30© Copyright 2013 Pivotal. All rights reserved.
Streaming (cont)
Ÿ  Other features
–  Window-based Transformations
–  Arbitrary join of streams
31© Copyright 2013 Pivotal. All rights reserved.
GraphX (Alpha)
Ÿ  Graph processing
–  Replaces Spark Bagel
Ÿ  Graph Parallel not Data Parallel
–  Reason in the context of neighbors
–  GraphLab API
32© Copyright 2013 Pivotal. All rights reserved.
GraphX (cont)
Ÿ  Predicting things about people (eg: political bias)
–  Look at posts, apply classifier, try to predict attribute
–  Local signal is difficult alone
–  Look at context of social network to improve prediction
Ÿ  Triangle processing
–  More triangles reveals greater community
Ÿ  Collaborative Filtering
–  Bi-partide graph processing
–  What I like, who rated those things, what they like => what I may like
33© Copyright 2013 Pivotal. All rights reserved.
GraphX (cont)
Ÿ  Graph Creation => Algorithm => Post Processing
–  Existing systems mainly deal with the Algorithm and not interactive
–  Unify collection and graph models
Ÿ  Graphs have
–  Vertices, edges
–  Transformation: reverse, filter, map
–  Joins: graphs and tables
–  Aggregate Neighbors
34© Copyright 2013 Pivotal. All rights reserved.
MLbase
Ÿ  Machine Learning toolset
–  Library and higher level abstractions
Ÿ  General tool is MatLab
–  Difficult for end users to learn, debug, scale solutions
Ÿ  Starting with MLlib
–  Low level Distributed Machine Learning Library
Ÿ  Many different Algorithms
–  Classification, Regression, Collaborative Filtering, etc.
35© Copyright 2013 Pivotal. All rights reserved.
Others
Ÿ  Mesos
–  Enable multiple frameworks to share same cluster resources
–  Twitter is largest user: Over 6,000 servers
Ÿ  Tachyon
–  In-memory, fault tolerant file system that exposes HDFS
Ÿ  Catalyst
–  SQL Query Optimizer
36© Copyright 2013 Pivotal. All rights reserved. 36© Copyright 2013 Pivotal. All rights reserved.
Spark 1.0
37© Copyright 2013 Pivotal. All rights reserved.
Release cycle
Ÿ  1.0 Came out at end of May
Ÿ  1.X expected to be current for several years
Ÿ  Quarterly release cycle
–  2 mo dev / 1 mo QA
–  Actual release is based on vote
Ÿ  1.1 due end of August
38© Copyright 2013 Pivotal. All rights reserved.
1.0
Ÿ  API Stability in 1.X for all non-Alpha projects
–  Can recompile jobs, but hoping for binary compatibility
–  Internal API are marked @DeveloperApi or @Experimental
Ÿ  Focus: Core Engine, Streaming, MLLib, SparkSQL
Ÿ  History Server for Spark UI
–  Driving development of instrumentation
Ÿ  Job Submission Tool
–  Don’t configure Context in code (eg: master)
39© Copyright 2013 Pivotal. All rights reserved.
1.0
Ÿ  Java8 Lamdas
–  No more writing closures as Classes
–  Functions are interfaces
–  Return type sensitive functions
▪  mapToPair
Ÿ  Python improvements
40© Copyright 2013 Pivotal. All rights reserved.
1.0
Ÿ  Hadoop security
–  Kerberos, ACL for UI
Ÿ  Job cancel from UI
Ÿ  Distributed GC as things go out of scope
–  Good for long lives service
Ÿ  Spark SQL
41© Copyright 2013 Pivotal. All rights reserved. 41© Copyright 2013 Pivotal. All rights reserved.
More Code and Demos
WordCount, TicTacToe, Java8
42© Copyright 2013 Pivotal. All rights reserved.
Code Review: WordCount
Ÿ  Java API
Ÿ  Java Code
Ÿ  More usage of RDDs
43© Copyright 2013 Pivotal. All rights reserved.
TicTacToe: a developers experience
Ÿ  IDE
Ÿ  Spring
Ÿ  Building/Logging
Ÿ  Debugging
44© Copyright 2013 Pivotal. All rights reserved.
Demo: Java 8
Lamda Lamda Lamda
45© Copyright 2013 Pivotal. All rights reserved. 45© Copyright 2013 Pivotal. All rights reserved.
Deployment Topologies
46© Copyright 2013 Pivotal. All rights reserved.
Topologies
Ÿ  Local
Ÿ  Spark Cluster (master/slaves)
Ÿ  Cluster Resource Managers
–  YARN
–  MESOS
Ÿ  (PaaS?)
47© Copyright 2013 Pivotal. All rights reserved.
Demo:
Ÿ  Start master and slaves
Ÿ  Show the UI
Ÿ  Run a Job
Ÿ  Talk about the History Server
48© Copyright 2013 Pivotal. All rights reserved. 48© Copyright 2013 Pivotal. All rights reserved.
This
And That
49© Copyright 2013 Pivotal. All rights reserved.
How Real is Spark?
Ÿ  There is some criticism
–  As expected
–  New project!
Ÿ  There are many indicators that Spark is heading to success
–  Solid technology
–  Good buzz
–  Significant community
50© Copyright 2013 Pivotal. All rights reserved.
Next Steps
Ÿ  Spark website: http://spark.apache.org
–  Lots’O’Goodstuff
Ÿ  Spark Summit June 30/July 01
–  http://spark-summit.org
51© Copyright 2013 Pivotal. All rights reserved. 51© Copyright 2013 Pivotal. All rights reserved.
A NEW PLATFORM FOR A NEW ERA

More Related Content

PPTX
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
PDF
How To Visualize Graphs
PPTX
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PDF
Pivotal OSS meetup - MADlib and PivotalR
PDF
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
PDF
Introduction to Property Graph Features (AskTOM Office Hours part 1)
PDF
Harnessing Spark Catalyst for Custom Data Payloads
PPTX
Machine learning with Spark
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
How To Visualize Graphs
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
Pivotal OSS meetup - MADlib and PivotalR
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
Introduction to Property Graph Features (AskTOM Office Hours part 1)
Harnessing Spark Catalyst for Custom Data Payloads
Machine learning with Spark

What's hot (20)

PDF
Simple, Modular and Extensible Big Data Platform Concept
PPTX
Graph Analytics on Data from Meetup.com
PDF
PGQL: A Language for Graphs
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
PDF
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
PPTX
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
PDF
Spark summit 2019 infrastructure for deep learning in apache spark 0425
PDF
GraphTech Ecosystem - part 2: Graph Analytics
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
PDF
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
PDF
Make your PySpark Data Fly with Arrow!
PDF
Oracle Spatial Studio: Fast and Easy Spatial Analytics and Maps
PDF
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
PDF
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
PDF
Machine learning at scale challenges and solutions
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Gain Insights with Graph Analytics
PPTX
Unlocking Your Hadoop Data with Apache Spark and CDH5
PDF
The MADlib Analytics Library
 
PPTX
Big dataarchitecturesandecosystem+nosql
Simple, Modular and Extensible Big Data Platform Concept
Graph Analytics on Data from Meetup.com
PGQL: A Language for Graphs
Pandas UDF: Scalable Analysis with Python and PySpark
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
Spark summit 2019 infrastructure for deep learning in apache spark 0425
GraphTech Ecosystem - part 2: Graph Analytics
Practical Distributed Machine Learning Pipelines on Hadoop
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Make your PySpark Data Fly with Arrow!
Oracle Spatial Studio: Fast and Easy Spatial Analytics and Maps
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Machine learning at scale challenges and solutions
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Gain Insights with Graph Analytics
Unlocking Your Hadoop Data with Apache Spark and CDH5
The MADlib Analytics Library
 
Big dataarchitecturesandecosystem+nosql
Ad

Similar to Spark For Plain Old Java Geeks (June2014 Meetup) (20)

PDF
Spark forplainoldjavageeks svforum_20140724
PDF
Spark forspringdevs springone_final
PPTX
Apache Spark Introduction @ University College London
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Apache Spark Presentation good for big data
PPTX
Apache Spark Fundamentals
PPT
Big_data_analytics_NoSql_Module-4_Session
PDF
An introduction To Apache Spark
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PPTX
Apache Spark in Industry
PPTX
In Memory Analytics with Apache Spark
PDF
Bds session 13 14
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Apache Spark: Lightning Fast Cluster Computing
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
PDF
Apache Spark Introduction
PDF
Jump Start on Apache Spark 2.2 with Databricks
PDF
Apache Spark Overview @ ferret
Spark forplainoldjavageeks svforum_20140724
Spark forspringdevs springone_final
Apache Spark Introduction @ University College London
Simplifying Big Data Analytics with Apache Spark
Apache Spark Presentation good for big data
Apache Spark Fundamentals
Big_data_analytics_NoSql_Module-4_Session
An introduction To Apache Spark
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Unified Big Data Processing with Apache Spark (QCON 2014)
Apache Spark in Industry
In Memory Analytics with Apache Spark
Bds session 13 14
Intro to Apache Spark by CTO of Twingo
Apache Spark: Lightning Fast Cluster Computing
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Apache Spark Introduction
Jump Start on Apache Spark 2.2 with Databricks
Apache Spark Overview @ ferret
Ad

Recently uploaded (20)

PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
AI in Product Development-omnex systems
PPTX
history of c programming in notes for students .pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
ai tools demonstartion for schools and inter college
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
medical staffing services at VALiNTRY
PPTX
Introduction to Artificial Intelligence
PPTX
Transform Your Business with a Software ERP System
PDF
top salesforce developer skills in 2025.pdf
PPTX
L1 - Introduction to python Backend.pptx
PTS Company Brochure 2025 (1).pdf.......
AI in Product Development-omnex systems
history of c programming in notes for students .pptx
Design an Analysis of Algorithms II-SECS-1021-03
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Operating system designcfffgfgggggggvggggggggg
Understanding Forklifts - TECH EHS Solution
ai tools demonstartion for schools and inter college
Upgrade and Innovation Strategies for SAP ERP Customers
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Softaken Excel to vCard Converter Software.pdf
Design an Analysis of Algorithms I-SECS-1021-03
Navsoft: AI-Powered Business Solutions & Custom Software Development
medical staffing services at VALiNTRY
Introduction to Artificial Intelligence
Transform Your Business with a Software ERP System
top salesforce developer skills in 2025.pdf
L1 - Introduction to python Backend.pptx

Spark For Plain Old Java Geeks (June2014 Meetup)

  • 1. 1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved. Intro to Apache Spark A primer for POJGs (Plain Old Java Geeks) Scott Deeg: Sr. Field Engineer sdeeg@gopivotal.com
  • 2. 2© Copyright 2013 Pivotal. All rights reserved. Agenda Ÿ  Intro: Agenda, it’s all about ME!, 10 seconds on Pivotal Ÿ  What is Spark, and what does it have to do with BigData/Hadoop? –  Ecosystem (Shark, Streaming, MLlib, GraphX) Ÿ  Spark Programming Model –  Demo: interactive shell Ÿ  Related Projects Ÿ  Spark 1.0 Ÿ  More Tech: WordCount, TicTacToe – dev experience, Java8 Ÿ  Deployment Topologies –  Simple Cluster Demo
  • 3. 3© Copyright 2013 Pivotal. All rights reserved. Who Am I? Just a Plain Old Java Guy Ÿ  Java since 1996, Symantec Visual Café 1.0 Ÿ  Random consulting around Si Valley Ÿ  Hacker on Java based BPM product for 10 years Ÿ  Joined VMW 2009 when they acquired SpringSource Ÿ  Rolled into Pivotal April 1 2013
  • 4. 4© Copyright 2013 Pivotal. All rights reserved. What is Pivotal? Ÿ  Cloud, Big Data, Fast Data, Modern Apps Ÿ  Technology Bets –  HDFS will be the way we talk to Enterprise data repositories ▪  Consolidate Silos in “Data Lake” ▪  Eco-system of services will arise to utilize HDFS data –  PaaS will manage the Application Life Cycle –  OSS will be the basis for solutions –  Cloud Architecture ▪  Distributed / Parallel ▪  CPU, Memory, Network … storage is a distributed service
  • 5. 5© Copyright 2013 Pivotal. All rights reserved. Data Sources Application Platform Stream Server IMDG ASF Services MPP SQL HDFS Pivotal Platform SQL Objects JSON GemFireXD ...ETC End Users Developers AppOps
  • 6. 6© Copyright 2013 Pivotal. All rights reserved. 6© Copyright 2013 Pivotal. All rights reserved. What Is Spark? Hint: It’s all about the RDD
  • 7. 7© Copyright 2013 Pivotal. All rights reserved. ? Ÿ  Is it “Big Data” Ÿ  Is it “Hadoop” Ÿ  It’s one of those “in memory” things, right Ÿ  JVM, Java, Scala Ÿ  Is it Real or just another shiny technology with a long, but ultimately small tail
  • 8. 8© Copyright 2013 Pivotal. All rights reserved. Spark is … Ÿ  Distributed/Cluster Compute Execution Engine –  Came out of AMPLab project at UCB, now ASF top level project Ÿ  Designed to work with data in memory Ÿ  Similar scalability and fault tolerance as Hadoop Map/Reduce –  Utilizes Lineage to reconstitute data instead of replication Ÿ  Generalization of Map/Reduce –  Implementation of Resilient Distributed Dataset (RDD) Ÿ  Programmatic or Interactive Ÿ  Written in Scala
  • 9. 9© Copyright 2013 Pivotal. All rights reserved. Spark is also … Ÿ  An ASF Top Level project Ÿ  Has ~100 contributors across 25 companies –  More active than Hadoop MapReduce Ÿ  An eco-system of domain specific tools –  Different models, but mostly interoperable Ÿ  Hadoop Compatible
  • 10. 10© Copyright 2013 Pivotal. All rights reserved. Berkley Data Analytics Stack (BDAS) Support Ÿ  Batch Ÿ  Streaming Ÿ  Interactive Make it easy to compose them
  • 11. 11© Copyright 2013 Pivotal. All rights reserved. Short History Ÿ  2009 Started as research project at UCB Ÿ  2010 Open Sourced Ÿ  January 2011 AMPLab Created Ÿ  October 2012 0.6 –  Java, Stand alone cluster, maven Ÿ  June 21 2013 Spark accepted into ASF Incubator Ÿ  Feb 27 2014 Spark becomes top level ASF project Ÿ  May 30 2014 Spark 1.0
  • 12. 12© Copyright 2013 Pivotal. All rights reserved. Spark Philosophy Ÿ  Make life easy and productive for Data Scientists Ÿ  Provide well documented and expressive APIs Ÿ  Powerful Domain Specific Libraries Ÿ  Easy integration with storage systems Ÿ  Caching to avoid data movement (performance) Ÿ  Well defined releases, stable API
  • 13. 13© Copyright 2013 Pivotal. All rights reserved. Spark is not Hadoop, but is compatible Ÿ  Often better than Hadoop (Eric Baldeschwieler) –  M/R fine for “Data Parallel”, but awkward for some workloads –  Low latency dispatch, Iterative, Streaming Ÿ  Natively accesses Hadoop data Ÿ  Spark just another YARN job –  Maintains huge investment in data collection –  Brings Spark to the Data Ÿ  It’s not OR … it’s AND!
  • 14. 14© Copyright 2013 Pivotal. All rights reserved. Improvements over Map/Reduce Ÿ  Efficiency –  General Execution Graphs (not just map->reduce->store) –  In memory Ÿ  Usability –  Rich APIs in Scala, Java, Python –  Interactive Ÿ  Can Spark be the R for Big Data?
  • 15. 15© Copyright 2013 Pivotal. All rights reserved. 15© Copyright 2013 Pivotal. All rights reserved. Spark Programming Model RDDs in Detail
  • 16. 16© Copyright 2013 Pivotal. All rights reserved. Core Concept Think of a program as a set of transformations on a Distributed Dataset Model: Resilient Distributed Dataset (RDD) –  Read Only Collection of Objects spread across a cluster –  RDDs are built through parallel transformations (map, filter, etc.) –  Automatically rebuilt on failure using lineage –  Controllable persistence (RAM, HDFS, etc.)
  • 17. 17© Copyright 2013 Pivotal. All rights reserved. Operations Ÿ  Create –  From stable storage (hdfs) Ÿ  Transform –  Generate RDD from other RDD (map, filter, groupBy) –  Lazy Operations that build a DAG –  Once Spark knows your transformations it can build an efficient plan Ÿ  Action –  Return a result or write to storage (count, collect, reduce, save)
  • 18. 18© Copyright 2013 Pivotal. All rights reserved. Demo: Log Mining Ÿ  Scala shell Ÿ  Load file from HDFS Ÿ  Search for patterns
  • 19. 19© Copyright 2013 Pivotal. All rights reserved. Transformation and Actions Ÿ  Transformations –  Map –  filter –  flatMap –  sample –  groupByKey –  reduceByKey –  union –  join –  sort Ÿ  Actions –  count –  collect –  reduce –  lookup –  save
  • 20. 20© Copyright 2013 Pivotal. All rights reserved. RDD Fault Tolerance Ÿ  RDDs maintain lineage information that can be used to reconstruct lost partitions cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2)) .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD
  • 21. 21© Copyright 2013 Pivotal. All rights reserved. RDDs are Foundational Ÿ  General purpose enough to use to implement other programing models –  SQL –  Graph –  ML –  MR
  • 22. 22© Copyright 2013 Pivotal. All rights reserved. 22© Copyright 2013 Pivotal. All rights reserved. Related Projects Things that run on Spark
  • 23. 23© Copyright 2013 Pivotal. All rights reserved. Related Projects Ÿ  Shark Ÿ  Spark SQL Ÿ  Spark Streaming Ÿ  GraphX Ÿ  MLbase Ÿ  Others
  • 24. 24© Copyright 2013 Pivotal. All rights reserved. Shark Ÿ  Hive on Spark –  HiveQL, UDFs, etc. Ÿ  Turn SQL into RDD –  Part of the lineage Ÿ  Based on Hive, but takes advantage of Spark for –  Fast Scheduling –  Queries are DAGs of jobs, not chained M/R –  Fast broadcast variables © Apache Software Foundation
  • 25. 25© Copyright 2013 Pivotal. All rights reserved. Shark (cont) Ÿ  Optimized Columnar Storage format Ÿ  Fast/Efficient Compression –  From Yahoo! –  Able to hold 3-20x more data in same cluster Ÿ  Various other optimizations using partitioning Ÿ  Will ultimately run on Spark SQL –  No Hive dependencies except to accessing Hive datastore –  Long running process with management tools
  • 26. 26© Copyright 2013 Pivotal. All rights reserved. Spark SQL Ÿ  Lib in Spark Core to treat RDDs as relations –  SchemaRDD Ÿ  Lighter weight version of Shark –  No code from Hive Ÿ  Import/Export in different Storage formats –  Parquet, learn schema from existing Hive warehouse Ÿ  Takes columnar storage from Shark
  • 27. 27© Copyright 2013 Pivotal. All rights reserved. Spark SQL Code Ÿ  Go take a look
  • 28. 28© Copyright 2013 Pivotal. All rights reserved. Spark Streaming Ÿ  Extend Spark to do large scale stream processing –  100s of nodes and second scale end to end latency Ÿ  Stateful Processing –  Hard to make FT –  Storm: requires idempotent updates Ÿ  Simple, batch like API with RDDs Ÿ  Single semantics for both real time and high latency
  • 29. 29© Copyright 2013 Pivotal. All rights reserved. Streaming (cont) Ÿ  Input is broken up into Batches that become RDDs Ÿ  RDD’s are composed into DAGs to generate output Ÿ  Raw data is replicated in-memory for FT
  • 30. 30© Copyright 2013 Pivotal. All rights reserved. Streaming (cont) Ÿ  Other features –  Window-based Transformations –  Arbitrary join of streams
  • 31. 31© Copyright 2013 Pivotal. All rights reserved. GraphX (Alpha) Ÿ  Graph processing –  Replaces Spark Bagel Ÿ  Graph Parallel not Data Parallel –  Reason in the context of neighbors –  GraphLab API
  • 32. 32© Copyright 2013 Pivotal. All rights reserved. GraphX (cont) Ÿ  Predicting things about people (eg: political bias) –  Look at posts, apply classifier, try to predict attribute –  Local signal is difficult alone –  Look at context of social network to improve prediction Ÿ  Triangle processing –  More triangles reveals greater community Ÿ  Collaborative Filtering –  Bi-partide graph processing –  What I like, who rated those things, what they like => what I may like
  • 33. 33© Copyright 2013 Pivotal. All rights reserved. GraphX (cont) Ÿ  Graph Creation => Algorithm => Post Processing –  Existing systems mainly deal with the Algorithm and not interactive –  Unify collection and graph models Ÿ  Graphs have –  Vertices, edges –  Transformation: reverse, filter, map –  Joins: graphs and tables –  Aggregate Neighbors
  • 34. 34© Copyright 2013 Pivotal. All rights reserved. MLbase Ÿ  Machine Learning toolset –  Library and higher level abstractions Ÿ  General tool is MatLab –  Difficult for end users to learn, debug, scale solutions Ÿ  Starting with MLlib –  Low level Distributed Machine Learning Library Ÿ  Many different Algorithms –  Classification, Regression, Collaborative Filtering, etc.
  • 35. 35© Copyright 2013 Pivotal. All rights reserved. Others Ÿ  Mesos –  Enable multiple frameworks to share same cluster resources –  Twitter is largest user: Over 6,000 servers Ÿ  Tachyon –  In-memory, fault tolerant file system that exposes HDFS Ÿ  Catalyst –  SQL Query Optimizer
  • 36. 36© Copyright 2013 Pivotal. All rights reserved. 36© Copyright 2013 Pivotal. All rights reserved. Spark 1.0
  • 37. 37© Copyright 2013 Pivotal. All rights reserved. Release cycle Ÿ  1.0 Came out at end of May Ÿ  1.X expected to be current for several years Ÿ  Quarterly release cycle –  2 mo dev / 1 mo QA –  Actual release is based on vote Ÿ  1.1 due end of August
  • 38. 38© Copyright 2013 Pivotal. All rights reserved. 1.0 Ÿ  API Stability in 1.X for all non-Alpha projects –  Can recompile jobs, but hoping for binary compatibility –  Internal API are marked @DeveloperApi or @Experimental Ÿ  Focus: Core Engine, Streaming, MLLib, SparkSQL Ÿ  History Server for Spark UI –  Driving development of instrumentation Ÿ  Job Submission Tool –  Don’t configure Context in code (eg: master)
  • 39. 39© Copyright 2013 Pivotal. All rights reserved. 1.0 Ÿ  Java8 Lamdas –  No more writing closures as Classes –  Functions are interfaces –  Return type sensitive functions ▪  mapToPair Ÿ  Python improvements
  • 40. 40© Copyright 2013 Pivotal. All rights reserved. 1.0 Ÿ  Hadoop security –  Kerberos, ACL for UI Ÿ  Job cancel from UI Ÿ  Distributed GC as things go out of scope –  Good for long lives service Ÿ  Spark SQL
  • 41. 41© Copyright 2013 Pivotal. All rights reserved. 41© Copyright 2013 Pivotal. All rights reserved. More Code and Demos WordCount, TicTacToe, Java8
  • 42. 42© Copyright 2013 Pivotal. All rights reserved. Code Review: WordCount Ÿ  Java API Ÿ  Java Code Ÿ  More usage of RDDs
  • 43. 43© Copyright 2013 Pivotal. All rights reserved. TicTacToe: a developers experience Ÿ  IDE Ÿ  Spring Ÿ  Building/Logging Ÿ  Debugging
  • 44. 44© Copyright 2013 Pivotal. All rights reserved. Demo: Java 8 Lamda Lamda Lamda
  • 45. 45© Copyright 2013 Pivotal. All rights reserved. 45© Copyright 2013 Pivotal. All rights reserved. Deployment Topologies
  • 46. 46© Copyright 2013 Pivotal. All rights reserved. Topologies Ÿ  Local Ÿ  Spark Cluster (master/slaves) Ÿ  Cluster Resource Managers –  YARN –  MESOS Ÿ  (PaaS?)
  • 47. 47© Copyright 2013 Pivotal. All rights reserved. Demo: Ÿ  Start master and slaves Ÿ  Show the UI Ÿ  Run a Job Ÿ  Talk about the History Server
  • 48. 48© Copyright 2013 Pivotal. All rights reserved. 48© Copyright 2013 Pivotal. All rights reserved. This And That
  • 49. 49© Copyright 2013 Pivotal. All rights reserved. How Real is Spark? Ÿ  There is some criticism –  As expected –  New project! Ÿ  There are many indicators that Spark is heading to success –  Solid technology –  Good buzz –  Significant community
  • 50. 50© Copyright 2013 Pivotal. All rights reserved. Next Steps Ÿ  Spark website: http://spark.apache.org –  Lots’O’Goodstuff Ÿ  Spark Summit June 30/July 01 –  http://spark-summit.org
  • 51. 51© Copyright 2013 Pivotal. All rights reserved. 51© Copyright 2013 Pivotal. All rights reserved. A NEW PLATFORM FOR A NEW ERA