Spark For Plain Old Java Geeks (June2014 Meetup)

1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved.
Intro to Apache Spark
A primer for POJGs
(Plain Old Java Geeks)
Scott Deeg: Sr. Field Engineer
sdeeg@gopivotal.com

2© Copyright 2013 Pivotal. All rights reserved.
Agenda
Ÿ  Intro: Agenda, it’s all about ME!, 10 seconds on Pivotal
Ÿ  What is Spark, and what does it have to do with BigData/Hadoop?
–  Ecosystem (Shark, Streaming, MLlib, GraphX)
Ÿ  Spark Programming Model
–  Demo: interactive shell
Ÿ  Related Projects
Ÿ  Spark 1.0
Ÿ  More Tech: WordCount, TicTacToe – dev experience, Java8
Ÿ  Deployment Topologies
–  Simple Cluster Demo

Who Am I?
Just a Plain Old Java Guy
Ÿ  Java since 1996, Symantec Visual Café 1.0
Ÿ  Random consulting around Si Valley
Ÿ  Hacker on Java based BPM product for 10 years
Ÿ  Joined VMW 2009 when they acquired SpringSource
Ÿ  Rolled into Pivotal April 1 2013

What is Pivotal?
Ÿ  Cloud, Big Data, Fast Data, Modern Apps
Ÿ  Technology Bets
–  HDFS will be the way we talk to Enterprise data repositories
▪  Consolidate Silos in “Data Lake”
▪  Eco-system of services will arise to utilize HDFS data
–  PaaS will manage the Application Life Cycle
–  OSS will be the basis for solutions
–  Cloud Architecture
▪  Distributed / Parallel
▪  CPU, Memory, Network … storage is a distributed service

Data
Sources
Application Platform
Stream
Server
IMDG
ASF
Services
MPP
SQL
HDFS
Pivotal Platform
SQL
Objects
JSON GemFireXD
...ETC
End Users Developers
AppOps

What Is Spark?
Hint: It’s all about the RDD

?
Ÿ  Is it “Big Data”
Ÿ  Is it “Hadoop”
Ÿ  It’s one of those “in memory” things, right
Ÿ  JVM, Java, Scala
Ÿ  Is it Real or just another shiny technology with a long, but
ultimately small tail

Spark is …
Ÿ  Distributed/Cluster Compute Execution Engine
–  Came out of AMPLab project at UCB, now ASF top level project
Ÿ  Designed to work with data in memory
Ÿ  Similar scalability and fault tolerance as Hadoop Map/Reduce
–  Utilizes Lineage to reconstitute data instead of replication
Ÿ  Generalization of Map/Reduce
–  Implementation of Resilient Distributed Dataset (RDD)
Ÿ  Programmatic or Interactive
Ÿ  Written in Scala

Spark is also …
Ÿ  An ASF Top Level project
Ÿ  Has ~100 contributors across 25 companies
–  More active than Hadoop MapReduce
Ÿ  An eco-system of domain specific tools
–  Different models, but mostly interoperable
Ÿ  Hadoop Compatible

Berkley Data Analytics Stack (BDAS)
Support
Ÿ  Batch
Ÿ  Streaming
Ÿ  Interactive
Make it easy to
compose them

Short History
Ÿ  2009 Started as research project at UCB
Ÿ  2010 Open Sourced
Ÿ  January 2011 AMPLab Created
Ÿ  October 2012 0.6
–  Java, Stand alone cluster, maven
Ÿ  June 21 2013 Spark accepted into ASF Incubator
Ÿ  Feb 27 2014 Spark becomes top level ASF project
Ÿ  May 30 2014 Spark 1.0

Spark Philosophy
Ÿ  Make life easy and productive for Data Scientists
Ÿ  Provide well documented and expressive APIs
Ÿ  Powerful Domain Specific Libraries
Ÿ  Easy integration with storage systems
Ÿ  Caching to avoid data movement (performance)
Ÿ  Well defined releases, stable API

Spark is not Hadoop, but is compatible
Ÿ  Often better than Hadoop (Eric Baldeschwieler)
–  M/R fine for “Data Parallel”, but awkward for some workloads
–  Low latency dispatch, Iterative, Streaming
Ÿ  Natively accesses Hadoop data
Ÿ  Spark just another YARN job
–  Maintains huge investment in data collection
–  Brings Spark to the Data
Ÿ  It’s not OR … it’s AND!

Improvements over Map/Reduce
Ÿ  Efficiency
–  General Execution Graphs (not just map->reduce->store)
–  In memory
Ÿ  Usability
–  Rich APIs in Scala, Java, Python
–  Interactive
Ÿ  Can Spark be the R for Big Data?

Spark Programming
Model
RDDs in Detail

Core Concept
Think of a program as a set of transformations on a
Distributed Dataset
Model: Resilient Distributed Dataset (RDD)
–  Read Only Collection of Objects spread across a cluster
–  RDDs are built through parallel transformations (map, filter, etc.)
–  Automatically rebuilt on failure using lineage
–  Controllable persistence (RAM, HDFS, etc.)

Operations
Ÿ  Create
–  From stable storage (hdfs)
Ÿ  Transform
–  Generate RDD from other RDD (map, filter, groupBy)
–  Lazy Operations that build a DAG
–  Once Spark knows your transformations it can build an efficient plan
Ÿ  Action
–  Return a result or write to storage (count, collect, reduce, save)

Demo: Log Mining
Ÿ  Scala shell
Ÿ  Load file from HDFS
Ÿ  Search for patterns

Transformation and Actions
Ÿ  Transformations
–  Map
–  filter
–  flatMap
–  sample
–  groupByKey
–  reduceByKey
–  union
–  join
–  sort
Ÿ  Actions
–  count
–  collect
–  reduce
–  lookup
–  save

RDD Fault Tolerance
Ÿ  RDDs maintain lineage information that can be used to
reconstruct lost partitions
cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(‘t’)(2))
.cache()
HdfsRDD
path: hdfs://…
FilteredRDD
func: contains(...)
MappedRDD
func: split(…)
CachedRDD

RDDs are Foundational
Ÿ  General purpose enough to use to implement other
programing models
–  SQL
–  Graph
–  ML
–  MR

Related Projects
Things that run on Spark

Related Projects
Ÿ  Shark
Ÿ  Spark SQL
Ÿ  Spark Streaming
Ÿ  GraphX
Ÿ  MLbase
Ÿ  Others

Shark
Ÿ  Hive on Spark
–  HiveQL, UDFs, etc.
Ÿ  Turn SQL into RDD
–  Part of the lineage
Ÿ  Based on Hive, but takes advantage of Spark for
–  Fast Scheduling
–  Queries are DAGs of jobs, not chained M/R
–  Fast broadcast variables
© Apache Software Foundation

Shark (cont)
Ÿ  Optimized Columnar Storage format
Ÿ  Fast/Efficient Compression
–  From Yahoo!
–  Able to hold 3-20x more data in same cluster
Ÿ  Various other optimizations using partitioning
Ÿ  Will ultimately run on Spark SQL
–  No Hive dependencies except to accessing Hive datastore
–  Long running process with management tools

Spark SQL
Ÿ  Lib in Spark Core to treat RDDs as relations
–  SchemaRDD
Ÿ  Lighter weight version of Shark
–  No code from Hive
Ÿ  Import/Export in different Storage formats
–  Parquet, learn schema from existing Hive warehouse
Ÿ  Takes columnar storage from Shark

Spark SQL Code
Ÿ  Go take a look

Spark Streaming
Ÿ  Extend Spark to do large scale stream processing
–  100s of nodes and second scale end to end latency
Ÿ  Stateful Processing
–  Hard to make FT
–  Storm: requires idempotent updates
Ÿ  Simple, batch like API with RDDs
Ÿ  Single semantics for both real time and high latency

Streaming (cont)
Ÿ  Input is broken up into Batches that become RDDs
Ÿ  RDD’s are composed into DAGs to generate output
Ÿ  Raw data is replicated in-memory for FT

Streaming (cont)
Ÿ  Other features
–  Window-based Transformations
–  Arbitrary join of streams

GraphX (Alpha)
Ÿ  Graph processing
–  Replaces Spark Bagel
Ÿ  Graph Parallel not Data Parallel
–  Reason in the context of neighbors
–  GraphLab API

GraphX (cont)
Ÿ  Predicting things about people (eg: political bias)
–  Look at posts, apply classifier, try to predict attribute
–  Local signal is difficult alone
–  Look at context of social network to improve prediction
Ÿ  Triangle processing
–  More triangles reveals greater community
Ÿ  Collaborative Filtering
–  Bi-partide graph processing
–  What I like, who rated those things, what they like => what I may like

GraphX (cont)
Ÿ  Graph Creation => Algorithm => Post Processing
–  Existing systems mainly deal with the Algorithm and not interactive
–  Unify collection and graph models
Ÿ  Graphs have
–  Vertices, edges
–  Transformation: reverse, filter, map
–  Joins: graphs and tables
–  Aggregate Neighbors

MLbase
Ÿ  Machine Learning toolset
–  Library and higher level abstractions
Ÿ  General tool is MatLab
–  Difficult for end users to learn, debug, scale solutions
Ÿ  Starting with MLlib
–  Low level Distributed Machine Learning Library
Ÿ  Many different Algorithms
–  Classification, Regression, Collaborative Filtering, etc.

Others
Ÿ  Mesos
–  Enable multiple frameworks to share same cluster resources
–  Twitter is largest user: Over 6,000 servers
Ÿ  Tachyon
–  In-memory, fault tolerant file system that exposes HDFS
Ÿ  Catalyst
–  SQL Query Optimizer

Spark 1.0

Release cycle
Ÿ  1.0 Came out at end of May
Ÿ  1.X expected to be current for several years
Ÿ  Quarterly release cycle
–  2 mo dev / 1 mo QA
–  Actual release is based on vote
Ÿ  1.1 due end of August

1.0
Ÿ  API Stability in 1.X for all non-Alpha projects
–  Can recompile jobs, but hoping for binary compatibility
–  Internal API are marked @DeveloperApi or @Experimental
Ÿ  Focus: Core Engine, Streaming, MLLib, SparkSQL
Ÿ  History Server for Spark UI
–  Driving development of instrumentation
Ÿ  Job Submission Tool
–  Don’t configure Context in code (eg: master)

1.0
Ÿ  Java8 Lamdas
–  No more writing closures as Classes
–  Functions are interfaces
–  Return type sensitive functions
▪  mapToPair
Ÿ  Python improvements

1.0
Ÿ  Hadoop security
–  Kerberos, ACL for UI
Ÿ  Job cancel from UI
Ÿ  Distributed GC as things go out of scope
–  Good for long lives service
Ÿ  Spark SQL

More Code and Demos
WordCount, TicTacToe, Java8

Code Review: WordCount
Ÿ  Java API
Ÿ  Java Code
Ÿ  More usage of RDDs

TicTacToe: a developers experience
Ÿ  IDE
Ÿ  Spring
Ÿ  Building/Logging
Ÿ  Debugging

Demo: Java 8
Lamda Lamda Lamda

Deployment Topologies

Topologies
Ÿ  Local
Ÿ  Spark Cluster (master/slaves)
Ÿ  Cluster Resource Managers
–  YARN
–  MESOS
Ÿ  (PaaS?)

Demo:
Ÿ  Start master and slaves
Ÿ  Show the UI
Ÿ  Run a Job
Ÿ  Talk about the History Server

This
And That

How Real is Spark?
Ÿ  There is some criticism
–  As expected
–  New project!
Ÿ  There are many indicators that Spark is heading to success
–  Solid technology
–  Good buzz
–  Significant community

Next Steps
Ÿ  Spark website: http://spark.apache.org
–  Lots’O’Goodstuff
Ÿ  Spark Summit June 30/July 01
–  http://spark-summit.org

A NEW PLATFORM FOR A NEW ERA

Spark For Plain Old Java Geeks (June2014 Meetup)

More Related Content

What's hot (20)

Similar to Spark For Plain Old Java Geeks (June2014 Meetup) (20)

Recently uploaded (20)

Spark For Plain Old Java Geeks (June2014 Meetup)