SlideShare a Scribd company logo
Getting Started Running 

Apache Spark on Apache Mesos,
2014-01-24	

Paco Nathan 

liber118.com/pxn

@pacoid
Spark on Mesos, 2014-01-24	


• what is Apache Mesos?	

• launch a Mesos cluster in the cloud	

• configure and run Spark on Mesos	

• run jobs in Spark	

• further resources…
Datacenter Computing	

Google has been doing datacenter computing for years, 

to address the complexities of large-scale data workflows:	


•
•

leveraging the modern kernel: isolation in lieu of VMs	


•

“most (>80%) jobs are batch jobs, but the majority 

of resources (55–80%) are allocated to service jobs”	


•
•
•

mixed workloads, multi-tenancy	


among the top 10 Linux kernel OSS contributors:
cgroups	


relatively high utilization rates	

JVM? not so much…	


!

take-aways: 

scheduling batch is not so difficult; 

scheduling services is hard+expensive
Google describes the business case…	

Taming Latency Variability

Jeff Dean

plus.google.com/u/0/+ResearchatGoogle/posts/C1dPhQhcDRv
“Return of the Borg”	

Return of the Borg: How Twitter Rebuilt Google’s Secret Weapon

Cade Metz

wired.com/wiredenterprise/2013/03/googleborg-twitter-mesos	

!

The Datacenter as a Computer: An Introduction 

to the Design of Warehouse-Scale Machines	

Luiz André Barroso, Urs Hölzle	

research.google.com/pubs/pub35290.html	

!
!

2011 GAFS Omega

John Wilkes, et al.

youtu.be/0ZFMlO98Jkc
Google describes the technology…	

Omega: flexible, scalable schedulers for large compute clusters	

Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, John Wilkes	

eurosys2013.tudos.org/wp-content/uploads/2013/paper/
Schwarzkopf.pdf
Getting Started Running Apache Spark on Apache Mesos
Mesos – open source datacenter computing	

a common substrate for cluster computing	

mesos.apache.org	

heterogenous assets in your datacenter or cloud 

made available as a homogenous set of resources	


•
•
•
•
•
•
•
•

top-level Apache project	

scalability to 10,000s of nodes	

obviates the need for virtual machines	

isolation (pluggable) for CPU, RAM, I/O, FS, etc.	

fault-tolerant leader election based on Zookeeper	

APIs in C++, Java, Python, Go	

web UI for inspecting cluster state	

available for Linux, OpenSolaris, Mac OSX
Getting Started Running Apache Spark on Apache Mesos
Mesos – architecture	

services

batch

Workloads

Apps
Scalding

MPI

Impala

Hadoop

Shark

Spark

MySQL

Kafka

JBoss

Django

Chronos

Storm

Rails

Frameworks

Marathon

Kernel

distributed file system

distributed resources: CPU, RAM, I/O, FS, rack locality, etc.

DFS

Cluster
Mesos – architecture	

apps: HA services, web apps, batch
jobs, scripts, etc.

frameworks: Spark, Storm,
MPI, Jenkins, etc.

task schedulers: Chronos, etc.

meta-frameworks: Aurora, Marathon

APIs: C++, JVM, Py, Go

Mesos, distrib kernel

HDFS, distrib file system

Linux: libcgroup, libprocess, libev, etc.
Mesos – dynamics	


scheduled
apps

HA
services

distrib
frameworks

Marathon
distrib init.d

Mesos
distrib kernel

Chronos
distrib cron
Mesos – dynamics	


distributed
framework

Scheduler

Executor

Executor

Executor

Mesos
Mesos
slave
slave

Mesos
Mesos
slave
slave

Mesos
Mesos
slave
slave

resource
offers
Mesos
Mesos
master
master

available resources

distributed
kernel
Production Deployments (public)
Case Study: Twitter (bare metal / on premise)	

“Mesos is the cornerstone of our elastic compute infrastructure – 

it’s how we build all our new services and is critical for Twitter’s

continued success at scale. It's one of the primary keys to our

data center efficiency."	

Chris Fry, SVP Engineering	

!

blog.twitter.com/2013/mesos-graduates-from-apache-incubation	

wired.com/gadgetlab/2013/11/qa-with-chris-fry/	


•
•
•

key services run in production: analytics, typeahead, ads	


•

allows services to scale and leverage a shared pool of 

servers across datacenters efficiently	


•

reduces the time between prototyping and launching

Twitter engineers rely on Mesos to build all new services	

instead of thinking about static machines, engineers think 

about resources like CPU, memory and disk
Spark on Mesos, 2014-01-24	


• what is Apache Mesos?	

• launch a Mesos cluster in the cloud	

• configure and run Spark on Mesos	

• run jobs in Spark	

• further resources…
http://elastic.mesosphere.io

launch a Mesos cluster in the Amazon AWS 

cloud in three simple steps, given: 


•
•
•

AWS credentials	

SSH public key	

email address
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Spark on Mesos, 2014-01-24	


• what is Apache Mesos?	

• launch a Mesos cluster in the cloud	

• configure and run Spark on Mesos	

• run jobs in Spark	

• further resources…
http://mesosphere.io/learn/run-spark-on-mesos/	


configure and run Spark on a Mesos 

cluster on AWS, in a seven-step tutorial…
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
step 1: ssh to master
ssh -l ubuntu <master>
step 2: install git, jdk-7
sudo aptitude -y install git!
sudo aptitude -y install openjdk-7-jdk
step 3: download spark
wget http://spark-project.org/download/spark-0.8.0-incubating-bin-cdh4.tgz!
tar xzf spark-0.8.0-incubating-bin-cdh4.tgz!
cd spark-0.8.0-incubating-bin-cdh4/
step 4: sbt clean assembly
SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.4.0 sbt/sbt clean assembly
step 5: make distro, cp to HDFS
./make-distribution.sh --hadoop 2.0.0-mr1-cdh4.4.0!
mv dist spark-0.8.0-2.0.0-mr1-cdh4.4.0!
tar czf spark-0.8.0-2.0.0-mr1-cdh4.4.0.tgz spark-0.8.0-2.0.0-mr1-cdh4.4.0!

!
hadoop fs -mkdir /tmp!
hadoop fs -put spark-0.8.0-2.0.0-mr1-cdh4.4.0.tgz /tmp
step 6: config env
cd conf/!
cp spark-env.sh.template spark-env.sh!
vim spark-env.sh!

!
export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so!
export SPARK_EXECUTOR_URI=hdfs://<nn>/tmp/spark-0.8.0-2.0.0-mr1-cdh4.4.0.tgz!
export MASTER=zk://<master>:2181/mesos!

!
cat spark-env.sh!
cd ..!

!
./spark-shell
et voilà!
Spark on Mesos, 2014-01-24	


• what is Apache Mesos?	

• launch a Mesos cluster in the cloud	

• configure and run Spark on Mesos	

• run jobs in Spark	

• further resources…
http://spark.incubator.apache.org/examples.html	


run an example job in Spark, 

to filter an RDD of integers,	

in two steps at the REPL…
step 1: create an RDD
val data = 1 to 10000!
val distData = sc.parallelize(data)!

!
distData.filter(_< 10).collect()
step 2: run the filter
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
Spark on Mesos, 2014-01-24	


• what is Apache Mesos?	

• launch a Mesos cluster in the cloud	

• configure and run Spark on Mesos	

• run jobs in Spark	

• further resources…
Join us!	

!

O’Reilly Strata, Santa Clara

Feb 11-13

strataconf.com/strata2014

Mesos tutorial, Tue 2/11 1:30pm	

BOF lunch, Wed 2/12 12:10pm	

Mesos session, Thu 2/13 2:20pm	

office hours, Thu 2/13 3:15pm
More insights…	

!

Monthly newsletter for 

events, conf summaries, 

workshops, etc.:	

liber118.com/pxn/	

!

collected Mesos notes:	

goo.gl/jPtTP
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos

More Related Content

PDF
SMACK Stack 1.1
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
PDF
The How and Why of Fast Data Analytics with Apache Spark
PPTX
Kafka Lambda architecture with mirroring
PDF
Data processing platforms with SMACK: Spark and Mesos internals
PDF
How to deploy Apache Spark 
to Mesos/DCOS
SMACK Stack 1.1
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Real time data viz with Spark Streaming, Kafka and D3.js
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
The How and Why of Fast Data Analytics with Apache Spark
Kafka Lambda architecture with mirroring
Data processing platforms with SMACK: Spark and Mesos internals
How to deploy Apache Spark 
to Mesos/DCOS

What's hot (20)

PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PDF
Streaming Processing with a Distributed Commit Log
PDF
Using the SDACK Architecture to Build a Big Data Product
PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
PDF
Akka in Production - ScalaDays 2015
PDF
Productionizing Spark and the Spark Job Server
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Spark streaming , Spark SQL
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
PDF
Analyzing Time Series Data with Apache Spark and Cassandra
PPTX
ETL with SPARK - First Spark London meetup
PDF
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
PPTX
Terraform Modules Restructured
PDF
Spark Community Update - Spark Summit San Francisco 2015
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Streaming Processing with a Distributed Commit Log
Using the SDACK Architecture to Build a Big Data Product
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Akka in Production - ScalaDays 2015
Productionizing Spark and the Spark Job Server
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Spark streaming , Spark SQL
Kafka spark cassandra webinar feb 16 2016
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Analyzing Time Series Data with Apache Spark and Cassandra
ETL with SPARK - First Spark London meetup
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Terraform Modules Restructured
Spark Community Update - Spark Summit San Francisco 2015
Ad

Viewers also liked (20)

PDF
Building Distributed Systems from Scratch - Part 1
KEY
Building Distributed Systems in Scala
PPTX
Bring the Spark To Your Eyes
PPTX
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Apache spark - Installation
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
Thing you didn't know you could do in Spark
PPTX
Jenkins 2.0: Организуем тестирование в составе Continuous Delivery
PDF
Using Spark with Tachyon by Gene Pang
PPTX
CloudFoundry-summit-2015-a-look-back
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
PPTX
Spark 101 - First steps to distributed computing
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
PDF
Predictive modeling healthcare
PDF
Building distributed processing system from scratch - Part 2
PDF
Ranking the Web with Spark
PDF
Introduction to Structured Streaming
PPTX
Keyboard covert channels
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems in Scala
Bring the Spark To Your Eyes
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Unified Big Data Processing with Apache Spark (QCON 2014)
Apache spark - Installation
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Thing you didn't know you could do in Spark
Jenkins 2.0: Организуем тестирование в составе Continuous Delivery
Using Spark with Tachyon by Gene Pang
CloudFoundry-summit-2015-a-look-back
Databricks Meetup @ Los Angeles Apache Spark User Group
Spark 101 - First steps to distributed computing
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Predictive modeling healthcare
Building distributed processing system from scratch - Part 2
Ranking the Web with Spark
Introduction to Structured Streaming
Keyboard covert channels
Ad

Similar to Getting Started Running Apache Spark on Apache Mesos (20)

PDF
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
PDF
Strata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks
PDF
Datacenter Computing with Apache Mesos - BigData DC
PDF
Apache Mesos Overview and Integration
PDF
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
PDF
Introducing Apache Mesos
PDF
Datacenter Computing and Resource Management Using Apache Mesos
PPTX
Apache Mesos
PDF
Modern Container Orchestration (Without Breaking the Bank)
PDF
Introduction To Apache Mesos
PDF
15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos
PDF
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
PPTX
Introduction to mesos
PPTX
Apache Mesos Distributed Computing Talk
PDF
Mesos: Cluster Management System
PDF
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
PDF
eScience Cluster Arch. Overview
PDF
Mesos vs kubernetes comparison
PPTX
MANTL Data Platform, Microservices and BigData Services
ODP
An introduction to Apache Mesos
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Strata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks
Datacenter Computing with Apache Mesos - BigData DC
Apache Mesos Overview and Integration
Datacenter Computing with Apache Mesos - シリコンバレー日本人駐在員Meetup
Introducing Apache Mesos
Datacenter Computing and Resource Management Using Apache Mesos
Apache Mesos
Modern Container Orchestration (Without Breaking the Bank)
Introduction To Apache Mesos
15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Introduction to mesos
Apache Mesos Distributed Computing Talk
Mesos: Cluster Management System
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
eScience Cluster Arch. Overview
Mesos vs kubernetes comparison
MANTL Data Platform, Microservices and BigData Services
An introduction to Apache Mesos

More from Paco Nathan (20)

PDF
Human in the loop: a design pattern for managing teams working with ML
PDF
Human-in-the-loop: a design pattern for managing teams that leverage ML
PDF
Human-in-a-loop: a design pattern for managing teams which leverage ML
PDF
Humans in a loop: Jupyter notebooks as a front-end for AI
PDF
Humans in the loop: AI in open source and industry
PDF
Computable Content
PDF
Computable Content: Lessons Learned
PDF
SF Python Meetup: TextRank in Python
PDF
Use of standards and related issues in predictive analytics
PDF
Data Science in 2016: Moving Up
PDF
Data Science Reinvents Learning?
PDF
Jupyter for Education: Beyond Gutenberg and Erasmus
PDF
GalvanizeU Seattle: Eleven Almost-Truisms About Data
PDF
Microservices, containers, and machine learning
PDF
GraphX: Graph analytics for insights about developer communities
PDF
Graph Analytics in Spark
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
PDF
A New Year in Data Science: ML Unpaused
PDF
Microservices, Containers, and Machine Learning
Human in the loop: a design pattern for managing teams working with ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in the loop: AI in open source and industry
Computable Content
Computable Content: Lessons Learned
SF Python Meetup: TextRank in Python
Use of standards and related issues in predictive analytics
Data Science in 2016: Moving Up
Data Science Reinvents Learning?
Jupyter for Education: Beyond Gutenberg and Erasmus
GalvanizeU Seattle: Eleven Almost-Truisms About Data
Microservices, containers, and machine learning
GraphX: Graph analytics for insights about developer communities
Graph Analytics in Spark
Apache Spark and the Emerging Technology Landscape for Big Data
QCon São Paulo: Real-Time Analytics with Spark Streaming
A New Year in Data Science: ML Unpaused
Microservices, Containers, and Machine Learning

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
cuic standard and advanced reporting.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Big Data Technologies - Introduction.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Understanding_Digital_Forensics_Presentation.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
Spectral efficient network and resource selection model in 5G networks
cuic standard and advanced reporting.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Dropbox Q2 2025 Financial Results & Investor Presentation
CIFDAQ's Market Insight: SEC Turns Pro Crypto
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
Big Data Technologies - Introduction.pptx
Empathic Computing: Creating Shared Understanding
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Modernizing your data center with Dell and AMD
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Monthly Chronicles - July 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx

Getting Started Running Apache Spark on Apache Mesos