SlideShare a Scribd company logo
BENEATH RDD
IN APACHE SPARK
USING SPARK-SHELL AND WEBUI
/ / /JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK NOTES
Jacek Laskowski is an independent consultant
Contact me at jacek@japila.pl or
Delivering Development Services | Consulting | Training
Building and leading development teams
Mostly and these days
Leader of and
Blogger at and
@JacekLaskowski
Apache Spark Scala
Warsaw Scala Enthusiasts Warsaw Apache
Spark
Java Champion
blog.jaceklaskowski.pl jaceklaskowski.pl
http://bit.ly/mastering-apache-spark
http://bit.ly/mastering-apache-spark
SPARKCONTEXT
THE LIVING SPACE FOR RDDS
SPARKCONTEXT AND RDDS
An RDD belongs to one and only one Spark context.
You cannot share RDDs between contexts.
SparkContext tracks how many RDDs were created.
You may see it in toString output.
SPARKCONTEXT AND RDDS (2)
RDD
RESILIENT DISTRIBUTED DATASET
CREATING RDD - SC.PARALLELIZE
sc.parallelize(col, slices)to distribute a local
collection of any elements.
scala> val rdd = sc.parallelize(0 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at
Alternatively, sc.makeRDD(col, slices)
CREATING RDD - SC.RANGE
sc.range(start, end, step, slices)to create
RDD of long numbers.
scala> val rdd = sc.range(0, 100)
rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[14] at range at <console>:
CREATING RDD - SC.TEXTFILE
sc.textFile(name, partitions)to create a RDD of
lines from a file.
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[16] at textFil
CREATING RDD - SC.WHOLETEXTFILES
sc.wholeTextFiles(name, partitions)to create
a RDD of pairs of a file name and its content from a
directory.
scala> val rdd = sc.wholeTextFiles("tags")
rdd: org.apache.spark.rdd.RDD[(String, String)] = tags MapPartitionsRDD[18] at wh
There are many more more advanced functions in
SparkContextto create RDDs.
PARTITIONS (AND SLICES)
Did you notice the words slices and partitions as
parameters?
Partitions (aka slices) are the level of parallelism.
We're going to talk about the level of parallelism later.
CREATING RDD - DATAFRAMES
RDDs are so last year :-) Use DataFrames...early and often!
A DataFrame is a higher-level abstraction over RDDs and
semi-structured data.
DataFrames require a SQLContext.
FROM RDDS TO DATAFRAMES
scala> val rdd = sc.parallelize(0 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at
scala> val df = rdd.toDF
df: org.apache.spark.sql.DataFrame = [_1: int]
scala> val df = rdd.toDF("numbers")
df: org.apache.spark.sql.DataFrame = [numbers: int]
...AND VICE VERSA
scala> val rdd = sc.parallelize(0 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at
scala> val df = rdd.toDF("numbers")
df: org.apache.spark.sql.DataFrame = [numbers: int]
scala> df.rdd
res23: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[70]
CREATING DATAFRAMES -
SQLCONTEXT.CREATEDATAFRAME
sqlContext.createDataFrame(rowRDD, schema)
CREATING DATAFRAMES - SQLCONTEXT.READ
sqlContext.readis the modern yet experimental way.
sqlContext.read.format(f).load(path), where f
is:
jdbc
json
orc
parquet
text
EXECUTION ENVIRONMENT
PARTITIONS AND LEVEL OF PARALLELISM
The number of partitions of a RDD is (roughly) the number
of tasks.
Partitions are the hint to size jobs.
Tasks are the smallest unit of execution.
Tasks belong to TaskSets.
TaskSets belong to Stages.
Stages belong to Jobs.
Jobs, stages, and tasks are displayed in web UI.
We're going to talk about the web UI later.
PARTITIONS AND LEVEL OF PARALLELISM CD.
In local[*] mode, the number of partitions equals the
number of cores (the default in spark-shell)
scala> sc.defaultParallelism
res0: Int = 8
scala> sc.master
res1: String = local[*]
Not necessarily true when you use local or local[n] master
URLs.
LEVEL OF PARALLELISM IN SPARK CLUSTERS
TaskScheduler controls the level of parallelism
DAGScheduler, TaskScheduler, SchedulerBackend work
in tandem
DAGScheduler manages a "DAG" of RDDs (aka RDD
lineage)
SchedulerBackends manage TaskSets
DAGSCHEDULER
TASKSCHEDULER AND SCHEDULERBACKEND
RDD LINEAGE
RDD lineage is a graph of RDD dependencies.
Use toDebugString to know the lineage.
Be careful with the hops - they introduce shuffle barriers.
Why is the RDD lineage important?
This is the R in RDD - resiliency.
But deep lineage costs processing time, doesn't it?
Persist (aka cache) it early and often!
RDD LINEAGE - DEMO
What does the following do?
val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)
RDD LINEAGE - DEMO CD.
How many stages are there?
// val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)
scala> rdd.toDebugString
res2: String =
(2) ShuffledRDD[3] at groupBy at <console>:24 []
+-(2) MapPartitionsRDD[2] at groupBy at <console>:24 []
| MapPartitionsRDD[1] at map at <console>:24 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
Nothing happens yet - processing time-wise.
SPARK CLUSTERS
Spark supports the following clusters:
one-JVM local cluster
Spark Standalone
Apache Mesos
Hadoop YARN
You use --master to select the cluster
spark://hostname:port is for Spark Standalone
And you know the local master URL, ain't you?
local, local[n], or local[*]
MANDATORY PROPERTIES OF SPARK APP
Your task: Fill in the gaps below.
Any Spark application must specify application name (aka
appName ) and master URL.
Demo time! => spark-shell is a Spark app, too!
SPARK STANDALONE CLUSTER
The built-in Spark cluster
Start standalone Master with sbin/start-master
Use -h to control the host name to bind to.
Start standalone Worker with sbin/start-slave
Run single worker per machine (aka node)
= web UI for Standalone cluster
Don't confuse it with the web UI of Spark application
Demo time! => Run Standalone cluster
http://localhost:8080/
SPARK-SHELL
SPARK REPL APPLICATION
SPARK-SHELL AND SPARK STANDALONE
You can connect to Spark Standalone using spark-shell
through --master command-line option.
Demo time! => we've already started the Standalone
cluster.
WEBUI
WEB USER INTERFACE FOR SPARK APPLICATION
WEBUI
It is available under
You can disable it using spark.ui.enabled flag.
All the events are captured by Spark listeners
You can register your own Spark listener.
Demo time! => webUI in action with different master URLs
http://localhost:4040/
QUESTIONS?
- Visit
- Follow at twitter
- Use
- Read notes.
Jacek Laskowski's blog
@jaceklaskowski
Jacek's projects at GitHub
Mastering Apache Spark

More Related Content

PDF
Top 5 mistakes when writing Spark applications
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Introduction to Apache Spark
PPTX
Introduction to Apache Spark Developer Training
PDF
Introduction to Apache Spark
PDF
Apache Spark RDDs
PDF
Strata NYC 2015: What's new in Spark Streaming
PDF
How Apache Spark fits into the Big Data landscape
Top 5 mistakes when writing Spark applications
Processing Large Data with Apache Spark -- HasGeek
Introduction to Apache Spark
Introduction to Apache Spark Developer Training
Introduction to Apache Spark
Apache Spark RDDs
Strata NYC 2015: What's new in Spark Streaming
How Apache Spark fits into the Big Data landscape

What's hot (20)

PDF
Apache Spark Tutorial
PDF
Why your Spark job is failing
PPTX
Spark 1.6 vs Spark 2.0
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PPTX
Spark tutorial
PDF
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
PDF
Spark on YARN
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PPTX
Intro to Apache Spark
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PDF
Apache Spark Introduction - CloudxLab
PPTX
Up and running with pyspark
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Introduction to apache spark
PDF
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PPTX
Introduction to Apache Spark and MLlib
PPTX
Intro to Spark development
PDF
Operational Tips for Deploying Spark
PPTX
Spark architecture
Apache Spark Tutorial
Why your Spark job is failing
Spark 1.6 vs Spark 2.0
Unified Big Data Processing with Apache Spark (QCON 2014)
Spark tutorial
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Spark on YARN
Apache Spark: The Next Gen toolset for Big Data Processing
Intro to Apache Spark
Spark Summit East 2015 Advanced Devops Student Slides
Apache Spark Introduction - CloudxLab
Up and running with pyspark
Spark And Cassandra: 2 Fast, 2 Furious
Introduction to apache spark
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Introduction to Apache Spark and MLlib
Intro to Spark development
Operational Tips for Deploying Spark
Spark architecture
Ad

Viewers also liked (14)

PPTX
IBM Spark Meetup - RDD & Spark Basics
PPTX
Introduction to Apache Spark
PPT
BDAS RDD study report v1.2
PDF
Writing your own RDD for fun and profit
PDF
Opening slides to Warsaw Scala FortyFives on Testing tools
PPTX
A Prototype Storage Subsystem based on Phase Change Memory
PDF
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
PDF
Production Readiness Testing At Salesforce Using Spark MLlib
PDF
Derrick Miles on Executive Book Summaries
PPTX
SparkNotes
PPT
Visual book summaries
PPTX
Study Notes: Apache Spark
PPS
ProQuest Safari: essentials of computing and popular technology
IBM Spark Meetup - RDD & Spark Basics
Introduction to Apache Spark
BDAS RDD study report v1.2
Writing your own RDD for fun and profit
Opening slides to Warsaw Scala FortyFives on Testing tools
A Prototype Storage Subsystem based on Phase Change Memory
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Production Readiness Testing At Salesforce Using Spark MLlib
Derrick Miles on Executive Book Summaries
SparkNotes
Visual book summaries
Study Notes: Apache Spark
ProQuest Safari: essentials of computing and popular technology
Ad

Similar to Beneath RDD in Apache Spark by Jacek Laskowski (20)

PPTX
Introduction to Apache Spark
PDF
Apache Spark for Library Developers with Erik Erlandson and William Benton
PDF
Tuning and Debugging in Apache Spark
PPTX
Tuning and Debugging in Apache Spark
PPTX
Apache spark core
PDF
Apache Spark and DataStax Enablement
PDF
Apache Spark
PDF
Spark Programming
PPTX
Introduction to Spark - DataFactZ
PDF
Zero to Streaming: Spark and Cassandra
PDF
Apache spark: in and out
PDF
SparkR: Enabling Interactive Data Science at Scale on Hadoop
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Intro to Apache Spark
PDF
Apache Spark Fundamentals Meetup Talk
PPT
Bigdata processing with Spark - part II
PPT
11. From Hadoop to Spark 2/2
PDF
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
PPTX
Spark 计算模型
PPTX
Apache Spark
Introduction to Apache Spark
Apache Spark for Library Developers with Erik Erlandson and William Benton
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Apache spark core
Apache Spark and DataStax Enablement
Apache Spark
Spark Programming
Introduction to Spark - DataFactZ
Zero to Streaming: Spark and Cassandra
Apache spark: in and out
SparkR: Enabling Interactive Data Science at Scale on Hadoop
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Intro to Apache Spark
Apache Spark Fundamentals Meetup Talk
Bigdata processing with Spark - part II
11. From Hadoop to Spark 2/2
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
Spark 计算模型
Apache Spark

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Database Infoormation System (DBIS).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to machine learning and Linear Models
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Quality review (1)_presentation of this 21
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Knowledge Engineering Part 1
Business Acumen Training GuidePresentation.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Database Infoormation System (DBIS).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Fluorescence-microscope_Botany_detailed content
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to machine learning and Linear Models
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Supervised vs unsupervised machine learning algorithms
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
.pdf is not working space design for the following data for the following dat...
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
STUDY DESIGN details- Lt Col Maksud (21).pptx
Quality review (1)_presentation of this 21
168300704-gasification-ppt.pdfhghhhsjsjhsuxush

Beneath RDD in Apache Spark by Jacek Laskowski

  • 1. BENEATH RDD IN APACHE SPARK USING SPARK-SHELL AND WEBUI / / /JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK NOTES
  • 2. Jacek Laskowski is an independent consultant Contact me at jacek@japila.pl or Delivering Development Services | Consulting | Training Building and leading development teams Mostly and these days Leader of and Blogger at and @JacekLaskowski Apache Spark Scala Warsaw Scala Enthusiasts Warsaw Apache Spark Java Champion blog.jaceklaskowski.pl jaceklaskowski.pl
  • 6. SPARKCONTEXT AND RDDS An RDD belongs to one and only one Spark context. You cannot share RDDs between contexts. SparkContext tracks how many RDDs were created. You may see it in toString output.
  • 9. CREATING RDD - SC.PARALLELIZE sc.parallelize(col, slices)to distribute a local collection of any elements. scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at Alternatively, sc.makeRDD(col, slices)
  • 10. CREATING RDD - SC.RANGE sc.range(start, end, step, slices)to create RDD of long numbers. scala> val rdd = sc.range(0, 100) rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[14] at range at <console>:
  • 11. CREATING RDD - SC.TEXTFILE sc.textFile(name, partitions)to create a RDD of lines from a file. scala> val rdd = sc.textFile("README.md") rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[16] at textFil
  • 12. CREATING RDD - SC.WHOLETEXTFILES sc.wholeTextFiles(name, partitions)to create a RDD of pairs of a file name and its content from a directory. scala> val rdd = sc.wholeTextFiles("tags") rdd: org.apache.spark.rdd.RDD[(String, String)] = tags MapPartitionsRDD[18] at wh
  • 13. There are many more more advanced functions in SparkContextto create RDDs.
  • 14. PARTITIONS (AND SLICES) Did you notice the words slices and partitions as parameters? Partitions (aka slices) are the level of parallelism. We're going to talk about the level of parallelism later.
  • 15. CREATING RDD - DATAFRAMES RDDs are so last year :-) Use DataFrames...early and often! A DataFrame is a higher-level abstraction over RDDs and semi-structured data. DataFrames require a SQLContext.
  • 16. FROM RDDS TO DATAFRAMES scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at scala> val df = rdd.toDF df: org.apache.spark.sql.DataFrame = [_1: int] scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]
  • 17. ...AND VICE VERSA scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int] scala> df.rdd res23: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[70]
  • 19. CREATING DATAFRAMES - SQLCONTEXT.READ sqlContext.readis the modern yet experimental way. sqlContext.read.format(f).load(path), where f is: jdbc json orc parquet text
  • 21. PARTITIONS AND LEVEL OF PARALLELISM The number of partitions of a RDD is (roughly) the number of tasks. Partitions are the hint to size jobs. Tasks are the smallest unit of execution. Tasks belong to TaskSets. TaskSets belong to Stages. Stages belong to Jobs. Jobs, stages, and tasks are displayed in web UI. We're going to talk about the web UI later.
  • 22. PARTITIONS AND LEVEL OF PARALLELISM CD. In local[*] mode, the number of partitions equals the number of cores (the default in spark-shell) scala> sc.defaultParallelism res0: Int = 8 scala> sc.master res1: String = local[*] Not necessarily true when you use local or local[n] master URLs.
  • 23. LEVEL OF PARALLELISM IN SPARK CLUSTERS TaskScheduler controls the level of parallelism DAGScheduler, TaskScheduler, SchedulerBackend work in tandem DAGScheduler manages a "DAG" of RDDs (aka RDD lineage) SchedulerBackends manage TaskSets
  • 26. RDD LINEAGE RDD lineage is a graph of RDD dependencies. Use toDebugString to know the lineage. Be careful with the hops - they introduce shuffle barriers. Why is the RDD lineage important? This is the R in RDD - resiliency. But deep lineage costs processing time, doesn't it? Persist (aka cache) it early and often!
  • 27. RDD LINEAGE - DEMO What does the following do? val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)
  • 28. RDD LINEAGE - DEMO CD. How many stages are there? // val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1) scala> rdd.toDebugString res2: String = (2) ShuffledRDD[3] at groupBy at <console>:24 [] +-(2) MapPartitionsRDD[2] at groupBy at <console>:24 [] | MapPartitionsRDD[1] at map at <console>:24 [] | ParallelCollectionRDD[0] at parallelize at <console>:24 [] Nothing happens yet - processing time-wise.
  • 29. SPARK CLUSTERS Spark supports the following clusters: one-JVM local cluster Spark Standalone Apache Mesos Hadoop YARN You use --master to select the cluster spark://hostname:port is for Spark Standalone And you know the local master URL, ain't you? local, local[n], or local[*]
  • 30. MANDATORY PROPERTIES OF SPARK APP Your task: Fill in the gaps below. Any Spark application must specify application name (aka appName ) and master URL. Demo time! => spark-shell is a Spark app, too!
  • 31. SPARK STANDALONE CLUSTER The built-in Spark cluster Start standalone Master with sbin/start-master Use -h to control the host name to bind to. Start standalone Worker with sbin/start-slave Run single worker per machine (aka node) = web UI for Standalone cluster Don't confuse it with the web UI of Spark application Demo time! => Run Standalone cluster http://localhost:8080/
  • 33. SPARK-SHELL AND SPARK STANDALONE You can connect to Spark Standalone using spark-shell through --master command-line option. Demo time! => we've already started the Standalone cluster.
  • 34. WEBUI WEB USER INTERFACE FOR SPARK APPLICATION
  • 35. WEBUI It is available under You can disable it using spark.ui.enabled flag. All the events are captured by Spark listeners You can register your own Spark listener. Demo time! => webUI in action with different master URLs http://localhost:4040/
  • 36. QUESTIONS? - Visit - Follow at twitter - Use - Read notes. Jacek Laskowski's blog @jaceklaskowski Jacek's projects at GitHub Mastering Apache Spark