SlideShare a Scribd company logo
Apache Spark
when things go wrong
@rabbitonweb
Apache Spark - when things go wrong
val sc = new SparkContext(conf)
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
.foreach(println)
If above seems odd, this talk is rather not for you.
@rabbitonweb tweet compliments and complaints
Target for this talk
Target for this talk
Target for this talk
Target for this talk
Target for this talk
Target for this talk
Target for this talk
Target for this talk
Target for this talk
4 things for take-away
4 things for take-away
1. Knowledge of how Apache Spark works
internally
4 things for take-away
1. Knowledge of how Apache Spark works
internally
2. Courage to look at Spark’s implementation
(code)
4 things for take-away
1. Knowledge of how Apache Spark works
internally
2. Courage to look at Spark’s implementation
(code)
3. Notion of how to write efficient Spark
programs
4 things for take-away
1. Knowledge of how Apache Spark works
internally
2. Courage to look at Spark’s implementation
(code)
3. Notion of how to write efficient Spark
programs
4. Basic ability to monitor Spark when things
are starting to act weird
Two words about examples used
Two words about examples used
Super Cool
App
Two words about examples used
Journal
Super Cool
App
events
spark cluster
Two words about examples used
Journal
Super Cool
App
events
node 1
node 2
node 3
master A
master B
spark cluster
Two words about examples used
Journal
Super Cool
App
events
node 1
node 2
node 3
master A
master B
spark cluster
Two words about examples used
Journal
Super Cool
App
events
node 1
node 2
node 3
master A
master B
Two words about examples used
Two words about examples used
sc.textFile(“hdfs://journal/*”)
Two words about examples used
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
Two words about examples used
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
Two words about examples used
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
Two words about examples used
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
Two words about examples used
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
.foreach(println)
What is a RDD?
What is a RDD?
Resilient Distributed Dataset
What is a RDD?
Resilient Distributed Dataset
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
What is a RDD?
node 1
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
node 2 node 3
What is a RDD?
node 1
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
node 2 node 3
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
...
10 10/05/2015 10:14:01 UserInitialized Ania Nowak
10 10/05/2015 10:14:55 FirstNameChanged Anna
12 10/05/2015 10:17:03 UserLoggedIn
12 10/05/2015 10:21:31 UserLoggedOut
…
198 13/05/2015 21:10:11 UserInitialized Jan Kowalski
What is a RDD?
What is a RDD?
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
What is a partition?
What is a partition?
A partition represents subset of data within your
distributed collection.
What is a partition?
A partition represents subset of data within your
distributed collection.
override def getPartitions: Array[Partition] = ???
What is a partition?
A partition represents subset of data within your
distributed collection.
override def getPartitions: Array[Partition] = ???
How this subset is defined depends on type of
the RDD
example: HadoopRDD
val journal = sc.textFile(“hdfs://journal/*”)
example: HadoopRDD
val journal = sc.textFile(“hdfs://journal/*”)
How HadoopRDD is partitioned?
example: HadoopRDD
val journal = sc.textFile(“hdfs://journal/*”)
How HadoopRDD is partitioned?
In HadoopRDD partition is exactly the same as
file chunks in HDFS
example: HadoopRDD
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
node 1
10 10/05/2015 10:14:01 UserInit
3 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
4 10/05/2015 10:21:31 UserLo
5 13/05/2015 21:10:11 UserIni
node 2 node 3
16 10/05/2015 10:14:01 UserInit
20 10/05/2015 10:14:55 FirstNa
42 10/05/2015 10:17:03 UserLo
67 10/05/2015 10:21:31 UserLo
12 13/05/2015 21:10:11 UserIni
10 10/05/2015 10:14:01 UserInit
10 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
12 10/05/2015 10:21:31 UserLo
198 13/05/2015 21:10:11 UserIni
5 10/05/2015 10:14:01 UserInit
4 10/05/2015 10:14:55 FirstNa
12 10/05/2015 10:17:03 UserLo
142 10/05/2015 10:21:31 UserLo
158 13/05/2015 21:10:11 UserIni
example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {
...
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) }
array
}
example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {
...
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) }
array
}
example: HadoopRDD
class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {
...
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
if (inputFormat.isInstanceOf[Configurable]) {
inputFormat.asInstanceOf[Configurable].setConf(jobConf)
}
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) }
array
}
example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
How MapPartitionsRDD is partitioned?
example: MapPartitionsRDD
val journal = sc.textFile(“hdfs://journal/*”)
val fromMarch = journal.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
How MapPartitionsRDD is partitioned?
MapPartitionsRDD inherits partition information
from its parent RDD
example: MapPartitionsRDD
class MapPartitionsRDD[U: ClassTag, T: ClassTag](...) extends RDD[U](prev) {
...
override def getPartitions: Array[Partition] = firstParent[T].partitions
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how to evaluate its internal data
RDD parent
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
.foreach(println)
RDD parent
sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
.take(300)
.foreach(println)
RDD parent
sc.textFile()
.groupBy()
.map { }
.filter {
}
.take()
.foreach()
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two types of parent dependencies:
1. narrow dependency
2. wider dependency
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two types of parent dependencies:
1. narrow dependency
2. wider dependency
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two types of parent dependencies:
1. narrow dependency
2. wider dependency
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Tasks
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Tasks
Directed acyclic graph
HadoopRDD
ShuffeledRDD MapPartRDD MapPartRDD
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Stage 1
Stage 2
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Stage 1
Stage 2
Directed acyclic graph
sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
Two important concepts:
1. shuffle write
2. shuffle read
toDebugString
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
toDebugString
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
scala> events.toDebugString
toDebugString
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
scala> events.toDebugString
res5: String =
(4) MapPartitionsRDD[22] at filter at <console>:50 []
| MapPartitionsRDD[21] at map at <console>:49 []
| ShuffledRDD[20] at groupBy at <console>:48 []
+-(6) HadoopRDD[17] at textFile at <console>:47 []
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
What is a RDD?
RDD needs to hold 3 chunks of
information in order to do its work:
1. pointer to his parent
2. how its internal data is partitioned
3. how evaluate its internal data
Stage 1
Stage 2
Running Job aka materializing DAG
sc.textFile() .groupBy() .map { } .filter { }
Stage 1
Stage 2
Running Job aka materializing DAG
sc.textFile() .groupBy() .map { } .filter { } .collect()
Stage 1
Stage 2
Running Job aka materializing DAG
sc.textFile() .groupBy() .map { } .filter { } .collect()
action
Stage 1
Stage 2
Running Job aka materializing DAG
sc.textFile() .groupBy() .map { } .filter { } .collect()
action
Actions are implemented
using sc.runJob method
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
): Array[U]
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
): Array[U]
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
partitions: Seq[Int],
): Array[U]
Running Job aka materializing DAG
/**
* Run a function on a given set of partitions in an RDD and return the results as an array.
*/
def runJob[T, U](
rdd: RDD[T],
partitions: Seq[Int],
func: Iterator[T] => U,
): Array[U]
Running Job aka materializing DAG
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T] = {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
Running Job aka materializing DAG
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T] = {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
/**
* Return the number of elements in the RDD.
*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
Multiple jobs for single action
/**
* Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to
estimate the number of additional partitions needed to satisfy the limit.
*/
def take(num: Int): Array[T] = {
(….)
val left = num - buf.size
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true)
(….)
res.foreach(buf ++= _.take(num - buf.size))
partsScanned += numPartsToTry
(….)
buf.toArray
}
Lets test what
we’ve learned
Towards efficiency
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
Towards efficiency
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
scala> events.toDebugString
(4) MapPartitionsRDD[22] at filter at <console>:50 []
| MapPartitionsRDD[21] at map at <console>:49 []
| ShuffledRDD[20] at groupBy at <console>:48 []
+-(6) HadoopRDD[17] at textFile at <console>:47 []
Towards efficiency
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter {
case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)
}
scala> events.toDebugString
(4) MapPartitionsRDD[22] at filter at <console>:50 []
| MapPartitionsRDD[21] at map at <console>:49 []
| ShuffledRDD[20] at groupBy at <console>:48 []
+-(6) HadoopRDD[17] at textFile at <console>:47 []
events.count
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 1
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Everyday I’m Shuffling
Everyday I’m Shuffling
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Everyday I’m Shuffling
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Everyday I’m Shuffling
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Everyday I’m Shuffling
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Everyday I’m Shuffling
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
node 1
node 2
node 3
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Stage 2
node 1
node 2
node 3
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
node 1
node 2
node 3
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _)
.map { case (date, events) => (date, events.size) }
node 1
node 2
node 3
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _, 6)
.map { case (date, events) => (date, events.size) }
node 1
node 2
node 3
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _, 6)
.map { case (date, events) => (date, events.size) }
node 1
node 2
node 3
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _, 6)
.map { case (date, events) => (date, events.size) }
node 1
node 2
node 3
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.groupBy(extractDate _, 6)
.map { case (date, events) => (date, events.size) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.map( e => (extractDate(e), e))
.combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 +
c2)
.groupBy(extractDate _, 6)
.map { case (date, events) => (date, events.size) }
Let's refactor
val events = sc.textFile(“hdfs://journal/*”)
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.map( e => (extractDate(e), e))
.combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 +
c2)
A bit more about partitions
val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4
.map( e => (extractDate(e), e))
A bit more about partitions
val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4
.repartition(256) // note, this will cause a shuffle
.map( e => (extractDate(e), e))
A bit more about partitions
val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.map( e => (extractDate(e), e))
A bit more about partitions
val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024
.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }
.coalesce(64) // this will NOT shuffle
.map( e => (extractDate(e), e))
Enough theory
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Apache spark   when things go wrong
Few optimization tricks
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
b. map vs mapValues (partitioner)
Spark performance - shuffle optimization
map groupBy
Spark performance - shuffle optimization
map groupBy
Spark performance - shuffle optimization
map groupBy join
Spark performance - shuffle optimization
map groupBy join
Spark performance - shuffle optimization
map groupBy join
Optimization: shuffle avoided if
data is already partitioned
Spark performance - shuffle optimization
map groupBy map
Spark performance - shuffle optimization
map groupBy map
Spark performance - shuffle optimization
map groupBy map join
Spark performance - shuffle optimization
map groupBy map join
Spark performance - shuffle optimization
map groupBy mapValues
Spark performance - shuffle optimization
map groupBy mapValues
Spark performance - shuffle optimization
map groupBy mapValues join
Spark performance - shuffle optimization
map groupBy mapValues join
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
b. map vs mapValues (partitioner)
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
b. map vs mapValues (partitioner)
5. Understand basics & internals to omit common mistakes
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
b. map vs mapValues (partitioner)
5. Understand basics & internals to omit common mistakes
Example: Usage of RDD within other RDD. Invalid since only the driver can
call operations on RDDS (not other workers)
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
b. map vs mapValues (partitioner)
5. Understand basics & internals to omit common mistakes
Example: Usage of RDD within other RDD. Invalid since only the driver can
call operations on RDDS (not other workers)
rdd.map((k,v) => otherRDD.get(k))
rdd.map(e => otherRdd.map {})
Few optimization tricks
1. Serialization issues (e.g KryoSerializer)
2. Turn on speculations if on shared cluster
3. Experiment with compression codecs
4. Learn the API
a. groupBy->map vs combineByKey
b. map vs mapValues (partitioner)
5. Understand basics & internals to omit common mistakes
Example: Usage of RDD within other RDD. Invalid since only the driver can
call operations on RDDS (not other workers)
rdd.map((k,v) => otherRDD.get(k)) -> rdd.join(otherRdd)
rdd.map(e => otherRdd.map {}) -> rdd.cartesian(otherRdd)
Resources & further read
● Spark Summit Talks @ Youtube
Resources & further read
● Spark Summit Talks @ Youtube
SPARK SUMMIT EUROPE!!!
October 27th to 29th
Resources & further read
● Spark Summit Talks @ Youtube
Resources & further read
● Spark Summit Talks @ Youtube
● “Learn Spark” book, published by O’Reilly
Resources & further read
● Spark Summit Talks @ Youtube
● “Learn Spark” book, published by O’Reilly
● Apache Spark Documentation
○ https://spark.apache.org/docs/latest/monitoring.html
○ https://spark.apache.org/docs/latest/tuning.html
Resources & further read
● Spark Summit Talks @ Youtube
● “Learn Spark” book, published by O’Reilly
● Apache Spark Documentation
○ https://spark.apache.org/docs/latest/monitoring.html
○ https://spark.apache.org/docs/latest/tuning.html
● Mailing List aka solution-to-your-problem-is-probably-already-there
Resources & further read
● Spark Summit Talks @ Youtube
● “Learn Spark” book, published by O’Reilly
● Apache Spark Documentation
○ https://spark.apache.org/docs/latest/monitoring.html
○ https://spark.apache.org/docs/latest/tuning.html
● Mailing List aka solution-to-your-problem-is-probably-already-there
● Pretty soon my blog & github :)
Apache spark   when things go wrong
Apache spark   when things go wrong
Paweł Szulc
Paweł Szulc
blog: http://rabbitonweb.com
Paweł Szulc
blog: http://rabbitonweb.com
twitter: @rabbitonweb
Paweł Szulc
blog: http://rabbitonweb.com
twitter: @rabbitonweb
github: https://github.com/rabbitonweb

More Related Content

PDF
Writing your own RDD for fun and profit
PDF
Know your platform. 7 things every scala developer should know about jvm
PDF
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
PDF
Column Stride Fields aka. DocValues
PDF
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Cassandra data structures and algorithms
PDF
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
PDF
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Writing your own RDD for fun and profit
Know your platform. 7 things every scala developer should know about jvm
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
Column Stride Fields aka. DocValues
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Cassandra data structures and algorithms
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...

What's hot (12)

PDF
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
PDF
Cassandra 2.1
PDF
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
ODP
Buenos Aires Drools Expert Presentation
PDF
Cassandra summit keynote 2014
PPTX
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02
PDF
Cassandra Summit 2015
PPTX
Data Mining with Splunk
PDF
Cassandra Summit 2013 Keynote
PDF
Cassandra London - C* Spark Connector
PDF
MongoDB: Optimising for Performance, Scale & Analytics
KEY
Rails Model Basics
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Cassandra 2.1
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
Buenos Aires Drools Expert Presentation
Cassandra summit keynote 2014
Utahbigmountain ancestrydnahbasehadoop9-7-2013billyetman-130928100600-phpapp02
Cassandra Summit 2015
Data Mining with Splunk
Cassandra Summit 2013 Keynote
Cassandra London - C* Spark Connector
MongoDB: Optimising for Performance, Scale & Analytics
Rails Model Basics
Ad

Viewers also liked (14)

PDF
Functional Programming & Event Sourcing - a pair made in heaven
PDF
Introduction to type classes
PDF
Apache spark workshop
PDF
Introduction to type classes in 30 min
PDF
The cats toolbox a quick tour of some basic typeclasses
PDF
Real world gobbledygook
PDF
“Going bananas with recursion schemes for fixed point data types”
PDF
Advanced Threat Detection on Streaming Data
PDF
Introduction to Spark
PDF
Spark workshop
PDF
Stock Prediction Using NLP and Deep Learning
PDF
Going bananas with recursion schemes for fixed point data types
PDF
Make your programs Free
PDF
Applying Machine Learning to Live Patient Data
Functional Programming & Event Sourcing - a pair made in heaven
Introduction to type classes
Apache spark workshop
Introduction to type classes in 30 min
The cats toolbox a quick tour of some basic typeclasses
Real world gobbledygook
“Going bananas with recursion schemes for fixed point data types”
Advanced Threat Detection on Streaming Data
Introduction to Spark
Spark workshop
Stock Prediction Using NLP and Deep Learning
Going bananas with recursion schemes for fixed point data types
Make your programs Free
Applying Machine Learning to Live Patient Data
Ad

Similar to Apache spark when things go wrong (20)

PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
PDF
A deeper-understanding-of-spark-internals
PDF
A deeper-understanding-of-spark-internals-aaron-davidson
PPTX
Apache Spark Workshop
PDF
DTCC '14 Spark Runtime Internals
PDF
Advanced spark training advanced spark internals and tuning reynold xin
PDF
Apache Spark Internals - Part 2
PPTX
Study Notes: Apache Spark
PDF
Introduction to Apache Spark
PDF
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Apache Spark Best Practices Meetup Talk
PDF
TriHUG talk on Spark and Shark
PDF
G017143640
PDF
Big Data Analysis and Its Scheduling Policy – Hadoop
PDF
Spark_RDD_SyedAcademy
PPTX
Berlin buzzwords 2018
PDF
Most Popular Hadoop Interview Questions and Answers
PDF
Introduction to Apache Spark
PDF
Intro to big data choco devday - 23-01-2014
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals-aaron-davidson
Apache Spark Workshop
DTCC '14 Spark Runtime Internals
Advanced spark training advanced spark internals and tuning reynold xin
Apache Spark Internals - Part 2
Study Notes: Apache Spark
Introduction to Apache Spark
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Best Practices Meetup Talk
TriHUG talk on Spark and Shark
G017143640
Big Data Analysis and Its Scheduling Policy – Hadoop
Spark_RDD_SyedAcademy
Berlin buzzwords 2018
Most Popular Hadoop Interview Questions and Answers
Introduction to Apache Spark
Intro to big data choco devday - 23-01-2014
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

More from Pawel Szulc (18)

PDF
Getting acquainted with Lens
PDF
Impossibility
PDF
Maintainable Software Architecture in Haskell (with Polysemy)
PDF
Painless Haskell
PDF
Trip with monads
PDF
Trip with monads
PDF
Illogical engineers
PDF
RChain - Understanding Distributed Calculi
PDF
Illogical engineers
PDF
Understanding distributed calculi in Haskell
PDF
Software engineering the genesis
PDF
Category theory is general abolute nonsens
PDF
Fun never stops. introduction to haskell programming language
PDF
Monads asking the right question
PDF
Apache Spark 101 [in 50 min]
PDF
Javascript development done right
PDF
Architektura to nie bzdura
ODP
Testing and Testable Code
Getting acquainted with Lens
Impossibility
Maintainable Software Architecture in Haskell (with Polysemy)
Painless Haskell
Trip with monads
Trip with monads
Illogical engineers
RChain - Understanding Distributed Calculi
Illogical engineers
Understanding distributed calculi in Haskell
Software engineering the genesis
Category theory is general abolute nonsens
Fun never stops. introduction to haskell programming language
Monads asking the right question
Apache Spark 101 [in 50 min]
Javascript development done right
Architektura to nie bzdura
Testing and Testable Code

Recently uploaded (20)

PDF
medical staffing services at VALiNTRY
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
System and Network Administration Chapter 2
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Essential Infomation Tech presentation.pptx
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
System and Network Administraation Chapter 3
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
history of c programming in notes for students .pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
medical staffing services at VALiNTRY
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
System and Network Administration Chapter 2
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Odoo Companies in India – Driving Business Transformation.pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
How to Choose the Right IT Partner for Your Business in Malaysia
Design an Analysis of Algorithms I-SECS-1021-03
Essential Infomation Tech presentation.pptx
Softaken Excel to vCard Converter Software.pdf
System and Network Administraation Chapter 3
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Understanding Forklifts - TECH EHS Solution
history of c programming in notes for students .pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41

Apache spark when things go wrong

  • 1. Apache Spark when things go wrong @rabbitonweb
  • 2. Apache Spark - when things go wrong val sc = new SparkContext(conf) sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println) If above seems odd, this talk is rather not for you. @rabbitonweb tweet compliments and complaints
  • 12. 4 things for take-away
  • 13. 4 things for take-away 1. Knowledge of how Apache Spark works internally
  • 14. 4 things for take-away 1. Knowledge of how Apache Spark works internally 2. Courage to look at Spark’s implementation (code)
  • 15. 4 things for take-away 1. Knowledge of how Apache Spark works internally 2. Courage to look at Spark’s implementation (code) 3. Notion of how to write efficient Spark programs
  • 16. 4 things for take-away 1. Knowledge of how Apache Spark works internally 2. Courage to look at Spark’s implementation (code) 3. Notion of how to write efficient Spark programs 4. Basic ability to monitor Spark when things are starting to act weird
  • 17. Two words about examples used
  • 18. Two words about examples used Super Cool App
  • 19. Two words about examples used Journal Super Cool App events
  • 20. spark cluster Two words about examples used Journal Super Cool App events node 1 node 2 node 3 master A master B
  • 21. spark cluster Two words about examples used Journal Super Cool App events node 1 node 2 node 3 master A master B
  • 22. spark cluster Two words about examples used Journal Super Cool App events node 1 node 2 node 3 master A master B
  • 23. Two words about examples used
  • 24. Two words about examples used sc.textFile(“hdfs://journal/*”)
  • 25. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _)
  • 26. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) }
  • 27. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 28. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300)
  • 29. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)
  • 30. What is a RDD?
  • 31. What is a RDD? Resilient Distributed Dataset
  • 32. What is a RDD? Resilient Distributed Dataset
  • 33. ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?
  • 34. node 1 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 What is a RDD?
  • 35. node 1 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?
  • 36. What is a RDD?
  • 37. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work:
  • 38. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent
  • 39. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned
  • 40. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data
  • 41. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data
  • 42. What is a partition?
  • 43. What is a partition? A partition represents subset of data within your distributed collection.
  • 44. What is a partition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ???
  • 45. What is a partition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ??? How this subset is defined depends on type of the RDD
  • 46. example: HadoopRDD val journal = sc.textFile(“hdfs://journal/*”)
  • 47. example: HadoopRDD val journal = sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned?
  • 48. example: HadoopRDD val journal = sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned? In HadoopRDD partition is exactly the same as file chunks in HDFS
  • 49. example: HadoopRDD 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 50. example: HadoopRDD node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 51. example: HadoopRDD node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 52. example: HadoopRDD node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 53. example: HadoopRDD node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 54. example: HadoopRDD node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni
  • 55. example: HadoopRDD class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging { ... override def getPartitions: Array[Partition] = { val jobConf = getJobConf() SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }
  • 56. example: HadoopRDD class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging { ... override def getPartitions: Array[Partition] = { val jobConf = getJobConf() SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }
  • 57. example: HadoopRDD class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging { ... override def getPartitions: Array[Partition] = { val jobConf = getJobConf() SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }
  • 58. example: MapPartitionsRDD val journal = sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 59. example: MapPartitionsRDD val journal = sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned?
  • 60. example: MapPartitionsRDD val journal = sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned? MapPartitionsRDD inherits partition information from its parent RDD
  • 61. example: MapPartitionsRDD class MapPartitionsRDD[U: ClassTag, T: ClassTag](...) extends RDD[U](prev) { ... override def getPartitions: Array[Partition] = firstParent[T].partitions
  • 62. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data
  • 63. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data
  • 64. RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)
  • 65. RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)
  • 66. RDD parent sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 67. Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 68. Directed acyclic graph HadoopRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 69. Directed acyclic graph HadoopRDD ShuffeledRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 70. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 71. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 72. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 73. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two types of parent dependencies: 1. narrow dependency 2. wider dependency
  • 74. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two types of parent dependencies: 1. narrow dependency 2. wider dependency
  • 75. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two types of parent dependencies: 1. narrow dependency 2. wider dependency
  • 76. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 77. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Tasks
  • 78. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Tasks
  • 79. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 80. Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 81. Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 82. Stage 1 Stage 2 Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()
  • 83. Stage 1 Stage 2 Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two important concepts: 1. shuffle write 2. shuffle read
  • 84. toDebugString val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 85. toDebugString val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString
  • 86. toDebugString val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString res5: String = (4) MapPartitionsRDD[22] at filter at <console>:50 [] | MapPartitionsRDD[21] at map at <console>:49 [] | ShuffledRDD[20] at groupBy at <console>:48 [] +-(6) HadoopRDD[17] at textFile at <console>:47 []
  • 87. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data
  • 88. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data
  • 89. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { }
  • 90. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect()
  • 91. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action
  • 92. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action Actions are implemented using sc.runJob method
  • 93. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( ): Array[U]
  • 94. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], ): Array[U]
  • 95. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], partitions: Seq[Int], ): Array[U]
  • 96. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], partitions: Seq[Int], func: Iterator[T] => U, ): Array[U]
  • 97. Running Job aka materializing DAG /** * Return an array that contains all of the elements in this RDD. */ def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) }
  • 98. Running Job aka materializing DAG /** * Return an array that contains all of the elements in this RDD. */ def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) } /** * Return the number of elements in the RDD. */ def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
  • 99. Multiple jobs for single action /** * Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit. */ def take(num: Int): Array[T] = { (….) val left = num - buf.size val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true) (….) res.foreach(buf ++= _.take(num - buf.size)) partsScanned += numPartsToTry (….) buf.toArray }
  • 101. Towards efficiency val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 102. Towards efficiency val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString (4) MapPartitionsRDD[22] at filter at <console>:50 [] | MapPartitionsRDD[21] at map at <console>:49 [] | ShuffledRDD[20] at groupBy at <console>:48 [] +-(6) HadoopRDD[17] at textFile at <console>:47 []
  • 103. Towards efficiency val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString (4) MapPartitionsRDD[22] at filter at <console>:50 [] | MapPartitionsRDD[21] at map at <console>:49 [] | ShuffledRDD[20] at groupBy at <console>:48 [] +-(6) HadoopRDD[17] at textFile at <console>:47 [] events.count
  • 104. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 105. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 106. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 107. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 108. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 109. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 110. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 111. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 112. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 113. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 114. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 115. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 116. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 117. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 119. Everyday I’m Shuffling val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 120. Everyday I’m Shuffling val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 121. Everyday I’m Shuffling val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 122. Everyday I’m Shuffling val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 123. Everyday I’m Shuffling val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 124. Stage 2 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 125. Stage 2 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 126. Stage 2 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 127. Stage 2 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 128. Stage 2 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 129. Stage 2 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3
  • 130. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 131. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 132. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 133. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 134. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 135. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 136. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 137. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 138. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 139. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 140. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 141. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }
  • 142. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) }
  • 143. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) }
  • 144. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3
  • 145. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3
  • 146. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3
  • 147. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3
  • 148. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3
  • 149. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) }
  • 150. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e)) .combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 + c2) .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) }
  • 151. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e)) .combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 + c2)
  • 152. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4 .map( e => (extractDate(e), e))
  • 153. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4 .repartition(256) // note, this will cause a shuffle .map( e => (extractDate(e), e))
  • 154. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024 .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e))
  • 155. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024 .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .coalesce(64) // this will NOT shuffle .map( e => (extractDate(e), e))
  • 188. Few optimization tricks 1. Serialization issues (e.g KryoSerializer)
  • 189. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster
  • 190. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs
  • 191. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey
  • 192. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner)
  • 193. Spark performance - shuffle optimization map groupBy
  • 194. Spark performance - shuffle optimization map groupBy
  • 195. Spark performance - shuffle optimization map groupBy join
  • 196. Spark performance - shuffle optimization map groupBy join
  • 197. Spark performance - shuffle optimization map groupBy join Optimization: shuffle avoided if data is already partitioned
  • 198. Spark performance - shuffle optimization map groupBy map
  • 199. Spark performance - shuffle optimization map groupBy map
  • 200. Spark performance - shuffle optimization map groupBy map join
  • 201. Spark performance - shuffle optimization map groupBy map join
  • 202. Spark performance - shuffle optimization map groupBy mapValues
  • 203. Spark performance - shuffle optimization map groupBy mapValues
  • 204. Spark performance - shuffle optimization map groupBy mapValues join
  • 205. Spark performance - shuffle optimization map groupBy mapValues join
  • 206. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner)
  • 207. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner) 5. Understand basics & internals to omit common mistakes
  • 208. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner) 5. Understand basics & internals to omit common mistakes Example: Usage of RDD within other RDD. Invalid since only the driver can call operations on RDDS (not other workers)
  • 209. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner) 5. Understand basics & internals to omit common mistakes Example: Usage of RDD within other RDD. Invalid since only the driver can call operations on RDDS (not other workers) rdd.map((k,v) => otherRDD.get(k)) rdd.map(e => otherRdd.map {})
  • 210. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner) 5. Understand basics & internals to omit common mistakes Example: Usage of RDD within other RDD. Invalid since only the driver can call operations on RDDS (not other workers) rdd.map((k,v) => otherRDD.get(k)) -> rdd.join(otherRdd) rdd.map(e => otherRdd.map {}) -> rdd.cartesian(otherRdd)
  • 211. Resources & further read ● Spark Summit Talks @ Youtube
  • 212. Resources & further read ● Spark Summit Talks @ Youtube SPARK SUMMIT EUROPE!!! October 27th to 29th
  • 213. Resources & further read ● Spark Summit Talks @ Youtube
  • 214. Resources & further read ● Spark Summit Talks @ Youtube ● “Learn Spark” book, published by O’Reilly
  • 215. Resources & further read ● Spark Summit Talks @ Youtube ● “Learn Spark” book, published by O’Reilly ● Apache Spark Documentation ○ https://spark.apache.org/docs/latest/monitoring.html ○ https://spark.apache.org/docs/latest/tuning.html
  • 216. Resources & further read ● Spark Summit Talks @ Youtube ● “Learn Spark” book, published by O’Reilly ● Apache Spark Documentation ○ https://spark.apache.org/docs/latest/monitoring.html ○ https://spark.apache.org/docs/latest/tuning.html ● Mailing List aka solution-to-your-problem-is-probably-already-there
  • 217. Resources & further read ● Spark Summit Talks @ Youtube ● “Learn Spark” book, published by O’Reilly ● Apache Spark Documentation ○ https://spark.apache.org/docs/latest/monitoring.html ○ https://spark.apache.org/docs/latest/tuning.html ● Mailing List aka solution-to-your-problem-is-probably-already-there ● Pretty soon my blog & github :)
  • 223. Paweł Szulc blog: http://rabbitonweb.com twitter: @rabbitonweb github: https://github.com/rabbitonweb