Apache spark when things go wrong

1. Apache Spark when things go wrong @rabbitonweb

2. Apache Spark - when things go wrong val sc = new SparkContext(conf) sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println) If above seems odd, this talk is rather not for you. @rabbitonweb tweet compliments and complaints

3. Target for this talk

12. 4 things for take-away

13. 4 things for take-away 1. Knowledge of how Apache Spark works internally

14. 4 things for take-away 1. Knowledge of how Apache Spark works internally 2. Courage to look at Spark’s implementation (code)

15. 4 things for take-away 1. Knowledge of how Apache Spark works internally 2. Courage to look at Spark’s implementation (code) 3. Notion of how to write efficient Spark programs

16. 4 things for take-away 1. Knowledge of how Apache Spark works internally 2. Courage to look at Spark’s implementation (code) 3. Notion of how to write efficient Spark programs 4. Basic ability to monitor Spark when things are starting to act weird

17. Two words about examples used

18. Two words about examples used Super Cool App

19. Two words about examples used Journal Super Cool App events

20. spark cluster Two words about examples used Journal Super Cool App events node 1 node 2 node 3 master A master B

23. Two words about examples used

24. Two words about examples used sc.textFile(“hdfs://journal/*”)

25. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _)

26. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) }

27. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

28. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300)

29. Two words about examples used sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)

30. What is a RDD?

31. What is a RDD? Resilient Distributed Dataset

32. What is a RDD? Resilient Distributed Dataset

33. ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?

34. node 1 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 What is a RDD?

35. node 1 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?

36. What is a RDD?

37. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work:

38. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent

39. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned

40. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data

42. What is a partition?

43. What is a partition? A partition represents subset of data within your distributed collection.

44. What is a partition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ???

45. What is a partition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ??? How this subset is defined depends on type of the RDD

46. example: HadoopRDD val journal = sc.textFile(“hdfs://journal/*”)

47. example: HadoopRDD val journal = sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned?

48. example: HadoopRDD val journal = sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned? In HadoopRDD partition is exactly the same as file chunks in HDFS

49. example: HadoopRDD 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni

50. example: HadoopRDD node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni

55. example: HadoopRDD class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging { ... override def getPartitions: Array[Partition] = { val jobConf = getJobConf() SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }

58. example: MapPartitionsRDD val journal = sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

59. example: MapPartitionsRDD val journal = sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned?

60. example: MapPartitionsRDD val journal = sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned? MapPartitionsRDD inherits partition information from its parent RDD

61. example: MapPartitionsRDD class MapPartitionsRDD[U: ClassTag, T: ClassTag](...) extends RDD[U](prev) { ... override def getPartitions: Array[Partition] = firstParent[T].partitions

64. RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)

65. RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)

66. RDD parent sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

67. Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

68. Directed acyclic graph HadoopRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

69. Directed acyclic graph HadoopRDD ShuffeledRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

70. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

71. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

73. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two types of parent dependencies: 1. narrow dependency 2. wider dependency

77. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Tasks

78. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Tasks

82. Stage 1 Stage 2 Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

83. Stage 1 Stage 2 Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two important concepts: 1. shuffle write 2. shuffle read

84. toDebugString val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

85. toDebugString val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString

86. toDebugString val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString res5: String = (4) MapPartitionsRDD[22] at filter at <console>:50 [] | MapPartitionsRDD[21] at map at <console>:49 [] | ShuffledRDD[20] at groupBy at <console>:48 [] +-(6) HadoopRDD[17] at textFile at <console>:47 []

87. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data

88. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data

89. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { }

90. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect()

91. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action

92. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action Actions are implemented using sc.runJob method

93. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( ): Array[U]

94. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], ): Array[U]

95. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], partitions: Seq[Int], ): Array[U]

96. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], partitions: Seq[Int], func: Iterator[T] => U, ): Array[U]

97. Running Job aka materializing DAG /** * Return an array that contains all of the elements in this RDD. */ def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) }

98. Running Job aka materializing DAG /** * Return an array that contains all of the elements in this RDD. */ def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) } /** * Return the number of elements in the RDD. */ def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

99. Multiple jobs for single action /** * Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit. */ def take(num: Int): Array[T] = { (….) val left = num - buf.size val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true) (….) res.foreach(buf ++= _.take(num - buf.size)) partsScanned += numPartsToTry (….) buf.toArray }

100. Lets test what we’ve learned

101. Towards efficiency val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

102. Towards efficiency val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString (4) MapPartitionsRDD[22] at filter at <console>:50 [] | MapPartitionsRDD[21] at map at <console>:49 [] | ShuffledRDD[20] at groupBy at <console>:48 [] +-(6) HadoopRDD[17] at textFile at <console>:47 []

103. Towards efficiency val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString (4) MapPartitionsRDD[22] at filter at <console>:50 [] | MapPartitionsRDD[21] at map at <console>:49 [] | ShuffledRDD[20] at groupBy at <console>:48 [] +-(6) HadoopRDD[17] at textFile at <console>:47 [] events.count

104. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

105. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

106. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3

118. Everyday I’m Shuffling

119. Everyday I’m Shuffling val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3

130. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

139. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

142. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) }

143. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) }

144. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3

145. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3

146. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3

149. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) }

150. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e)) .combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 + c2) .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) }

151. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e)) .combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 + c2)

152. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4 .map( e => (extractDate(e), e))

153. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4 .repartition(256) // note, this will cause a shuffle .map( e => (extractDate(e), e))

154. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024 .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e))

155. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024 .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .coalesce(64) // this will NOT shuffle .map( e => (extractDate(e), e))

156. Enough theory

187. Few optimization tricks

188. Few optimization tricks 1. Serialization issues (e.g KryoSerializer)

189. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster

190. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs

191. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey

192. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner)

193. Spark performance - shuffle optimization map groupBy

194. Spark performance - shuffle optimization map groupBy

195. Spark performance - shuffle optimization map groupBy join

196. Spark performance - shuffle optimization map groupBy join

197. Spark performance - shuffle optimization map groupBy join Optimization: shuffle avoided if data is already partitioned

198. Spark performance - shuffle optimization map groupBy map

199. Spark performance - shuffle optimization map groupBy map

200. Spark performance - shuffle optimization map groupBy map join

201. Spark performance - shuffle optimization map groupBy map join

202. Spark performance - shuffle optimization map groupBy mapValues

203. Spark performance - shuffle optimization map groupBy mapValues

204. Spark performance - shuffle optimization map groupBy mapValues join

205. Spark performance - shuffle optimization map groupBy mapValues join

206. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner)

207. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner) 5. Understand basics & internals to omit common mistakes

208. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner) 5. Understand basics & internals to omit common mistakes Example: Usage of RDD within other RDD. Invalid since only the driver can call operations on RDDS (not other workers)

209. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner) 5. Understand basics & internals to omit common mistakes Example: Usage of RDD within other RDD. Invalid since only the driver can call operations on RDDS (not other workers) rdd.map((k,v) => otherRDD.get(k)) rdd.map(e => otherRdd.map {})

210. Few optimization tricks 1. Serialization issues (e.g KryoSerializer) 2. Turn on speculations if on shared cluster 3. Experiment with compression codecs 4. Learn the API a. groupBy->map vs combineByKey b. map vs mapValues (partitioner) 5. Understand basics & internals to omit common mistakes Example: Usage of RDD within other RDD. Invalid since only the driver can call operations on RDDS (not other workers) rdd.map((k,v) => otherRDD.get(k)) -> rdd.join(otherRdd) rdd.map(e => otherRdd.map {}) -> rdd.cartesian(otherRdd)

211. Resources & further read ● Spark Summit Talks @ Youtube

212. Resources & further read ● Spark Summit Talks @ Youtube SPARK SUMMIT EUROPE!!! October 27th to 29th

213. Resources & further read ● Spark Summit Talks @ Youtube

214. Resources & further read ● Spark Summit Talks @ Youtube ● “Learn Spark” book, published by O’Reilly

215. Resources & further read ● Spark Summit Talks @ Youtube ● “Learn Spark” book, published by O’Reilly ● Apache Spark Documentation ○ https://spark.apache.org/docs/latest/monitoring.html ○ https://spark.apache.org/docs/latest/tuning.html

216. Resources & further read ● Spark Summit Talks @ Youtube ● “Learn Spark” book, published by O’Reilly ● Apache Spark Documentation ○ https://spark.apache.org/docs/latest/monitoring.html ○ https://spark.apache.org/docs/latest/tuning.html ● Mailing List aka solution-to-your-problem-is-probably-already-there

217. Resources & further read ● Spark Summit Talks @ Youtube ● “Learn Spark” book, published by O’Reilly ● Apache Spark Documentation ○ https://spark.apache.org/docs/latest/monitoring.html ○ https://spark.apache.org/docs/latest/tuning.html ● Mailing List aka solution-to-your-problem-is-probably-already-there ● Pretty soon my blog & github :)

220. Paweł Szulc

221. Paweł Szulc blog: http://rabbitonweb.com

222. Paweł Szulc blog: http://rabbitonweb.com twitter: @rabbitonweb

223. Paweł Szulc blog: http://rabbitonweb.com twitter: @rabbitonweb github: https://github.com/rabbitonweb

Apache spark when things go wrong

More Related Content

What's hot (12)

Viewers also liked (14)

Similar to Apache spark when things go wrong (20)

More from Pawel Szulc (18)

Recently uploaded (20)

Apache spark when things go wrong