SlideShare a Scribd company logo
Why your Spark job is failing
● Data science at Cloudera 
● Recently lead Apache Spark development at 
Cloudera 
● Before that, committing on Apache YARN 
and MapReduce 
● Hadoop project management committee
com.esotericsoftware.kryo. 
KryoException: Unable to find class: 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC 
$$iwC$$iwC$$anonfun$4$$anonfun$apply$3
Why your Spark job is failing
Why your Spark job is failing
Why your Spark job is failing
org.apache.spark.SparkException: Job aborted due to stage failure: Task 
0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on 
host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by 
zero 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) 
[...] 
Driver stacktrace: 
at org.apache.spark.scheduler.DAGScheduler. 
org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages 
(DAGScheduler.scala:1033) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. 
apply(DAGScheduler.scala:1017) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. 
apply(DAGScheduler.scala:1015) 
[...]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 
0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on 
host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by 
zero 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) 
[...]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 
0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on 
host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by 
zero 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) 
[...]
val file = sc.textFile("hdfs://...") 
file.filter(_.startsWith(“banana”)) 
.count()
Job 
Stage 
Task Task 
Task Task 
Stage 
Task Task 
Task Task
Why your Spark job is failing
val rdd1 = sc.textFile(“hdfs://...”) 
.map(someFunc) 
.filter(filterFunc) 
textFile map 
filter
val rdd2 = sc.hadoopFile(“hdfs: 
//...”) 
.groupByKey() 
.map(someOtherFunc) 
hadoopFile groupByKey map
val rdd3 = rdd1.join(rdd2) 
.map(someFunc) 
join map
rdd3.collect()
textFile map filter 
hadoop 
group 
File ByKey 
map 
join map
textFile map filter 
hadoop 
group 
File ByKey 
map 
join map
Stage 
Task Task 
Task Task
Stage 
Task Task 
Task Task
Stage 
Task Task 
Task Task
org.apache.spark.SparkException: Job aborted due to stage failure: Task 
0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on 
host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by 
zero 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) 
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) 
[...]
14/04/22 11:59:58 ERROR executor.Executor: Exception in task ID 286 6 
java.io.IOException: Filesystem closed 
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:565 ) 
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:64 8) 
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:706 ) 
at java.io.DataInputStream.read(DataInputStream.java:100 ) 
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:20 9) 
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173 ) 
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:20 6) 
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:4 5) 
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164 ) 
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149 ) 
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71 ) 
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:2 7) 
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327 ) 
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388 ) 
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388 ) 
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327 ) 
at scala.collection.Iterator$class.foreach(Iterator.scala:727 ) 
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157 ) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:16 1) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:10 2) 
at org.apache.spark.scheduler.Task.run(Task.scala:53 ) 
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:21 1) 
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:4 2) 
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:4 1) 
at java.security.AccessController.doPrivileged(Native Method ) 
at javax.security.auth.Subject.doAs(Subject.java:415 ) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:140 8) 
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:4 1) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176 ) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:114 5) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:61 5) 
at java.lang.Thread.run(Thread.java:724)
Why your Spark job is failing
ResourceManager 
NodeManager NodeManager 
Container Container 
Application 
Master 
Container 
Client
ResourceManager 
Client 
NodeManager NodeManager 
Container 
Map Task 
Container 
Application 
Master 
Container 
Reduce Task
ResourceManager 
Client 
NodeManager NodeManager 
Container 
Map Task 
Container 
Application 
Master 
Container 
Reduce Task
ResourceManager 
Client 
NodeManager NodeManager 
Container 
Map Task 
Container 
Application 
Master 
Container 
Reduce Task
Container [pid=63375, 
containerID=container_1388158490598_0001_01_00 
0003] is running beyond physical memory 
limits. Current usage: 2.1 GB of 2 GB physical 
memory used; 2.8 GB of 4.2 GB virtual memory 
used. Killing container.
yarn.nodemanager.resource.memory-mb 
Executor container 
spark.yarn.executor.memoryOverhead 
spark.executor.memory 
spark.shuffle.memoryFraction 
spark.storage.memoryFraction
Why your Spark job is failing
Why your Spark job is failing
Why your Spark job is failing
Why your Spark job is failing
Why your Spark job is failing
ExternalAppend 
OnlyMap 
Block 
Block 
deserialize 
deserialize
ExternalAppend 
OnlyMap 
key1 -> values 
key2 -> values 
key3 -> values
ExternalAppend 
OnlyMap 
key1 -> values 
key2 -> values 
key3 -> values
ExternalAppend 
OnlyMap 
Sort & Spill 
key1 -> values 
key2 -> values 
key3 -> values
rdd.reduceByKey(reduceFunc, 
numPartitions=1000)
Why your Spark job is failing
java.io.FileNotFoundException: 
/dn6/spark/local/spark-local- 
20140610134115- 
2cee/30/merged_shuffle_0_368_14 (Too many 
open files)
Task 
Task 
Write stuff 
key1 -> values 
key2 -> values 
key3 -> values 
out 
key1 -> values 
key2 -> values 
key3 -> values Task
Task 
Task 
Write stuff 
key1 -> values 
key2 -> values 
key3 -> values 
out 
key1 -> values 
key2 -> values 
key3 -> values Task
Partition 1 
File 
Partition 2 
File 
Partition 3 
File 
Records
Records 
Buffer
Single file 
Buffer 
Sort & Spill 
Partition 1 
Records 
Partition 2 
Records 
Partition 3 
Records 
Index file
conf.set(“spark.shuffle.manager”, 
SORT)
● No 
● Distributed systems are complicated
Why your Spark job is failing

More Related Content

PPTX
Why your Spark Job is Failing
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PDF
Top 5 Mistakes When Writing Spark Applications
PDF
Spark shuffle introduction
PDF
Apache Spark Core – Practical Optimization
PDF
Deep Dive: Memory Management in Apache Spark
Why your Spark Job is Failing
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark shuffle introduction
Apache Spark Core – Practical Optimization
Deep Dive: Memory Management in Apache Spark

What's hot (20)

PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PPTX
Emr spark tuning demystified
PDF
Memory Management in Apache Spark
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
PDF
Physical Plans in Spark SQL
PPTX
Optimizing Apache Spark SQL Joins
PDF
Optimizing S3 Write-heavy Spark workloads
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Top 5 mistakes when writing Spark applications
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Understanding Query Plans and Spark UIs
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
PDF
Apache Spark Core—Deep Dive—Proper Optimization
PDF
Apache Spark At Scale in the Cloud
ODP
Introduction to Structured Streaming
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
PPT
Parquet overview
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Emr spark tuning demystified
Memory Management in Apache Spark
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Physical Plans in Spark SQL
Optimizing Apache Spark SQL Joins
Optimizing S3 Write-heavy Spark workloads
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Top 5 mistakes when writing Spark applications
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
A Deep Dive into Query Execution Engine of Spark SQL
Understanding Query Plans and Spark UIs
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark At Scale in the Cloud
Introduction to Structured Streaming
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Parquet overview
Tuning Apache Kafka Connectors for Flink.pptx
Ad

Viewers also liked (16)

PDF
Spark 2.x Troubleshooting Guide
 
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PPTX
Hadoop and Spark Analytics over Better Storage
PDF
Productionizing Spark and the Spark Job Server
PDF
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
PDF
Dynamically Allocate Cluster Resources to your Spark Application
PPTX
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
PDF
Spark Compute as a Service at Paypal with Prabhu Kasinathan
PPT
SocSciBot(01 Mar2010) - Korean Manual
PDF
Spark on yarn
PPTX
Get most out of Spark on YARN
PPTX
Producing Spark on YARN for ETL
PPTX
ETL with SPARK - First Spark London meetup
PPT
Proxy Servers
PPTX
Apache Spark Model Deployment
PPT
Proxy Server
Spark 2.x Troubleshooting Guide
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Hadoop and Spark Analytics over Better Storage
Productionizing Spark and the Spark Job Server
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Dynamically Allocate Cluster Resources to your Spark Application
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark Compute as a Service at Paypal with Prabhu Kasinathan
SocSciBot(01 Mar2010) - Korean Manual
Spark on yarn
Get most out of Spark on YARN
Producing Spark on YARN for ETL
ETL with SPARK - First Spark London meetup
Proxy Servers
Apache Spark Model Deployment
Proxy Server
Ad

Similar to Why your Spark job is failing (20)

PDF
Why is My Spark Job Failing? by Sandy Ryza of Cloudera
PDF
Why is My Spark Job Failing? by Sandy Ryza of Cloudera
PPTX
Why Your Apache Spark Job is Failing
PDF
Using apache spark for processing trillions of records each day at Datadog
PPTX
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
PDF
Hadoop Performance comparison
PPTX
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
PDF
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
PPTX
4 supporting h base jeff, jon, kathleen - cloudera - final 2
PDF
Troubleshooting Hadoop: Distributed Debugging
PPTX
Spark etl
PPTX
Failing gracefully
PDF
Advanced spark training advanced spark internals and tuning reynold xin
PDF
PySpark in practice slides
PDF
Hadoop spark performance comparison
PPTX
Managing growth in Production Hadoop Deployments
PPTX
Exactly once with spark streaming
PDF
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
PDF
Debugging Apache Spark
PDF
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...
Why is My Spark Job Failing? by Sandy Ryza of Cloudera
Why is My Spark Job Failing? by Sandy Ryza of Cloudera
Why Your Apache Spark Job is Failing
Using apache spark for processing trillions of records each day at Datadog
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop Performance comparison
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
4 supporting h base jeff, jon, kathleen - cloudera - final 2
Troubleshooting Hadoop: Distributed Debugging
Spark etl
Failing gracefully
Advanced spark training advanced spark internals and tuning reynold xin
PySpark in practice slides
Hadoop spark performance comparison
Managing growth in Production Hadoop Deployments
Exactly once with spark streaming
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Debugging Apache Spark
Using Spark ML on Spark Errors – What Do the Clusters Tell Us? with Holden K...

Recently uploaded (20)

PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
Computer network topology notes for revision
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Global journeys: estimating international migration
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Reliability_Chapter_ presentation 1221.5784
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Moving the Public Sector (Government) to a Digital Adoption
Business Ppt On Nestle.pptx huunnnhhgfvu
Taxes Foundatisdcsdcsdon Certificate.pdf
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Computer network topology notes for revision
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Introduction to Knowledge Engineering Part 1
Data_Analytics_and_PowerBI_Presentation.pptx
climate analysis of Dhaka ,Banglades.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Global journeys: estimating international migration

Why your Spark job is failing

  • 2. ● Data science at Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache YARN and MapReduce ● Hadoop project management committee
  • 3. com.esotericsoftware.kryo. KryoException: Unable to find class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC $$iwC$$iwC$$anonfun$4$$anonfun$apply$3
  • 7. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by zero $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) [...] Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler. org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages (DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1. apply(DAGScheduler.scala:1015) [...]
  • 8. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by zero $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) [...]
  • 9. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by zero $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) [...]
  • 10. val file = sc.textFile("hdfs://...") file.filter(_.startsWith(“banana”)) .count()
  • 11. Job Stage Task Task Task Task Stage Task Task Task Task
  • 13. val rdd1 = sc.textFile(“hdfs://...”) .map(someFunc) .filter(filterFunc) textFile map filter
  • 14. val rdd2 = sc.hadoopFile(“hdfs: //...”) .groupByKey() .map(someOtherFunc) hadoopFile groupByKey map
  • 15. val rdd3 = rdd1.join(rdd2) .map(someFunc) join map
  • 17. textFile map filter hadoop group File ByKey map join map
  • 18. textFile map filter hadoop group File ByKey map join map
  • 19. Stage Task Task Task Task
  • 20. Stage Task Task Task Task
  • 21. Stage Task Task Task Task
  • 22. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: Exception failure in TID 6 on host bottou02-10g.pa.cloudera.com: java.lang.ArithmeticException: / by zero $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:13) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016) [...]
  • 23. 14/04/22 11:59:58 ERROR executor.Executor: Exception in task ID 286 6 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:565 ) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:64 8) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:706 ) at java.io.DataInputStream.read(DataInputStream.java:100 ) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:20 9) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173 ) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:20 6) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:4 5) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164 ) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149 ) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71 ) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:2 7) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327 ) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388 ) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388 ) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327 ) at scala.collection.Iterator$class.foreach(Iterator.scala:727 ) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157 ) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:16 1) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:10 2) at org.apache.spark.scheduler.Task.run(Task.scala:53 ) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:21 1) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:4 2) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:4 1) at java.security.AccessController.doPrivileged(Native Method ) at javax.security.auth.Subject.doAs(Subject.java:415 ) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:140 8) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:4 1) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176 ) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:114 5) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:61 5) at java.lang.Thread.run(Thread.java:724)
  • 25. ResourceManager NodeManager NodeManager Container Container Application Master Container Client
  • 26. ResourceManager Client NodeManager NodeManager Container Map Task Container Application Master Container Reduce Task
  • 27. ResourceManager Client NodeManager NodeManager Container Map Task Container Application Master Container Reduce Task
  • 28. ResourceManager Client NodeManager NodeManager Container Map Task Container Application Master Container Reduce Task
  • 29. Container [pid=63375, containerID=container_1388158490598_0001_01_00 0003] is running beyond physical memory limits. Current usage: 2.1 GB of 2 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used. Killing container.
  • 30. yarn.nodemanager.resource.memory-mb Executor container spark.yarn.executor.memoryOverhead spark.executor.memory spark.shuffle.memoryFraction spark.storage.memoryFraction
  • 36. ExternalAppend OnlyMap Block Block deserialize deserialize
  • 37. ExternalAppend OnlyMap key1 -> values key2 -> values key3 -> values
  • 38. ExternalAppend OnlyMap key1 -> values key2 -> values key3 -> values
  • 39. ExternalAppend OnlyMap Sort & Spill key1 -> values key2 -> values key3 -> values
  • 42. java.io.FileNotFoundException: /dn6/spark/local/spark-local- 20140610134115- 2cee/30/merged_shuffle_0_368_14 (Too many open files)
  • 43. Task Task Write stuff key1 -> values key2 -> values key3 -> values out key1 -> values key2 -> values key3 -> values Task
  • 44. Task Task Write stuff key1 -> values key2 -> values key3 -> values out key1 -> values key2 -> values key3 -> values Task
  • 45. Partition 1 File Partition 2 File Partition 3 File Records
  • 47. Single file Buffer Sort & Spill Partition 1 Records Partition 2 Records Partition 3 Records Index file
  • 49. ● No ● Distributed systems are complicated