SlideShare a Scribd company logo
Hadoop MapReduce -
System’sView
By Niketan Pansare (np6@rice.edu)
Rice University
Wednesday, March 27, 13
JobSubmission at Client’s side
Client Node Job tracker Node
Task tracker Node
Wednesday, March 27, 13
Client Node
Client
pgm
Wednesday, March 27, 13
Client Node
Client
pgm
Job
Wednesday, March 27, 13
Client Node
Client
pgm
Job
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Client stub to
JobTracker
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Client stub to
JobTracker
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Client stub to
JobTrackerjobSubmissionClient.getNewJobID()
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Client stub to
JobTracker
JobTracker
jobSubmissionClient.getNewJobID()
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Client stub to
JobTracker
JobTracker
jobSubmissionClient.getNewJobID()
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Client stub to
JobTracker
JobTracker
jobSubmissionClient.getNewJobID()
RPC call
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
jobConf.getOutputFormat().checkOutputSpecs()
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Copy Job Resources
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Copy Job Resources
JobSubmissionFiles
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Copy Job Resources
1. Get destination paths
- Job staging area (getStagingArea())
- Job submission area
- Job config file path (getJobConfPath())
- Job jar file path (getJobJar())
- Information about splits:
(a) split meta file (getJobSplitMetaFile())
(b) split file (getJobSplitFile())
JobSubmissionFiles
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Copy Job Resources (jar)
Shared FS (HDFS)
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Copy Job Resources (jar)
Shared FS (HDFS)
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Copy Job Resources (jar)
Shared FS (HDFS)
jar file + replication = 10
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Copy Job Resources (jar)
Shared FS (HDFS)
jar file + replication = 10
replication = mapred.submit.replication = default: 10
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Copy Job Resources (splits/config)
Shared FS (HDFS)
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Copy Job Resources (splits/config)
Shared FS (HDFS)
a. Compute splits
jobConf.getInputFormat().getSplits()
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Copy Job Resources (splits/config)
Shared FS (HDFS)
a. Compute splits
jobConf.getInputFormat().getSplits()
b. Sort splits based on size (biggest goes first)
- Modify Array.sort() in writeSplit() for randomization
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Copy Job Resources (splits/config)
Shared FS (HDFS)
a. Compute splits
jobConf.getInputFormat().getSplits()
b. Sort splits based on size (biggest goes first)
- Modify Array.sort() in writeSplit() for randomization
c. Copy split “meta” file to jobtracker into path given by
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Copy Job Resources (splits/config)
JobSubmissionFiles
Shared FS (HDFS)
a. Compute splits
jobConf.getInputFormat().getSplits()
b. Sort splits based on size (biggest goes first)
- Modify Array.sort() in writeSplit() for randomization
c. Copy split “meta” file to jobtracker into path given by
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Copy Job Resources (splits/config)
JobSubmissionFiles
Shared FS (HDFS)
a. Compute splits
jobConf.getInputFormat().getSplits()
b. Sort splits based on size (biggest goes first)
- Modify Array.sort() in writeSplit() for randomization
c. Copy split “meta” file to jobtracker into path given by
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
JobTracker
Copy Job Resources (splits/config)
JobSubmissionFiles
Shared FS (HDFS)
a. Compute splits
jobConf.getInputFormat().getSplits()
b. Sort splits based on size (biggest goes first)
- Modify Array.sort() in writeSplit() for randomization
c. Copy split “meta” file to jobtracker into path given by
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
JobTracker
Copy Job Resources (splits/config)
JobSubmissionFiles
Shared FS (HDFS)
a. Compute splits
jobConf.getInputFormat().getSplits()
b. Sort splits based on size (biggest goes first)
- Modify Array.sort() in writeSplit() for randomization
c. Copy split “meta” file to jobtracker into path given by
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
JobTracker
Copy Job Resources (splits/config)
JobSubmissionFiles
Shared FS (HDFS)
a. Compute splits
jobConf.getInputFormat().getSplits()
b. Sort splits based on size (biggest goes first)
- Modify Array.sort() in writeSplit() for randomization
c. Copy split “meta” file to jobtracker into path given by
JobSplit.SplitMetaInfo
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
JobTracker
Copy Job Resources (splits/config)
JobSubmissionFiles
Shared FS (HDFS)
a. Compute splits
jobConf.getInputFormat().getSplits()
b. Sort splits based on size (biggest goes first)
- Modify Array.sort() in writeSplit() for randomization
c. Copy split “meta” file to jobtracker into path given by
JobSplit.SplitMetaInfo
d. Copy split file to HDFS (replica=10) path given by
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
JobTracker
Copy Job Resources (splits/config)
JobSubmissionFiles
Shared FS (HDFS)
a. Compute splits
jobConf.getInputFormat().getSplits()
b. Sort splits based on size (biggest goes first)
- Modify Array.sort() in writeSplit() for randomization
c. Copy split “meta” file to jobtracker into path given by
JobSplit.SplitMetaInfo
d. Copy split file to HDFS (replica=10) path given by
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
JobTracker
Copy Job Resources (splits/config)
JobSubmissionFiles
Shared FS (HDFS)
a. Compute splits
jobConf.getInputFormat().getSplits()
b. Sort splits based on size (biggest goes first)
- Modify Array.sort() in writeSplit() for randomization
c. Copy split “meta” file to jobtracker into path given by
JobSplit.SplitMetaInfo
d. Copy split file to HDFS (replica=10) path given by
JobSplit.TaskSplitIndex
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
JobTracker
Copy Job Resources (splits/config)
JobSubmissionFiles
Shared FS (HDFS)
a. Compute splits
jobConf.getInputFormat().getSplits()
b. Sort splits based on size (biggest goes first)
- Modify Array.sort() in writeSplit() for randomization
c. Copy split “meta” file to jobtracker into path given by
JobSplit.SplitMetaInfo
d. Copy split file to HDFS (replica=10) path given by
JobSplit.TaskSplitIndex
e. Copy job config file to JobTracker path given by
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
JobTracker
Copy Job Resources (splits/config)
JobSubmissionFiles
Shared FS (HDFS)
a. Compute splits
jobConf.getInputFormat().getSplits()
b. Sort splits based on size (biggest goes first)
- Modify Array.sort() in writeSplit() for randomization
c. Copy split “meta” file to jobtracker into path given by
JobSplit.SplitMetaInfo
d. Copy split file to HDFS (replica=10) path given by
JobSplit.TaskSplitIndex
e. Copy job config file to JobTracker path given by
job config file
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Client stub to
JobTracker
JobTracker
After copying job resources
(jar, split files, config)
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Client stub to
JobTracker
JobTracker
After copying job resources
(jar, split files, config)
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Client stub to
JobTracker
JobTracker
After copying job resources
(jar, split files, config)
RPC submitJob()
Wednesday, March 27, 13
Client Node
Client
pgm
Job
job.submit()
JobClient
jobClient.submitJobInternal()
Client stub to
JobTracker
JobTracker
After copying job resources
(jar, split files, config)
RPC submitJob()
Done with Job Submission at Client side ....
Now let’s look at JobTracker’s side.
Wednesday, March 27, 13
JobSubmission at Job tracker node
Client Node Job tracker Node
Task tracker Node
Client stub to
JobTracker
JobTracker
Wednesday, March 27, 13
JobSubmission at Job tracker node
Client Node Job tracker Node
Task tracker Node
Client stub to
JobTracker
RPC submitJob()
JobTracker
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
Read job config file
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
Read job config file
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
createSplits()
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
split meta file
(JobSplit.SplitMetaInfo)
createSplits()
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
split meta file
(JobSplit.SplitMetaInfo)
createSplits()
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
split meta file
(JobSplit.SplitMetaInfo)
createSplits()
JobSplit.TaskSplitMetaInfo[] splits
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
split meta file
(JobSplit.SplitMetaInfo)
JobSplit.TaskSplitMetaInfo[] splits
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
split meta file
(JobSplit.SplitMetaInfo)
JobSplit.TaskSplitMetaInfo[] splits
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
1 map
per split
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
Map<Node, List<TIP>>
nonRunningMapCache
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
mapred.reduce.tasks
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
Other bookkeeping
structures:
runningMapCache,
nonLocalMaps,
failedMaps, ...
+
JobProfile, JobStatus
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
Run by TaskTracker
and are used to setup
and to cleanup tasks
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
2 = One for map and
other for reduce task
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
What code to run by TaskInProgress ?
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
What code to run by TaskInProgress ?User-defined
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
What code to run by TaskInProgress ?
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
What code to run by TaskInProgress ?
For setup and cleanup, specified by
mapred.output.committer.class
Default: FileOutputCommitter
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
What code to run by TaskInProgress ?
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
job.initTasks()
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
Done initializing:
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
Done initializing:
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
JobTracker
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
Queue exists ? +
User permissions
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
addJob()
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
addJob()
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
addJob()
Notify Listeners of
the queue
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
addJob()
Wednesday, March 27, 13
JobSubmission at Job tracker node
Job tracker Node
submitJob()
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
addJob()
Done submitting the job !!!
Wednesday, March 27, 13
TaskScheduler class
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
- Multiple queue, each with different priority
(VERY_HIGH, HIGH, ....)
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
- Multiple queue, each with different priority
(VERY_HIGH, HIGH, ....)
- User specifies job priority (mapred.job.priority)
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
- Multiple queue, each with different priority
(VERY_HIGH, HIGH, ....)
- User specifies job priority (mapred.job.priority)
- Logic:
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
- Multiple queue, each with different priority
(VERY_HIGH, HIGH, ....)
- User specifies job priority (mapred.job.priority)
- Logic:
First select queue with highest priority
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
- Multiple queue, each with different priority
(VERY_HIGH, HIGH, ....)
- User specifies job priority (mapred.job.priority)
- Logic:
First select queue with highest priority
Then FIFO within that queue
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
Callback jobAdded(JIP)
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
List<Task> assignTasks(TaskTracker)
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
List<Task> assignTasks(TaskTracker)
1. Calculate availableMapSlots
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobQueueTaskScheduler
List<Task> assignTasks(TaskTracker)
1. Calculate availableMapSlots
JobTracker
availableMapSlots = trackerCurrentMapCapacity trackerRunningMaps
= min(dmapLoadFactor ⇤ trackerMapCapacitye, trackerMapCapacity)
trackerRunningMaps
where,
trackerMapCapacity = taskTrackerStatus.getMaxMapSlots()
trackerRunningMaps = taskTrackerStatus.countMapTasks()
mapLoadFactor =
X
8jobs
JIP’s numMapTask finishedMapTask
clusterStatus.getMaxMapTasks()
TaskTrackerStatus
ClusterStatus
JIPListener
JobInProgress (JIP)
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
List<Task> assignTasks(TaskTracker)
for(i = 1 to availableMapSlots) {
for(JIP job : JIPListener.getJobQ()) {
}
}
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener getJobQueue() uses
Map<JobSchedulingInfo, JIP> +
FIFO_JOB_QUEUE comparator
Process jobs in higher
priority queue first
List<Task> assignTasks(TaskTracker)
for(i = 1 to availableMapSlots) {
for(JIP job : JIPListener.getJobQ()) {
}
}
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
List<Task> assignTasks(TaskTracker)
for(i = 1 to availableMapSlots) {
for(JIP job : JIPListener.getJobQ()) {
}
}
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
List<Task> assignTasks(TaskTracker)
for(i = 1 to availableMapSlots) {
for(JIP job : JIPListener.getJobQ()) {
}
}
Task t = job.findNewMapTask()
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
List<Task> assignTasks(TaskTracker)
for(i = 1 to availableMapSlots) {
for(JIP job : JIPListener.getJobQ()) {
}
}
Task t = job.findNewMapTask()
- Return task with most failures
(not on given m/c) w/o locality (JIP’s
failedMaps)
- Return non-running tasks using
locality info (JIP’s
nonRunningMapCache)
- Return speculative task
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
List<Task> assignTasks(TaskTracker)
for(i = 1 to availableMapSlots) {
for(JIP job : JIPListener.getJobQ()) {
}
}
Task t = job.findNewMapTask()
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
List<Task> assignTasks(TaskTracker)
for(i = 1 to availableMapSlots) {
for(JIP job : JIPListener.getJobQ()) {
}
}
Task t = job.findNewMapTask()
assignedTasks.add(t)
// Also, make sure there are free
slots in cluster for speculative
tasks
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
List<Task> assignTasks(TaskTracker)
for(i = 1 to availableMapSlots) {
for(JIP job : JIPListener.getJobQ()) {
}
}
Task t = job.findNewMapTask()
assignedTasks.add(t)
// Also, make sure there are free
slots in cluster for speculative
tasks
Do same thing for reducer
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
List<Task> assignTasks(TaskTracker)
for(i = 1 to availableMapSlots) {
for(JIP job : JIPListener.getJobQ()) {
}
}
Task t = job.findNewMapTask()
assignedTasks.add(t)
// Also, make sure there are free
slots in cluster for speculative
tasks
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
List<Task> assignTasks(TaskTracker)
for(i = 1 to availableMapSlots) {
for(JIP job : JIPListener.getJobQ()) {
}
}
Task t = job.findNewMapTask()
assignedTasks.add(t)
// Also, make sure there are free
slots in cluster for speculative
tasks
return assignedTasks
Wednesday, March 27, 13
Task Scheduling
Job tracker Node
JobTracker QueueManagerqueueManager
JobInProgress (job)
JobSplit.TaskSplitMetaInfo[] splits
TaskInProgress[]
maps
TaskInProgress[]
reduces
Map<Node, List<TIP>>
nonRunningMapCache
Set<TaskInProgress>
nonRunningReduces
TaskInProgress[2]
setup
TaskInProgress[2]
cleanup
JobQueueTaskScheduler
JIPListener
for(i = 1 to availableMapSlots) {
for(JIP job : JIPListener.getJobQ()) {
}
}
Task t = job.findNewMapTask()
assignedTasks.add(t)
// Also, make sure there are free
slots in cluster for speculative
tasks
return assignedTasks
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
Goal: Provide fast response time for small jobs
and guaranteed service levels for productions
jobs.
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
Goal: Provide fast response time for small jobs
and guaranteed service levels for productions
jobs.
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
Pools:
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
Goal: Provide fast response time for small jobs
and guaranteed service levels for productions
jobs.
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
Pools:
Min share: 30 slots 40 slots
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
Goal: Provide fast response time for small jobs
and guaranteed service levels for productions
jobs.
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
Pools:
Min share: 30 slots 40 slots
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
Goal: Provide fast response time for small jobs
and guaranteed service levels for productions
jobs.
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
Pools:
Cluster: 100 slots
available. Allocate
them !
Min share: 30 slots 40 slots
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
Goal: Provide fast response time for small jobs
and guaranteed service levels for productions
jobs.
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
Pools:
Cluster: 100 slots
available. Allocate
them !
Min share: 30 slots 40 slots
40 slots30 slots30 slots
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
Goal: Provide fast response time for small jobs
and guaranteed service levels for productions
jobs.
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
Pools:
Cluster: 100 slots
available. Allocate
them !
Min share: 30 slots 40 slots
40 slots30 slots30 slots
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
Goal: Provide fast response time for small jobs
and guaranteed service levels for productions
jobs.
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
Pools:
Cluster: 100 slots
available. Allocate
them !
Min share: 30 slots 40 slots
40 slots30 slots30 slots
15 15
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
Goal: Provide fast response time for small jobs
and guaranteed service levels for productions
jobs.
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
Additional features:
- Job weights for unequal sharing (based on
priority or size)
- Limits for #running jobs per user/pool
Usage:
cp build/contrib/fairscheduler/*.jar lib
mapred.jobtracker.taskScheduler to o.a.h.m.FairScheduler
mapred.fairscheduler.allocation.file to /path/pool.xml
Pools:
Cluster: 100 slots
available. Allocate
them !
Min share: 30 slots 40 slots
40 slots30 slots30 slots
15 15
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
~ FairScheduler, queues instead of pools.
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
~ FairScheduler, queues instead of pools.
Queue share % of cluster. Queue can have jobs of different
priorities
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
~ FairScheduler, queues instead of pools.
Queue share % of cluster. Queue can have jobs of different
priorities
FIFO scheduling within each queue. Scheduling more
deterministic than FairScheduler.
Wednesday, March 27, 13
TaskScheduler class
• Used by JobTracker to schedule Task on TaskTracker.
• Uses one or more JobInProgressListener to receive notifications about the jobs.
• Uses ClusterStatus to get info about the state of cluster.
• Methods:
• start(), terminate(), refresh()
• Collection<JobInProgress> getJobs(String queueName)
• List<Task> assignTasks(TaskTracker)
• Implementations:
• Specified by mapred.jobtracker.taskScheduler
• Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)
• Facebook’s FairScheduler
•Yahoo’s CapacityScheduler
- Doesnot support preemption
- Bad for production cluster (high priority can be
misused)
~ FairScheduler, queues instead of pools.
Queue share % of cluster. Queue can have jobs of different
priorities
FIFO scheduling within each queue. Scheduling more
deterministic than FairScheduler.
Also, unlike other 2, provides support for memory-based
scheduling and preemption.
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTrackerjobClient
TaskScheduler
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTrackerjobClient
this.jobClient = (InterTrackerProtocol)
UserGroupInformation.getLoginUser().doAs(
new PrivilegedExceptionAction<Object>() {
public Object run() throws IOException {
return RPC.waitForProxy(InterTrackerProtocol.class,
InterTrackerProtocol.versionID,
jobTrackAddr, fConf);
}
});
TaskScheduler
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTrackerjobClient
jobClient.heartbeat(…);
this.jobClient = (InterTrackerProtocol)
UserGroupInformation.getLoginUser().doAs(
new PrivilegedExceptionAction<Object>() {
public Object run() throws IOException {
return RPC.waitForProxy(InterTrackerProtocol.class,
InterTrackerProtocol.versionID,
jobTrackAddr, fConf);
}
});
TaskScheduler
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTrackerjobClient
jobClient.heartbeat(…);
this.jobClient = (InterTrackerProtocol)
UserGroupInformation.getLoginUser().doAs(
new PrivilegedExceptionAction<Object>() {
public Object run() throws IOException {
return RPC.waitForProxy(InterTrackerProtocol.class,
InterTrackerProtocol.versionID,
jobTrackAddr, fConf);
}
});
TaskScheduler
List<Task> assignTasks(TaskTracker)
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTrackerjobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
this.jobClient = (InterTrackerProtocol)
UserGroupInformation.getLoginUser().doAs(
new PrivilegedExceptionAction<Object>() {
public Object run() throws IOException {
return RPC.waitForProxy(InterTrackerProtocol.class,
InterTrackerProtocol.versionID,
jobTrackAddr, fConf);
}
});
TaskScheduler
List<Task> assignTasks(TaskTracker)
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
List<Task> assignTasks(TaskTracker)
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
void run() {
offerService();
}
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
void run() {
offerService();
}
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
void run() {
offerService();
}
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
void run() {
offerService();
}
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
void run() {
offerService();
}
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
void run() {
offerService();
}
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
void run() {
offerService();
}
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
// or explicit cleanup directive
void run() {
offerService();
}
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
// or explicit cleanup directive
markUnresponsiveTasks();
void run() {
offerService();
}
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
// or explicit cleanup directive
markUnresponsiveTasks();
killOverflowingTasks(); // if low disk space: reduce
first, then least progress
void run() {
offerService();
}
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
// or explicit cleanup directive
markUnresponsiveTasks();
killOverflowingTasks(); // if low disk space: reduce
first, then least progress
}}
void run() {
offerService();
}
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
// or explicit cleanup directive
markUnresponsiveTasks();
killOverflowingTasks(); // if low disk space: reduce
first, then least progress
}}
void run() {
offerService();
}
Heartbeat protocol:
- Periodic
- Indicate health of TaskTracker
- Failure detection
- Remote Procedure Call
- Piggyback directives
- Launch a task
- Perform cleanup/commit
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
// or explicit cleanup directive
markUnresponsiveTasks();
killOverflowingTasks(); // if low disk space: reduce
first, then least progress
}}
void run() {
offerService();
}
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
// or explicit cleanup directive
markUnresponsiveTasks();
killOverflowingTasks(); // if low disk space: reduce
first, then least progress
}}
void run() {
offerService();
}
TaskTracker uses 2 internal
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
// or explicit cleanup directive
markUnresponsiveTasks();
killOverflowingTasks(); // if low disk space: reduce
first, then least progress
}}
void run() {
offerService();
}
TaskTracker uses 2 internal
classes:
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
// or explicit cleanup directive
markUnresponsiveTasks();
killOverflowingTasks(); // if low disk space: reduce
first, then least progress
}}
void run() {
offerService();
}
TaskTracker uses 2 internal
classes:
- TaskLauncher
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
// or explicit cleanup directive
markUnresponsiveTasks();
killOverflowingTasks(); // if low disk space: reduce
first, then least progress
}}
void run() {
offerService();
}
TaskTracker uses 2 internal
classes:
- TaskLauncher
mapLauncher,reduceLauncher
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
// or explicit cleanup directive
markUnresponsiveTasks();
killOverflowingTasks(); // if low disk space: reduce
first, then least progress
}}
void run() {
offerService();
}
TaskTracker uses 2 internal
classes:
- TaskLauncher
mapLauncher,reduceLauncher
- TaskInProgress’s launchTask()
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
// or explicit cleanup directive
markUnresponsiveTasks();
killOverflowingTasks(); // if low disk space: reduce
first, then least progress
}}
void run() {
offerService();
}
TaskTracker uses 2 internal
classes:
- TaskLauncher
mapLauncher,reduceLauncher
- TaskInProgress’s launchTask()
Calls TaskRunner
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
HeartbeatResponse heartbeatResponse =
jobClient.heartbeat(…);
List<Task> assignTasks(TaskTracker)
offerService() {
while(is task tracker running flags) {
HeartbeatResponse heartbeatResponse =
transmitHeartBeat(now);
TaskTrackerAction[] actions =
heartbeatResponse.getActions();
// type: LaunchTaskAction, CommitTaskAction
// or explicit cleanup directive
markUnresponsiveTasks();
killOverflowingTasks(); // if low disk space: reduce
first, then least progress
}}
void run() {
offerService();
}
TaskTracker uses 2 internal
classes:
- TaskLauncher
mapLauncher,reduceLauncher
- TaskInProgress’s launchTask()
Calls TaskRunner
TaskRunner
start()
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
List<Task> assignTasks(TaskTracker)
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
List<Task> assignTasks(TaskTracker)
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
List<Task> assignTasks(TaskTracker)
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
List<Task> assignTasks(TaskTracker)
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
List<Task> assignTasks(TaskTracker)
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
List<Task> assignTasks(TaskTracker)
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
List<Task> assignTasks(TaskTracker)
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
Wednesday, March 27, 13
Task creation
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
List<Task> assignTasks(TaskTracker)
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
Wednesday, March 27, 13
Task creation in little more detail
Job tracker Node Task tracker Node
JobTracker TaskTracker
TaskScheduler
jobClient
List<Task> assignTasks(TaskTracker)
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
JvmManager
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
- TaskController pluggable through mapred.task.tracker.task-controller
(DefaultTaskController or LinuxTaskController)
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
- TaskController pluggable through mapred.task.tracker.task-controller
(DefaultTaskController or LinuxTaskController)
- Creates directories for task (attempt, working, log)
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
- TaskController pluggable through mapred.task.tracker.task-controller
(DefaultTaskController or LinuxTaskController)
- Creates directories for task (attempt, working, log)
- Pass JVM args and OS specific manipulations to TaskLog and then to
o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder.
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
- TaskController pluggable through mapred.task.tracker.task-controller
(DefaultTaskController or LinuxTaskController)
- Creates directories for task (attempt, working, log)
- Pass JVM args and OS specific manipulations to TaskLog and then to
o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder.
Note, args for JVM already set by TaskRunner’s getJVMArgs(...)
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
- TaskController pluggable through mapred.task.tracker.task-controller
(DefaultTaskController or LinuxTaskController)
- Creates directories for task (attempt, working, log)
- Pass JVM args and OS specific manipulations to TaskLog and then to
o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder.
Note, args for JVM already set by TaskRunner’s getJVMArgs(...)
- Default main class: Child.java
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
- TaskController pluggable through mapred.task.tracker.task-controller
(DefaultTaskController or LinuxTaskController)
- Creates directories for task (attempt, working, log)
- Pass JVM args and OS specific manipulations to TaskLog and then to
o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder.
Note, args for JVM already set by TaskRunner’s getJVMArgs(...)
- Default main class: Child.java
Different JVM
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
- TaskController pluggable through mapred.task.tracker.task-controller
(DefaultTaskController or LinuxTaskController)
- Creates directories for task (attempt, working, log)
- Pass JVM args and OS specific manipulations to TaskLog and then to
o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder.
Note, args for JVM already set by TaskRunner’s getJVMArgs(...)
- Default main class: Child.java
Different JVM
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
- TaskController pluggable through mapred.task.tracker.task-controller
(DefaultTaskController or LinuxTaskController)
- Creates directories for task (attempt, working, log)
- Pass JVM args and OS specific manipulations to TaskLog and then to
o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder.
Note, args for JVM already set by TaskRunner’s getJVMArgs(...)
- Default main class: Child.java
Different JVM
Child
void main(..)
{ .... }
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
- TaskController pluggable through mapred.task.tracker.task-controller
(DefaultTaskController or LinuxTaskController)
- Creates directories for task (attempt, working, log)
- Pass JVM args and OS specific manipulations to TaskLog and then to
o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder.
Note, args for JVM already set by TaskRunner’s getJVMArgs(...)
- Default main class: Child.java
Different JVM
umbilicalChild
void main(..)
{ .... }
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
- Launches a new “child” JVM per task using class JvmManager.
- Why? Any bug in map/reduce don’t affect TaskTracker.
- Builds child JVM options using property mapred.java.child.opts (heapsize
(max/initial), garbage collection options). Default: -Xmx200m
- To control additional processes by child JVM (eg: Hadoop Streaming), use
property mapred.child.ulimit (limit of virtual memory)
- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks
(default 1)
- Task for a given JVM: sequentially; but across JVMs: parallelly.
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
- TaskController pluggable through mapred.task.tracker.task-controller
(DefaultTaskController or LinuxTaskController)
- Creates directories for task (attempt, working, log)
- Pass JVM args and OS specific manipulations to TaskLog and then to
o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder.
Note, args for JVM already set by TaskRunner’s getJVMArgs(...)
- Default main class: Child.java
Different JVM
umbilicalChild
void main(..)
{ .... }
MapTask or Reduce Task
run(job, umbilical) {
}
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
Different JVM
umbilicalChild
void main(..)
{ .... }
MapTask or Reduce Task
run(job, umbilical) {
}
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
Different JVM
umbilicalChild
void main(..)
{ .... }
MapTask or Reduce Task
run(job, umbilical) {
}
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
Different JVM
umbilicalChild
void main(..)
{ .... }
MapTask or Reduce Task
run(job, umbilical) {
}
TaskReporter
- Create TaskReporter that also uses umbilical object.
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
Different JVM
umbilicalChild
void main(..)
{ .... }
MapTask or Reduce Task
run(job, umbilical) {
}
TaskReporter
- Create TaskReporter that also uses umbilical object.
- Check if it is job/task setup/cleanup task.
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
Different JVM
umbilicalChild
void main(..)
{ .... }
MapTask or Reduce Task
run(job, umbilical) {
}
TaskReporter
- Create TaskReporter that also uses umbilical object.
- Check if it is job/task setup/cleanup task.
- If so, run their respective method and return.
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
Different JVM
umbilicalChild
void main(..)
{ .... }
MapTask or Reduce Task
run(job, umbilical) {
}
TaskReporter
- Create TaskReporter that also uses umbilical object.
- Check if it is job/task setup/cleanup task.
- If so, run their respective method and return.
- Else, do Map/Reduce specific actions !!!
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
Different JVM
umbilicalChild
void main(..)
{ .... }
MapTask or Reduce Task
run(job, umbilical) {
}
TaskReporter
- Create TaskReporter that also uses umbilical object.
- Check if it is job/task setup/cleanup task.
- If so, run their respective method and return.
- Else, do Map/Reduce specific actions !!!
- Perform commit operation if it is required.
Wednesday, March 27, 13
Task creation in little more detail
Task tracker Node
TaskTrackerjobClient
void run() {
offerService();
}
TaskRunner
start()
LaunchTaskAction
void run() {
}
JvmManager
JvmRunner
runChild() {
..
tracker.getTaskController()
.launchTask(...)
..
}
Different JVM
umbilicalChild
void main(..)
{ .... }
MapTask or Reduce Task
run(job, umbilical) {
}
TaskReporter
- Create TaskReporter that also uses umbilical object.
- Check if it is job/task setup/cleanup task.
- If so, run their respective method and return.
- Else, do Map/Reduce specific actions !!!
- Perform commit operation if it is required.
- If speculative task, ensure only one of the duplicate task is
committed.
Wednesday, March 27, 13
Map-specific actions:
Wednesday, March 27, 13
Map-specific actions:
Wednesday, March 27, 13
Map-specific actions:
map
map
map
MapperInputFormat
mapper & input using ReflectionUtils.newInstance(...)
Wednesday, March 27, 13
Map-specific actions:
map
map
map
MapperInputFormat
mapper & input using ReflectionUtils.newInstance(...)
split 1
split 2
split 3
split 4
split 5
Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf
Wednesday, March 27, 13
Map-specific actions:
map
map
map
MapperInputFormat
mapper & input using ReflectionUtils.newInstance(...)
split 1
split 2
split 3
split 4
split 5
Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf
For each key-value read from the split (through context.nextKeyValue()), call user-defined map
Wednesday, March 27, 13
Map-specific actions:
map
map
map
MapperInputFormat
mapper & input using ReflectionUtils.newInstance(...)
split 1
split 2
split 3
split 4
split 5
Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf
For each key-value read from the split (through context.nextKeyValue()), call user-defined map
Sort/Spill
Wednesday, March 27, 13
Map-specific actions:
map
map
map
MapperInputFormat
mapper & input using ReflectionUtils.newInstance(...)
split 1
split 2
split 3
split 4
split 5
Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf
For each key-value read from the split (through context.nextKeyValue()), call user-defined map
Store output of map into in-memory circular buffer (MapOutputBuffer)
Sort/Spill
Wednesday, March 27, 13
Map-specific actions:
map
map
map
MapperInputFormat
mapper & input using ReflectionUtils.newInstance(...)
split 1
split 2
split 3
split 4
split 5
Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf
For each key-value read from the split (through context.nextKeyValue()), call user-defined map
Store output of map into in-memory circular buffer (MapOutputBuffer)
- If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk.
Sort/Spill
Wednesday, March 27, 13
Map-specific actions:
map
map
map
MapperInputFormat
mapper & input using ReflectionUtils.newInstance(...)
split 1
split 2
split 3
split 4
split 5
Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf
For each key-value read from the split (through context.nextKeyValue()), call user-defined map
Store output of map into in-memory circular buffer (MapOutputBuffer)
- If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk.
- When buffer reaches certain threshold, a background thread MapOutputBuffer’s inner class
SpillThread will start spilling the buffer to the disk (mapred.local.dir).
Sort/Spill
Wednesday, March 27, 13
Map-specific actions:
map
map
map
MapperInputFormat
mapper & input using ReflectionUtils.newInstance(...)
split 1
split 2
split 3
split 4
split 5
Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf
For each key-value read from the split (through context.nextKeyValue()), call user-defined map
Store output of map into in-memory circular buffer (MapOutputBuffer)
- If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk.
- When buffer reaches certain threshold, a background thread MapOutputBuffer’s inner class
SpillThread will start spilling the buffer to the disk (mapred.local.dir).
- If specified, run combiner if at least 3 spill files (min.num.spills.for.combine)
Sort/Spill
Wednesday, March 27, 13
Map-specific actions:
map
map
map
MapperInputFormat
mapper & input using ReflectionUtils.newInstance(...)
split 1
split 2
split 3
split 4
split 5
Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf
For each key-value read from the split (through context.nextKeyValue()), call user-defined map
Store output of map into in-memory circular buffer (MapOutputBuffer)
- If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk.
- When buffer reaches certain threshold, a background thread MapOutputBuffer’s inner class
SpillThread will start spilling the buffer to the disk (mapred.local.dir).
- If specified, run combiner if at least 3 spill files (min.num.spills.for.combine)
- Before writing to disk, compress if mapred.compress.map.output is true.
Sort/Spill
Wednesday, March 27, 13
Map-specific actions:
map
map
map
MapperInputFormat
mapper & input using ReflectionUtils.newInstance(...)
split 1
split 2
split 3
split 4
split 5
Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf
For each key-value read from the split (through context.nextKeyValue()), call user-defined map
Store output of map into in-memory circular buffer (MapOutputBuffer)
- If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk.
- When buffer reaches certain threshold, a background thread MapOutputBuffer’s inner class
SpillThread will start spilling the buffer to the disk (mapred.local.dir).
- If specified, run combiner if at least 3 spill files (min.num.spills.for.combine)
- Before writing to disk, compress if mapred.compress.map.output is true.
- Sort uses user-defined Comparator and Partitioner.
Sort/Spill
Wednesday, March 27, 13
Map-specific actions:
map
map
map
MapperInputFormat
mapper & input using ReflectionUtils.newInstance(...)
split 1
split 2
split 3
split 4
split 5
Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf
For each key-value read from the split (through context.nextKeyValue()), call user-defined map
Store output of map into in-memory circular buffer (MapOutputBuffer)
- If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk.
- When buffer reaches certain threshold, a background thread MapOutputBuffer’s inner class
SpillThread will start spilling the buffer to the disk (mapred.local.dir).
- If specified, run combiner if at least 3 spill files (min.num.spills.for.combine)
- Before writing to disk, compress if mapred.compress.map.output is true.
- Sort uses user-defined Comparator and Partitioner.
Sort/Spill
Final output: One sorted
partitioned file
Wednesday, March 27, 13
In-memory circular buffer
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
<Partition, Key offset,Value offset>
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776
Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776
INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
INFO org.apache.hadoop.mapred.MapTask: record buffer =
262144/327680
Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776
INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
INFO org.apache.hadoop.mapred.MapTask: record buffer =
262144/327680
Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680
2 common cases for spilling:
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776
INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
INFO org.apache.hadoop.mapred.MapTask: record buffer =
262144/327680
Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680
2 common cases for spilling:
1. Lot of small records filling up the record buffer
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776
INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
INFO org.apache.hadoop.mapred.MapTask: record buffer =
262144/327680
Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680
2 common cases for spilling:
1. Lot of small records filling up the record buffer
- Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic:
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776
INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
INFO org.apache.hadoop.mapred.MapTask: record buffer =
262144/327680
Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680
2 common cases for spilling:
1. Lot of small records filling up the record buffer
- Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic:
= 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte)
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776
INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
INFO org.apache.hadoop.mapred.MapTask: record buffer =
262144/327680
Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680
2 common cases for spilling:
1. Lot of small records filling up the record buffer
- Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic:
= 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte)
- See https://issues.apache.org/jira/browse/MAPREDUCE-64
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776
INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
INFO org.apache.hadoop.mapred.MapTask: record buffer =
262144/327680
Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680
2 common cases for spilling:
1. Lot of small records filling up the record buffer
- Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic:
= 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte)
- See https://issues.apache.org/jira/browse/MAPREDUCE-64
INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776
INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
INFO org.apache.hadoop.mapred.MapTask: record buffer =
262144/327680
Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680
2 common cases for spilling:
1. Lot of small records filling up the record buffer
- Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic:
= 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte)
- See https://issues.apache.org/jira/browse/MAPREDUCE-64
INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
2. Few but very large records filling up the data buffer
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776
INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
INFO org.apache.hadoop.mapred.MapTask: record buffer =
262144/327680
Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680
2 common cases for spilling:
1. Lot of small records filling up the record buffer
- Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic:
= 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte)
- See https://issues.apache.org/jira/browse/MAPREDUCE-64
INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
2. Few but very large records filling up the data buffer
- Increase buffer size and also spill percent (~ 1). Key:Try to spill only once.
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776
INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
INFO org.apache.hadoop.mapred.MapTask: record buffer =
262144/327680
Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680
2 common cases for spilling:
1. Lot of small records filling up the record buffer
- Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic:
= 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte)
- See https://issues.apache.org/jira/browse/MAPREDUCE-64
INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
2. Few but very large records filling up the data buffer
- Increase buffer size and also spill percent (~ 1). Key:Try to spill only once.
- Tradeoff: Buffer takes memory from JVM (i.e. from mapred.child.java.opts).Therefore,
if Max JVM =1GB and $1=128MB, then user code gets only 896MB.
Wednesday, March 27, 13
In-memory circular buffer
io.sort.mb (Default: 100MB = 104857600 bytes) = $1
$1 * io.sort.spill.percent (Default: 0.8)
$1 * io.sort.record.percent (Default: 0.05)
Record pointers
kvindices
(1 int)
kvoffsets (3 ints)
Index
buffer:
Partition
buffer:
Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776
INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
INFO org.apache.hadoop.mapred.MapTask: record buffer =
262144/327680
Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680
2 common cases for spilling:
1. Lot of small records filling up the record buffer
- Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic:
= 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte)
- See https://issues.apache.org/jira/browse/MAPREDUCE-64
INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
2. Few but very large records filling up the data buffer
- Increase buffer size and also spill percent (~ 1). Key:Try to spill only once.
- Tradeoff: Buffer takes memory from JVM (i.e. from mapred.child.java.opts).Therefore,
if Max JVM =1GB and $1=128MB, then user code gets only 896MB.
INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full = true
Wednesday, March 27, 13
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
Wednesday, March 27, 13
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
Wednesday, March 27, 13
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskTracker (map-side)
mapping info
Wednesday, March 27, 13
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskTracker (map-side)
mapping info
TaskTracker (reduce-side)
TaskTracker (reduce-side)
JobTracker
thru heartbeat
Wednesday, March 27, 13
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskTracker (map-side)
mapping info
TaskTracker (reduce-side)
TaskTracker (reduce-side)
JobTracker
thru heartbeat
Reducers know which
machines to fetch data from.
Wednesday, March 27, 13
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskTracker (map-side)
mapping info
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskTracker (map-side)
mapping info
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
TaskTracker (map-side)
mapping info
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
TaskTracker (map-side)
mapping info
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTaskTaskTracker (map-side)
mapping info
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
TaskTracker (map-side)
mapping info
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
TaskTracker (map-side)
mapping info
ReduceCopier
fetchOutput() {
}
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
TaskTracker (map-side)
mapping info
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
TaskTracker (map-side)
mapping info
ReduceCopier
fetchOutput() {
}
MapOutputCopier
HttpServer
MapOutputServlet
- Get output using HTTP
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
TaskTracker (map-side)
mapping info
ReduceCopier
fetchOutput() {
}
MapOutputCopier
HttpServer
MapOutputServlet
- Get output using HTTP
- mapred.reduce.parallel.copies: #MapOutputCopier
threads (i.e. # fetches in parallel on each reduce task)
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
TaskTracker (map-side)
mapping info
ReduceCopier
fetchOutput() {
}
MapOutputCopier
HttpServer
MapOutputServlet
- Get output using HTTP
- mapred.reduce.parallel.copies: #MapOutputCopier
threads (i.e. # fetches in parallel on each reduce task)
- Default: 5
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
TaskTracker (map-side)
mapping info
ReduceCopier
fetchOutput() {
}
MapOutputCopier
HttpServer
MapOutputServlet
- Get output using HTTP
- mapred.reduce.parallel.copies: #MapOutputCopier
threads (i.e. # fetches in parallel on each reduce task)
- Default: 5
- tasktracker.http.threads: #clients HttpServer will service
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
TaskTracker (map-side)
mapping info
ReduceCopier
fetchOutput() {
}
MapOutputCopier
HttpServer
MapOutputServlet
- Get output using HTTP
- mapred.reduce.parallel.copies: #MapOutputCopier
threads (i.e. # fetches in parallel on each reduce task)
- Default: 5
- tasktracker.http.threads: #clients HttpServer will service
- Default: 40
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
TaskTracker (map-side)
mapping info
ReduceCopier
fetchOutput() {
}
MapOutputCopier
HttpServer
MapOutputServlet
- Get output using HTTP
- mapred.reduce.parallel.copies: #MapOutputCopier
threads (i.e. # fetches in parallel on each reduce task)
- Default: 5
- tasktracker.http.threads: #clients HttpServer will service
- Default: 40
- Mapreduce2 will use Netty (2x #processors)
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
TaskTracker (map-side)
mapping info
HttpServer
MapOutputServlet
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
TaskTracker (map-side)
mapping info
HttpServer
MapOutputServlet
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling ? bytes (? raw bytes) into (RAM/Local-
FS) from attempt_?
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling ? bytes (? raw bytes) into (RAM/Local-
FS) from attempt_?
LocalFSMerger
InMemFSMergeThread
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling ? bytes (? raw bytes) into (RAM/Local-
FS) from attempt_?
LocalFSMerger
InMemFSMergeThread
Perform “in-memory merge” if
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling ? bytes (? raw bytes) into (RAM/Local-
FS) from attempt_?
LocalFSMerger
InMemFSMergeThread
Perform “in-memory merge” if
- Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66)
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling ? bytes (? raw bytes) into (RAM/Local-
FS) from attempt_?
LocalFSMerger
InMemFSMergeThread
Perform “in-memory merge” if
- Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66)
- Or #map outputs > mapred.inmem.merge.threshold (default: 1000)
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling ? bytes (? raw bytes) into (RAM/Local-
FS) from attempt_?
LocalFSMerger
InMemFSMergeThread
Perform (interleaved) “on-disk merge” if
Perform “in-memory merge” if
- Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66)
- Or #map outputs > mapred.inmem.merge.threshold (default: 1000)
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling ? bytes (? raw bytes) into (RAM/Local-
FS) from attempt_?
LocalFSMerger
InMemFSMergeThread
Perform (interleaved) “on-disk merge” if
- #files on disk > 2*io.sort.factor - 1 (fairly rare)
Perform “in-memory merge” if
- Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66)
- Or #map outputs > mapred.inmem.merge.threshold (default: 1000)
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling ? bytes (? raw bytes) into (RAM/Local-
FS) from attempt_?
LocalFSMerger
InMemFSMergeThread
Perform (interleaved) “on-disk merge” if
- #files on disk > 2*io.sort.factor - 1 (fairly rare)
Eg: 50 files and io.sort.factor = 10
Perform “in-memory merge” if
- Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66)
- Or #map outputs > mapred.inmem.merge.threshold (default: 1000)
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling ? bytes (? raw bytes) into (RAM/Local-
FS) from attempt_?
LocalFSMerger
InMemFSMergeThread
Perform (interleaved) “on-disk merge” if
- #files on disk > 2*io.sort.factor - 1 (fairly rare)
Eg: 50 files and io.sort.factor = 10
5 rounds of merging, 10 files at a time*
Perform “in-memory merge” if
- Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66)
- Or #map outputs > mapred.inmem.merge.threshold (default: 1000)
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling ? bytes (? raw bytes) into (RAM/Local-
FS) from attempt_?
LocalFSMerger
InMemFSMergeThread
Perform (interleaved) “on-disk merge” if
- #files on disk > 2*io.sort.factor - 1 (fairly rare)
Eg: 50 files and io.sort.factor = 10
5 rounds of merging, 10 files at a time*
Merge
SORT
Perform “in-memory merge” if
- Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66)
- Or #map outputs > mapred.inmem.merge.threshold (default: 1000)
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling ? bytes (? raw bytes) into (RAM/Local-
FS) from attempt_?
LocalFSMerger
InMemFSMergeThread
Perform (interleaved) “on-disk merge” if
- #files on disk > 2*io.sort.factor - 1 (fairly rare)
Eg: 50 files and io.sort.factor = 10
5 rounds of merging, 10 files at a time*
Merge
SORT
Perform “in-memory merge” if
- Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66)
- Or #map outputs > mapred.inmem.merge.threshold (default: 1000)
Finally, spills in-memory data to disk.Why ?
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling ? bytes (? raw bytes) into (RAM/Local-
FS) from attempt_?
LocalFSMerger
InMemFSMergeThread
Perform (interleaved) “on-disk merge” if
- #files on disk > 2*io.sort.factor - 1 (fairly rare)
Eg: 50 files and io.sort.factor = 10
5 rounds of merging, 10 files at a time*
Merge
SORT
Perform “in-memory merge” if
- Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66)
- Or #map outputs > mapred.inmem.merge.threshold (default: 1000)
Finally, spills in-memory data to disk.Why ?
- Assumes user reduce() needs all the RAM.
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
if(mapred.job.tracker != local)
ReduceCopier
fetchOutput() {
}
MapOutputCopier
Is map output size < ShuffleRamManager’s
MaxSingleShuffleLimit ?
-Yes: Keep output in memory
- No:Write it to disk
MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx *
mapred.job.shuffle.input.buffer.percent (default: 0.7) *
0.25f
INFO org.apache.hadoop.mapred.ReduceTask:
Shuffling ? bytes (? raw bytes) into (RAM/Local-
FS) from attempt_?
LocalFSMerger
InMemFSMergeThread
Perform (interleaved) “on-disk merge” if
- #files on disk > 2*io.sort.factor - 1 (fairly rare)
Eg: 50 files and io.sort.factor = 10
5 rounds of merging, 10 files at a time*
Merge
SORT
Perform “in-memory merge” if
- Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66)
- Or #map outputs > mapred.inmem.merge.threshold (default: 1000)
Finally, spills in-memory data to disk.Why ?
- Assumes user reduce() needs all the RAM.
- Can tweak it using mapred.job.reduce.input.buffer.percent (default: 0)
to ~ 0.7, if simple reducer.
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
Merge
SORT
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
Merge
SORT
Use RawKeyValueIterator and
Wednesday, March 27, 13
reduce
reduce
Sort/Spill
Reduce-specific actions:
map
map
map
MapperInputFormat
split 1
split 2
split 3
split 4
split 5
TaskStatus.Phase.
Fetch
SHUFFLE
ReduceTask
Merge
SORT
call user-defined Reducer class.
part-0
part-1
Reducer OutputFormat
REDUCE
Use RawKeyValueIterator and
Wednesday, March 27, 13
References
- Hadoop - The definitive guide 3rd edition by Tom White.
- Hadoop Operations by Eric Sammers.
- Data-Intensive Text Processing by Jimmy Lin and Chris Dyers.
- Mining of Massive Datasets by Rajaraman et al.
- Online Aggregation for Large MapReduce Jobs by Pansare et al.
- Distributed and Cloud Computing by Hwang et al.
- http://developer.yahoo.com/hadoop/tutorial/
- http://www.slideshare.net/cloudera/mr-perf
- http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html
- http://www.cs.rice.edu/~fd2/pdf/hpdc106-dinu.pdf
Wednesday, March 27, 13

More Related Content

PDF
Powering code reuse with context and render props
PPTX
Hadoop
PDF
Hadoop Cluster on Docker Containers
POTX
What's the Scoop on Hadoop? How It Works and How to WORK IT!
PDF
Hadoop - How It Works
PDF
What is hadoop and how it works?
PPTX
Learn Big Data & Hadoop
PDF
An Introduction to the World of Hadoop
Powering code reuse with context and render props
Hadoop
Hadoop Cluster on Docker Containers
What's the Scoop on Hadoop? How It Works and How to WORK IT!
Hadoop - How It Works
What is hadoop and how it works?
Learn Big Data & Hadoop
An Introduction to the World of Hadoop

Similar to How MapReduce part of Hadoop works (i.e. system's view) ? (20)

PPTX
Hadoop Introduction
PPT
Hadoop 2
PPT
DOCX
Big data unit iv and v lecture notes qb model exam
PDF
Lecture 2 part 1
PPT
hadoop.ppt
PPTX
Introduction to hadoop and hdfs
PPTX
Scheduling scheme for hadoop clusters
PPTX
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
PDF
PPT
Anatomy of classic map reduce in hadoop
PPTX
BIG DATA ANALYSIS
PPSX
Hadoop-Quick introduction
PDF
Ruby on hadoop
PDF
PPT
Hadoop Map-Reduce from the subject: Big Data Analytics
PPTX
Schedulers optimization to handle multiple jobs in hadoop cluster
PPTX
Juniper Innovation Contest
PDF
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
PPTX
Hadoop architecture by ajay
Hadoop Introduction
Hadoop 2
Big data unit iv and v lecture notes qb model exam
Lecture 2 part 1
hadoop.ppt
Introduction to hadoop and hdfs
Scheduling scheme for hadoop clusters
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Anatomy of classic map reduce in hadoop
BIG DATA ANALYSIS
Hadoop-Quick introduction
Ruby on hadoop
Hadoop Map-Reduce from the subject: Big Data Analytics
Schedulers optimization to handle multiple jobs in hadoop cluster
Juniper Innovation Contest
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Hadoop architecture by ajay
Ad

Recently uploaded (20)

PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Lesson notes of climatology university.
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
01-Introduction-to-Information-Management.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Pharma ospi slides which help in ospi learning
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
master seminar digital applications in india
PDF
Pre independence Education in Inndia.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Final Presentation General Medicine 03-08-2024.pptx
Complications of Minimal Access Surgery at WLH
Lesson notes of climatology university.
STATICS OF THE RIGID BODIES Hibbelers.pdf
Insiders guide to clinical Medicine.pdf
01-Introduction-to-Information-Management.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Pharma ospi slides which help in ospi learning
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
102 student loan defaulters named and shamed – Is someone you know on the list?
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Module 4: Burden of Disease Tutorial Slides S2 2025
master seminar digital applications in india
Pre independence Education in Inndia.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
GDM (1) (1).pptx small presentation for students
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Ad

How MapReduce part of Hadoop works (i.e. system's view) ?

  • 1. Hadoop MapReduce - System’sView By Niketan Pansare (np6@rice.edu) Rice University Wednesday, March 27, 13
  • 2. JobSubmission at Client’s side Client Node Job tracker Node Task tracker Node Wednesday, March 27, 13
  • 12. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Client stub to JobTrackerjobSubmissionClient.getNewJobID() Wednesday, March 27, 13
  • 13. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Client stub to JobTracker JobTracker jobSubmissionClient.getNewJobID() Wednesday, March 27, 13
  • 14. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Client stub to JobTracker JobTracker jobSubmissionClient.getNewJobID() Wednesday, March 27, 13
  • 15. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Client stub to JobTracker JobTracker jobSubmissionClient.getNewJobID() RPC call Wednesday, March 27, 13
  • 21. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Copy Job Resources 1. Get destination paths - Job staging area (getStagingArea()) - Job submission area - Job config file path (getJobConfPath()) - Job jar file path (getJobJar()) - Information about splits: (a) split meta file (getJobSplitMetaFile()) (b) split file (getJobSplitFile()) JobSubmissionFiles Wednesday, March 27, 13
  • 22. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Copy Job Resources (jar) Shared FS (HDFS) Wednesday, March 27, 13
  • 23. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Copy Job Resources (jar) Shared FS (HDFS) Wednesday, March 27, 13
  • 24. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Copy Job Resources (jar) Shared FS (HDFS) jar file + replication = 10 Wednesday, March 27, 13
  • 25. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Copy Job Resources (jar) Shared FS (HDFS) jar file + replication = 10 replication = mapred.submit.replication = default: 10 Wednesday, March 27, 13
  • 26. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Copy Job Resources (splits/config) Shared FS (HDFS) Wednesday, March 27, 13
  • 27. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Copy Job Resources (splits/config) Shared FS (HDFS) a. Compute splits jobConf.getInputFormat().getSplits() Wednesday, March 27, 13
  • 28. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Copy Job Resources (splits/config) Shared FS (HDFS) a. Compute splits jobConf.getInputFormat().getSplits() b. Sort splits based on size (biggest goes first) - Modify Array.sort() in writeSplit() for randomization Wednesday, March 27, 13
  • 29. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Copy Job Resources (splits/config) Shared FS (HDFS) a. Compute splits jobConf.getInputFormat().getSplits() b. Sort splits based on size (biggest goes first) - Modify Array.sort() in writeSplit() for randomization c. Copy split “meta” file to jobtracker into path given by Wednesday, March 27, 13
  • 30. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Copy Job Resources (splits/config) JobSubmissionFiles Shared FS (HDFS) a. Compute splits jobConf.getInputFormat().getSplits() b. Sort splits based on size (biggest goes first) - Modify Array.sort() in writeSplit() for randomization c. Copy split “meta” file to jobtracker into path given by Wednesday, March 27, 13
  • 31. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Copy Job Resources (splits/config) JobSubmissionFiles Shared FS (HDFS) a. Compute splits jobConf.getInputFormat().getSplits() b. Sort splits based on size (biggest goes first) - Modify Array.sort() in writeSplit() for randomization c. Copy split “meta” file to jobtracker into path given by Wednesday, March 27, 13
  • 32. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() JobTracker Copy Job Resources (splits/config) JobSubmissionFiles Shared FS (HDFS) a. Compute splits jobConf.getInputFormat().getSplits() b. Sort splits based on size (biggest goes first) - Modify Array.sort() in writeSplit() for randomization c. Copy split “meta” file to jobtracker into path given by Wednesday, March 27, 13
  • 33. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() JobTracker Copy Job Resources (splits/config) JobSubmissionFiles Shared FS (HDFS) a. Compute splits jobConf.getInputFormat().getSplits() b. Sort splits based on size (biggest goes first) - Modify Array.sort() in writeSplit() for randomization c. Copy split “meta” file to jobtracker into path given by Wednesday, March 27, 13
  • 34. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() JobTracker Copy Job Resources (splits/config) JobSubmissionFiles Shared FS (HDFS) a. Compute splits jobConf.getInputFormat().getSplits() b. Sort splits based on size (biggest goes first) - Modify Array.sort() in writeSplit() for randomization c. Copy split “meta” file to jobtracker into path given by JobSplit.SplitMetaInfo Wednesday, March 27, 13
  • 35. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() JobTracker Copy Job Resources (splits/config) JobSubmissionFiles Shared FS (HDFS) a. Compute splits jobConf.getInputFormat().getSplits() b. Sort splits based on size (biggest goes first) - Modify Array.sort() in writeSplit() for randomization c. Copy split “meta” file to jobtracker into path given by JobSplit.SplitMetaInfo d. Copy split file to HDFS (replica=10) path given by Wednesday, March 27, 13
  • 36. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() JobTracker Copy Job Resources (splits/config) JobSubmissionFiles Shared FS (HDFS) a. Compute splits jobConf.getInputFormat().getSplits() b. Sort splits based on size (biggest goes first) - Modify Array.sort() in writeSplit() for randomization c. Copy split “meta” file to jobtracker into path given by JobSplit.SplitMetaInfo d. Copy split file to HDFS (replica=10) path given by Wednesday, March 27, 13
  • 37. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() JobTracker Copy Job Resources (splits/config) JobSubmissionFiles Shared FS (HDFS) a. Compute splits jobConf.getInputFormat().getSplits() b. Sort splits based on size (biggest goes first) - Modify Array.sort() in writeSplit() for randomization c. Copy split “meta” file to jobtracker into path given by JobSplit.SplitMetaInfo d. Copy split file to HDFS (replica=10) path given by JobSplit.TaskSplitIndex Wednesday, March 27, 13
  • 38. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() JobTracker Copy Job Resources (splits/config) JobSubmissionFiles Shared FS (HDFS) a. Compute splits jobConf.getInputFormat().getSplits() b. Sort splits based on size (biggest goes first) - Modify Array.sort() in writeSplit() for randomization c. Copy split “meta” file to jobtracker into path given by JobSplit.SplitMetaInfo d. Copy split file to HDFS (replica=10) path given by JobSplit.TaskSplitIndex e. Copy job config file to JobTracker path given by Wednesday, March 27, 13
  • 39. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() JobTracker Copy Job Resources (splits/config) JobSubmissionFiles Shared FS (HDFS) a. Compute splits jobConf.getInputFormat().getSplits() b. Sort splits based on size (biggest goes first) - Modify Array.sort() in writeSplit() for randomization c. Copy split “meta” file to jobtracker into path given by JobSplit.SplitMetaInfo d. Copy split file to HDFS (replica=10) path given by JobSplit.TaskSplitIndex e. Copy job config file to JobTracker path given by job config file Wednesday, March 27, 13
  • 40. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Client stub to JobTracker JobTracker After copying job resources (jar, split files, config) Wednesday, March 27, 13
  • 41. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Client stub to JobTracker JobTracker After copying job resources (jar, split files, config) Wednesday, March 27, 13
  • 42. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Client stub to JobTracker JobTracker After copying job resources (jar, split files, config) RPC submitJob() Wednesday, March 27, 13
  • 43. Client Node Client pgm Job job.submit() JobClient jobClient.submitJobInternal() Client stub to JobTracker JobTracker After copying job resources (jar, split files, config) RPC submitJob() Done with Job Submission at Client side .... Now let’s look at JobTracker’s side. Wednesday, March 27, 13
  • 44. JobSubmission at Job tracker node Client Node Job tracker Node Task tracker Node Client stub to JobTracker JobTracker Wednesday, March 27, 13
  • 45. JobSubmission at Job tracker node Client Node Job tracker Node Task tracker Node Client stub to JobTracker RPC submitJob() JobTracker Wednesday, March 27, 13
  • 46. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker Wednesday, March 27, 13
  • 47. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker Read job config file Wednesday, March 27, 13
  • 48. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) Read job config file Wednesday, March 27, 13
  • 49. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) Wednesday, March 27, 13
  • 50. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) Wednesday, March 27, 13
  • 51. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() Wednesday, March 27, 13
  • 52. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() createSplits() Wednesday, March 27, 13
  • 53. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() split meta file (JobSplit.SplitMetaInfo) createSplits() Wednesday, March 27, 13
  • 54. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() split meta file (JobSplit.SplitMetaInfo) createSplits() Wednesday, March 27, 13
  • 55. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() split meta file (JobSplit.SplitMetaInfo) createSplits() JobSplit.TaskSplitMetaInfo[] splits Wednesday, March 27, 13
  • 56. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() split meta file (JobSplit.SplitMetaInfo) JobSplit.TaskSplitMetaInfo[] splits Wednesday, March 27, 13
  • 57. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() split meta file (JobSplit.SplitMetaInfo) JobSplit.TaskSplitMetaInfo[] splits Wednesday, March 27, 13
  • 58. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits Wednesday, March 27, 13
  • 59. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps Wednesday, March 27, 13
  • 60. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps Wednesday, March 27, 13
  • 61. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps 1 map per split Wednesday, March 27, 13
  • 62. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps Wednesday, March 27, 13
  • 63. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps Wednesday, March 27, 13
  • 64. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps Map<Node, List<TIP>> nonRunningMapCache Wednesday, March 27, 13
  • 65. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Wednesday, March 27, 13
  • 66. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache mapred.reduce.tasks Wednesday, March 27, 13
  • 67. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Wednesday, March 27, 13
  • 68. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Wednesday, March 27, 13
  • 69. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces Wednesday, March 27, 13
  • 70. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces Other bookkeeping structures: runningMapCache, nonLocalMaps, failedMaps, ... + JobProfile, JobStatus Wednesday, March 27, 13
  • 71. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces Wednesday, March 27, 13
  • 72. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup Wednesday, March 27, 13
  • 73. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup Wednesday, March 27, 13
  • 74. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup Run by TaskTracker and are used to setup and to cleanup tasks Wednesday, March 27, 13
  • 75. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup Wednesday, March 27, 13
  • 76. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup 2 = One for map and other for reduce task Wednesday, March 27, 13
  • 77. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup Wednesday, March 27, 13
  • 78. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup Wednesday, March 27, 13
  • 79. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup What code to run by TaskInProgress ? Wednesday, March 27, 13
  • 80. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup What code to run by TaskInProgress ?User-defined Wednesday, March 27, 13
  • 81. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup What code to run by TaskInProgress ? Wednesday, March 27, 13
  • 82. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup What code to run by TaskInProgress ? For setup and cleanup, specified by mapred.output.committer.class Default: FileOutputCommitter Wednesday, March 27, 13
  • 83. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup What code to run by TaskInProgress ? Wednesday, March 27, 13
  • 84. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup Wednesday, March 27, 13
  • 85. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) job.initTasks() JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup Done initializing: Wednesday, March 27, 13
  • 86. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup Done initializing: Wednesday, March 27, 13
  • 87. JobSubmission at Job tracker node Job tracker Node JobTracker JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup Wednesday, March 27, 13
  • 88. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup Wednesday, March 27, 13
  • 89. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup Wednesday, March 27, 13
  • 90. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup Queue exists ? + User permissions Wednesday, March 27, 13
  • 91. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup Wednesday, March 27, 13
  • 92. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup addJob() Wednesday, March 27, 13
  • 93. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup addJob() Wednesday, March 27, 13
  • 94. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup addJob() Notify Listeners of the queue Wednesday, March 27, 13
  • 95. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup addJob() Wednesday, March 27, 13
  • 96. JobSubmission at Job tracker node Job tracker Node submitJob() JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup addJob() Done submitting the job !!! Wednesday, March 27, 13
  • 98. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. Wednesday, March 27, 13
  • 99. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. Wednesday, March 27, 13
  • 100. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. Wednesday, March 27, 13
  • 101. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. Wednesday, March 27, 13
  • 102. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: Wednesday, March 27, 13
  • 103. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() Wednesday, March 27, 13
  • 104. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) Wednesday, March 27, 13
  • 105. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) Wednesday, March 27, 13
  • 106. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: Wednesday, March 27, 13
  • 107. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler Wednesday, March 27, 13
  • 108. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) Wednesday, March 27, 13
  • 109. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) Wednesday, March 27, 13
  • 110. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) - Multiple queue, each with different priority (VERY_HIGH, HIGH, ....) Wednesday, March 27, 13
  • 111. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) - Multiple queue, each with different priority (VERY_HIGH, HIGH, ....) - User specifies job priority (mapred.job.priority) Wednesday, March 27, 13
  • 112. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) - Multiple queue, each with different priority (VERY_HIGH, HIGH, ....) - User specifies job priority (mapred.job.priority) - Logic: Wednesday, March 27, 13
  • 113. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) - Multiple queue, each with different priority (VERY_HIGH, HIGH, ....) - User specifies job priority (mapred.job.priority) - Logic: First select queue with highest priority Wednesday, March 27, 13
  • 114. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) - Multiple queue, each with different priority (VERY_HIGH, HIGH, ....) - User specifies job priority (mapred.job.priority) - Logic: First select queue with highest priority Then FIFO within that queue Wednesday, March 27, 13
  • 115. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler Wednesday, March 27, 13
  • 116. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler Wednesday, March 27, 13
  • 117. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener Wednesday, March 27, 13
  • 118. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener Callback jobAdded(JIP) Wednesday, March 27, 13
  • 119. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener Wednesday, March 27, 13
  • 120. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener Wednesday, March 27, 13
  • 121. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener List<Task> assignTasks(TaskTracker) Wednesday, March 27, 13
  • 122. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener List<Task> assignTasks(TaskTracker) 1. Calculate availableMapSlots Wednesday, March 27, 13
  • 123. Task Scheduling Job tracker Node JobQueueTaskScheduler List<Task> assignTasks(TaskTracker) 1. Calculate availableMapSlots JobTracker availableMapSlots = trackerCurrentMapCapacity trackerRunningMaps = min(dmapLoadFactor ⇤ trackerMapCapacitye, trackerMapCapacity) trackerRunningMaps where, trackerMapCapacity = taskTrackerStatus.getMaxMapSlots() trackerRunningMaps = taskTrackerStatus.countMapTasks() mapLoadFactor = X 8jobs JIP’s numMapTask finishedMapTask clusterStatus.getMaxMapTasks() TaskTrackerStatus ClusterStatus JIPListener JobInProgress (JIP) Wednesday, March 27, 13
  • 124. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener List<Task> assignTasks(TaskTracker) for(i = 1 to availableMapSlots) { for(JIP job : JIPListener.getJobQ()) { } } Wednesday, March 27, 13
  • 125. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener getJobQueue() uses Map<JobSchedulingInfo, JIP> + FIFO_JOB_QUEUE comparator Process jobs in higher priority queue first List<Task> assignTasks(TaskTracker) for(i = 1 to availableMapSlots) { for(JIP job : JIPListener.getJobQ()) { } } Wednesday, March 27, 13
  • 126. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener List<Task> assignTasks(TaskTracker) for(i = 1 to availableMapSlots) { for(JIP job : JIPListener.getJobQ()) { } } Wednesday, March 27, 13
  • 127. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener List<Task> assignTasks(TaskTracker) for(i = 1 to availableMapSlots) { for(JIP job : JIPListener.getJobQ()) { } } Task t = job.findNewMapTask() Wednesday, March 27, 13
  • 128. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener List<Task> assignTasks(TaskTracker) for(i = 1 to availableMapSlots) { for(JIP job : JIPListener.getJobQ()) { } } Task t = job.findNewMapTask() - Return task with most failures (not on given m/c) w/o locality (JIP’s failedMaps) - Return non-running tasks using locality info (JIP’s nonRunningMapCache) - Return speculative task Wednesday, March 27, 13
  • 129. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener List<Task> assignTasks(TaskTracker) for(i = 1 to availableMapSlots) { for(JIP job : JIPListener.getJobQ()) { } } Task t = job.findNewMapTask() Wednesday, March 27, 13
  • 130. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener List<Task> assignTasks(TaskTracker) for(i = 1 to availableMapSlots) { for(JIP job : JIPListener.getJobQ()) { } } Task t = job.findNewMapTask() assignedTasks.add(t) // Also, make sure there are free slots in cluster for speculative tasks Wednesday, March 27, 13
  • 131. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener List<Task> assignTasks(TaskTracker) for(i = 1 to availableMapSlots) { for(JIP job : JIPListener.getJobQ()) { } } Task t = job.findNewMapTask() assignedTasks.add(t) // Also, make sure there are free slots in cluster for speculative tasks Do same thing for reducer Wednesday, March 27, 13
  • 132. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener List<Task> assignTasks(TaskTracker) for(i = 1 to availableMapSlots) { for(JIP job : JIPListener.getJobQ()) { } } Task t = job.findNewMapTask() assignedTasks.add(t) // Also, make sure there are free slots in cluster for speculative tasks Wednesday, March 27, 13
  • 133. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener List<Task> assignTasks(TaskTracker) for(i = 1 to availableMapSlots) { for(JIP job : JIPListener.getJobQ()) { } } Task t = job.findNewMapTask() assignedTasks.add(t) // Also, make sure there are free slots in cluster for speculative tasks return assignedTasks Wednesday, March 27, 13
  • 134. Task Scheduling Job tracker Node JobTracker QueueManagerqueueManager JobInProgress (job) JobSplit.TaskSplitMetaInfo[] splits TaskInProgress[] maps TaskInProgress[] reduces Map<Node, List<TIP>> nonRunningMapCache Set<TaskInProgress> nonRunningReduces TaskInProgress[2] setup TaskInProgress[2] cleanup JobQueueTaskScheduler JIPListener for(i = 1 to availableMapSlots) { for(JIP job : JIPListener.getJobQ()) { } } Task t = job.findNewMapTask() assignedTasks.add(t) // Also, make sure there are free slots in cluster for speculative tasks return assignedTasks Wednesday, March 27, 13
  • 135. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler Wednesday, March 27, 13
  • 136. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler - Doesnot support preemption - Bad for production cluster (high priority can be misused) Wednesday, March 27, 13
  • 137. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler Goal: Provide fast response time for small jobs and guaranteed service levels for productions jobs. - Doesnot support preemption - Bad for production cluster (high priority can be misused) Wednesday, March 27, 13
  • 138. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler Goal: Provide fast response time for small jobs and guaranteed service levels for productions jobs. - Doesnot support preemption - Bad for production cluster (high priority can be misused) Pools: Wednesday, March 27, 13
  • 139. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler Goal: Provide fast response time for small jobs and guaranteed service levels for productions jobs. - Doesnot support preemption - Bad for production cluster (high priority can be misused) Pools: Min share: 30 slots 40 slots Wednesday, March 27, 13
  • 140. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler Goal: Provide fast response time for small jobs and guaranteed service levels for productions jobs. - Doesnot support preemption - Bad for production cluster (high priority can be misused) Pools: Min share: 30 slots 40 slots Wednesday, March 27, 13
  • 141. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler Goal: Provide fast response time for small jobs and guaranteed service levels for productions jobs. - Doesnot support preemption - Bad for production cluster (high priority can be misused) Pools: Cluster: 100 slots available. Allocate them ! Min share: 30 slots 40 slots Wednesday, March 27, 13
  • 142. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler Goal: Provide fast response time for small jobs and guaranteed service levels for productions jobs. - Doesnot support preemption - Bad for production cluster (high priority can be misused) Pools: Cluster: 100 slots available. Allocate them ! Min share: 30 slots 40 slots 40 slots30 slots30 slots Wednesday, March 27, 13
  • 143. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler Goal: Provide fast response time for small jobs and guaranteed service levels for productions jobs. - Doesnot support preemption - Bad for production cluster (high priority can be misused) Pools: Cluster: 100 slots available. Allocate them ! Min share: 30 slots 40 slots 40 slots30 slots30 slots Wednesday, March 27, 13
  • 144. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler Goal: Provide fast response time for small jobs and guaranteed service levels for productions jobs. - Doesnot support preemption - Bad for production cluster (high priority can be misused) Pools: Cluster: 100 slots available. Allocate them ! Min share: 30 slots 40 slots 40 slots30 slots30 slots 15 15 Wednesday, March 27, 13
  • 145. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler Goal: Provide fast response time for small jobs and guaranteed service levels for productions jobs. - Doesnot support preemption - Bad for production cluster (high priority can be misused) Additional features: - Job weights for unequal sharing (based on priority or size) - Limits for #running jobs per user/pool Usage: cp build/contrib/fairscheduler/*.jar lib mapred.jobtracker.taskScheduler to o.a.h.m.FairScheduler mapred.fairscheduler.allocation.file to /path/pool.xml Pools: Cluster: 100 slots available. Allocate them ! Min share: 30 slots 40 slots 40 slots30 slots30 slots 15 15 Wednesday, March 27, 13
  • 146. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler - Doesnot support preemption - Bad for production cluster (high priority can be misused) Wednesday, March 27, 13
  • 147. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler - Doesnot support preemption - Bad for production cluster (high priority can be misused) Wednesday, March 27, 13
  • 148. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler - Doesnot support preemption - Bad for production cluster (high priority can be misused) ~ FairScheduler, queues instead of pools. Wednesday, March 27, 13
  • 149. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler - Doesnot support preemption - Bad for production cluster (high priority can be misused) ~ FairScheduler, queues instead of pools. Queue share % of cluster. Queue can have jobs of different priorities Wednesday, March 27, 13
  • 150. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler - Doesnot support preemption - Bad for production cluster (high priority can be misused) ~ FairScheduler, queues instead of pools. Queue share % of cluster. Queue can have jobs of different priorities FIFO scheduling within each queue. Scheduling more deterministic than FairScheduler. Wednesday, March 27, 13
  • 151. TaskScheduler class • Used by JobTracker to schedule Task on TaskTracker. • Uses one or more JobInProgressListener to receive notifications about the jobs. • Uses ClusterStatus to get info about the state of cluster. • Methods: • start(), terminate(), refresh() • Collection<JobInProgress> getJobs(String queueName) • List<Task> assignTasks(TaskTracker) • Implementations: • Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler) • Facebook’s FairScheduler •Yahoo’s CapacityScheduler - Doesnot support preemption - Bad for production cluster (high priority can be misused) ~ FairScheduler, queues instead of pools. Queue share % of cluster. Queue can have jobs of different priorities FIFO scheduling within each queue. Scheduling more deterministic than FairScheduler. Also, unlike other 2, provides support for memory-based scheduling and preemption. Wednesday, March 27, 13
  • 152. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 153. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 154. Task creation Job tracker Node Task tracker Node JobTracker TaskTrackerjobClient TaskScheduler Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 155. Task creation Job tracker Node Task tracker Node JobTracker TaskTrackerjobClient this.jobClient = (InterTrackerProtocol) UserGroupInformation.getLoginUser().doAs( new PrivilegedExceptionAction<Object>() { public Object run() throws IOException { return RPC.waitForProxy(InterTrackerProtocol.class, InterTrackerProtocol.versionID, jobTrackAddr, fConf); } }); TaskScheduler Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 156. Task creation Job tracker Node Task tracker Node JobTracker TaskTrackerjobClient jobClient.heartbeat(…); this.jobClient = (InterTrackerProtocol) UserGroupInformation.getLoginUser().doAs( new PrivilegedExceptionAction<Object>() { public Object run() throws IOException { return RPC.waitForProxy(InterTrackerProtocol.class, InterTrackerProtocol.versionID, jobTrackAddr, fConf); } }); TaskScheduler Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 157. Task creation Job tracker Node Task tracker Node JobTracker TaskTrackerjobClient jobClient.heartbeat(…); this.jobClient = (InterTrackerProtocol) UserGroupInformation.getLoginUser().doAs( new PrivilegedExceptionAction<Object>() { public Object run() throws IOException { return RPC.waitForProxy(InterTrackerProtocol.class, InterTrackerProtocol.versionID, jobTrackAddr, fConf); } }); TaskScheduler List<Task> assignTasks(TaskTracker) Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 158. Task creation Job tracker Node Task tracker Node JobTracker TaskTrackerjobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); this.jobClient = (InterTrackerProtocol) UserGroupInformation.getLoginUser().doAs( new PrivilegedExceptionAction<Object>() { public Object run() throws IOException { return RPC.waitForProxy(InterTrackerProtocol.class, InterTrackerProtocol.versionID, jobTrackAddr, fConf); } }); TaskScheduler List<Task> assignTasks(TaskTracker) Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 159. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient List<Task> assignTasks(TaskTracker) Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 160. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 161. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) void run() { offerService(); } Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 162. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) void run() { offerService(); } Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 163. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { void run() { offerService(); } Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 164. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { void run() { offerService(); } Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 165. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); void run() { offerService(); } Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 166. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); void run() { offerService(); } Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 167. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction void run() { offerService(); } Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 168. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive void run() { offerService(); } Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 169. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); void run() { offerService(); } Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 170. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); killOverflowingTasks(); // if low disk space: reduce first, then least progress void run() { offerService(); } Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 171. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); killOverflowingTasks(); // if low disk space: reduce first, then least progress }} void run() { offerService(); } Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 172. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); killOverflowingTasks(); // if low disk space: reduce first, then least progress }} void run() { offerService(); } Heartbeat protocol: - Periodic - Indicate health of TaskTracker - Failure detection - Remote Procedure Call - Piggyback directives - Launch a task - Perform cleanup/commit Wednesday, March 27, 13
  • 173. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); killOverflowingTasks(); // if low disk space: reduce first, then least progress }} void run() { offerService(); } Wednesday, March 27, 13
  • 174. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); killOverflowingTasks(); // if low disk space: reduce first, then least progress }} void run() { offerService(); } TaskTracker uses 2 internal Wednesday, March 27, 13
  • 175. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); killOverflowingTasks(); // if low disk space: reduce first, then least progress }} void run() { offerService(); } TaskTracker uses 2 internal classes: Wednesday, March 27, 13
  • 176. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); killOverflowingTasks(); // if low disk space: reduce first, then least progress }} void run() { offerService(); } TaskTracker uses 2 internal classes: - TaskLauncher Wednesday, March 27, 13
  • 177. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); killOverflowingTasks(); // if low disk space: reduce first, then least progress }} void run() { offerService(); } TaskTracker uses 2 internal classes: - TaskLauncher mapLauncher,reduceLauncher Wednesday, March 27, 13
  • 178. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); killOverflowingTasks(); // if low disk space: reduce first, then least progress }} void run() { offerService(); } TaskTracker uses 2 internal classes: - TaskLauncher mapLauncher,reduceLauncher - TaskInProgress’s launchTask() Wednesday, March 27, 13
  • 179. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); killOverflowingTasks(); // if low disk space: reduce first, then least progress }} void run() { offerService(); } TaskTracker uses 2 internal classes: - TaskLauncher mapLauncher,reduceLauncher - TaskInProgress’s launchTask() Calls TaskRunner Wednesday, March 27, 13
  • 180. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…); List<Task> assignTasks(TaskTracker) offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); killOverflowingTasks(); // if low disk space: reduce first, then least progress }} void run() { offerService(); } TaskTracker uses 2 internal classes: - TaskLauncher mapLauncher,reduceLauncher - TaskInProgress’s launchTask() Calls TaskRunner TaskRunner start() Wednesday, March 27, 13
  • 181. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient List<Task> assignTasks(TaskTracker) void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } Wednesday, March 27, 13
  • 182. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient List<Task> assignTasks(TaskTracker) void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } Wednesday, March 27, 13
  • 183. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient List<Task> assignTasks(TaskTracker) void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. Wednesday, March 27, 13
  • 184. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient List<Task> assignTasks(TaskTracker) void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. Wednesday, March 27, 13
  • 185. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient List<Task> assignTasks(TaskTracker) void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m Wednesday, March 27, 13
  • 186. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient List<Task> assignTasks(TaskTracker) void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) Wednesday, March 27, 13
  • 187. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient List<Task> assignTasks(TaskTracker) void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) Wednesday, March 27, 13
  • 188. Task creation Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient List<Task> assignTasks(TaskTracker) void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. Wednesday, March 27, 13
  • 189. Task creation in little more detail Job tracker Node Task tracker Node JobTracker TaskTracker TaskScheduler jobClient List<Task> assignTasks(TaskTracker) void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. Wednesday, March 27, 13
  • 190. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. Wednesday, March 27, 13
  • 191. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. JvmManager Wednesday, March 27, 13
  • 192. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } Wednesday, March 27, 13
  • 193. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } Wednesday, March 27, 13
  • 194. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } - TaskController pluggable through mapred.task.tracker.task-controller (DefaultTaskController or LinuxTaskController) Wednesday, March 27, 13
  • 195. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } - TaskController pluggable through mapred.task.tracker.task-controller (DefaultTaskController or LinuxTaskController) - Creates directories for task (attempt, working, log) Wednesday, March 27, 13
  • 196. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } - TaskController pluggable through mapred.task.tracker.task-controller (DefaultTaskController or LinuxTaskController) - Creates directories for task (attempt, working, log) - Pass JVM args and OS specific manipulations to TaskLog and then to o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder. Wednesday, March 27, 13
  • 197. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } - TaskController pluggable through mapred.task.tracker.task-controller (DefaultTaskController or LinuxTaskController) - Creates directories for task (attempt, working, log) - Pass JVM args and OS specific manipulations to TaskLog and then to o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder. Note, args for JVM already set by TaskRunner’s getJVMArgs(...) Wednesday, March 27, 13
  • 198. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } - TaskController pluggable through mapred.task.tracker.task-controller (DefaultTaskController or LinuxTaskController) - Creates directories for task (attempt, working, log) - Pass JVM args and OS specific manipulations to TaskLog and then to o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder. Note, args for JVM already set by TaskRunner’s getJVMArgs(...) - Default main class: Child.java Wednesday, March 27, 13
  • 199. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } - TaskController pluggable through mapred.task.tracker.task-controller (DefaultTaskController or LinuxTaskController) - Creates directories for task (attempt, working, log) - Pass JVM args and OS specific manipulations to TaskLog and then to o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder. Note, args for JVM already set by TaskRunner’s getJVMArgs(...) - Default main class: Child.java Different JVM Wednesday, March 27, 13
  • 200. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } - TaskController pluggable through mapred.task.tracker.task-controller (DefaultTaskController or LinuxTaskController) - Creates directories for task (attempt, working, log) - Pass JVM args and OS specific manipulations to TaskLog and then to o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder. Note, args for JVM already set by TaskRunner’s getJVMArgs(...) - Default main class: Child.java Different JVM Wednesday, March 27, 13
  • 201. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } - TaskController pluggable through mapred.task.tracker.task-controller (DefaultTaskController or LinuxTaskController) - Creates directories for task (attempt, working, log) - Pass JVM args and OS specific manipulations to TaskLog and then to o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder. Note, args for JVM already set by TaskRunner’s getJVMArgs(...) - Default main class: Child.java Different JVM Child void main(..) { .... } Wednesday, March 27, 13
  • 202. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } - TaskController pluggable through mapred.task.tracker.task-controller (DefaultTaskController or LinuxTaskController) - Creates directories for task (attempt, working, log) - Pass JVM args and OS specific manipulations to TaskLog and then to o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder. Note, args for JVM already set by TaskRunner’s getJVMArgs(...) - Default main class: Child.java Different JVM umbilicalChild void main(..) { .... } Wednesday, March 27, 13
  • 203. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } - Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker. - Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m - To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory) - For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1) - Task for a given JVM: sequentially; but across JVMs: parallelly. JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } - TaskController pluggable through mapred.task.tracker.task-controller (DefaultTaskController or LinuxTaskController) - Creates directories for task (attempt, working, log) - Pass JVM args and OS specific manipulations to TaskLog and then to o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder. Note, args for JVM already set by TaskRunner’s getJVMArgs(...) - Default main class: Child.java Different JVM umbilicalChild void main(..) { .... } MapTask or Reduce Task run(job, umbilical) { } Wednesday, March 27, 13
  • 204. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } Different JVM umbilicalChild void main(..) { .... } MapTask or Reduce Task run(job, umbilical) { } Wednesday, March 27, 13
  • 205. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } Different JVM umbilicalChild void main(..) { .... } MapTask or Reduce Task run(job, umbilical) { } Wednesday, March 27, 13
  • 206. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } Different JVM umbilicalChild void main(..) { .... } MapTask or Reduce Task run(job, umbilical) { } TaskReporter - Create TaskReporter that also uses umbilical object. Wednesday, March 27, 13
  • 207. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } Different JVM umbilicalChild void main(..) { .... } MapTask or Reduce Task run(job, umbilical) { } TaskReporter - Create TaskReporter that also uses umbilical object. - Check if it is job/task setup/cleanup task. Wednesday, March 27, 13
  • 208. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } Different JVM umbilicalChild void main(..) { .... } MapTask or Reduce Task run(job, umbilical) { } TaskReporter - Create TaskReporter that also uses umbilical object. - Check if it is job/task setup/cleanup task. - If so, run their respective method and return. Wednesday, March 27, 13
  • 209. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } Different JVM umbilicalChild void main(..) { .... } MapTask or Reduce Task run(job, umbilical) { } TaskReporter - Create TaskReporter that also uses umbilical object. - Check if it is job/task setup/cleanup task. - If so, run their respective method and return. - Else, do Map/Reduce specific actions !!! Wednesday, March 27, 13
  • 210. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } Different JVM umbilicalChild void main(..) { .... } MapTask or Reduce Task run(job, umbilical) { } TaskReporter - Create TaskReporter that also uses umbilical object. - Check if it is job/task setup/cleanup task. - If so, run their respective method and return. - Else, do Map/Reduce specific actions !!! - Perform commit operation if it is required. Wednesday, March 27, 13
  • 211. Task creation in little more detail Task tracker Node TaskTrackerjobClient void run() { offerService(); } TaskRunner start() LaunchTaskAction void run() { } JvmManager JvmRunner runChild() { .. tracker.getTaskController() .launchTask(...) .. } Different JVM umbilicalChild void main(..) { .... } MapTask or Reduce Task run(job, umbilical) { } TaskReporter - Create TaskReporter that also uses umbilical object. - Check if it is job/task setup/cleanup task. - If so, run their respective method and return. - Else, do Map/Reduce specific actions !!! - Perform commit operation if it is required. - If speculative task, ensure only one of the duplicate task is committed. Wednesday, March 27, 13
  • 214. Map-specific actions: map map map MapperInputFormat mapper & input using ReflectionUtils.newInstance(...) Wednesday, March 27, 13
  • 215. Map-specific actions: map map map MapperInputFormat mapper & input using ReflectionUtils.newInstance(...) split 1 split 2 split 3 split 4 split 5 Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf Wednesday, March 27, 13
  • 216. Map-specific actions: map map map MapperInputFormat mapper & input using ReflectionUtils.newInstance(...) split 1 split 2 split 3 split 4 split 5 Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf For each key-value read from the split (through context.nextKeyValue()), call user-defined map Wednesday, March 27, 13
  • 217. Map-specific actions: map map map MapperInputFormat mapper & input using ReflectionUtils.newInstance(...) split 1 split 2 split 3 split 4 split 5 Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf For each key-value read from the split (through context.nextKeyValue()), call user-defined map Sort/Spill Wednesday, March 27, 13
  • 218. Map-specific actions: map map map MapperInputFormat mapper & input using ReflectionUtils.newInstance(...) split 1 split 2 split 3 split 4 split 5 Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf For each key-value read from the split (through context.nextKeyValue()), call user-defined map Store output of map into in-memory circular buffer (MapOutputBuffer) Sort/Spill Wednesday, March 27, 13
  • 219. Map-specific actions: map map map MapperInputFormat mapper & input using ReflectionUtils.newInstance(...) split 1 split 2 split 3 split 4 split 5 Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf For each key-value read from the split (through context.nextKeyValue()), call user-defined map Store output of map into in-memory circular buffer (MapOutputBuffer) - If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk. Sort/Spill Wednesday, March 27, 13
  • 220. Map-specific actions: map map map MapperInputFormat mapper & input using ReflectionUtils.newInstance(...) split 1 split 2 split 3 split 4 split 5 Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf For each key-value read from the split (through context.nextKeyValue()), call user-defined map Store output of map into in-memory circular buffer (MapOutputBuffer) - If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk. - When buffer reaches certain threshold, a background thread MapOutputBuffer’s inner class SpillThread will start spilling the buffer to the disk (mapred.local.dir). Sort/Spill Wednesday, March 27, 13
  • 221. Map-specific actions: map map map MapperInputFormat mapper & input using ReflectionUtils.newInstance(...) split 1 split 2 split 3 split 4 split 5 Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf For each key-value read from the split (through context.nextKeyValue()), call user-defined map Store output of map into in-memory circular buffer (MapOutputBuffer) - If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk. - When buffer reaches certain threshold, a background thread MapOutputBuffer’s inner class SpillThread will start spilling the buffer to the disk (mapred.local.dir). - If specified, run combiner if at least 3 spill files (min.num.spills.for.combine) Sort/Spill Wednesday, March 27, 13
  • 222. Map-specific actions: map map map MapperInputFormat mapper & input using ReflectionUtils.newInstance(...) split 1 split 2 split 3 split 4 split 5 Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf For each key-value read from the split (through context.nextKeyValue()), call user-defined map Store output of map into in-memory circular buffer (MapOutputBuffer) - If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk. - When buffer reaches certain threshold, a background thread MapOutputBuffer’s inner class SpillThread will start spilling the buffer to the disk (mapred.local.dir). - If specified, run combiner if at least 3 spill files (min.num.spills.for.combine) - Before writing to disk, compress if mapred.compress.map.output is true. Sort/Spill Wednesday, March 27, 13
  • 223. Map-specific actions: map map map MapperInputFormat mapper & input using ReflectionUtils.newInstance(...) split 1 split 2 split 3 split 4 split 5 Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf For each key-value read from the split (through context.nextKeyValue()), call user-defined map Store output of map into in-memory circular buffer (MapOutputBuffer) - If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk. - When buffer reaches certain threshold, a background thread MapOutputBuffer’s inner class SpillThread will start spilling the buffer to the disk (mapred.local.dir). - If specified, run combiner if at least 3 spill files (min.num.spills.for.combine) - Before writing to disk, compress if mapred.compress.map.output is true. - Sort uses user-defined Comparator and Partitioner. Sort/Spill Wednesday, March 27, 13
  • 224. Map-specific actions: map map map MapperInputFormat mapper & input using ReflectionUtils.newInstance(...) split 1 split 2 split 3 split 4 split 5 Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf For each key-value read from the split (through context.nextKeyValue()), call user-defined map Store output of map into in-memory circular buffer (MapOutputBuffer) - If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk. - When buffer reaches certain threshold, a background thread MapOutputBuffer’s inner class SpillThread will start spilling the buffer to the disk (mapred.local.dir). - If specified, run combiner if at least 3 spill files (min.num.spills.for.combine) - Before writing to disk, compress if mapred.compress.map.output is true. - Sort uses user-defined Comparator and Partitioner. Sort/Spill Final output: One sorted partitioned file Wednesday, March 27, 13
  • 226. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 Wednesday, March 27, 13
  • 227. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) Wednesday, March 27, 13
  • 228. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers Wednesday, March 27, 13
  • 229. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Wednesday, March 27, 13
  • 230. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: <Partition, Key offset,Value offset> Wednesday, March 27, 13
  • 231. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Wednesday, March 27, 13
  • 232. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776 Wednesday, March 27, 13
  • 233. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776 Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680 Wednesday, March 27, 13
  • 234. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680 Wednesday, March 27, 13
  • 235. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680 2 common cases for spilling: Wednesday, March 27, 13
  • 236. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680 2 common cases for spilling: 1. Lot of small records filling up the record buffer Wednesday, March 27, 13
  • 237. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680 2 common cases for spilling: 1. Lot of small records filling up the record buffer - Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic: Wednesday, March 27, 13
  • 238. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680 2 common cases for spilling: 1. Lot of small records filling up the record buffer - Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic: = 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte) Wednesday, March 27, 13
  • 239. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680 2 common cases for spilling: 1. Lot of small records filling up the record buffer - Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic: = 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte) - See https://issues.apache.org/jira/browse/MAPREDUCE-64 Wednesday, March 27, 13
  • 240. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680 2 common cases for spilling: 1. Lot of small records filling up the record buffer - Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic: = 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte) - See https://issues.apache.org/jira/browse/MAPREDUCE-64 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true Wednesday, March 27, 13
  • 241. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680 2 common cases for spilling: 1. Lot of small records filling up the record buffer - Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic: = 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte) - See https://issues.apache.org/jira/browse/MAPREDUCE-64 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2. Few but very large records filling up the data buffer Wednesday, March 27, 13
  • 242. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680 2 common cases for spilling: 1. Lot of small records filling up the record buffer - Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic: = 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte) - See https://issues.apache.org/jira/browse/MAPREDUCE-64 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2. Few but very large records filling up the data buffer - Increase buffer size and also spill percent (~ 1). Key:Try to spill only once. Wednesday, March 27, 13
  • 243. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680 2 common cases for spilling: 1. Lot of small records filling up the record buffer - Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic: = 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte) - See https://issues.apache.org/jira/browse/MAPREDUCE-64 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2. Few but very large records filling up the data buffer - Increase buffer size and also spill percent (~ 1). Key:Try to spill only once. - Tradeoff: Buffer takes memory from JVM (i.e. from mapred.child.java.opts).Therefore, if Max JVM =1GB and $1=128MB, then user code gets only 896MB. Wednesday, March 27, 13
  • 244. In-memory circular buffer io.sort.mb (Default: 100MB = 104857600 bytes) = $1 $1 * io.sort.spill.percent (Default: 0.8) $1 * io.sort.record.percent (Default: 0.05) Record pointers kvindices (1 int) kvoffsets (3 ints) Index buffer: Partition buffer: Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680 2 common cases for spilling: 1. Lot of small records filling up the record buffer - Spill before the data buffer is full.Tweak io.sort.record.percent using heuristic: = 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte) - See https://issues.apache.org/jira/browse/MAPREDUCE-64 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2. Few but very large records filling up the data buffer - Increase buffer size and also spill percent (~ 1). Key:Try to spill only once. - Tradeoff: Buffer takes memory from JVM (i.e. from mapred.child.java.opts).Therefore, if Max JVM =1GB and $1=128MB, then user code gets only 896MB. INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full = true Wednesday, March 27, 13
  • 245. Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 Wednesday, March 27, 13
  • 246. Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 Wednesday, March 27, 13
  • 247. Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskTracker (map-side) mapping info Wednesday, March 27, 13
  • 248. Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskTracker (map-side) mapping info TaskTracker (reduce-side) TaskTracker (reduce-side) JobTracker thru heartbeat Wednesday, March 27, 13
  • 249. Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskTracker (map-side) mapping info TaskTracker (reduce-side) TaskTracker (reduce-side) JobTracker thru heartbeat Reducers know which machines to fetch data from. Wednesday, March 27, 13
  • 250. Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskTracker (map-side) mapping info Wednesday, March 27, 13
  • 251. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskTracker (map-side) mapping info Wednesday, March 27, 13
  • 252. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. TaskTracker (map-side) mapping info Wednesday, March 27, 13
  • 253. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE TaskTracker (map-side) mapping info Wednesday, March 27, 13
  • 254. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTaskTaskTracker (map-side) mapping info Wednesday, March 27, 13
  • 255. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) TaskTracker (map-side) mapping info Wednesday, March 27, 13
  • 256. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) TaskTracker (map-side) mapping info ReduceCopier fetchOutput() { } Wednesday, March 27, 13
  • 257. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) TaskTracker (map-side) mapping info ReduceCopier fetchOutput() { } MapOutputCopier Wednesday, March 27, 13
  • 258. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) TaskTracker (map-side) mapping info ReduceCopier fetchOutput() { } MapOutputCopier HttpServer MapOutputServlet - Get output using HTTP Wednesday, March 27, 13
  • 259. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) TaskTracker (map-side) mapping info ReduceCopier fetchOutput() { } MapOutputCopier HttpServer MapOutputServlet - Get output using HTTP - mapred.reduce.parallel.copies: #MapOutputCopier threads (i.e. # fetches in parallel on each reduce task) Wednesday, March 27, 13
  • 260. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) TaskTracker (map-side) mapping info ReduceCopier fetchOutput() { } MapOutputCopier HttpServer MapOutputServlet - Get output using HTTP - mapred.reduce.parallel.copies: #MapOutputCopier threads (i.e. # fetches in parallel on each reduce task) - Default: 5 Wednesday, March 27, 13
  • 261. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) TaskTracker (map-side) mapping info ReduceCopier fetchOutput() { } MapOutputCopier HttpServer MapOutputServlet - Get output using HTTP - mapred.reduce.parallel.copies: #MapOutputCopier threads (i.e. # fetches in parallel on each reduce task) - Default: 5 - tasktracker.http.threads: #clients HttpServer will service Wednesday, March 27, 13
  • 262. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) TaskTracker (map-side) mapping info ReduceCopier fetchOutput() { } MapOutputCopier HttpServer MapOutputServlet - Get output using HTTP - mapred.reduce.parallel.copies: #MapOutputCopier threads (i.e. # fetches in parallel on each reduce task) - Default: 5 - tasktracker.http.threads: #clients HttpServer will service - Default: 40 Wednesday, March 27, 13
  • 263. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) TaskTracker (map-side) mapping info ReduceCopier fetchOutput() { } MapOutputCopier HttpServer MapOutputServlet - Get output using HTTP - mapred.reduce.parallel.copies: #MapOutputCopier threads (i.e. # fetches in parallel on each reduce task) - Default: 5 - tasktracker.http.threads: #clients HttpServer will service - Default: 40 - Mapreduce2 will use Netty (2x #processors) Wednesday, March 27, 13
  • 264. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier TaskTracker (map-side) mapping info HttpServer MapOutputServlet Wednesday, March 27, 13
  • 265. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier TaskTracker (map-side) mapping info HttpServer MapOutputServlet Wednesday, March 27, 13
  • 266. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Wednesday, March 27, 13
  • 267. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? Wednesday, March 27, 13
  • 268. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory Wednesday, March 27, 13
  • 269. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk Wednesday, March 27, 13
  • 270. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f Wednesday, March 27, 13
  • 271. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local- FS) from attempt_? Wednesday, March 27, 13
  • 272. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local- FS) from attempt_? LocalFSMerger InMemFSMergeThread Wednesday, March 27, 13
  • 273. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local- FS) from attempt_? LocalFSMerger InMemFSMergeThread Perform “in-memory merge” if Wednesday, March 27, 13
  • 274. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local- FS) from attempt_? LocalFSMerger InMemFSMergeThread Perform “in-memory merge” if - Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66) Wednesday, March 27, 13
  • 275. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local- FS) from attempt_? LocalFSMerger InMemFSMergeThread Perform “in-memory merge” if - Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66) - Or #map outputs > mapred.inmem.merge.threshold (default: 1000) Wednesday, March 27, 13
  • 276. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local- FS) from attempt_? LocalFSMerger InMemFSMergeThread Perform (interleaved) “on-disk merge” if Perform “in-memory merge” if - Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66) - Or #map outputs > mapred.inmem.merge.threshold (default: 1000) Wednesday, March 27, 13
  • 277. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local- FS) from attempt_? LocalFSMerger InMemFSMergeThread Perform (interleaved) “on-disk merge” if - #files on disk > 2*io.sort.factor - 1 (fairly rare) Perform “in-memory merge” if - Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66) - Or #map outputs > mapred.inmem.merge.threshold (default: 1000) Wednesday, March 27, 13
  • 278. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local- FS) from attempt_? LocalFSMerger InMemFSMergeThread Perform (interleaved) “on-disk merge” if - #files on disk > 2*io.sort.factor - 1 (fairly rare) Eg: 50 files and io.sort.factor = 10 Perform “in-memory merge” if - Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66) - Or #map outputs > mapred.inmem.merge.threshold (default: 1000) Wednesday, March 27, 13
  • 279. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local- FS) from attempt_? LocalFSMerger InMemFSMergeThread Perform (interleaved) “on-disk merge” if - #files on disk > 2*io.sort.factor - 1 (fairly rare) Eg: 50 files and io.sort.factor = 10 5 rounds of merging, 10 files at a time* Perform “in-memory merge” if - Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66) - Or #map outputs > mapred.inmem.merge.threshold (default: 1000) Wednesday, March 27, 13
  • 280. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local- FS) from attempt_? LocalFSMerger InMemFSMergeThread Perform (interleaved) “on-disk merge” if - #files on disk > 2*io.sort.factor - 1 (fairly rare) Eg: 50 files and io.sort.factor = 10 5 rounds of merging, 10 files at a time* Merge SORT Perform “in-memory merge” if - Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66) - Or #map outputs > mapred.inmem.merge.threshold (default: 1000) Wednesday, March 27, 13
  • 281. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local- FS) from attempt_? LocalFSMerger InMemFSMergeThread Perform (interleaved) “on-disk merge” if - #files on disk > 2*io.sort.factor - 1 (fairly rare) Eg: 50 files and io.sort.factor = 10 5 rounds of merging, 10 files at a time* Merge SORT Perform “in-memory merge” if - Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66) - Or #map outputs > mapred.inmem.merge.threshold (default: 1000) Finally, spills in-memory data to disk.Why ? Wednesday, March 27, 13
  • 282. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local- FS) from attempt_? LocalFSMerger InMemFSMergeThread Perform (interleaved) “on-disk merge” if - #files on disk > 2*io.sort.factor - 1 (fairly rare) Eg: 50 files and io.sort.factor = 10 5 rounds of merging, 10 files at a time* Merge SORT Perform “in-memory merge” if - Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66) - Or #map outputs > mapred.inmem.merge.threshold (default: 1000) Finally, spills in-memory data to disk.Why ? - Assumes user reduce() needs all the RAM. Wednesday, March 27, 13
  • 283. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask if(mapred.job.tracker != local) ReduceCopier fetchOutput() { } MapOutputCopier Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ? -Yes: Keep output in memory - No:Write it to disk MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local- FS) from attempt_? LocalFSMerger InMemFSMergeThread Perform (interleaved) “on-disk merge” if - #files on disk > 2*io.sort.factor - 1 (fairly rare) Eg: 50 files and io.sort.factor = 10 5 rounds of merging, 10 files at a time* Merge SORT Perform “in-memory merge” if - Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66) - Or #map outputs > mapred.inmem.merge.threshold (default: 1000) Finally, spills in-memory data to disk.Why ? - Assumes user reduce() needs all the RAM. - Can tweak it using mapred.job.reduce.input.buffer.percent (default: 0) to ~ 0.7, if simple reducer. Wednesday, March 27, 13
  • 284. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask Merge SORT Wednesday, March 27, 13
  • 285. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask Merge SORT Use RawKeyValueIterator and Wednesday, March 27, 13
  • 286. reduce reduce Sort/Spill Reduce-specific actions: map map map MapperInputFormat split 1 split 2 split 3 split 4 split 5 TaskStatus.Phase. Fetch SHUFFLE ReduceTask Merge SORT call user-defined Reducer class. part-0 part-1 Reducer OutputFormat REDUCE Use RawKeyValueIterator and Wednesday, March 27, 13
  • 287. References - Hadoop - The definitive guide 3rd edition by Tom White. - Hadoop Operations by Eric Sammers. - Data-Intensive Text Processing by Jimmy Lin and Chris Dyers. - Mining of Massive Datasets by Rajaraman et al. - Online Aggregation for Large MapReduce Jobs by Pansare et al. - Distributed and Cloud Computing by Hwang et al. - http://developer.yahoo.com/hadoop/tutorial/ - http://www.slideshare.net/cloudera/mr-perf - http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html - http://www.cs.rice.edu/~fd2/pdf/hpdc106-dinu.pdf Wednesday, March 27, 13