Wei's notes on hadoop resource awareness

Wei’s Notes on Resource AwarenessMarch 2011

Example workloadsIO-boundIndexingSearchingGroupingDecoding/decompressingData importing and exportingCPU-boundMachine learningComplex text miningNatural language processingFeature extraction

IO/CPU intensive?How to judge if a job is IO/CPU intensive?Simplify: let user specifyOtherwise:Does it make more sense to find the pattern at the job level or task level?Could a job be CPU intensive but with reduce tasks being IO intensive?

GoalMake task/Job placement resource awareProposal: provide a profiling mechanism to quantify demand and supply per job per task type and per machine periodically, like a 3D score sheet. Any scheduler could generically adopt the score sheet, and sign slot/task based on the weighted task/slot. Job_TaskTypetimemachine

Proposed schemeQuantify resource capacities at cluster startQuantify machine/network variables periodicallyProfile tasks/jobs resource demand whenever: a job is submitted, first mapper task finishes, mapper done, or first mapper task finishes.Assign score per job per task_type per possible machine placement (all slots on a given machine are homogeneous) based on profiles obtained in 1, 2 and 3 periodically.

Variables*traffic on the link which a given node have to transfer data from

Idle Cluster: 1 Task – M SlotsPolicy (without Network IO && Picking only, not scoring. ONLY for brainstorming):List<Node> nodes, s. t. availability_io > demand_io && availability_cpu > demand _cpuIf nodes.size() = 1DONE!else if nodes.size() > 1for each //try to balance io usage and cpu usage on a machine io_cpu_dist = dist (availability_io - demand_io,availability_cpu - demand _cpu)Pick node with min(io_cpu_dist)DONE!else if nodes.size() = 0for each shortage = dist (availability_io, demand_io) + dist(availability_cpu, demand _cpu)Pick node with min(shortage )DONE

Busy Cluster: 1 Slot – M TasksCloser to the production clusters usage patternSimilar algo as idle. And the same algo can be extended to assign scores.

LimitationsScore sheet only has scores of running tasks (extending to tasks from the same job of the same task type). Doesn’t benefit the very first mapper task or the very first reducer task.

Measurement & QuantificationProfile a task type of a job by samplingHow to measure IO and CPU of a given machine at a given time?Availability = Capacity – (sum of resource consumption of running task). Capacity?Or better: Availability = (sum of resource consumption of running task) * (1/usage percentage – 1) *this availability is based on average current running task demand. And step 1 in the proposed scheme could potentially be skipped! Well… but that could come handy when placing the very first task.How to normalize IO and CPU against each other?Use percentage? Then demands has to be normalized with the same multipliers, IO and CPU respectively.

Wei's notes on hadoop resource awareness

More Related Content

What's hot (15)

Viewers also liked (6)

Similar to Wei's notes on hadoop resource awareness (20)

Recently uploaded (20)

Wei's notes on hadoop resource awareness

Editor's Notes