HaLoop: Efficient Iterative Processing on Large-Scale Clusters

QuickTime™ and a
decompressor
are needed to see this picture.
HaLoop: Efficient Iterative Data
Processing On Large Scale Clusters
Yingyi Bu, UC Irvine
Bill Howe, UW
Magda Balazinska, UW
Michael Ernst, UW
http://clue.cs.washington.edu/
Award IIS 0844572
Cluster Exploratory (CluE)
QuickTime™ and a
decompressor
http://escience.washington.edu/
VLDB 2010, Singapore
Horizon

01/30/15Bill Howe, UW 2QuickTime™ and a
decompressor
Thesis in one slide
 Observation: MapReduce has proven successful as a
common runtime for non-recursive declarative languages
 HIVE (SQL)
 Pig (RA with nested types)
 Observation: Many people roll their own loops
 Graphs, clustering, mining, recursive queries
 iteration managed by external script
 Thesis: With minimal extensions, we can provide an efficient
common runtime for recursive languages
 Map, Reduce, Fixpoint

decompressor
Related Work: Twister [Ekanayake HPDC 2010]
 Redesigned evaluation engine using pub/sub
 Termination condition evaluated by main()
13. while(!complete){
14. monitor = driver.runMapReduceBCast(cData);
15. monitor.monitorTillCompletion();
16. DoubleVectorData newCData = ((KMeansCombiner) driver
.getCurrentCombiner()).getResults();
17. totalError = getError(cData, newCData);
18. cData = newCData;
19. if (totalError < THRESHOLD) {
20. complete = true;
21. break;
22. }
23. }
O(k)

decompressor
In Detail: PageRank (Twister)
while (!complete) {
// start the pagerank map reduce process
monitor = driver.runMapReduceBCast(new
BytesValue(tmpCompressedDvd.getBytes()));
monitor.monitorTillCompletion();
// get the result of process
newCompressedDvd = ((PageRankCombiner)
driver.getCurrentCombiner()).getResults();
// decompress the compressed pagerank values
newDvd = decompress(newCompressedDvd);
tmpDvd = decompress(tmpCompressedDvd);
totalError = getError(tmpDvd, newDvd);
// get the difference between new and old pagerank values
if (totalError < tolerance) {
complete = true;
}
tmpCompressedDvd = newCompressedDvd;
}
O(N) in the size
of the graph
run MR
term.
cond.

decompressor
Related Work: Spark [Zaharia HotCloud 2010]
 Reduction output collected at driver program

“…does not currently support a grouped reduce
operation as in MapReduce”
val spark = new SparkContext(<Mesos master>)
var count = spark.accumulator(0)
for (i <- spark.parallelize(1 to 10000, 10)) {
val x = Math.random * 2 - 1
val y = Math.random * 2 - 1
if (x*x + y*y < 1) count += 1
}
println("Pi is roughly " + 4 * count.value / 10000.0)
all output sent
to driver.

decompressor
Related Work: Pregel [Malewicz PODC 2009]
 Graphs only
 clustering: k-means, canopy, DBScan
 Assumes each vertex has access to outgoing edges
 So an edge representation …
 …requires offline preprocessing
 perhaps using MapReduce
Edge(from, to)

decompressor
Related Work: Piccolo [Power OSDI 2010]
 Partitioned table data model, with user-
defined partitioning
 Programming model:
 message-passing with global synchronization
barriers
 User can give locality hints
 Worth exploring a direct comparison
GroupTables(curr, next, graph)

decompressor
Related Work: BOOM [c.f. Alvaro EuroSys 10]
 Distributed computing based on Overlog
(Datalog + temporal logic + more)
 Recursion supported naturally
 app: API-compliant implementation of MR
 Worth exploring a direct comparison

decompressor
Details
 Architecture
 Programming Model
 Caching (and Indexing)
 Scheduling

decompressor
Example 1: PageRank
url rank
www.a.com 1.0
www.b.com 1.0
www.c.com 1.0
www.d.com 1.0
www.e.com 1.0
url_src url_dest
www.a.com www.b.com
www.a.com www.c.com
www.c.com www.a.com
www.e.com www.c.com
www.d.com www.b.com
www.c.com www.e.com
www.e.com www.c.om
www.a.com www.d.com
Rank Table R0
Linkage Table L
url rank
www.a.com 2.13
www.b.com 3.89
www.c.com 2.60
www.d.com 2.60
www.e.com 2.13
Rank Table R3
Ri L
Ri.rank = Ri.rank/γurlCOUNT(url_dest)
Ri.url = L.url_src
π(url_dest, γurl_destSUM(rank))
Ri+1

decompressor
A MapReduce Implementation
M
M
M
M
M
r
r
Ri
L-split1
L-split0
M
M
r
r
i=i+1 Converged?
Join & compute rank
Aggregate fixpoint evaluation
Client
done
r
r

decompressor
What’s the problem?
1. L is loaded on each iteration
2. L is shuffled on each iteration
3. Fixpoint evaluated as a separate MapReduce job per iteration
m
m
m
Ri
L-split1
L-split0
M
M
r
r
1.
2.
3.
L is loop invariant, but
plus
r
r M
M
r
r

decompressor
Example 2: Transitive Closure
Friend Find all transitive friends of Eric
{Eric, Elisa}
{Eric, Tom
Eric, Harry}
{}
R1
R0 {Eric, Eric}
R2
R3
(semi-naïve evaluation)

decompressor
Example 2 in MapReduce
M
M
M
M
M
r
r
Si
Friend1
Friend0
i=i+1
Anything new?
Join
Dupe-elim
Client
done
r
r
(compute next generation of friends)
(remove the ones
we’ve already seen)

decompressor
1. Friend is loaded on each iteration
2. Friend is shuffled on each iteration
Friend is loop invariant, but
M
M
M
M
M
r
r
Si
Friend1
Friend0
Join
Dupe-elim
r
r
(compute next generation of friends)
(remove the ones
we’ve already seen)
1.
2.

decompressor
Example 3: k-means
M
M
M
P0
i=i+1
ki - ki+1 < threshold?
Client
done
r
r
P1
P2
= k centroids at iteration iki
ki
ki
ki
ki+1

decompressor
M
M
M
P0
i=i+1
ki - ki+1 < threshold?
Client
done
r
r
P1
P2
= k centroids at iteration ikiki
ki
ki
ki+1
1. P is loaded on each iteration
P is loop invariant, but
1.

decompressor
Approach: Inter-iteration caching
Mapper input cache (MI)
Mapper output cache (MO)
Reducer input cache (RI)
Reducer output cache (RO)
M
M
M
r
r
…
Loop body

decompressor
RI: Reducer Input Cache
 Provides:
 Access to loop invariant data without
map/shuffle
 Used By:
 Reducer function
 Assumes:
1. Mapper output for a given table constant
across iterations
2. Static partitioning (implies: no new nodes)
 PageRank
 Avoid shuffling the network at every step
 Transitive Closure
 Avoid shuffling the graph at every step
 K-means
 No help
…

decompressor
Reducer Input Cache Benefit
Transitive Closure
Billion Triples Dataset (120GB)
90 small instances on EC2
Overall run time

decompressor
Transitive Closure
Join step only
Livejournal, 12GB

decompressor
Transitive Closure
Reduce and Shuffle of Join Step
Livejournal, 12GB

decompressor
Join & compute rank
M
M
M
M
M
r
r
Ri
L-split1
L-split0
M
M
r
r
Aggregate fixpoint evaluation
r
r
Total

decompressor
RO: Reducer Output Cache
 Provides:
 Distributed access to output of previous
iterations
 Used By:
 Fixpoint evaluation
 Assumes:
1. Partitioning constant across iterations
2. Reducer output key functionally
determines Reducer input key
 PageRank
 Allows distributed fixpoint evaluation
 Obviates extra MapReduce job
 No help
 K-means
 No help
…

decompressor
Reducer Output Cache BenefitFixpointevaluation(s)
Iteration # Iteration #
Livejournal dataset
50 EC2 small instances
Freebase dataset
90 EC2 small instances

decompressor
MI: Mapper Input Cache
 Provides:
 Access to non-local mapper input on later
iterations
 Used:
 During scheduling of map tasks
 Assumes:
1. Mapper input does not change
 PageRank
 Subsumed by use of Reducer Input Cache
 Subsumed by use of Reducer Input Cache
 K-means
 Avoids non-local data reads on iterations > 0
…

decompressor
Mapper Input Cache Benefit
5% non-local data reads;
~5% improvement

decompressor
Conclusions (last slide)
 Relatively simple changes to MapReduce/Hadoop can
support arbitrary recursive programs
 TaskTracker (Cache management)
 Scheduler (Cache awareness)
 Programming model (multi-step loop bodies, cache control)
 Optimizations
 Caching loop invariant data realizes largest gain
 Good to eliminate extra MapReduce step for termination checks
 Mapper input cache benefit inconclusive; need a busier cluster
 Future Work
 Analyze expressiveness of Map Reduce Fixpoint
 Consider a model of Map (Reduce+
) Fixpoint

decompressor
Data-Intensive
Scalable Science
http://clue.cs.washington.edu
http://escience.washington.edu
Award IIS 0844572
Cluster Exploratory (CluE)

decompressor
Motivation in One Slide
 MapReduce can’t express recursion/iteration
 Lots of interesting programs need loops
 graph algorithms
 clustering
 machine learning
 recursive queries (CTEs, datalog, WITH clause)
 Dominant solution: Use a driver program outside
of mapreduce
 Hypothesis: making MapReduce loop-aware
affords optimization
 …and lays a foundation for scalable implementations of
recursive languages

decompressor
Experiments
 Amazon EC2
 20, 50, 90 default small instances
 Datasets
 Billions of Triples (120GB) [1.5B nodes 1.6B edges]
 Freebase (12GB) [7M ndoes 154M edges]
 Livejournal social network (18GB) [4.8M nodes, 67M edges]
 Queries
 PageRank
 k-means
[VLDB 2010]

decompressor
HaLoop Architecture

decompressor
Scheduling Algorithm
Input: Node node
Global variable: HashMap<Node, List<Parition>> last, HashMaph<Node, List<Partition>> current
1: if (iteration ==0) {
2: Partition part = StandardMapReduceSchedule(node);
3: current.add(node, part);
4: }else{
5: if (node.hasFullLoad()) {
6: Node substitution = findNearbyNode(node);
7: last.get(substitution).addAll(last.remove(node));
8: return;
9: }
10: if (last.get(node).size()>0) {
11: Partition part = last.get(node).get(0);
12: schedule(part, node);
13: current.get(node).add(part);
14: list.remove(part);
15: }
16: }
The same as MapReduce
Find a substitution
Iteration-local Schedule

decompressor
Programming Interface
Job job = new Job();
job.AddMap(Map Rank, 1);
job.AddReduce(Reduce Rank, 1);
job.AddMap(Map Aggregate, 2);
job.AddReduce(Reduce Aggregate, 2);
job.AddInvariantTable(#1);
job.SetInput(IterationInput);
job.SetFixedPointThreshold(0.1);
job.SetDistanceMeasure(ResultDistance);
job.SetMaxNumOfIterations(10);
job.SetReducerInputCache(true);
job.SetReducerOutputCache(true);
job.Submit();
define loop body
Turn on caches
Declare an input as invariant
Specify loop body input,
parameterized by iteration #
Termination condition

decompressor
Cache Infrastructure Details
 Programmer control
 Architecture for cache management
 Scheduling for inter-iteration locality
 Indexing the values in the cache

decompressor
Other Extensions and Experiments
 Distributed databases and Pig/Hadoop for Astronomy [IASDS 09]
 Efficient “Friends of Friends” in Dryad [SSDBM 2010]
 SkewReduce: Automated skew handling [SOCC 2010]
 Image Stacking and Mosaicing with Hadoop [Hadoop Summit 2010]
 HaLoop: Efficient iterative processing with Hadoop [VLDB2010]

decompressor
MapReduce Broadly Applicable
 Biology
 [Schatz 08, 09]
 Astronomy
 [IASDS 09, SSDBM 10, SOCC 10, PASP 10]
 Oceanography
 [UltraVis 09]
 Visualization
 [UltraVis 09, EuroVis 10]

decompressor
Key idea
 When the loop output is large…
 transitive closure
 connected components
 PageRank (with a convergence test as the
termination condition)
 …need a distributed fixpoint operator
 typically implemented as yet another
MapReduce job -- on every iteration

decompressor
Background
 Why is MapReduce popular?
 Because it’s fast?
 Because it scales to 1000s of commodity
nodes?
 Because it’s fault tolerant?
 Witness
 MapReduce on GPUs
 MapReduce on MPI
 MapReduce in main memory
 MapReduce on <10 nodes

decompressor
So why is MapReduce popular?
 The programming model
 Two serial functions, parallelism for free
 Easy and expressive
 Compare this with MPI
 70+ operations
 But it can’t express recursion
 graph algorithms
 clustering
 machine learning
 recursive queries (CTEs, datalog, WITH clause)

decompressor
Fixpoint
 A fixpoint of a function f is a value x such that
f(x) = x
 The fixpoint queries FIX can be expressed with
the relational algebra plus a fixpoint operator
 Map - Reduce - Fixpoint
 hypothesis: sufficient model for all recursive queries

HaLoop: Efficient Iterative Processing on Large-Scale Clusters

More Related Content

What's hot (11)

Similar to HaLoop: Efficient Iterative Processing on Large-Scale Clusters (20)

More from University of Washington (20)

Recently uploaded (20)

HaLoop: Efficient Iterative Processing on Large-Scale Clusters

Editor's Notes