This is a summary for the article "Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems"​ by Andrey Goder, Alex Spiridonov, Yin Wang
Facebook Bistro

This is a summary for the article "Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems" by Andrey Goder, Alex Spiridonov, Yin Wang

I wrote this summary so that you can grasp an overall idea about the article.

Concepts and definitions:

·       A job: overall work to be completed on a set of data shards

·       A task: work to be completed on one data shard (a job is a set of tasks)

·       A worker: process that performs tasks

·       A scheduler: process that dispatches tasks to workers

·       A brute-force scheduler: Each iteration examines all tasks of all jobs to find runnable tasks to execute, and updates resources on the forest model accordingly

·       Worker host: computer that runs worker process

·       Data host: computer that stores data shards

This article is about Facebook’s Bistro, a scheduler that runs data-intensive batch jobs on live, customer-facing production systems. This scheduler adopts a new hierarchical model of data and computational resources. The model enables Bistro to schedule workloads efficiently and adapt within a short time to changing configurations. According to the authors, they are now replacing Hadoop and custom-built batch schedulers by Bistro without degrading the performance.

1-Introduction

Facebook engineers must frequently run batch jobs on huge amount of data, that process, transform, or transfer them. Data processing systems such as Hadoop handle only copies of data. As stated in the article, creating copies of data is clumsy, slow, error-prone, and sometimes impossible due to Big Data. Also maintaining these copies of data is not efficient. Besides, some of their batch jobs cannot be run on copies of data. Running batch jobs directly on live production systems can dramatically improve efficiency in environment such as Facebook. Most of existing batch job schedulers are designed for offline operation and are not appropriate to online systems.

2-Scheduling

The scheduling objective is to maximize resource utilization subject to resource constraints. A scheduler should not let a task waiting especially if its required resources are available. At large scale, the main challenge is then to minimize scheduling overhead, i.e., to quickly find tasks to execute whenever extra resources become available. In Facebook’s scheduling problem formulation, each task requires resources along a unique path in their forest model. FIFO queue-based solutions are frequent but seems to be inefficient for them. And a non-FIFO scheduler might not notice runnable tasks, which is not interesting for large-scale computations as in Facebook environment.

              Their performance baseline is the brute-force scheduler that avoids non necessary head-of-line blocking by searching the entire queue for runnable tasks. But in practice, the brute-force scheduler appears to be slow which incurs a significant scheduling overhead for short tasks. Bistro leverages the fact that a task only requires resources in one tree of the resource forest, or just a subtree if it does not consume resources all the way to the root. Although Facebook’s algorithm executes the scheduling procedure in an infinite loop like brute-force scheduler, it however waits for a finished task, and invokes the scheduling iteration on the corresponding subtree only. Tree-scheduler do not schedule tasks in large batches but enables multi-head threaded scheduling. 

3-Implementation

The architecture of Bistro consists of 6 different modules that work asynchronously and communicating via either snapshots or queues. The first module is the config loader that reads and periodically updates the resource and job configuration from some source such as file, URL, or a Zookeeper instance. The second one is the node fetcher that builds and periodically updates the resource model. The third one is the Scheduler that chooses tasks to execute based on the latest configuration and resource model as well as the statuses of running and unfinished tasks. These statuses are maintained by the next module which is the Status Store. The Task Runner receives a list of tasks to start and can either run them locally or dispatch them to remote workers depending on the configuration. And finally, the users can observe the progression of jobs via a Web UI provided by the last module which is the Monitor.

No alt text provided for this image

In Bistro, workers can reside on either data hosts or separate worker pool, which leads to 4 scheduling mode. The single/co-locate mode that has one central Bistro instance, and one worker on each data host, receives only tasks that access local data. The multi/co-locate mode is for large-scale jobs that a single scheduler cannot handle efficiently due to an excessively large resource model or an excessively high turnover. The single/remote mode is like single/co-locate with one scheduler and multiple workers on dedicated worker hosts, good for load balancing. And the multi/remote mode has multiple Bistro instances. Their scheduler implements 4 different scheduling policies which are Round Robin, Randomized priority, Ranked priority and long tail scheduling policy.

4-Evaluation

For the evaluation, they compare Bistro algorithm with the brute-force approach. Regarding the microbenchmark, basically, brute-force scheduling shows relatively stable performance according to the authors. Tree-based scheduling achieves almost zero overhead when there are enough threads to handle the turnover rate, or else its performance considerably decreases.

              For their production applications, they essentially replaced all of their previous different schedulers by Bistro. Database iterator, an application designed to modify rows directly in hundreds of thousands of databases distributed over thousands of hosts, was previously handled by PHP functions asynchronously. Database scraping, a common step in ETL pipelines, used to run scraping on a distributed execution system for time-based jobs. Bistro took over that scraping jobs which significantly reduced the scraping time. Haystack/F4 applications for Binary Large Objects (BLOBs) is now using Bistro. Hbase Compression, a convenient application for long-term data storage with schema, before Bistro, they ran that compression job on Hadoop.

5-Conclusion

In a nutshell, the authors present a tree-based scheduler called Bistro, a toolkit for making distributed computation systems. It can schedule and run distributed tasks, including data-parallel jobs. Bistro has gained popularity at Facebook to the point where they replaced Hadoop and numerous custom-built schedulers in their production systems.

To view or add a comment, sign in

Others also viewed

Explore topics