SlideShare a Scribd company logo
Workflow Hacks! #1
Taro L. Saito

leo@treasure-data.com
Dec. 14, 2015
dots. Tokyo, Japan
Workflow Hacks! #1
2
アンケート
• 終了後 メールにてアンケートを送付します
• 質問内容
• 現在、どのようなシステムを使っているか?
• ワークフローでどのような問題を解決したいか?
• 回答いただいた方に、抽選でTreasure Dataパーカー
をプレゼント!
3
About Me: Taro L. Saito
4
2007 University of Tokyo. Ph.D.
XML DBMS, Transaction Processing
Relational-Style XML Query [SIGMOD 2008]
~ 2014 Assistant Professor at University of Tokyo
Genome Science Research
- Big Data Processing
- Distributed Computing
2014.03~ Treasure Data, Inc. Tokyo
2015.07~ Treasure Data, Inc. 

Mountain View, CA
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
Cloud Platform for Data Analytics
8
• Importing 1,000,000~ records / sec.
• Presto (Distributed SQL engine)
• 50,000~ queries / day
• Processing 10 trillion records / day
• http://qiita.com/xerial/items/a9093b60062f2c613fda
Import Export
Store
Analyze with
Presto/Hive
(Distributed SQL Engine)
Enterp
Enterprise
Data
BI
Workflow Fundamental Features
• Dependency management
• task1 -> task2 -> task3 …
• Scheduling
• Execution monitoring
• State management
• Error handling
• Easy access to logs
• Notification
9
Workflow Tools
• Workflow Management Tools
• Python: Luigi, Airflow, pinball
• For Hadoop: Oozie (XML)
• Script-based: Makefile, Azkaban
• Biological Science: Galaxy (Web UI), nextflow
• Domestic: JP1, Hinemos
• Dataflow DSL
• Spark, Flink, DriadLINQ, TensorFlow
• Cascading (Java -> MR), Scalding (Scala -> MR)
10
Dataflow DSL
• Translate this data processing program
• into a cluster computing program
11
A B
A0
A1
A2
B1
B2
f
B0
C
C
g
map reduce
f g
Redbook: Dataflow Engines
• Chapter 5: Large-Scale Dataflow Engine, by Peter Bailis
• http://www.redbook.io/ch5-dataflow.html
• DryadLINQ
• Most influential interface

for dataflow DSL
• SQL-like operation
• Functional style
• Spark
• SparkSQL
• 70% of Spark accesses
• Dataset API
• Shift to the dataframe based API
12
Dataflow -> Execution Plan
• Example - Hive: SQL to MapReduce
• Mapping SQL stages into MapReduce program
• SELECT page, count(*) FROM weblog

GROUP BY page
13
HDFS
A0
B0
A1
A2
B
B1
B2
B3
A
map reduce mergesplit
HDFS
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
Workflows
14
A
f
B C
g
D E
F
G
Hadoop is not enough
• C. Olston et al. [SIGMOD 2011]
• continuous processing
• independent scheduling
• Incremental processing
• Google Parcolator [OSDI 2010]
• Naiad - Differential Workflow

Microsoft [SOSP 2013]
15
Continuous Processing
• The Dataflow Model
• Akidau et al., Google [VLDB2015]
• Unbounded data processing
• late-coming data
• Integration of
• batch processing
• accumulation
16
Cluster Computing with Dryad 

M. Budiu, 2008
Cluster Computing with Dryad 

M. Budiu, 2008
Workflow Hacks!
Airflow
19
Airflow
• Best practices with Airflow - An open source
platform for workflows & schedules (Nov 2015)
• At Silicon Valley Data Engineering Meetup
• https://youtu.be/dgaoqOZlvEA
20
Workflow Development
• Programmatic
• Generate workflows by code
• Configuration as Code
• Workflow reuse/overwrite
• object oriented
• Parameterization
21
Luigi
• Luigiによるワークフロー管理
• http://qiita.com/k24d/items/
fb9bed08423e6249d376
22
Nextflow
• http://www.nextflow.io/
23
Dataflow DSL vs Workflow DSL
• Dataflow
• A -> B -> C -> …
• Data dependencies
• Workflow
• Task A -> Task B -> Task C -> …
• Task dependencies
• Data transfer is optional (through file or DB)
• + Scheduling
• + Task names
• For monitoring, redo, etc.
24
Weavelet (wvlet)
• Object-oriented workflow DSL for Scala
• Workflow reuse, extension, override
• Parameterization
• Function := Task, Workflow := Class
25
Isolating DAG generation and its execution
• Alternatives of MR
• Tez
• Pig on Spark https://issues.apache.org/jira/browse/PIG-4059
• Asakusa on Hadoop, Spark
26
Local
Hadoop
Spark
Result
DSL generates DAG
Stream DSL
• Add “moving stream” support to Dataflow DSL
• ”moving" streams and "resting" datasets
• Example
• Spark Streaming
• Spark DSL + Micro-batch for stream
• Microsoft Azure Stream SQL
• Windowing support for moving data
• Norikra
• Stream processing with SQL
• Reactive programming
• ReactiveX (Netflix), Akka Streaming (beta)  <- Stream DSL (DAG)
• Back-pressure support
• Controlling data transfer speed from receiver side
27
Task Execution Retry
• リトライと冪等性のデザインパターン
• http://frsyuki.hatenablog.com/entry/2014/06/09/164559
• System failures
• Process is not responding
• network, hardware failures
• Middleware failures
• provisioning failures, missing components
• User failures
• Wrong configuration
• Programming error
28
Retry Example
• Example: Task calling a REST API /create/xxx
• Client: First attempt
• Server returns 200 Success
• But failed to get the status code
• Client retries the task
• Get 409 conflict error (entry xxx is already created)
• Solution (Application side)
• Handle 409 error as success in the client (idempotent
execution)
• More strict approach
• Making xxx unique for each request
29
Fault Tolerance
• Presto: Distributed query engine developed by Facebook
• Uses HTTP data transfer
• No fault-tolerance
• 99.5% of queries finishes without any failure
• For queries processing 10 billions or more rows => Drops to 85%
30
A0
B0
A1
A2
B
B1
B2
B3
A
map reduce mergesplit
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
Summary
• Recent workflow tools
• Driven by Python community
• Because of this book! (=>)
• Airflow, Luigi, etc.
• Workflow manager
• Handle system failures, monitoring
• Workflow development
• DAG based DSL (dataflow, workflow, stream processing) -> Execution
• Does not cover application logic errors
• Idempotent execution
• Requires splitting large tasks into smaller ones
31

More Related Content

PDF
Presto @ Treasure Data - Presto Meetup Boston 2015
PDF
Presto as a Service - Tips for operation and monitoring
PPTX
Streaming Distributed Data Processing with Silk #deim2014
PDF
A Day in the Life of a Druid Implementor and Druid's Roadmap
PDF
Presto
PDF
Top 5 mistakes when writing Streaming applications
PDF
Presto at Hadoop Summit 2016
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto as a Service - Tips for operation and monitoring
Streaming Distributed Data Processing with Silk #deim2014
A Day in the Life of a Druid Implementor and Druid's Roadmap
Presto
Top 5 mistakes when writing Streaming applications
Presto at Hadoop Summit 2016
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng

What's hot (20)

PDF
Introduction to Presto at Treasure Data
PDF
20140120 presto meetup_en
PDF
Prestogres, ODBC & JDBC connectivity for Presto
PDF
Presto meetup 2015-03-19 @Facebook
PDF
Big data serving: Processing and inference at scale in real time
PDF
Spark Summit EU 2015: Reynold Xin Keynote
PDF
Presto at Twitter
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
PPTX
Case study- Real-time OLAP Cubes
PPTX
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PDF
Visualizing big data in the browser using spark
PDF
Building Data Pipelines in Python
PDF
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
PDF
Data Infrastructure for a World of Music
PDF
Building real time data-driven products
PPTX
Functional architectural patterns
PDF
Presto updates to 0.178
PDF
Presto At Treasure Data
Introduction to Presto at Treasure Data
20140120 presto meetup_en
Prestogres, ODBC & JDBC connectivity for Presto
Presto meetup 2015-03-19 @Facebook
Big data serving: Processing and inference at scale in real time
Spark Summit EU 2015: Reynold Xin Keynote
Presto at Twitter
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Case study- Real-time OLAP Cubes
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Visualizing big data in the browser using spark
Building Data Pipelines in Python
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Data Infrastructure for a World of Music
Building real time data-driven products
Functional architectural patterns
Presto updates to 0.178
Presto At Treasure Data

Viewers also liked (11)

PDF
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
PDF
xrdpを使ったお手軽BYOD環境の構築
PDF
Apache Hbase バルクロードの使い方
PPT
Yahoo!Japan北米DCでOCPのツボをみせてもらってきました - OpenStack最新情報セミナー 2016年5月
ODP
xrdpで変える!社内のPC環境
PDF
並列データベースシステムの概念と原理
PPTX
Z Lab社におけるOpenStack × Kubernetesの活用 〜アプリケーション開発者からみた課題解決 - OpenStack最新情報セミナー...
PPTX
OCP, Kubernetes ハイパースケールアーキテクチャ 導入の道のり - OpenStack最新情報セミナー(2016年7月)
PPTX
EmbulkとDigdagとデータ分析基盤と
PDF
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
PDF
Rd gatewayによるwindowsインスタンスへの接続
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
xrdpを使ったお手軽BYOD環境の構築
Apache Hbase バルクロードの使い方
Yahoo!Japan北米DCでOCPのツボをみせてもらってきました - OpenStack最新情報セミナー 2016年5月
xrdpで変える!社内のPC環境
並列データベースシステムの概念と原理
Z Lab社におけるOpenStack × Kubernetesの活用 〜アプリケーション開発者からみた課題解決 - OpenStack最新情報セミナー...
OCP, Kubernetes ハイパースケールアーキテクチャ 導入の道のり - OpenStack最新情報セミナー(2016年7月)
EmbulkとDigdagとデータ分析基盤と
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
Rd gatewayによるwindowsインスタンスへの接続

Similar to Workflow Hacks #1 - dots. Tokyo (20)

PDF
Interactive workflow management using Azkaban
PDF
S-CUBE LP: Chemical Modeling: Workflow Enactment based on the Chemical Metaphor
PPTX
Cassandra Lunch #88: Cadence
PDF
QCon SF-feedback
PDF
Rearchitecturing a 9-year-old legacy Laravel application.pdf
PDF
Online Workflow Management and Performance Analysis with Stampede
DOCX
SivaramV_Resume
PDF
Data Pipelines with Python - NWA TechFest 2017
PDF
Handling not so big data
PDF
Apache: Big Data North America 2017 参加報告 #streamctjp
PDF
Stream dataprocessing101
PDF
Oozie @ Riot Games
PDF
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
PDF
Discovering Concurrency: Learning (Business) Process Models from Examples
PDF
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PPTX
Cloud computing_Applications and paradigams.pptx
PPTX
Cloud computing_Applications and paradigams.pptx
PDF
Reactive Microservices with Spring 5: WebFlux
PDF
20210127 今日から始めるイベントドリブンアーキテクチャ AWS Expert Online #13
PDF
Workflows via Event driven architecture
Interactive workflow management using Azkaban
S-CUBE LP: Chemical Modeling: Workflow Enactment based on the Chemical Metaphor
Cassandra Lunch #88: Cadence
QCon SF-feedback
Rearchitecturing a 9-year-old legacy Laravel application.pdf
Online Workflow Management and Performance Analysis with Stampede
SivaramV_Resume
Data Pipelines with Python - NWA TechFest 2017
Handling not so big data
Apache: Big Data North America 2017 参加報告 #streamctjp
Stream dataprocessing101
Oozie @ Riot Games
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Discovering Concurrency: Learning (Business) Process Models from Examples
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
Cloud computing_Applications and paradigams.pptx
Cloud computing_Applications and paradigams.pptx
Reactive Microservices with Spring 5: WebFlux
20210127 今日から始めるイベントドリブンアーキテクチャ AWS Expert Online #13
Workflows via Event driven architecture

More from Taro L. Saito (20)

PDF
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
PDF
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
PDF
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
PDF
Airframe RPC
PDF
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
PDF
Airframe Meetup #3: 2019 Updates & AirSpec
PDF
Presto At Arm Treasure Data - 2019 Updates
PDF
Reading The Source Code of Presto
PDF
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
PDF
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
PDF
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
PDF
Tips For Maintaining OSS Projects
PDF
Learning Silicon Valley Culture
PDF
Scala at Treasure Data
PDF
Presto As A Service - Treasure DataでのPresto運用事例
PPTX
JNuma Library
PDF
Treasure Dataを支える技術 - MessagePack編
PDF
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
PPTX
Spark Internals - Hadoop Source Code Reading #16 in Japan
PDF
Silkによる並列分散ワークフロープログラミング
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Airframe RPC
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
Airframe Meetup #3: 2019 Updates & AirSpec
Presto At Arm Treasure Data - 2019 Updates
Reading The Source Code of Presto
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Tips For Maintaining OSS Projects
Learning Silicon Valley Culture
Scala at Treasure Data
Presto As A Service - Treasure DataでのPresto運用事例
JNuma Library
Treasure Dataを支える技術 - MessagePack編
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Spark Internals - Hadoop Source Code Reading #16 in Japan
Silkによる並列分散ワークフロープログラミング

Recently uploaded (20)

PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
PPT on Performance Review to get promotions
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Digital Logic Computer Design lecture notes
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Well-logging-methods_new................
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPT on Performance Review to get promotions
Operating System & Kernel Study Guide-1 - converted.pdf
Digital Logic Computer Design lecture notes
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Internet of Things (IOT) - A guide to understanding
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Lecture Notes Electrical Wiring System Components
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Construction Project Organization Group 2.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Well-logging-methods_new................
Automation-in-Manufacturing-Chapter-Introduction.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
OOP with Java - Java Introduction (Basics)
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd

Workflow Hacks #1 - dots. Tokyo

  • 1. Workflow Hacks! #1 Taro L. Saito
 leo@treasure-data.com Dec. 14, 2015 dots. Tokyo, Japan
  • 3. アンケート • 終了後 メールにてアンケートを送付します • 質問内容 • 現在、どのようなシステムを使っているか? • ワークフローでどのような問題を解決したいか? • 回答いただいた方に、抽選でTreasure Dataパーカー をプレゼント! 3
  • 4. About Me: Taro L. Saito 4 2007 University of Tokyo. Ph.D. XML DBMS, Transaction Processing Relational-Style XML Query [SIGMOD 2008] ~ 2014 Assistant Professor at University of Tokyo Genome Science Research - Big Data Processing - Distributed Computing 2014.03~ Treasure Data, Inc. Tokyo 2015.07~ Treasure Data, Inc. 
 Mountain View, CA
  • 8. Cloud Platform for Data Analytics 8 • Importing 1,000,000~ records / sec. • Presto (Distributed SQL engine) • 50,000~ queries / day • Processing 10 trillion records / day • http://qiita.com/xerial/items/a9093b60062f2c613fda Import Export Store Analyze with Presto/Hive (Distributed SQL Engine) Enterp Enterprise Data BI
  • 9. Workflow Fundamental Features • Dependency management • task1 -> task2 -> task3 … • Scheduling • Execution monitoring • State management • Error handling • Easy access to logs • Notification 9
  • 10. Workflow Tools • Workflow Management Tools • Python: Luigi, Airflow, pinball • For Hadoop: Oozie (XML) • Script-based: Makefile, Azkaban • Biological Science: Galaxy (Web UI), nextflow • Domestic: JP1, Hinemos • Dataflow DSL • Spark, Flink, DriadLINQ, TensorFlow • Cascading (Java -> MR), Scalding (Scala -> MR) 10
  • 11. Dataflow DSL • Translate this data processing program • into a cluster computing program 11 A B A0 A1 A2 B1 B2 f B0 C C g map reduce f g
  • 12. Redbook: Dataflow Engines • Chapter 5: Large-Scale Dataflow Engine, by Peter Bailis • http://www.redbook.io/ch5-dataflow.html • DryadLINQ • Most influential interface
 for dataflow DSL • SQL-like operation • Functional style • Spark • SparkSQL • 70% of Spark accesses • Dataset API • Shift to the dataframe based API 12
  • 13. Dataflow -> Execution Plan • Example - Hive: SQL to MapReduce • Mapping SQL stages into MapReduce program • SELECT page, count(*) FROM weblog
 GROUP BY page 13 HDFS A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit HDFS TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  • 15. Hadoop is not enough • C. Olston et al. [SIGMOD 2011] • continuous processing • independent scheduling • Incremental processing • Google Parcolator [OSDI 2010] • Naiad - Differential Workflow
 Microsoft [SOSP 2013] 15
  • 16. Continuous Processing • The Dataflow Model • Akidau et al., Google [VLDB2015] • Unbounded data processing • late-coming data • Integration of • batch processing • accumulation 16
  • 17. Cluster Computing with Dryad 
 M. Budiu, 2008
  • 18. Cluster Computing with Dryad 
 M. Budiu, 2008 Workflow Hacks!
  • 20. Airflow • Best practices with Airflow - An open source platform for workflows & schedules (Nov 2015) • At Silicon Valley Data Engineering Meetup • https://youtu.be/dgaoqOZlvEA 20
  • 21. Workflow Development • Programmatic • Generate workflows by code • Configuration as Code • Workflow reuse/overwrite • object oriented • Parameterization 21
  • 24. Dataflow DSL vs Workflow DSL • Dataflow • A -> B -> C -> … • Data dependencies • Workflow • Task A -> Task B -> Task C -> … • Task dependencies • Data transfer is optional (through file or DB) • + Scheduling • + Task names • For monitoring, redo, etc. 24
  • 25. Weavelet (wvlet) • Object-oriented workflow DSL for Scala • Workflow reuse, extension, override • Parameterization • Function := Task, Workflow := Class 25
  • 26. Isolating DAG generation and its execution • Alternatives of MR • Tez • Pig on Spark https://issues.apache.org/jira/browse/PIG-4059 • Asakusa on Hadoop, Spark 26 Local Hadoop Spark Result DSL generates DAG
  • 27. Stream DSL • Add “moving stream” support to Dataflow DSL • ”moving" streams and "resting" datasets • Example • Spark Streaming • Spark DSL + Micro-batch for stream • Microsoft Azure Stream SQL • Windowing support for moving data • Norikra • Stream processing with SQL • Reactive programming • ReactiveX (Netflix), Akka Streaming (beta)  <- Stream DSL (DAG) • Back-pressure support • Controlling data transfer speed from receiver side 27
  • 28. Task Execution Retry • リトライと冪等性のデザインパターン • http://frsyuki.hatenablog.com/entry/2014/06/09/164559 • System failures • Process is not responding • network, hardware failures • Middleware failures • provisioning failures, missing components • User failures • Wrong configuration • Programming error 28
  • 29. Retry Example • Example: Task calling a REST API /create/xxx • Client: First attempt • Server returns 200 Success • But failed to get the status code • Client retries the task • Get 409 conflict error (entry xxx is already created) • Solution (Application side) • Handle 409 error as success in the client (idempotent execution) • More strict approach • Making xxx unique for each request 29
  • 30. Fault Tolerance • Presto: Distributed query engine developed by Facebook • Uses HTTP data transfer • No fault-tolerance • 99.5% of queries finishes without any failure • For queries processing 10 billions or more rows => Drops to 85% 30 A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  • 31. Summary • Recent workflow tools • Driven by Python community • Because of this book! (=>) • Airflow, Luigi, etc. • Workflow manager • Handle system failures, monitoring • Workflow development • DAG based DSL (dataflow, workflow, stream processing) -> Execution • Does not cover application logic errors • Idempotent execution • Requires splitting large tasks into smaller ones 31