Guagua an iterative computing framework on hadoop
Guagua: An Iterative Computing 
Framework on Hadoop 
Zhang Pengshan(David), PayPal
AGENDA 
• Introduction 
• Distributed Neural Network Algorithm 
• What is Guagua? 
• Guagua Advanced Features 
• Shifu on Guagua 
• Future Plans
AGENDA 
• Introduction 
• Distributed Neural Network Algorithm 
• What is Guagua? 
• Guagua Advanced Features 
• Shifu on Guagua 
• Future Plans
ALIPAY vs. PAYPAL 
Q: Where is risk control in PayPal? 
A: Risk control is everywhere in paypal.com.
FRAUD TYPES IN PAYPAL 
Fraud Types in 
PayPal 
Account Take Over 
Stolen Financials 
INR/SNAD 
Credit Cards 
INR: Item Not Received 
SNAD: Significantly Not as Described
RISK CONTROL IN PAYPAL 
Models Rules Agents
RISK MODELING IN PAYPAL 
MODELING CHALLENGES 
Thousands 
of Features 
Algorithms 
(LR, NN, DT) 
Big 
Training 
Data 
SLA 
(Online) 
Simulation
RISK MODELING IN PAYPAL 
MODELING CHALLENGES 
Thousands 
of Features 
Algorithms 
(LR, NN, DT) 
Big 
Training 
Data 
SLA 
(Online) 
Simulation 
Q: How to train models with TB data and thousands of features?
AGENDA 
• Introduction 
• Distributed Neural Network Algorithm 
• What is Guagua? 
• Guagua Advanced Features 
• Shifu on Guagua 
• Future Plans
DISTRIBUTED NEURAL NETWORK ALGORITHM* 
Worker Worker Worker 
GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] 
Master 
ACCUMULATE GRADIENTS 
UPDATE WEIGHTS 
Worker Worker Worker 
GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] 
Master 
… 
1st iteration 
2nd iteration 
… 
GRADIENTS: DOUBLE [] 
WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] 
ACCUMULATE GRADIENTS 
UPDATE WEIGHTS 
WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] 
* Distributed batch gradient descent algorithm.
DISTRIBUTED NEURAL NETWORK ALGORITHM 
Worker Worker Worker 
GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] 
Master 
ACCUMULATE GRADIENTS 
UPDATE WEIGHTS 
Worker Worker Worker 
GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] 
Master 
… 
1st iteration 
2nd iteration 
… 
GRADIENTS: DOUBLE [] 
WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] 
ACCUMULATE GRADIENTS 
UPDATE WEIGHTS 
WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] 
Q: How to implement it?
WHY NOT MAHOUT OR SPARK? 
Mahout 
• No distributed logistic regression & 
neural network. 
• Iterative through Hadoop jobs, bad 
performance. 
Spark 
• No independent Spark cluster. 
• Hadoop cluster is still 1.0 based, not 
YARN. 
Q: How to implement it in Hadoop?
POSSIBLE SOLUTIONS 
Hadoop YARN Hadoop MapReduce 
Pros 
Flexible framework for framework Works well on all Hadoop versions 
Self resource management Mature computing model 
Internal fault tolerance, splits, UI … 
Cons 
2.0.3-Alpha Different computing model 
PayPal Clusters: Hadoop 0.20.2 How to do iterative coordination? 
Extra fault tolerance, splits, UI … 
Q: How to implement it in Hadoop MapReduce?
AGENDA 
• Introduction 
• Distributed Neural Network Algorithm 
• What is Guagua? 
• Guagua Advanced Features 
• Shifu on Guagua 
• Future Plans
ITERATIVE COMPUTING MODEL IN GUAGUA 
Worker Worker Worker 
WORKER RESULT WORKER RESULT 
Master 
Worker Worker Worker 
WORKER RESULT WORKER RESULT WORKER RESULT 
Master 
… 
1st iteration 
2nd iteration 
… 
WORKER RESULT 
MASTER RESULT MASTER RESULT MASTER RESULT 
MASTER RESULT MASTER RESULT MASTER RESULT 
Guagua is a framework over such iterative computing model, 
compared with Hadoop 1.0 over MapReduce.
GUAGUA API 
MasterComputable 
WorkerComputable
GUAGUA OVERVIEW 
Iterative 
Computing 
Framework 
CORE 
Master- 
Workers Core 
MapReduce Adapter 
(For Hadoop 1.0) 
Coordination 
YARN Adapter 
(For Hadoop 2.0) 
Fault 
Tolerance 
Consistent Client 
Distributed Neural 
Network Application
PLUGGABLE, SCALABLE INTERCEPTORS 
Master 
Fault Tolerance Interceptor 
ZooKeeper Coordinator 
Perf Interceptor 
Timer 
User Defined Interceptors 
Master Computation 
Worker 
Fault Tolerance Interceptor 
ZooKeeper Coordinator 
Perf Interceptor 
Timer 
User Defined Interceptors 
Worker Computation 
* These two graphs are aspects for each iteration.
GUAGUA RUNTIME 
Master: Mapper (Container) 
Worker: Mapper (Container) 
Worker: Mapper (Container) 
Worker: Mapper (Container) 
Worker: Mapper (Container) 
Worker: Mapper (Container)
GUAGUA RUNTIME 
Master: Mapper (Container) 
Worker: Mapper (Container) 
Worker: Mapper (Container) 
Worker: Mapper (Container) 
Worker: Mapper (Container) 
Worker: Mapper (Container) 
ZooKeeper 
Cluster 
REGISTER 
REGISTER 
REGISTER 
REGISTER 
REGISTER 
REGISTER 
1. Master is listening znodes of workers. 
2. Workers are listening znode of master.
GUAGUA RUNTIME 
Master: Mapper (Container) 
Worker: Mapper (Container) 
Worker: Mapper (Container) 
Worker: Mapper (Container) 
Worker: Mapper (Container) 
Worker: Mapper (Container) 
ZooKeeper 
Cluster 
UPDATE ITER 
UPDATE ITER 
UPDATE IITER 
UPDATE ITER 
UPDATE ITER 
UPDATE ITER 
1. Data is loaded in worker memory in the first iteration. 
2. Whole process is done when reaches maximal iteration 
or halt condition is triggered.
AGENDA 
• Introduction 
• Distributed Neural Network Algorithm 
• What is Guagua? 
• Guagua Advanced Features 
• Shifu on Guagua 
• Future Plans
FAULT TOLERANCE 
Master 
Worker 
Worker 
Worker 
Worker 
Worker 
Master 
Worker 
Worker 
Worker 
Worker 
Worker 
Master 
Worker 
Worker 
Worker 
Worker 
Worker 
Master 
Worker 
Worker 
Worker 
Worker 
Worker 
Master 
Worker 
Worker 
Worker 
Worker 
Worker 
… 
… 
… 
… 
… 
… 
1 2 3 4 … n
FAULT TOLERANCE 
Master 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Worker 
Master 
Worker 
Worker 
Worker 
Worker 
Worker 
… 
… 
… 
… 
… 
1 2 3 4 … n 
* The same on workers.
STRAGGLER MITIGATION 
Master 
Worker 
Worker 
Worker 
Worker 
Worker 
Master 
Worker 
Worker 
Worker 
Worker 
Worker 
Master 
Worker 
1 2 3 
Worker 
Worker 
Worker 
Worker
STRAGGLER MITIGATION 
Master 
Worker Worker Worker 
Worker 
Worker 
Worker 
Worker 
Master 
Worker 
Worker 
Worker 
Worker 
Master 
Worker 
Worker 
Worker 
Worker 
1 2 3
STRAGGLER MITIGATION 
Master 
Worker Worker 
Worker 
Worker 
Worker 
Worker 
Master 
Worker 
Worker 
Worker 
Worker 
Master 
Worker 
Worker 
Worker 
Worker 
1 2 3
PROGRESS AND STATE REPORT 
0.86 = 432/501 (Current Iteration) / (Total Iteration)
GUAGUA UNIT
AGENDA 
• Introduction 
• Distributed Neural Network Algorithm 
• What is Guagua? 
• Guagua Advanced Features 
• Shifu on Guagua 
• Future Plans
WHAT IS SHIFU? 
Shifu* is an open-source, end-to-end machine learning and data 
mining framework built on top of Hadoop. 
NEW INIT STATS VARSELECT 
NORMALIZE 
POSTTRAIN TRAIN 
EVAL 
Built on Guagua 
*Want to try Shifu? Please visit http://shifu.ml.
SHIFU ON GUAGUA (TRAIN STEP) 
NNMaster 
MasterInterceptor 
NNWorker NNOutput 
MasterComputable 
WorkerComputable 
AbstractWorkerComputable BasicMasterInterceptor 
Gradients 
Weights 
GUAGUA API SHIFU CODE ENCOG CODE
SHIFU NN vs. SPARK LR 
Run Time Comparison 
Shifu-NN: 1102*20*1 Network, 319 Mappers * 1G 
Spark-LR: 1102 features, 120 executors * 3G 
45 
40 
35 
30 
25 
20 
15 
10 
5 
0 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
Time 
Iteration 
Shifu-NN 
Spark-LR
SHIFU NN BENCHMARK RESULTS 
All data are located in memory. At most we used 2400 mappers. 20 epochs are used. 
1400 
1200 
1000 
800 
600 
400 
200 
0 
125G 375G 500G 625G 750G 875G 1000G 
Run Time 
(Seconds) 
Size of Input 
Time(Seconds)
AGENDA 
• Introduction 
• Distributed Neural Network Algorithm 
• What is Guagua? 
• Guagua Advanced Features 
• Shifu on Guagua 
• Future Plans
WHAT’S NEXT? 
• More open source docs 
• Support more (distributed) machine learning algorithms 
• Improve YARN (Beta) implementation 
• Support more input formats 
• Big model support 
• Deep learning support
Q&A
APPENDIX 
• Website 
• http://shifu.ml 
• http://shifu.ml/docs/guagua/ 
• Guagua issue website 
• https://github.com/shifuml/shifu/issues 
• https://github.com/shifuml/guagua/issues 
• Shifu & Guagua source code: 
• https://github.com/shifuml/shifu/ 
• https://github.com/shifuml/guagua/
Guagua an iterative computing framework on hadoop
@InfoQ infoqchina

More Related Content

PDF
First-ever scalable, distributed deep learning architecture using Spark & Tac...
PDF
Top 5 Mistakes When Writing Spark Applications
PDF
Real world machine learning with Java for Fumankaitori.com
PDF
Enterprise Scale Topological Data Analysis Using Spark
PDF
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
PPTX
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
PDF
Optimizations in Spark; RDD, DataFrame
PDF
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
First-ever scalable, distributed deep learning architecture using Spark & Tac...
Top 5 Mistakes When Writing Spark Applications
Real world machine learning with Java for Fumankaitori.com
Enterprise Scale Topological Data Analysis Using Spark
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Optimizations in Spark; RDD, DataFrame
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...

What's hot (20)

PDF
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
PDF
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
PDF
Spark tuning2016may11bida
PDF
Machine Learning using Apache Spark MLlib
PDF
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
PDF
MLConf 2016 SigOpt Talk by Scott Clark
PDF
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
PDF
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
PDF
Neural Networks, Spark MLlib, Deep Learning
PDF
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
PDF
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
PDF
Harnessing Big Data with Spark
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
PDF
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
PDF
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
PPTX
How to Reduce Scikit-Learn Training Time
PDF
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
PDF
A Graph-Based Method For Cross-Entity Threat Detection
PPTX
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
PDF
Auto-Pilot for Apache Spark Using Machine Learning
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Spark tuning2016may11bida
Machine Learning using Apache Spark MLlib
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
MLConf 2016 SigOpt Talk by Scott Clark
Deep Learning in Spark with BigDL by Petar Zecevic at Big Data Spain 2017
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Neural Networks, Spark MLlib, Deep Learning
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Harnessing Big Data with Spark
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
How to Reduce Scikit-Learn Training Time
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
A Graph-Based Method For Cross-Entity Threat Detection
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Auto-Pilot for Apache Spark Using Machine Learning
Ad

Viewers also liked (13)

PDF
Apache hadoop
PPTX
Hadoop technology
PPT
Big data & hadoop framework
PPT
Hadoop ecosystem framework n hadoop in live environment
PPTX
Zachman Framework As Enterprise Architecture Ontology
PPSX
Zachman framework
PPT
Zachman Framework
PPTX
Enterprise Architecture Framework: Chase Global Bank
PPTX
Big Data & Hadoop Tutorial
PPT
Seminar Presentation Hadoop
PPTX
Hadoop introduction , Why and What is Hadoop ?
PDF
Hadoop Overview & Architecture
 
PPTX
Big data and Hadoop
Apache hadoop
Hadoop technology
Big data & hadoop framework
Hadoop ecosystem framework n hadoop in live environment
Zachman Framework As Enterprise Architecture Ontology
Zachman framework
Zachman Framework
Enterprise Architecture Framework: Chase Global Bank
Big Data & Hadoop Tutorial
Seminar Presentation Hadoop
Hadoop introduction , Why and What is Hadoop ?
Hadoop Overview & Architecture
 
Big data and Hadoop
Ad

Similar to Guagua an iterative computing framework on hadoop (20)

PDF
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
PDF
Architecting and productionising data science applications at scale
PDF
Comparing pregel related systems
PDF
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
PPTX
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
PDF
PDF
Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...
PPTX
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
PDF
Gpars - the coolest bits
PDF
[@NaukriEngineering] Apache Spark
PDF
Hadoop Jungle
PDF
Apache Spark Overview part1 (20161107)
PPTX
Profiling & Testing with Spark
PDF
RAPIDS cuGraph – Accelerating all your Graph needs
PPTX
Jstorm introduction-0.9.6
PDF
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
PDF
Managing Apache Spark Workload and Automatic Optimizing
PDF
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
PPTX
Strata + Hadoop World 2012: Knitting Boar
PDF
Deep Learning with GPUs in Production - AI By the Bay
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
Architecting and productionising data science applications at scale
Comparing pregel related systems
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
Run Scala Faster with GraalVM on any Platform / GraalVMで、どこでもScalaを高速実行しよう by...
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Gpars - the coolest bits
[@NaukriEngineering] Apache Spark
Hadoop Jungle
Apache Spark Overview part1 (20161107)
Profiling & Testing with Spark
RAPIDS cuGraph – Accelerating all your Graph needs
Jstorm introduction-0.9.6
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Managing Apache Spark Workload and Automatic Optimizing
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Strata + Hadoop World 2012: Knitting Boar
Deep Learning with GPUs in Production - AI By the Bay

Recently uploaded (20)

PDF
DNT Brochure 2025 – ISV Solutions @ D365
PPTX
CNN LeNet5 Architecture: Neural Networks
PDF
CCleaner 6.39.11548 Crack 2025 License Key
PPTX
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
PPTX
Tech Workshop Escape Room Tech Workshop
PDF
MCP Security Tutorial - Beginner to Advanced
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PDF
Microsoft Office 365 Crack Download Free
PPTX
Matchmaking for JVMs: How to Pick the Perfect GC Partner
PDF
Topaz Photo AI Crack New Download (Latest 2025)
PPTX
Computer Software - Technology and Livelihood Education
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PDF
Salesforce Agentforce AI Implementation.pdf
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PDF
BoxLang Dynamic AWS Lambda - Japan Edition
PDF
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
PDF
Guide to Food Delivery App Development.pdf
PPTX
Introduction to Windows Operating System
DNT Brochure 2025 – ISV Solutions @ D365
CNN LeNet5 Architecture: Neural Networks
CCleaner 6.39.11548 Crack 2025 License Key
WiFi Honeypot Detecscfddssdffsedfseztor.pptx
Tech Workshop Escape Room Tech Workshop
MCP Security Tutorial - Beginner to Advanced
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
Microsoft Office 365 Crack Download Free
Matchmaking for JVMs: How to Pick the Perfect GC Partner
Topaz Photo AI Crack New Download (Latest 2025)
Computer Software - Technology and Livelihood Education
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
Salesforce Agentforce AI Implementation.pdf
How Tridens DevSecOps Ensures Compliance, Security, and Agility
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
BoxLang Dynamic AWS Lambda - Japan Edition
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
Guide to Food Delivery App Development.pdf
Introduction to Windows Operating System

Guagua an iterative computing framework on hadoop

  • 2. Guagua: An Iterative Computing Framework on Hadoop Zhang Pengshan(David), PayPal
  • 3. AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 4. AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 5. ALIPAY vs. PAYPAL Q: Where is risk control in PayPal? A: Risk control is everywhere in paypal.com.
  • 6. FRAUD TYPES IN PAYPAL Fraud Types in PayPal Account Take Over Stolen Financials INR/SNAD Credit Cards INR: Item Not Received SNAD: Significantly Not as Described
  • 7. RISK CONTROL IN PAYPAL Models Rules Agents
  • 8. RISK MODELING IN PAYPAL MODELING CHALLENGES Thousands of Features Algorithms (LR, NN, DT) Big Training Data SLA (Online) Simulation
  • 9. RISK MODELING IN PAYPAL MODELING CHALLENGES Thousands of Features Algorithms (LR, NN, DT) Big Training Data SLA (Online) Simulation Q: How to train models with TB data and thousands of features?
  • 10. AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 11. DISTRIBUTED NEURAL NETWORK ALGORITHM* Worker Worker Worker GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] Master ACCUMULATE GRADIENTS UPDATE WEIGHTS Worker Worker Worker GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] Master … 1st iteration 2nd iteration … GRADIENTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] ACCUMULATE GRADIENTS UPDATE WEIGHTS WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] * Distributed batch gradient descent algorithm.
  • 12. DISTRIBUTED NEURAL NETWORK ALGORITHM Worker Worker Worker GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] Master ACCUMULATE GRADIENTS UPDATE WEIGHTS Worker Worker Worker GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] Master … 1st iteration 2nd iteration … GRADIENTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] ACCUMULATE GRADIENTS UPDATE WEIGHTS WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] Q: How to implement it?
  • 13. WHY NOT MAHOUT OR SPARK? Mahout • No distributed logistic regression & neural network. • Iterative through Hadoop jobs, bad performance. Spark • No independent Spark cluster. • Hadoop cluster is still 1.0 based, not YARN. Q: How to implement it in Hadoop?
  • 14. POSSIBLE SOLUTIONS Hadoop YARN Hadoop MapReduce Pros Flexible framework for framework Works well on all Hadoop versions Self resource management Mature computing model Internal fault tolerance, splits, UI … Cons 2.0.3-Alpha Different computing model PayPal Clusters: Hadoop 0.20.2 How to do iterative coordination? Extra fault tolerance, splits, UI … Q: How to implement it in Hadoop MapReduce?
  • 15. AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 16. ITERATIVE COMPUTING MODEL IN GUAGUA Worker Worker Worker WORKER RESULT WORKER RESULT Master Worker Worker Worker WORKER RESULT WORKER RESULT WORKER RESULT Master … 1st iteration 2nd iteration … WORKER RESULT MASTER RESULT MASTER RESULT MASTER RESULT MASTER RESULT MASTER RESULT MASTER RESULT Guagua is a framework over such iterative computing model, compared with Hadoop 1.0 over MapReduce.
  • 17. GUAGUA API MasterComputable WorkerComputable
  • 18. GUAGUA OVERVIEW Iterative Computing Framework CORE Master- Workers Core MapReduce Adapter (For Hadoop 1.0) Coordination YARN Adapter (For Hadoop 2.0) Fault Tolerance Consistent Client Distributed Neural Network Application
  • 19. PLUGGABLE, SCALABLE INTERCEPTORS Master Fault Tolerance Interceptor ZooKeeper Coordinator Perf Interceptor Timer User Defined Interceptors Master Computation Worker Fault Tolerance Interceptor ZooKeeper Coordinator Perf Interceptor Timer User Defined Interceptors Worker Computation * These two graphs are aspects for each iteration.
  • 20. GUAGUA RUNTIME Master: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container)
  • 21. GUAGUA RUNTIME Master: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) ZooKeeper Cluster REGISTER REGISTER REGISTER REGISTER REGISTER REGISTER 1. Master is listening znodes of workers. 2. Workers are listening znode of master.
  • 22. GUAGUA RUNTIME Master: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) ZooKeeper Cluster UPDATE ITER UPDATE ITER UPDATE IITER UPDATE ITER UPDATE ITER UPDATE ITER 1. Data is loaded in worker memory in the first iteration. 2. Whole process is done when reaches maximal iteration or halt condition is triggered.
  • 23. AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 24. FAULT TOLERANCE Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker … … … … … … 1 2 3 4 … n
  • 25. FAULT TOLERANCE Master Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker … … … … … 1 2 3 4 … n * The same on workers.
  • 26. STRAGGLER MITIGATION Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Master Worker 1 2 3 Worker Worker Worker Worker
  • 27. STRAGGLER MITIGATION Master Worker Worker Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Master Worker Worker Worker Worker 1 2 3
  • 28. STRAGGLER MITIGATION Master Worker Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Master Worker Worker Worker Worker 1 2 3
  • 29. PROGRESS AND STATE REPORT 0.86 = 432/501 (Current Iteration) / (Total Iteration)
  • 31. AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 32. WHAT IS SHIFU? Shifu* is an open-source, end-to-end machine learning and data mining framework built on top of Hadoop. NEW INIT STATS VARSELECT NORMALIZE POSTTRAIN TRAIN EVAL Built on Guagua *Want to try Shifu? Please visit http://shifu.ml.
  • 33. SHIFU ON GUAGUA (TRAIN STEP) NNMaster MasterInterceptor NNWorker NNOutput MasterComputable WorkerComputable AbstractWorkerComputable BasicMasterInterceptor Gradients Weights GUAGUA API SHIFU CODE ENCOG CODE
  • 34. SHIFU NN vs. SPARK LR Run Time Comparison Shifu-NN: 1102*20*1 Network, 319 Mappers * 1G Spark-LR: 1102 features, 120 executors * 3G 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Time Iteration Shifu-NN Spark-LR
  • 35. SHIFU NN BENCHMARK RESULTS All data are located in memory. At most we used 2400 mappers. 20 epochs are used. 1400 1200 1000 800 600 400 200 0 125G 375G 500G 625G 750G 875G 1000G Run Time (Seconds) Size of Input Time(Seconds)
  • 36. AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 37. WHAT’S NEXT? • More open source docs • Support more (distributed) machine learning algorithms • Improve YARN (Beta) implementation • Support more input formats • Big model support • Deep learning support
  • 38. Q&A
  • 39. APPENDIX • Website • http://shifu.ml • http://shifu.ml/docs/guagua/ • Guagua issue website • https://github.com/shifuml/shifu/issues • https://github.com/shifuml/guagua/issues • Shifu & Guagua source code: • https://github.com/shifuml/shifu/ • https://github.com/shifuml/guagua/