Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI

了解更多CS求职信息
扫描二维码关注微信
www.laioffer.comlaiofferhelper2
How Spark Speedup AI
Mike Tang

Outline
● Spark ecosystem
● Spark ML and XGBoost
● Spark Deep Learning pipeline

Big data

What is machine learning or AI ?
⬢ Database, Big data, Machine Learning, AI ?
⬢ “using algorithms to understand the pattern in data”
Prediction
insight

History of big data
⬢ Application driven
○ Billions of web pages
○ New system requirements
■ Cheap
■ Robust
■ Efficient
○ 2004 Google
○ 2007 Yahoo
○ Hadoop ecosystem
■ HDFS
■ MAPR
■ Yarn
○ 2012 Hortonworks

Applications driven for big data
⬢ Ecosystem of Hadoop
○ How Facebook use Hadoop?
■ Hive for OLAP query processing
■ HBase for for billion users activities tracking
○ How Twitter use Hadoop?
■ Storm: streaming data processing for twitter
stream data
○ How LinkedIn use Hadoop?
■ Kafaka to subscribe users streaming data
○ When Hadoop come together?
■ Ambari: for node management and deploy
different components

The leading data science platform for big data
Apache Spark
Hadoop
Interactive Streaming Batch
Nosql Tensor
flow
⬢ Apache Spark
○ Machine learning
application driven
○ The leading computation
engine for big data
processing
○ Data pipeline for
different data source
and other computation
engine
○ Uniform data processing
object RDD and
DataFrame
○ Memory based

Data pipeline for machine learning
Resilient Distributed Dataset
server server server server
ETL Exploration Machine
learning
Structural
data
RAW data
processing
Interactive,
OLAP,
Spark SQL
Feature
engineering
Model
training
Data
Product
Visualization

ML is only a small part of real-word ML system

Bring Data Science to Big Data
Retraining
History
data
Feedback
data
Data scientist
Continuous updating
Deploying
Operational
data
ML
Model
Feature engineering
Model selection
Model tuning
ML
Pipeline
Scoring

Outline
● Spark ecosystem
● Spark XGBoost
● Spark Deep Learning pipeline

Motivation
⬢ Machine learning for big data
⬢ Application lists
○ House price prediction
○ CTR prediction
○ ….
○ Products recommendation
⬢ ML job categories
○ Regression
○ Classification
○ Clustering
○ Etc.
⬢ XGBoost is good at
○ Regression and Classification

Motivation
⬢ XGBoost is the start-of-art approach in Kaggle for structural data
○ 80% teams win the competition based on XGBoost
○ A tree based model
○ Excellent at classification and regression
○ Ref: http://xgboost.readthedocs.io/en/latest/model.html

Motivation
⬢ Ensemble and Boosting is time consuming for training model
○ Ensemble
○ An ensemble is a combination of predication model that output a final result

Motivation
⬢ Ensemble and Boosting is time consuming for training model
○ Gradient Boosting
○ Multiple round (1…M) iterations to correct the errors of previous round mistake
○ Ref: https://www.slideshare.net/LonghowLam/machine-learning-overview

Motivation
⬢ Train XGBoost is time consuming
Training
data
XGBoost
Model
B: Model EvaluationTesting
data
Model
Evaluation
A: Training algorithm
C: Model tuning

Motivation
⬢ What should we do ?
Training
data
XGBoost
Model
B: Model Evaluation
with Spark ML
Testing
data
Model
Evaluatio
n
A: Speedup Training
1. Parallel and GPU
C: AUTO Model
Tuning with Spark ML

Motivation
⬢ From single machine to parallel computation
○ Distributed training
○ GPU supported
○ Cowork with big data ecosystem
⬢ How to provide the end-end solution for DS?
○ Front-end
■ Easy and efficient way for parallel XGBoost computation
■ Notebook front end for model visualization
○ Backend
■ Yarn to allocate the resource for application (CPU, Memory, GPU)
■ Docker support

How Spark enhance XGBoost
⬢ Efficient distributed training and Spark ML pipeline
○ Dataframe and RDD support for efficient data preprocessing
⬢ Ref: http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html

How Spark enhance XGBoost
⬢ Each node of XGBoost need Rabit to communicate with each others
○ Efficient but not easy to manage Rabit
XGBoost
worker2
XGBoost
worker3
XGBoost
worker4
Training data
Partition 1 XGBoost
worker1
Training data
Partition 2
Training data
Partition 3
Training data
Partition 4
Statistic sync:
optimal split value

XGBoost on Spark ML pipeline
⬢ Distributed XGBoost inside Spark ML pipeline
⬢ XGBoost estimator
○ Extend from Spark ML estimator
⬢ XGBoost model
○ Extend from Spark ML pipelineModel
○ Naturally work inside Spark ML Pipeline for model materialization
⬢ XGBoost parameter
○ Extend from Spark ML parameter
○ Enable automatically parameter tuning

XGBoost on Spark ML pipeline
⬢ Distributed XGBoost
○ Parameter:
○ val paramMap = List( "eta" -> 0.1f, "max_depth" -> 2, "objective" -> "binary:logistic").toMap
○ training
○ val xgboostModelRDD = XGBoost.train(trainRDD, paramMap, 1, 4, useExternalMemory=true)
○ val xgboostModelDF = XGBoost.trainWithDataFrame(trainDF, paramMap, 1, 4, useExternalMemory = true)
○ Prediction
○ val xgboostPredictionRDD = xgboostModelRDD.predict(trainRDD.map{x => x.features})
○ XGBoost inside ML pipeline
○ val xgboostEstimator = new XGBoostEstimator( Map[String, Any]("num_round" -> 30, "nworkers" -> 10, "objective" ->
"reg:linear", "eta" -> 0.3, "max_depth" -> 6, "early_stopping_rounds" -> 10))
val pipeline = new Pipeline() .setStages(Array(assembler, xgboostEstimator))
○ val pipelineData = dataset.withColumnRenamed("PE","label")
○ val pipelineModel = pipeline.fit(pipelineData)

GPU speedup XGBoost
⬢ Where to improve the tree building procedure?
⬢ Procedure to build a tree
○ for each feature of input data
■ for each leaf of current tree
● find the best spilt
■ split the leaf node
A
Y N
A
Y N

GPU speedup XGBoost
⬢ GPU speedup XGBoost in the single machine
○ Processing all nodes in the same level concurrently
○ Optimizing splitting point selection
○ Optimize memory usage for data sparsity
⬢ Algorithm to speedup XGBoost via GPU
○ Phase 1: Find splits
○ Phase 2: Update node positions
○ Phase 3: Sort node buckets
○ Ref: https://peerj.com/articles/cs-127/
Instance ID 1 4 3 2
Feature
value
0.1 0.2 0.4 0.5
Gradient 0.3 0.5 0.3 0.3

GPU speedup XGBoost
⬢ XGBoost with GPU wins 4.x speedup vs CPU based
⬢ Ref: https://devblogs.nvidia.com/gradient-boosting-decision-trees-xgboost-cuda/

GPU speedup XGBoost
⬢ GPU is good but manage GPU cluster is not easy
○ Different versions of drivers for GPUs
○ Users have to build XGBoost for GPU supported
○ Hard to manage the resources of GPU
○ GPU resource cannot be shared
⬢ An idle environment is everything included
○ Spark is an efficient distributed engine for data processing
○ Spark ML pipeline for model tuning
○ GPU is used to speedup the XGBoost training
○ Yarn is able to manage the resources of cluster
○ Notebook is used for end users

What you can learn from this notebook
⬢ Combine Spark, and XGBoost together
○ Train and deploy XGBoost model in a unified data platform
○ Automatically tune the XGBoost model based on Spark ML pipeline
○ Speedup XGBoost training based on distributed computation and GPU
○ Multiple users can share the same cluster with GPU and Spark
⬢ Benefits
○ End to end solution for ML pipeline with XGBoost support
○ Do not need to care about GPU management
○ Train the XGBoost with Spark ML APIs
○ Visualize the predication results on notebook

Spark and Xgboost for Fintech
⬢ Lending club data
⬢ Spark Dataframe for ETL
⬢ Spark SQL for OLAP
⬢ Spark ML for auto modeling tuning and model serving
⬢ Notebook link: （use databricks community edition）
○ Part1: (https://bit.ly/2QuLQ9b) https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/49999
72933037924/27242371102049/8135547933712821/latest.html
○ Part2:(https://bit.ly/2AZJI3Z)
https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/49999
72933037924/27242371102070/8135547933712821/latest.html
⬢ Acknowledgment: https://databricks.com/blog/2018/08/09/loan-risk-analysis-
with-xgboost-and-databricks-runtime-for-machine-learning.html

Why Deep Learning
Data explosion
Computation explosion
An AI-driven world

What is deep learning
⬢ A set of machine learning techniques that can learn useful representations of
features directly from images, text and sound.
⬢ Achievements
○ ImageNet
○ Google Neural Machine
Translation
○ AlphaGo/AlphaZero
⬢ Benefit from big data and GPU

A typical Deep Learning workflow
Load data Select neural network architecture, optimize the parameters

Build your own deep learning model
Model Images(#) Classes(#)
ImageNet 14M 20K
Skin cancer 129,450 757

Transfer Learning Pipeline
Pre-trained CNN
model
Softmax classification
(Trainable parameters)
Load data as
DataFrame

Deep Learning in Spark MLlib Pipeline
⬢ Spark MLlib pipeline
○ Sequence of Transformers and Estimators
○ Simple, concise API and ease of use
⬢ Integrates with Spark APIs
○ Spark is great at scaling out computations
○ Image representation and reader in Spark DataFrame/Dataset (new in Spark 2.3)
⬢ Spark Deep Learning Pipelines (github.com/databricks/spark-deep-learning)
○ Plugin your own TensorFlow Graph or Keras Model as Transformers
○ Open source under Apache 2.0 license

Auto ML in Spark ML pipeline
⬢ Spark to prepare the data
○ Spark streaming
○ Spark SQL
⬢ Spark for model parameter tuning
○ Hyper parameter
○ Save memory usage
⬢ TensorFlow auto network structure tuning
○ Reinforce learning
○ Transfer learning
⬢ Model deploy as a service

Case study
⬢ Car damage estimation ⬢ Intelligence agent
⬢ X-Ray Image analysis ⬢ Anti-Terrorism

What you can learn this section
⬢ How to combine deep learning and Spark together
⬢ Take DL as a operator in Spark ML pipeline
⬢ Transfer learning with DL model
⬢ DL model parameter tuning
⬢ Apply DL model into Spark SQL
⬢ Notebook: https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4999972933037924/4
324977500035919/8135547933712821/latest.html
⬢ Acknowledgment: https://docs.databricks.com/applications/deep-learning/deep-learning-
pipelines.html

resources:
https://drive.google.com/drive/folders/1wGKNGq7w75YKYazMZ7ytgaAtfTCgvsE
D?usp=sharing

Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI

More Related Content

What's hot (20)

Similar to Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI (20)

More from AI Frontiers (20)

Recently uploaded (20)

Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI