AI on Greenplum Using  Apache MADlib and MADlib Flow - Greenplum Summit 2019

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Sridhar Paladugu
Frank McQuillan
AI on Greenplum Using
Apache MADlib and MADlib Flow

Greenplum Integrated Analytics
Data Transformation
Traditional BI
Machine
Learning
Graph
Data Science
Productivity Tools
Geospatial
Text
Deep
Learning
Build
Manage
Deploy

■ Machine learning
■ Deep learning
■ Model management
■ Deployment and
orchestration of models
Agenda

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
1. Machine Learning with
Apache MADlib

Scalable, In-Database
Machine Learning
• Open source https://github.com/apache/madlib
• Downloads and docs http://madlib.apache.org/
• Wiki https://cwiki.apache.org/confluence/display/MADLIB/
Apache MADlib: Big Data Machine Learning in SQL
Open source,
top level
Apache project
For PostgreSQL
and Greenplum
Database
Powerful machine
learning, graph,
statistics and analytics
for data scientists

History
MADlib project was initiated in 2011 by EMC/Greenplum architects and
Professor Joe Hellerstein from University of California, Berkeley.
UrbanDictionary.com:
mad (adj.): an adjective used to enhance a
noun.
1- dude, you got skills.
2- dude, you got mad skills.

Functions
Data Types and Transformations
Array and Matrix Operations
Matrix Factorization
• Low Rank
• Singular Value Decomposition (SVD)
Norms and Distance Functions
Sparse Vectors
Encoding Categorical Variables
Path Functions
Pivot
Sessionize
Stemming
Apache MADlib 1.15.1
Graph
All Pairs Shortest Path (APSP)
Breadth-First Search
Hyperlink-Induced Topic Search (HITS)
Average Path Length
Closeness Centrality
Graph Diameter
In-Out Degree
PageRank and Personalized PageRank
Single Source Shortest Path (SSSP)
Weakly Connected Components
Model Selection
Cross Validation
Prediction Metrics
Train-Test Split
Statistics
Descriptive Statistics
• Cardinality Estimators
• Correlation and Covariance
• Summary
Inferential Statistics
• Hypothesis Tests
Probability Functions
Supervised Learning
Neural Networks
Support Vector Machines (SVM)
Conditional Random Field (CRF)
Regression Models
• Clustered Variance
• Cox-Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Naïve Bayes
• Ordinal Regression
• Robust Variance
Tree Methods
• Decision Tree
• Random Forest
Time Series Analysis
• ARIMA
Unsupervised Learning
Association Rules (Apriori)
Clustering (k-Means)
Principal Component Analysis (PCA)
Topic Modelling (Latent Dirichlet Allocation)
Utility Functions
Columns to Vector
Conjugate Gradient
Linear Solvers
• Dense Linear Systems
• Sparse Linear Systems
Mini-Batching
PMML Export
Term Frequency for Text
Vector to Columns
Nearest Neighbors
• k-Nearest Neighbors
Sampling
Balanced
Random
Stratified
Comprehensive and mature
data science library

Why MADlib on Greenplum?
• Better parallelism
• Better scalability
• Higher predictive accuracy
• Top level ASF project
“Apache MADlib Comes of Age”, Frank McQuillan, Oct. 2017,
https://content.pivotal.io/blog/apache-madlib-comes-of-age

Greenplum Database with MADlib
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Node1
Segment Host
Node2
Segment Host
Node3
Segment Host
NodeN
Local
Storage
Other
RDBMSes
SparkGemFire
Cloud
Object
Storage
HDFS KafkaETL
Spring
Cloud
Data Flow
In-Database
Functions
Machine learning
&
statistics
&
math
&
graph
&
utilities
MassivelyParallelProcessing

Iterative Model Execution
Master
model = init(…)
WHILE model not converged
model =
SELECT
model.aggregation(…)
FROM
data table
ENDWHILE
Stored Procedure for Model
…
Broadcast
Segment 2
Segment n
…
Transition Function
Operates on tuples
or mini-batches to
update transition state
(model)
1
Merge
Function
Combines
transition states2
Final Function
Transforms transition
state into output value
3
Segment 1

Familiar SQL Interface
Train (build a predictive model)
Predict (use model on new data)

Familiar SQL Interface From house pricing model

SVM Scale with Data Size
Greenplum cluster:
● 1 master
● 4 segment hosts with
6 segments per host
Support Vector Machines

PageRank Scale with Graph Size
Greenplum cluster:
● 1 master
● 4 segment hosts with
6 segments per host
Normal random graphs with
mean degrees 50 edges per vertex
(i.e., 5B edges in the largest case)
5B edges
(1K) (10K) (100K) (1M) (10M) (100M)
Note: log-log scale
(100s)
(1s)
(10K s)
(1M s)
“Graph Processing on Greenplum Database using Apache MADlib”, Frank McQuillan, Jan 2018,
https://content.pivotal.io/blog/graph-processing-on-greenplum-database-using-apache-madlib

But modeling is only part of the story...
“It’s an absolute myth that you can send an algorithm
over raw data and have insights pop up.”
- Jeffrey Heer, Professor of Computer Science at the University of Washington and Co-
founder of Trifacta
“For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”, Aug. 17, 2014
https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

Feature Engineering
Example
data science
workflow

2. Deep Learning

Deep Learning
• Type of machine
learning inspired by
biology of the brain
• Artificial neural
networks with
multiple layers
between input and
output

Example Deep Learning Algorithms
Multilayer
perceptron (MLP)
“The Original”
Recurrent
neural network (RNN)
E.g., machine translation
Convolutional
neural network (CNN)
E.g., image classification

Convolutional Neural Networks (CNN)
• Effective for computer vision
• Fewer parameters than fully
connected networks
• Translational invariance
• Classic networks: LeNet-5,
AlexNet, VGG

Graphics Processing Units (GPUs)
• Great at performing a
lot of simple
computations such as
matrix operations
• Well suited to deep
learning algorithms

GPU N
…
Single Server
Host
Node 1
GPU 1

Moving Data Greenplum <-> Single Server
Deep learningData preparation, feature generation,
machine learning, geospatial, etc.
Large
data
transfer
Suboptimal

Integrated Deep Learning with Greenplum
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Node1
Segment Host
Node2
Segment Host
Node3
Segment Host
NodeN
GPU N
…
GPU 1 GPU N
…
GPU 1 GPU N
…
GPU 1
…
GPU N
…
GPU 1
In-Database
Functions
Machine learning
&
statistics
&
math
&
graph
&
utilities
MassivelyParallelProcessing

Deep Learning on a Cluster
Num Approach Description
1 Distributed deep learning Train single model architecture across the cluster.
Data distributed (usually randomly) across segments.
2 Data parallel models Train same model architecture in parallel on different
data groups (e.g., build separate models per country).
3 Hyperparameter tuning Train same model architecture in parallel with different
hyperparameter settings and incorporate cross
validation. Same data on each segment.
4 Neural architecture
search
Train different model architectures in parallel. Same
data on each segment.
Current
work

Testing Infrastructure
• Google Cloud Platform (GCP)
• Type n1-highmem-32 (32 vCPUs, 208 GB memory)
• NVIDIA Tesla P100 GPUs
• Greenplum database config
– Tested up to 20 segment (worker node) clusters
– 1 GPU per segment

6-layer CNN - Runtime (CIFAR-10)
Method: Model weight averaging

3. Model Management

Try and try and try...
• Data scientists typically try many different types of
models with many different parameters combinations

Model Persistence in MADlib 1.x
One model at a time

Model Persistence in MADlib 2.0
Multiple models at a time in
model library

4. MADlib Flow

Data Science Process
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup

Model Operationalization
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
Model Operationalization
is the process of deploying data
science models to production
for ongoing use by other
software

Common Challenges With Operationalizing Models
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
Common challenges with model
operationalization:
● Handling production data
● Engineering for scale and
performance
● Model transportation
● Managing and orchestrating
deployed models
● Data Scientists are not
developers or platform
experts

BATCH TRAINING
BATCH INFERENCE
~40% of today’s use cases
Tax Return Fraud: Score database of
tax returns - on a nightly basis - to flag
likely fraudulent returns for audit
EVENT DRIVEN
TRAINING EVENT
DRIVEN INFERENCE
<5% today’s use cases
Online Advertising: Maximize Click
Thru Rate by algorithmically selecting
and testing advertisement placement in
real time
BATCH TRAINING
EVENT DRIVEN
INFERENCE
~55% today’s use cases (growing)
Real Time Transaction Fraud: Train
a ML model on historical data to
classify - in real time - whether or not
new credit/debit transactions are likely
to be fraudulent
EXAMPLE
Patterns For Operationalizing Models
EXAMPLE EXAMPLE
PotsgreSQL/Greenplum
with MADlib supports
this pattern
PostgreSQL/Greenplum
with MADlib & MADlib
Flow supports this
pattern
Highly specialized – low
number of enterprise use
cases

AI For The PostgreSQL Community
Standardized end-to-end Data Science in SQL with the Greenplum/Postgres stack
Experimentation
Initial code development and testing,
model experimentation on samples.
Modeling at Scale
Heavy compute tasks such as model
training across big data
Deployment
Production deployment of models to feed
downstream applications and reports
Artificial
Intelligence
: Closed
Loop
Machine
Learning

Model Deployment With MADlib Flow
1
ML Training
Train ML model in
Postgres or Greenplum
using Apache MADlib
madlibflow --
deploy
Set configs in .yml and
deploy model from
Greenplum to Docker,
PCF or Kubernetes
2
Docker pull
Pull docker containers
with optimized Postgres
and MADlib
3
Pull Model
Extract model and
feature table schema
layout from Greenplum
database
4
Load Model
Load model and feature
table schema into
optimized Postgres
5
Deploy
Deploy docker container
to target environment
6
Automated Backend OperationsUser Operations

Containerized Deployment Of Models
$ madlibflow --deploy --target kubernetes --type model
Key benefits of MADlib Flow
● Easy to deploy & light weight
● Highly scalable REST and Streaming
● End-to-end SQL workflow
● Low latency inference/predictions
● Feature Transformations
Single command to deploy a MADlib
trained model from GPDB/Postgres to
Docker, PCF or Kubernetes
Containerized deployment of Apache MADlib Machine Learning workflows for low
latency event driven inference and scale

MADlib Flow : Hello World!
Let us demonstrate a Linear Regression Model deployment
Dependent Variable:
● patient has had a second heart attack within 1 year
independent variables:
● patient completed a treatment on anger control
● anxiety scale score
Workflow:
Create
schema
Load data Train model
Deploy
model
Tes
t
Batch
prediction

Model Deployment
Deployment manifest
$ madlibflow --name patient-lr --type model --action deploy --target kubernetes --inputJson config.json

Greenplum Database
Feature EngineCredit/Debit Card Transaction
(Input)
Message
{
“transaction_ts”: ,
“credit_card_number”: ,
“transaction_amt”:,
“merchant_id”:
}
Approved Credit/Debit Card
Transaction
(Output)
Message
{
“transaction_ts”: ,
“transaction_amt”:,
“credit_card_number”:,
“num_transactions_30days”:,
“max_transactions_30days”:,
“merchant_id”:,
“num_fraud_cases”:,
“avg_transaction_amount_30days”:,
“fraud_risk_score”: 0.92,
“approved”: True
}
Accounts
credit_card_number
num_transactions_30days
max_transactions_30days
Merchants
merchant_id
num_fraud_cases
avg_transaction_amount_30days
Cache
(Gemfire, PCC, Redis, etc.)
Cache Abstraction
Cache Abstraction
SELECT mch.*
,acct.*
,log(msg.transaction_amt + 1) AS log_transaction_amt
FROM message msg
JOIN merchants mch ON
msg.merchant_id=mch.merchant_id
JOIN accounts acct ON
msg.credit_card_number=acct.credit_card_number;
MADlib REST
Cache Loader
Automated deployment
of scalable low latency
end-to-end ML pipelines
(“Data Science Ops.”)
No code conversion -
engineer features and
populate cache in SQL
Join data from the
incoming message with
cached data
Accounts Merchants
SELECT create_accounts(); SELECT create_merchants();
Example Flow for Fraud Detection

5. Learn More!

• Download
– http://madlib.apache.org/
• ~40 Jupyter notebooks
– https://github.com/apache/madlib-site/tree/asf-
site/community-artifacts
• Wednesday March 20 @PostgresConf

Backup Slides

MADlib 2.0
● More deep learning
capabilities
○ Improved model
performance
○ Hyperparameter
tuning
● Model repositories and
management for
streamlined data science
workflows
● New and improved SQL
interface for MADlib
functions
MADlib Flow
● Support for PL/Python and
PL/R
● Native deployment to
Pivotal Cloud foundry as
build pack.
● Beta Release in May’19
● Metrics collector.
MADlib 1.16
● Initial deep learning
release for image
classification
(Keras/TensorFlow)
● Postgres 11 support
● Improve speed of k-
nearest neighbors via
approximate method
Looking Ahead

Apache MADlib Resources
• Web site
– http://madlib.apache.org/
• Wiki
– https://cwiki.apache.org/confluence/display/MAD
LIB/Apache+MADlib
• User docs
– http://madlib.apache.org/docs/latest/index.html
• Jupyter notebooks
– https://github.com/apache/madlib-site/tree/asf-
site/community-artifacts
• Technical docs
– http://madlib.apache.org/design.pdf
• Pivotal commercial site
– http://pivotal.io/madlib
• Mailing lists and JIRAs
– https://mail-
archives.apache.org/mod_mbox/incubator-
madlib-dev/
– http://mail-
archives.apache.org/mod_mbox/incubator-
madlib-user/
– https://issues.apache.org/jira/browse/MADLIB
• PivotalR
– https://cran.r-
project.org/web/packages/PivotalR/index.html
• Github
– https://github.com/apache/madlib
– https://github.com/pivotalsoftware/PivotalR

Execution Flow
Client
Database
Server
Master
Segment 1
Segment 2
Segment n
…
SQL
Stored
Procedure
Result
Set
String
Aggregation
psql
…

Artificial Intelligence Landscape
Deep
Learning

Distributed Deep Learning Methods
• Open area of research*
• Methods we have investigated so far:
– Simple averaging
– Ensembling
– Elastic averaging stochastic gradient descent
(EASGD)
* Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
https://arxiv.org/pdf/1802.09941.pdf

Some Results with CIFAR-10
• 60k 32x32 color
images in 10 classes,
with 6k images per
class
• 50k training images
and 10k test images
https://www.cs.toronto.edu/~kriz/cifar.html

■ Experimentation -> Modeling at scale -> Deployment all in SQL
■ Single platform from model development to Deployment using Postgres/Greenplum
■ Low latency inference
■ Easy to deploy both feature generation code and model
■ Join data from event message with Feature cache objects using ANSI SQL
■ Continuously generate the features and feed in to feature engine.
■ Multiple versions of Models can be deployed for accuracy measurement.
■ Same tool can deploy to multiple Container Environments, PKS, AKS, GKE, etc.
MADlib Flow Benefits

AI on Greenplum Using  Apache MADlib and MADlib Flow - Greenplum Summit 2019

More Related Content

What's hot (20)

Similar to AI on Greenplum Using  Apache MADlib and MADlib Flow - Greenplum Summit 2019 (20)

More from VMware Tanzu (20)

Recently uploaded (20)