SlideShare a Scribd company logo
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Sridhar Paladugu
Frank McQuillan
AI on Greenplum Using
Apache MADlib and MADlib Flow
Greenplum Integrated Analytics
Data Transformation
Traditional BI
Machine
Learning
Graph
Data Science
Productivity Tools
Geospatial
Text
Deep
Learning
Build
Manage
Deploy
■ Machine learning
■ Deep learning
■ Model management
■ Deployment and
orchestration of models
Agenda
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
1. Machine Learning with
Apache MADlib
Scalable, In-Database
Machine Learning
• Open source https://github.com/apache/madlib
• Downloads and docs http://madlib.apache.org/
• Wiki https://cwiki.apache.org/confluence/display/MADLIB/
Apache MADlib: Big Data Machine Learning in SQL
Open source,
top level
Apache project
For PostgreSQL
and Greenplum
Database
Powerful machine
learning, graph,
statistics and analytics
for data scientists
History
MADlib project was initiated in 2011 by EMC/Greenplum architects and
Professor Joe Hellerstein from University of California, Berkeley.
UrbanDictionary.com:
mad (adj.): an adjective used to enhance a
noun.
1- dude, you got skills.
2- dude, you got mad skills.
Functions
Data Types and Transformations
Array and Matrix Operations
Matrix Factorization
• Low Rank
• Singular Value Decomposition (SVD)
Norms and Distance Functions
Sparse Vectors
Encoding Categorical Variables
Path Functions
Pivot
Sessionize
Stemming
Apache MADlib 1.15.1
Graph
All Pairs Shortest Path (APSP)
Breadth-First Search
Hyperlink-Induced Topic Search (HITS)
Average Path Length
Closeness Centrality
Graph Diameter
In-Out Degree
PageRank and Personalized PageRank
Single Source Shortest Path (SSSP)
Weakly Connected Components
Model Selection
Cross Validation
Prediction Metrics
Train-Test Split
Statistics
Descriptive Statistics
• Cardinality Estimators
• Correlation and Covariance
• Summary
Inferential Statistics
• Hypothesis Tests
Probability Functions
Supervised Learning
Neural Networks
Support Vector Machines (SVM)
Conditional Random Field (CRF)
Regression Models
• Clustered Variance
• Cox-Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Naïve Bayes
• Ordinal Regression
• Robust Variance
Tree Methods
• Decision Tree
• Random Forest
Time Series Analysis
• ARIMA
Unsupervised Learning
Association Rules (Apriori)
Clustering (k-Means)
Principal Component Analysis (PCA)
Topic Modelling (Latent Dirichlet Allocation)
Utility Functions
Columns to Vector
Conjugate Gradient
Linear Solvers
• Dense Linear Systems
• Sparse Linear Systems
Mini-Batching
PMML Export
Term Frequency for Text
Vector to Columns
Nearest Neighbors
• k-Nearest Neighbors
Sampling
Balanced
Random
Stratified
Comprehensive and mature
data science library
Why MADlib on Greenplum?
• Better parallelism
• Better scalability
• Higher predictive accuracy
• Top level ASF project
“Apache MADlib Comes of Age”, Frank McQuillan, Oct. 2017,
https://content.pivotal.io/blog/apache-madlib-comes-of-age
Greenplum Database with MADlib
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Node1
Segment Host
Node2
Segment Host
Node3
Segment Host
NodeN
Local
Storage
Other
RDBMSes
SparkGemFire
Cloud
Object
Storage
HDFS KafkaETL
Spring
Cloud
Data Flow
In-Database
Functions
Machine learning
&
statistics
&
math
&
graph
&
utilities
MassivelyParallelProcessing
Iterative Model Execution
Master
model = init(…)
WHILE model not converged
model =
SELECT
model.aggregation(…)
FROM
data table
ENDWHILE
Stored Procedure for Model
…
Broadcast
Segment 2
Segment n
…
Transition Function
Operates on tuples
or mini-batches to
update transition state
(model)
1
Merge
Function
Combines
transition states2
Final Function
Transforms transition
state into output value
3
Segment 1
Familiar SQL Interface
Train (build a predictive model)
Predict (use model on new data)
Familiar SQL Interface From house pricing model
SVM Scale with Data Size
Greenplum cluster:
● 1 master
● 4 segment hosts with
6 segments per host
Support Vector Machines
PageRank Scale with Graph Size
Greenplum cluster:
● 1 master
● 4 segment hosts with
6 segments per host
Normal random graphs with
mean degrees 50 edges per vertex
(i.e., 5B edges in the largest case)
5B edges
(1K) (10K) (100K) (1M) (10M) (100M)
Note: log-log scale
(100s)
(1s)
(10K s)
(1M s)
“Graph Processing on Greenplum Database using Apache MADlib”, Frank McQuillan, Jan 2018,
https://content.pivotal.io/blog/graph-processing-on-greenplum-database-using-apache-madlib
But modeling is only part of the story...
“It’s an absolute myth that you can send an algorithm
over raw data and have insights pop up.”
- Jeffrey Heer, Professor of Computer Science at the University of Washington and Co-
founder of Trifacta
“For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”, Aug. 17, 2014
https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
Feature Engineering
Example
data science
workflow
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
2. Deep Learning
Deep Learning
• Type of machine
learning inspired by
biology of the brain
• Artificial neural
networks with
multiple layers
between input and
output
Example Deep Learning Algorithms
Multilayer
perceptron (MLP)
“The Original”
Recurrent
neural network (RNN)
E.g., machine translation
Convolutional
neural network (CNN)
E.g., image classification
Convolutional Neural Networks (CNN)
• Effective for computer vision
• Fewer parameters than fully
connected networks
• Translational invariance
• Classic networks: LeNet-5,
AlexNet, VGG
Graphics Processing Units (GPUs)
• Great at performing a
lot of simple
computations such as
matrix operations
• Well suited to deep
learning algorithms
GPU N
…
Single Server
Host
Node 1
GPU 1
Moving Data Greenplum <-> Single Server
Deep learningData preparation, feature generation,
machine learning, geospatial, etc.
Large
data
transfer
Suboptimal
Integrated Deep Learning with Greenplum
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Node1
Segment Host
Node2
Segment Host
Node3
Segment Host
NodeN
GPU N
…
GPU 1 GPU N
…
GPU 1 GPU N
…
GPU 1
…
GPU N
…
GPU 1
In-Database
Functions
Machine learning
&
statistics
&
math
&
graph
&
utilities
MassivelyParallelProcessing
Deep Learning on a Cluster
Num Approach Description
1 Distributed deep learning Train single model architecture across the cluster.
Data distributed (usually randomly) across segments.
2 Data parallel models Train same model architecture in parallel on different
data groups (e.g., build separate models per country).
3 Hyperparameter tuning Train same model architecture in parallel with different
hyperparameter settings and incorporate cross
validation. Same data on each segment.
4 Neural architecture
search
Train different model architectures in parallel. Same
data on each segment.
Current
work
Data Loading and Formatting
Testing Infrastructure
• Google Cloud Platform (GCP)
• Type n1-highmem-32 (32 vCPUs, 208 GB memory)
• NVIDIA Tesla P100 GPUs
• Greenplum database config
– Tested up to 20 segment (worker node) clusters
– 1 GPU per segment
6-layer CNN - Runtime (CIFAR-10)
Method: Model weight averaging
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
3. Model Management
Try and try and try...
• Data scientists typically try many different types of
models with many different parameters combinations
Model Persistence in MADlib 1.x
One model at a time
Model Persistence in MADlib 2.0
Multiple models at a time in
model library
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
4. MADlib Flow
Data Science Process
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
Model Operationalization
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
Model Operationalization
is the process of deploying data
science models to production
for ongoing use by other
software
Common Challenges With Operationalizing Models
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
Common challenges with model
operationalization:
● Handling production data
● Engineering for scale and
performance
● Model transportation
● Managing and orchestrating
deployed models
● Data Scientists are not
developers or platform
experts
BATCH TRAINING
BATCH INFERENCE
~40% of today’s use cases
Tax Return Fraud: Score database of
tax returns - on a nightly basis - to flag
likely fraudulent returns for audit
EVENT DRIVEN
TRAINING EVENT
DRIVEN INFERENCE
<5% today’s use cases
Online Advertising: Maximize Click
Thru Rate by algorithmically selecting
and testing advertisement placement in
real time
BATCH TRAINING
EVENT DRIVEN
INFERENCE
~55% today’s use cases (growing)
Real Time Transaction Fraud: Train
a ML model on historical data to
classify - in real time - whether or not
new credit/debit transactions are likely
to be fraudulent
EXAMPLE
Patterns For Operationalizing Models
EXAMPLE EXAMPLE
PotsgreSQL/Greenplum
with MADlib supports
this pattern
PostgreSQL/Greenplum
with MADlib & MADlib
Flow supports this
pattern
Highly specialized – low
number of enterprise use
cases
AI For The PostgreSQL Community
Standardized end-to-end Data Science in SQL with the Greenplum/Postgres stack
Experimentation
Initial code development and testing,
model experimentation on samples.
Modeling at Scale
Heavy compute tasks such as model
training across big data
Deployment
Production deployment of models to feed
downstream applications and reports
Artificial
Intelligence
: Closed
Loop
Machine
Learning
Model Deployment With MADlib Flow
1
ML Training
Train ML model in
Postgres or Greenplum
using Apache MADlib
madlibflow --
deploy
Set configs in .yml and
deploy model from
Greenplum to Docker,
PCF or Kubernetes
2
Docker pull
Pull docker containers
with optimized Postgres
and MADlib
3
Pull Model
Extract model and
feature table schema
layout from Greenplum
database
4
Load Model
Load model and feature
table schema into
optimized Postgres
5
Deploy
Deploy docker container
to target environment
6
Automated Backend OperationsUser Operations
Containerized Deployment Of Models
$ madlibflow --deploy --target kubernetes --type model
Key benefits of MADlib Flow
● Easy to deploy & light weight
● Highly scalable REST and Streaming
● End-to-end SQL workflow
● Low latency inference/predictions
● Feature Transformations
Single command to deploy a MADlib
trained model from GPDB/Postgres to
Docker, PCF or Kubernetes
Containerized deployment of Apache MADlib Machine Learning workflows for low
latency event driven inference and scale
MADlib Flow Components
MADlib Flow : Hello World!
Let us demonstrate a Linear Regression Model deployment
Dependent Variable:
● patient has had a second heart attack within 1 year
independent variables:
● patient completed a treatment on anger control
● anxiety scale score
Workflow:
Create
schema
Load data Train model
Deploy
model
Tes
t
Batch
prediction
Model Deployment
Deployment manifest
$ madlibflow --name patient-lr --type model --action deploy --target kubernetes --inputJson config.json
Model Deployment
Greenplum Database
Feature EngineCredit/Debit Card Transaction
(Input)
Message
{
“transaction_ts”: ,
“credit_card_number”: ,
“transaction_amt”:,
“merchant_id”:
}
Approved Credit/Debit Card
Transaction
(Output)
Message
{
“transaction_ts”: ,
“transaction_amt”:,
“credit_card_number”:,
“num_transactions_30days”:,
“max_transactions_30days”:,
“merchant_id”:,
“num_fraud_cases”:,
“avg_transaction_amount_30days”:,
“fraud_risk_score”: 0.92,
“approved”: True
}
Accounts
credit_card_number
num_transactions_30days
max_transactions_30days
Merchants
merchant_id
num_fraud_cases
avg_transaction_amount_30days
Cache
(Gemfire, PCC, Redis, etc.)
Cache Abstraction
Cache Abstraction
SELECT mch.*
,acct.*
,log(msg.transaction_amt + 1) AS log_transaction_amt
FROM message msg
JOIN merchants mch ON
msg.merchant_id=mch.merchant_id
JOIN accounts acct ON
msg.credit_card_number=acct.credit_card_number;
MADlib REST
Cache Loader
Automated deployment
of scalable low latency
end-to-end ML pipelines
(“Data Science Ops.”)
No code conversion -
engineer features and
populate cache in SQL
Join data from the
incoming message with
cached data
Accounts Merchants
SELECT create_accounts(); SELECT create_merchants();
Example Flow for Fraud Detection
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
5. Learn More!
• Download
– http://madlib.apache.org/
• ~40 Jupyter notebooks
– https://github.com/apache/madlib-site/tree/asf-
site/community-artifacts
• Wednesday March 20 @PostgresConf
#ScaleMatters
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Backup Slides
MADlib 2.0
● More deep learning
capabilities
○ Improved model
performance
○ Hyperparameter
tuning
● Model repositories and
management for
streamlined data science
workflows
● New and improved SQL
interface for MADlib
functions
MADlib Flow
● Support for PL/Python and
PL/R
● Native deployment to
Pivotal Cloud foundry as
build pack.
● Beta Release in May’19
● Metrics collector.
MADlib 1.16
● Initial deep learning
release for image
classification
(Keras/TensorFlow)
● Postgres 11 support
● Improve speed of k-
nearest neighbors via
approximate method
Looking Ahead
Apache MADlib Resources
• Web site
– http://madlib.apache.org/
• Wiki
– https://cwiki.apache.org/confluence/display/MAD
LIB/Apache+MADlib
• User docs
– http://madlib.apache.org/docs/latest/index.html
• Jupyter notebooks
– https://github.com/apache/madlib-site/tree/asf-
site/community-artifacts
• Technical docs
– http://madlib.apache.org/design.pdf
• Pivotal commercial site
– http://pivotal.io/madlib
• Mailing lists and JIRAs
– https://mail-
archives.apache.org/mod_mbox/incubator-
madlib-dev/
– http://mail-
archives.apache.org/mod_mbox/incubator-
madlib-user/
– https://issues.apache.org/jira/browse/MADLIB
• PivotalR
– https://cran.r-
project.org/web/packages/PivotalR/index.html
• Github
– https://github.com/apache/madlib
– https://github.com/pivotalsoftware/PivotalR
Execution Flow
Client
Database
Server
Master
Segment 1
Segment 2
Segment n
…
SQL
Stored
Procedure
Result
Set
String
Aggregation
psql
…
Artificial Intelligence Landscape
Deep
Learning
Distributed Deep Learning Methods
• Open area of research*
• Methods we have investigated so far:
– Simple averaging
– Ensembling
– Elastic averaging stochastic gradient descent
(EASGD)
* Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
https://arxiv.org/pdf/1802.09941.pdf
Some Results with CIFAR-10
• 60k 32x32 color
images in 10 classes,
with 6k images per
class
• 50k training images
and 10k test images
https://www.cs.toronto.edu/~kriz/cifar.html
■ Experimentation -> Modeling at scale -> Deployment all in SQL
■ Single platform from model development to Deployment using Postgres/Greenplum
■ Low latency inference
■ Easy to deploy both feature generation code and model
■ Join data from event message with Feature cache objects using ANSI SQL
■ Continuously generate the features and feed in to feature engine.
■ Multiple versions of Models can be deployed for accuracy measurement.
■ Same tool can deploy to multiple Container Environments, PKS, AKS, GKE, etc.
MADlib Flow Benefits
Model Training
Model Testing

More Related Content

PDF
[DI12] あらゆるデータをビジネスに活用! Azure Data Lake を中心としたビックデータ処理基盤のアーキテクチャと実装
PDF
ヤフーのプライベートクラウドとクラウドエンジニアの業務について
PPTX
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
PDF
Building Modern Streaming Analytics with Confluent on AWS
PPTX
Capgemini CRM Modernization Services
PDF
Data Platform Architecture Principles and Evaluation Criteria
PPTX
Data Factory V2 新機能徹底活用入門
PDF
技術キャッチアップのための「頑張らない副業」という選択
[DI12] あらゆるデータをビジネスに活用! Azure Data Lake を中心としたビックデータ処理基盤のアーキテクチャと実装
ヤフーのプライベートクラウドとクラウドエンジニアの業務について
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
Building Modern Streaming Analytics with Confluent on AWS
Capgemini CRM Modernization Services
Data Platform Architecture Principles and Evaluation Criteria
Data Factory V2 新機能徹底活用入門
技術キャッチアップのための「頑張らない副業」という選択

What's hot (20)

PPTX
機械学習の定番プラットフォームSparkの紹介
PPTX
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
PDF
Azure Training + Certification Guide.pdf
PDF
Prestoクエリログの保存/分析機能の構築 #yjdsnight
PPTX
cloud computing basics
PDF
Azure Synapse Analytics
PPTX
Qlik Sense SaaSからオンプレミスデータを活用!Qlik Data Gateway - Direct Accessのご紹介
PDF
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
PPTX
Data Factoryの勘所・大事なところ
PPTX
Power BI Advance Modeling
 
PPTX
明治大学理工学部 特別講義 AI on Azure
PDF
Apache Impalaパフォーマンスチューニング #dbts2018
PDF
Oracle Database / Exadata Cloud 技術情報(Oracle Cloudウェビナーシリーズ: 2020年7月9日)
PPTX
Qlik Replicateでのテーブル設定詳細(変換・フィルターなど)
PDF
Modernizing to a Cloud Data Architecture
PDF
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
PPTX
OCI Data Integration Overview 2021年5月版
PDF
JDKの選択肢とサーバーサイドでの選び方
PDF
楽天トラベルとSpring(Spring Day 2016)
PDF
SAP on Azure インフラ設計解説:HA/DR、Backupからパフォーマンス最適化まで
機械学習の定番プラットフォームSparkの紹介
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
Azure Training + Certification Guide.pdf
Prestoクエリログの保存/分析機能の構築 #yjdsnight
cloud computing basics
Azure Synapse Analytics
Qlik Sense SaaSからオンプレミスデータを活用!Qlik Data Gateway - Direct Accessのご紹介
Oracle Gen 2 Exadata Cloud@Customer:サービス概要のご紹介 [2021年7月版]
Data Factoryの勘所・大事なところ
Power BI Advance Modeling
 
明治大学理工学部 特別講義 AI on Azure
Apache Impalaパフォーマンスチューニング #dbts2018
Oracle Database / Exadata Cloud 技術情報(Oracle Cloudウェビナーシリーズ: 2020年7月9日)
Qlik Replicateでのテーブル設定詳細(変換・フィルターなど)
Modernizing to a Cloud Data Architecture
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
OCI Data Integration Overview 2021年5月版
JDKの選択肢とサーバーサイドでの選び方
楽天トラベルとSpring(Spring Day 2016)
SAP on Azure インフラ設計解説:HA/DR、Backupからパフォーマンス最適化まで
Ad

Similar to AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019 (20)

PDF
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
PPTX
Combining Machine Learning frameworks with Apache Spark
PDF
Scalable machine learning
PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
The Analytics Frontier of the Hadoop Eco-System
PDF
An Analytics Platform for Connected Vehicles
PDF
Machine Learning Infrastructure
PDF
GOAI: GPU-Accelerated Data Science DataSciCon 2017
PPTX
The Challenges of Bringing Machine Learning to the Masses
PDF
World Artificial Intelligence Conference Shanghai 2018
PDF
Machine learning model to production
PDF
Very large scale distributed deep learning on BigDL
PPT
A Hands-on Intro to Data Science and R Presentation.ppt
PPTX
Is Spark the right choice for data analysis ?
PPTX
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
PDF
TensorFlow 16: Building a Data Science Platform
PDF
NVIDIA Rapids presentation
PDF
Rapids: Data Science on GPUs
PDF
Big Data Analytics (ML, DL, AI) hands-on
PDF
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Combining Machine Learning frameworks with Apache Spark
Scalable machine learning
Combining Machine Learning Frameworks with Apache Spark
The Analytics Frontier of the Hadoop Eco-System
An Analytics Platform for Connected Vehicles
Machine Learning Infrastructure
GOAI: GPU-Accelerated Data Science DataSciCon 2017
The Challenges of Bringing Machine Learning to the Masses
World Artificial Intelligence Conference Shanghai 2018
Machine learning model to production
Very large scale distributed deep learning on BigDL
A Hands-on Intro to Data Science and R Presentation.ppt
Is Spark the right choice for data analysis ?
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
TensorFlow 16: Building a Data Science Platform
NVIDIA Rapids presentation
Rapids: Data Science on GPUs
Big Data Analytics (ML, DL, AI) hands-on
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Ad

More from VMware Tanzu (20)

PDF
Spring into AI presented by Dan Vega 5/14
PDF
What AI Means For Your Product Strategy And What To Do About It
PDF
Make the Right Thing the Obvious Thing at Cardinal Health 2023
PPTX
Enhancing DevEx and Simplifying Operations at Scale
PDF
Spring Update | July 2023
PPTX
Platforms, Platform Engineering, & Platform as a Product
PPTX
Building Cloud Ready Apps
PDF
Spring Boot 3 And Beyond
PDF
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
PDF
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
PDF
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
PPTX
tanzu_developer_connect.pptx
PDF
Tanzu Virtual Developer Connect Workshop - French
PDF
Tanzu Developer Connect Workshop - English
PDF
Virtual Developer Connect Workshop - English
PDF
Tanzu Developer Connect - French
PDF
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
PDF
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
PDF
SpringOne Tour: The Influential Software Engineer
PDF
SpringOne Tour: Domain-Driven Design: Theory vs Practice
Spring into AI presented by Dan Vega 5/14
What AI Means For Your Product Strategy And What To Do About It
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Enhancing DevEx and Simplifying Operations at Scale
Spring Update | July 2023
Platforms, Platform Engineering, & Platform as a Product
Building Cloud Ready Apps
Spring Boot 3 And Beyond
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
tanzu_developer_connect.pptx
Tanzu Virtual Developer Connect Workshop - French
Tanzu Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
Tanzu Developer Connect - French
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: Domain-Driven Design: Theory vs Practice

Recently uploaded (20)

PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
System and Network Administration Chapter 2
PPTX
ai tools demonstartion for schools and inter college
PPT
Introduction Database Management System for Course Database
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPT
JAVA ppt tutorial basics to learn java programming
PDF
System and Network Administraation Chapter 3
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Introduction to Artificial Intelligence
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Transform Your Business with a Software ERP System
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
System and Network Administration Chapter 2
ai tools demonstartion for schools and inter college
Introduction Database Management System for Course Database
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
JAVA ppt tutorial basics to learn java programming
System and Network Administraation Chapter 3
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Introduction to Artificial Intelligence
How to Migrate SBCGlobal Email to Yahoo Easily
PTS Company Brochure 2025 (1).pdf.......
Design an Analysis of Algorithms I-SECS-1021-03
Transform Your Business with a Software ERP System
Softaken Excel to vCard Converter Software.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free

AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019

  • 2. © Copyright 2019 Pivotal Software, Inc. All rights Reserved. Sridhar Paladugu Frank McQuillan AI on Greenplum Using Apache MADlib and MADlib Flow
  • 3. Greenplum Integrated Analytics Data Transformation Traditional BI Machine Learning Graph Data Science Productivity Tools Geospatial Text Deep Learning Build Manage Deploy
  • 4. ■ Machine learning ■ Deep learning ■ Model management ■ Deployment and orchestration of models Agenda
  • 5. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. 1. Machine Learning with Apache MADlib
  • 6. Scalable, In-Database Machine Learning • Open source https://github.com/apache/madlib • Downloads and docs http://madlib.apache.org/ • Wiki https://cwiki.apache.org/confluence/display/MADLIB/ Apache MADlib: Big Data Machine Learning in SQL Open source, top level Apache project For PostgreSQL and Greenplum Database Powerful machine learning, graph, statistics and analytics for data scientists
  • 7. History MADlib project was initiated in 2011 by EMC/Greenplum architects and Professor Joe Hellerstein from University of California, Berkeley. UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills.
  • 8. Functions Data Types and Transformations Array and Matrix Operations Matrix Factorization • Low Rank • Singular Value Decomposition (SVD) Norms and Distance Functions Sparse Vectors Encoding Categorical Variables Path Functions Pivot Sessionize Stemming Apache MADlib 1.15.1 Graph All Pairs Shortest Path (APSP) Breadth-First Search Hyperlink-Induced Topic Search (HITS) Average Path Length Closeness Centrality Graph Diameter In-Out Degree PageRank and Personalized PageRank Single Source Shortest Path (SSSP) Weakly Connected Components Model Selection Cross Validation Prediction Metrics Train-Test Split Statistics Descriptive Statistics • Cardinality Estimators • Correlation and Covariance • Summary Inferential Statistics • Hypothesis Tests Probability Functions Supervised Learning Neural Networks Support Vector Machines (SVM) Conditional Random Field (CRF) Regression Models • Clustered Variance • Cox-Proportional Hazards Regression • Elastic Net Regularization • Generalized Linear Models • Linear Regression • Logistic Regression • Marginal Effects • Multinomial Regression • Naïve Bayes • Ordinal Regression • Robust Variance Tree Methods • Decision Tree • Random Forest Time Series Analysis • ARIMA Unsupervised Learning Association Rules (Apriori) Clustering (k-Means) Principal Component Analysis (PCA) Topic Modelling (Latent Dirichlet Allocation) Utility Functions Columns to Vector Conjugate Gradient Linear Solvers • Dense Linear Systems • Sparse Linear Systems Mini-Batching PMML Export Term Frequency for Text Vector to Columns Nearest Neighbors • k-Nearest Neighbors Sampling Balanced Random Stratified Comprehensive and mature data science library
  • 9. Why MADlib on Greenplum? • Better parallelism • Better scalability • Higher predictive accuracy • Top level ASF project “Apache MADlib Comes of Age”, Frank McQuillan, Oct. 2017, https://content.pivotal.io/blog/apache-madlib-comes-of-age
  • 10. Greenplum Database with MADlib Standby Master … Master Host SQL Interconnect Segment Host Node1 Segment Host Node2 Segment Host Node3 Segment Host NodeN Local Storage Other RDBMSes SparkGemFire Cloud Object Storage HDFS KafkaETL Spring Cloud Data Flow In-Database Functions Machine learning & statistics & math & graph & utilities MassivelyParallelProcessing
  • 11. Iterative Model Execution Master model = init(…) WHILE model not converged model = SELECT model.aggregation(…) FROM data table ENDWHILE Stored Procedure for Model … Broadcast Segment 2 Segment n … Transition Function Operates on tuples or mini-batches to update transition state (model) 1 Merge Function Combines transition states2 Final Function Transforms transition state into output value 3 Segment 1
  • 12. Familiar SQL Interface Train (build a predictive model) Predict (use model on new data)
  • 13. Familiar SQL Interface From house pricing model
  • 14. SVM Scale with Data Size Greenplum cluster: ● 1 master ● 4 segment hosts with 6 segments per host Support Vector Machines
  • 15. PageRank Scale with Graph Size Greenplum cluster: ● 1 master ● 4 segment hosts with 6 segments per host Normal random graphs with mean degrees 50 edges per vertex (i.e., 5B edges in the largest case) 5B edges (1K) (10K) (100K) (1M) (10M) (100M) Note: log-log scale (100s) (1s) (10K s) (1M s) “Graph Processing on Greenplum Database using Apache MADlib”, Frank McQuillan, Jan 2018, https://content.pivotal.io/blog/graph-processing-on-greenplum-database-using-apache-madlib
  • 16. But modeling is only part of the story... “It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.” - Jeffrey Heer, Professor of Computer Science at the University of Washington and Co- founder of Trifacta “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”, Aug. 17, 2014 https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
  • 18. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. 2. Deep Learning
  • 19. Deep Learning • Type of machine learning inspired by biology of the brain • Artificial neural networks with multiple layers between input and output
  • 20. Example Deep Learning Algorithms Multilayer perceptron (MLP) “The Original” Recurrent neural network (RNN) E.g., machine translation Convolutional neural network (CNN) E.g., image classification
  • 21. Convolutional Neural Networks (CNN) • Effective for computer vision • Fewer parameters than fully connected networks • Translational invariance • Classic networks: LeNet-5, AlexNet, VGG
  • 22. Graphics Processing Units (GPUs) • Great at performing a lot of simple computations such as matrix operations • Well suited to deep learning algorithms
  • 24. Moving Data Greenplum <-> Single Server Deep learningData preparation, feature generation, machine learning, geospatial, etc. Large data transfer Suboptimal
  • 25. Integrated Deep Learning with Greenplum Standby Master … Master Host SQL Interconnect Segment Host Node1 Segment Host Node2 Segment Host Node3 Segment Host NodeN GPU N … GPU 1 GPU N … GPU 1 GPU N … GPU 1 … GPU N … GPU 1 In-Database Functions Machine learning & statistics & math & graph & utilities MassivelyParallelProcessing
  • 26. Deep Learning on a Cluster Num Approach Description 1 Distributed deep learning Train single model architecture across the cluster. Data distributed (usually randomly) across segments. 2 Data parallel models Train same model architecture in parallel on different data groups (e.g., build separate models per country). 3 Hyperparameter tuning Train same model architecture in parallel with different hyperparameter settings and incorporate cross validation. Same data on each segment. 4 Neural architecture search Train different model architectures in parallel. Same data on each segment. Current work
  • 27. Data Loading and Formatting
  • 28. Testing Infrastructure • Google Cloud Platform (GCP) • Type n1-highmem-32 (32 vCPUs, 208 GB memory) • NVIDIA Tesla P100 GPUs • Greenplum database config – Tested up to 20 segment (worker node) clusters – 1 GPU per segment
  • 29. 6-layer CNN - Runtime (CIFAR-10) Method: Model weight averaging
  • 30. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. 3. Model Management
  • 31. Try and try and try... • Data scientists typically try many different types of models with many different parameters combinations
  • 32. Model Persistence in MADlib 1.x One model at a time
  • 33. Model Persistence in MADlib 2.0 Multiple models at a time in model library
  • 34. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. 4. MADlib Flow
  • 35. Data Science Process Model Evaluation Operationalization Model Building Feature Engineering Data Review User Feedback Problem Definition Setup
  • 36. Model Operationalization Model Evaluation Operationalization Model Building Feature Engineering Data Review User Feedback Problem Definition Setup Model Operationalization is the process of deploying data science models to production for ongoing use by other software
  • 37. Common Challenges With Operationalizing Models Model Evaluation Operationalization Model Building Feature Engineering Data Review User Feedback Problem Definition Setup Common challenges with model operationalization: ● Handling production data ● Engineering for scale and performance ● Model transportation ● Managing and orchestrating deployed models ● Data Scientists are not developers or platform experts
  • 38. BATCH TRAINING BATCH INFERENCE ~40% of today’s use cases Tax Return Fraud: Score database of tax returns - on a nightly basis - to flag likely fraudulent returns for audit EVENT DRIVEN TRAINING EVENT DRIVEN INFERENCE <5% today’s use cases Online Advertising: Maximize Click Thru Rate by algorithmically selecting and testing advertisement placement in real time BATCH TRAINING EVENT DRIVEN INFERENCE ~55% today’s use cases (growing) Real Time Transaction Fraud: Train a ML model on historical data to classify - in real time - whether or not new credit/debit transactions are likely to be fraudulent EXAMPLE Patterns For Operationalizing Models EXAMPLE EXAMPLE PotsgreSQL/Greenplum with MADlib supports this pattern PostgreSQL/Greenplum with MADlib & MADlib Flow supports this pattern Highly specialized – low number of enterprise use cases
  • 39. AI For The PostgreSQL Community Standardized end-to-end Data Science in SQL with the Greenplum/Postgres stack Experimentation Initial code development and testing, model experimentation on samples. Modeling at Scale Heavy compute tasks such as model training across big data Deployment Production deployment of models to feed downstream applications and reports Artificial Intelligence : Closed Loop Machine Learning
  • 40. Model Deployment With MADlib Flow 1 ML Training Train ML model in Postgres or Greenplum using Apache MADlib madlibflow -- deploy Set configs in .yml and deploy model from Greenplum to Docker, PCF or Kubernetes 2 Docker pull Pull docker containers with optimized Postgres and MADlib 3 Pull Model Extract model and feature table schema layout from Greenplum database 4 Load Model Load model and feature table schema into optimized Postgres 5 Deploy Deploy docker container to target environment 6 Automated Backend OperationsUser Operations
  • 41. Containerized Deployment Of Models $ madlibflow --deploy --target kubernetes --type model Key benefits of MADlib Flow ● Easy to deploy & light weight ● Highly scalable REST and Streaming ● End-to-end SQL workflow ● Low latency inference/predictions ● Feature Transformations Single command to deploy a MADlib trained model from GPDB/Postgres to Docker, PCF or Kubernetes Containerized deployment of Apache MADlib Machine Learning workflows for low latency event driven inference and scale
  • 43. MADlib Flow : Hello World! Let us demonstrate a Linear Regression Model deployment Dependent Variable: ● patient has had a second heart attack within 1 year independent variables: ● patient completed a treatment on anger control ● anxiety scale score Workflow: Create schema Load data Train model Deploy model Tes t Batch prediction
  • 44. Model Deployment Deployment manifest $ madlibflow --name patient-lr --type model --action deploy --target kubernetes --inputJson config.json
  • 46. Greenplum Database Feature EngineCredit/Debit Card Transaction (Input) Message { “transaction_ts”: , “credit_card_number”: , “transaction_amt”:, “merchant_id”: } Approved Credit/Debit Card Transaction (Output) Message { “transaction_ts”: , “transaction_amt”:, “credit_card_number”:, “num_transactions_30days”:, “max_transactions_30days”:, “merchant_id”:, “num_fraud_cases”:, “avg_transaction_amount_30days”:, “fraud_risk_score”: 0.92, “approved”: True } Accounts credit_card_number num_transactions_30days max_transactions_30days Merchants merchant_id num_fraud_cases avg_transaction_amount_30days Cache (Gemfire, PCC, Redis, etc.) Cache Abstraction Cache Abstraction SELECT mch.* ,acct.* ,log(msg.transaction_amt + 1) AS log_transaction_amt FROM message msg JOIN merchants mch ON msg.merchant_id=mch.merchant_id JOIN accounts acct ON msg.credit_card_number=acct.credit_card_number; MADlib REST Cache Loader Automated deployment of scalable low latency end-to-end ML pipelines (“Data Science Ops.”) No code conversion - engineer features and populate cache in SQL Join data from the incoming message with cached data Accounts Merchants SELECT create_accounts(); SELECT create_merchants(); Example Flow for Fraud Detection
  • 47. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. 5. Learn More!
  • 48. • Download – http://madlib.apache.org/ • ~40 Jupyter notebooks – https://github.com/apache/madlib-site/tree/asf- site/community-artifacts • Wednesday March 20 @PostgresConf
  • 49. #ScaleMatters © Copyright 2019 Pivotal Software, Inc. All rights Reserved.
  • 50. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Backup Slides
  • 51. MADlib 2.0 ● More deep learning capabilities ○ Improved model performance ○ Hyperparameter tuning ● Model repositories and management for streamlined data science workflows ● New and improved SQL interface for MADlib functions MADlib Flow ● Support for PL/Python and PL/R ● Native deployment to Pivotal Cloud foundry as build pack. ● Beta Release in May’19 ● Metrics collector. MADlib 1.16 ● Initial deep learning release for image classification (Keras/TensorFlow) ● Postgres 11 support ● Improve speed of k- nearest neighbors via approximate method Looking Ahead
  • 52. Apache MADlib Resources • Web site – http://madlib.apache.org/ • Wiki – https://cwiki.apache.org/confluence/display/MAD LIB/Apache+MADlib • User docs – http://madlib.apache.org/docs/latest/index.html • Jupyter notebooks – https://github.com/apache/madlib-site/tree/asf- site/community-artifacts • Technical docs – http://madlib.apache.org/design.pdf • Pivotal commercial site – http://pivotal.io/madlib • Mailing lists and JIRAs – https://mail- archives.apache.org/mod_mbox/incubator- madlib-dev/ – http://mail- archives.apache.org/mod_mbox/incubator- madlib-user/ – https://issues.apache.org/jira/browse/MADLIB • PivotalR – https://cran.r- project.org/web/packages/PivotalR/index.html • Github – https://github.com/apache/madlib – https://github.com/pivotalsoftware/PivotalR
  • 53. Execution Flow Client Database Server Master Segment 1 Segment 2 Segment n … SQL Stored Procedure Result Set String Aggregation psql …
  • 55. Distributed Deep Learning Methods • Open area of research* • Methods we have investigated so far: – Simple averaging – Ensembling – Elastic averaging stochastic gradient descent (EASGD) * Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://arxiv.org/pdf/1802.09941.pdf
  • 56. Some Results with CIFAR-10 • 60k 32x32 color images in 10 classes, with 6k images per class • 50k training images and 10k test images https://www.cs.toronto.edu/~kriz/cifar.html
  • 57. ■ Experimentation -> Modeling at scale -> Deployment all in SQL ■ Single platform from model development to Deployment using Postgres/Greenplum ■ Low latency inference ■ Easy to deploy both feature generation code and model ■ Join data from event message with Feature cache objects using ANSI SQL ■ Continuously generate the features and feed in to feature engine. ■ Multiple versions of Models can be deployed for accuracy measurement. ■ Same tool can deploy to multiple Container Environments, PKS, AKS, GKE, etc. MADlib Flow Benefits