SlideShare a Scribd company logo
Pivotal Data Labs – Technology and 
Tools in our Data Scientist’s Arsenal 
Srivatsan Ramanujam 
Senior Data Scientist 
15 Oct 2014 Pivotal Data Labs 
© Copyright 2013 Pivotal. All rights reserved. 1
Agenda 
Ÿ Pivotal: Technology and Tools Introduction 
– Greenplum MPP Database and Pivotal Hadoop with HAWQ 
Ÿ Data Parallelism 
– PL/Python, PL/R, PL/Java, PL/C 
Ÿ Complete Parallelism 
– MADlib 
Ÿ Python and R Wrappers 
– PyMADlib and PivotalR 
Ÿ Open Source Integration 
– Spark and PySpark examples 
Ÿ Live Demos – Pivotal Data Science Tools in Action 
– Topic and Sentiment Analysis 
– Content Based Image Search 
© Copyright 2013 Pivotal. All rights reserved. 2
Technology and Tools 
© Copyright 2013 Pivotal. All rights reserved. 3
MPP Architectural Overview 
Think of it as multiple 
PostGreSQL servers 
Master 
Segments/Workers 
Rows are distributed across segments by 
a particular field (or randomly) 
© Copyright 2013 Pivotal. All rights reserved. 4
Implicit Parallelism – Procedural 
Languages 
© Copyright 2013 Pivotal. All rights reserved. 5
Data Parallelism – Embarrassingly Parallel Tasks 
Ÿ Little or no effort is required to break up the problem into a 
number of parallel tasks, and there exists no dependency (or 
communication) between those parallel tasks. 
Ÿ Examples: 
– map() function in Python: 
>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] 
>>> map(lambda e: e*e, x) 
>>> [1, 4, 9, 16, 25, 36, 49, 64, 81, 100] 
www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal-pydata-2013 
© Copyright 2013 Pivotal. All rights reserved. 6
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.} 
• Allows users to write Greenplum/ 
PostgreSQL functions in the R/Python/ 
Java, Perl, pgsql or C languages Standby 
Ÿ The interpreter/VM of the language ‘X’ is 
installed on each node of the Greenplum 
Database Cluster 
• Data Parallelism: 
- PL/X piggybacks on 
Greenplum/HAWQ’s MPP 
architecture 
Master 
Segment Host 
Segment 
Segment 
… 
Master 
Host 
SQL 
Interconnect 
Segment Host 
Segment 
Segment 
Segment Host 
Segment 
Segment 
Segment Host 
Segment 
Segment 
© Copyright 2013 Pivotal. All rights reserved. 7
User Defined Functions – PL/Python Example 
Ÿ Procedural languages need to be installed on each database used. 
Ÿ Syntax is like normal Python function with function definition line replaced by SQL wrapper. 
Alternatively like a SQL User Defined Function with Python inside. 
CREATE 
FUNCTION 
pymax 
(a 
integer, 
b 
integer) 
RETURNS 
integer 
AS 
$$ 
if 
a 
> 
b: 
return 
a 
return 
b 
$$ 
LANGUAGE 
plpythonu; 
SQL wrapper 
Normal Python 
SQL wrapper 
© Copyright 2013 Pivotal. All rights reserved. 8
Returning Results 
Ÿ Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) 
Ÿ Composite types can be returned by creating a composite type in the database: 
CREATE 
TYPE 
named_value 
AS 
( 
name 
text, 
value 
integer 
); 
Ÿ Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: 
CREATE 
FUNCTION 
make_pair 
(name 
text, 
value 
integer) 
RETURNS 
named_value 
AS 
$$ 
return 
[ 
name, 
value 
] 
# 
or 
alternatively, 
as 
tuple: 
return 
( 
name, 
value 
) 
# 
or 
as 
dict: 
return 
{ 
"name": 
name, 
"value": 
value 
} 
# 
or 
as 
an 
object 
with 
attributes 
.name 
and 
.value 
$$ 
LANGUAGE 
plpythonu; 
Ÿ For functions which return multiple rows, prefix “setof” before the return type 
http://www.slideshare.net/PyData/massively-parallel-process-with-prodedural-python-ian-huston 
© Copyright 2013 Pivotal. All rights reserved. 9
Returning more results 
You can return multiple results by wrapping them in a sequence (tuple, list or set), 
an iterator or a generator: 
CREATE 
FUNCTION 
make_pair 
(name 
text) 
RETURNS 
SETOF 
named_value 
AS 
$$ 
return 
([ 
name, 
1 
], 
[ 
name, 
2 
], 
[ 
name, 
3]) 
$$ 
LANGUAGE 
plpythonu; 
Sequence 
Generator 
CREATE 
FUNCTION 
make_pair 
(name 
text) 
RETURNS 
SETOF 
named_value 
AS 
$$ 
for 
i 
in 
range(3): 
yield 
(name, 
i) 
$$ 
LANGUAGE 
plpythonu; 
© Copyright 2013 Pivotal. All rights reserved. 10
Accessing Packages 
Ÿ On Greenplum DB: To be available packages must be installed on the 
individual segment nodes. 
– Can use “parallel ssh” tool gpssh to conda/pip install 
– Currently Greenplum DB ships with Python 2.6 (!) 
Ÿ Then just import as usual inside function: 
CREATE 
FUNCTION 
make_pair 
(name 
text) 
RETURNS 
named_value 
AS 
$$ 
import 
numpy 
as 
np 
return 
((name,i) 
for 
i 
in 
np.arange(3)) 
$$ 
LANGUAGE 
plpythonu; 
© Copyright 2013 Pivotal. All rights reserved. 11
UCI Auto MPG Dataset – A toy problem 
Sample Data 
Ÿ Sample Task: Aero-dynamics aside (attributable to body style), what is the effect of engine parameters 
(bore, stroke, compression_ratio, horsepower, peak_rpm) on the highway mpg of cars? 
Ÿ Solution: Build a Linear Regression model for each body style (hatchback, sedan) using the features 
bore, stroke, compression ration, horsepower and peak_rpm with highway_mpg as the target label. 
Ÿ This is a data parallel task which can be executed in parallel by simply piggybacking on the MPP 
architecture. One segment can build a model for Hatchbacks another for Sedan 
http://archive.ics.uci.edu/ml/datasets/Auto+MPG 
© Copyright 2013 Pivotal. All rights reserved. 12
Ridge Regression with scikit-learn on PL/Python 
SQL 
wrapper 
Python 
SQL 
wrapper 
User Defined Type User Defined Aggregate 
User Defined Function 
© Copyright 2013 Pivotal. All rights reserved. 13
PL/Python + scikit-learn : Model Coefficients 
Physical machine on the cluster in which the regression model was built 
Invoke UDF 
Build Feature 
Vector 
Choose Features 
One model 
per body style 
© Copyright 2013 Pivotal. All rights reserved. 14
Parallelized R in Pivotal via PL/R: 
An Example 
Ÿ With placeholders in SQL, write functions in the native R language 
Ÿ Accessible, powerful modeling framework 
http://pivotalsoftware.github.io/gp-r/ 
© Copyright 2013 Pivotal. All rights reserved. 15
Parallelized R in Pivotal via PL/R: 
An Example 
Ÿ Execute PL/R function 
Ÿ Plain and simple table is returned 
http://pivotalsoftware.github.io/gp-r/ 
© Copyright 2013 Pivotal. All rights reserved. 16
Parallelized R in Pivotal via PL/R: 
Parallel Bagged Decision Trees 
Aggregate and obtain 
final prediction 
Each tree makes a 
prediction 
http://pivotalsoftware.github.io/gp-r/ 
© Copyright 2013 Pivotal. All rights reserved. 17
Complete Parallelism 
© Copyright 2013 Pivotal. All rights reserved. 18
Complete Parallelism – Beyond Data Parallel Tasks 
Ÿ Data Parallel computation via PL/X libraries only allow us to 
run ‘n’ models in parallel. 
Ÿ This works great when we are building one model for each 
value of the group by column, but we need parallelized 
algorithms to be able to build a single model on all the 
available data 
Ÿ For this, we use MADlib – an open source library of parallel 
in-database machine learning algorithms. 
© Copyright 2013 Pivotal. All rights reserved. 19
MADlib: Scalable, in-database Machine Learning 
http://madlib.net 
© Copyright 2013 Pivotal. All rights reserved. 20
MADlib In-Database 
Functions 
Predictive Modeling Library 
Machine Learning Algorithms 
• Principal Component Analysis (PCA) 
• Association Rules (Affinity Analysis, Market 
Basket) 
• Topic Modeling (Parallel LDA) 
• Decision Trees 
• Ensemble Learners (Random Forests) 
• Support Vector Machines 
• Conditional Random Field (CRF) 
• Clustering (K-means) 
• Cross Validation 
Linear Systems 
• Sparse and Dense Solvers 
Generalized Linear Models 
• Linear Regression 
• Logistic Regression 
• Multinomial Logistic Regression 
• Cox Proportional Hazards 
• Regression 
• Elastic Net Regularization 
• Sandwich Estimators (Huber white, 
clustered, marginal effects) 
Matrix Factorization 
• Single Value Decomposition (SVD) 
• Low-Rank 
Descriptive Statistics 
Sketch-based Estimators 
• CountMin (Cormode- 
Muthukrishnan) 
• FM (Flajolet-Martin) 
• MFV (Most Frequent 
Values) 
Correlation 
Summary 
Support Modules 
Array Operations 
Sparse Vectors 
Random Sampling 
Probability Functions 
© Copyright 2013 Pivotal. All rights reserved. 21
Linear Regression: Streaming Algorithm 
Ÿ Finding linear 
dependencies between 
variables 
Ÿ How to compute with a 
single scan? 
© Copyright 2013 Pivotal. All rights reserved. 22
Linear Regression: Parallel Computation 
XT 
y 
Σ 
XT y = xi 
T yi 
i 
© Copyright 2013 Pivotal. All rights reserved. 23
Linear Regression: Parallel Computation 
y 
XT 
XT y 
Master 
T y2 + = 
T y1 X2 
X1 
Segment 1 Segment 2 
© Copyright 2013 Pivotal. All rights reserved. 24
Linear Regression: Parallel Computation 
y 
XT 
T y2 + = 
T y1 X2 
XT X y 1 
Segment 1 Segment 2 Master 
© Copyright 2013 Pivotal. All rights reserved. 25
Performing a linear regression on 10 million rows in 
seconds 
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of 
the VLDB Endowment 5.12 (2012): 1700-1711. 
© Copyright 2013 Pivotal. All rights reserved. 26
Calling MADlib Functions: Fast Training, Scoring 
Ÿ MADlib allows users to easily and 
create models without moving data 
out of the systems 
– Model generation 
– Model validation 
– Scoring (evaluation of) new data 
Ÿ All the data can be used in one 
model 
Ÿ Built-in functionality to create of 
multiple smaller models (e.g. 
classification grouped by feature) 
Ÿ Open-source lets you tweak and 
extend methods, or build your own 
MADlib model function 
Table containing 
training data 
SELECT madlib.linregr_train( 'houses’,! 
'houses_linregr’,! 
'price’,! 
'ARRAY[1, tax, bath, size]’);! 
Table in which to 
save results 
Column containing 
Features included in the dependent variable 
model 
https://www.youtube.com/watch?v=Gur4FS9gpAg 
© Copyright 2013 Pivotal. All rights reserved. 27
Calling MADlib Functions: Fast Training, Scoring 
MADlib model function 
Table containing 
training data 
SELECT madlib.linregr_train( 'houses’,! 
'houses_linregr’,! 
'price’,! 
'ARRAY[1, tax, bath, size]’,! 
‘bedroom’);! 
Table in which to 
save results 
Column containing 
dependent variable 
Features included in the 
Create multiple output models 
(one for each value of bedroom) 
Ÿ MADlib allows users to easily and 
create models without moving data 
out of the systems 
– Model generation 
– Model validation 
– Scoring (evaluation of) new data 
Ÿ All the data can be used in one 
model 
Ÿ Built-in functionality to create of 
multiple smaller models (e.g. 
classification grouped by feature) 
Ÿ Open-source lets you tweak and 
extend methods, or build your own 
model 
https://www.youtube.com/watch?v=Gur4FS9gpAg 
© Copyright 2013 Pivotal. All rights reserved. 28
Calling MADlib Functions: Fast Training, Scoring 
SELECT madlib.linregr_train( 'houses’,! 
'houses_linregr’,! 
'price’,! 
'ARRAY[1, tax, bath, size]’);! 
SELECT houses.*, 
MADlib model scoring function 
madlib.linregr_predict(ARRAY[1,tax,bath,size], 
m.coef! 
)as predict ! 
FROM houses, houses_linregr m;! 
Table with data to be scored Table containing model 
Ÿ MADlib allows users to easily and 
create models without moving data 
out of the systems 
– Model generation 
– Model validation 
– Scoring (evaluation of) new data 
Ÿ All the data can be used in one 
model 
Ÿ Built-in functionality to create of 
multiple smaller models (e.g. 
classification grouped by feature) 
Ÿ Open-source lets you tweak and 
extend methods, or build your own 
© Copyright 2013 Pivotal. All rights reserved. 29
Python and R wrappers to MADlib 
© Copyright 2013 Pivotal. All rights reserved. 30
PivotalR: Bringing MADlib and HAWQ to a familiar 
R interface 
Ÿ Challenge 
Want to harness the familiarity of R’s interface and the performance & 
scalability benefits of in-DB analytics 
Ÿ Simple solution: 
Translate R code into SQL 
Pivotal R 
d <- db.data.frame(”houses")! 
houses_linregr <- madlib.lm(price ~ tax! 
! ! !+ bath! 
! ! !+ size! 
! ! !, data=d)! 
SQL Code 
SELECT madlib.linregr_train( 'houses’,! 
'houses_linregr’,! 
'price’,! 
'ARRAY[1, tax, bath, size]’);! 
https://github.com/pivotalsoftware/PivotalR 
© Copyright 2013 Pivotal. All rights reserved. 31
PivotalR: Bringing MADlib and HAWQ to a familiar 
R interface 
Ÿ Challenge 
Want to harness the familiarity of R’s interface and the performance & 
scalability benefits of in-DB analytics 
Ÿ Simple solution: 
Translate R code into SQL 
Pivotal R 
# Build a regression model with a different! 
# intercept term for each state! 
# (state=1 as baseline).! 
# Note that PivotalR supports automated! 
# indicator coding a la as.factor()!! 
d <- db.data.frame(”houses")! 
houses_linregr <- madlib.lm(price ~ as.factor(state)! 
! ! ! !+ tax! 
! ! ! !+ bath! 
! ! ! !+ size! 
! ! ! !, data=d)! 
https://github.com/pivotalsoftware/PivotalR 
© Copyright 2013 Pivotal. All rights reserved. 32
PivotalR Design Overview 
RPostgreSQL 
• Call MADlib’s in-DB machine learning functions 
• Syntax is analogous to native R function 
2. SQL to execute 
3. Computation results 
directly from R 
PivotalR 
1. R à SQL 
Database/Hadoop 
w/ MADlib 
• Data doesn’t need to leave the database 
• All heavy lifting, including model estimation 
No data here Data lives here 
& computation, are done in the database 
https://github.com/pivotalsoftware/PivotalR 
© Copyright 2013 Pivotal. All rights reserved. 33
PyMADlib : Power of MADlib + Flexibility of Python 
Linear Regression 
Logistic Regression 
Extras 
Current PyMADlib Algorithms 
– Linear Regression 
– Logistic Regression 
– K-Means 
– LDA 
http://pivotalsoftware.github.io/pymadlib/ 
– Support for Categorical variables 
– Pivoting 
© Copyright 2013 Pivotal. All rights reserved. 34
Visualization 
© Copyright 2013 Pivotal. All rights reserved. 35
Visualization 
Open Source Commercial 
© Copyright 2013 Pivotal. All rights reserved. 36
Hack one when needed – Pandas_via_psql 
http://vatsan.github.io/pandas_via_psql/ 
SQL Client 
DB 
© Copyright 2013 Pivotal. All rights reserved. 37
Integration with Open Source – 
(Py)Spark Example 
© Copyright 2013 Pivotal. All rights reserved. 38
Apache Spark Project – Quick Overview 
• Apache Project, originated in AMPLab Berkeley 
• Supported on Pivotal Hadoop 2.0! 
http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf 
© Copyright 2013 Pivotal. All rights reserved. 39
MapReduce vs. Spark 
http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf 
© Copyright 2013 Pivotal. All rights reserved. 40
Data Parallelism in PySpark – A Simple Example 
• Next we’ll take the UCI automobile dataset example from PL/Python and 
demonstrate how to run in PySpark 
© Copyright 2013 Pivotal. All rights reserved. 41
Scikit-Learn on PySpark – UCI Auto Dataset Example 
• This is in essence similar to 
the PL/Python example from 
the earlier slide, except we’re 
using data store on HDFS 
(Pivotal HD) with Spark as the 
platform in place of HAWQ/ 
Greenplum 
© Copyright 2013 Pivotal. All rights reserved. 42
Large Scale Topic and Sentiment 
Analysis of Tweets 
Social Media Demo 
© Copyright 2013 Pivotal. All rights reserved. 43
Pivotal GNIP Decahose Pipeline 
Parallel Parsing 
of JSON 
PXF 
Twitter Decahose 
(~55 million tweets/day) 
Source: http 
Sink: hdfs 
HDFS 
External 
Tables 
PXF 
Nightly Cron Jobs 
Topic Analysis 
through MADlib pLDA 
Unsupervised 
Sentiment Analysis 
(PL/Python) 
D3.js 
http://www.slideshare.net/SrivatsanRamanujam/a-pipeline-for-distributed-topic-and-sentiment-analysis-of-tweets- 
on-pivotal-greenplum-database 
© Copyright 2013 Pivotal. All rights reserved. 44
Data Science + Agile = Quick Wins 
Ÿ The Team 
– 1 Data Scientist 
– 2 Agile Developers 
– 1 Designer (part-time) 
– 1 Project Manager (part-time) 
Ÿ Duration 
– 3 weeks! 
© Copyright 2013 Pivotal. All rights reserved. 45
Live Demo – Topic and Sentiment Analysis 
© Copyright 2013 Pivotal. All rights reserved. 46
Content Based Image Search 
CBIR Live Demo 
Pivotal Confidential–Internal Use Only 47
Content Based Information Retrieval - Task 
http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database 
Pivotal Confidential–Internal Use Only 48
CBIR - Components 
http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database 
Pivotal Confidential–Internal Use Only 49
Live Demo – Content Based Image Search 
http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database 
Pivotal Confidential–Internal Use Only 50
Appendix 
Pivotal Confidential–Internal Use Only 51
Acknowledgements 
• Ian Huston, Woo Jung, Sarah Aerni, Gautam Muralidhar, Regunathan 
Radhakrishnan, Ronert Obst, Hai Qian, MADlib Engineering Team, 
Sumedh Mungee, Girish Lingappa 
Pivotal Confidential–Internal Use Only 52

More Related Content

PPTX
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
PPTX
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PDF
Pivotal OSS meetup - MADlib and PivotalR
PDF
Python Powered Data Science at Pivotal (PyData 2013)
PPTX
All thingspython@pivotal
PPTX
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
PDF
Analyzing Power of Tweets in Predicting Commodity Futures
PPTX
Apache HAWQ and Apache MADlib: Journey to Apache
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
Pivotal OSS meetup - MADlib and PivotalR
Python Powered Data Science at Pivotal (PyData 2013)
All thingspython@pivotal
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
Analyzing Power of Tweets in Predicting Commodity Futures
Apache HAWQ and Apache MADlib: Journey to Apache

What's hot (20)

PDF
Simple, Modular and Extensible Big Data Platform Concept
PDF
Spark meetup TCHUG
PDF
Machine Learning by Example - Apache Spark
PPTX
MLlib and Machine Learning on Spark
PDF
The MADlib Analytics Library
 
PPTX
Large Scale Machine learning with Spark
PDF
Hopsworks at Google AI Huddle, Sunnyvale
PPTX
Machine Learning and Hadoop
PPTX
Machine Learning with Hadoop
PPTX
2011.10.14 Apache Giraph - Hortonworks
PPTX
Lightening Fast Big Data Analytics using Apache Spark
PDF
The Bitter Lesson of ML Pipelines
PDF
Spark 101
PPTX
Machine Learning With Spark
PDF
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
PDF
Quick Understanding of NoSQL
PPTX
HBase and Drill: How loosley typed SQL is ideal for NoSQL
PPTX
Parallel Linear Regression in Interative Reduce and YARN
PDF
Big learning 1.2
PDF
End-to-end Data Pipeline with Apache Spark
Simple, Modular and Extensible Big Data Platform Concept
Spark meetup TCHUG
Machine Learning by Example - Apache Spark
MLlib and Machine Learning on Spark
The MADlib Analytics Library
 
Large Scale Machine learning with Spark
Hopsworks at Google AI Huddle, Sunnyvale
Machine Learning and Hadoop
Machine Learning with Hadoop
2011.10.14 Apache Giraph - Hortonworks
Lightening Fast Big Data Analytics using Apache Spark
The Bitter Lesson of ML Pipelines
Spark 101
Machine Learning With Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Quick Understanding of NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
Parallel Linear Regression in Interative Reduce and YARN
Big learning 1.2
End-to-end Data Pipeline with Apache Spark
Ad

Viewers also liked (20)

PPTX
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
PDF
Python typing module
PDF
Python的module机制与最佳实践
PDF
ZLM-Cython Build you first module
PPTX
An Introduction To Python - Modules & Solving Real World Problems
PPT
python.ppt
PPTX
Data Driven Action : A Primer on Data Science
PDF
Transforming Data to Unlock Its Latent Value
PPTX
Building a Distributed Data Pipeline
PDF
Big datalab
PDF
Gartner Predictions for Hadoop
PPTX
Big Data Analytics Principles
PPTX
DataLab DataQuality Dimensions
PDF
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
PDF
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
PPTX
The Laws of Data Science Gravity
PPTX
Big Data for the CMO
PDF
From Business Intelligence to Predictive Analytics
PDF
Map reduce vs spark
PDF
Analyttica_Data Science in Motion_Intro
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Python typing module
Python的module机制与最佳实践
ZLM-Cython Build you first module
An Introduction To Python - Modules & Solving Real World Problems
python.ppt
Data Driven Action : A Primer on Data Science
Transforming Data to Unlock Its Latent Value
Building a Distributed Data Pipeline
Big datalab
Gartner Predictions for Hadoop
Big Data Analytics Principles
DataLab DataQuality Dimensions
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
The Laws of Data Science Gravity
Big Data for the CMO
From Business Intelligence to Predictive Analytics
Map reduce vs spark
Analyttica_Data Science in Motion_Intro
Ad

Similar to Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal (20)

PDF
Massively Parallel Process with Prodedural Python by Ian Huston
PDF
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
PDF
Massively Parallel Processing with Procedural Python (PyData London 2014)
PPTX
Big Data Analytics-Open Source Toolkits
PPT
A Hands-on Intro to Data Science and R Presentation.ppt
PPTX
Apache Spark Introduction @ University College London
PDF
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
PDF
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PPTX
Using Apache Spark with IBM SPSS Modeler
PPTX
System mldl meetup
PDF
Greenplum Architecture
PDF
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
PDF
Spark forplainoldjavageeks svforum_20140724
PPT
Data science and OSS
PPTX
Speed up R with parallel programming in the Cloud
PDF
Spark For Plain Old Java Geeks (June2014 Meetup)
PDF
Machine learning at scale challenges and solutions
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
PDF
Generalized Linear Models with H2O
PDF
Data Mining with SpagoBI suite
Massively Parallel Process with Prodedural Python by Ian Huston
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Massively Parallel Processing with Procedural Python (PyData London 2014)
Big Data Analytics-Open Source Toolkits
A Hands-on Intro to Data Science and R Presentation.ppt
Apache Spark Introduction @ University College London
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Using Apache Spark with IBM SPSS Modeler
System mldl meetup
Greenplum Architecture
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
Spark forplainoldjavageeks svforum_20140724
Data science and OSS
Speed up R with parallel programming in the Cloud
Spark For Plain Old Java Geeks (June2014 Meetup)
Machine learning at scale challenges and solutions
Yarn spark next_gen_hadoop_8_jan_2014
Generalized Linear Models with H2O
Data Mining with SpagoBI suite

Recently uploaded (20)

PDF
.pdf is not working space design for the following data for the following dat...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Global journeys: estimating international migration
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
.pdf is not working space design for the following data for the following dat...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Miokarditis (Inflamasi pada Otot Jantung)
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Fluorescence-microscope_Botany_detailed content
oil_refinery_comprehensive_20250804084928 (1).pptx
Reliability_Chapter_ presentation 1221.5784
Moving the Public Sector (Government) to a Digital Adoption
Clinical guidelines as a resource for EBP(1).pdf
Database Infoormation System (DBIS).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
climate analysis of Dhaka ,Banglades.pptx
Global journeys: estimating international migration
Business Ppt On Nestle.pptx huunnnhhgfvu

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

  • 1. Pivotal Data Labs – Technology and Tools in our Data Scientist’s Arsenal Srivatsan Ramanujam Senior Data Scientist 15 Oct 2014 Pivotal Data Labs © Copyright 2013 Pivotal. All rights reserved. 1
  • 2. Agenda Ÿ Pivotal: Technology and Tools Introduction – Greenplum MPP Database and Pivotal Hadoop with HAWQ Ÿ Data Parallelism – PL/Python, PL/R, PL/Java, PL/C Ÿ Complete Parallelism – MADlib Ÿ Python and R Wrappers – PyMADlib and PivotalR Ÿ Open Source Integration – Spark and PySpark examples Ÿ Live Demos – Pivotal Data Science Tools in Action – Topic and Sentiment Analysis – Content Based Image Search © Copyright 2013 Pivotal. All rights reserved. 2
  • 3. Technology and Tools © Copyright 2013 Pivotal. All rights reserved. 3
  • 4. MPP Architectural Overview Think of it as multiple PostGreSQL servers Master Segments/Workers Rows are distributed across segments by a particular field (or randomly) © Copyright 2013 Pivotal. All rights reserved. 4
  • 5. Implicit Parallelism – Procedural Languages © Copyright 2013 Pivotal. All rights reserved. 5
  • 6. Data Parallelism – Embarrassingly Parallel Tasks Ÿ Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. Ÿ Examples: – map() function in Python: >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> map(lambda e: e*e, x) >>> [1, 4, 9, 16, 25, 36, 49, 64, 81, 100] www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal-pydata-2013 © Copyright 2013 Pivotal. All rights reserved. 6
  • 7. PL/X : X in {pgsql, R, Python, Java, Perl, C etc.} • Allows users to write Greenplum/ PostgreSQL functions in the R/Python/ Java, Perl, pgsql or C languages Standby Ÿ The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster • Data Parallelism: - PL/X piggybacks on Greenplum/HAWQ’s MPP architecture Master Segment Host Segment Segment … Master Host SQL Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment © Copyright 2013 Pivotal. All rights reserved. 7
  • 8. User Defined Functions – PL/Python Example Ÿ Procedural languages need to be installed on each database used. Ÿ Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside. CREATE FUNCTION pymax (a integer, b integer) RETURNS integer AS $$ if a > b: return a return b $$ LANGUAGE plpythonu; SQL wrapper Normal Python SQL wrapper © Copyright 2013 Pivotal. All rights reserved. 8
  • 9. Returning Results Ÿ Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) Ÿ Composite types can be returned by creating a composite type in the database: CREATE TYPE named_value AS ( name text, value integer ); Ÿ Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: CREATE FUNCTION make_pair (name text, value integer) RETURNS named_value AS $$ return [ name, value ] # or alternatively, as tuple: return ( name, value ) # or as dict: return { "name": name, "value": value } # or as an object with attributes .name and .value $$ LANGUAGE plpythonu; Ÿ For functions which return multiple rows, prefix “setof” before the return type http://www.slideshare.net/PyData/massively-parallel-process-with-prodedural-python-ian-huston © Copyright 2013 Pivotal. All rights reserved. 9
  • 10. Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator: CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ return ([ name, 1 ], [ name, 2 ], [ name, 3]) $$ LANGUAGE plpythonu; Sequence Generator CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ for i in range(3): yield (name, i) $$ LANGUAGE plpythonu; © Copyright 2013 Pivotal. All rights reserved. 10
  • 11. Accessing Packages Ÿ On Greenplum DB: To be available packages must be installed on the individual segment nodes. – Can use “parallel ssh” tool gpssh to conda/pip install – Currently Greenplum DB ships with Python 2.6 (!) Ÿ Then just import as usual inside function: CREATE FUNCTION make_pair (name text) RETURNS named_value AS $$ import numpy as np return ((name,i) for i in np.arange(3)) $$ LANGUAGE plpythonu; © Copyright 2013 Pivotal. All rights reserved. 11
  • 12. UCI Auto MPG Dataset – A toy problem Sample Data Ÿ Sample Task: Aero-dynamics aside (attributable to body style), what is the effect of engine parameters (bore, stroke, compression_ratio, horsepower, peak_rpm) on the highway mpg of cars? Ÿ Solution: Build a Linear Regression model for each body style (hatchback, sedan) using the features bore, stroke, compression ration, horsepower and peak_rpm with highway_mpg as the target label. Ÿ This is a data parallel task which can be executed in parallel by simply piggybacking on the MPP architecture. One segment can build a model for Hatchbacks another for Sedan http://archive.ics.uci.edu/ml/datasets/Auto+MPG © Copyright 2013 Pivotal. All rights reserved. 12
  • 13. Ridge Regression with scikit-learn on PL/Python SQL wrapper Python SQL wrapper User Defined Type User Defined Aggregate User Defined Function © Copyright 2013 Pivotal. All rights reserved. 13
  • 14. PL/Python + scikit-learn : Model Coefficients Physical machine on the cluster in which the regression model was built Invoke UDF Build Feature Vector Choose Features One model per body style © Copyright 2013 Pivotal. All rights reserved. 14
  • 15. Parallelized R in Pivotal via PL/R: An Example Ÿ With placeholders in SQL, write functions in the native R language Ÿ Accessible, powerful modeling framework http://pivotalsoftware.github.io/gp-r/ © Copyright 2013 Pivotal. All rights reserved. 15
  • 16. Parallelized R in Pivotal via PL/R: An Example Ÿ Execute PL/R function Ÿ Plain and simple table is returned http://pivotalsoftware.github.io/gp-r/ © Copyright 2013 Pivotal. All rights reserved. 16
  • 17. Parallelized R in Pivotal via PL/R: Parallel Bagged Decision Trees Aggregate and obtain final prediction Each tree makes a prediction http://pivotalsoftware.github.io/gp-r/ © Copyright 2013 Pivotal. All rights reserved. 17
  • 18. Complete Parallelism © Copyright 2013 Pivotal. All rights reserved. 18
  • 19. Complete Parallelism – Beyond Data Parallel Tasks Ÿ Data Parallel computation via PL/X libraries only allow us to run ‘n’ models in parallel. Ÿ This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data Ÿ For this, we use MADlib – an open source library of parallel in-database machine learning algorithms. © Copyright 2013 Pivotal. All rights reserved. 19
  • 20. MADlib: Scalable, in-database Machine Learning http://madlib.net © Copyright 2013 Pivotal. All rights reserved. 20
  • 21. MADlib In-Database Functions Predictive Modeling Library Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis, Market Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation Linear Systems • Sparse and Dense Solvers Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber white, clustered, marginal effects) Matrix Factorization • Single Value Decomposition (SVD) • Low-Rank Descriptive Statistics Sketch-based Estimators • CountMin (Cormode- Muthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions © Copyright 2013 Pivotal. All rights reserved. 21
  • 22. Linear Regression: Streaming Algorithm Ÿ Finding linear dependencies between variables Ÿ How to compute with a single scan? © Copyright 2013 Pivotal. All rights reserved. 22
  • 23. Linear Regression: Parallel Computation XT y Σ XT y = xi T yi i © Copyright 2013 Pivotal. All rights reserved. 23
  • 24. Linear Regression: Parallel Computation y XT XT y Master T y2 + = T y1 X2 X1 Segment 1 Segment 2 © Copyright 2013 Pivotal. All rights reserved. 24
  • 25. Linear Regression: Parallel Computation y XT T y2 + = T y1 X2 XT X y 1 Segment 1 Segment 2 Master © Copyright 2013 Pivotal. All rights reserved. 25
  • 26. Performing a linear regression on 10 million rows in seconds Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711. © Copyright 2013 Pivotal. All rights reserved. 26
  • 27. Calling MADlib Functions: Fast Training, Scoring Ÿ MADlib allows users to easily and create models without moving data out of the systems – Model generation – Model validation – Scoring (evaluation of) new data Ÿ All the data can be used in one model Ÿ Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature) Ÿ Open-source lets you tweak and extend methods, or build your own MADlib model function Table containing training data SELECT madlib.linregr_train( 'houses’,! 'houses_linregr’,! 'price’,! 'ARRAY[1, tax, bath, size]’);! Table in which to save results Column containing Features included in the dependent variable model https://www.youtube.com/watch?v=Gur4FS9gpAg © Copyright 2013 Pivotal. All rights reserved. 27
  • 28. Calling MADlib Functions: Fast Training, Scoring MADlib model function Table containing training data SELECT madlib.linregr_train( 'houses’,! 'houses_linregr’,! 'price’,! 'ARRAY[1, tax, bath, size]’,! ‘bedroom’);! Table in which to save results Column containing dependent variable Features included in the Create multiple output models (one for each value of bedroom) Ÿ MADlib allows users to easily and create models without moving data out of the systems – Model generation – Model validation – Scoring (evaluation of) new data Ÿ All the data can be used in one model Ÿ Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature) Ÿ Open-source lets you tweak and extend methods, or build your own model https://www.youtube.com/watch?v=Gur4FS9gpAg © Copyright 2013 Pivotal. All rights reserved. 28
  • 29. Calling MADlib Functions: Fast Training, Scoring SELECT madlib.linregr_train( 'houses’,! 'houses_linregr’,! 'price’,! 'ARRAY[1, tax, bath, size]’);! SELECT houses.*, MADlib model scoring function madlib.linregr_predict(ARRAY[1,tax,bath,size], m.coef! )as predict ! FROM houses, houses_linregr m;! Table with data to be scored Table containing model Ÿ MADlib allows users to easily and create models without moving data out of the systems – Model generation – Model validation – Scoring (evaluation of) new data Ÿ All the data can be used in one model Ÿ Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature) Ÿ Open-source lets you tweak and extend methods, or build your own © Copyright 2013 Pivotal. All rights reserved. 29
  • 30. Python and R wrappers to MADlib © Copyright 2013 Pivotal. All rights reserved. 30
  • 31. PivotalR: Bringing MADlib and HAWQ to a familiar R interface Ÿ Challenge Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics Ÿ Simple solution: Translate R code into SQL Pivotal R d <- db.data.frame(”houses")! houses_linregr <- madlib.lm(price ~ tax! ! ! !+ bath! ! ! !+ size! ! ! !, data=d)! SQL Code SELECT madlib.linregr_train( 'houses’,! 'houses_linregr’,! 'price’,! 'ARRAY[1, tax, bath, size]’);! https://github.com/pivotalsoftware/PivotalR © Copyright 2013 Pivotal. All rights reserved. 31
  • 32. PivotalR: Bringing MADlib and HAWQ to a familiar R interface Ÿ Challenge Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics Ÿ Simple solution: Translate R code into SQL Pivotal R # Build a regression model with a different! # intercept term for each state! # (state=1 as baseline).! # Note that PivotalR supports automated! # indicator coding a la as.factor()!! d <- db.data.frame(”houses")! houses_linregr <- madlib.lm(price ~ as.factor(state)! ! ! ! !+ tax! ! ! ! !+ bath! ! ! ! !+ size! ! ! ! !, data=d)! https://github.com/pivotalsoftware/PivotalR © Copyright 2013 Pivotal. All rights reserved. 32
  • 33. PivotalR Design Overview RPostgreSQL • Call MADlib’s in-DB machine learning functions • Syntax is analogous to native R function 2. SQL to execute 3. Computation results directly from R PivotalR 1. R à SQL Database/Hadoop w/ MADlib • Data doesn’t need to leave the database • All heavy lifting, including model estimation No data here Data lives here & computation, are done in the database https://github.com/pivotalsoftware/PivotalR © Copyright 2013 Pivotal. All rights reserved. 33
  • 34. PyMADlib : Power of MADlib + Flexibility of Python Linear Regression Logistic Regression Extras Current PyMADlib Algorithms – Linear Regression – Logistic Regression – K-Means – LDA http://pivotalsoftware.github.io/pymadlib/ – Support for Categorical variables – Pivoting © Copyright 2013 Pivotal. All rights reserved. 34
  • 35. Visualization © Copyright 2013 Pivotal. All rights reserved. 35
  • 36. Visualization Open Source Commercial © Copyright 2013 Pivotal. All rights reserved. 36
  • 37. Hack one when needed – Pandas_via_psql http://vatsan.github.io/pandas_via_psql/ SQL Client DB © Copyright 2013 Pivotal. All rights reserved. 37
  • 38. Integration with Open Source – (Py)Spark Example © Copyright 2013 Pivotal. All rights reserved. 38
  • 39. Apache Spark Project – Quick Overview • Apache Project, originated in AMPLab Berkeley • Supported on Pivotal Hadoop 2.0! http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf © Copyright 2013 Pivotal. All rights reserved. 39
  • 40. MapReduce vs. Spark http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf © Copyright 2013 Pivotal. All rights reserved. 40
  • 41. Data Parallelism in PySpark – A Simple Example • Next we’ll take the UCI automobile dataset example from PL/Python and demonstrate how to run in PySpark © Copyright 2013 Pivotal. All rights reserved. 41
  • 42. Scikit-Learn on PySpark – UCI Auto Dataset Example • This is in essence similar to the PL/Python example from the earlier slide, except we’re using data store on HDFS (Pivotal HD) with Spark as the platform in place of HAWQ/ Greenplum © Copyright 2013 Pivotal. All rights reserved. 42
  • 43. Large Scale Topic and Sentiment Analysis of Tweets Social Media Demo © Copyright 2013 Pivotal. All rights reserved. 43
  • 44. Pivotal GNIP Decahose Pipeline Parallel Parsing of JSON PXF Twitter Decahose (~55 million tweets/day) Source: http Sink: hdfs HDFS External Tables PXF Nightly Cron Jobs Topic Analysis through MADlib pLDA Unsupervised Sentiment Analysis (PL/Python) D3.js http://www.slideshare.net/SrivatsanRamanujam/a-pipeline-for-distributed-topic-and-sentiment-analysis-of-tweets- on-pivotal-greenplum-database © Copyright 2013 Pivotal. All rights reserved. 44
  • 45. Data Science + Agile = Quick Wins Ÿ The Team – 1 Data Scientist – 2 Agile Developers – 1 Designer (part-time) – 1 Project Manager (part-time) Ÿ Duration – 3 weeks! © Copyright 2013 Pivotal. All rights reserved. 45
  • 46. Live Demo – Topic and Sentiment Analysis © Copyright 2013 Pivotal. All rights reserved. 46
  • 47. Content Based Image Search CBIR Live Demo Pivotal Confidential–Internal Use Only 47
  • 48. Content Based Information Retrieval - Task http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database Pivotal Confidential–Internal Use Only 48
  • 49. CBIR - Components http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database Pivotal Confidential–Internal Use Only 49
  • 50. Live Demo – Content Based Image Search http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database Pivotal Confidential–Internal Use Only 50
  • 52. Acknowledgements • Ian Huston, Woo Jung, Sarah Aerni, Gautam Muralidhar, Regunathan Radhakrishnan, Ronert Obst, Hai Qian, MADlib Engineering Team, Sumedh Mungee, Girish Lingappa Pivotal Confidential–Internal Use Only 52