Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

Pivotal Data Labs – Technology and
Tools in our Data Scientist’s Arsenal
Srivatsan Ramanujam
Senior Data Scientist
15 Oct 2014 Pivotal Data Labs
© Copyright 2013 Pivotal. All rights reserved. 1

Agenda
Ÿ Pivotal: Technology and Tools Introduction
– Greenplum MPP Database and Pivotal Hadoop with HAWQ
Ÿ Data Parallelism
– PL/Python, PL/R, PL/Java, PL/C
Ÿ Complete Parallelism
– MADlib
Ÿ Python and R Wrappers
– PyMADlib and PivotalR
Ÿ Open Source Integration
– Spark and PySpark examples
Ÿ Live Demos – Pivotal Data Science Tools in Action
– Topic and Sentiment Analysis
– Content Based Image Search

Technology and Tools

MPP Architectural Overview
Think of it as multiple
PostGreSQL servers
Master
Segments/Workers
Rows are distributed across segments by
a particular field (or randomly)

Implicit Parallelism – Procedural
Languages

Data Parallelism – Embarrassingly Parallel Tasks
Ÿ Little or no effort is required to break up the problem into a
number of parallel tasks, and there exists no dependency (or
communication) between those parallel tasks.
Ÿ Examples:
– map() function in Python:
>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> map(lambda e: e*e, x)
>>> [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal-pydata-2013

PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
• Allows users to write Greenplum/
PostgreSQL functions in the R/Python/
Java, Perl, pgsql or C languages Standby
Ÿ The interpreter/VM of the language ‘X’ is
installed on each node of the Greenplum
Database Cluster
• Data Parallelism:
- PL/X piggybacks on
Greenplum/HAWQ’s MPP
architecture
Master
Segment Host
Segment
Segment
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment

User Defined Functions – PL/Python Example
Ÿ Procedural languages need to be installed on each database used.
Ÿ Syntax is like normal Python function with function definition line replaced by SQL wrapper.
Alternatively like a SQL User Defined Function with Python inside.
CREATE
FUNCTION
pymax
(a
integer,
b
integer)
RETURNS
integer
AS
$$
if
a
>
b:
return
a
return
b
$$
LANGUAGE
plpythonu;
SQL wrapper
Normal Python
SQL wrapper

Returning Results
Ÿ Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
Ÿ Composite types can be returned by creating a composite type in the database:
CREATE
TYPE
named_value
AS
(
name
text,
value
integer
);
Ÿ Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE
FUNCTION
make_pair
(name
text,
value
integer)
RETURNS
named_value
AS
$$
return
[
name,
value
]
#
or
alternatively,
as
tuple:
return
(
name,
value
)
#
or
as
dict:
return
{
"name":
name,
"value":
value
}
#
or
as
an
object
with
attributes
.name
and
.value
$$
LANGUAGE
plpythonu;
Ÿ For functions which return multiple rows, prefix “setof” before the return type
http://www.slideshare.net/PyData/massively-parallel-process-with-prodedural-python-ian-huston

Returning more results
You can return multiple results by wrapping them in a sequence (tuple, list or set),
an iterator or a generator:
CREATE
FUNCTION
make_pair
(name
text)
RETURNS
SETOF
named_value
AS
$$
return
([
name,
1
],
[
name,
2
],
[
name,
3])
$$
LANGUAGE
plpythonu;
Sequence
Generator
CREATE
FUNCTION
make_pair
(name
text)
RETURNS
SETOF
named_value
AS
$$
for
i
in
range(3):
yield
(name,
i)
$$
LANGUAGE
plpythonu;

Accessing Packages
Ÿ On Greenplum DB: To be available packages must be installed on the
individual segment nodes.
– Can use “parallel ssh” tool gpssh to conda/pip install
– Currently Greenplum DB ships with Python 2.6 (!)
Ÿ Then just import as usual inside function:
CREATE
FUNCTION
make_pair
(name
text)
RETURNS
named_value
AS
$$
import
numpy
as
np
return
((name,i)
for
i
in
np.arange(3))
$$
LANGUAGE
plpythonu;

UCI Auto MPG Dataset – A toy problem
Sample Data
Ÿ Sample Task: Aero-dynamics aside (attributable to body style), what is the effect of engine parameters
(bore, stroke, compression_ratio, horsepower, peak_rpm) on the highway mpg of cars?
Ÿ Solution: Build a Linear Regression model for each body style (hatchback, sedan) using the features
bore, stroke, compression ration, horsepower and peak_rpm with highway_mpg as the target label.
Ÿ This is a data parallel task which can be executed in parallel by simply piggybacking on the MPP
architecture. One segment can build a model for Hatchbacks another for Sedan
http://archive.ics.uci.edu/ml/datasets/Auto+MPG

Ridge Regression with scikit-learn on PL/Python
SQL
wrapper
Python
SQL
wrapper
User Defined Type User Defined Aggregate
User Defined Function

PL/Python + scikit-learn : Model Coefficients
Physical machine on the cluster in which the regression model was built
Invoke UDF
Build Feature
Vector
Choose Features
One model
per body style

Parallelized R in Pivotal via PL/R:
An Example
Ÿ With placeholders in SQL, write functions in the native R language
Ÿ Accessible, powerful modeling framework
http://pivotalsoftware.github.io/gp-r/

An Example
Ÿ Execute PL/R function
Ÿ Plain and simple table is returned

Parallel Bagged Decision Trees
Aggregate and obtain
final prediction
Each tree makes a
prediction

Complete Parallelism

Complete Parallelism – Beyond Data Parallel Tasks
Ÿ Data Parallel computation via PL/X libraries only allow us to
run ‘n’ models in parallel.
Ÿ This works great when we are building one model for each
value of the group by column, but we need parallelized
algorithms to be able to build a single model on all the
available data
Ÿ For this, we use MADlib – an open source library of parallel
in-database machine learning algorithms.

MADlib: Scalable, in-database Machine Learning
http://madlib.net

MADlib In-Database
Functions
Predictive Modeling Library
Machine Learning Algorithms
• Principal Component Analysis (PCA)
• Association Rules (Affinity Analysis, Market
Basket)
• Topic Modeling (Parallel LDA)
• Decision Trees
• Ensemble Learners (Random Forests)
• Support Vector Machines
• Conditional Random Field (CRF)
• Clustering (K-means)
• Cross Validation
Linear Systems
• Sparse and Dense Solvers
Generalized Linear Models
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• Cox Proportional Hazards
• Regression
• Elastic Net Regularization
• Sandwich Estimators (Huber white,
clustered, marginal effects)
Matrix Factorization
• Single Value Decomposition (SVD)
• Low-Rank
Descriptive Statistics
Sketch-based Estimators
• CountMin (Cormode-
Muthukrishnan)
• FM (Flajolet-Martin)
• MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions

Linear Regression: Streaming Algorithm
Ÿ Finding linear
dependencies between
variables
Ÿ How to compute with a
single scan?

Linear Regression: Parallel Computation
XT
y
Σ
XT y = xi
T yi
i

y
XT
XT y
Master
T y2 + =
T y1 X2
X1
Segment 1 Segment 2

y
XT
T y2 + =
T y1 X2
XT X y 1
Segment 1 Segment 2 Master

Performing a linear regression on 10 million rows in
seconds
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of
the VLDB Endowment 5.12 (2012): 1700-1711.

Calling MADlib Functions: Fast Training, Scoring
Ÿ MADlib allows users to easily and
create models without moving data
out of the systems
– Model generation
– Model validation
– Scoring (evaluation of) new data
Ÿ All the data can be used in one
model
Ÿ Built-in functionality to create of
multiple smaller models (e.g.
classification grouped by feature)
Ÿ Open-source lets you tweak and
extend methods, or build your own
MADlib model function
Table containing
training data
SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!
'price’,!
'ARRAY[1, tax, bath, size]’);!
Table in which to
save results
Column containing
Features included in the dependent variable
model
https://www.youtube.com/watch?v=Gur4FS9gpAg

MADlib model function
Table containing
training data
'price’,!
'ARRAY[1, tax, bath, size]’,!
‘bedroom’);!
Table in which to
save results
Column containing
dependent variable
Features included in the
Create multiple output models
(one for each value of bedroom)
out of the systems
model
model
https://www.youtube.com/watch?v=Gur4FS9gpAg

'price’,!
SELECT houses.*,
MADlib model scoring function
madlib.linregr_predict(ARRAY[1,tax,bath,size],
m.coef!
)as predict !
FROM houses, houses_linregr m;!
Table with data to be scored Table containing model
out of the systems
model

Python and R wrappers to MADlib

PivotalR: Bringing MADlib and HAWQ to a familiar
R interface
Ÿ Challenge
Want to harness the familiarity of R’s interface and the performance &
scalability benefits of in-DB analytics
Ÿ Simple solution:
Translate R code into SQL
Pivotal R
d <- db.data.frame(”houses")!
houses_linregr <- madlib.lm(price ~ tax!
! ! !+ bath!
! ! !+ size!
! ! !, data=d)!
SQL Code
'price’,!
https://github.com/pivotalsoftware/PivotalR

PivotalR: Bringing MADlib and HAWQ to a familiar
R interface
Ÿ Challenge
Want to harness the familiarity of R’s interface and the performance &
scalability benefits of in-DB analytics
Ÿ Simple solution:
Translate R code into SQL
Pivotal R
# Build a regression model with a different!
# intercept term for each state!
# (state=1 as baseline).!
# Note that PivotalR supports automated!
# indicator coding a la as.factor()!!
d <- db.data.frame(”houses")!
houses_linregr <- madlib.lm(price ~ as.factor(state)!
! ! ! !+ tax!
! ! ! !+ bath!
! ! ! !+ size!
! ! ! !, data=d)!

PivotalR Design Overview
RPostgreSQL
• Call MADlib’s in-DB machine learning functions
• Syntax is analogous to native R function
2. SQL to execute
3. Computation results
directly from R
PivotalR
1. R à SQL
Database/Hadoop
w/ MADlib
• Data doesn’t need to leave the database
• All heavy lifting, including model estimation
No data here Data lives here
& computation, are done in the database

PyMADlib : Power of MADlib + Flexibility of Python
Linear Regression
Logistic Regression
Extras
Current PyMADlib Algorithms
– Linear Regression
– Logistic Regression
– K-Means
– LDA
http://pivotalsoftware.github.io/pymadlib/
– Support for Categorical variables
– Pivoting

Visualization

Visualization
Open Source Commercial

Hack one when needed – Pandas_via_psql
http://vatsan.github.io/pandas_via_psql/
SQL Client
DB

Integration with Open Source –
(Py)Spark Example

Apache Spark Project – Quick Overview
• Apache Project, originated in AMPLab Berkeley
• Supported on Pivotal Hadoop 2.0!
http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

MapReduce vs. Spark
http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

Data Parallelism in PySpark – A Simple Example
• Next we’ll take the UCI automobile dataset example from PL/Python and
demonstrate how to run in PySpark

Scikit-Learn on PySpark – UCI Auto Dataset Example
• This is in essence similar to
the PL/Python example from
the earlier slide, except we’re
using data store on HDFS
(Pivotal HD) with Spark as the
platform in place of HAWQ/
Greenplum

Large Scale Topic and Sentiment
Analysis of Tweets
Social Media Demo

Pivotal GNIP Decahose Pipeline
Parallel Parsing
of JSON
PXF
Twitter Decahose
(~55 million tweets/day)
Source: http
Sink: hdfs
HDFS
External
Tables
PXF
Nightly Cron Jobs
Topic Analysis
through MADlib pLDA
Unsupervised
Sentiment Analysis
(PL/Python)
D3.js
http://www.slideshare.net/SrivatsanRamanujam/a-pipeline-for-distributed-topic-and-sentiment-analysis-of-tweets-
on-pivotal-greenplum-database

Data Science + Agile = Quick Wins
Ÿ The Team
– 1 Data Scientist
– 2 Agile Developers
– 1 Designer (part-time)
– 1 Project Manager (part-time)
Ÿ Duration
– 3 weeks!

Live Demo – Topic and Sentiment Analysis

Content Based Image Search
CBIR Live Demo
Pivotal Confidential–Internal Use Only 47

Content Based Information Retrieval - Task
http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database

CBIR - Components

Live Demo – Content Based Image Search

Appendix

Acknowledgements
• Ian Huston, Woo Jung, Sarah Aerni, Gautam Muralidhar, Regunathan
Radhakrishnan, Ronert Obst, Hai Qian, MADlib Engineering Team,
Sumedh Mungee, Girish Lingappa

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal (20)

Recently uploaded (20)

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal