SlideShare a Scribd company logo
Large-Scale Machine Learning with 
DB Tsai 
Machine Learning Engineering Lead @ AlpineDataLabs 
Internet of Things Conference @ Moscone Center, SF 
http://www.iotaconf.com/ 
October 20, 2014 
Learn more about Advanced Analytics at http://www.alpinenow.com
TRADITIONAL 
DESKTOP 
IN-DATABASE 
METHODS 
Learn more about Advanced Analytics at http://www.alpinenow.com 
WEB-BASED AND 
COLLABORATIVE 
SIMPLIFIED CODE-FREE 
HADOOP & MPP DATABASE 
ONGOING INNOVATION 
The Path to Innovation
The Path to Innovation 
Iterative algorithms 
scan through the 
data each time 
Learn more about Advanced Analytics at http://www.alpinenow.com 
With Spark, data is 
cached in memory 
after first iteration 
Quasi-Newton methods 
enhance in-memory 
benefits 
921s 
150m 
m 
rows 
97s
Machine Learning in the Big Data Era 
• Hadoop Map Reduce solutions 
+ = 
• MapReduce scales well for batch processing 
• Lots of machine learning algorithms are iterative by nature 
• There are lots of tricks people do, like training with sub-samples of 
data, and then average the models. Why have big data if you’re only 
approximating. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Lightning-fast cluster computing 
• Empower users to iterate 
through the data by utilizing 
the in-memory cache. 
• Logistic regression runs up 
to 100x faster than Hadoop 
M/R in memory. 
• We’re able to train exact 
models without doing any 
approximation. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Why MLlib? 
• MLlib is a Spark subproject providing Machine Learning 
primitives 
• It’s built on Apache Spark, a fast and general engine for 
large-scale data processing 
• Shipped with Apache Spark since version 0.8 
• High quality engineering design and effort 
• More than 50 contributors since July 2014 
Learn more about Advanced Analytics at http://www.alpinenow.com
Algorithms supported in MLlib 
• Classification: SVMs, logistic regression, decision trees, 
naïve Bayes, and random forests 
• Regression: linear regression, and random forests 
• Collaborative filtering: alternating least squares (ALS) 
• Clustering: k-means 
• Dimensionality reduction: singular value decomposition 
(SVD), and principal component analysis (PCA) 
• Basic statistics: summary statistics, correlations, stratified 
sampling, hypothesis testing, and random data generation 
• Feature extraction and transformation: TF-IDF, Word2Vec, 
StandardScaler, and Normalizer 
Learn more about Advanced Analytics at http://www.alpinenow.com
MapReduce Review 
• MapReduce – Simplified Data Processing on Large 
Clusters, 2004. 
• Scales Linearly 
• Data Locality 
• Fault Tolerance in Data Storage and Computation 
Learn more about Advanced Analytics at http://www.alpinenow.com
Hadoop MapReduce Review 
• Mapper: Loads the data and emits a set of key-value pair 
• Reducer: Collects the key-value pairs with the same key to process, 
and output the result. 
• Combiner: Can reduce shuffle traffic by combining key-value pairs 
locally before going to reducer. 
• In-Mapper Combiner: Aggregating the result in the mapper side, 
and using the LRU cache to prevent out of heap space. 
http://alpinenow.com/blog/in-mapper-combiner/ 
• Good: Built in fault tolerance, scalable, and production proven in 
industry. 
• Bad: Optimized for disk IO without leveraging memory well; iterative 
algorithms go through disk IO again and again; primitive API is not 
easy and clean to develop. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark MapReduce 
• Spark also uses MapReduce as a programming model but 
with much richer APIs in Scala, Java, and Python. 
• With Scala expressive APIs, 5-10x less code. 
• Not just a distributed computation framework, Spark provides 
several pre-built components helping users to implement 
application faster and easier. 
- Spark Streaming 
- Spark SQL 
- MLlib (Machine Learning) 
- GraphX (Graph Processing) 
Learn more about Advanced Analytics at http://www.alpinenow.com
Resilient Distributed Datasets (RDDs) 
• RDD is a fault-tolerant collection of elements that can be 
operated on in parallel. 
• RDDs can be created by parallelizing an existing 
collection in your driver program, or referencing a dataset 
in an external storage system, such as a shared 
filesystem, HDFS, HBase, HIVE, or any data source 
offering a Hadoop InputFormat. 
• RDDs can be cached in memory or on disk 
Learn more about Advanced Analytics at http://www.alpinenow.com
Hadoop M/R vs Spark M/R 
• Hadoop 
• Spark 
Learn more about Advanced Analytics at http://www.alpinenow.com
RDD Operations - two types of operations 
• Transformations: Creates a new dataset from 
an existing one. They are lazy, in that they do 
not compute their results right away. 
• Actions: Returns a value to the driver program 
after running a computation on the dataset. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Transformations 
• map(func) - Return a new distributed dataset formed by passing each 
element of the source through a function func. 
• filter(func) - Return a new dataset formed by selecting those elements of the 
source on which func returns true. 
• flatMap(func) - Similar to map, but each input item can be mapped to 0 or 
more output items (so func should return a Seq rather than a single item). 
• mapPartitions(func) - Similar to map, but runs separately on each partition 
(block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when 
running on an RDD of type T. 
• groupByKey([numTasks]) - When called on a dataset of (K, V) pairs, returns a 
dataset of (K, Iterable<V>) pairs. 
• reduceByKey(func, [numTasks]) – When called on a dataset of (K, V) pairs, 
returns a dataset of (K, V) pairs where the values for each key are 
aggregated using the given reduce function func, which must be of type (V,V) 
=> V. 
http://spark.apache.org/docs/latest/programming-guide.html#transformations 
Learn more about Advanced Analytics at http://www.alpinenow.com
Actions 
• reduce(func) - Aggregate the elements of the dataset 
using a function func (which takes two arguments and 
returns one). The function should be commutative and 
associative so that it can be computed correctly in 
parallel. 
• collect() - Return all the elements of the dataset as an 
array at the driver program. This is usually useful after a 
filter or other operation that returns a sufficiently small 
subset of the data. 
• count(), first(), take(n), saveAsTextFile(path), etc. 
http://spark.apache.org/docs/latest/programming-guide. 
html#actions 
Learn more about Advanced Analytics at http://www.alpinenow.com
RDD Persistence/Cache 
• RDD can be persisted using the persist() or cache() 
methods on it. The first time it is computed in an action, it 
will be kept in memory on the nodes. Spark’s cache is 
fault-tolerant – if any partition of an RDD is lost, it will 
automatically be recomputed using the transformations 
that originally created it. 
• Persisted RDD can be stored using a different storage 
level, allowing you, for example, to persist the dataset on 
disk, persist it in memory but as serialized Java objects 
(to save space). 
Learn more about Advanced Analytics at http://www.alpinenow.com
RDD Storage Level 
• MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM. 
If the RDD does not fit in memory, some partitions will not be cached 
and will be recomputed on the fly each time they're needed. This is the 
default level. 
• MEMORY_AND_DISK - Store RDD as deserialized Java objects in the 
JVM. If the RDD does not fit in memory, store the partitions that don't fit 
on disk, and read them from there when they're needed. 
• MEMORY_ONLY_SER - Store RDD as serialized Java objects (one 
byte array per partition). This is generally more space-efficient than 
deserialized objects, especially when using a fast serializer, but more 
CPU-intensive to read. 
• MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER, but 
spill partitions that don't fit in memory to disk instead of recomputing 
them on the fly each time they're needed. 
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence 
Learn more about Advanced Analytics at http://www.alpinenow.com
Word Count Example in Scala 
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
API’s design philosophy in MLlib 
• Works seamlessly with Spark Core, and Spark SQL; users can use 
core API’s or Spark SQL for data pre-processing, and then pipe into 
training step. 
• Algorithms are implemented in Scala. Public interfaces don’t use 
advanced Scala features to ensure Java compatibility. 
• Many of MLlib API’s have python bindings. 
• MLlib is under active development. The APIs marked Experimental/ 
DeveloperApi may change in future releases, and will provide 
migration guide if they are changed. 
• API’s are well documented, and designed to be expressive. 
• Code is well-tested, comprehensive unittest coverage. There are lots 
of comments in the code, and it’s a enjoyable experience to read the 
code. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Data Types 
• MLlib local vectors and local matrices are currently 
wrapping Breeze implementation; as a result, the underlying linear algebra 
operations are provided by Breeze and jblas. 
https://github.com/scalanlp/breeze 
• However, the methods converting MLlib to Breeze vectors/matrices or the 
other way around are private to org.apache.spark.mllib scope. This 
restriction can be workaround by having your custom code in 
org.apache.spark.mllib.something package. 
• A training sample used in supervised learning is stored in LabeledPoint 
which contains a label/response and a feature vector in dense or sparse. 
• Distributed RowMatrix – basically, it’s RDD[Vector] which doesn’t have 
meaningful row indices. 
• Distributed IndexedRowMatrix – it’s similar to RowMatrix, but each row 
is represented by its index and a local vector. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Local vector 
The base class of local vectors is Vector, and we provide two implementations: 
DenseVector and SparseVector. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Some useful tips related to local vector 
• If you want to use native Breeze functionality, you can 
have your code in org.apache.spark.mllib package. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Real code in MLlib in MultivariateOnlineSummarizer 
Learn more about Advanced Analytics at http://www.alpinenow.com
LabeledPoint 
• Double is used for storing the label, so we can use the labeled points 
in both regression and classification. For binary classification, a label 
should be either 0.0 or 1.0. For N-class classification, labels should 
be class indices starting from zero: 0.0, 1.0, 2.0, …, N - 1 
Learn more about Advanced Analytics at http://www.alpinenow.com
Supervised Learning 
• Binary Classification: linear SVMs (SGD), logistic regression (L-BFGS 
and SGD), decision trees, random forests (Spark 1.2), and 
naïve Bayes. 
• Multiclass Classification: Decision trees, naïve Bayes (coming 
soon - multinomial logistic regression in GLMNET) 
• Regression: linear least squares (SGD), Lasso (SGD + soft-threshold), 
ridge regression (SGD), decision trees, and random 
forests (Spark 1.2) 
• Currently, the regularization in linear model will penalize all the 
weights including the intercept which is not desired in some use-cases. 
Alpine has GLMNET implementation using OWLQN which 
can exactly reproduce R’s GLMNET package result with scalability. 
We’re in the process of merging it into MLlib community. 
Learn more about Advanced Analytics at http://www.alpinenow.com
LinearRegressionWithSGD 
Learn more about Advanced Analytics at http://www.alpinenow.com
SVMWithSGD 
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2934: LogisticRegressionWithLBFGS 
• Merged in Spark 1.1 
• Contributed by Alpine Data Labs 
• Using L-BFGS to train Logistic Regression instead of 
default Gradient Descent. 
• Users don't have to construct their objective function for 
Logistic Regression, and don't have to implement the 
whole details. 
• Together with SPARK-2979 to minimize the condition 
number, the convergence rate is further improved. 
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2979: Improve the convergence rate by 
standardizing the training features 
l Merged in Spark 1.1 
l Contributed by Alpine Data Labs 
l Due to the invariance property of MLEs, the scale of your inputs are 
irrelevant. 
l However, the optimizer will not be happy with poor condition numbers 
which can often be improved by scaling. 
l The model is trained in the scaled space, but the coefficients are 
converted to original space; as a result, it's transparent to users. 
l Without this, some training datasets mixing the columns with different 
scales may not be able to converge. 
l Scikit and glmnet package also standardize the features before training to 
improve the convergence. 
l Only enable in Logistic Regression for now. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
a9a Dataset Benchmark 
0.7 
0.65 
0.6 
0.55 
0.5 
0.45 
0.4 
0.35 
0.3 
Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements) 
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster 
-1 1 3 5 7 9 11 13 15 
Learn more about Advanced Analytics at http://www.alpinenow.com 
L-BFGS 
GD 
Iterations 
Log-Likelihood / Number of Samples
rcv1 Dataset Benchmark 
Logistic Regression with rcv1 Dataset (6.8M rows, 677,399 features, 0.15% non-zero elements) 
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster 
0 5 10 15 20 25 30 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
Learn more about Advanced Analytics at http://www.alpinenow.com 
LBFGS Sparse Vector 
GD Sparse Vector 
Second 
Log-Likelihood / Number of Samples
news20 Dataset Benchmark 
Logistic Regression with news20 Dataset (0.14M rows, 1,355,191 features, 0.034% non-zero elements) 
16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster 
0 10 20 30 40 50 60 70 80 
1.2 
1 
0.8 
0.6 
0.4 
0.2 
0 
Learn more about Advanced Analytics at http://www.alpinenow.com 
LBFGS Sparse Vector 
GD Sparse Vector 
Second Log-Likelihood / Number of Samples
K-Means 
Learn more about Advanced Analytics at http://www.alpinenow.com
PCA + K-Means 
Learn more about Advanced Analytics at http://www.alpinenow.com
Collaborative Filtering 
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark-1157: L-BFGS Optimizer 
• No, its not a blender! 
Learn more about Advanced Analytics at http://www.alpinenow.com
What is Spark-1157: L-BFGS Optimizer 
• Merged in Spark 1.0 
• Contributed by Alpine Data Labs 
• Popular algorithms for parameter estimation in Machine Learning. 
• It’s a quasi-Newton Method. 
• Hessian matrix of second derivatives doesn't need to be evaluated 
directly. 
• Hessian matrix is approximated using gradient evaluations. 
• It converges a way faster than the default optimizer in Spark, 
Gradient Decent. 
• We are contributing OWLQN which is an variant of LBFGS to deal 
with L1 problem to Spark. It’s a building block of GLMNET. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2505: Weighted Regularization 
ongoing work 
l Each components of weights can be penalized differently. 
l We can exclude intercept from regularization in this framework. 
l Decoupling regularization from the raw gradient update which is 
not used in other optimization schemes. 
l Allow various update/learning rate schemes (adagrad, 
normalized adaptive gradient, etc) to be applied independent of 
the regularization 
l Smooth and L1 regularization will be handled differently in 
optimizer. 
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2309: Multinomial Logistic Regression 
ongoing work 
l For K classes multinomial problem, we can generalize it via 
K -1 linear models with logist link functions. 
l As a result, the weights will have dimension of (K-1)(N + 1) 
where N is number of features. 
l MLlib interface is designed for one set of paramerters per 
model, so it requires some interface design changes. 
l Expected to be merged in next release of MLlib, Spark 1.2 
Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297 
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2272: Transformer 
A spark, the soul of a transformer 
Learn more about Advanced Analytics at http://www.alpinenow.com
SPARK-2272: Transformer 
l Merged in Spark 1.1 
l Contributed by Alpine Data Labs 
l MLlib data preprocessing pipeline. 
l StandardScaler 
- Standardize features by removing the mean and scaling to unit variance. 
- RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear 
models typically works better with zero mean and unit variance. 
l Normalizer 
- Normalizes samples individually to unit L^n norm. 
- Common operation for text classification or clustering for instance. 
- For example, the dot product of two l2-normalized TF-IDF vectors is the 
cosine similarity of the vectors. 
Learn more about Advanced Analytics at http://www.alpinenow.com
StandardScaler 
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com 
Normalizer
SPARK-1969: Online summarizer 
l Merged in Spark 1.1 
l Contributed by Alpine Data Labs 
l Online algorithms for computing the mean, variance, min, and max in a streaming 
fashion. 
l Two online summerier can be merged, so we can use one summerier for one block of 
data in map phase, and merge all of them in reduce phase to obtain the global 
summarizer. 
l A numerically stable one-pass algorithm is implemented to avoid catastrophic cancellation 
in naive implementation. 
Ref: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance 
Two-pass algorithm 
l Optimized for sparse vector, and the time complexity is O(non-zeors) instead of 
O(numCols) for each sample. 
Learn more about Advanced Analytics at http://www.alpinenow.com 
Naive algorithm
Learn more about Advanced Analytics at http://www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark SQL 
• Spark SQL allows relational queries expressed in SQL, HiveQL, or 
Scala to be executed using Spark. At the core of this component is a 
new type of RDD, SchemaRDD. 
• SchemaRDDs are composed of Row objects, along with a schema 
that describes the data types of each column in the row. A 
SchemaRDD is similar to a table in a traditional relational database. 
• A SchemaRDD can be created from an existing RDD, a Parquet file, 
a JSON dataset, or by running HiveQL against data stored in Apache 
Hive. 
http://spark.apache.org/docs/latest/sql-programming-guide.html 
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark SQL + MLlib 
l With SparkSQL, users can easily load the parquet/ 
avro datasets into Spark, and perform the data pre-processing 
before the training steps. 
l MLlib considers to use schemaRDD as a native 
typed data format, like R’s data-frame. This allows 
us to create output model with types and column 
names, and also be easier to create PMML model. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Spark SQL + MLlib 
l With SparkSQL, users can easily load the parquet/ 
avro datasets into Spark, and perform the data pre-processing 
before the training steps. 
l MLlib considers to use schemaRDD as a native 
typed data format, like R’s data-frame. This allows 
us to create output model with types and column 
names, and also be easier to create PMML model. 
Learn more about Advanced Analytics at http://www.alpinenow.com
Example: Prepare training data using Spark SQL 
Learn more about Advanced Analytics at http://www.alpinenow.com
Example: Prepare training data using Spark SQL 
Learn more about Advanced Analytics at http://www.alpinenow.com
Interested in MLlib? 
l MLlib official guide - 
https://spark.apache.org/docs/latest/mllib-guide.html 
l Github – https://github.com/apache/spark 
l Mailing lists - user@spark.apache.org 
or dev@spark.apache.org 
Learn more about Advanced Analytics at http://www.alpinenow.com
For more information, contact us 
1550 Bryant Street 
Suite 1000 
San Francisco, CA 94103 
USA 
+1 (877) 542-0062 
www.alpinenow.com 
Learn more about Advanced Analytics at http://www.alpinenow.com 
Get Started Today! 
http://start.alpinenow.com

More Related Content

PDF
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
PPTX
Scaling out logistic regression with Spark
PDF
Multinomial Logistic Regression with Apache Spark
PDF
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
PPTX
Large Scale Machine Learning with Apache Spark
PDF
Apache Spark Machine Learning
PDF
Large-Scale Machine Learning with Apache Spark
PDF
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
Scaling out logistic regression with Spark
Multinomial Logistic Regression with Apache Spark
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large Scale Machine Learning with Apache Spark
Apache Spark Machine Learning
Large-Scale Machine Learning with Apache Spark
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...

What's hot (19)

PDF
Generalized Linear Models in Spark MLlib and SparkR
PPTX
Large Scale Machine learning with Spark
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
PDF
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
PDF
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
PDF
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
PDF
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
PDF
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
PDF
Sparse Data Support in MLlib
PDF
Recent Developments in Spark MLlib and Beyond
PPTX
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
PDF
Generalized Linear Models with H2O
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
PDF
Recent Developments in Spark MLlib and Beyond
PDF
Inside Apache SystemML by Frederick Reiss
PDF
A Graph-Based Method For Cross-Entity Threat Detection
PDF
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
PDF
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
Generalized Linear Models in Spark MLlib and SparkR
Large Scale Machine learning with Spark
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Sparse Data Support in MLlib
Recent Developments in Spark MLlib and Beyond
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Generalized Linear Models with H2O
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Recent Developments in Spark MLlib and Beyond
Inside Apache SystemML by Frederick Reiss
A Graph-Based Method For Cross-Entity Threat Detection
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
Ad

Viewers also liked (20)

PDF
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
PPTX
Website Classification using Apache Spark
PDF
Standard Datasets in Information Retrieval
PDF
Scala for Machine Learning
PPTX
Datasets for logistic regression
PDF
CTR logistic regression
PPTX
Scalable and Flexible Machine Learning With Scala @ LinkedIn
PPTX
Exploring Optimization in Vowpal Wabbit
PPTX
Linear regression on 1 terabytes of data? Some crazy observations and actions
PPTX
Rise of the machine (learning algorithms)
PPTX
In pursuit of augmented intelligence
PDF
一淘广告机器学习
ODP
Click-Trough Rate (CTR) prediction
PPTX
Dynamic pricing
PDF
Cross Device Ad Targeting at Scale
PDF
Ad Click Prediction - Paper review
PDF
Machine Learning Meetup SOF: Intro to ML
PDF
Augmented Intelligence 2.0
PDF
Training Large-scale Ad Ranking Models in Spark
PDF
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Website Classification using Apache Spark
Standard Datasets in Information Retrieval
Scala for Machine Learning
Datasets for logistic regression
CTR logistic regression
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Exploring Optimization in Vowpal Wabbit
Linear regression on 1 terabytes of data? Some crazy observations and actions
Rise of the machine (learning algorithms)
In pursuit of augmented intelligence
一淘广告机器学习
Click-Trough Rate (CTR) prediction
Dynamic pricing
Cross Device Ad Targeting at Scale
Ad Click Prediction - Paper review
Machine Learning Meetup SOF: Intro to ML
Augmented Intelligence 2.0
Training Large-scale Ad Ranking Models in Spark
Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)
Ad

Similar to 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference (20)

PDF
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
PPTX
Big Data Analytics with Storm, Spark and GraphLab
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
PPTX
introduction to big data frameworks
PDF
How Apache Spark fits into the Big Data landscape
PPTX
Apache Spark Introduction @ University College London
PDF
Introduction to Apache Spark
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PPTX
Alpine innovation final v1.0
PPTX
Unit II Real Time Data Processing tools.pptx
PDF
Big Data Analytics and Ubiquitous computing
PPTX
Next generation analytics with yarn, spark and graph lab
PPTX
APACHE SPARK.pptx
PPTX
Big Data Processing with Apache Spark 2014
PPTX
Spark SQL versus Apache Drill: Different Tools with Different Rules
PDF
Scaling Analytics with Apache Spark
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PDF
How Apache Spark fits in the Big Data landscape
PDF
Big Data Processing using Apache Spark and Clojure
PPTX
SparkNotes
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
Big Data Analytics with Storm, Spark and GraphLab
Yarn spark next_gen_hadoop_8_jan_2014
introduction to big data frameworks
How Apache Spark fits into the Big Data landscape
Apache Spark Introduction @ University College London
Introduction to Apache Spark
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Alpine innovation final v1.0
Unit II Real Time Data Processing tools.pptx
Big Data Analytics and Ubiquitous computing
Next generation analytics with yarn, spark and graph lab
APACHE SPARK.pptx
Big Data Processing with Apache Spark 2014
Spark SQL versus Apache Drill: Different Tools with Different Rules
Scaling Analytics with Apache Spark
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
How Apache Spark fits in the Big Data landscape
Big Data Processing using Apache Spark and Clojure
SparkNotes

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
A Presentation on Artificial Intelligence
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Machine learning based COVID-19 study performance prediction
The AUB Centre for AI in Media Proposal.docx
Unlocking AI with Model Context Protocol (MCP)
Review of recent advances in non-invasive hemoglobin estimation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
A Presentation on Artificial Intelligence
“AI and Expert System Decision Support & Business Intelligence Systems”
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
NewMind AI Weekly Chronicles - August'25 Week I
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
NewMind AI Monthly Chronicles - July 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Machine learning based COVID-19 study performance prediction

2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference

  • 1. Large-Scale Machine Learning with DB Tsai Machine Learning Engineering Lead @ AlpineDataLabs Internet of Things Conference @ Moscone Center, SF http://www.iotaconf.com/ October 20, 2014 Learn more about Advanced Analytics at http://www.alpinenow.com
  • 2. TRADITIONAL DESKTOP IN-DATABASE METHODS Learn more about Advanced Analytics at http://www.alpinenow.com WEB-BASED AND COLLABORATIVE SIMPLIFIED CODE-FREE HADOOP & MPP DATABASE ONGOING INNOVATION The Path to Innovation
  • 3. The Path to Innovation Iterative algorithms scan through the data each time Learn more about Advanced Analytics at http://www.alpinenow.com With Spark, data is cached in memory after first iteration Quasi-Newton methods enhance in-memory benefits 921s 150m m rows 97s
  • 4. Machine Learning in the Big Data Era • Hadoop Map Reduce solutions + = • MapReduce scales well for batch processing • Lots of machine learning algorithms are iterative by nature • There are lots of tricks people do, like training with sub-samples of data, and then average the models. Why have big data if you’re only approximating. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 5. Lightning-fast cluster computing • Empower users to iterate through the data by utilizing the in-memory cache. • Logistic regression runs up to 100x faster than Hadoop M/R in memory. • We’re able to train exact models without doing any approximation. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 6. Why MLlib? • MLlib is a Spark subproject providing Machine Learning primitives • It’s built on Apache Spark, a fast and general engine for large-scale data processing • Shipped with Apache Spark since version 0.8 • High quality engineering design and effort • More than 50 contributors since July 2014 Learn more about Advanced Analytics at http://www.alpinenow.com
  • 7. Algorithms supported in MLlib • Classification: SVMs, logistic regression, decision trees, naïve Bayes, and random forests • Regression: linear regression, and random forests • Collaborative filtering: alternating least squares (ALS) • Clustering: k-means • Dimensionality reduction: singular value decomposition (SVD), and principal component analysis (PCA) • Basic statistics: summary statistics, correlations, stratified sampling, hypothesis testing, and random data generation • Feature extraction and transformation: TF-IDF, Word2Vec, StandardScaler, and Normalizer Learn more about Advanced Analytics at http://www.alpinenow.com
  • 8. MapReduce Review • MapReduce – Simplified Data Processing on Large Clusters, 2004. • Scales Linearly • Data Locality • Fault Tolerance in Data Storage and Computation Learn more about Advanced Analytics at http://www.alpinenow.com
  • 9. Hadoop MapReduce Review • Mapper: Loads the data and emits a set of key-value pair • Reducer: Collects the key-value pairs with the same key to process, and output the result. • Combiner: Can reduce shuffle traffic by combining key-value pairs locally before going to reducer. • In-Mapper Combiner: Aggregating the result in the mapper side, and using the LRU cache to prevent out of heap space. http://alpinenow.com/blog/in-mapper-combiner/ • Good: Built in fault tolerance, scalable, and production proven in industry. • Bad: Optimized for disk IO without leveraging memory well; iterative algorithms go through disk IO again and again; primitive API is not easy and clean to develop. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 10. Spark MapReduce • Spark also uses MapReduce as a programming model but with much richer APIs in Scala, Java, and Python. • With Scala expressive APIs, 5-10x less code. • Not just a distributed computation framework, Spark provides several pre-built components helping users to implement application faster and easier. - Spark Streaming - Spark SQL - MLlib (Machine Learning) - GraphX (Graph Processing) Learn more about Advanced Analytics at http://www.alpinenow.com
  • 11. Resilient Distributed Datasets (RDDs) • RDD is a fault-tolerant collection of elements that can be operated on in parallel. • RDDs can be created by parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, HIVE, or any data source offering a Hadoop InputFormat. • RDDs can be cached in memory or on disk Learn more about Advanced Analytics at http://www.alpinenow.com
  • 12. Hadoop M/R vs Spark M/R • Hadoop • Spark Learn more about Advanced Analytics at http://www.alpinenow.com
  • 13. RDD Operations - two types of operations • Transformations: Creates a new dataset from an existing one. They are lazy, in that they do not compute their results right away. • Actions: Returns a value to the driver program after running a computation on the dataset. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 14. Transformations • map(func) - Return a new distributed dataset formed by passing each element of the source through a function func. • filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true. • flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). • mapPartitions(func) - Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. • groupByKey([numTasks]) - When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. • reduceByKey(func, [numTasks]) – When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. http://spark.apache.org/docs/latest/programming-guide.html#transformations Learn more about Advanced Analytics at http://www.alpinenow.com
  • 15. Actions • reduce(func) - Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. • collect() - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. • count(), first(), take(n), saveAsTextFile(path), etc. http://spark.apache.org/docs/latest/programming-guide. html#actions Learn more about Advanced Analytics at http://www.alpinenow.com
  • 16. RDD Persistence/Cache • RDD can be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it. • Persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space). Learn more about Advanced Analytics at http://www.alpinenow.com
  • 17. RDD Storage Level • MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. • MEMORY_AND_DISK - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. • MEMORY_ONLY_SER - Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. • MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence Learn more about Advanced Analytics at http://www.alpinenow.com
  • 18. Word Count Example in Scala Learn more about Advanced Analytics at http://www.alpinenow.com
  • 19. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 20. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 21. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 22. API’s design philosophy in MLlib • Works seamlessly with Spark Core, and Spark SQL; users can use core API’s or Spark SQL for data pre-processing, and then pipe into training step. • Algorithms are implemented in Scala. Public interfaces don’t use advanced Scala features to ensure Java compatibility. • Many of MLlib API’s have python bindings. • MLlib is under active development. The APIs marked Experimental/ DeveloperApi may change in future releases, and will provide migration guide if they are changed. • API’s are well documented, and designed to be expressive. • Code is well-tested, comprehensive unittest coverage. There are lots of comments in the code, and it’s a enjoyable experience to read the code. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 23. Data Types • MLlib local vectors and local matrices are currently wrapping Breeze implementation; as a result, the underlying linear algebra operations are provided by Breeze and jblas. https://github.com/scalanlp/breeze • However, the methods converting MLlib to Breeze vectors/matrices or the other way around are private to org.apache.spark.mllib scope. This restriction can be workaround by having your custom code in org.apache.spark.mllib.something package. • A training sample used in supervised learning is stored in LabeledPoint which contains a label/response and a feature vector in dense or sparse. • Distributed RowMatrix – basically, it’s RDD[Vector] which doesn’t have meaningful row indices. • Distributed IndexedRowMatrix – it’s similar to RowMatrix, but each row is represented by its index and a local vector. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 24. Local vector The base class of local vectors is Vector, and we provide two implementations: DenseVector and SparseVector. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 25. Some useful tips related to local vector • If you want to use native Breeze functionality, you can have your code in org.apache.spark.mllib package. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 26. Real code in MLlib in MultivariateOnlineSummarizer Learn more about Advanced Analytics at http://www.alpinenow.com
  • 27. LabeledPoint • Double is used for storing the label, so we can use the labeled points in both regression and classification. For binary classification, a label should be either 0.0 or 1.0. For N-class classification, labels should be class indices starting from zero: 0.0, 1.0, 2.0, …, N - 1 Learn more about Advanced Analytics at http://www.alpinenow.com
  • 28. Supervised Learning • Binary Classification: linear SVMs (SGD), logistic regression (L-BFGS and SGD), decision trees, random forests (Spark 1.2), and naïve Bayes. • Multiclass Classification: Decision trees, naïve Bayes (coming soon - multinomial logistic regression in GLMNET) • Regression: linear least squares (SGD), Lasso (SGD + soft-threshold), ridge regression (SGD), decision trees, and random forests (Spark 1.2) • Currently, the regularization in linear model will penalize all the weights including the intercept which is not desired in some use-cases. Alpine has GLMNET implementation using OWLQN which can exactly reproduce R’s GLMNET package result with scalability. We’re in the process of merging it into MLlib community. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 29. LinearRegressionWithSGD Learn more about Advanced Analytics at http://www.alpinenow.com
  • 30. SVMWithSGD Learn more about Advanced Analytics at http://www.alpinenow.com
  • 31. SPARK-2934: LogisticRegressionWithLBFGS • Merged in Spark 1.1 • Contributed by Alpine Data Labs • Using L-BFGS to train Logistic Regression instead of default Gradient Descent. • Users don't have to construct their objective function for Logistic Regression, and don't have to implement the whole details. • Together with SPARK-2979 to minimize the condition number, the convergence rate is further improved. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 32. SPARK-2979: Improve the convergence rate by standardizing the training features l Merged in Spark 1.1 l Contributed by Alpine Data Labs l Due to the invariance property of MLEs, the scale of your inputs are irrelevant. l However, the optimizer will not be happy with poor condition numbers which can often be improved by scaling. l The model is trained in the scaled space, but the coefficients are converted to original space; as a result, it's transparent to users. l Without this, some training datasets mixing the columns with different scales may not be able to converge. l Scikit and glmnet package also standardize the features before training to improve the convergence. l Only enable in Logistic Regression for now. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 33. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 34. a9a Dataset Benchmark 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 Logistic Regression with a9a Dataset (11M rows, 123 features, 11% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster -1 1 3 5 7 9 11 13 15 Learn more about Advanced Analytics at http://www.alpinenow.com L-BFGS GD Iterations Log-Likelihood / Number of Samples
  • 35. rcv1 Dataset Benchmark Logistic Regression with rcv1 Dataset (6.8M rows, 677,399 features, 0.15% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster 0 5 10 15 20 25 30 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Learn more about Advanced Analytics at http://www.alpinenow.com LBFGS Sparse Vector GD Sparse Vector Second Log-Likelihood / Number of Samples
  • 36. news20 Dataset Benchmark Logistic Regression with news20 Dataset (0.14M rows, 1,355,191 features, 0.034% non-zero elements) 16 executors in INTEL Xeon E3-1230v3 32GB Memory * 5 nodes Hadoop 2.0.5 alpha cluster 0 10 20 30 40 50 60 70 80 1.2 1 0.8 0.6 0.4 0.2 0 Learn more about Advanced Analytics at http://www.alpinenow.com LBFGS Sparse Vector GD Sparse Vector Second Log-Likelihood / Number of Samples
  • 37. K-Means Learn more about Advanced Analytics at http://www.alpinenow.com
  • 38. PCA + K-Means Learn more about Advanced Analytics at http://www.alpinenow.com
  • 39. Collaborative Filtering Learn more about Advanced Analytics at http://www.alpinenow.com
  • 40. Spark-1157: L-BFGS Optimizer • No, its not a blender! Learn more about Advanced Analytics at http://www.alpinenow.com
  • 41. What is Spark-1157: L-BFGS Optimizer • Merged in Spark 1.0 • Contributed by Alpine Data Labs • Popular algorithms for parameter estimation in Machine Learning. • It’s a quasi-Newton Method. • Hessian matrix of second derivatives doesn't need to be evaluated directly. • Hessian matrix is approximated using gradient evaluations. • It converges a way faster than the default optimizer in Spark, Gradient Decent. • We are contributing OWLQN which is an variant of LBFGS to deal with L1 problem to Spark. It’s a building block of GLMNET. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 42. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 43. SPARK-2505: Weighted Regularization ongoing work l Each components of weights can be penalized differently. l We can exclude intercept from regularization in this framework. l Decoupling regularization from the raw gradient update which is not used in other optimization schemes. l Allow various update/learning rate schemes (adagrad, normalized adaptive gradient, etc) to be applied independent of the regularization l Smooth and L1 regularization will be handled differently in optimizer. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 44. SPARK-2309: Multinomial Logistic Regression ongoing work l For K classes multinomial problem, we can generalize it via K -1 linear models with logist link functions. l As a result, the weights will have dimension of (K-1)(N + 1) where N is number of features. l MLlib interface is designed for one set of paramerters per model, so it requires some interface design changes. l Expected to be merged in next release of MLlib, Spark 1.2 Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297 Learn more about Advanced Analytics at http://www.alpinenow.com
  • 45. SPARK-2272: Transformer A spark, the soul of a transformer Learn more about Advanced Analytics at http://www.alpinenow.com
  • 46. SPARK-2272: Transformer l Merged in Spark 1.1 l Contributed by Alpine Data Labs l MLlib data preprocessing pipeline. l StandardScaler - Standardize features by removing the mean and scaling to unit variance. - RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models typically works better with zero mean and unit variance. l Normalizer - Normalizes samples individually to unit L^n norm. - Common operation for text classification or clustering for instance. - For example, the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 47. StandardScaler Learn more about Advanced Analytics at http://www.alpinenow.com
  • 48. Learn more about Advanced Analytics at http://www.alpinenow.com Normalizer
  • 49. SPARK-1969: Online summarizer l Merged in Spark 1.1 l Contributed by Alpine Data Labs l Online algorithms for computing the mean, variance, min, and max in a streaming fashion. l Two online summerier can be merged, so we can use one summerier for one block of data in map phase, and merge all of them in reduce phase to obtain the global summarizer. l A numerically stable one-pass algorithm is implemented to avoid catastrophic cancellation in naive implementation. Ref: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance Two-pass algorithm l Optimized for sparse vector, and the time complexity is O(non-zeors) instead of O(numCols) for each sample. Learn more about Advanced Analytics at http://www.alpinenow.com Naive algorithm
  • 50. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 51. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 52. Spark SQL • Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark. At the core of this component is a new type of RDD, SchemaRDD. • SchemaRDDs are composed of Row objects, along with a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table in a traditional relational database. • A SchemaRDD can be created from an existing RDD, a Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive. http://spark.apache.org/docs/latest/sql-programming-guide.html Learn more about Advanced Analytics at http://www.alpinenow.com
  • 53. Spark SQL + MLlib l With SparkSQL, users can easily load the parquet/ avro datasets into Spark, and perform the data pre-processing before the training steps. l MLlib considers to use schemaRDD as a native typed data format, like R’s data-frame. This allows us to create output model with types and column names, and also be easier to create PMML model. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 54. Spark SQL + MLlib l With SparkSQL, users can easily load the parquet/ avro datasets into Spark, and perform the data pre-processing before the training steps. l MLlib considers to use schemaRDD as a native typed data format, like R’s data-frame. This allows us to create output model with types and column names, and also be easier to create PMML model. Learn more about Advanced Analytics at http://www.alpinenow.com
  • 55. Example: Prepare training data using Spark SQL Learn more about Advanced Analytics at http://www.alpinenow.com
  • 56. Example: Prepare training data using Spark SQL Learn more about Advanced Analytics at http://www.alpinenow.com
  • 57. Interested in MLlib? l MLlib official guide - https://spark.apache.org/docs/latest/mllib-guide.html l Github – https://github.com/apache/spark l Mailing lists - user@spark.apache.org or dev@spark.apache.org Learn more about Advanced Analytics at http://www.alpinenow.com
  • 58. For more information, contact us 1550 Bryant Street Suite 1000 San Francisco, CA 94103 USA +1 (877) 542-0062 www.alpinenow.com Learn more about Advanced Analytics at http://www.alpinenow.com Get Started Today! http://start.alpinenow.com