H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved
H20 - Thirst for Machine Learning
Meetup Machine Learning/Data Science,
Rome, 15 March 2017
Gabriele Nocco, Senior Data Scientist
gabriele.nocco@catenate.com
Catenate s.r.l.

● H2O Introduction
● GBM
● Demo
2
AGENDA

● GBM
● Demo
3
AGENDA

H2O INTRODUCTION
H2O is an opensource in-memory Machine Learning engine. Java-based, it exposes
comfortable APIs in Java, Scala, Python and R. It also has a notebook-like user
interface called Flow.
The transversality of languages enables the access to the framework for many
different professional roles, from analysts to programmers, up to more “academic”
data scientists. So H2O can be a complete infrastructure, from the prototype model
to the engineering solution.

H2O INTRODUCTION - GARTNER
In 2017, H2O.ai became a Visionary in
the Magic Quadrant for Data Science
Platforms:
STRENGTHS
● Market awareness
● Customer satisfaction
● Flexibility and scalability
CAUTIONS
● Data access and preparation
● High technical bar for use
● Visualization and data exploration
● Sales execution
https://www.gartner.com/doc/reprints?id=1-3TKPVG1&ct=170215&st=sb

H2O INTRODUCTION - FEATURES
● H2O Eco-System Benefits:
○ Scalable to massive datasets on large clusters, fully parallelized
○ Low-latency Java (“POJO”) scoring code is auto-generated
○ Easy to deploy on Laptop, Server, Hadoop cluster, Spark cluster, HPC
○ APIs include R, Python, Flow, Scala, Java, Javascript, REST
● Regularization techniques: Dropout, L1/L2
● Early stopping, N-fold cross-validation, Grid search
● Handling of categorical, missing and sparse data
● Gaussian/Laplace/Poisson/Gamma/Tweedie regression with offsets, observation weights,
various loss functions
● Unsupervised mode for nonlinear dimensionality reduction, outlier detection
● File type allowed: csv, ORC, SVMLite, ARFF, XLS, XLSX, Avro, Parquet

H2O INTRODUCTION - ALGORITHMS

H2O INTRODUCTION - ARCHITECTURE

H2O has the ability to develop Deep Neural Networks natively, or through integration with
TensorFlow. It is now possible to produce very deep networks (5 to 1000 layers!) and it is
possible to handle huge amounts of data, in the order of GBs or TBs.
Another great advantage is the ability to exploit the potential of GPU to perform
computations.
H2O INTRODUCTION - H2O + TENSORFLOW

With the release of
TensorFlow, H2O has
embraced the wave of
enthusiasm for the growth
of Deep Learning.
Thanks to Deep Water,
H2O allows us to interact
in a direct and simple way
with Deep Learning tools
like TensorFlow, MXNet
and Caffe.
H2O INTRODUCTION - H2O + TENSORFLOW

H2O INTRODUCTION - H2O + SPARK
One of the first plugin
developed in H2O was the
one for Apache Spark,
named Sparkling Water.
Binding to an opensource
project on the rise such as
Spark, with the power of
calculation that distributed
computing allows, has
been a great driving force
for the growth of H2O.

A Sparkling Water
application runs like a job
that can be started with
spark-submit.
At this point the Spark
Master produces the DAG
and divides the execution
for each Worker, in which
the H2O libraries are
loaded in the Java process.

The Sparkling Water
solution is obviously
certificated for all the
Spark distributions:
Hortonworks, Cloudera,
MapR.
Databricks provides a
Spark cluster in cloud, and
H2O works perfectly in this
environment. H2O Rains
with Databricks Cloud!

● GBM
● Demo
16
AGENDA

Gradient Boosting Machine is one of the most powerful techniques to build predictive models. It can
be applied for classification or regression, so it’s a supervised algorithm.
This is one of the most diffused and used algorithm in the Kaggle community, performing better than
SVMs, Decision Trees and Neural Networks in a large number of cases.
https://www.quora.com/Why-does-Gradient-boosting-work-so-well-for-so-many-Kaggle-problems
GBM can be an optimal solution when the dimension of the dataset or the computing power doesn’t
allow to train a Deep Neural Network.
GBM
Gradient Boosting Machine

Kaggle is the biggest platform for Machine Learning
contests in the world.
https://www.kaggle.com/
In the beginning of March 2017, Google announces
the acquisition of the Kaggle community.
GBM - KAGGLE

GBM - ORIGIN OF BOOSTING IDEA
A weak learner is an algorithm whose performance is only marginally better than random
chance. Boosting was developed in the 1980s as the answer to the following question: “can
we combine many weak learners to create a very strong one?”
Boosting revolves around filtering observations, by focusing new learners on samples that
were difficult to classify by previous weak learners.
Using this idea, we can train a succession of weak learning methods, each one focused on
patterns that were misclassified previously.
Origin of Boosting Idea

GBM - ADABOOST
The first algorithm to gain a large popularity in the boosting family was Adaptive
Boosting or AdaBoost for short. In the original formulation, weak learners are
made by decision trees with a single split, called decision stumps.
AdaBoost works by weighting the observations, and sampling the dataset at each
iteration with more emphasis on instances that are difficult to classify.
Sequentially, we add stumps trying to classify them better.
At every step, predictions are made by taking a majority vote of the weak
learners’ outputs, weighted by a measure of their individual accuracy.
The First Boosting Algorithm

GBM - GRADIENT BOOSTING
In later years, it was realized that AdaBoost can be derived formally as the minimization
of a specific cost function with an exponential loss. This allowed to recast the algorithm
under a statistical framework.
Gradient Boosting Machines, later called just gradient boosting (or gradient tree boosting
when using trees), are the natural generalization of AdaBoost to handle boosting with any
loss function following a gradient descent procedure:
GBM = Boosting + Gradient descent
This class of algorithms remains stage-wise additive, since new learners are added
iteratively while old ones are kept fixed. The generalization allows arbitrary
differentiable loss functions to be used, providing more flexible algorithms to handle
regression, multi-class classification and more.
Generalization of AdaBoost as Gradient Boosting

Summarizing, GBM requires to specify three different components:
● The loss function with respect to the new weak learners.
● The specific form of the weak learner (e.g., stumps).
● A technique to add weak learners between them to minimize the loss function.
How Gradient Boosting Works

The loss function determines the behavior of the algorithm.
The only requirement is differentiability, in order to allow gradient descent on it. Although
you can define arbitrary losses, in practice only a handful are used. For example, regression
may use a squared error and classification may use logarithmic loss.
Loss Function

In H2O, the weak learners are implemented as decision trees, making this an
instance of decision tree boosting. In order to allow the addition of their
outputs, regression trees (having real values in output) are used.
When building each decision tree, the algorithm iteratively selects a split
point in a greedy fashion based on a measure of “purity” of the dataset, in
order to minimize the loss. It is possible to increase the depth of the trees to
handle more flexible decision boundaries.
On the contrary, to limit overfitting we can constrain the topology of tree
by, e.g. limiting the depth, the number of splits, or the number of leaf nodes.
Weak Learner

Gradient descent is a generic iterative technique to minimize objective functions. At each iteration, the
gradient of the loss function (e.g., the error on the training set) is computed, and it is used to choose a set
of parameters that decreases its value.
In a GBM, the optimization problem is formulated in terms of functions such as trees (functional
optimization), making it relatively hard in general. The basic idea is to approximate this gradient using only
its values on our training points.
In a GBM with squared loss, the resulting algorithm is extremely simple: at each step we train a new tree
on the “residual errors” with respect to the previous weak learners. This can be seen as a gradient descent
step with respect to our loss, where all previous weak learners are kept fixed and the gradient is
approximated. This generalizes easily to different losses.
Additive Model

The output for the new tree is then added to the output of the existing sequence of trees in
an effort to correct or improve the final output of the model. In particular, we associate a
different weighting parameter to each decision region of the newly constructed tree. This is
done by solving a new optimization problem with respect to these weights.
A fixed number of trees are added or training stops once loss reaches an acceptable level
or no longer improves on an external validation dataset.
Output and Stop Condition

Gradient boosting is a greedy algorithm and can overfit a training dataset quickly.
It can benefit from regularization methods that penalize various parts of the algorithm and
generally improve the performance of the algorithm by reducing overfitting.
There are 4 enhancements to basic gradient boosting:
● Tree Constraints
● Learning Rate
● Stochastic Gradient Boosting
● Penalized Learning (Regularization of regression trees output in L1 or L2)
Improvements to Basic Gradient Boosting

● GBM
● Demo
28
AGENDA

Q&A

H20 - Thirst for Machine Learning

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to H20 - Thirst for Machine Learning (20)

More from MeetupDataScienceRoma (20)

Recently uploaded (20)

H20 - Thirst for Machine Learning