Machine Learning - Supervised Learning

GIORGIO ALFREDO SPEDICATO, PHD FCAS FSA CSPA
UNISACT 2018
Machine Learning and Actuarial Science
Supervised learning

Intro
▪There is an explicit target that the algorithms aims to predict
▪Dependent variable could, for example, be:
▪ Continuous: es. loss cost, income, …
▪ Integer: es. #of claims, # of purchased covers, # of successes in n trials.
▪ Binary: fraud, retention and conversion probability, …
▪ Multinomial: a-priori categories.
▪Classical multivariate regression is the most common algorithm

Linear models, GLM & GAM
GLMs are the first predictive models widely used in the Insurance
industry, currently the gold standard for personal lines pricing:
𝑔 μ𝑖 = 𝑥𝑖
𝑇
𝛽 = 𝑓 𝑥𝑖
𝑇
+ 𝑜𝑓𝑓𝑠𝑒𝑡
Possible link ( g) functions:
• logistic (log(p/(1-p)), to model probability
• Log(μ), Poisson or Gamma regression (# of claims and severity modeling);
• μ *1, identity, linear gaussian regression

▪𝑓 𝑥𝑖
𝑇
can be any linear component, of which:
▪ Additive terms;
▪ Binned continuous variables;
▪ Splines / Polynomials;
▪GAM consist in smooth functions in the additive term.
▪All actuarial pricing software (like Emblem, SAS) implement GLMs and their
extensions (splines, GAM, …)

▪Pros:
▪ GLM are «market» wide standards (baseline for competing models’ comparisons)
▪ Easy to fit;
▪ Interpretability.
▪Cons:
▪ Strong non-linearities difficult to be handled;
▪ Need to explicitly define interactions;
▪ Need to overcome collinearity when # of predictors increases.

Elasticnet
▪GLMs are usually fit maximizing logLikelihood max 𝐿𝑜𝑔𝐿𝑖𝑘
▪The ElasticNet approach extends GLM optimizing max 𝐿𝑜𝑔𝐿𝑖𝑘 − 𝑃𝑒𝑛𝑎𝑙𝑡𝑦,
where 𝑃𝑒𝑛𝑎𝑙𝑡𝑦 = 𝜆 𝛼 𝛽1 + (1 − 𝛼) 𝛽2 being:
▪ 𝛽1 = σ 𝑘=1
𝑝
|𝛽𝑖| 𝑖𝑠 the Lasso component (it helps to drop non significant
predictors);
▪ 𝛽2 = σ 𝑘=1
𝑝
𝛽𝑖
2 is the Ridge component, that handle
▪ α, λ is the relative Lasso and Ridge weight and penalty overall weight.
▪Elasticnet joins GLMs interpretability with higher robustness.

Elasticnet
Estaticnet is implemented both in R (glmnet
package), in SAS (Proc GLMSELECT), and in H2O.
Elastinet output is the same of GLMs.

Classification and Regression Tree
▪A CART creates hierarchical partitions of input data sets as a function of predictors.
▪Chosen predictors at every splits and cut offs maximizes loss reduction of outcome prediction.
▪CARTs can handle continuous and categorical predictors, optimizing F-test or 𝜒2
statistics when
defining the splits.

Classification and Regression Trees
▪ Sofware:
▪ SAS STAT: PROC HPSPLIT
▪ R packages: rpart, C50, party, partykit
▪ Python libraries: sklearn
DecisionTreeClassifier
▪ SPSS: CHAID

Classification and Regression Trees
▪Pros:
▪Easy to be explained to not technical audience
▪Alllow to easily understand predictors importance and interactions
▪Allows to have an insight of variable importance (useful as first analysis)
▪Cons:
▪Sensible to outliners «pruning»
▪Predictions in constant in intervals, can be less performant than other
approaches

Random Forest & Bagging
▪ Classification and regression trees extension following «bagging». approach. Can handle
continuous and categorical outcomes
▪ «bagging» means:
▪ Creating many independent samples;
▪ Fit «simple» models on each of them ( a «forest» overall)
▪ The prediction of an observation is the average of induvial trees predictions
▪ Most important parameters are:
▪ «mtries»: fraction of the number of predictions(p) to use in each trees; eristically p/3 for
classification problems, 𝑝 for regression ones
▪ Max depth, min rows: max depth of a single trees
▪ Ntrees: number of trees (independent sample) to be fit (usually>=50)

Random Forest – Grid search
▪As most ML models, no closed form is available to define optimal parameters
▪Grid search is needed. Various parameters configurations are tried to find the one that
maximizes predictive performance.
▪Possible approaches are:
▪ Cartesian grid: all possible combination of parameters «curse of dimensionality» issues
▪ Random grid search: a sample of the cartesian grid;
▪ Bayesian optimization / Genetic algorithms: a an initial grid search is performed, then
parameters are changed following a direction that tends to increase predictive
performance.

Random Forest – Grid search
▪Pros:
▪ Generally, it offers good fits
▪ Little sensitivity to outliers
▪ Easily scalable
▪Cons:
▪ Opaque (just variable importance is available)
▪ Difficult to use to fit rates (offset)

Boosted Models
▪Strong predictive performance, frequently used in Kaggle
competitions.
▪Extends CART, with enhancement to avoid overfit and to increase
predictive performance;
▪They can be use for:
▪ Classification problems;
▪ Regression. «base margin» allows to use initial estimates (as boosting existing models) or to handle
offsets
▪ Ranking and Survival modeling

Boosted Models
▪The boosting algorithm is a recursive error correction model: 𝐹𝑡 𝑥 = 𝐹𝑡−1 𝑥 + 𝜂 ∗ ℎ𝑡 𝑥 ,
being:
▪ ℎ𝑡 𝑥 a simple models to predict t-1 prediction errors
▪ 𝜂 is a shrinkage factor that increases model robustness.
▪The Gradient Descendent algorithm is used to minimize a chosen loss function.
▪The number of iterations depends by:
▪ A fixed #;
▪ A moving average approach.

Boosted Models
Typical boosted models parameters are:
• Η (shrinkage) e n (number of iteration / sub-trees);
• Fractions of sampled observations at each observations;
• Fraction of predictors available at each step;
• Max dept of each tree;
• Other regularization parameters
A grid search approach is needed to tune optimal hyperparameter configuration:

Boosted Models
▪GBM is the first widely used gradient boosting algorithm;
▪XGBoost is currently the gold standard of boosted trees. It extends GBM thanks to:
▪ Parallelizing;
▪ Regularization;
▪ Checkpointing: a new models starts from the results of a previous one.
▪LightGBM is a promising very recent evolution of XGBoost from Microsoft Research.

Boosted Models
▪R (libraries gbm, xgboost, ligthgbm) and Python (libraries scikit-learn, xgboost) are the core
packages to fit boosted models.
▪Boosted models are also in SAS (Enterprise miner) and Matlab (statistic and machine learning
toolbox)
▪H2O suite implements both GBM and XGBOOST, allowing an easy parallelization (also across
computing clusters) and a GPU extension.

Stacked Ensemble
▪Combining different algorithms by a «superlearner» to obtain an even more robust prediction;
▪The algorithm is:
▪ Estimating L separate algorithms based on N using k-fold cross validation;
▪ Combining L prediction by a «superlearner» fit finding the L best weights
▪ Final predictions is the weighed average of L models individual ones
▪Pros&Cons:
▪ Pros: generally increases predictive performance combining L models strengths;
▪ Cons: higher computing time, lower explicative performance

Stacked Ensemble
▪ H2O (stackedEnsemble)
▪ Easily to be generalized

Deep Learning
▪Multi – layer neural networks
▪Very effective for:
▪Image recognition
▪Natural Language Processing
▪Multivariate time series analysis
▪Unsupervised learnings
▪Requests:
▪Huge data
▪Computing powerr (often GPU
computing necessary)

Deep Learning
▪A Deep neural network consist in different neuron strata:
▪ Retrieving inputs from previous layers
▪ Properly weight inputs
▪ Retrieving an output 𝜑𝑖 σ𝑗=1
𝑞
𝑤𝑗 𝑥𝑗 + 𝑏𝑖
▪The increase in computing power and the introduction of methodologies (e.g.
Dropout) that reduces overfitting contributed to the renewed attention to
Neural Network that are currently the state of the art of ML and Artificial
Intelligence.
▪Most relevant drawback are:
▪ Lack of interpretability;
▪ Difficult to define the best architecture configuration

Deep Learning: tipi di reti neurali
▪Multi layer perceptron (MLP): it consists in an
input layer, an output one, one or more hidden
layers. Used for regression and classification;
▪Convolutionary neural networks (CNN):
Convolution layers allow to obtain spatial
feature. Useful to for image recognition and
natural language processing;
▪Recurrent neural networks (RNN): memory
effects can be get that can be useful in
sequence analysis (translation, nlp, time series
analysis).

word2vec
Natural Language Processing applications.
Word occurrence depends by neighbors frequence.
A weights vectors is associated to each word (𝑛 ∈ 150 − 300 )
Each word belong to an 𝑅 𝑛 space that means that:
1. Semantic similarity between words can be computed
2. Word algebra can be performed: “king” – “man”+ “woman” get a word vector close to
“queen” one.

Machine Learning - Supervised Learning

More Related Content

What's hot (20)

Similar to Machine Learning - Supervised Learning (20)

More from Giorgio Alfredo Spedicato (15)

Recently uploaded (20)

Machine Learning - Supervised Learning