Machine Learning From Raw Data To The Predictions

Machine Learning From Raw
Data To The Predictions
Luca Zavarella

#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Sponsor

Organizzatori
GetLatestVersion.
it

Who Am I?
Data Science Microsoft Professional Program
Microsoft SQL Server BI MCTS & MCITP
Working with SQL Server since 2007
BI & Advanced Analytics Technical Director @
Email: lzavarella@solidq.com
Twitter: @lucazav
LinkedIn: http://it.linkedin.com/in/lucazavarella
Survey: http://bit.do/ml-survey

Agenda
• What’s Azure Machine Learning
• How to Prepare you Data
• Missing Values
• Outliers
• Feature Engineering
• Feature Selection
• ML Algorithms
• Demo

The Azure Machine Learning Offer
Azure ML Services
Experimentation
Service
Model Management
Service
Workbench
Azure ML Studio
Deploy on Web Service
DSVM

The Azure ML Studio Environment
Data
Azure Machine Learning
Consumers
Cloud storage
Azure Storage
Azure Table
Hive
etc.
Local storage
Upload data from PC…
Business Apps
Excel
Business problem Modeling Business valueDeployment
API
Azure Marketplace
(Applications store)
Azure ML Gallery
(community)
ML Web Services
(REST API Services)
ML Studio
(Web IDE)
Workspace:
Experiments
Datasets
Trained models
Notebooks
Access settings
Data Model
Manage
API

DEMO 1
The Azure ML Studio Environment

Why we Need to Prepare Data?
• Feed ML algorithms the right data for the problem
to solve
• It has to be in a useful scale
• It has to be in the right format
• Meaningful features are to be included

Useful Steps for Data Preparation
• Here some steps to follow:
• Check missing values
• Check outliers
• Do some Feature Engineering
• Do some Feature Selection

Missing Values - Ask Your Self Some Questions
• What data is missing?
• Which variables and how much is missing?
• Do we know why it is missing?
• It’s missing because it should be missing or are cases
incomplete?
• What would be implications for modeling?
• Do we have enough samples to model without it?
• Do we have bias due to missing values?

Missing Values - Three Options, No Silver Bullets
• Do nothing!
• Data is truly missing and should not exist
• The lack of information can become a new feature (true/false)
• Data is absent when it should be there in a random way
• Weight the data
• Create a pseudo-population of weighted copies of the complete
cases to remove selection bias introduced by the missing data
• Imputing Missing Values
• Impute the mean, median or most common value
• Relationships with other variables are lost
• Imputation through Predictive Models
• Impute several values for each missing case
• Return several imputed data sets
• Allow you to run multiple models and pool them together at the end

Outliers – What They Are
• An outlier is an observation that lies an abnormal distance
from other values in a random sample from a population
• The “abnormal” distance hasn’t a fixed measure! It depends on data
• The analyst will decide what will be considered abnormal
• Extreme values sometimes have a big effect on statistical
operations
• That effect is not necessarily a good effect 

Outliers – Why They Are There
Data Entry Errors
Measurement Error
There are 10 weighing machines. 9 of them are correct, 1 is faulty
Experimental Error
In a 100m sprint of 7 runners, one runner missed out on concentrating on the
‘Go’ call which caused him to start late
Intentional Outlier
Teens would typically under report the amount of alcohol that they consume
Data Processing Error
Some manipulation or extraction errors may lead to outliers in the dataset
Sampling error
We have to measure the height of athletes; by mistake, we include a few
basketball players in the sample
Natural Outlier
Malls sell more products in the Christmas period.

Outliers – Extreme and Mild Ones
The Boxplot is your friend!
Just for univariate analysis and for bivariate
analysis with a categorical variable… 

Multivariate Outliers – 2D
• Cases with an unusual combination of values on different variables
• Finding outlier in 2d is more tricky than in univariate analysis

Multivariate Outliers – More Than 2D
Finding outliers in >2d can be well hidden!

Outliers – Remove Them or What?
• Generally, it’s a bad idea to remove them
• It decreases the ability of the model to generalize if
the cause of this type of variance is unknown (it can
be not just noise)
• The outliers effect can be often mitigated by
appropriate math transformations (log! ☺)
• Remove them only if they are truly “abnormal”
• Sometimes you can substitute them with NA and
then impute them, or replace them with the
mean/median

What is Feature Engineering
• It’s the process of using domain knowledge of the
data to create features that make ML algorithms
work in a better way
• Eg. Annual Debt Ratio = Monthly Debt * 12 / Annual
Income
• FE is sometimes necessary in order to turn raw input
data into things the ML algorithm can understand
• It’s often an art and it’s the most creative phase of
Data Processing

Examples of Simple Feature Engineering
ItemID ItemColor
1 Yellow
2 Green
3 Unknown
4 Green
5 Yellow
ItemId HasColor IsYellow IsGreen
1 1 1 0
2 1 0 1
3 0 0 0
4 1 0 1
5 1 1 0
One-hot encoding
or
Dummy variables
ItemID Price
1 10
2 12
3 5
4 23
5 76
ItemID PriceRange
1 Low
2 Medium
3 Low
4 Medium
5 High
Binning / Bucketing
1-10  Low
11 – 25  Medium
26 – 80  High

An Example of “Exotic” Feature Engineering
• Cyclical features need to be treated carefully
• Hours: the model can’t understand that 24h == 0h
• In a linear model the “length” (t0 – t23) = 23 is “huge”
• Use the clock hand’s projections to convert hours to smooth linear
variables: 𝑥 = 𝑐𝑜𝑠
𝜋
2
−
2𝜋ℎ
24
= 𝑠𝑖𝑛
2𝜋ℎ
24
and 𝑦 = 𝑐𝑜𝑠
2𝜋ℎ
24
ItemID Hours
1 22
2 23
3 0
4 1
5 2
X_Hours
Y_Hours
ItemId X_Hours Y_Hours
1 -0.5000 0.8660
2 -0.2588 0.9659
3 0.0000 1.0000
4 0.2588 0.9659
5 0.5000 0.8660

DEMO 2
Data Preparation on Azure ML Studio
Data Exploration with IDEAR

What is Feature Selection
• It’s the process of selecting a subset of relevant
features for use in model construction
• FS will include and exclude attributes present in the data
without changing them
• Dimensionality reduction is a different thing
• New combination of attributes maximizing total variance
• Methods: Principal Component Analysis, Singular Value Decomposition,
Sammon’s Mapping
• Reduce the noise due to irrelevant features
• Avoid to over-fit the training set
• Make the model generalized

ML Algorithms by Learning Style
• Supervised Learning
• Predicting the future
• Learn from known past examples
• Labels provided
• Unsupervised Learning
• Making sense of data
• Understanding the past
• Learning the structure of data
• Labels not provided
• Reinforcement Learning
• Machine trained to make decisions using trial and error
• Maximize a cumulative reward
• Exploration vs exploitation trade-off
Broadly split into three main categories:

Common Supervised Learning Algorithms 1
• Regression
• Modeling and predicting continuous,
numeric variables
• Linear Regression
• Straightforward to understand
• Not flexible enough to capture more complex
patterns
• Regression Tree
• Can learn non-linear relationships
• Fairly robust to outliers (using ensembles)
• Ensembles perform very well
• They combine predictions from many
individual trees

Common Supervised Learning Algorithms 2
• Classification
• Modeling and predicting categorical
variables
• Logistic Regression
• Yes! “Regression” but not for continuous variables
• Predictions ∈ 0, 1  class probabilities
• Not flexible enough to capture complex relationships
• Classification (or Decision) Tree
• Robust to outliers and scalable
• Naturally model non-linear decision boundaries
• Individual trees are prone to overfitting, not if in
ensembles

Common Unsupervised Learning Algorithms
• Clustering
• Find natural groupings of observations based on the inherent structure within your
dataset
• K-Means
• Make clusters based on geometric distance
• Grouped around centroids in a globular form
• Simple and flexible
• Number of clusters to be specified
• Anomaly Detection
• Identify unusual patterns not conform to expected behavior
• Find outliers! ☺
• Used when there are lot of "normal" data and not many cases of anomalies
• One-class Support Vector Machine (SVM)
• Points in space, mapped so that separate categories are divided by a clear gap that is as wide as possible
• PCA-Based
• First computes its projection on the eigenvectors, then computes the normalized reconstruction error
• The higher the error, the more anomalous the instance is

No Free Lunch!
• The “No Free Lunch” theorem in ML
• No one algorithm works best for every problem
• Many factors at play, e.g. size and structure of dataset
• Try many different algorithms for your problem!

DEMO 3
Price Elasticity in Azure ML

Lesson Learned
• It’s easy to create a predictive model with Azure ML
Studio
• But you have to know how to model data before put it
in the pot to avoid bad predictions!

Links
• What is Azure ML (https://is.gd/zUaGXs)
• Price Elasticity Experiment (https://is.gd/nq8jJs)
• Elasticity in Economics (https://is.gd/652EGX)
• TDSP Utility IDEAR (https://is.gd/5KKdPM)
• How to Better Evaluate the Goodness-of-Fit of
Regressions (https://is.gd/2z30H4)

#SqlSat675
Thank you! ☺

Machine Learning From Raw Data To The Predictions

More Related Content

Similar to Machine Learning From Raw Data To The Predictions (20)

Recently uploaded (20)

Machine Learning From Raw Data To The Predictions