SlideShare a Scribd company logo
Machine Learning From Raw
Data To The Predictions
Luca Zavarella
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Sponsor
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Organizzatori
GetLatestVersion.
it
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Who Am I?
Data Science Microsoft Professional Program
Microsoft SQL Server BI MCTS & MCITP
Working with SQL Server since 2007
BI & Advanced Analytics Technical Director @
Email: lzavarella@solidq.com
Twitter: @lucazav
LinkedIn: http://it.linkedin.com/in/lucazavarella
Survey: http://bit.do/ml-survey
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Agenda
• What’s Azure Machine Learning
• How to Prepare you Data
• Missing Values
• Outliers
• Feature Engineering
• Feature Selection
• ML Algorithms
• Demo
What’s Azure ML
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
The Azure Machine Learning Offer
Azure ML Services
Experimentation
Service
Model Management
Service
Workbench
Azure ML Studio
Deploy on Web Service
DSVM
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
The Azure ML Studio Environment
Data
Azure Machine Learning
Consumers
Cloud storage
Azure Storage
Azure Table
Hive
etc.
Local storage
Upload data from PC…
Business Apps
Excel
Business problem Modeling Business valueDeployment
API
Azure Marketplace
(Applications store)
Azure ML Gallery
(community)
ML Web Services
(REST API Services)
ML Studio
(Web IDE)
Workspace:
Experiments
Datasets
Trained models
Notebooks
Access settings
Data Model
Manage
API
DEMO 1
The Azure ML Studio Environment
Data Preparation
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Why we Need to Prepare Data?
• Feed ML algorithms the right data for the problem
to solve
• It has to be in a useful scale
• It has to be in the right format
• Meaningful features are to be included
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Useful Steps for Data Preparation
• Here some steps to follow:
• Check missing values
• Check outliers
• Do some Feature Engineering
• Do some Feature Selection
Missing Values
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Missing Values - Ask Your Self Some Questions
• What data is missing?
• Which variables and how much is missing?
• Do we know why it is missing?
• It’s missing because it should be missing or are cases
incomplete?
• What would be implications for modeling?
• Do we have enough samples to model without it?
• Do we have bias due to missing values?
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Missing Values - Three Options, No Silver Bullets
• Do nothing!
• Data is truly missing and should not exist
• The lack of information can become a new feature (true/false)
• Data is absent when it should be there in a random way
• Weight the data
• Create a pseudo-population of weighted copies of the complete
cases to remove selection bias introduced by the missing data
• Imputing Missing Values
• Impute the mean, median or most common value
• Relationships with other variables are lost
• Imputation through Predictive Models
• Impute several values for each missing case
• Return several imputed data sets
• Allow you to run multiple models and pool them together at the end
Outliers
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Outliers – What They Are
• An outlier is an observation that lies an abnormal distance
from other values in a random sample from a population
• The “abnormal” distance hasn’t a fixed measure! It depends on data
• The analyst will decide what will be considered abnormal
• Extreme values sometimes have a big effect on statistical
operations
• That effect is not necessarily a good effect 
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Outliers – Why They Are There
Data Entry Errors
Measurement Error
There are 10 weighing machines. 9 of them are correct, 1 is faulty
Experimental Error
In a 100m sprint of 7 runners, one runner missed out on concentrating on the
‘Go’ call which caused him to start late
Intentional Outlier
Teens would typically under report the amount of alcohol that they consume
Data Processing Error
Some manipulation or extraction errors may lead to outliers in the dataset
Sampling error
We have to measure the height of athletes; by mistake, we include a few
basketball players in the sample
Natural Outlier
Malls sell more products in the Christmas period.
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Outliers – Extreme and Mild Ones
The Boxplot is your friend!
Just for univariate analysis and for bivariate
analysis with a categorical variable… 
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Multivariate Outliers – 2D
• Cases with an unusual combination of values on different variables
• Finding outlier in 2d is more tricky than in univariate analysis
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Multivariate Outliers – More Than 2D
Finding outliers in >2d can be well hidden!
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Outliers – Remove Them or What?
• Generally, it’s a bad idea to remove them
• It decreases the ability of the model to generalize if
the cause of this type of variance is unknown (it can
be not just noise)
• The outliers effect can be often mitigated by
appropriate math transformations (log! ☺)
• Remove them only if they are truly “abnormal”
• Sometimes you can substitute them with NA and
then impute them, or replace them with the
mean/median
Feature Engineering
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
What is Feature Engineering
• It’s the process of using domain knowledge of the
data to create features that make ML algorithms
work in a better way
• Eg. Annual Debt Ratio = Monthly Debt * 12 / Annual
Income
• FE is sometimes necessary in order to turn raw input
data into things the ML algorithm can understand
• It’s often an art and it’s the most creative phase of
Data Processing
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Examples of Simple Feature Engineering
ItemID ItemColor
1 Yellow
2 Green
3 Unknown
4 Green
5 Yellow
ItemId HasColor IsYellow IsGreen
1 1 1 0
2 1 0 1
3 0 0 0
4 1 0 1
5 1 1 0
One-hot encoding
or
Dummy variables
ItemID Price
1 10
2 12
3 5
4 23
5 76
ItemID PriceRange
1 Low
2 Medium
3 Low
4 Medium
5 High
Binning / Bucketing
1-10  Low
11 – 25  Medium
26 – 80  High
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
An Example of “Exotic” Feature Engineering
• Cyclical features need to be treated carefully
• Hours: the model can’t understand that 24h == 0h
• In a linear model the “length” (t0 – t23) = 23 is “huge”
• Use the clock hand’s projections to convert hours to smooth linear
variables: 𝑥 = 𝑐𝑜𝑠
𝜋
2
−
2𝜋ℎ
24
= 𝑠𝑖𝑛
2𝜋ℎ
24
and 𝑦 = 𝑐𝑜𝑠
2𝜋ℎ
24
ItemID Hours
1 22
2 23
3 0
4 1
5 2
X_Hours
Y_Hours
ItemId X_Hours Y_Hours
1 -0.5000 0.8660
2 -0.2588 0.9659
3 0.0000 1.0000
4 0.2588 0.9659
5 0.5000 0.8660
DEMO 2
Data Preparation on Azure ML Studio
Data Exploration with IDEAR
Feature Selection
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
What is Feature Selection
• It’s the process of selecting a subset of relevant
features for use in model construction
• FS will include and exclude attributes present in the data
without changing them
• Dimensionality reduction is a different thing
• New combination of attributes maximizing total variance
• Methods: Principal Component Analysis, Singular Value Decomposition,
Sammon’s Mapping
• Reduce the noise due to irrelevant features
• Avoid to over-fit the training set
• Make the model generalized
ML Algorithms
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
ML Algorithms by Learning Style
• Supervised Learning
• Predicting the future
• Learn from known past examples
• Labels provided
• Unsupervised Learning
• Making sense of data
• Understanding the past
• Learning the structure of data
• Labels not provided
• Reinforcement Learning
• Machine trained to make decisions using trial and error
• Maximize a cumulative reward
• Exploration vs exploitation trade-off
Broadly split into three main categories:
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Common Supervised Learning Algorithms 1
• Regression
• Modeling and predicting continuous,
numeric variables
• Linear Regression
• Straightforward to understand
• Not flexible enough to capture more complex
patterns
• Regression Tree
• Can learn non-linear relationships
• Fairly robust to outliers (using ensembles)
• Ensembles perform very well
• They combine predictions from many
individual trees
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Common Supervised Learning Algorithms 2
• Classification
• Modeling and predicting categorical
variables
• Logistic Regression
• Yes! “Regression” but not for continuous variables
• Predictions ∈ 0, 1  class probabilities
• Not flexible enough to capture complex relationships
• Classification (or Decision) Tree
• Robust to outliers and scalable
• Naturally model non-linear decision boundaries
• Individual trees are prone to overfitting, not if in
ensembles
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Common Unsupervised Learning Algorithms
• Clustering
• Find natural groupings of observations based on the inherent structure within your
dataset
• K-Means
• Make clusters based on geometric distance
• Grouped around centroids in a globular form
• Simple and flexible
• Number of clusters to be specified
• Anomaly Detection
• Identify unusual patterns not conform to expected behavior
• Find outliers! ☺
• Used when there are lot of "normal" data and not many cases of anomalies
• One-class Support Vector Machine (SVM)
• Points in space, mapped so that separate categories are divided by a clear gap that is as wide as possible
• PCA-Based
• First computes its projection on the eigenvectors, then computes the normalized reconstruction error
• The higher the error, the more anomalous the instance is
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
No Free Lunch!
• The “No Free Lunch” theorem in ML
• No one algorithm works best for every problem
• Many factors at play, e.g. size and structure of dataset
• Try many different algorithms for your problem!
DEMO 3
Price Elasticity in Azure ML
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Lesson Learned
• It’s easy to create a predictive model with Azure ML
Studio
• But you have to know how to model data before put it
in the pot to avoid bad predictions!
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
Links
• What is Azure ML (https://is.gd/zUaGXs)
• Price Elasticity Experiment (https://is.gd/nq8jJs)
• Elasticity in Economics (https://is.gd/652EGX)
• TDSP Utility IDEAR (https://is.gd/5KKdPM)
• How to Better Evaluate the Goodness-of-Fit of
Regressions (https://is.gd/2z30H4)
#SqlSat675 – 18/11/2017http://bit.do/ml-survey
#SqlSat675
Thank you! ☺

More Related Content

PDF
Spark Summit EU talk by Casey Stella
PDF
Big Data Visualization
PDF
Agile Data Science
PPTX
Agile Data Science
PDF
Leland Lockhart - SXSW Intro to Data Science
PPTX
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
PPTX
03-Data-Exploration.pptx
PPTX
Data analytics and analysis trends in 2015 - Webinar
Spark Summit EU talk by Casey Stella
Big Data Visualization
Agile Data Science
Agile Data Science
Leland Lockhart - SXSW Intro to Data Science
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
03-Data-Exploration.pptx
Data analytics and analysis trends in 2015 - Webinar

Similar to Machine Learning From Raw Data To The Predictions (20)

PPTX
Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...
PDF
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...
PDF
BSSML16 L1. Introduction, Models, and Evaluations
PPT
PPTX
Starting data science with kaggle.com
PDF
Influx/Days 2017 San Francisco | Baron Schwartz
PPTX
L8 scientific visualization of data
PPTX
Introduction to Big Data/Machine Learning
PPTX
What is Datamining? Which algorithms can be used for Datamining?
PPTX
CS194Lec0hbh6EDA.pptx
PPTX
Machine Learning in the Financial Industry
PDF
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
PDF
The Dangers of Machine Learning
PDF
DutchMLSchool. Logistic Regression, Deepnets, Time Series
PPT
data mining
PPT
PPTX
Data mining Basics and complete description onword
PPSX
Data Refinement: The missing link between data collection and decisions
PDF
Agile Data Science
PPTX
Big data week 2018 - Graph Analytics on Big Data
Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...
BSSML16 L1. Introduction, Models, and Evaluations
Starting data science with kaggle.com
Influx/Days 2017 San Francisco | Baron Schwartz
L8 scientific visualization of data
Introduction to Big Data/Machine Learning
What is Datamining? Which algorithms can be used for Datamining?
CS194Lec0hbh6EDA.pptx
Machine Learning in the Financial Industry
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
The Dangers of Machine Learning
DutchMLSchool. Logistic Regression, Deepnets, Time Series
data mining
Data mining Basics and complete description onword
Data Refinement: The missing link between data collection and decisions
Agile Data Science
Big data week 2018 - Graph Analytics on Big Data
Ad

Recently uploaded (20)

PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
modul_python (1).pptx for professional and student
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Introduction to the R Programming Language
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Computer network topology notes for revision
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Managing Community Partner Relationships
PPTX
Leprosy and NLEP programme community medicine
PDF
annual-report-2024-2025 original latest.
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Predictive modeling basics in data cleaning process
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
Reliability_Chapter_ presentation 1221.5784
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
modul_python (1).pptx for professional and student
Clinical guidelines as a resource for EBP(1).pdf
Introduction to the R Programming Language
SAP 2 completion done . PRESENTATION.pptx
climate analysis of Dhaka ,Banglades.pptx
Computer network topology notes for revision
oil_refinery_comprehensive_20250804084928 (1).pptx
[EN] Industrial Machine Downtime Prediction
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Managing Community Partner Relationships
Leprosy and NLEP programme community medicine
annual-report-2024-2025 original latest.
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Predictive modeling basics in data cleaning process
IBA_Chapter_11_Slides_Final_Accessible.pptx
ISS -ESG Data flows What is ESG and HowHow
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Ad

Machine Learning From Raw Data To The Predictions

  • 1. Machine Learning From Raw Data To The Predictions Luca Zavarella
  • 4. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Who Am I? Data Science Microsoft Professional Program Microsoft SQL Server BI MCTS & MCITP Working with SQL Server since 2007 BI & Advanced Analytics Technical Director @ Email: lzavarella@solidq.com Twitter: @lucazav LinkedIn: http://it.linkedin.com/in/lucazavarella Survey: http://bit.do/ml-survey
  • 5. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Agenda • What’s Azure Machine Learning • How to Prepare you Data • Missing Values • Outliers • Feature Engineering • Feature Selection • ML Algorithms • Demo
  • 7. #SqlSat675 – 18/11/2017http://bit.do/ml-survey The Azure Machine Learning Offer Azure ML Services Experimentation Service Model Management Service Workbench Azure ML Studio Deploy on Web Service DSVM
  • 8. #SqlSat675 – 18/11/2017http://bit.do/ml-survey The Azure ML Studio Environment Data Azure Machine Learning Consumers Cloud storage Azure Storage Azure Table Hive etc. Local storage Upload data from PC… Business Apps Excel Business problem Modeling Business valueDeployment API Azure Marketplace (Applications store) Azure ML Gallery (community) ML Web Services (REST API Services) ML Studio (Web IDE) Workspace: Experiments Datasets Trained models Notebooks Access settings Data Model Manage API
  • 9. DEMO 1 The Azure ML Studio Environment
  • 11. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Why we Need to Prepare Data? • Feed ML algorithms the right data for the problem to solve • It has to be in a useful scale • It has to be in the right format • Meaningful features are to be included
  • 12. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Useful Steps for Data Preparation • Here some steps to follow: • Check missing values • Check outliers • Do some Feature Engineering • Do some Feature Selection
  • 14. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Missing Values - Ask Your Self Some Questions • What data is missing? • Which variables and how much is missing? • Do we know why it is missing? • It’s missing because it should be missing or are cases incomplete? • What would be implications for modeling? • Do we have enough samples to model without it? • Do we have bias due to missing values?
  • 15. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Missing Values - Three Options, No Silver Bullets • Do nothing! • Data is truly missing and should not exist • The lack of information can become a new feature (true/false) • Data is absent when it should be there in a random way • Weight the data • Create a pseudo-population of weighted copies of the complete cases to remove selection bias introduced by the missing data • Imputing Missing Values • Impute the mean, median or most common value • Relationships with other variables are lost • Imputation through Predictive Models • Impute several values for each missing case • Return several imputed data sets • Allow you to run multiple models and pool them together at the end
  • 17. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Outliers – What They Are • An outlier is an observation that lies an abnormal distance from other values in a random sample from a population • The “abnormal” distance hasn’t a fixed measure! It depends on data • The analyst will decide what will be considered abnormal • Extreme values sometimes have a big effect on statistical operations • That effect is not necessarily a good effect 
  • 18. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Outliers – Why They Are There Data Entry Errors Measurement Error There are 10 weighing machines. 9 of them are correct, 1 is faulty Experimental Error In a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused him to start late Intentional Outlier Teens would typically under report the amount of alcohol that they consume Data Processing Error Some manipulation or extraction errors may lead to outliers in the dataset Sampling error We have to measure the height of athletes; by mistake, we include a few basketball players in the sample Natural Outlier Malls sell more products in the Christmas period.
  • 19. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Outliers – Extreme and Mild Ones The Boxplot is your friend! Just for univariate analysis and for bivariate analysis with a categorical variable… 
  • 20. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Multivariate Outliers – 2D • Cases with an unusual combination of values on different variables • Finding outlier in 2d is more tricky than in univariate analysis
  • 21. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Multivariate Outliers – More Than 2D Finding outliers in >2d can be well hidden!
  • 22. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Outliers – Remove Them or What? • Generally, it’s a bad idea to remove them • It decreases the ability of the model to generalize if the cause of this type of variance is unknown (it can be not just noise) • The outliers effect can be often mitigated by appropriate math transformations (log! ☺) • Remove them only if they are truly “abnormal” • Sometimes you can substitute them with NA and then impute them, or replace them with the mean/median
  • 24. #SqlSat675 – 18/11/2017http://bit.do/ml-survey What is Feature Engineering • It’s the process of using domain knowledge of the data to create features that make ML algorithms work in a better way • Eg. Annual Debt Ratio = Monthly Debt * 12 / Annual Income • FE is sometimes necessary in order to turn raw input data into things the ML algorithm can understand • It’s often an art and it’s the most creative phase of Data Processing
  • 25. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Examples of Simple Feature Engineering ItemID ItemColor 1 Yellow 2 Green 3 Unknown 4 Green 5 Yellow ItemId HasColor IsYellow IsGreen 1 1 1 0 2 1 0 1 3 0 0 0 4 1 0 1 5 1 1 0 One-hot encoding or Dummy variables ItemID Price 1 10 2 12 3 5 4 23 5 76 ItemID PriceRange 1 Low 2 Medium 3 Low 4 Medium 5 High Binning / Bucketing 1-10  Low 11 – 25  Medium 26 – 80  High
  • 26. #SqlSat675 – 18/11/2017http://bit.do/ml-survey An Example of “Exotic” Feature Engineering • Cyclical features need to be treated carefully • Hours: the model can’t understand that 24h == 0h • In a linear model the “length” (t0 – t23) = 23 is “huge” • Use the clock hand’s projections to convert hours to smooth linear variables: 𝑥 = 𝑐𝑜𝑠 𝜋 2 − 2𝜋ℎ 24 = 𝑠𝑖𝑛 2𝜋ℎ 24 and 𝑦 = 𝑐𝑜𝑠 2𝜋ℎ 24 ItemID Hours 1 22 2 23 3 0 4 1 5 2 X_Hours Y_Hours ItemId X_Hours Y_Hours 1 -0.5000 0.8660 2 -0.2588 0.9659 3 0.0000 1.0000 4 0.2588 0.9659 5 0.5000 0.8660
  • 27. DEMO 2 Data Preparation on Azure ML Studio Data Exploration with IDEAR
  • 29. #SqlSat675 – 18/11/2017http://bit.do/ml-survey What is Feature Selection • It’s the process of selecting a subset of relevant features for use in model construction • FS will include and exclude attributes present in the data without changing them • Dimensionality reduction is a different thing • New combination of attributes maximizing total variance • Methods: Principal Component Analysis, Singular Value Decomposition, Sammon’s Mapping • Reduce the noise due to irrelevant features • Avoid to over-fit the training set • Make the model generalized
  • 31. #SqlSat675 – 18/11/2017http://bit.do/ml-survey ML Algorithms by Learning Style • Supervised Learning • Predicting the future • Learn from known past examples • Labels provided • Unsupervised Learning • Making sense of data • Understanding the past • Learning the structure of data • Labels not provided • Reinforcement Learning • Machine trained to make decisions using trial and error • Maximize a cumulative reward • Exploration vs exploitation trade-off Broadly split into three main categories:
  • 32. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Common Supervised Learning Algorithms 1 • Regression • Modeling and predicting continuous, numeric variables • Linear Regression • Straightforward to understand • Not flexible enough to capture more complex patterns • Regression Tree • Can learn non-linear relationships • Fairly robust to outliers (using ensembles) • Ensembles perform very well • They combine predictions from many individual trees
  • 33. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Common Supervised Learning Algorithms 2 • Classification • Modeling and predicting categorical variables • Logistic Regression • Yes! “Regression” but not for continuous variables • Predictions ∈ 0, 1  class probabilities • Not flexible enough to capture complex relationships • Classification (or Decision) Tree • Robust to outliers and scalable • Naturally model non-linear decision boundaries • Individual trees are prone to overfitting, not if in ensembles
  • 34. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Common Unsupervised Learning Algorithms • Clustering • Find natural groupings of observations based on the inherent structure within your dataset • K-Means • Make clusters based on geometric distance • Grouped around centroids in a globular form • Simple and flexible • Number of clusters to be specified • Anomaly Detection • Identify unusual patterns not conform to expected behavior • Find outliers! ☺ • Used when there are lot of "normal" data and not many cases of anomalies • One-class Support Vector Machine (SVM) • Points in space, mapped so that separate categories are divided by a clear gap that is as wide as possible • PCA-Based • First computes its projection on the eigenvectors, then computes the normalized reconstruction error • The higher the error, the more anomalous the instance is
  • 35. #SqlSat675 – 18/11/2017http://bit.do/ml-survey No Free Lunch! • The “No Free Lunch” theorem in ML • No one algorithm works best for every problem • Many factors at play, e.g. size and structure of dataset • Try many different algorithms for your problem!
  • 36. DEMO 3 Price Elasticity in Azure ML
  • 37. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Lesson Learned • It’s easy to create a predictive model with Azure ML Studio • But you have to know how to model data before put it in the pot to avoid bad predictions!
  • 38. #SqlSat675 – 18/11/2017http://bit.do/ml-survey Links • What is Azure ML (https://is.gd/zUaGXs) • Price Elasticity Experiment (https://is.gd/nq8jJs) • Elasticity in Economics (https://is.gd/652EGX) • TDSP Utility IDEAR (https://is.gd/5KKdPM) • How to Better Evaluate the Goodness-of-Fit of Regressions (https://is.gd/2z30H4)