SlideShare a Scribd company logo
Machine Learning
with R
Barbara Fusinska
@BasiaFusinska
About me
Programmer
Machine Learning
Data Solutions Architect
@BasiaFusinska
https://github.com/BasiaFusinska/MachineLearningWithR
Agenda
• What’s Machine Learning?
• Exploratory Data Analysis
• Classification
• Clustering
• Regression
Setup
• Install R:
https://www.r-project.org/
• Install RStudio:
https://www.rstudio.com/
• GitHub repository:
https://github.com/BasiaFusinska/Ma
chineLearningWithR
• Packages
Machine Learning?
Machine Learning with R
Movies Genres
Title # Kisses # Kicks Genre
Taken 3 47 Action
Love story 24 2 Romance
P.S. I love you 17 3 Romance
Rush hours 5 51 Action
Bad boys 7 42 Action
Question:
What is the genre of
Gone with the wind
?
Data-based classification
Id Feature 1 Feature 2 Class
1. 3 47 A
2. 24 2 B
3. 17 3 B
4. 5 51 A
5. 7 42 A
Question:
What is the class of the entry
with the following features:
F1: 31, F2: 4
?
Data Visualization
0
10
20
30
40
50
60
0 10 20 30 40 50
Rule 1:
If on the left side of the
line then Class = A
Rule 2:
If on the right side of the
line then Class = B
A
B
Chick sexing
Supervised
learning
• Classification, regression
• Label, target value
• Training & Validation
phases
Unsupervised
learning
• Clustering, feature
selection
• Finding structure of data
• Statistical values
describing the data
Supervised Machine Learning workflow
Clean data Data split
Machine Learning
algorithm
Trained model Score
Preprocess
data
Training
data
Test data
Publishing the model
Machine Learning
Model
Model Training
Published
Machine Learning
Model
Prediction
Training data
Publish model
Test stream
Scores
Exploratory
Data Analysis
Demo
Classification problem
Model training
Data & Labels
Classification data
Source #Links #Characters ... Fake
TopNews 10 2750 … T
Twitter 2 120 … F
TopNews 235 502 … F
Channel X 1530 3024 … T
Twitter 24 70 … F
StoryLeaks 722 1408 … T
Facebook 98 230 … T
… … … … ...
Features
Labels
Task: Iris EDA
• Descriptive statistics (dimensions,
rows, columns, data types,
correlation)
• Data visualization (distributions,
outliers)
• Features distributions & classes
separation
• 2D visualisation
http://archive.ics.uci.edu/ml/datasets/Iris
K-Nearest Neighbours Algorithm
• Object is classified by a majority
vote
• k – algorithm parameter
• Distance metrics: Euclidean
(continuous variables), Hamming
(text)
?
NaĆÆve Bayes classifier
š‘ š¶ š‘˜ š’™) =
š‘ š¶ š‘˜ š‘ š’™ š¶ š‘˜)
š‘(š’™)
š’™ = (š‘„1, … , š‘„ š‘˜)
š‘ š¶ š‘˜ š‘„1, … , š‘„ š‘˜) likelihood
evidence
prior
posterior
NaĆÆve Bayes example
Sex Height Weight Foot size
Male 6 190 11
Male 6.2 170 10
Female 5 130 6
… … … …
Sex Height Weight Foot size
? 5.9 140 8
š‘ š‘šš‘Žš‘™š‘’ š’™ =
š‘ š‘šš‘Žš‘™š‘’ š‘ 5.9 š‘šš‘Žš‘™š‘’ š‘ 140 š‘šš‘Žš‘™š‘’ š‘(8|š‘šš‘Žš‘™š‘’)
š‘’š‘£š‘–š‘‘š‘’š‘›š‘š‘’
š‘’š‘£š‘–š‘‘š‘’š‘›š‘š‘’ = š‘ š‘šš‘Žš‘™š‘’ š‘ 5.9 š‘šš‘Žš‘™š‘’ š‘ 140 š‘šš‘Žš‘™š‘’ š‘ 8 š‘šš‘Žš‘™š‘’ +
š‘ š‘“š‘’š‘šš‘Žš‘™š‘’ š‘ 5.9 š‘“š‘’š‘šš‘Žš‘™š‘’ š‘ 140 š‘“š‘’š‘šš‘Žš‘™š‘’ š‘(8|š‘“š‘’š‘šš‘Žš‘™š‘’)
š‘ š‘“š‘’š‘šš‘Žš‘™š‘’ š’™ =
š‘ š‘“š‘’š‘šš‘Žš‘™š‘’ š‘ 5.9 š‘“š‘’š‘šš‘Žš‘™š‘’ š‘ 140 š‘“š‘’š‘šš‘Žš‘™š‘’ š‘(8|š‘“š‘’š‘šš‘Žš‘™š‘’)
š‘’š‘£š‘–š‘‘š‘’š‘›š‘š‘’
Logistic regression
š‘§ = š›½0 + š›½1 š‘„1 + ⋯ + š›½ š‘˜ š‘„ š‘˜
š‘¦ =
1 š‘“š‘œš‘Ÿ š‘§ > 0
0 š‘“š‘œš‘Ÿ š‘§ < 0
š‘¦ =
1 š‘“š‘œš‘Ÿ šœ™(š‘§) > 0.5
0 š‘“š‘œš‘Ÿ šœ™(š‘§) < 0.5
Logistic function
Coefficients
Best fit of β
Data
processing
Demo
Data
classification
Demo
Evaluation methods for classification
Confusion
Matrix
Reference
Positive Negative
Prediction
Positive TP FP
Negative FN TN
Receiver Operating Characteristic
curve
Area under the curve
(AUC)
š“š‘š‘š‘¢š‘Ÿš‘Žš‘š‘¦ =
#š‘š‘œš‘Ÿš‘Ÿš‘’š‘š‘”
#š‘š‘Ÿš‘’š‘‘š‘–š‘š‘”š‘–š‘œš‘›š‘ 
=
š‘‡š‘ƒ + š‘‡š‘
š‘‡š‘ƒ + š‘‡š‘ + š¹š‘ƒ + š¹š‘
š‘ƒš‘Ÿš‘’š‘š‘–š‘ š‘–š‘œš‘› =
š‘‡š‘ƒ
š‘‡š‘ƒ + š¹š‘ƒ
š‘…š‘’š‘š‘Žš‘™š‘™ = š‘†š‘’š‘›š‘ š‘–š‘”š‘–š‘£š‘–š‘”š‘¦ =
š‘‡š‘ƒ
š‘‡š‘ƒ + š¹š‘
š‘†š‘š‘’š‘š‘–š‘“š‘–š‘š‘–š‘”š‘¦ =
š‘‡š‘
š‘‡š‘ + š¹š‘
How good at avoiding
false alarms
How good it is at
detecting positives
Task: Iris
Classification
• Data preprocessing
• Split data for training and tests
sets
• Classification using: kNN and NaĆÆve
Bayes
• Performance evaluation
• Results Visualisation
Task: Binary
Classification
• Only two slasses in the dataset
(versicolor & virginica)
• Classification using logistic
regression
• Performance evaluation
• Results Visualisation
Resampling: Bootstrapping
k-fold cross validation
Data
resampling
Demo
Data tuning
Demo
Task: Resampling
& Tuning
• Repeated k-fold cross validation
• Use NaĆÆve Bayes as classification
algorithm
• Tune the parameters using specific
values
• Performance evaluation
Clustering problem
K-means Algorithm
Hierarchical clustering
• Decision of where the cluster
should be split
• Metric: distance between pairs
of observation
• Linkage criterion: dissimilarity of
sets
Clustering
Demo
Evaluating
methods for
clustering
• Sum of squares
• Class based measures
• Underlying true
Task: Iris
Clustering
• Clustering using k-means and
hierarchies
• Compare clusters with the original
classes assignments
• Visualise the findings
Regression problem
• Dependent value
• Predicting the real value
• Fitting the coefficients
• Analytical solutions
• Gradient descent
š‘“ š’™ = š›½0 + š›½1 š‘„1 + ⋯ + š›½ š‘˜ š‘„ š‘˜
Ordinary linear regression
Residual sum of squares (RSS)
š‘† š‘¤ =
š‘–=1
š‘›
(š‘¦š‘– āˆ’ š‘„š‘–
š‘‡
š‘¤)2
= š‘¦ āˆ’ š‘‹š‘¤ š‘‡
š‘¦ āˆ’ š‘‹š‘¤
š‘¤ = š‘Žš‘Ÿš‘” min
š‘¤
š‘†(š‘¤)
Task: Prestige EDA
• Descriptive statistics (dimensions,
rows, columns, data types,
correlation)
• Data visualization (distributions,
outliers)
• Handle missing data
• Features significance
Evaluation methods for regression
• Errors
š‘…š‘€š‘†šø = š‘–=1
š‘›
(š‘“š‘– āˆ’ š‘¦š‘–)2
š‘›
š‘…2 = 1 āˆ’
(š‘“š‘– āˆ’ š‘¦š‘–)2
( š‘¦ āˆ’ š‘¦š‘–)2
• Statistics (t, ANOVA)
Residuals vs
Fitted
• Check if residuals have non-
linear patterns
• Check if the model captures
the non-linear relationship
• Should show equally spread
residuals around the
horizontal line
Normal Q-Q
• Shows if the residuals are
normally distributed
• Values should be lined on the
straight dashed line
• Check if residuals do not
deviate severely
Scale-Location
• Show if residuals are spread
equally along the ranges of
predictors
• Test the assumption of equal
variance (homoscedasticity)
• Should show horizontal line
with equally (randomly)
spread points
Residuals vs
Leverage
• Helps to find influential cases
• When outside of the Cook’s
distance the cases are
influential
• With no influential cases
Cook’s distance lines should
be barely visible
Regression
problem
Demo
Task: Prestige
Regression
• Numeric and categorical features
• Other than linear relations
• Combining the features
Categorical data for regression
• Categories: A, B, C are coded as
dummy variables
• In general if the variable has k
categories it will be decoded into
k-1 dummy variables
Category V1 V2
A 0 0
B 1 0
C 0 0
š‘“ š’™ = š›½0 + š›½1 š‘„1 + ⋯ + š›½š‘— š‘„š‘— + š›½š‘—+1 š‘£1 + ⋯ + š›½š‘—+š‘˜āˆ’1 š‘£ š‘˜
Categorical data for regression
š‘“ š‘„ = š›½0 + š›½1 š‘„ + š›½2 š‘£1 + ⋯ + š›½ š‘˜ š‘£ š‘˜āˆ’1 +
š›½ š‘˜+1 š‘£1 š‘„ + ⋯ + š›½2š‘˜āˆ’1 š‘£ š‘˜āˆ’1 š‘„
š‘¦ ~ š‘„ + š‘š‘Žš‘” + š‘„: š‘š‘Žš‘”
Machine Learning with R
Keep in touch
BarbaraFusinska.com
@BasiaFusinska
https://github.com/BasiaFusinska/MachineLearningWithR

More Related Content

PDF
Machine Learning in R
PDF
Class ppt intro to r
PPTX
Machine Learning-Linear regression
PPTX
Supervised Machine Learning
PDF
R Programming: Introduction To R Packages
PDF
Data Analysis and Visualization using Python
PDF
Data Exploration and Visualization with R
PPT
3.5 model based clustering
Machine Learning in R
Class ppt intro to r
Machine Learning-Linear regression
Supervised Machine Learning
R Programming: Introduction To R Packages
Data Analysis and Visualization using Python
Data Exploration and Visualization with R
3.5 model based clustering

What's hot (20)

PDF
Natural Language Processing
PPTX
Installing R and R-Studio
PPTX
Machine Learning - Dataset Preparation
PDF
Social Data Mining
PPTX
Introduction to R Programming
PDF
Deep Learning for Computer Vision: Image Classification (UPC 2016)
PPTX
Gradient Boosted trees
PDF
Introduction to Machine Learning with SciKit-Learn
PDF
Machine Learning and its Applications
PDF
Introduction to R Graphics with ggplot2
Ā 
PDF
Introduction to Data Science and Analytics
PPTX
Descriptive Statistics in R.pptx
PDF
The Data Science Process
PDF
Data Science - Part XI - Text Analytics
PPTX
Text Classification
PPTX
Sales territory optimization with genetic algorithm
PPTX
Supervised and unsupervised learning
PPTX
R programming
PPTX
NAMED ENTITY RECOGNITION
PPTX
Introduction to ML (Machine Learning)
Natural Language Processing
Installing R and R-Studio
Machine Learning - Dataset Preparation
Social Data Mining
Introduction to R Programming
Deep Learning for Computer Vision: Image Classification (UPC 2016)
Gradient Boosted trees
Introduction to Machine Learning with SciKit-Learn
Machine Learning and its Applications
Introduction to R Graphics with ggplot2
Ā 
Introduction to Data Science and Analytics
Descriptive Statistics in R.pptx
The Data Science Process
Data Science - Part XI - Text Analytics
Text Classification
Sales territory optimization with genetic algorithm
Supervised and unsupervised learning
R programming
NAMED ENTITY RECOGNITION
Introduction to ML (Machine Learning)
Ad

Similar to Machine Learning with R (20)

PPTX
Machine Learning with R
PPTX
Machine Learning with Azure
PPTX
Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017
PPTX
Clean, Learn and Visualise data with R
PPTX
Clean, Learn and Visualise data with R
PPTX
Classification Aalgorithms KNN and Protype-based classifiers.pptx
PDF
Machine learning meetup
PPTX
Data Mining Lecture_10(b).pptx
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PDF
Introduction to machine learning
PDF
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
PPTX
Nimrita koul Machine Learning
PDF
Clustering training
PPTX
DataAnalysis in machine learning using different techniques
PDF
Machine learning by using python By: Professor Lili Saghafi
PDF
Data mining with weka
PDF
Machine Learning: Classification Concepts (Part 1)
PPTX
pjgjhkjhkjhkkhkhkkhkjhjhjhjkhjhjkhjhroject.pptx
PPTX
Analytics Boot Camp - Slides
Machine Learning with R
Machine Learning with Azure
Barbara Fusinska - Machine Learning with R - Codemotion Milan 2017
Clean, Learn and Visualise data with R
Clean, Learn and Visualise data with R
Classification Aalgorithms KNN and Protype-based classifiers.pptx
Machine learning meetup
Data Mining Lecture_10(b).pptx
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
Introduction to machine learning
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
Nimrita koul Machine Learning
Clustering training
DataAnalysis in machine learning using different techniques
Machine learning by using python By: Professor Lili Saghafi
Data mining with weka
Machine Learning: Classification Concepts (Part 1)
pjgjhkjhkjhkkhkhkkhkjhjhjhjkhjhjkhjhroject.pptx
Analytics Boot Camp - Slides
Ad

More from Barbara Fusinska (20)

PPTX
Hassle free, scalable, machine learning learning with Kubeflow
PPTX
Deep learning with TensorFlow
PPTX
TensorFlow in 3 sentences
PPTX
Using Machine Learning and Chatbots to handle 1st line Technical Support
PPTX
Networks are like onions: Practical Deep Learning with TensorFlow
PPTX
Using Machine Learning and Chatbots to handle 1st line Technical Support
PPTX
Deep Learning with Microsoft Cognitive Toolkit
PPTX
Using Machine Learning and Chatbots to handle 1st line technical support
PPTX
V like Velocity, Predicting in Real-Time with Azure ML
PPTX
A picture speaks a thousand words - Data Visualisation with R
PPTX
Predicting the Future as a Service with Azure ML and R
PPTX
Getting started with R when analysing GitHub commits
PPTX
Analysing GitHub commits with R
PPTX
Analysing GitHub commits with R
PPTX
Breaking the eggshell: From .NET to Node.js
PPTX
Analysing GitHub commits with R
PPTX
Analysing GitHub commits with R
PPTX
When the connection fails
PPTX
When the connection fails
PPTX
How aspects clean your code
Hassle free, scalable, machine learning learning with Kubeflow
Deep learning with TensorFlow
TensorFlow in 3 sentences
Using Machine Learning and Chatbots to handle 1st line Technical Support
Networks are like onions: Practical Deep Learning with TensorFlow
Using Machine Learning and Chatbots to handle 1st line Technical Support
Deep Learning with Microsoft Cognitive Toolkit
Using Machine Learning and Chatbots to handle 1st line technical support
V like Velocity, Predicting in Real-Time with Azure ML
A picture speaks a thousand words - Data Visualisation with R
Predicting the Future as a Service with Azure ML and R
Getting started with R when analysing GitHub commits
Analysing GitHub commits with R
Analysing GitHub commits with R
Breaking the eggshell: From .NET to Node.js
Analysing GitHub commits with R
Analysing GitHub commits with R
When the connection fails
When the connection fails
How aspects clean your code

Recently uploaded (20)

PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Database Infoormation System (DBIS).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Global journeys: estimating international migration
PPTX
Computer network topology notes for revision
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Mega Projects Data Mega Projects Data
PPTX
IB Computer Science - Internal Assessment.pptx
Taxes Foundatisdcsdcsdon Certificate.pdf
Major-Components-ofNKJNNKNKNKNKronment.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Supervised vs unsupervised machine learning algorithms
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Database Infoormation System (DBIS).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Global journeys: estimating international migration
Computer network topology notes for revision
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
oil_refinery_comprehensive_20250804084928 (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Data_Analytics_and_PowerBI_Presentation.pptx
.pdf is not working space design for the following data for the following dat...
Mega Projects Data Mega Projects Data
IB Computer Science - Internal Assessment.pptx

Machine Learning with R

  • 1. Machine Learning with R Barbara Fusinska @BasiaFusinska
  • 2. About me Programmer Machine Learning Data Solutions Architect @BasiaFusinska https://github.com/BasiaFusinska/MachineLearningWithR
  • 3. Agenda • What’s Machine Learning? • Exploratory Data Analysis • Classification • Clustering • Regression
  • 4. Setup • Install R: https://www.r-project.org/ • Install RStudio: https://www.rstudio.com/ • GitHub repository: https://github.com/BasiaFusinska/Ma chineLearningWithR • Packages
  • 7. Movies Genres Title # Kisses # Kicks Genre Taken 3 47 Action Love story 24 2 Romance P.S. I love you 17 3 Romance Rush hours 5 51 Action Bad boys 7 42 Action Question: What is the genre of Gone with the wind ?
  • 8. Data-based classification Id Feature 1 Feature 2 Class 1. 3 47 A 2. 24 2 B 3. 17 3 B 4. 5 51 A 5. 7 42 A Question: What is the class of the entry with the following features: F1: 31, F2: 4 ?
  • 9. Data Visualization 0 10 20 30 40 50 60 0 10 20 30 40 50 Rule 1: If on the left side of the line then Class = A Rule 2: If on the right side of the line then Class = B A B
  • 11. Supervised learning • Classification, regression • Label, target value • Training & Validation phases
  • 12. Unsupervised learning • Clustering, feature selection • Finding structure of data • Statistical values describing the data
  • 13. Supervised Machine Learning workflow Clean data Data split Machine Learning algorithm Trained model Score Preprocess data Training data Test data
  • 14. Publishing the model Machine Learning Model Model Training Published Machine Learning Model Prediction Training data Publish model Test stream Scores
  • 17. Classification data Source #Links #Characters ... Fake TopNews 10 2750 … T Twitter 2 120 … F TopNews 235 502 … F Channel X 1530 3024 … T Twitter 24 70 … F StoryLeaks 722 1408 … T Facebook 98 230 … T … … … … ... Features Labels
  • 18. Task: Iris EDA • Descriptive statistics (dimensions, rows, columns, data types, correlation) • Data visualization (distributions, outliers) • Features distributions & classes separation • 2D visualisation http://archive.ics.uci.edu/ml/datasets/Iris
  • 19. K-Nearest Neighbours Algorithm • Object is classified by a majority vote • k – algorithm parameter • Distance metrics: Euclidean (continuous variables), Hamming (text) ?
  • 20. NaĆÆve Bayes classifier š‘ š¶ š‘˜ š’™) = š‘ š¶ š‘˜ š‘ š’™ š¶ š‘˜) š‘(š’™) š’™ = (š‘„1, … , š‘„ š‘˜) š‘ š¶ š‘˜ š‘„1, … , š‘„ š‘˜) likelihood evidence prior posterior
  • 21. NaĆÆve Bayes example Sex Height Weight Foot size Male 6 190 11 Male 6.2 170 10 Female 5 130 6 … … … … Sex Height Weight Foot size ? 5.9 140 8 š‘ š‘šš‘Žš‘™š‘’ š’™ = š‘ š‘šš‘Žš‘™š‘’ š‘ 5.9 š‘šš‘Žš‘™š‘’ š‘ 140 š‘šš‘Žš‘™š‘’ š‘(8|š‘šš‘Žš‘™š‘’) š‘’š‘£š‘–š‘‘š‘’š‘›š‘š‘’ š‘’š‘£š‘–š‘‘š‘’š‘›š‘š‘’ = š‘ š‘šš‘Žš‘™š‘’ š‘ 5.9 š‘šš‘Žš‘™š‘’ š‘ 140 š‘šš‘Žš‘™š‘’ š‘ 8 š‘šš‘Žš‘™š‘’ + š‘ š‘“š‘’š‘šš‘Žš‘™š‘’ š‘ 5.9 š‘“š‘’š‘šš‘Žš‘™š‘’ š‘ 140 š‘“š‘’š‘šš‘Žš‘™š‘’ š‘(8|š‘“š‘’š‘šš‘Žš‘™š‘’) š‘ š‘“š‘’š‘šš‘Žš‘™š‘’ š’™ = š‘ š‘“š‘’š‘šš‘Žš‘™š‘’ š‘ 5.9 š‘“š‘’š‘šš‘Žš‘™š‘’ š‘ 140 š‘“š‘’š‘šš‘Žš‘™š‘’ š‘(8|š‘“š‘’š‘šš‘Žš‘™š‘’) š‘’š‘£š‘–š‘‘š‘’š‘›š‘š‘’
  • 22. Logistic regression š‘§ = š›½0 + š›½1 š‘„1 + ⋯ + š›½ š‘˜ š‘„ š‘˜ š‘¦ = 1 š‘“š‘œš‘Ÿ š‘§ > 0 0 š‘“š‘œš‘Ÿ š‘§ < 0 š‘¦ = 1 š‘“š‘œš‘Ÿ šœ™(š‘§) > 0.5 0 š‘“š‘œš‘Ÿ šœ™(š‘§) < 0.5 Logistic function Coefficients Best fit of β
  • 25. Evaluation methods for classification Confusion Matrix Reference Positive Negative Prediction Positive TP FP Negative FN TN Receiver Operating Characteristic curve Area under the curve (AUC) š“š‘š‘š‘¢š‘Ÿš‘Žš‘š‘¦ = #š‘š‘œš‘Ÿš‘Ÿš‘’š‘š‘” #š‘š‘Ÿš‘’š‘‘š‘–š‘š‘”š‘–š‘œš‘›š‘  = š‘‡š‘ƒ + š‘‡š‘ š‘‡š‘ƒ + š‘‡š‘ + š¹š‘ƒ + š¹š‘ š‘ƒš‘Ÿš‘’š‘š‘–š‘ š‘–š‘œš‘› = š‘‡š‘ƒ š‘‡š‘ƒ + š¹š‘ƒ š‘…š‘’š‘š‘Žš‘™š‘™ = š‘†š‘’š‘›š‘ š‘–š‘”š‘–š‘£š‘–š‘”š‘¦ = š‘‡š‘ƒ š‘‡š‘ƒ + š¹š‘ š‘†š‘š‘’š‘š‘–š‘“š‘–š‘š‘–š‘”š‘¦ = š‘‡š‘ š‘‡š‘ + š¹š‘ How good at avoiding false alarms How good it is at detecting positives
  • 26. Task: Iris Classification • Data preprocessing • Split data for training and tests sets • Classification using: kNN and NaĆÆve Bayes • Performance evaluation • Results Visualisation
  • 27. Task: Binary Classification • Only two slasses in the dataset (versicolor & virginica) • Classification using logistic regression • Performance evaluation • Results Visualisation
  • 32. Task: Resampling & Tuning • Repeated k-fold cross validation • Use NaĆÆve Bayes as classification algorithm • Tune the parameters using specific values • Performance evaluation
  • 35. Hierarchical clustering • Decision of where the cluster should be split • Metric: distance between pairs of observation • Linkage criterion: dissimilarity of sets
  • 37. Evaluating methods for clustering • Sum of squares • Class based measures • Underlying true
  • 38. Task: Iris Clustering • Clustering using k-means and hierarchies • Compare clusters with the original classes assignments • Visualise the findings
  • 39. Regression problem • Dependent value • Predicting the real value • Fitting the coefficients • Analytical solutions • Gradient descent š‘“ š’™ = š›½0 + š›½1 š‘„1 + ⋯ + š›½ š‘˜ š‘„ š‘˜
  • 40. Ordinary linear regression Residual sum of squares (RSS) š‘† š‘¤ = š‘–=1 š‘› (š‘¦š‘– āˆ’ š‘„š‘– š‘‡ š‘¤)2 = š‘¦ āˆ’ š‘‹š‘¤ š‘‡ š‘¦ āˆ’ š‘‹š‘¤ š‘¤ = š‘Žš‘Ÿš‘” min š‘¤ š‘†(š‘¤)
  • 41. Task: Prestige EDA • Descriptive statistics (dimensions, rows, columns, data types, correlation) • Data visualization (distributions, outliers) • Handle missing data • Features significance
  • 42. Evaluation methods for regression • Errors š‘…š‘€š‘†šø = š‘–=1 š‘› (š‘“š‘– āˆ’ š‘¦š‘–)2 š‘› š‘…2 = 1 āˆ’ (š‘“š‘– āˆ’ š‘¦š‘–)2 ( š‘¦ āˆ’ š‘¦š‘–)2 • Statistics (t, ANOVA)
  • 43. Residuals vs Fitted • Check if residuals have non- linear patterns • Check if the model captures the non-linear relationship • Should show equally spread residuals around the horizontal line
  • 44. Normal Q-Q • Shows if the residuals are normally distributed • Values should be lined on the straight dashed line • Check if residuals do not deviate severely
  • 45. Scale-Location • Show if residuals are spread equally along the ranges of predictors • Test the assumption of equal variance (homoscedasticity) • Should show horizontal line with equally (randomly) spread points
  • 46. Residuals vs Leverage • Helps to find influential cases • When outside of the Cook’s distance the cases are influential • With no influential cases Cook’s distance lines should be barely visible
  • 48. Task: Prestige Regression • Numeric and categorical features • Other than linear relations • Combining the features
  • 49. Categorical data for regression • Categories: A, B, C are coded as dummy variables • In general if the variable has k categories it will be decoded into k-1 dummy variables Category V1 V2 A 0 0 B 1 0 C 0 0 š‘“ š’™ = š›½0 + š›½1 š‘„1 + ⋯ + š›½š‘— š‘„š‘— + š›½š‘—+1 š‘£1 + ⋯ + š›½š‘—+š‘˜āˆ’1 š‘£ š‘˜
  • 50. Categorical data for regression š‘“ š‘„ = š›½0 + š›½1 š‘„ + š›½2 š‘£1 + ⋯ + š›½ š‘˜ š‘£ š‘˜āˆ’1 + š›½ š‘˜+1 š‘£1 š‘„ + ⋯ + š›½2š‘˜āˆ’1 š‘£ š‘˜āˆ’1 š‘„ š‘¦ ~ š‘„ + š‘š‘Žš‘” + š‘„: š‘š‘Žš‘”